基于DSR和BGRU模型的聊天文本证据分类方法

doi:10.11959/j.issn.2096-109x.2022007

摘要/Abstract

摘要：

即时通信等社交软件产生的聊天文本内容证据数据量大且聊天内容含有“黑话”等复杂语义，数字取证时无法快速识别和提取与犯罪事件有关的聊天文本证据。为此，基于 DSR（dynamic semantic representation）模型和 BGRU（bidirectional gated recurrent unit）模型提出一个聊天文本证据分类模型（DSR-BGRU）。通过预处理手段处理聊天文本数据，使其保存犯罪领域特征。设计并实现了基于DSR模型的聊天文本证据语义特征表示方法，从语义层面对聊天文本进行特征表示，通过聚类算法筛选出语义词，并通过单词属性与语义词的加权组合对非语义词词向量进行特征表示，且将语义词用于对新单词进行稀疏表示。利用Keras框架构建了包含DSR模型输入层、BGRU模型隐藏层和softmax分类层的多层聊天文本特征提取与分类模型，该模型使用DSR模型进行词的向量表示组成的文本矩阵作为输入向量，从语义层面对聊天文本进行特征表示，基于 BGRU 模型的多层隐藏层对使用这些词向量组成的文本提取上下文特征，从而能够更好地准确理解聊天文本的语义信息，并利用softmax分类层实现聊天文本证据识别与提取目标。实验结果表明，基于 DSR-BGRU 的聊天文本证据分类模型能够更加准确地完成聊天记录证据的识别和提取任务，该模型能够有效地提取出聊天信息中的犯罪文本信息，取得有效的证据，并取得了 92.06%的准确率， F1值为91.00%。高于其他用于文本分类的模型与方法。

关键词: 文本语义表示, 一词多义, 文本分类, 数字取证

Abstract:

It is always unlikely to efficiently identify and extract chat text evidence related to criminal events, due to the complex semantics such as “slang” in the chat content and the huge amount of chat text data generated by social software such as instant messaging.Based on this motivation, a chat text evidence classification model (DSR-BGRU) based on the DSR (dynamic semantic representation) model and the BGRU (bidirectional gated recurrent unit) model was proposed.The chat text data was pre-processed to preserve the characteristics of the criminal field.Then a multi-layer chat text feature extraction and classification model using the Keras framework was proposed.With the text matrix composed of vector representation of words in the DSR model as the input vector, the input layer of the DSR model featured the chat text from the semantic level.Then the hidden layer of the BGRU model extracted the context characteristics of the text composed of the word vectors.The softmax classification layer recognized and extracted the chat text evidence.The experimental results show that the proposed DSR-BGRU can more accurately identify and extract chat records compared with other models and methods for text classification, and it can also effectively extract the criminal text information from the chat information with the accuracy rate 92.06% and the F1 score 91.00%.

Key words: text semantic representation, polysemy, text classification, digital forensics

中图分类号:

TP391

张宇, 李炳龙, 李学娟, 张和禹. 基于DSR和BGRU模型的聊天文本证据分类方法[J]. 网络与信息安全学报, 2022, 8(2): 150-159.

Yu ZHANG, Binglong LI, Xuejuan LI, Heyu ZHANG. Evidence classification method of chat text based on DSR and BGRU model[J]. Chinese Journal of Network and Information Security, 2022, 8(2): 150-159.

图/表 15

图1

图2

表1

表2

图3

图4

表3

表4

表5

表6

图5

图6

图7

图8

表7

参考文献 17

[1]	中国互联网络中心. 中国互联网络发展状况统计报告[EB].
	China Internet Network Center. Statistical report on Internet development in China[EB].
[2]	刘昊, 徐鹏 . 基于关系网络的PageRank算法在禁毒情报上的应用研究[J]. 中国人民公安大学学报(自然科学版), 2019,25(1): 65-73.
	LIU H , XU P . Research on application of PageRank algorithm based on relational network in anti-drug intelligence[J]. Journal of People's Public Security University of China (Science and Technology), 2019,25(1): 65-73.
[3]	杜思佳, 于海宁, 张宏莉 . 基于深度学习的文本分类研究进展[J]. 网络与信息安全学报, 2020,6(4): 1-13.
	DU S J , YU H N , ZHANG H L . Survey of text classification methods based on deep learning[J]. Chinese Journal of Network and Information Security, 2020,6(4): 1-13.
[4]	于游, 付钰, 吴晓平 . 中文文本分类方法综述[J]. 网络与信息安全学报, 2019,5(5): 1-8.
	YU Y , FU Y , WU X P . Summary of text classification methods[J]. Chinese Journal of Network and Information Security, 2019,5(5): 1-8.
[5]	LAI S , XU L , LIU K ,et al. Recurrent convolutional neural networks for text classification[C]// Proceedings of Twenty-ninth AAAI Conference on Artificial intelligence. 2015.
[6]	TRAN K , BISAZZA A , MONZ C . Recurrent memory networks for language modeling[J]. 2016:arXiv:1601.01272.
[7]	TANG D Y , QIN B , FENG X C ,et al. Effective LSTMs for target-dependent sentiment classification[J]. 2015:arXiv:1512.01100.
[8]	LIU G , GUO J B . Bidirectional LSTM with attention mechanism and convolutional layer for text classification[J]. Neurocomputing, 2019,337: 325-338.
[9]	赵宏, 王乐, 王伟杰 . 基于 BiLSTM-CNN 串行混合模型的文本情感分析[J]. 计算机应用, 2020,40(1): 16-22.
	ZHAO H , WANG L , WANG W J . Text sentiment analysis based on serial hybrid model of bi-directional long short-term memory and convolutional neural network[J]. Journal of Computer Applications, 2020,40(1): 16-22.
[10]	陈榕, 任崇广, 王智远 ,等. 基于注意力机制的CRNN文本分类算法[J]. 计算机工程与设计, 2019,40(11): 3151-3157.
	CHEN R , REN C G , WANG Z Y ,et al. Attention based CRNN for text classification[J]. Computer Engineering and Design, 2019,40(11): 3151-3157.
[11]	张洋, 胡燕 . 基于多通道深度学习网络的混合语言短文本情感分类方法[J]. 计算机应用研究, 2021,38(1): 69-74.
	ZHANG Y , HU Y . Code-switching short-text sentiment classification method based on multi-channel deep learning network[J]. Application Research of Computers, 2021,38(1): 69-74.
[12]	丁建立, 苏现帅 . 基于组合式深度学习网络的混合文本情感分类[J]. 计算机工程与设计, 2019,40(11): 3254-3258,3264.
	DING J L , SU X S . Mixed text classification method based on combined deep learning network[J]. Computer Engineering and Design, 2019,40(11): 3254-3258,3264.
[13]	曹宇, 李天瑞, 贾真 ,等. BGRU:中文文本情感分析的新方法[J]. 计算机科学与探索, 2019,13(6): 973-981.
	CAO Y , LI T R , JIA Z ,et al. BGRU:new method of Chinese text sentiment analysis[J]. Journal of Frontiers of Computer Science and Technology, 2019,13(6): 973-981.
[14]	MOREO A , ESULI A , SEBASTIANI F . Word-class embeddings for multiclass text classification[J]. Data Mining and Knowledge Discovery, 2021,35(3): 911-963.
[15]	DU J , VONG C M , CHEN C L P . Novel efficient RNN and LSTM-like architectures:recurrent and gated broad learning systems and their applications for text classification[J]. IEEE Transactions on Cybernetics, 2021,51(3): 1586-1597.
[16]	XU J Y , CAI Y , WU X ,et al. Incorporating context-relevant concepts into convolutional neural networks for short text classification[J]. Neurocomputing, 2020,386: 42-53.
[17]	段丹丹, 唐加山, 温勇 ,等. 基于 BERT 模型的中文短文本分类算法[J]. 计算机工程, 2021,47(1): 79-86.
	DUAN D D , TANG J S , WEN Y ,et al. Chinese short text classification algorithm based on BERT model[J]. Computer Engineering, 2021,47(1): 79-86.

原始聊天记录	切词及词性标注后文本
有猪肉么	有/v猪肉/n么/o
有信么，我这有洗信人	有/v信/n么/o，我/o这/o有/v洗信人/n
四号到货了，要溜么	四号/n到/v货/n了/o，要/v溜/v么/o

词性	对应含义
n	名词
v	动词
o	其他

文本类型	主题
正常文本	生活
正常文本	娱乐
正常文本	学习
正常文本	体育
正常文本	财经
异常文本	色情
异常文本	毒品
异常文本	赌博

语料类别	正常文本	异常文本
训练集	8 944	10 336
测试集	2 236	2 584
合计	11 180	12 920

数据集	正面情感数量	负面情感数量	总数
ChnSentiCorp_htl_all	5 000	2 000	7 000
waimai_10k	4 000	4 000	8 000
online_shoping_10_cats	3 000	3 000	6 000
weibo_senti_100k	5 000	5 000	10 000