基于深度学习的警情记录关键信息自动抽取

doi:10.11959/j.issn.2096-0271.2022052

摘要/Abstract

摘要：

随着智慧警务的兴起，民众报警渠道拓宽，非结构化警情激增，警情实体识别难度增大。针对这一业务痛点，引入BERT模型获取词向量，融合自注意力机制来捕获文字之间的长距离依赖关系，并构建BERTBiGRU-SelfAtt-CRF警情实体识别模型。为了验证模型的性能和泛化能力，在公开数据集上进行了实验。为了验证模型在警情领域的可行性和效率，在构建的警情数据集上进行了实验。实验结果表明，提出的模型在警情数据集上的精确率达到了82.45%，召回率达到了79.03%，F1值达到了80.72%，优于其他模型。可见，提出的模型可以满足实际公安工作需要，是可行、有效的。

关键词: 深度学习, 预训练语言模型, 自注意力机制, 警情实体识别

Abstract:

With the emergence of intelligent policing, the channels of mass to call police are widened, unstructured police records increase immensely, and the difficulty of police entity recognition is magnified.For this pain point, BERT model was introduced to generate the word vector, the self-attention mechanism was integrated to capture the long-distance dependence between words, and the BERT-BiGRU-SelfAtt-CRF police entity recognition model was constructed.In order to verify the performance and generalization ability of this model, experiments were carried out on public datasets.And to prove the feasibility and efficiency of this model in the police field, experiments were run on the annotated police dataset.Ultimately, the results showed that BERT-BiGRU-SelfAtt-CRF model outperformed other models on the police dataset, with the precision of 82.45%, recall rate of 79.03%, and F1 value of 80.72%.It is concluded that this model can meet the requirements of actual police work, and it is feasible and effective in the field of police entity recognition.

Key words: deep learning, pretrained language model, self-attention mechanism, entity recognition in police records

中图分类号:

TP391.1

崔雨萌, 王靖亚, 闫尚义, 陶知众. 基于深度学习的警情记录关键信息自动抽取[J]. 大数据, 2022, 8(6): 127-142.

Yumeng CUI, Jingya WANG, Shangyi YAN, Zhizhong TAO. Automatic key information extraction of police records based on deep learning[J]. Big Data Research, 2022, 8(6): 127-142.

图/表 11

图1

图2

图3

表1

图4

图5

表2

表3

表4

图6

图7

参考文献 31

[1]	张晓艳, 王挺, 陈火旺 . 命名实体识别研究[J]. 计算机科学, 2005,32(4): 44-48.
	ZHANG X Y , WANG T , CHEN H W . Research on named entity recognition[J]. Computer Science, 2005,32(4): 44-48.
[2]	何玉洁, 杜方, 史英杰 ,等. 基于深度学习的命名实体识别研究综述[J]. 计算机工程与应用, 2021,57(11): 21-36.
	HE Y J , DU F , SHI Y J ,et al. Survey of named entity recognition based on deep learning[J]. Computer Engineering and Applications, 2021,57(11): 21-36.
[3]	王月, 王孟轩, 张胜 ,等. 基于BERT的警情文本命名实体识别[J]. 计算机应用, 2020,40(2): 535-540.
	WANG Y , WANG M X , ZHANG S ,et al. Alarm text named entity recognition based on BERT[J]. Journal of Computer Applications, 2020,40(2): 535-540.
[4]	ISOZAKI H , KAZAWA H . Efficient support vector classifiers for named entity recognition[C]// Proceedings of the 19th International Conference on Computational Linguistics. Morristown:Association for Computational Linguistics, 2002.
[5]	LIU K X , HU Q C , LIU J W ,et al. Named entity recognition in Chinese electronic medical records based on CRF[C]// Proceedings of 2017 14th Web Information Systems and Applications Conference. Piscataway:IEEE Press, 2017: 105-110.
[6]	HAN A L F , WONG D F , CHAO L S . Chinese named entity recognition with conditional random fields in the light of Chinese characteristics[C]// Proceedings of the Language Processing and Intelligent Information Systems.[S.l.:s.n.], 2013: 57-68.
[7]	MORWAL S . Named entity recognition using hidden Markov model (HMM)[J]. International Journal on Natural Language Computing, 2012,1(4): 15-23.
[8]	FU G H , LUKE K K . Chinese named entity recognition using lexicalized HMMs[J]. ACM SIGKDD Explorations Newsletter, 2005,7(1): 19-25.
[9]	BENDER O , OCH F J , NEY H . Maximum entropy models for named entity recognition[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Morristown:Association for Computational Linguistics, 2003: 148-151.
[10]	CHIEU H L , NG H T . Named entity recognition:a maximum entropy approach using global information[C]// Proceedings of the 19th International Conference on Computational Linguistics. Morristown:Association for Computational Linguistics, 2002.
[11]	吴超, 王汉军 . 基于GRU的电力调度领域命名实体识别方法[J]. 计算机系统应用, 2020,29(8): 185-191.
	WU C , WANG H J . Named entity recognition in electric power dispatching field based on GRU[J]. Computer Systems＆ Applications, 2020,29(8): 185-191.
[12]	DONG C H , WU H J , ZHANG J J ,et al. Multichannel LSTM-CRF for named entity recognition in Chinese social media[C]// Proceedings of the Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data.[S.l.:s.n.], 2017: 197-208.
[13]	WU F Z , LIU J X , WU C H ,et al. Neural Chinese named entity recognition via CNNLSTM-CRF and joint training with word segmentation[C]// Proceedings of World Wide Web Conference （WWW 2019）. New York:ACM Press, 2019: 3342-3348.
[14]	DONG C H , ZHANG J J , ZONG C Q ,et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[C]// Proceedings of the Natural Language Understanding and Intelligent Applications.[S.l.:s.n.], 2016: 239-250.
[15]	TANG B Z , WANG X L , YAN J ,et al. Entity recognition in Chinese clinical text using attention-based CNN-LSTMCRF[J]. BMC Medical Informatics and Decision Making, 2019,19(Suppl 3): 74.
[16]	HUANG Z H , XU W , YU K . Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv preprint,2015,arXiv:1508.01991.
[17]	CHEN Y , ZHOU C J , LI T X ,et al. Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training[J]. Journal of Biomedical Informatics, 2019,96:103252.
[18]	李一斌, 张欢欢 . 基于双向GRU-CRF的中文包装产品实体识别[J]. 华东理工大学学报(自然科学版), 2019,45(3): 486-490.
	LI Y B , ZHANG H H . Chinese packaging product entity recognition based on bidirectional GRU-CRF[J]. Journal of East China University of Science and Technology, 2019,45(3): 486-490.
[19]	WU G H , TANG G G , WANG Z R ,et al. An attention-based BiLSTM-CRF model for Chinese clinic named entity recognition[J]. IEEE Access, 2019,7: 113942-113949.
[20]	ZHONG Q , TANG Y . An attention-based BILSTM-CRF for Chinese named entity recognition[C]// Proceedings of 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics. Piscataway:IEEE Press, 2020: 550-555.
[21]	MIKOLOV T , SUTSKEVER I , CHEN K ,et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the Advances in Neural Information Processing Systems.[S.l.:s.n.], 2013: 3111-3119.
[22]	DEVLIN J , CHANG M.W , LEE K ,et al. Bert:pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint,2018,arXiv:1810.04805.
[23]	LI X Y , ZHANG H , ZHOU X H . Chinese clinical named entity recognition with variant neural structures based on BERT methods[J]. Journal of Biomedical Informatics, 2020,107:103422.
[24]	尹学振, 赵慧, 赵俊保 ,等. 多神经网络协作的军事领域命名实体识别[J]. 清华大学学报(自然科学版), 2020,60(8): 648-655.
	YIN X Z , ZHAO H , ZHAO J B ,et al. Multi-neural network collaboration for Chinese military named entity recognition[J]. Journal of Tsinghua University (Science and Technology), 2020,60(8): 648-655.
[25]	GU L , ZHANG W J , WANG Y ,et al. Named entity recognition in judicial field based on BERT-BiLSTM-CRF model[C]// Proceedings of 2020 International Workshop on Electronic Communication and Artificial Intelligence. Piscataway:IEEE Press, 2020: 170-174.
[26]	NIE Y Y , TIAN Y H , WAN X ,et al. Named entity recognition for social media texts with semantic augmentation[C]// Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2020: 1383-1391.
[27]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Proceedings of the Advances in Neural Information Processing Systems.[S.l.:s.n.], 2017: 5998-6008.
[28]	CHO K , VAN MERRIENBOER B , GULCEHRE C ,et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation[C]// Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2014: 1724-1734.
[29]	BAHDANAU D , CHO K , BENGIO Y . Neural machine translation by jointly learning to align and translate[J]. arXiv preprint,2018,arXiv:1409.0473.
[30]	LAFFERTY J , MCCALLUM A , PEREIRA F . Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]// Proceedings of the 18th International Conference on Machine Learning.[S.l.:s.n.], 2001,3(2): 282-289.
[31]	GINA A L , . The third international Chinese language processing bakeoff:word segmentation and named entity recognition[C]// Proceedings of the 5th SIGHAN Workshop on Chinese Language Proceeding.[S.l.:s.n.], 2006: 548-554.

参数类型	参数名称	参数值
BERT参数	layer_nums	4
	head_num	12
	hidden_size	768
BiGRU-CRF	BiGRU units	128
超参数	max_seq_length	100
	dropout_rate	0.4
	SelfAtt_head	12
BiGRU-CRF	Epochs	5
训练参数	batch_size	64

文本实例	标签	含义
北	B-LOC	案发地址实体的首部
京	I-LOC	案发地址实体的中间部分
的	O	非实体
李	B-PER	报警人姓名实体的首部
先	I-PER	报警人姓名实体的中间部分
生	I-PER	报警人姓名实体的中间部分
在	O	非实体
人	B-ORG	涉案机构实体的首部
民	I-ORG	涉案机构实体的中间部分
医	I-ORG	涉案机构实体的中间部分
院	I-ORG	涉案机构实体的中间部分

模型	训练周期	精确率	召回率	F1值	消耗时间/min
CNN-LSTM	40	72.87%	74.94%	73.98%	91
BiLSTM-CRF	40	95.50%	92.09%	93.76%	398
BiGRU-CRF	40	96.38%	93.22%	93.54%	245
BiGRU-SelfAtt-CRF	40	96.14%	93.37%	93.61%	252
BERT-CNN-LSTM	40	90.38%	93.98%	93.53%	309
BERT-BiLSTM-CRF	40	92.68%	90.31%	91.48%	443
BERT-BiGRU-CRF	40	91.11%	91.03%	91.07%	441
BERT-BiGRU-SelfAtt-CRF	40	91.62%	90.69%	91.13%	459

模型	训练周期	精确率	召回率	F1值	消耗时间/min
CNN-LSTM	50	30.95%	19.70%	24.07%	0.68
BiLSTM-CRF	50	68.92%	71.21%	69.79%	7.15
BiGRU-CRF	50	61.83%	68.18%	64.51%	2.62
BiGRU-SelfAtt-CRF	50	64.90%	69.27%	66.74%	3.27
BERT-CNN-LSTM	10	78.57%	65.67%	71.54%	10.28
BERT-BiLSTM-CRF	10	78.12%	74.63%	76.34%	17.22
BERT-BiGRU-CRF	10	79.69%	76.12%	77.86%	16.10
BERT-BiGRU-SelfAtt-CRF	10	82.45%	79.03%	80.72%	17.23