融合对抗主动学习的网络安全知识三元组抽取

doi:10.11959/j.issn.1000-436x.2020174

摘要/Abstract

摘要：

针对当前网络安全领域知识获取中所依赖的流水线模式存在实体识别错误的传播，未考虑实体识别与关系抽取任务间的联系，以及模型训练缺乏标签语料的问题，提出一种融合对抗主动学习的端到端网络安全知识三元组抽取方法。首先，将实体识别与关系抽取通过联合标注策略建模为序列标注任务；然后，设计融合动态注意力机制的BiLSTM-LSTM模型实现实体与关系的联合抽取，并形成三元组；最后，基于对抗网络训练一个判别器模型，增量地筛选出高质量的待标注数据进行标注，并通过迭代训练不断提升联合抽取模型的性能。通过实验表明，所提方案中实体-关系联合抽取模型优于现有的网络安全知识抽取方案，并验证了对抗主动学习方法的有效性。

关键词: 知识三元组, 网络安全, 联合抽取, 对抗网络, 主动学习

Abstract:

Aiming at the problem that using pipeline methods for extracting cybersecurity knowledge triples may cause the errors propagation of entity recognition and did not consider the correlation between entity recognition and relation extraction,and training triple extraction model lacked labeled corpora,an end-to-end cybersecurity knowledge triple extraction method with adversarial active learning was proposed.For knowledge triple extraction,the conventional entity recognition and relation extraction were modelled as sequence labeling task through joint labeling strategy firstly.And then,a BiLSTM-LSTM-based model with dynamic attention mechanism was designed to jointly extract entities and relations,forming triples.Finally,with adversarial learning framework,a discriminator was trained to incrementally select high-quality samples for labeling,and the performance of the joint extraction model was continuously enhanced by iterative retraining.Experiments show that the proposed joint extraction model outperforms the existing cybersecurity knowledge triple extraction methods,and demonstrate the effectiveness of proposed adversarial active learning scheme.

Key words: knowledge triple, cybersecurity, joint extraction, adversarial network, active learning

中图分类号:

TP391

李涛,郭渊博,琚安康. 融合对抗主动学习的网络安全知识三元组抽取[J]. 通信学报, 2020, 41(10): 80-91.

Tao LI,Yuanbo GUO,Ankang JU. Knowledge triple extraction in cybersecurity with adversarial active learning[J]. Journal on Communications, 2020, 41(10): 80-91.

图/表 9

图1

图2

图3

表1

表2

表3

表4

三元组抽取结果示例"

模型	抽取结果
示例1	Since the revelation of an[Adobe Flash Player]_{e1,hasVulnerability}zero day exploit exposed as part of the leaked Hacking Team arsenal in 2015 designated[CVE-2015-5119]_{e2,hasVulnerability}.
Att-PCNN_BiLSTM	Since the revelation of an[Adobe Flash Player]_{e1 uses}zero day exploit exposed as part of the leaked Hacking Team arsenal in 2015 designated[CVE-2015-5119]_{e2 uses}.
BiLSTM-CRF-Multi_head	Since the revelation of an[Adobe Flash Player]_{e1 hasVulnerability}zero day exploit exposed as part of the leaked Hacking Team arsenal in 2015 designated[CVE-2015-5119]_{e2 hasVulnerability}.
Dynamic-att-BiLSTM-LSTM	Since the revelation of an[Adobe Flash Player]_{e1 hasVulnerability}zero day exploit exposed as part of the leaked Hacking Team arsenal in 2015 designated[CVE-2015-5119]_{e2 hasVulnerability}.
示例2	Apt 28]_e1,Mwhich we suspect is sponsored by[Russian]_{e2,comes-from}government,uses[spear phishing emails]_e2,usesto target its victims by specific topics.
Att-PCNN_BiLSTM	Apt 28]_{e1,comes-from}which we suspect is sponsored by[Russian]_{e2,comes-from}government,uses[spear phishing emails]to target its victims by specific topics.
BiLSTM-CRF-Multi_head	Apt 28]_{e1,comes-from}which we suspect is sponsored by[Russian]_{e2,comes-from}government,uses[spear phishing]emails to target its victims by specific topics.
Dynamic-att-BiLSTM-LSTM	Apt 28]_e1,Mwhich we suspect is sponsored by[Russian]_{e2,comes-from}government,uses[spear phishing emails]_e2,usesto target its victims by specific topics.
示例3	One identified malware sample ([75193fc10145931ec0788d7c88fc8832]_e1,indicates,compiled in March 2014) uses a password-protected[.7z]_{e1,located-at}to deliver the[Etumbot installer]_e2,M,which is most likely contained within[spear phishing email]_{e2,located-at}.
Att-PCNN_BiLSTM	One identified malware sample ([75193fc10145931ec0788d7c88fc8832]_e1,indicates,compiled in March 2014) uses a password-protected[.7z]to deliver the[Etumbot installer]_e2,indicates,which is most likely contained within[spear phishing email].
BiLSTM-CRF-Multi_head	One identified malware sample ([75193fc10145931ec0788d7c88fc8832]_e1,indicates,compiled in March 2014) uses a password-protected[.7z]to deliver the[Etumbot installer]_e2,indicates,which is most likely contained within[spear phishing]email.
Dynamic-att-BiLSTM-LSTM	One identified malware sample ([75193fc10145931ec0788d7c88fc8832]_e1,indicates,compiled in March 2014) uses a password-protected[.7z]to deliver the[Etumbot installer]_e2,M,which is most likely contained within[spear phishing email]_{e2,located-at}.

表4

表5

图4

参考文献 31

[1]	JOSHI A , LAL R , FININ T ,et al. Extracting cybersecurity related linked data from text[C]// 2013 IEEE Seventh International Conference on Semantic Computing. Piscataway:IEEE Press, 2013: 252-259.
[2]	鄂海红, 张文静, 肖思琪 ,等. 深度学习实体关系抽取研究综述[J]. 软件学报, 2019,30(6): 1793-1818.
	E H H , ZHANG W J , XIAO S Q ,et al. Survey of entity relationship extraction based on deep learning[J]. Journal of Software, 2019,30(6): 1793-1818.
[3]	PHANDI P , SILVA A , LU W . Semeval-2018 task 8:semantic extraction from cybersecurity reports using natural language processing (SecureNLP)[C]// Proceedings of the 12th International Workshop on Semantic Evaluation.[S.n.:s.l]. 2018: 697-706.
[4]	SIMRAN K , SRIRAM R , VINAYAKUMAR R ,et al. Deep learning approach for intelligent named entity recognition of cyber security[J]. arXiv Preprint,arXiv:2004.00502, 2020
[5]	PINGLE A , PIPLAI A , MITTAL S ,et al. RelExt:relation extraction using deep learning approaches for cybersecurity knowledge graph improvement[C]// Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Piscataway:IEEE Press, 2019: 879-886.
[6]	HUANG W , CHENG X , WANG T ,et al. BERT-based multi-head selection for joint entity-relation extraction[C]// CCF International Conference on Natural Language Processing and Chinese Computing. Berlin:Springer, 2019: 713-723.
[7]	曹明宇, 杨志豪, 罗凌 ,等. 基于神经网络的药物实体与关系联合抽取[J]. 计算机研究与发展, 2019,56(7): 1432-1440.
	CAO M Y , YANG Z H , LUO L ,et al. Joint drug entities and relations extraction based on neural networks[J]. Journal of Computer Research and Development, 2019,56(7): 1432-1440.
[8]	ZHENG S , WANG F , BAO H ,et al. Joint extraction of entities and relations based on a novel tagging scheme[C]// Proceedings of the 55th Association for Computational Linguistics.[S.n.:s.l]. 2017: 1227-1236.
[9]	LIAO X . Towards automatically evaluating security risks and providing cyber intelligence[D]. Atlanta:Georgia Institute of Technology, 2017.
[10]	PANWAR A . Toward automatic generation and analysis of indicators of compromise (IoCS) using convolutional neural network[D]. Arizona:Arizona State University, 2017.
[11]	GASMI H , LAVAL J , BOURAS A . Information extraction of cybersecurity concepts:an LSTM approach[J]. Applied Science, 2019,9(19): 1-15.
[12]	CHAMBERS N , FRY B , MCMASTERS J . Detecting denial-of-service attacks from social media text:applying nlp to computer security[C]// Proceedings of the North American Chapter of the Association for Computational Linguistics.[S.n.:s.l]. 2018: 1626-1635.
[13]	ZHOU S , LONG Z , TAN L ,et al. Automatic identification of indicators of compromise using neural-based sequence labelling[J]. arXiv Preprint,arXiv:1810.10156, 2018
[14]	LONG Z , TAN L , ZHOU S ,et al. Collecting indicators of compromise from unstructured text of cybersecurity articles using neural-based sequence labelling[C]// 2019 International Joint Conference on Neural Networks (IJCNN). Piscataway:IEEE Press, 2019: 1-8.
[15]	秦娅, 申国伟, 赵文波 ,等. 基于深度神经网络的网络安全实体识别方法[J]. 南京大学学报(自然科学), 2019,55(1): 29-40.
	QIN Y , SHEN G W , ZHAO W B ,et al. Research on the method of network security entity recognition based on deep neural network[J]. Journal of Nanjing University(Natural Science), 2019,55(1): 29-40.
[16]	张若彬, 刘嘉勇, 何祥 . 基于BLSTM-CRF模型的安全漏洞领域命名实体识别[J]. 四川大学学报(自然科学版), 2019,56(3): 469-475.
	ZHANG R B , LIU J Y , HE X . Named entity recognition for vulnerabilities based on BLSTM-CRF model[J]. Journal of Sichuan University(Natural Science Edition), 2019,56(3): 469-475.
[17]	ZHU J J , BENTO J . Generative adversarial active learning[J]. arXiv Preprint,arXiv:1702.07956v5, 2017
[18]	CULOTTA A , MCCALLUM A . Reducing labeling effort for structured prediction tasks[C]// International Conference on Artificial Intelligence. Piscataway:IEEE Press, 2005: 746-751.
[19]	HOULSBY N , HUSZAR F , GHAHRAMANI Z ,et al. Bayesian active learning for classification and preference learning[J]. arXiv Preprint,arXiv:1112.5745, 2011
[20]	GAL Y , GHAHRAMANI Z . Dropout as a Bayesian approximation:representing model uncertainty in deep learning[C]// International Conference on Machine Learning. Piscataway:IEEE Press, 2016: 1050-1059.
[21]	SENER O , SAVARESE S . Active Learning for convolutional neural networks:a core-set approach[J]. arXiv Preprint,arXiv:1708.00489, 2017
[22]	KUO W , HANE C , YUH E L ,et al. Cost-sensitive active learning for intracranial hemorrhage detection[C]// Medical Image Computing and Computer Assisted Intervention. Piscataway:IEEE Press, 2018: 715-723.
[23]	SHEN Y , YUN H , LIPTON Z C ,et al. Deep active learning for named entity recognition[C]// International Conference on Learning Representations. Piscataway:IEEE Press, 2018: 1-15.
[24]	CHIU J P C , NICHOLS E . Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016,4: 357-370.
[25]	CAO P , CHEN Y , LIU K ,et al. Adversarial transfer learning for chinese named entity recognition with self-attention mechanism[C]// The 2018 Conference on Empirical Methods in Natural Language Processing. Piscataway:IEEE Press, 2018: 182-192.
[26]	程梦, 洪宇, 唐建 ,等. 面向属性抽取的门控动态注意力机制[J]. 模式识别与人工智能, 2019,32(2): 184-192.
	CHENG M , HONG Y , TANG J ,et al. Gated dynamic attention mechanism towards aspect extraction[J]. Pattern Recognition and Artificial Intelligence, 2019,32(2): 184-192.
[27]	TIELEMAN T , HINTON G.Lecture 6 . 5-rmsprop,coursera:neural networks for machine learning[R]. University of Toronto,Technical Report, 2012.
[28]	张晓斌, 陈福才, 黄瑞阳 . 基于 CNN 和双向 LSTM 融合的实体关系抽取[J]. 网络与信息安全学报, 2018,4(9): 44-51.
	ZHANG X B , CHEN F C , HUANG R Y . Relation extraction based on CNN and BiLSTM[J]. Chinese Journal of Network and Information Security, 2018,4(9): 44-51.
[29]	XU Y , MOU L , LI G ,et al. Classifying Relations via long short term memory networks along shortest dependency paths[C]// The 2015 Conference on Empirical Methods in Natural Language Processing. Piscataway:IEEE Press, 2015: 1785-1794.
[30]	MIWA M , BANSAL M . End-to-end relation extraction using LSTMs on sequences and tree structures[C]// The 54th Annual Meeting of the Association for Computational Linguistics. Piscataway:IEEE Press, 2016: 1105-1116.
[31]	BEKOULIS G , DELEU J , DEMEESTER T ,et al. Joint entity recognition and relation extraction as a multi-head selection problem[J]. arXiv Preprint,arXiv:1804.07847, 2018

参数名称	参数值
词向量维度	100
字符向量维度	25
卷积核个数/个	20
窗口大小	3
BiLSTM编码层单元大小	300
LSTM解码层单元大小	600
偏置权重	10

模型	字符嵌入特征	P	R	F1
CRF	×	56.12%	55.37%	55.74%
BiLSTM-CRF	×	60.41%	58.24%	59.31%
BiLSTM-CRF	○	61.83%	60.02%	60.91%
Self-att-BiLSTM-CRF	○	63.65%	62.32%	62.98%
Self-att-BiLSTM-LSTM	○	64.06%	62.19%	63.11%
Dynamic-att-BiLSTM-LSTM	○	65.75%	63.44%	64.57%

	模型	P	R	F1
流水线抽取	SDP-LSTM	61.85%	59.03%	60.41%
	Att-PCNN_BiLSTM	63.28%	58.63%	60.87%
	BiLSTM_Bi-TreeLSTM	62.17%	60.58%	61.36%
联合抽取	BiLSTM-CRF-Multi_head	64.73%	61.65%	63.20%
	Dynamic-att-BiLSTM-LSTM	65.75%	63.44%	64.57%

标注数据规模	P	R	F1
10%	33.28%	32.75%	33.01%
20%	52.72%	50.64%	51.66%
30%	62.15%	59.97%	61.04%
40%	64.95%	63.02%	63.97%
45%	65.62%	63.25%	64.41%
100%	65.75%	63.44%	64.57%