基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究

doi:10.11959/j.issn.1000-436x.2022116

Abstract

Abstract:

Objectives: In the face of the complex and changing network security environment, how to fight against Advanced Persistent Threat (APT) attacks has become an urgent problem for the entire security community. The massive APT attack analysis reports and threat intelligence generated by security companies have significant research value. They can effectively provide the information of APT organizations, thereby assisting in the traceability analysis of network attack events. Aiming at the problem that APT analysis reports have not been fully utilized, and there is a lack of automation methods to generate structured knowledge and construct feature portraits of the hacker organizations, an automatic knowledge extraction method of APT attacks combining entity recognition and entity alignment is proposed. The proposed method can automatically extract entities from APT analysis reports and construct structured knowledge of the APT organization.

Methods: An automatic extraction method of APT attack knowledge that integrates entity recognition and entity alignment is designed. Firstly, 12 entity categories are designed according to the characteristics of APT attacks. Then, lowercase conversion, data cleaning, and data annotation are performed on the corpus through the preprocessing layer, and the preprocessed APT text sequence is represented as a vector. Secondly, the Bert model is built to pre-train the annotated corpus, encode each word, and generate the corresponding word vector. Also, the BiLSTM model is constructed to capture long-distance and contextual semantic features. The attention mechanism is built to highlight key features and convert the vector sequence into an annotation probability matrix. Thirdly, the CRF algorithm is utilized to decode the relationship between the output predicted labels and generate the optimal label sequence. Finally, the entity alignment method based on semantic similarity and Birch is constructed, which can improve the quality of the extracted APT attack knowledge through knowledge matching and merging into the infobox of each APT organization.

Results: In terms of entity recognition, the proposed APT attack entity recognition method is superior to the existing entity recognition methods (i.e., CRF, LSTM-CRF, GRU-CRF, BiLSTMCRF, CNN-CRF, and Bert-CRF). The experimental results of our method have been improved to a certain extent, whose precision, recall, and F1-score are 0.929 6, 0.873 3, and 0.900 6. Compared with CRF, the F1-score of the proposed model is increased by 14.32%. Compared with CNN-CRF, which integrates convolutional neural networks, the F1-score of the proposed model is increased by 6.92%. Compared with LSTM-CRF and BiLSTM-CRF, the F1-score of the proposed model is increased by 8.43% and 5.30%, respectively. Compared with GRU-CRF, the F1-score of this model is increased by 8.74%. Compared with Bert-CRF, the F1-score of this model is increased by 7.03%. In addition, the accuracy of the proposed model is 0.9004, which is 9.85% higher than the average of the other six models. Also, the proposed model's training process is more stable, and the entire curve converges faster, which can achieve higher accuracy with fewer training batches. The model's error converges faster in the training period, and the curve is smoother. Moreover, the proposed model has the best prediction effect on the "attack method" entity category, whose F1-score is 0.927 5. On the one hand, a large number of entities exist in this category. On the other hand, this category of entities widely exists in semantic-rich APT attack events and has the action characteristics of attack behavior, which leads to a better recognition effect of this category. In terms of entity recognition with small sample annotation, the proposed method's precision, recall, and F1-score are 0.780 0, 0.589 4, and 0.671 4, respectively. Compared with the CRF model, LSTM-CRF model,GRU-CRF model, BiLSTM-CRF model, CNN-CRF model, and Bert-CRF model, the F1-score values of the proposed model are improved by 27.42%, 18.78%, 23.62%, 13.25%, 14.88%, and 14.46%. This experiment fully demonstrates that the proposed method can perform pre-training on a small sample corpus through the Bert model, thereby improving the effect of entity recognition. In terms of entity alignment and knowledge fusion, the experiment automatically extracts named entities with the high frequency of various entity categories, which often exist in APT attack events. For example, common APT organizations include "APT29", "APT32", "APT28", and "Turla";common attack equipment includes "PowerShell", "Cobalt Strike", and "Mimikatz"; common attack methods include "Spearphishing", "C2", "Watering Hole Attack", and "Backdoor"; common vulnerabilities include "CVE-2017-11882", "CVE-2017-0199", and "CVE-2012-0158", etc. The proposed method combines the corpus titles and keywords to carry out entity fusion of APT organization names. Finally, the infobox of common APT organizations in this dataset is constructed, and the structured knowledge of each APT organization is formed. Also, the attack domain knowledge of APT28 and APT32 is shown in detail.

Conclusions: According to the characteristics of APT attacks, an automatic extraction method of APT attack knowledge based on entity recognition and entity alignment is designed and implemented. This method can effectively identify APT attack entities, automatically extract advanced persistent threat knowledge under the condition of few-sample annotation, and generate structured feature portraits of common APT organizations, which will provide support for subsequent APT attack knowledge graph construction and attack traceability analysis.

Key words: advanced persistent threat, threat intelligence extraction, entity recognition, entity alignment, deep learning

CLC Number:

TP309

Xiuzhang YANG, Guojun PENG, Zichuan LI, Yangqi LYU, Side LIU, Chenguang LI. Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF[J]. Journal on Communications, 2022, 43(6): 58-70.

Figures/Tables 16

References 27

[1]	STOJANOVI? B , HOFER-SCHMITZ K , KLEB U . APT datasets and attack modeling for automated detection methods:a review[J]. Computers ＆ Security, 2020,92: 101734.
[2]	WANG W , ZHU M , ZENG X W ,et al. Malware traffic classification using convolutional neural network for representation learning[C]// Proceedings of 2017 International Conference on Information Networking (ICOIN). Piscataway:IEEE Press, 2017: 712-717.
[3]	LUO Y , XIAO Y , CHENG L ,et al. Deep learning-based anomaly detection in cyber-physical systems:progress and opportunities[J]. ACM Computing Surveys, 2021,54(5): 106: 1-36.
[4]	MILAJERDI S M , GJOMEMO R , ESHETE B ,et al. HOLMES:real-time APT detection through correlation of suspicious information flows[C]// Proceedings of 2019 IEEE Symposium on Security and Privacy. Piscataway:IEEE Press, 2019: 1137-1152.
[5]	MARCHETTI M , PIERAZZI F , COLAJANNI M ,et al. Analysis of high volumes of network traffic for advanced persistent threat detection[J]. Computer Networks, 2016,109: 127-141.
[6]	HAN X Y , PASQUIER T , BATES A ,et al. Unicorn:runtime provenance-based detector for advanced persistent threats[C]// Proceedings 2020 Network and Distributed System Security Symposium. Reston:Internet Society, 2020: 1-19.
[7]	LANGNER R . Stuxnet:dissecting a cyberwarfare weapon[J]. IEEE Security ＆ Privacy, 2011,9(3): 49-51.
[8]	MUCKIN M , FITCH S C . A threat-driven approach to cyber security[J]. Lockheed Martin Corporation, 2015,3(1): 1-8.
[9]	宋文纳, 彭国军, 傅建明 ,等. 恶意代码演化与溯源技术研究[J]. 软件学报, 2019,30(8): 2229-2267.
	SONG W N , PENG G J , FU J M ,et al. Research on malicious code evolution and traceability technology[J]. Journal of Software, 2019,30(8): 2229-2267.
[10]	GIURA P , WANG W . A context-based detection framework for advanced persistent threats[C]// Proceedings of 2012 International Conference on Cyber Security. Piscataway:IEEE Press, 2012: 69-74.
[11]	KIM Y H , PARK W H . A study on cyber threat prediction based on intrusion detection event for APT attack detection[J]. Multimedia Tools and Applications, 2014,71(2): 685-698.
[12]	付钰, 李洪成, 吴晓平 ,等. 基于大数据分析的APT攻击检测研究综述[J]. 通信学报, 2015,36(11): 1-14.
	FU Y , LI H C , WU X P ,et al. Detecting APT attacks:a survey from the perspective of big data analysis[J]. Journal on Communications, 2015,36(11): 1-14.
[13]	YANG H P , . Method for behavior-prediction of APT attack based on dynamic Bayesian game[C]// Proceedings of 2016 IEEE International Conference on Cloud Computing and Big Data Analysis. Piscataway:IEEE Press, 2016: 177-182.
[14]	张小松, 牛伟纳, 杨国武 ,等. 基于树型结构的APT攻击预测方法[J]. 电子科技大学学报, 2016,45(4): 582-588.
	ZHANG X S , NIU W N , YANG G W ,et al. Method for APT prediction based on tree structure[J]. Journal of University of Electronic Science and Technology of China, 2016,45(4): 582-588.
[15]	MILAJERDI S M , ESHETE B , GJOMEMO R ,et al. POIROT:aligning attack behavior with kernel audit records for cyber threat hunting[C]// Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. New York:ACM Press, 2019: 1813-1830.
[16]	HUMPHREYS K , GAIZAUSKAS R , AZZAM S ,et al. University of sheffield:description of the LaSIE-II system as used for MUC-7[C]// Proceedings of the Seventh Message Understanding Conferences. Stroudsburg:ACL Press, 1998: 1-20.
[17]	BLACK W J , RINALDI F R , MOWATT D . Facile:description of the NE system used for MUC-7[C]// Proceedings of the Seventh Message Understanding Conference. Stroudsburg:ACL Press, 1998: 1-10.
[18]	COLLINS M , SINGER Y . Unsupervised models for named entity classification[C]// Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Stroudsburg:ACL Press, 1999: 100-110.
[19]	FREITAG D , MCCALLUM A . Information extraction with HMMs and shrinkage[C]// Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. Palo Alto:AAAI Press, 1999: 31-36.
[20]	CHIEU H L , NG H T . Named entity recognition:a maximum entropy approach using global information[C]// Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg:ACL Press, 2002: 1-7.
[21]	LI Y Y , BONTCHEVA K , CUNNINGHAM H . SVM based learning system for information extraction[C]// International Workshop on Deterministic and Statistical Methods in Machine Learning. Berlin:Springer, 2005: 319-339.
[22]	MCCALLUM A , LI W . Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons[C]// Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. Stroudsburg:ACL Press, 2003: 188-191.
[23]	HAMMERTON J , . Named entity recognition with long short-term memory[C]// Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. Stroudsburg:ACL Press, 2003: 172-175.
[24]	STRUBELL E , VERGA P , BELANGER D ,et al. Fast and accurate entity recognition with iterated dilated convolutions[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL Press, 2017: 2670-2680.
[25]	ZHANG Y , YANG J . Chinese NER using lattice LSTM[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL Press, 2018: 1554-1564.
[26]	张若彬, 刘嘉勇, 何祥 . 基于BLSTM-CRF模型的安全漏洞领域命名实体识别[J]. 四川大学学报(自然科学版), 2019,56(3): 469-475.
	ZHANG R B , LIU J Y , HE X . Named entity recognition for vulnerabilities based on BLSTM-CRF model[J]. Journal of Sichuan University (Natural Science Edition), 2019,56(3): 469-475.
[27]	DEVLIN J , CHANG M W , LEE K ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg:ACL Press, 2019. 4171-4186.

Metrics

Recommended 0

No Suggested Reading articles found!

标志符号	实体类别	类别定义	示例
AG	APT组织	常见APT攻击的团队名称	Lazarus，APT28，OceanLotus
AEQ	攻击装备	APT组织的装备	CobaltStrike，Metasploit，Gh0st
AM	攻击手法	APT组织的攻击手段和技术	SQL injection，spear phishing，XSS attack
AV	攻击漏洞	APT组织常用的漏洞，主要包括CVE编号标识或特定漏洞名称	CVE-2017-11882，CVE-2018-4878， EternalBlue，HeartBleed
AE	攻击事件	APT组织近年来开展的攻击活动	Operation Blockbuster，DarkSeoul
AT	攻击目标	APT组织攻击的公司、部门和单位	Sony，Iranian nuclear power plant
AI	攻击行业	APT组织攻击的行业信息	financial，economic，trade policy
MF	恶意文件	APT 组织常用恶意文件、敏感目录及恶意指令，文件格式包括exe、xls、doc	cmdl32.exe，Agent.btz，wwlib.dll
MFA	恶意软件家族	APT组织常用的恶意软件家族	Trojan/Win32.Occamy，ZeuS
RL	区域位置	APT组织所在区域及目标区域	North Korea，Russia，South Asia
OS	操作系统	发起APT攻击的操作系统环境	Windows，Mac，Linux，Android
SI	利用软件	发起APT攻击的软件环境	Chrome，Office，Firefox

实体类别	实体数量/个	实体类别	实体数量/个
APT组织	128	攻击行业	34
攻击装备	651	恶意文件	116
攻击手法	65	恶意软件家族	38
攻击漏洞	60	区域位置	72
攻击事件	16	操作系统	5
攻击目标	31	利用软件	48

模型	Precision	Recall	F₁-score	Accuracy
CRF	0.800 2	0.719 0	0.757 4	0.741 3
LSTM-CRF	0.877 7	0.762 8	0.816 3	0.816 9
GRU-CRF	0.889 6	0.748 9	0.813 2	0.805 2
BiLSTM-CRF	0.951 4	0.764 3	0.847 6	0.846 6
CNN-CRF	0.853 6	0.810 4	0.831 4	0.826 4
Bert-CRF	0.923 6	0.754 2	0.830 3	0.814 5
本文模型	0.929 6	0.873 3	0.900 6	0.900 4

实体类别	Precision	Recall	F₁-score
APT组织	0.937 0	0.877 3	0.906 2
攻击装备	0.951 7	0.892 2	0.921 0
攻击手法	0.947 3	0.908 5	0.927 5
攻击漏洞	0.907 4	0.859 6	0.882 9
攻击事件	0.930 4	0.856 0	0.891 7
攻击目标	0.931 8	0.872 3	0.901 1
攻击行业	0.925 2	0.900 4	0.912 6
恶意文件	0.910 2	0.832 9	0.869 8
恶意软件家族	0.890 9	0.830 5	0.859 6
区域位置	0.947 5	0.874 1	0.909 3
操作系统	0.942 1	0.890 6	0.915 7
利用软件	0.933 8	0.885 3	0.908 9
平均结果	0.929 6	0.873 3	0.900 6

模型	Precision	Recall	F₁-score
CRF	0.467 0	0.345 5	0.397 2
LSTM-CRF	0.553 6	0.429 2	0.483 6
GRU-CRF	0.557 0	0.357 1	0.435 2
BiLSTM-CRF	0.623 7	0.474 4	0.538 9
CNN-CRF	0.612 4	0.455 7	0.522 6
Bert-CRF	0.591 1	0.475 1	0.526 8
本文模型	0.780 0	0.589 4	0.671 4

Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 16

References 27

Related Articles 15

Metrics

Recommended 0

实体类别	被成功识别的命名实体
APT组织	APT29; APT32; APT28; Turla; Sandworm; MuddyWater; OilRig; APT39; Kimsuky; FIN7; TA505
攻击装备	PowerShell; Cobalt Strike; Mimikatz; LaZagne; Cannon; Dropper; Empire; NBTscan; TrickBot; FireMalv
攻击手法	Spearphishing; C2; Anti-censorship; Backdoor; Payload; Persistence; SQL injection; Watering Hole Attack
攻击漏洞	CVE-2017-11882; CVE-2017-0199; CVE-2012-0158; CVE-2019-19781; CVE-2014-4114; CVE-2018-0802
攻击事件	DarkSeoul; Operation Blockbuster; Operation Flame; SolarWinds; Clinton Campaign; Stuxnet
攻击目标	NATO; Nuclear Facility; OPCW; Sony; ASEAN; World Health Organization; High-tech Company
攻击行业	Government; Espionage; Industry; Military Institutions; Financial Company; Manufacturing; Telecommunication
恶意文件	mshta.exe; wmiexec.vbs; rundl132.exe; backup.pst; csrss.exe; regsvr32.exe; sqlceip.exe; msfte.dll; pubprn.vbs
恶意软件家族	Trojan; Agent; Denes; Gh0st; Beacon; MSOffice.Alien.gen; CoreShell; Win32.Mimikatz; Win32.Cobalt
区域位置	America; Russia; North Korea; Iran; South Asia; Europe; U.S.; India; Germany
操作系统	Windows; Linux; Android; Mac OS; Unix; IOS; Kernel Operating System
利用软件	Office; Firefox; Word; RDP; Microsoft Exchange; Outlook; Adobe; WinRAR; PDF; Defender; Gmail

实体类别	实体知识
APT组织	APT28; Fancy Bear; Sofacy;Sednit; Strontium
攻击装备	PowerShell; Mimikatz; Koadic; JHUHUGIT; Dropper
攻击手法	Spearphishing; C2; Persistence; Script; DDoS; Backdoor
攻击漏洞	CVE-2015-1701; CVE-2017-0263;CVE-2017-0262
攻击事件	the Hillary Clinton Campaign; VPNFilter
攻击目标	NATO; WADA; OSCE; OPCW; Nuclear Facility
攻击行业	Government; Industry; Organization; Education
恶意文件	rundll32.exe; explorer.exe; twain_64.dll; srhost.exe
恶意软件家族	ChopStick; Trojan; Win32.Dynamer; Zebrocy
区域位置	Russia; U.S.; Europe; India; Germany; U.K.; Israel
操作系统	Windows; Android
利用软件	Office; Microsoft Exchange; Gmail; PDF; NetBIOS; Delphi

实体类别	实体知识
APT名称	APT32; SeaLotus; OceanLotus; APT-C-00
攻击装备	Cobalt Strike; PowerShell; Mimikatz; RC4; DKMC
攻击手法	Backdoor; C2; Scheduled task; Spearphishing; Script
攻击漏洞	CVE-2017-11882; CVE-2016-7255; CVE-2017-8759
攻击事件	Cobalt Kitty; OceanLotus Blossoms
攻击目标	ASEAN; Asian Nations; the Media; Civil Society
攻击行业	Government; Military Institutions; Industry
恶意文件	pubprn.vbs; rundll32.exe; regsvr32.exe; kb-10233.exe
恶意软件家族	Denis; Gh0st; Trojan; Beacon; Win32.Agent
区域位置	Vietnam; Cambodia; Philippine; China; Laos
操作系统	Windows; Mac OS
利用软件	Office; Outlook; COM; RTF; Dropbox; Amazon S3

[1]	Dongyu CHEN, Hua CHEN, Limin FAN, Yifang FU, Jian WANG. Research on test strategy for randomness based on deep learning [J]. Journal on Communications, 2023, 44(6): 23-33.
[2]	Rongpeng LI, Bingyan WANG, Honggang ZHANG, Zhifeng ZHAO. Design of knowledge enhanced semantic communication receiver [J]. Journal on Communications, 2023, 44(6): 70-76.
[3]	Shuai MA, Ke PEI, Huayan QI, Hang LI, Wen CAO, Hongmei WANG, Hailiang XIONG, Shiyin LI. Research on geomagnetic indoor high-precision positioning algorithm based on generative model [J]. Journal on Communications, 2023, 44(6): 211-222.
[4]	Zexi XU, Lei ZHUANG, Kunli ZHANG, Mingyu GUI. Online placement algorithm of service function chain based on knowledge graph [J]. Journal on Communications, 2022, 43(8): 41-51.
[5]	Jie YANG, Biao DONG, Xue FU, Yu WANG, Guan GUI. Lightweight decentralized learning-based automatic modulation classification method [J]. Journal on Communications, 2022, 43(7): 134-142.
[6]	Tao LENG, Lijun CAI, Aimin YU, Ziyuan ZHU, Jian’gang MA, Chaofei LI, Ruicheng NIU, Dan MENG. Review of threat discovery and forensic analysis based on system provenance graph [J]. Journal on Communications, 2022, 43(7): 172-188.
[7]	Yong LIAO, Shiyi WANG. CSI feedback algorithm based on RM-Net for massive MIMO systems in high-speed mobile environment [J]. Journal on Communications, 2022, 43(5): 166-176.
[8]	Yurong LIAO, Haining WANG, Cunbao LIN, Yang LI, Yuqiang FANG, Shuyan NI. Research progress of deep learning-based object detection of optical remote sensing image [J]. Journal on Communications, 2022, 43(5): 190-203.
[9]	Zenghua ZHAO, Yuefan TONG, Jiayang CUI. Device-independent Wi-Fi fingerprinting indoor localization model based on domain adaptation [J]. Journal on Communications, 2022, 43(4): 143-153.
[10]	Yong LIAO, Gang CHENG, Yujie LI. CSI feedback algorithm based on deep unfolding for massive MIMO systems [J]. Journal on Communications, 2022, 43(12): 77-88.
[11]	Xueyuan DUAN, Yu FU, Kun WANG, Bin LI. LDoS attack detection method based on simple statistical features [J]. Journal on Communications, 2022, 43(11): 53-64.
[12]	Junyan HUO, Ruipeng QIU, Yanzhuo MA, Fuzheng YANG. Reference frame list optimization algorithm in video coding by quality enhancement of the nearest picture [J]. Journal on Communications, 2022, 43(11): 136-147.
[13]	Haiyan KANG, Yuanrui JI. Research on federated learning approach based on local differential privacy [J]. Journal on Communications, 2022, 43(10): 94-105.
[14]	Hongxia ZHANG, Qi WANG, Dengyue WANG, Ben WANG. Honeypot contract detection of blockchain based on deep learning [J]. Journal on Communications, 2022, 43(1): 194-202.
[15]	Yan YAN, Yiming CONG, Mahmood Adnan, Quanzheng SHENG. Statistics release and privacy protection method of location big data based on deep learning [J]. Journal on Communications, 2022, 43(1): 203-216.