通信学报 ›› 2022, Vol. 43 ›› Issue (6): 58-70.doi: 10.11959/j.issn.1000-436x.2022116

• 学术论文 • 上一篇    下一篇

基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究

杨秀璋1,2, 彭国军1,2, 李子川1,2, 吕杨琦1,2, 刘思德1,2, 李晨光1,2   

  1. 1 武汉大学空天信息安全与可信计算教育部重点实验室,湖北 武汉 430072
    2 武汉大学国家网络安全学院,湖北 武汉 430072
  • 修回日期:2022-05-18 出版日期:2022-06-01 发布日期:2022-06-01
  • 作者简介:杨秀璋(1991- ),男,贵州凯里人,武汉大学博士生,主要研究方向为网络与信息系统安全
    彭国军(1979- ),男,湖北荆州人,博士,武汉大学教授、博士生导师,主要研究方向为网络与信息系统安全
    李子川(1999- ),男,河北邯郸人,武汉大学硕士生,主要研究方向为IoT安全、漏洞自动化挖掘与利用
    吕杨琦(1997- ),女,湖北孝感人,武汉大学硕士生,主要研究方向为网络与信息系统安全
    刘思德(1997- ),男,湖北荆州人,武汉大学博士生,主要研究方向为恶意代码检测与系统安全
    李晨光(1999- ),男,湖北十堰人,武汉大学硕士生,主要研究方向为网络与信息系统安全
  • 基金资助:
    国家自然科学基金资助项目(62172308);国家自然科学基金资助项目(U1626107);国家自然科学基金资助项目(61972297);国家自然科学基金资助项目(62172144)

Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF

Xiuzhang YANG1,2, Guojun PENG1,2, Zichuan LI1,2, Yangqi LYU1,2, Side LIU1,2, Chenguang LI1,2   

  1. 1 Key Laboratory of Aerospace Information Security and Trusted Computing of Ministry of Education, Wuhan University, Wuhan 430072, China
    2 School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
  • Revised:2022-05-18 Online:2022-06-01 Published:2022-06-01
  • Supported by:
    The National Natural Science Foundation of China(62172308);The National Natural Science Foundation of China(U1626107);The National Natural Science Foundation of China(61972297);The National Natural Science Foundation of China(62172144)

摘要:

目的:面对当前复杂变化的网络安全环境,如何对抗高级可持续威胁(APT)攻击已成为整个安全界亟需解决的问题。安全公司生成的海量APT攻击分析报告和威胁情报具有极其重要的研究价值,它们能有效提供APT组织的动态,从而辅助网络攻击事件的溯源分析。针对APT分析报告未被有效利用,缺乏自动化方法生成结构化知识并形成黑客组织特征画像问题,本文提出一种融合实体识别和实体对齐的APT攻击知识自动抽取方法,旨在从APT分析报告中自动抽取实体,形成APT组织的结构化知识。

方法:设计一种融合实体识别和实体对齐的APT攻击知识自动抽取方法。首先,结合APT攻击特点设计12种实体类别,通过预处理层对语料进行小写转换、数据清洗和数据标注,并将预处理后的APT文本序列表征成向量;其次,通过Bert预训练,对每个词语编码并生成对应的字向量,构建 BiLSTM 模型来捕获长距离和上下文语义特征,再结合注意力机制突出关键特征,将向量序列转换为标注概率矩阵;再次,通过CRF算法对输出预测标签间的关系进行解码,生成最优的标签序列;最后,构建语义相似度和Birch的实体对齐方法,通过知识匹配提升所抽取APT攻击知识的质量,最终融合形成各APT组织的知识消息盒。

结果:在实体识别方面,本文提出的APT攻击实体识别方法比现有常见的实体识别方法(CRF、LSTM-CRF、GRU-CRF、BiLSTM-CRF、CNN-CRF和Bert-CRF)的实验结果均有一定程度的提升,其精确率、召回率和F1值分别为0.929 6、0.873 3和0.900 6。相比于CRF,本文模型的F1值提升了14.32%;相比于融合卷积神经网络的CNN-CRF,本文模型的F1值提升了6.92%;相比于LSTM-CRF和BiLSTM-CRF,本文模型的F1值分别提升了 8.43%和 5.30%;相比于 GRU-CRF,本文模型的F1值提升了 8.74%;相比于 BertCRF,本文模型的F1值提升了7.03%。同时,本文模型的准确率为0.900 4,比其他6种模型的平均值高9.85%。本文模型训练过程更加稳定,整个曲线收敛速度更快,能在较少训练批次下取得较高的准确率;误差随训练周期收敛速度更快,曲线更平缓。此外,本文模型在“攻击手法”实体类别上的预测效果最佳,其F1值为0.927 5,这一方面是由于该类别的实体数量较多,另一方面是该类实体广泛存在于富含语义的APT攻击事件中,并且带有攻击行为的动作特征,从而导致其识别效果更好。在小样本标注的实体识别方面,本文方法的精确率、召回率和F1值分别为0.780 0、0.589 4和0.671 4。其F1值比CRF模型提升了27.42%,比LSTM-CRF模型提升了18.78%,比GRU-CRF模型提升了23.62%,比BiLSTM-CRF模型提升了13.25%,比CNN-CRF模型提升了14.88%,比Bert-CRF模型提升了14.46%。该实验充分说明了本文方法能通过Bert模型对小样本语料开展预训练,从而提升实体识别的效果。在实体对齐与知识融合方面,本文实验自动化抽取各类实体类别出现频率较高的命名实体,这些实体常常存在于APT攻击事件中。比如常见APT组织包括“APT29”“APT32”“APT28”和“Turla”等;常见攻击装备包括“PowerShell”“Cobalt Strike”和“Mimikatz”等;常见攻击手法包括“Spearphishing”“C2”“Watering Hole Attack”和“Backdoor”等;常见漏洞包括“CVE-2017-11882”“CVE-2017-0199”和“CVE-2012-0158”等。本文结合语料标题和关键词对APT组织名称开展实体融合,最终构建了该数据集常见APT组织的知识消息盒,形成各APT组织的结构化知识,并详细展示了APT28和APT32的攻击领域知识。

结论:本文结合APT攻击特点,设计并实现一种融合实体识别和实体对齐的APT攻击知识自动抽取方法。该方法能有效识别APT攻击实体,在少样本标注的情况下自动抽取高级可持续威胁知识,并生成常见APT组织的结构化特征画像,这将为后续APT攻击知识图谱构建和攻击溯源分析提供帮助。

关键词: 高级可持续威胁, 威胁情报抽取, 实体识别, 实体对齐, 深度学习

Abstract:

Objectives: In the face of the complex and changing network security environment, how to fight against Advanced Persistent Threat (APT) attacks has become an urgent problem for the entire security community. The massive APT attack analysis reports and threat intelligence generated by security companies have significant research value. They can effectively provide the information of APT organizations, thereby assisting in the traceability analysis of network attack events. Aiming at the problem that APT analysis reports have not been fully utilized, and there is a lack of automation methods to generate structured knowledge and construct feature portraits of the hacker organizations, an automatic knowledge extraction method of APT attacks combining entity recognition and entity alignment is proposed. The proposed method can automatically extract entities from APT analysis reports and construct structured knowledge of the APT organization.

Methods: An automatic extraction method of APT attack knowledge that integrates entity recognition and entity alignment is designed. Firstly, 12 entity categories are designed according to the characteristics of APT attacks. Then, lowercase conversion, data cleaning, and data annotation are performed on the corpus through the preprocessing layer, and the preprocessed APT text sequence is represented as a vector. Secondly, the Bert model is built to pre-train the annotated corpus, encode each word, and generate the corresponding word vector. Also, the BiLSTM model is constructed to capture long-distance and contextual semantic features. The attention mechanism is built to highlight key features and convert the vector sequence into an annotation probability matrix. Thirdly, the CRF algorithm is utilized to decode the relationship between the output predicted labels and generate the optimal label sequence. Finally, the entity alignment method based on semantic similarity and Birch is constructed, which can improve the quality of the extracted APT attack knowledge through knowledge matching and merging into the infobox of each APT organization.

Results: In terms of entity recognition, the proposed APT attack entity recognition method is superior to the existing entity recognition methods (i.e., CRF, LSTM-CRF, GRU-CRF, BiLSTMCRF, CNN-CRF, and Bert-CRF). The experimental results of our method have been improved to a certain extent, whose precision, recall, and F1-score are 0.929 6, 0.873 3, and 0.900 6. Compared with CRF, the F1-score of the proposed model is increased by 14.32%. Compared with CNN-CRF, which integrates convolutional neural networks, the F1-score of the proposed model is increased by 6.92%. Compared with LSTM-CRF and BiLSTM-CRF, the F1-score of the proposed model is increased by 8.43% and 5.30%, respectively. Compared with GRU-CRF, the F1-score of this model is increased by 8.74%. Compared with Bert-CRF, the F1-score of this model is increased by 7.03%. In addition, the accuracy of the proposed model is 0.9004, which is 9.85% higher than the average of the other six models. Also, the proposed model's training process is more stable, and the entire curve converges faster, which can achieve higher accuracy with fewer training batches. The model's error converges faster in the training period, and the curve is smoother. Moreover, the proposed model has the best prediction effect on the "attack method" entity category, whose F1-score is 0.927 5. On the one hand, a large number of entities exist in this category. On the other hand, this category of entities widely exists in semantic-rich APT attack events and has the action characteristics of attack behavior, which leads to a better recognition effect of this category. In terms of entity recognition with small sample annotation, the proposed method's precision, recall, and F1-score are 0.780 0, 0.589 4, and 0.671 4, respectively. Compared with the CRF model, LSTM-CRF model,GRU-CRF model, BiLSTM-CRF model, CNN-CRF model, and Bert-CRF model, the F1-score values of the proposed model are improved by 27.42%, 18.78%, 23.62%, 13.25%, 14.88%, and 14.46%. This experiment fully demonstrates that the proposed method can perform pre-training on a small sample corpus through the Bert model, thereby improving the effect of entity recognition. In terms of entity alignment and knowledge fusion, the experiment automatically extracts named entities with the high frequency of various entity categories, which often exist in APT attack events. For example, common APT organizations include "APT29", "APT32", "APT28", and "Turla";common attack equipment includes "PowerShell", "Cobalt Strike", and "Mimikatz"; common attack methods include "Spearphishing", "C2", "Watering Hole Attack", and "Backdoor"; common vulnerabilities include "CVE-2017-11882", "CVE-2017-0199", and "CVE-2012-0158", etc. The proposed method combines the corpus titles and keywords to carry out entity fusion of APT organization names. Finally, the infobox of common APT organizations in this dataset is constructed, and the structured knowledge of each APT organization is formed. Also, the attack domain knowledge of APT28 and APT32 is shown in detail.

Conclusions: According to the characteristics of APT attacks, an automatic extraction method of APT attack knowledge based on entity recognition and entity alignment is designed and implemented. This method can effectively identify APT attack entities, automatically extract advanced persistent threat knowledge under the condition of few-sample annotation, and generate structured feature portraits of common APT organizations, which will provide support for subsequent APT attack knowledge graph construction and attack traceability analysis.

Key words: advanced persistent threat, threat intelligence extraction, entity recognition, entity alignment, deep learning

中图分类号: 

No Suggested Reading articles found!