基于图神经网络的代码漏洞检测方法

doi:10.11959/j.issn.2096-109x.2021039

摘要/Abstract

摘要：

使用神经网络进行漏洞检测的方案大多基于传统自然语言处理的思路，将源代码当作序列样本处理，忽视了代码中所具有的结构性特征，从而遗漏了可能存在的漏洞。提出了一种基于图神经网络的代码漏洞检测方法，通过中间语言的控制流图特征，实现了函数级别的智能化代码漏洞检测。首先，将源代码编译为中间表示，进而提取其包含结构信息的控制流图，同时使用词向量嵌入算法初始化基本块向量提取代码语义信息；然后，完成拼接生成图结构样本数据，使用多层图神经网络对图结构数据特征进行模型训练和测试。采用开源漏洞样本数据集生成测试数据对所提方法进行了评估，结果显示该方法有效提高了漏洞检测能力。

关键词: 漏洞检测, 图神经网络, 控制流图, 中间表示

Abstract:

The schemes of using neural networks for vulnerability detection are mostly based on traditional natural language processing ideas, processing the code as array samples and ignoring the structural features in the code, which may omit possible vulnerabilities.A code vulnerability detection method based on graph neural network was proposed, which realized function-level code vulnerability detection through the control flow graph feature of the intermediate language.Firstly, the source code was compiled into an intermediate representation, and then the control flow graph containing structural information was extracted.At the same time, the word vector embedding algorithm was used to initialize the vector of basic block to extract the code semantic information.Then both of above were spliced to generate the graph structure sample data.The multilayer graph neural network model was trained and tested on graph structure data features.The open source vulnerability sample data set was used to generate test data to evaluate the method proposed.The results show that the method effectively improves the vulnerability detection ability.

Key words: vulnerability detection, graph neural network, control flow graph, intermediate representation

中图分类号:

TP309

陈皓, 易平. 基于图神经网络的代码漏洞检测方法[J]. 网络与信息安全学报, 2021, 7(3): 37-45.

Hao CHEN, Ping YI. Code vulnerability detection method based on graph neural network[J]. Chinese Journal of Network and Information Security, 2021, 7(3): 37-45.

图/表 13

图1

图2

图3

图4

图5

图6

图7

表1

表2

图8

图9

表3

表4

不同模型在不同类型测试集上的评估指标对比Table 4 Comparison of evaluation indicators of different models on different types of test datasets"

DataSets	Methods	Acc	TPR	FPR	F-1
	Flawfinder	0.534 3	0.103 7	0.027 1	0.183 4
CWE-190	RATS	0.521 2	0.100 3	0.050 2	0.174 6
	LSTM	0.801 8	0.475 9	0.084 0	0.556 1
	Our Model	$0 . 8355$	0.608 9	0.085 6	0.656 5
	Flawfinder	0.413 1	0.185 4	0.322 0	0.253 6
CWE-401	RATS	0.545 9	0.533 9	0.440 0	0.558 5
	LSTM	0.761 9	0.456 0	0.107 4	0.529 6
	Our Model	$0 . 7713$	0.489 7	0.117 8	0.547 6
	Flawfinder	0.563 0	0.123 3	0.004 3	0.218 7
CWE-590	RATS	0.536 0	0.104 0	0.038 9	0.181 9
	LSTM	0.767 7	0.537 1	0.058 0	0.665 9
	Our Model	$0 . 8140$	0.733 1	0.123 4	0.774 7

表4

参考文献 19

[1]	《国家网络空间安全战略》(全文)[J]. 中国信息安全, 2017(1): 26-31.
	“National Cyberspace Security Strategy” (full-text)[J]. China Information Security, 2017(1): 26-31.
[2]	WU Z , PAN S , CHEN F ,et al. A comprehensive survey on graph neural networks[J]. arXiv:1901.00596, 2019.
[3]	PHAM N H , NGUYEN T T , NGUYEN H A ,et al. Detection of recurring software vulnerabilities[C]// 25th IEEE/ACM International Conference on Automated Software Engineering. 2010.
[4]	YAMAGUCHI F , GOLDE N , ARP D ,et al. Modeling and discovering vulnerabilities with code property graphs[C]// IEEE Symposium on Security and Privacy. 2014.
[5]	LIN G , ZHANG J , LUO W ,et al. POSTER:vulnerability discovery with function representation learning from unlabeled projects[C]// ACM Sigsac Conference. 2017.
[6]	RUSSELL R , KIM L , HAMILTON L ,et al. Automated vulnerability detection in source code using deep representation learning[C]// 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 2018: 757-762.
[7]	XU X , LIU C , FENG Q ,et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]// Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security-(CCS′17). 2017: 363-376.
[8]	LI Y , GU C , DULLIEN T ,et al. Graph matching networks for learning the similarity of graph structured objects[C]// Thirty-sixth International Conference on Machine Learning(ICML 2019). 2019.
[9]	YU Z , CAO R , TANG Q ,et al. Order matters:semantic-aware neural networks for binary code similarity detection[C]// The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). 2020.
[10]	DUAN Y , LI X , WANG J ,et al. DeepBinDiff:learning program-wide code representations for binary diffing[C]// Proceedings 2020 Network and Distributed System Security Symposium. 2020.
[11]	李珍, 邹德清, 王泽丽 ,等. 面向源代码的软件漏洞静态检测综述[J]. 网络与信息安全学报, 2019,5(1): 1-14.
	LI Z , ZOU D Q , WANG Z L ,et al. Survey on static software vulnerability detection for source code[J]. Chinese Journal of Network and Information Security, 2019,5(1): 1-14.
[12]	GORI M , GABRIELE M , FRANCO S . A new model for learning in graph domains[C]// IEEE International Joint Conference on Neural Networks. 2005.
[13]	SCARSELLI F , GORI M , TSOI A C ,et al. The graph neural network model[J]. IEEE Transactions on Neural Networks, 2009,20(1): 61-80.
[14]	GILMER J , SCHOENHOLZ S S , RILEY P F ,et al. Neural message passing for Quantum chemistry[C]// Proceedings of the 34th International Conference on Machine Learning-Volume 70 (ICML’17). 2011: 1263-1272.
[15]	MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[16]	KIPF T N , WELLING M . Semi-supervised classification with graph convolutional networks[J]. arXiv preprint arXiv:1609.02907, 2016.
[17]	NIST software assurance reference dataset project[EB].
[18]	Flawfinder[EB].
[19]	Rough-auditing-tool-for-security[EB].

样本类型	样本数量
CWE-190	12 802
CWE-401	4 574
CWE-590	6 899

实际样本分类	模型预测结果
实际样本分类	含漏洞	不含漏洞
漏洞样本	TP	FN
不含漏洞样本	FP	TN

数据集	模型训练时间/s	模型测试时间/s	Flawfinder/s	RATS/s
CWE-190	757.034	2.116	1.97	0.86
CWE-401	263.315	0.718	1.46	0.087
CWE-590	397.540	1.068	1.52	0.15