基于程序过程间语义优化的深度学习漏洞检测方法

doi:10.11959/j.issn.2096-109x.2023085

网络与信息安全学报 ›› 2023, Vol. 9 ›› Issue (6): 86-101.doi: 10.11959/j.issn.2096-109x.2023085

• 学术论文 • 上一篇

基于程序过程间语义优化的深度学习漏洞检测方法

李妍¹^,²^,³, 羌卫中¹^,²^,³, 李珍¹^,²^,³, 邹德清¹^,²^,³, 金海¹^,⁴

¹ 大数据技术与系统国家地方联合工程研究中心服务计算技术与系统教育部重点实验室，湖北武汉 430074
² 分布式系统安全湖北省重点实验室，湖北武汉 430074
³ 华中科技大学网络空间安全学院，湖北武汉 430074
⁴ 华中科技大学计算机科学与技术学院，湖北武汉 430074

修回日期:2023-07-28 出版日期:2023-12-01 发布日期:2023-12-01
作者简介:李妍（1998- ），女，陕西渭南人，华中科技大学硕士生，主要研究方向为深度学习和漏洞检测
羌卫中（1977- ），男，江苏南通人，华中科技大学教授、博士生导师，主要研究方向为机密计算、云计算安全、软件安全
李珍（1981- ），女，河北保定人，博士，华中科技大学副教授，主要研究方向为软件安全和人工智能安全
邹德清（1975- ），男，湖南湘潭人，博士，华中科技大学教授、博士生导师，主要研究方向为云计算安全、网络攻防与漏洞检测、软件安全、隐私保护
金海（1966- ），男，上海人，博士，华中科技大学教授、博士生导师，主要研究方向为计算机体系结构、虚拟化技术、集群计算和云计算、存储与安全
基金资助:
国家自然科学基金(62272187);国家通用技术基础研究联合基金(U1936211)

Deep learning vulnerability detection method based on optimized inter-procedural semantics of programs

Yan LI¹^,²^,³, Weizhong QIANG¹^,²^,³, Zhen LI¹^,²^,³, Deqing ZOU¹^,²^,³, Hai JIN¹^,⁴

¹ Services Computing Technology and System Lab, National Engineering Research Center for Big Data Technology, Wuhan 430074, China
² Hubei Key Laboratory of Distributed System Security, Wuhan 430074, China
³ School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
⁴ School of Computer Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China

Revised:2023-07-28 Online:2023-12-01 Published:2023-12-01
Supported by:
The National Natural Science Foundation of China(62272187);The Joint Funds of the National Natural Science Foundation of China(U1936211)

摘要/Abstract

摘要：

近年来，软件漏洞引发的安全事件层出不穷，及早发现并修补漏洞能够有效降低损失。传统的基于规则的漏洞检测方法依赖于专家定义规则，存在较高的漏报率，基于深度学习的方法能够自动学习漏洞程序的潜在特征，然而随着软件复杂程度的提升，该类方法在面对真实软件时存在精度下降的问题。一方面，现有方法执行漏洞检测时大多在函数级工作，无法处理跨函数的漏洞样例；另一方面，BGRU和BLSTM等模型在输入序列过长时性能下降，不善于捕捉程序语句间的长期依赖关系。针对上述问题，优化了现有的程序切片方法，结合过程内和过程间切片对跨函数的漏洞进行全面的上下文分析以捕获漏洞触发的完整因果关系；应用了包含多头注意力机制的 Transformer 神经网络模型执行漏洞检测任务，共同关注来自不同表示子空间的信息来提取节点的深层特征，相较于循环神经网络解决了信息衰减的问题，能够更有效地学习源程序的语法和语义信息。实验结果表明，该方法在真实软件数据集上的 F1 分数达到了 73.4%，相较于对比方法提升了13.6%～40.8%，并成功检测出多个开源软件漏洞，证明了其有效性与实用性。

关键词: 漏洞检测, 程序切片, 深度学习, 注意力机制

Abstract:

In recent years, software vulnerabilities have been causing a multitude of security incidents, and the early discovery and patching of vulnerabilities can effectively reduce losses.Traditional rule-based vulnerability detection methods, relying upon rules defined by experts, suffer from a high false negative rate.Deep learning-based methods have the capability to automatically learn potential features of vulnerable programs.However, as software complexity increases, the precision of these methods decreases.On one hand, current methods mostly operate at the function level, thus unable to handle inter-procedural vulnerability samples.On the other hand, models such as BGRU and BLSTM exhibit performance degradation when confronted with long input sequences, and are not adept at capturing long-term dependencies in program statements.To address the aforementioned issues, the existing program slicing method has been optimized, enabling a comprehensive contextual analysis of vulnerabilities triggered across functions through the combination of intra-procedural and inter-procedural slicing.This facilitated the capture of the complete causal relationship of vulnerability triggers.Furthermore, a vulnerability detection task was conducted using a Transformer neural network architecture equipped with a multi-head attention mechanism.This architecture collectively focused on information from different representation subspaces, allowing for the extraction of deep features from nodes.Unlike recurrent neural networks, this approach resolved the issue of information decay and effectively learned the syntax and semantic information of the source program.Experimental results demonstrate that this method achieves an F1 score of 73.4% on a real software dataset.Compared to the comparative methods, it shows an improvement of 13.6% to 40.8%.Furthermore, it successfully detects several vulnerabilities in open-source software, confirming its effectiveness and applicability.

Key words: vulnerability detection, program slice, deep learning, attention mechanism

中图分类号:

TP311

李妍, 羌卫中, 李珍, 邹德清, 金海. 基于程序过程间语义优化的深度学习漏洞检测方法[J]. 网络与信息安全学报, 2023, 9(6): 86-101.

Yan LI, Weizhong QIANG, Zhen LI, Deqing ZOU, Hai JIN. Deep learning vulnerability detection method based on optimized inter-procedural semantics of programs[J]. Chinese Journal of Network and Information Security, 2023, 9(6): 86-101.

图/表 19

图1

表1

图2

图3

图4

表2

表3

图5

图6

图7

表4

表5

图8

表6

表7

表8

表9

图9

表10

参考文献 38

[1]	CVE[EB].
[2]	SKYBOX SECURITY[EB].
[3]	WU T M , WEN S , XIANG Y ,et al. Twitter spam detection:survey of new approaches and comparative study[J]. Computers ＆ Security, 2018,76(7): 265-284.
[4]	LECUN Y , BENGIO Y , HINTON G . Deep learning[J]. Nature, 2015,521(7553): 436-444.
[5]	LI Z , ZOU D Q , XU S H ,et al. VulDeePecker:a deep learning-based system for vulnerability detection[C]// The Network and Distributed System Security Symposium (NDSS). 2018: 18-21.
[6]	CHAKRABORTY S , KRISHNA R , DING Y ,et al. Deep learning based vulnerability detection:are we there yet[J]. IEEE Transactions on Software Engineering (TSE), 2021,48(9): 3280-3296.
[7]	DAM H K , TRAN T , PHAM T ,et al. Automatic feature learning for predicting vulnerable software components[J]. IEEE Transactions on Software Engineering (TSE), 2018,47(1): 67-85.
[8]	LI Z , ZOU D Q , XU S H ,et al. SySeVR:a framework for using deep learning to detect software vulnerabilities[J]. IEEE Transactions on Dependable and Secure Computing (TDSC), 2021,19(4): 2244-2258.
[9]	XIAO Y , CHEN B H , YU C D ,et al. MVP:detecting vulnerabilities using patch-enhanced vulnerability signatures[C]// Proceedings of USENIX Security Symposium. 2020.
[10]	LIN G J , WEN S , HAN Q L ,et al. Software vulnerability detection using deep neural networks:a survey[J]. Proceedings of the IEEE, 2020,108(10): 1825-1848.
[11]	LIN G J , ZHANG J , LUO W ,et al. POSTER:vulnerability discovery with function representation learning from unlabeled projects[C]// Proceedings of the Conference on Computer and Communications Security (CCS). 2017.
[12]	ZHOU Y Q , LIU S Q , SIOW J ,et al. Devign:effective vulnerability identification by learning comprehensive program semantics via graph neural networks[C]// Proceedings of Annual Conference on Neural Information Processing Systems (NeurIPS). 2019.
[13]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Proceedings of Advances in Neural Information Processing Systems (NIPS). 2017.
[14]	GHAFFARIAN S M , SHAHRIARI H R . Software vulnerability analysis and discovery using machine-learning and data-mining techniques[J]. ACM Computing Surveys (CSUR), 2017,50(4): 1-36.
[15]	ACHARYA M . Mining API patterns as partial orders from source code:from usage scenarios to specifications[C]// Proceedings of Joint Meeting of the European Software Engineering Conference ＆the ACM Sigsoft Symposium on the Foundations of Software Engineering. 2007.
[16]	SCANDARIATO R , WALDEN J , HOVSEPYAN A ,et al. Predicting vulnerable software components via text mining[J]. IEEE Transactions on Software Engineering (TSE), 2014,40(10): 993-1006.
[17]	YAMAGUCHI F , WRESSNEGGER C , GASCON H ,et al. Chucky:exposing missing checks in source code for vulnerability discovery[C]// Proceedings of the Conference on Computer ＆ Communications Security (CCS). 2013.
[18]	WHITE M , VENDOME C , LINARES-VASQUEZ M ,et al. Toward deep learning software repositories[C]// Proceedings of IEEE/ACM Working Conference on Mining Software Repositories. 2015.
[19]	WANG S , LIU T Y , TAN L . Automatically learning semantic features for defect prediction[C]// Proceedings of IEEE/ACM 38th International Conference on Software Engineering (ICSE). 2016.
[20]	YAMAGUCHI F , GOLDE N , ARP D ,et al. Modeling and discovering vulnerabilities with code property graphs[C]// IEEE Symposium on Security and Privacy (S＆P). 2014.
[21]	SHAR L K , BRIAND L , TAN H . Web application vulnerability prediction using hybrid program analysis and machine learning[J]. IEEE Transactions on Dependable and Secure Computing (TDSC), 2015,12(6): 688-707.
[22]	LIN G J , ZHANG J , LUO W ,et al. Cross-project transfer representation learning for vulnerable function discovery[J]. IEEE Transactions on Industrial Informatics, 2018,14(7): 3289-3297.
[23]	LIN G J , XIAO W , ZHANG J ,et al. Deep Learning-Based Vulnerable Function Detection:A Benchmark[C]// Proceedings of International Conference on Information and Communications Security (ICICS). 2019.
[24]	李韵, 黄辰林, 王中锋 ,等. 基于机器学习的软件漏洞挖掘方法综述[J]. 软件学报, 2020,31(7): 2040-2061.
	LI Y , HUANG C L , WANG Z F ,et al. Survey of software vulnerability mining methods based on machine learning[J]. Journal of Software. 2020,31(7): 2040-2061.
[25]	WANG H T , YE G X , TANG Z Y ,et al. Combining graph-based learning with automated data collection for code vulnerability detection[J]. IEEE Transactions on Dependable and Secure Computing (TIFS), 2020,16: 1943-1958.
[26]	LI Y , WANG S H , NGUYEN T N . Vulnerability detection with fine-grained interpretations[C]// Proceedings of ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 2021.
[27]	YING R , BOURGEOIS D , YOU J X ,et al. GNNExplainer:generating explanations for graph neural networks[C]// Proceedings of International Conference on Neural Information Processing Systems(NIPS). 2019.
[28]	SONNEKALB T . Machine-learning supported vulnerability detection in source code[C]// Proceedings of ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 2019.
[29]	RUSSELL R L , KIM L , LEI H H ,et al. Automated vulnerability detection in source code using deep representation learning[C]// 2018 IEEE 17th International Conference on Machine Learning and Applications (ICMLA). 2018.
[30]	CHENG X , WANG H Y , HUA J Y ,et al. DeepWukong:statically detecting software vulnerabilities using deep graph neural network[J]. ACM Transactions on Software Engineering and Methodology (TOSEM 2021), 2021,30(3): 1-33.
[31]	段旭, 吴敬征, 罗天悦 ,等. 基于代码属性图及注意力双向LSTM的漏洞挖掘方法[J]. 软件学报, 2020,31(11): 3404-3420.
	DUAN X , WU J Z , LUO T Y ,et al. Vulnerability mining method based on code property graph and attention BiLSTM[J]. Journal of Software, 2020,31(11): 3404-3420.
[32]	陈肇炫, 邹德清, 李珍 ,等. 基于抽象语法树的智能化漏洞检测系统[J]. 信息安全学报, 2020,5(4): 1-13.
	CHEN Z X , ZOU D Q , LI Z ,et al. Intelligent vulnerability detection system based on abstract syntax tree[J]. Journal of Cyber Security. 2020,5(4): 1-13.
[33]	胡雨涛, 王溯远, 吴月明 ,等. 基于图神经网络的切片级漏洞检测及解释方法[J]. 软件学报, 2023,34(6): 65-82.
	HU Y T , WANG S Y , WU Y M ,et al. A Slice-level vulnerability detection and interpretation method based on graph neural network[J]. Journal of Software, 2023,34(6): 65-82.
[34]	Joern[EB]
[35]	Neo4j graph platform[EB]
[36]	MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013(1).
[37]	Checkmarx[EB].
[38]	Flawfinder[EB].

CWE-ID	漏洞类型描述
CWE-119	缓冲区溢出，内存缓冲区操作不当，读取或写入缓冲区预期边界之外的内存位置
CWE-125	越界读，软件读取的数据超过预期缓冲区的末尾或开头
CWE-787	越界写，软件将数据写入预期缓冲区的末尾或开头之前
CWE-189	数值错误，与不正确的数值计算或转换相关
CWE-190	整数溢出或环绕，当逻辑假定结果值总是大于原始值时，软件执行的计算可能会产生整数溢出或环绕。当计算结果用于资源管理或执行控制时，可能会引入其他漏洞
CWE-20	输入校验不当，软件接收输入或数据，但未验证或错误地验证输入是否安全或是否被正确处理，攻击者可能利用该漏洞修改控制流、控制任意资源和执行任意代码
CWE-369	除零错误，经常出现在涉及长度、宽度和高度等物理尺寸的计算中，可能导致系统崩溃
CWE-415	Double Free，在同一内存地址上两次调用free()，可能导致修改意外的内存位置
CWE-416	Use After Free，释放内存后引用内存，可能导致程序崩溃、使用意外值或代码执行
CWE-476	空指针解引用，程序解引用预期有效但实际为NULL的指针，通常会导致崩溃或退出

source语句特征	样例
漏洞函数参数输入	vul_func(CV,…)
局部变量声明与定义	CV = func/fread/fopen/…
全局变量声明与定义	extern int CV = …

漏洞类型	CWE-ID	关键变量与sink语句特征
	CWE-119
内存操作不当	CWE-125	内存敏感的API，如malloc、memcpy、memset等，关键变量为内存大小相关的参数；数组使用，关键变量
	CWE-787	为数组下标；指针使用，关键变量为指针本身或与指针运算相关的参数
	CWE-20
	CWE-189
数值运算不当	CWE-190	整数运算，关键变量为参与整数运算的参数，可能进一步导致缓冲区溢出漏洞
	CWE-369	除法运算或者模运算，关键变量为除数
	CWE-415	double free，关键变量为指针，漏洞在第二次调用free/delete重复释放内存时被触发
指针使用不当	CWE-416	use after free，关键变量为释放内存后的空指针，漏洞在其调用free/delete释放内存之后再次使用时被触发
	CWE-476	空指针解引用，关键变量为未初始化或被赋值为NULL的指针，如未初始化的结构体、函数指针等，漏洞在其首次使用时被触发

CWE-ID	CVE数量	CWE-ID	CVE数量
CWE-119	970	CWE-190	95
CWE-125	220	CWE-369	30
CWE-787	88	CWE-415	18
CWE-20	366	CWE-416	129
CWE-189	241	CWE-476	181

参数	设置	参数	设置
编码器层数	3	注意力头数	8
损失函数	交叉熵损失	优化算法	Adam
学习率	0.000 5	随机失活率	0.5
批量大小	16	训练轮次	50

基于程序过程间语义优化的深度学习漏洞检测方法

Deep learning vulnerability detection method based on optimized inter-procedural semantics of programs

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 19

参考文献 38

相关文章 15

Metrics

推荐阅读 0

	硬件配置			软件配置	操作系统
CPU型号	GPU型号	运行内存	硬盘大小	软件配置	操作系统
Intel Xeon Gold 6234 CPU @ 3.30 GHz	Quadro RTX 5000	132 GB	4 TB	Python2.7, Joern-0.3.1, Neo4j-community-2.1.5, Python3.6, PyTorch-1.5.1, VS Code	Linux version 5.4.0-77-generic

切片方法	Acc	P	R	F1
SySeVR	60.2%	63.8%	59.6%	61.6%
MVP	61.8%	64.7%	61.5%	63.1%
本文	71.4%	75.0%	71.9%	73.4%

模型	Acc	P	R	F1
BGRU	57.4%	55.0%	60.4%	57.5%
BLSTM	59.1%	56.5%	63.3%	59.7%
Transformer	71.4%	75.0%	71.9%	73.4%

检测方法	Acc	P	R	F1
Checkmarx	50.2%	50.3%	24.2%	32.6%
FlawFinder	50.4%	50.7%	26.6%	34.9%
SySeVR	60.0%	51.2%	57.7%	54.2%
Devign	58.8%	58.5%	61.1%	59.8%
本文方法	71.4%	75.0%	71.9%	73.4%

漏洞类型	漏洞文件路径	漏洞成因分析
CWE-476	src/storage/gstor/zekernel/kernel/table/*.c	置空指针可能引起空指针解引用风险，需对其进行初始化
CWE-415	src/gausskernel/optimizer/commands/*.cpp	存在double free安全隐患，在释放指针后应及时将其置为NULL
CWE-369	src/common/backend/utils/adt/*.cpp	被除数被初始化为0，但在除以该变量时未判断其是否为零

[1]	王金伟, 陈正嘉, 谢雪, 罗向阳, 马宾. 恶意软件检测和分类可视化技术综述[J]. 网络与信息安全学报, 2023, 9(5): 1-20.
[2]	张博林, 朱春陶, 殷琪林, 付婧巧, 刘凌毅, 刘佳睿, 刘红梅, 卢伟. 基于噪声注意力的伪造人脸检测方法[J]. 网络与信息安全学报, 2023, 9(4): 155-165.
[3]	李晓萌, 郭玳豆, 卓训方, 姚恒, 秦川. 载体独立的抗屏摄信息膜叠加水印算法[J]. 网络与信息安全学报, 2023, 9(3): 135-149.
[4]	谢绒娜, 马铸鸿, 李宗俞, 田野. 基于卷积神经网络的加密流量分类方法[J]. 网络与信息安全学报, 2022, 8(6): 84-91.
[5]	章登勇, 文凰, 李峰, 曹鹏, 向凌云, 杨高波, 丁湘陵. 基于双分支网络的图像修复取证方法[J]. 网络与信息安全学报, 2022, 8(6): 110-122.
[6]	林佳滢, 周文柏, 张卫明, 俞能海. 空域频域相结合的唇型篡改检测方法[J]. 网络与信息安全学报, 2022, 8(6): 146-155.
[7]	穆超, 王鑫, 杨明, 张恒, 陈振娅, 吴晓明. 面向物联网设备固件的硬编码漏洞检测方法[J]. 网络与信息安全学报, 2022, 8(5): 98-110.
[8]	高凡, 王健, 刘吉强. 基于动态浏览器指纹的链接检测技术研究[J]. 网络与信息安全学报, 2022, 8(4): 144-156.
[9]	陈晋音, 吴长安, 郑海斌. 基于softmax激活变换的对抗防御方法[J]. 网络与信息安全学报, 2022, 8(2): 48-63.
[10]	邱宝琳, 易平. 基于多维特征图知识蒸馏的对抗样本防御方法[J]. 网络与信息安全学报, 2022, 8(2): 88-99.
[11]	胡向东, 田正国. 融合注意力机制和BSRU的工业互联网安全态势预测方法[J]. 网络与信息安全学报, 2022, 8(1): 41-51.
[12]	李丽娟, 李曼, 毕红军, 周华春. 基于混合深度学习的多类型低速率DDoS攻击检测方法[J]. 网络与信息安全学报, 2022, 8(1): 73-85.
[13]	秦中元, 贺兆祥, 李涛, 陈立全. 基于图像重构的MNIST对抗样本防御算法[J]. 网络与信息安全学报, 2022, 8(1): 86-94.
[14]	邹德清, 李响, 黄敏桓, 宋翔, 李浩, 李伟明. 基于图结构源代码切片的智能化漏洞检测系统[J]. 网络与信息安全学报, 2021, 7(5): 113-122.
[15]	王正龙, 张保稳. 生成对抗网络研究综述[J]. 网络与信息安全学报, 2021, 7(4): 68-85.