基于特征依赖图的源代码漏洞检测方法

doi:10.11959/j.issn.1000-436x.2023018

通信学报 ›› 2023, Vol. 44 ›› Issue (1): 103-117.doi: 10.11959/j.issn.1000-436x.2023018

基于特征依赖图的源代码漏洞检测方法

杨宏宇¹^,², 杨海云², 张良³, 成翔⁴^,⁵

¹ 中国民航大学安全科学与工程学院，天津 300300
² 中国民航大学计算机科学与技术学院，天津 300300
³ 亚利桑那大学信息学院，图森 AZ85721
⁴ 扬州大学信息工程学院，江苏扬州 225127
⁵ 江苏省知识管理与智能服务工程研究中心，江苏扬州 225127

修回日期:2022-12-03 出版日期:2023-01-25 发布日期:2023-01-01
作者简介:杨宏宇（1969- ），男，吉林长春人，博士，中国民航大学教授，主要研究方向为网络信息安全
杨海云（1997- ），男，陕西宝鸡人，中国民航大学硕士生，主要研究方向为网络信息安全
张良（1987- ），男，天津人，博士，美国亚利桑那大学研究员，主要研究方向为强化学习和基于深度学习的信号处理
成翔（1988- ），男，新疆乌鲁木齐人，博士，扬州大学实验师，主要研究方向为网络与系统安全、网络安全态势感知、APT攻击检测
基金资助:
国家自然科学基金资助项目(U1833107)

Feature dependence graph based source code loophole detection method

Hongyu YANG¹^,², Haiyun YANG², Liang ZHANG³, Xiang CHENG⁴^,⁵

¹ School of Safety Science and Engineering, Civil Aviation University of China, Tianjin 300300, China
² School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
³ School of Information, University of Arizona, Tucson AZ85721, USA
⁴ School of Information Engineering, Yangzhou University, Yangzhou 225127, China
⁵ Jiangsu Engineering Research Center for Knowledge Management and Intelligent Service, Yangzhou 225127, China

Revised:2022-12-03 Online:2023-01-25 Published:2023-01-01
Supported by:
The National Natural Science Foundation of China(U1833107)

摘要/Abstract

摘要：

针对现有源代码漏洞检测方法未显式维护源代码中与漏洞相关的语义信息，导致漏洞语句特征提取困难和漏洞检测误报率高的问题，提出一种基于特征依赖图的源代码漏洞检测方法。首先，提取函数片中的候选漏洞语句，通过分析候选漏洞语句的控制依赖链和数据依赖链，生成特征依赖图。其次，使用词向量模型生成特征依赖图的节点初始表示向量。最后，构建一种面向特征依赖图的漏洞检测神经网络，由图学习网络学习特征依赖图的异构邻居节点信息，由检测网络提取全局特征并进行漏洞检测。实验结果表明，所提方法的召回率、F1分数分别提高1.50%～22.32%、1.86%～16.69%，优于现有方法。

关键词: 源代码, 漏洞检测, 语义信息, 依赖图, 神经网络

Abstract:

Given the problem that the existing source code loophole detection methods did not explicitly maintain the semantic information related to the loophole in the source code, which led to the difficulty of feature extraction of loo-phole statements and the high false positive rate of loophole detection, a source code loophole detection method based on feature dependency graph was proposed.First, extracted the candidate loophole statements in the function slice, and gen-erated the feature dependency graph by analyzing the control dependency chain and data dependency chain of the candi-date loophole statements.Secondly, the word vector model was used to generate the initial node representation vector of the feature dependency graph.Finally, a loophole detection neural network oriented to feature dependence graph was constructed, in which the graph learning network learned the heterogeneous neighbor node information of the feature de-pendency graph and the detection network extracted global features and performed loophole detection.The experimental results show that the recall rate and F1 score of the proposed method are improved by 1.50%～22.32% and 1.86%～16.69% respectively, which is superior to the existing method.

Key words: source code, loophole detection, semantic information, dependence graph, neural network

中图分类号:

TP393

杨宏宇, 杨海云, 张良, 成翔. 基于特征依赖图的源代码漏洞检测方法[J]. 通信学报, 2023, 44(1): 103-117.

Hongyu YANG, Haiyun YANG, Liang ZHANG, Xiang CHENG. Feature dependence graph based source code loophole detection method[J]. Journal on Communications, 2023, 44(1): 103-117.

图/表 21

图1

图2

图3

图4

图5

图6

图7

图8

图9

表1

表2

表3

SARD数据集上各方法准确率对比"

方法	CWE20	CWE78	CWE129	CWE190	CWE400	CWE787	CWE789
Russell	72.45%	92.71%	85.97%	87.49%	88.58%	74.42%	89.24%
VulDeePecker	75.19%	95.71%	83.31%	86.97%	91.26%	75.62%	89.85%
μVulDeePecker	75.93%	95.95%	89.41%	87.52%	93.71%	82.38%	89.59%
SySeVR	78.58%	97.01%	90.25%	93.00%	94.59%	85.69%	91.90%
VulDeeLocator	76.68%	97.06%	91.37%	93.39%	95.28%	84.65%	90.94%
Devign	77.93%	96.66%	90.19%	95.53%	96.47%	85.09%	91.97%
Reveal	78.93%	97.28%	93.49%	97.32%	$97 . 99 %$	85.58%	92.46%
$F B L D$	$81 . 42 %$	$98 . 37 %$	$96 . 24 %$	$98 . 17 %$	97.93%	$87 . 59 %$	$94 . 72 %$

表3

表4

SARD数据集上各方法精确率对比"

方法	CWE20	CWE78	CWE129	CWE190	CWE400	CWE787	CWE789
Russell	38.18%	87.01%	79.88%	85.12%	77.78%	53.44%	67.43%
VulDeePecker	42.86%	$95 . 96 %$	75.61%	82.53%	83.72%	55.56%	68.93%
μVulDeePecker	44.98%	88.31%	84.61%	75.21%	85.57%	$71 . 73 %$	66.27%
SySeVR	49.36%	92.94%	83.33%	90.16%	88.64%	66.75%	75.37%
VulDeeLocator	46.35%	92.44%	85.56%	85.45%	88.95%	66.07%	69.11%
Devign	48.18%	91.83%	83.10%	94.19%	92.59%	65.82%	75.55%
Reveal	50.00%	93.02%	84.77%	94.89%	$93 . 35 %$	66.62%	76.60%
$F B L D$	$54 . 55 %$	94.31%	$90 . 63 %$	$97 . 00 %$	93.33%	69.23%	$83 . 92 %$

表4

表5

SARD数据集上各方法召回率对比"

方法	CWE20	CWE78	CWE129	CWE190	CWE400	CWE787	CWE789
Russell	49.65%	79.76%	70.54%	57.83%	83.14%	51.02%	56.41%
VulDeePecker	53.19%	84.76%	67.18%	57.76%	85.51%	54.66%	60.44%
μVulDeePecker	62.31%	94.79%	78.67%	71.36%	92.67%	61.59%	64.14%
SySeVR	63.83%	94.04%	83.98%	79.42%	92.64%	95.12%	69.30%
VulDeeLocator	65.84%	94.88%	85.39%	87.25%	94.50%	93.28%	73.01%
Devign	62.65%	93.67%	84.08%	86.64%	95.04%	94.75%	69.70%
Reveal	65.01%	95.23%	95.20%	93.86%	$99 . 98 %$	94.90%	72.52%
$F B L D$	$70 . 92 %$	$98 . 80 %$	$97 . 41 %$	$95 . 27 %$	99.76%	$98 . 40 %$	$80 . 74 %$

表5

表6

SARD数据集上各方法F1分数对比"

方法	CWE20	CWE78	CWE129	CWE190	CWE400	CWE787	CWE789
Russell	43.17%	83.23%	74.92%	68.87%	80.37%	52.20%	61.43%
VulDeePecker	47.47%	90.01%	71.15%	67.96%	84.61%	55.11%	64.40%
μVulDeePecker	52.25%	91.43%	81.53%	73.24%	88.98%	66.27%	65.19%
SySeVR	55.67%	93.49%	83.66%	84.45%	90.59%	78.45%	72.21%
VulDeeLocator	54.40%	93.64%	85.47%	86.34%	91.64%	77.35%	71.00%
Devign	54.47%	92.74%	83.59%	90.26%	93.79%	77.68%	72.51%
Reveal	56.53%	94.12%	89.68%	94.37%	$96 . 55 %$	78.28%	74.50%
$F B L D$	$61 . 67 %$	$96 . 51 %$	$93 . 89 %$	$96 . 13 %$	96.44%	$81 . 27 %$	$82 . 30 %$

表6

图10

图11

图12

图13

表7

表8

漏洞检测模型时间消耗对比"

方法	训练时间/min	平均检测时间/s
Russel	18	0.95
VulDeePecker	24	1.24
μVuDeePecker	29	1.78
SySeVR	36	2.25
VulDeeLocator	32	2.12
Devign	49	3.40
Reveal	56	3.51
$F B L D$	87	4.89

表8

参考文献 18

[1]	LIN G J , WEN S , HAN Q L ,et al. Software vulnerability detection using deep neural networks:a survey[J]. Proceedings of the IEEE, 2020,108(10): 1825-1848.
[2]	MIAO Y T , CHEN C , PAN L ,et al. Machine learning-based cyber attacks targeting on controlled information[J]. ACM Computing Surveys, 2022,54(7): 1-36.
[3]	LI Z , ZOU D Q , XU S H ,et al. SySeVR:a framework for using deep learning to detect software vulnerabilities[J]. IEEE Transactions on Dependable and Secure Computing, 2022,19(4): 2244-2258.
[4]	ZHANG J , PAN L , HAN Q L ,et al. Deep learning based attack detection for cyber-physical system cybersecurity:a survey[J]. IEEE/CAA Journal of Automatica Sinica, 2022,9(3): 377-391.
[5]	QIU J Y , ZHANG J , LUO W ,et al. A survey of android malware detection with deep neural models[J]. ACM Computing Surveys, 2021,53(6): 1-36.
[6]	WANG H T , YE G X , TANG Z Y ,et al. Combining graph-based learning with automated data collection for code vulnerability detection[J]. IEEE Transactions on Information Forensics and Security, 2021,16: 1943-1958.
[7]	CHAKRABORTY S , KRISHNA R , DING Y ,et al. Deep learning based vulnerability detection:are we there yet?[J]. IEEE Transactions on Software Engineering, 2022,48(9): 3280-3296.
[8]	YAMAGUCHI F , GOLDE N , ARP D ,et al. Modeling and discovering vulnerabilities with code property graphs[C]// Proceedings of 2014 IEEE Symposium on Security and Privacy. Piscataway:IEEE Press, 2014: 590-604.
[9]	RUSSELL R , KIM L , HAMILTON L ,et al. Automated vulnerability detection in source code using deep representation learning[C]// Proceedings of 2018 17th IEEE International Conference on Machine Learning and Applications. Piscataway:IEEE Press, 2018: 757-762.
[10]	LI Z , ZOU D Q , XU S H ,et al. VulDeePecker:a deep learning-based system for vulnerability detection[J]. arXiv Preprint,arXiv:1801.01681, 2018.
[11]	ZOU D Q , WANG S J , XU S H ,et al. μVulDeePecker:a deep learning-based system for multiclass vulnerability detection[J]. IEEE Transactions on Dependable and Secure Computing, 2021,18(5): 2224-2236.
[12]	LI Z , ZOU D Q , XU S H ,et al. VulDeeLocator:a deep learning-based fine-grained vulnerability detector[J]. IEEE Transactions on Dependable and Secure Computing, 2022,19(4): 2821-2837.
[13]	ZHOU Y , LIU S , SIOW J ,et al. Devign:effective vulnerability identification by learning comprehensive program semantics via graph neural networks[J]. Advances in neural information processing systems, 2019,32(1): 10197-10207.
[14]	LI Y J , TARLOW D , BROCKSCHMIDT M ,et al. Gated graph sequence neural networks[J]. arXiv Preprint,arXiv:1511.05493, 2015.
[15]	ALLAMANIS M , BROCKSCHMIDT M , KHADEMI M . Learning to represent programs with graphs[J]. arXiv Preprint,arXiv:1711.00740, 2017.
[16]	WU Z H , PAN S R , CHEN F W ,et al. A comprehensive survey on graph neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021,32(1): 4-24.
[17]	HERREMANS D , CHUAN C H . Modeling musical context with Word2Vec[J]. arXiv Preprint,arXiv:1706.09088, 2017.
[18]	HU Z N , DONG Y X , WANG K S ,et al. Heterogeneous graph transformer[C]// Proceedings of The Web Conference 2020. New York:ACM Press, 2020: 2704-2710.

类型	名称	函数片/个	FDG/个	有漏洞的FDG/个	良性FDG/个
CWE20	不合适的输入验证	3 452	4 015	846	3 169
CWE78	命令注入	1 7000	18 420	4 200	14 220
CWE129	数组索引验证不当	11 208	10 019	2 977	7 042
CWE190	整数上溢	25 913	28 943	6 925	22 018
CWE400	资源耗尽	9 990	10 023	2 744	7 279
CWE787	越界写入	14 797	14 980	4 210	10 770
CWE789	分配失控	7 300	8 167	1 241	6 926

数据集名称	函数片/个	FDG/个	有漏洞的FDG/个	良性FDG/个
Devign	27 313	28 760	13 754	15 006
Reveal	22 725	20 109	2 753	17 356

名称	最小值	中位数	75分位数	最大值
节点数量/个	4	13	21	306
边数量/条	6	65	132	6 307
消耗时间/s	0.001	2.53	14.19	78.96

基于特征依赖图的源代码漏洞检测方法

Feature dependence graph based source code loophole detection method

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 21

参考文献 18

相关文章 15

Metrics

推荐阅读 0

[1]	张平, 牛凯, 姚圣时, 戴金晟. 面向未来的语义通信：基本原理与实现方法[J]. 通信学报, 2023, 44(5): 1-14.
[2]	房颖, 徐艺文, 赵铁松. 面向机器识别-人类感知的联合振动触觉编码[J]. 通信学报, 2023, 44(5): 42-51.
[3]	陈晋音, 熊海洋, 马浩男, 郑雅羽. 基于对比学习的图神经网络后门攻击防御方法[J]. 通信学报, 2023, 44(4): 154-166.
[4]	李建锋, 刘哲宇, 荣洋, 李展, 廖柏林, 屈林曦, 刘志杰, 林琨煌. 用于线性噪声时变凸二次规划的归零神经网络[J]. 通信学报, 2023, 44(4): 226-233.
[5]	林云, 徐怀韬, 王森, 张思成, 庄龙. 基于特征融合的通信语音干扰效果客观评估[J]. 通信学报, 2023, 44(3): 105-116.
[6]	何世文, 袁军, 安振宇, 张敏, 黄永明, 张尧学. 基于图神经网络的联合用户调度与波束成形优化算法[J]. 通信学报, 2022, 43(7): 73-84.
[7]	冷涛, 蔡利君, 于爱民, 朱子元, 马建刚, 李超飞, 牛瑞丞, 孟丹. 基于系统溯源图的威胁发现与取证分析综述[J]. 通信学报, 2022, 43(7): 172-188.
[8]	李昂, 陈建新, 魏昕, 周亮. 面向6G的跨模态信号重建技术[J]. 通信学报, 2022, 43(6): 28-40.
[9]	王晓丹, 李京泰, 宋亚飞. DDAC：面向卷积神经网络图像隐写分析模型的特征提取方法[J]. 通信学报, 2022, 43(5): 68-81.
[10]	廖育荣, 王海宁, 林存宝, 李阳, 方宇强, 倪淑燕. 基于深度学习的光学遥感图像目标检测研究进展[J]. 通信学报, 2022, 43(5): 190-203.
[11]	张帆, 黄赟, 方子茁, 郭威. 卷积神经网络的损失最小训练后参数量化方法[J]. 通信学报, 2022, 43(4): 114-122.
[12]	朱政宇, 侯庚旺, 黄崇文, 孙钢灿, 郝万明, 梁静. 基于并行CNN的RIS辅助D2D保密通信系统资源分配算法[J]. 通信学报, 2022, 43(3): 172-179.
[13]	霍俊彦, 王丹妮, 马彦卓, 万帅, 杨付正. 基于轻量级全连接网络的H.266/VVC分量间预测[J]. 通信学报, 2022, 43(2): 143-155.
[14]	龙华, 黄张衡, 邵玉斌, 杜庆治, 苏树盟. 基于改进CFCC特征提取的语种识别算法研究[J]. 通信学报, 2022, 43(12): 211-221.
[15]	朱政宇, 陈鹏飞, 王梓晅, 巩克现, 吴迪, 王忠勇. 基于Swin-Transformer的短波协议信号识别[J]. 通信学报, 2022, 43(11): 127-135.