双粒度轻量级漏洞代码切片方法评估模型

doi:10.11959/j.issn.1000-436x.2021196

通信学报 ›› 2021, Vol. 42 ›› Issue (11): 233-241.doi: 10.11959/j.issn.1000-436x.2021196

双粒度轻量级漏洞代码切片方法评估模型

张炳¹^,², 文峥¹^,², 赵宇轩¹, 王苧¹, 任家东¹^,²

¹ 燕山大学信息科学与工程学院，河北秦皇岛 066004
² 河北省软件工程重点实验室，河北秦皇岛 066004

修回日期:2021-09-23 出版日期:2021-11-25 发布日期:2021-11-01
作者简介:张炳（1989− ），男，湖北黄冈人，博士，燕山大学副教授、硕士生导师，主要研究方向为数据挖掘、机器学习、软件安全
文峥（1998− ），男，河北保定人，燕山大学硕士生，主要研究方向为软件安全
赵宇轩（1997− ），男，河北秦皇岛人，燕山大学硕士生，主要研究方向为文本挖掘、软件安全
王苧（1994− ），女，山西阳泉人，燕山大学硕士生，主要研究方向为软件安全
任家东（1967− ），男，黑龙江齐齐哈尔人，博士，燕山大学教授、博士生导师，主要研究方向为时态数据建模、软件安全
基金资助:
国家自然科学基金资助项目(61802332);国家自然科学基金资助项目(61807028);国家自然科学基金资助项目(61772449);燕山大学博士基金资助项目(BL18012)

Dual-granularity lightweight model for vulnerability code slicing method assessment

Bing ZHANG¹^,², Zheng WEN¹^,², Yuxuan ZHAO¹, Ning WANG¹, Jiadong REN¹^,²

¹ School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China
² Key Laboratory of Software Engineering of Hebei Province, Qinhuangdao 066004, China

Revised:2021-09-23 Online:2021-11-25 Published:2021-11-01
Supported by:
The National Natural Science Foundation of China(61802332);The National Natural Science Foundation of China(61807028);The National Natural Science Foundation of China(61772449);The Doctoral Foundation Program of Yanshan University(BL18012)

摘要/Abstract

摘要：

针对现有漏洞代码切片方法评估过程存在的切片信息抽取不完全、模型复杂度高且泛化能力差、评估过程开环无反馈的问题，提出了一种双粒度轻量级漏洞代码切片方法评估模型（VCSE）。针对代码片段，构建了轻量级的TF-IDF与N-gram融合模型，高效绕过了OOV问题，并基于词、字符双粒度提取了代码切片语义及统计特征，设计了高精确率与泛化性能的异质集成分类器，进行漏洞预测分析。实验结果表明，轻量级VCSE的评估效果明显优于当前应用广泛的深度学习模型。

关键词: 代码切片, 漏洞检测, 未登录词, 轻量级, 评估方法

Abstract:

Aiming at the problems existing in the assessment of existing vulnerability code slicing method, such as incomplete extraction of slicing information, high model complexity and poor generalization ability, and no feedback in the evaluation process, a dual-granularity lightweight vulnerability code slicing evaluation (VCSE) model was proposed.Aiming at the code snippet, a lightweight fusion model of TF-IDF and N-gram was constructed, which bypassed the OOV problem efficiently, and the semantic and statistical features of code slices were extracted based on the double granularity of words and characters.A heterogeneous integrated classifier with high accuracy and generalization performance was designed for vulnerability prediction and analysis.The experimental results show that the evaluation effect of lightweight VCSE is obviously better than that of the current widely used deep learning model.

Key words: code slicing, vulnerability prediction, out of vocabulary, lightweight, assessment method

中图分类号:

TP309

张炳, 文峥, 赵宇轩, 王苧, 任家东. 双粒度轻量级漏洞代码切片方法评估模型[J]. 通信学报, 2021, 42(11): 233-241.

Bing ZHANG, Zheng WEN, Yuxuan ZHAO, Ning WANG, Jiadong REN. Dual-granularity lightweight model for vulnerability code slicing method assessment[J]. Journal on Communications, 2021, 42(11): 233-241.

图/表 13

图1

表1

表2

图2

表3

图3

图4

图5

表4

表5

表6

表7

图6

参考文献 23

[1]	LIN G J , WEN S , HAN Q L ,et al. Software vulnerability detection using deep neural networks:a survey[J]. Proceedings of the IEEE, 2020,108(10): 1825-1848.
[2]	李珍, 邹德清, 王泽丽 ,等. 面向源代码的软件漏洞静态检测综述[J]. 网络与信息安全学报, 2019,5(1): 1-14.
	LI Z , ZOU D Q , WANG Z L ,et al. Survey on static software vulnerability detection for source code[J]. Chinese Journal of Network and Information Security, 2019,5(1): 1-14.
[3]	RAMOS U J . Using tf-idf to determine word relevance in document queries[J]. Proceedings of the First Instructional Conference on Machine Learning, 2003,242: 133-142.
[4]	李韵, 黄辰林, 王中锋 ,等. 基于机器学习的软件漏洞挖掘方法综述[J]. 软件学报, 2020,31(7): 2040-2061.
	LI Y , HUANG C L , WANG Z F ,et al. Survey of software vulnerability mining methods based on machine learning[J]. Journal of Software, 2020,31(7): 2040-2061.
[5]	PETERS M E , NEUMANN M , IYYER M ,et al. Deep contextualized word representations[J]. arXiv Preprint,arXiv:1802.05365, 2018.
[6]	DEVLIN J , CHANG M W , LEE K ,et al. Bert:pre-training of deep bidirectional transformers for language understanding[J]. arXiv Preprint,arXiv:1810.04805, 2018.
[7]	BURATTI L , PUJAR S , BORNEA M ,et al. Exploring software naturalness through neural language models[J]. arXiv Preprint,arXiv:2006.12641, 2020.
[8]	KARAMPATSIS R M , BABII H , ROBBES R ,et al. Big code != big vocabulary:open-vocabulary models for source code[C]// Proceedings of Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. New York:ACM Press, 2020: 1073-1085.
[9]	BROWN P F , DELLA PIETRA V J , SOUZA P V ,et al. Class-based N-gram models of natural language[J]. Computational Linguistics, 1992,18(4): 467-479.
[10]	DIETTERICH T G . Ensemble learning[J]. The Handbook of Brain Theory and Neural Networks, 2002,2(1): 110-125.
[11]	MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. arXiv Preprint,arXiv:1301.3781, 2013.
[12]	FENG Z Y , GUO D Y , TANG D Y ,et al. Codebert:a pre-trained model for programming and natural languages[J]. arXiv Preprint,arXiv:2002.08155, 2020.
[13]	GUO D Y , REN S , LU S ,et al. GraphCodeBERT:pre-training code representations with data flow[J]. arXiv Preprint,arXiv:2009.08366, 2020.
[14]	SALIMI S , EBRAHIMZADEH M , KHARRAZI M . Improving real-world vulnerability characterization with vulnerable slices[C]// Proceedings of Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering. New York:ACM Press, 2020: 11-20.
[15]	LI Z , ZOU D Q , XU S H ,et al. VulDeePecker:a deep learning-based system for vulnerability detection[J]. arXiv Preprint,arXiv:1801.01681, 2018.
[16]	ZOU D Q , WANG S J , XU S H ,et al. μVulDeePecker:a deep learning-based system for multiclass vulnerability detection[J]. IEEE Transactions on Dependable and Secure Computing, 2021,18(5): 2224-2236.
[17]	LI Z , ZOU D Q , XU S H ,et al. S_ySeVR:a framework for using deep learning to detect software vulnerabilities[J]. IEEE Transactions on Dependable and Secure Computing, 2021,PP(99): 1.
[18]	CHOWDHURY I , ZULKERNINE M . Using complexity,coupling,and cohesion metrics as early indicators of vulnerabilities[J]. Journal of Systems Architecture, 2011,57(3): 294-313.
[19]	MOU L L , LI G , ZHANG L ,et al. Convolutional neural networks over tree structures for programming language processing[J]. arXiv Preprint,arXiv:1409.5718, 2014.
[20]	ZHOU Y , LIU S , SIOW J ,et al. Devign:effective vulnerability identification by learning comprehensive program semantics via graph neural networks[J]. arXiv Preprint,arXiv:1909.03496, 2019.
[21]	HINDLE A , BARR E T , GABEL M ,et al. On the naturalness of software[J]. Communications of the ACM, 2016,59(5): 122-131.
[22]	SCANDARIATO R , WALDEN J , HOVSEPYAN A ,et al. Predicting vulnerable software components via text mining[J]. IEEE Transactions on Software Engineering, 2014,40(10): 993-1006.
[23]	PENNINGTON J , SOCHER R , MANNING C . Glove:global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg:Association for Computational Linguistics, 2014: 1532-1543.

N	划分结果
1	‘memcopy’, ‘(’, ‘buf’, ‘str’, ‘len’, ‘)’
2	‘memcopy (’, ‘(buf’, ‘buf str’, ‘str len’, ‘len )’
3	‘memcopy (buf’, ‘( buf str’, ‘buf str len’, ‘str len )’

D	t₁	t₂	…	t_m
d₁	tfidf_1,1	tfidf_1,2	…	tfidf_1,m
d₂	tfidf_2,1	tfidf_2,2	…	tfidf_2,m
?	?	?	?	?
d_n	tfidf_n,1	tfidf_n,1	…	tfidf_n,m

类型	含有漏洞的代码切片数量/个	不含漏洞的代码切片数量/个	漏洞代码切片占比
缓存区溢出	10 400	39 753	26.3%
资源管理	7 285	21 885	33.3%
数组使用	10 926	42 229	25.9%
算术表达	3 475	22 154	15.7%

分类器	FPR FNR	TPR	P	F1
Word2Vec+BLSTM	2.9%18.0%	82.0%	91.7%	86.6%
S_sub+SVM	3.1%18.4%	81.6%	90.5%	85.8%
S_sub+RF	2.4%14.8%	85.2%	92.7%	88.8%
S_sub+TextCNN	2.3%20.2%	79.8%	92.5%	85.7%
VCSE	2.7%11.4%	88.6%	92.1%	90.3%

分类器	FPR	FNR	TPR	P	F1
Word2Vec+BLSTM	2.8%	4.7%	95.3%	94.6%	95.0%
S_sub+TextCNN	1.8%	7.6%	92.3%	96.2%	94.2%
VCSE	1.2%	2.6%	97.4%	97.7%	97.5%

双粒度轻量级漏洞代码切片方法评估模型

Dual-granularity lightweight model for vulnerability code slicing method assessment

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 23

相关文章 15

Metrics

推荐阅读 0

分类器	FPR	FNR	TPR	P	F1
Word2Vec+BLSTM	3.8%	17.1%	92.7%	88.3%	85.5%
S_sub+TextCNN	4.0%	12.9%	87.2%	88.5%	87.8%
VCSE	4.6%	9.8%	90.1%	87.2%	88.7%

分类器	FPR	FNR	TPR	P	F1
Word2Vec+BLSTM	1.5%	18.3%	96.9%	87.9%	84.7%
S_sub+TextCNN	0.9%	53.1%	47.0%	90.8%	61.9%
VCSE	1.2%	12.7%	87.3%	93.0%	90.1%

[1]	刘帅, 关杰, 胡斌, 马宿东. 基于MILP的轻量级密码算法ACE的差分分析[J]. 通信学报, 2023, 44(1): 39-48.
[2]	杨宏宇, 杨海云, 张良, 成翔. 基于特征依赖图的源代码漏洞检测方法[J]. 通信学报, 2023, 44(1): 103-117.
[3]	王振宇, 郭阳, 李少青, 侯申, 邓丁. 面向轻量级物联网设备的高效匿名身份认证协议设计[J]. 通信学报, 2022, 43(7): 49-61.
[4]	殷新春, 王梦宇, 宁建廷. 轻量级可搜索医疗数据共享方案[J]. 通信学报, 2022, 43(5): 110-122.
[5]	蒋梓龙, 金晨辉. Saturnin算法的不可能差分分析[J]. 通信学报, 2022, 43(3): 53-62.
[6]	李玮, 汪梦林, 谷大武, 李嘉耀, 蔡天培, 徐光伟. 轻量级密码算法TWINE的唯密文故障分析[J]. 通信学报, 2021, 42(3): 135-149.
[7]	秦佳伟, 张华, 严寒冰, 何能强, 涂腾飞. 上下文感知的安卓应用程序漏洞检测研究[J]. 通信学报, 2021, 42(11): 13-27.
[8]	闫宏强,王琳杰. 物联网中认证技术研究[J]. 通信学报, 2020, 41(7): 213-222.
[9]	谢敏,李嘉琪,田峰. FeW的差分故障攻击[J]. 通信学报, 2020, 41(4): 143-149.
[10]	谢绒娜,李晖,史国振,郭云川. 基于属性轻量级可重构的访问控制策略[J]. 通信学报, 2020, 41(2): 112-122.
[11]	赵羽,杨洁,刘淼,孙金龙,桂冠. 面向视频监控基于联邦学习的智能边缘计算技术[J]. 通信学报, 2020, 41(10): 109-115.
[12]	武小年,李迎新,韦永壮,孙亚平. GRANULE和MANTRA算法的不可能差分区分器分析[J]. 通信学报, 2020, 41(1): 94-101.
[13]	谢敏,田峰,李嘉琪. TWINE算法的相关密钥不可能飞来去器攻击[J]. 通信学报, 2019, 40(9): 184-192.
[14]	李玮,吴益鑫,谷大武,李嘉耀,曹珊,汪梦林,蔡天培,丁祥武,刘志强. SIMON轻量级密码算法的唯密文故障分析[J]. 通信学报, 2019, 40(11): 122-137.
[15]	谭郁松,李荣振,吴庆波,张建锋,张尧学. 基于超算环境的面向多租户的轻量级虚拟HPC集群的设计与实现[J]. 通信学报, 2017, 38(Z2): 56-66.