面向项目版本差异性的漏洞识别技术研究

doi:10.11959/j.issn.2096-109x.2021094

网络与信息安全学报 ›› 2022, Vol. 8 ›› Issue (1): 52-62.doi: 10.11959/j.issn.2096-109x.2021094

• 专栏：安全感知与检测方法 • 上一篇下一篇

面向项目版本差异性的漏洞识别技术研究

黄诚¹^,², 孙明旭¹, 段仁语¹, 吴苏晟¹, 陈斌¹

¹ 四川大学网络空间安全学院，四川成都 610065
² 广西密码学与信息安全重点实验室，广西桂林541000

修回日期:2021-10-12 出版日期:2022-02-15 发布日期:2022-02-01
作者简介:黄诚（1987− ），男，重庆人，四川大学副教授，主要研究方向为网络空间安全、攻击检测、威胁溯源、数据挖掘、社交网络、机器学习和自然语言处理
孙明旭（2000− ），男，黑龙江绥化人，主要研究方向为数据挖掘、自然语言处理和漏洞情报分析
段仁语（1998− ），男，重庆人，主要研究方向为漏洞情报挖掘、计算机视觉和人工智能
吴苏晟（1999− ），男，浙江杭州人，主要研究方向为漏洞挖掘和开源代码漏洞库的分析与构建
陈斌（1999− ），男，江西南昌人，主要研究方向为漏洞挖掘与自然语言处理
基金资助:
国家自然科学基金(61902265);四川省科技厅重点研发项目(2020YFG0047);广西密码学与信息安全重点实验室研究课题(GCIS201921)

Vulnerability identification technology research based on project version difference

Cheng HUANG¹^,², Mingxu SUN¹, Renyu DUAN¹, Susheng WU¹, Bin CHEN¹

¹ School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China
² Guangxi Key Laboratory of Cryptography and Information Security, Guilin 541000, China

Revised:2021-10-12 Online:2022-02-15 Published:2022-02-01
Supported by:
The National Natural Science Foundation of China(61902265);Sichuan Science and Technology Program(2020YFG0047);Guangxi Key Laboratory of Cryptography and Information Security(GCIS201921)

摘要/Abstract

摘要：

开源代码托管平台为软件开发行业带来了活力和机遇，但存在诸多安全隐患。开源代码的不规范性、项目依赖库的复杂性、漏洞披露平台收集漏洞的被动性等问题都影响着开源项目及引入开源组件的闭源项目的安全，大部分漏洞修复行为无法及时被察觉和识别，进而将各类项目的安全风险直接暴露给攻击者。为了全面且及时地发现开源项目中的漏洞修复行为，设计并实现了基于项目版本差异性的漏洞识别系统—VpatchFinder。系统自动获取开源项目中的更新代码及内容数据，对更新前后代码和文本描述信息进行提取分析。提出了基于安全行为与代码特征的差异性特征，提取了包括项目注释信息特征组、页面统计特征组、代码统计特征组以及漏洞类型特征组的共 40 个特征构建特征集，采用随机森林算法来训练可识别漏洞的分类器。通过真实漏洞数据进行测试，VpatchFinder 的精确率为 84.35%，准确率为 85.46%，召回率为85.09%，优于其他常见的机器学习算法模型。进一步通过整理的历年部分开源软件 CVE 漏洞数据进行实验，其结果表明 68.07%的软件漏洞能够提前被 VpatchFinder 发现。该研究结果可以为软件安全架构设计、开发及成分分析等领域提供有效技术支撑。

关键词: 漏洞识别, 开源平台, 安全修复, 机器学习

Abstract:

The open source code hosting platform has brought power and opportunities to software development, but there are also many security risks.The open source code has poor quality, the dependency libraries of projects are complex and vulnerability collection platforms are inadequate in collecting vulnerabilities.All these problems affect the security of open source projects and complex software with open source complements and most security patches can't be discovered and applied in time.Thus, the hackers could be easily found such vulnerable software.To discover the vulnerability in the open source community fully and timely, a vulnerability identification system based on project version difference was proposed.The update contents of projects in the open source community were collected automatically, then features were defined as security behaviors and code differences from the code and log in patches, 40 features including comment information feature group, page statistics feature group, code statistics feature group and vulnerability type feature group were proposed to build feature set.And random forest model was built to learn classifiers for vulnerability identification.The results show that VpatchFinder achieves a precision rate of 0.844, an accuracy rate of 0.855 and a recall rate of 0.851.Besides, 68.07% of community vulnerabilities can be early discovered by VpatchFinder in real open source CVE vulnerabilities.This research result can improve the current issue in software security architecture design and development.

Key words: vulnerability detection, open source platform, security patch, machine learning

中图分类号:

TP393

黄诚, 孙明旭, 段仁语, 吴苏晟, 陈斌. 面向项目版本差异性的漏洞识别技术研究[J]. 网络与信息安全学报, 2022, 8(1): 52-62.

Cheng HUANG, Mingxu SUN, Renyu DUAN, Susheng WU, Bin CHEN. Vulnerability identification technology research based on project version difference[J]. Chinese Journal of Network and Information Security, 2022, 8(1): 52-62.

图/表 11

图1

图2

表1

图3

图4

图5

表2

随机森林和其他算法的评估数值结果Table 2 Numerical results of random forest and other algorithms"

算法	准确率	召回率	精确率	F1值
SVM	0.784 2	0.774 9	0.768 8	0.772 0
GBDT	0.853 2	0.856 9	0.814 3	0.835 0
DT	0.814 7	0.808 5	0.799 3	0.803 8
XGBoost	0.849 2	$0 . 8593$	0.828 6	0.843 7
AdaBoost	0.809 8	0.719 2	$0 . 8562$	0.781 7
$R F$	$0 . 8546$	0.850 9	0.843 5	$0 . 8472$

表2

图6

表3

表4

表5

参考文献 33

[1]	ALFADEL M , COSTA D E , SHIHAB E ,et al. On the use of dependabot security pull requests[C]// Proceedings of 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 2021: 254-265.
[2]	PASHCHENKO I , PLATE H , PONTA S E ,et al. Vulnerable open source dependencies:counting those that matter[C]// Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 2018: 1-10.
[3]	SABETTA A , BEZZI M . A practical approach to the automatic classification of security-relevant commits[C]// Proceedings of 2018 IEEE International Conference on Software Maintenance and Evolution. 2018: 579-582.
[4]	KAMIYA T , KUSUMOTO S , INOUE K . CCFinder:a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002,28(7): 654-670.
[5]	LI Z , LU S , MYAGMAR S ,et al. CP-Miner:finding copy-paste and related bugs in large-scale software code[J]. IEEE Transactions on Software Engineering, 2006,32(3): 176-192.
[6]	王雅文, 姚欣洪, 宫云战 ,等. 一种基于代码静态分析的缓冲区溢出检测算法[J]. 计算机研究与发展, 2012,49(4): 839-845.
	WANG Y W , YAO X H , GONG Y Z ,et al. A method of buffer overflow detection based on static code analysis[J]. Journal of Computer Research and Development, 2012,49(4): 839-845.
[7]	王蕾, 李丰, 李炼 ,等. 污点分析技术的原理和实践应用[J]. 软件学报, 2017,28(4): 860-882.
	WANG L , LI F , LI L ,et al. Principle and practice of taint analysis[J]. Journal of Software, 2017,28(4): 860-882.
[8]	YAMAGUCHI F , LOTTMANN M , RIECK K . Generalized vulnerability extrapolation using abstract syntax trees[C]// Proceedings of the 28th Annual Computer Security Applications Conference. 2012: 359-368.
[9]	LI J Y , ERNST M D . CBCD:cloned buggy code detector[C]// Proceedings of 2012 34th International Conference on Software Engineering (ICSE). 2012: 310-320.
[10]	LI Z , ZOU D Q , XU S H ,et al. VulDeePecker:a deep learning-based system for vulnerability detection[C]// Proceedings 2018 Network and Distributed System Security Symposium. 2018.
[11]	TIAN Y , LAWALL J , LO D . Identifying Linux bug fixing patches[C]// Proceedings of 2012 34th International Conference on Software Engineering (ICSE). 2012: 386-396.
[12]	PERL H , DECHAND S , SMITH M ,et al. VCCFinder:finding potential vulnerabilities in open-source projects to assist code audits[C]// Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 2015: 426-437.
[13]	ZAMAN S , ADAMS B , HASSAN A E . Security versus performance bugs:a case study on Firefox[C]// Proceedings of the 8th Working Conference on Mining Software Repositories. 2011: 93-102.
[14]	LI F , PAXSON V . A large-scale empirical study of security patches[C]// Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017: 2201-2215.
[15]	WANG X D , SUN K , BATCHELLER A ,et al. An empirical study of secret security patch in open source software[M]// Adaptive Autonomous Secure Cyber Systems. 2020: 269-289.
[16]	NEUHAUS S , ZIMMERMANN T , HOLLER C ,et al. Predicting vulnerable software components[C]// Proceedings of the 14th ACM Conference on Computer and Communications Security. 2007: 529-540.
[17]	郑荣锋, 方勇, 刘亮 . 基于动态行为指纹的恶意代码同源性分析[J]. 四川大学学报(自然科学版), 2016,53(4): 793-798.
	ZHENG R F , FANG Y , LIU L . Homology analysis of malicious code based on dynamic-behavior fingerprint[J]. Journal of Sichuan University (Natural Science Edition), 2016,53(4): 793-798.
[18]	KONG D G , ZHENG Q , CHEN C ,et al. ISA:a source code static vulnerability detection system based on data fusion[C]// Proceedings of the 2nd International ICST Conference on Scalable Information Systems. 2007:55.
[19]	SONNEKALB T , . Machine-learning supported vulnerability detection in source code[C]// Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019: 1180-1183.
[20]	李元诚, 崔亚奇, 吕俊峰 ,等. 开源软件漏洞检测的混合深度学习方法[J]. 计算机工程与应用, 2019,55(11): 52-59.
	LI Y C , CUI Y Q , LYU J F ,et al. Combined deep learning method for open source software vulnerability detection[J]. Computer Engineering and Applications, 2019,55(11): 52-59.
[21]	JIANG L X , MISHERGHI G , SU Z D ,et al. DECKARD:scalable and accurate tree-based detection of code clones[C]// Proceedings of 29th International Conference on Software Engineering (ICSE'07). 2007: 96-105.
[22]	ALON U , ZILBERSTEIN M , LEVY O ,et al. code2vec:learning distributed representations of code[J]. Proceedings of the ACM on Programming Languages, 2019,3:40.
[23]	LIU C , CHEN C , HAN J W ,et al. GPLAG:detection of software plagiarism by program dependence graph analysis[C]// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006: 872-881.
[24]	PHAM N H , NGUYEN T T , NGUYEN H A ,et al. Detection of recurring software vulnerabilities[C]// ASE '10:Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 2010: 447-456.
[25]	刘凯, 方勇, 张磊 ,等. 基于图卷积网络的恶意代码聚类[J]. 四川大学学报(自然科学版), 2019,56(4): 654-660.
	LIU K , FANG Y , ZHANG L ,et al. Malware clustering based on graph convolutional networks[J]. Journal of Sichuan University (Natural Science Edition), 2019,56(4): 654-660.
[26]	SHIN Y , MENEELY A , WILLIAMS L ,et al. Evaluating complexity,code churn,and developer activity metrics as indicators of software vulnerabilities[J]. IEEE Transactions on Software Engineering, 2011,37(6): 772-787.
[27]	NEIL L , MITTAL S , JOSHI A . Mining threat intelligence about open-source projects and libraries from code repository issues and bug reports[C]// Proceedings of 2018 IEEE International Conference on Intelligence and Security Informatics. 2018: 7-12.
[28]	曹琰, 刘龙, 王禹 ,等. 基于函数语义分析的软件补丁比对技术[J]. 网络与信息安全学报, 2019,5(5): 56-63.
	CAO Y , LIU L , WANG Y ,et al. Software patch comparison technology through semantic analysis on function[J]. Chinese Journal of Network and Information Security, 2019,5(5): 56-63.
[29]	PONTA S E , PLATE H , SABETTA A ,et al. A manually-curated dataset of fixes to vulnerabilities of open-source software[C]// Proceedings of 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 2019: 383-387.
[30]	RAMOS J . Using TF-IDF to determine word relevance in document queries[J]. Proceedings of the First Instructional Conference on Machine Learning, 2003: 29-48.
[31]	RISTAD E S , YIANILOS P N . Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998,20(5): 522-532.
[32]	吕维梅, 刘坚 . C/C++程序安全漏洞的分类与分析[J]. 计算机工程与应用, 2005,41(5): 123-125,228.
	LYU W M , LIU J . The classification and analysis on safety holes of C/C++ programs[J]. Computer Engineering and Applications, 2005,41(5): 123-125,228.
[33]	BREIMAN L . Random forests[J]. Machine Learning, 2001,45(1): 5-32.

特征组类别	特征名称	符号	特征序号	来源
注释信息特征组α	Subject安全关键词统计	α_swc	1	首次提出
	Subject非安全关键词统计	α_nswc	2	首次提出
	变化的文件数量	β_cfn	3	文献[13,14]
	变化的修改块数量	β_ccn	4	文献[11,12]
	变化的行的数量	β_cln	5~10	文献[12,13]
页面统计特征组β	变化的字符数量	β_ccn	11~16	文献[11,15]
	添加代码与移除代码的相似程度	β_ars	17~19	文献[15]
	出现相同的代码更改的最大次数	β_msn	20	首次提出
	patch文件大小	β_fis	21	首次提出
	变化的条件语句的数量	γ_ccn	22~27	文献[11,15]
代码统计特征组γ	变化的循环语句的数量	γ_cln	28~33	文献[11,15]
	变化的算术、逻辑和关系运算符的总数量	γ_con	34~39	首次提出
漏洞类型特征组δ	变化的代码中漏洞关键函数统计	δ_cwc	40	首次提出

检测时间状况	数量	占比
检测时间早于CVE	162	68.07%
检测时间与CVE相同	11	4.62%
检测时间晚于CVE	65	27.31%

检测系统	报告漏洞数量	人工检验漏洞数量	正确率	误报率
VpatchFinder	236	192	81.36%	18.64%
Cppcheck	318	182	57.23%	42.77%

漏洞来源	提交时间	漏洞类型	漏洞描述
tcpdump(5e48…fb8)	Mon,13 Jul 2015	栈溢出	print-juniper.c 文件中的函数 juniper_parseHeader()存在缓冲区溢出漏洞，攻击者可利用此处实施拒绝服务攻击，导致应用程序崩溃
radare2(9bd0…7fb)	Fri,15 Nov 2013	栈溢出	cmd_write.c文件中未判断变量的数据长度，可能引发栈溢出漏洞
micropython(2daa…93b)	Fri,1 Sep 2017	栈溢出	未进行输入数据格式的规范性检验，攻击者可以人为构造可利用代码溢出modstruct.c文件中的结构体缓冲区，实施恶意攻击
ImageMagick(a464…0d0)	Mon,11 Feb 2019	堆溢出	处理SVG图像时，未检查变量message的数据长度，导致堆溢出漏洞
php-src (6ebe…d3c)	Thu,27 May 2010	空指针	在mysqlnd.c文件中，调用指针变量result前缺乏判空步骤，远程攻击者可以绕过访问限制或进行拒绝服务攻击，导致应用程序崩溃
php-src(cdd9…004)	Wed,22 Aug 2018	内存泄露	在array.c文件中，未释放result变量，如果回调函数异常，可能会产生内存泄露，导致系统崩溃或敏感信息泄露
php-src(0a1c…b68)	Sat,15 Oct 2011	整数溢出	ext/soap/php_http.c 文件中的 emalloc 函数存在整数溢出漏洞，有权限执行该脚本的攻击者可以利用此处进行提权操作

面向项目版本差异性的漏洞识别技术研究

Vulnerability identification technology research based on project version difference

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 33

相关文章 15

Metrics

推荐阅读 0

[1]	夏锐琪, 李曼曼, 陈少真. 基于机器学习的分组密码结构识别[J]. 网络与信息安全学报, 2023, 9(3): 79-89.
[2]	韦南, 殷丽华, 宁洪, 方滨兴. 本科“机器学习”课程教学改革初探[J]. 网络与信息安全学报, 2022, 8(4): 182-189.
[3]	顾笛儿, 卢华, 谢人超, 黄韬. 边缘计算开源平台综述[J]. 网络与信息安全学报, 2021, 7(2): 22-34.
[4]	张颖君,刘尚奇,杨牧,张海霞,黄克振. 基于日志的异常检测技术综述[J]. 网络与信息安全学报, 2020, 6(6): 1-12.
[5]	付溪,李晖,赵兴文. 网络钓鱼识别研究综述[J]. 网络与信息安全学报, 2020, 6(5): 1-10.
[6]	何康,祝跃飞,刘龙,芦斌,刘彬. 敌对攻击环境下基于移动目标防御的算法稳健性增强方法[J]. 网络与信息安全学报, 2020, 6(4): 67-76.
[7]	袁福祥,刘粉林,刘翀,刘琰,罗向阳. MLAR：面向IP定位的大规模网络别名解析[J]. 网络与信息安全学报, 2020, 6(4): 77-94.
[8]	骆子铭,许书彬,刘晓东. 基于机器学习的TLS恶意加密流量检测方案[J]. 网络与信息安全学报, 2020, 6(1): 77-83.
[9]	黄伟,刘存才,祁思博. 针对设备端口链路的LSTM网络流量预测与链路拥塞方案[J]. 网络与信息安全学报, 2019, 5(6): 50-57.
[10]	宋蕾, 马春光, 段广晗. 机器学习安全及隐私保护研究进展[J]. 网络与信息安全学报, 2018, 4(8): 1-11.
[11]	明拓思宇, 陈鸿昶. 文本摘要研究进展与趋势[J]. 网络与信息安全学报, 2018, 4(6): 1-10.
[12]	王正琦,冯晓兵,张驰. 基于两层分类器的恶意网页快速检测系统研究[J]. 网络与信息安全学报, 2017, 3(8): 44-60.
[13]	张茜,延志伟,李洪涛,耿光刚. 网络钓鱼欺诈检测技术研究[J]. 网络与信息安全学报, 2017, 3(7): 7-24.
[14]	张东,张尧,刘刚,宋桂香. 基于机器学习算法的主机恶意代码检测技术研究[J]. 网络与信息安全学报, 2017, 3(7): 25-32.
[15]	孙博文,黄炎裔,温俏琨,田斌,吴鹏,李祺. 基于静态多特征融合的恶意软件分类方法[J]. 网络与信息安全学报, 2017, 3(11): 68-76.