通信学报 ›› 2016, Vol. 37 ›› Issue (11): 104-113.doi: 10.11959/j.issn.1000-436x.2016225

• 学术论文 • 上一篇    下一篇

基于simhash与倒排索引的复用代码快速溯源方法

乔延臣1,2,3,云晓春1,2,3,庹宇鹏2,3(),张永铮2,3   

  1. 1 中国科学院计算技术研究所,北京 100080
    2 中国科学院研究生院,北京 100039
    3 中国科学院信息工程研究所,北京 100093
  • 出版日期:2016-11-25 发布日期:2016-11-30
  • 基金资助:
    国家自然科学基金资助项目;国家高技术研究发展计划(“863”计划)基金资助项目;国家高技术研究发展计划(“863”计划)基金资助项目;国家242信息安全计划基金资助项目;中国科学院战略性科技先导专项基金资助项目

Fast reused code tracing method based on simhash and inverted index

Yan-chen QIAO1,2,3,Xiao-chun YUN1,2,3,Yu-peng TUO2,3(),Yong-zheng ZHANG2,3   

  1. 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China
    2 Graduate School, Chinese Academy of Sciences, Beijing 100039, China
    3 Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
  • Online:2016-11-25 Published:2016-11-30
  • Supported by:
    The National Natural Science Foundation of China;The National High Technology Research and Development Program of China (863 Program);The National High Technology Research and Development Program of China (863 Program);The National 242 Information Secu-rity Research Program of China;The Strategic Priority Research Program of the Chinese Academy of Sciences

摘要:

提出了一种新颖的复用代码精确快速溯源方法。该方法以函数为单位,基于simhash与倒排索引技术,能在海量代码中快速溯源相似函数。首先基于simhash利用海量样本构建具有三级倒排索引结构的代码库。对于待溯源函数,依据函数中代码块的simhash值快速发现相似代码块,继而倒排索引潜在相似函数,依据代码块跳转关系精确判定是否相似,并溯源至所在样本。实验结果表明,该方法在保证高准确率与召回率的前提下,基于代码库能快速识别样本中的编译器插入函数与复用函数。

关键词: 网络安全, 复用代码, 快速溯源, 同源判定, 恶意代码

Abstract:

A novel method for fast and accurately tracing reused code was proposed. Based on simhash and inverted in-dex, the method can fast trace similar functions in massive code. First of all, a code database with three-level inverted in-dex structures was constructed. For the function to be traced, similar code blocks could be found quickly according to simhash value of the code block in the function code. Then the potential similar functions could be fast traced using in-verted index. Finally, really similar functions could be identified by comparing jump relationships of similar code blocks. Further, malware samples containing similar functions could be traced. The experimental results show that the method can quickly identify the functions inserted by compilers and the reused functions based on the code database under the premise of high accuracy and recall rate.

Key words: network security, reused code, retrieval method, homology identification, malware

[1] 李 颖,魏急波. 裁减自动球形译码算法与性能分析[J]. 通信学报, 2007, 28(5): 8 -54 .
[2] 包先雨,蒋建国,袁 炜,李 援. H.264/AVC标准中基于CABAC的数字视频加密研究[J]. 通信学报, 2007, 28(6): 5 -29 .
[3] 战金龙,卢建军,卢光跃. 新的GLSFBC-CDMA-OFDMA发射方案[J]. 通信学报, 2012, 33(4): 14 -106 .
[4] 李方伟,李 晗,卢 晓. TD-SCDMA系统中多速率业务的接纳控制算法研究[J]. 通信学报, 2012, 33(4): 24 -182 .
[5] 胡玉鹏,罗昊,林亚平,秦拯,尹波. 社会网络中时空周期行为模式挖掘算法[J]. 通信学报, 2013, 34(1): 8 -18 .
[6] 陈晓华,李春芝,陈良育,曾振柄. 虚拟网络映射最小费用流模型及算法[J]. 电信科学, 2014, 30(6): 90 -94 .
[7] “基于大数据的互联网化存量经营”项目组,“基于用户感知的运维转型”项目组. 运营商存量经营大数据平台及其关键技术研究[J]. 电信科学, 2014, 30(6): 118 -125 .
[8] 沈成彬,王成巍,蒋铭,王波. 下一代PON技术的进展与应用[J]. 电信科学, 2010, 26(8): 1 -7 .
[9] 王晓鹏,王纯. 基于OSGi和RCP的融合通信客户端的设计与实现[J]. 电信科学, 2010, 26(8): 35 -41 .
[10] 刘军1,孙茜1,王英梅2,叶宁1,沙明博3. 支持网络编码的认知无线自组网拓扑控制算法[J]. 通信学报, 2013, 34(5): 16 -142 .