基于simhash与倒排索引的复用代码快速溯源方法

doi:10.11959/j.issn.1000-436x.2016225

摘要/Abstract

摘要：

提出了一种新颖的复用代码精确快速溯源方法。该方法以函数为单位，基于simhash与倒排索引技术，能在海量代码中快速溯源相似函数。首先基于simhash利用海量样本构建具有三级倒排索引结构的代码库。对于待溯源函数，依据函数中代码块的simhash值快速发现相似代码块，继而倒排索引潜在相似函数，依据代码块跳转关系精确判定是否相似，并溯源至所在样本。实验结果表明，该方法在保证高准确率与召回率的前提下，基于代码库能快速识别样本中的编译器插入函数与复用函数。

关键词: 网络安全, 复用代码, 快速溯源, 同源判定, 恶意代码

Abstract:

A novel method for fast and accurately tracing reused code was proposed. Based on simhash and inverted in-dex, the method can fast trace similar functions in massive code. First of all, a code database with three-level inverted in-dex structures was constructed. For the function to be traced, similar code blocks could be found quickly according to simhash value of the code block in the function code. Then the potential similar functions could be fast traced using in-verted index. Finally, really similar functions could be identified by comparing jump relationships of similar code blocks. Further, malware samples containing similar functions could be traced. The experimental results show that the method can quickly identify the functions inserted by compilers and the reused functions based on the code database under the premise of high accuracy and recall rate.

Key words: network security, reused code, retrieval method, homology identification, malware

乔延臣,云晓春,庹宇鹏,张永铮. 基于simhash与倒排索引的复用代码快速溯源方法[J]. 通信学报, 2016, 37(11): 104-113.

Yan-chen QIAO,Xiao-chun YUN,Yu-peng TUO,Yong-zheng ZHANG. Fast reused code tracing method based on simhash and inverted index[J]. Journal on Communications, 2016, 37(11): 104-113.

图/表 12

图1

图2

表1

表2

图3

图4

图5

图6

表3

表4

表5

表6

参考文献 24

[1]	董志强，肖新光，张栗伟 . 编码心理学分析病毒同源性[J]. 信息安全与通信保密, 2005(8):55-59. DONG Z Q , XIAO X G , ZHANG S W . Malware homology identifica-tion based on programming psychology[J]. China Information Security, 2005(8):55-59.
[2]	GReAT . Gauss: abnormal distribution 2012[R/OL].
[3]	YURY Y , NAMESTNIKOV V K , OLEG K . Chthonic: a new modification of ZeuS 2014[R/OL]. .
[4]	SKELORU V . Visgean/Zeus[EB/OL]. .
[5]	GREAT . A fanny equation: “i am your father, stuxnet”2015[EB/OL]. .
[6]	QIAO Y C , YUN X , ZHANG Y . Fast reused function retrieval method based on simhash and inverted index[C]// 2016 15th IEEE Interna-tional Conference on Trust, Security and Privacy in Computing and Communications. 2016.
[7]	BENCSATH B , PEK G , BUTTYAN L , et al Duqu: a stuxnet-like malware found in the wild[R]. CrySyS Lab Technical Report. 2011.
[8]	GREAT . Cloud Atlas: RedOctober APT is back in style 2014[R/OL]. .
[9]	LABS F S . PITOU: The “silent” resurrection of the notorious Srizbi kernel spambot[R]. . 2014.
[10]	MYLES G , COLLBERG C , K-gram based software birthmarks[C]// Proceedings of the 2005 ACM Symposium on Applied Computing. 2005:314-318.
[11]	S?BJ?RNSEN A , WILLCOCK J , PANAS T , et al. Detecting code clones in binary executables[C]// 18th International Symposium on Software Testing and Analysis. 2009:117-128.
[12]	LAKHOTIA A , PREDA M D , GIACOBAZZI R . Fast location of similar code fragments using semantic'juice'[C]// 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. 2013:1-6.
[13]	RUTTENBERG B , MILES C , KELLOGG L , et al. Identifying shared software components to support malware forensics[J]. Detection of In-trusions and Malware, and Vulnerability Assessment: Springer, 2014,21-40.
[14]	OUELLETTE J , PFEFFER A , LAKHOTIA A , et al. Countering malware evolution using cloud-based learning[C]// 2013 8th International Con-ference on Malicious and Unwanted Software, 2013.
[15]	DAVID Y , YAHAV E , Tracelet-based code search in executables[C]// ACM SIGPLAN Notices. 2014.
[16]	ALRABAEE S , SHIRANI P , WANG L , et al. SIGMA: a semantic inte-grated graph matching approach for identifying reused functions in binary code[J]. Digital Investigation, 2015,12:S61-S71.
[17]	CHARIKAR M S . Similarity estimation techniques from rounding algorithms[C]// 34th Annual ACM Symposium on Theory of Comput-ing. 2002.
[18]	MANKU G S , JAIN A , SARMA A D . Detecting near-duplicates for web crawling[C]// 16th International Conference on World Wide Web. Banff, Alberta, Canada, 2007:141-50.
[19]	UDDIN M S , ROY C K , SCHNEIDER K A , et al. On the effectiveness of simhash for detecting near-miss clones in large scale software sys-tems[C]// 2011 18th Working Conference on Reverse Engineering(WCRE), 2011.
[20]	郭颖，陈峰宏，周明辉 . 大规模代码克隆的检测方法[J]. 计算机科学与探索, 2014(4):417-426. GUO Y , CHEN F H , ZHOU M H . Code clone detection method for large scale source code[J]. Journal of Frontiers of Computer Science ＆Technology, 2014(4):417-426.
[21]	TIMO J , RINNE S L . ssh-3.2.9.1 2003[EB/OL]. .
[22]	VX Heaven[EB/OL]. .
[23]	Wikipedia . Agobot 2016[EB/OL]. .
[24]	KHOO W M , MYCROFT A , ANDERSON R , et al. Rendezvous: a search engine for binary code[C]// Proceedings of the 10th Working Confer-ence on Mining Software Repositories. 2013.

代码块名称	内容
	push esi
	mov esi, [esp+4+arg_0]
	push 0
loc_4023BC	and dword ptr [esi], 0
	call ds:GetModuleHandleA
	cmp word ptr [eax], 5A4Dh
	jnz short loc_4023E7
	pop esi
loc_4023E7
	retn

原代码块	标准化处理代码块
push esi	push REG32
mov esi, [esp+4+arg_0]	mov REG32, MEM
push 0	push VAL
and dword ptr [esi], 0	and MEM, VAL
call ds:GetModuleHandleA	call GetModuleHandleA
cmp word ptr [eax], 5A4Dh	cmp MEM, VAL
jnz short loc_4023E7	jnz loc_xxx

已知	溯源	未溯源
相似	1 486	62
不相似	131	—

编译器函数	WinXP系统文件中的相似函数
sub_40451A	imjputyc.dll中的___old_sbh_decommit_pages等
sub_402A5B	TPPS.DLL中的sub_10005660等
sub_4023E9	TPVMW32.dll中的sub_10006046等
sub_4044C4	MSCOMCTL.OCX中的sub_27608F15等
sub_4023BC	tprdpw32.dll中的sub_10009720等
sub_404380	imjpuex.exe中的___old_sbh_new_region等
sub_4045DC	tprdpw32.dll中的sub_1000A73D等
sub_404880	msvcr70.dll中的___old_sbh_alloc_block_from_page等
sub_402531	tprdpw32.dll中的sub_10009895等
sub_404678	imjpdct.dll中的___old_sbh_alloc_block等
sub_402E2A	imjprw.exe中的_calloc等
sub_403BA2	imjpmig.exe中的___sbh_free_block等
sub_40292A	imjpcus.dll中的__heap_alloc等
sub_402799	imskdic.dll中的__NMSG_WRITE等
sub_404633	TPVMW32.dll中的sub_1000708D等
sub_401233	tprdpw32.dll中的sub_1000F109等

非Agobot家族样本	相似函数数量	总指令数
Net-Worm.Win32.Kolabc.eph	44	2 411
Trojan-Spy.Win32.SCKeyLog.fp	44	1 048
Trojan-Spy.Win32.SCKeyLog.ij	43	1 141
Backdoor.Win32.IRCBot.ol	40	4 036
Trojan-Dropper.Win32.Agent.tif	31	712
Trojan-Spy.Win32.SCKeyLog.fb	29	669
Trojan-Downloader.Win32.Delf.gij	28	711
Backdoor.Win32.SuperSpy.b	28	665
Trojan-Downloader.Win32.Realtens.h	28	619
Trojan-PSW.Win32.Zombie.10	26	646