基于simhash与倒排索引的复用代码快速溯源方法

doi:10.11959/j.issn.1000-436x.2016225

Abstract

Abstract:

A novel method for fast and accurately tracing reused code was proposed. Based on simhash and inverted in-dex, the method can fast trace similar functions in massive code. First of all, a code database with three-level inverted in-dex structures was constructed. For the function to be traced, similar code blocks could be found quickly according to simhash value of the code block in the function code. Then the potential similar functions could be fast traced using in-verted index. Finally, really similar functions could be identified by comparing jump relationships of similar code blocks. Further, malware samples containing similar functions could be traced. The experimental results show that the method can quickly identify the functions inserted by compilers and the reused functions based on the code database under the premise of high accuracy and recall rate.

Key words: network security, reused code, retrieval method, homology identification, malware

Yan-chen QIAO,Xiao-chun YUN,Yu-peng TUO,Yong-zheng ZHANG. Fast reused code tracing method based on simhash and inverted index[J]. Journal on Communications, 2016, 37(11): 104-113.

Figures/Tables 12

References 24

[1]	董志强，肖新光，张栗伟 . 编码心理学分析病毒同源性[J]. 信息安全与通信保密, 2005(8):55-59. DONG Z Q , XIAO X G , ZHANG S W . Malware homology identifica-tion based on programming psychology[J]. China Information Security, 2005(8):55-59.
[2]	GReAT . Gauss: abnormal distribution 2012[R/OL].
[3]	YURY Y , NAMESTNIKOV V K , OLEG K . Chthonic: a new modification of ZeuS 2014[R/OL]. .
[4]	SKELORU V . Visgean/Zeus[EB/OL]. .
[5]	GREAT . A fanny equation: “i am your father, stuxnet”2015[EB/OL]. .
[6]	QIAO Y C , YUN X , ZHANG Y . Fast reused function retrieval method based on simhash and inverted index[C]// 2016 15th IEEE Interna-tional Conference on Trust, Security and Privacy in Computing and Communications. 2016.
[7]	BENCSATH B , PEK G , BUTTYAN L , et al Duqu: a stuxnet-like malware found in the wild[R]. CrySyS Lab Technical Report. 2011.
[8]	GREAT . Cloud Atlas: RedOctober APT is back in style 2014[R/OL]. .
[9]	LABS F S . PITOU: The “silent” resurrection of the notorious Srizbi kernel spambot[R]. . 2014.
[10]	MYLES G , COLLBERG C , K-gram based software birthmarks[C]// Proceedings of the 2005 ACM Symposium on Applied Computing. 2005:314-318.
[11]	S?BJ?RNSEN A , WILLCOCK J , PANAS T , et al. Detecting code clones in binary executables[C]// 18th International Symposium on Software Testing and Analysis. 2009:117-128.
[12]	LAKHOTIA A , PREDA M D , GIACOBAZZI R . Fast location of similar code fragments using semantic'juice'[C]// 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. 2013:1-6.
[13]	RUTTENBERG B , MILES C , KELLOGG L , et al. Identifying shared software components to support malware forensics[J]. Detection of In-trusions and Malware, and Vulnerability Assessment: Springer, 2014,21-40.
[14]	OUELLETTE J , PFEFFER A , LAKHOTIA A , et al. Countering malware evolution using cloud-based learning[C]// 2013 8th International Con-ference on Malicious and Unwanted Software, 2013.
[15]	DAVID Y , YAHAV E , Tracelet-based code search in executables[C]// ACM SIGPLAN Notices. 2014.
[16]	ALRABAEE S , SHIRANI P , WANG L , et al. SIGMA: a semantic inte-grated graph matching approach for identifying reused functions in binary code[J]. Digital Investigation, 2015,12:S61-S71.
[17]	CHARIKAR M S . Similarity estimation techniques from rounding algorithms[C]// 34th Annual ACM Symposium on Theory of Comput-ing. 2002.
[18]	MANKU G S , JAIN A , SARMA A D . Detecting near-duplicates for web crawling[C]// 16th International Conference on World Wide Web. Banff, Alberta, Canada, 2007:141-50.
[19]	UDDIN M S , ROY C K , SCHNEIDER K A , et al. On the effectiveness of simhash for detecting near-miss clones in large scale software sys-tems[C]// 2011 18th Working Conference on Reverse Engineering(WCRE), 2011.
[20]	郭颖，陈峰宏，周明辉 . 大规模代码克隆的检测方法[J]. 计算机科学与探索, 2014(4):417-426. GUO Y , CHEN F H , ZHOU M H . Code clone detection method for large scale source code[J]. Journal of Frontiers of Computer Science ＆Technology, 2014(4):417-426.
[21]	TIMO J , RINNE S L . ssh-3.2.9.1 2003[EB/OL]. .
[22]	VX Heaven[EB/OL]. .
[23]	Wikipedia . Agobot 2016[EB/OL]. .
[24]	KHOO W M , MYCROFT A , ANDERSON R , et al. Rendezvous: a search engine for binary code[C]// Proceedings of the 10th Working Confer-ence on Mining Software Repositories. 2013.

Metrics

Recommended 0

No Suggested Reading articles found!

代码块名称	内容
	push esi
	mov esi, [esp+4+arg_0]
	push 0
loc_4023BC	and dword ptr [esi], 0
	call ds:GetModuleHandleA
	cmp word ptr [eax], 5A4Dh
	jnz short loc_4023E7
	pop esi
loc_4023E7
	retn

原代码块	标准化处理代码块
push esi	push REG32
mov esi, [esp+4+arg_0]	mov REG32, MEM
push 0	push VAL
and dword ptr [esi], 0	and MEM, VAL
call ds:GetModuleHandleA	call GetModuleHandleA
cmp word ptr [eax], 5A4Dh	cmp MEM, VAL
jnz short loc_4023E7	jnz loc_xxx

已知	溯源	未溯源
相似	1 486	62
不相似	131	—

编译器函数	WinXP系统文件中的相似函数
sub_40451A	imjputyc.dll中的___old_sbh_decommit_pages等
sub_402A5B	TPPS.DLL中的sub_10005660等
sub_4023E9	TPVMW32.dll中的sub_10006046等
sub_4044C4	MSCOMCTL.OCX中的sub_27608F15等
sub_4023BC	tprdpw32.dll中的sub_10009720等
sub_404380	imjpuex.exe中的___old_sbh_new_region等
sub_4045DC	tprdpw32.dll中的sub_1000A73D等
sub_404880	msvcr70.dll中的___old_sbh_alloc_block_from_page等
sub_402531	tprdpw32.dll中的sub_10009895等
sub_404678	imjpdct.dll中的___old_sbh_alloc_block等
sub_402E2A	imjprw.exe中的_calloc等
sub_403BA2	imjpmig.exe中的___sbh_free_block等
sub_40292A	imjpcus.dll中的__heap_alloc等
sub_402799	imskdic.dll中的__NMSG_WRITE等
sub_404633	TPVMW32.dll中的sub_1000708D等
sub_401233	tprdpw32.dll中的sub_1000F109等

非Agobot家族样本	相似函数数量	总指令数
Net-Worm.Win32.Kolabc.eph	44	2 411
Trojan-Spy.Win32.SCKeyLog.fp	44	1 048
Trojan-Spy.Win32.SCKeyLog.ij	43	1 141
Backdoor.Win32.IRCBot.ol	40	4 036
Trojan-Dropper.Win32.Agent.tif	31	712
Trojan-Spy.Win32.SCKeyLog.fb	29	669
Trojan-Downloader.Win32.Delf.gij	28	711
Backdoor.Win32.SuperSpy.b	28	665
Trojan-Downloader.Win32.Realtens.h	28	619
Trojan-PSW.Win32.Zombie.10	26	646

Fast reused code tracing method based on simhash and inverted index

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 24

Related Articles 15

Metrics

Recommended 0

恶意代码家族	关联样本数量
Trojan.Win32.Vapsup	2 566
Trojan.Win32.Agent	124
Trojan-Downloader.Win32.Agent	114
Trojan-Clicker.Win32.Agent	69
Trojan-Ransom.Win32.Hexzone	58
Trojan-GameThief.Win32.OnLineGames	47
Backdoor.Win32.Rbot	45
Backdoor.Win32.Agent	43
Trojan.Win32.BHO	40
Backdoor.Win32.SdBot	25

[1]	Shiqi ZHAO, Xiaohong HUANG, Zhigang ZHONG. Research and implementation of reputation-based inter-domain routing selection mechanism [J]. Journal on Communications, 2023, 44(6): 47-56.
[2]	Haiyan KANG, Molan LONG. Research on network attack analysis method based on attack graph of absorbing Markov chain [J]. Journal on Communications, 2023, 44(2): 122-135.
[3]	Hongbin ZHANG, Yan YIN, Dongmei ZHAO, Bin LIU. Network security situational awareness model based on threat intelligence [J]. Journal on Communications, 2021, 42(6): 182-194.
[4]	Tengfei ZHANG, Shunzheng YU. Research prospects of user information detection from encrypted traffic of mobile devices [J]. Journal on Communications, 2021, 42(2): 154-167.
[5]	Xu CHENG, Yingying WANG, Nianjie ZHANG, Zhangjie FU, Beijing CHEN, Guoying ZHAO. Multi-level loss object tracking adversarial attack method based on spatial perception [J]. Journal on Communications, 2021, 42(11): 242-254.
[6]	Tao HUANG, Jiang LIU, Shuo WANG, Chen ZHANG, Yunjie LIU. Survey of the future network technology and trend [J]. Journal on Communications, 2021, 42(1): 130-150.
[7]	Zhiyong LUO,Xu YANG,Jiahui LIU,Rui XU. Network intrusion intention analysis model based on Bayesian attack graph [J]. Journal on Communications, 2020, 41(9): 160-169.
[8]	Hanxun ZHOU,Chen CHEN,Runze FENG,Junkun XIONG,Hong PAN,Wei GUO. Mobile malware traffic detection approach based on value-derivative GRU [J]. Journal on Communications, 2020, 41(1): 102-113.
[9]	JIANG Lyu,ZHANG Hengwei,WANG Jindong. Optimal strategy selection method for moving target defense based on signaling game [J]. Journal on Communications, 2019, 40(6): 128-137.
[10]	HU Jianwei,CHE Xin,ZHOU Man,CUI Yanpeng. Incremental clustering method based on Gaussian mixture model to identify malware family [J]. Journal on Communications, 2019, 40(6): 148-159.
[11]	Yuan XU,Chao YANG,Li YANG. Single password authentication method for remote user based on mobile terminal assistance [J]. Journal on Communications, 2019, 40(2): 174-187.
[12]	Zhiyong LUO, Xu YANG, Guanglu SUN, Zhiqiang XIE, Jiahui LIU. Finite automaton intrusion tolerance system model based on Markov [J]. Journal on Communications, 2019, 40(10): 79-89.
[13]	Shirui HUANG,Hengwei ZHANG,Jindong WANG,Ruiyu DOU. Network security threat warning method based on qualitative differential game [J]. Journal on Communications, 2018, 39(8): 29-36.
[14]	Xiaodong ZANG,Jian GONG,Xiaoyan HU. Detecting malicious domain names based on AGD [J]. Journal on Communications, 2018, 39(7): 15-25.
[15]	Yashu LIU,Zhihai WANG,Hanbing YAN,Yueran HOU,Yukun LAI. Method of anti-confusion texture feature descriptor for malware images [J]. Journal on Communications, 2018, 39(11): 44-53.