基于两层分类器的恶意网页快速检测系统研究

doi:10.11959/j.issn.2096-109x.2017.00186

摘要/Abstract

摘要：

针对当前传统静态恶意网页检测方案在面对海量的新增网页时面临的压力，引入了两段式的分析检测过程，并依次为每段检测提出相应的特征提取方案，通过层次化使用优化的朴素贝叶斯算法和支持向量机算法，设计并实现了一种兼顾效率和功能的恶意网页检测系统——TSMWD（two-step malicious Web page detection system）。第一层检测系统用于过滤大量的正常网页，其特点为效率高、速度快、更新迭代容易，真正率优先。第二层检测系统追求性能，对于检测的准确率要求较高，时间和资源的开销上适当放宽。实验结果表明，该架构能够在整体检测准确率基本不变的情况下，提高系统的检测速度，在时间一定的情况下，接纳更多的检测请求。

关键词: 恶意网页检测, 网络安全, 机器学习, 特征提取

Abstract:

In view of the increasing number of new Web pages and the increasing pressure of traditional detection methods,the naive Bayesian algorithm and the support vector machine algorithm were used to design and implement a malicious Web detection system with both efficiency and function,TSMWD ,two-step malicious Web page detection.The first step of detection system was mainly used to filter a large number of normal Web pages,which was characterized by high efficiency,speed,update iteration easy,real rate priority.After the former filter,due to the limited number of samples,the main pursuit of the second step was the detection rate.The experimental results show that the proposed scheme can improve the detection speed of the system under the condition that the overall detection accuracy is basically the same,and can accept more detection requests in certain time.

Key words: malicious Web page detection, network security, machine learning, feature extraction

中图分类号:

TP393

王正琦,冯晓兵,张驰. 基于两层分类器的恶意网页快速检测系统研究[J]. 网络与信息安全学报, 2017, 3(8): 44-60.

Zheng-qi WANG,Xiao-bing FENG,Chi ZHANG. Study of high-speed malicious Web page detection system based on two-step classifier[J]. Chinese Journal of Network and Information Security, 2017, 3(8): 44-60.

图/表 18

图1

图2

图3

图4

图5

图6

图7

图8

表1

图9

表2

图10

表3

图11

图12

表4

图13

表5

参考文献 29

[1]	中国互联网信息中心. 第37次中国互联网络发展状况统计报告[R]. 北京:CNNIC, 2016.
	CNNIC. The 37th report of China Inter development statistics[R]. Beijing:CNNIC, 2016.
[2]	[EB/OL]. .
[3]	PROVOS N , MAVROMMATIS P , RAJAB M A ,et al. All your iFRAMEs point to us[C]// Conference on Security Symposium. 2008: 1-15.
[4]	SHENG S , WARDMAN B , WARNER G ,et al. An empirical analysis of phishing blacklists[C]// The Sixth Conference on Email and Anti-Spam (CEAS). 2009.
[5]	ESHETE B , VILLAFIORITA A , WELDEMARIAM K . Malicious website detection:effectiveness and efficiency issues[C]// SysSec Workshop. 2011: 123-126.
[6]	Making the Web safer[R/OL]. .
[7]	Malware domain list[EB/OL]. .
[8]	OpenDNS,PhishTank[EB/OL]. .
[9]	PRAKASH P , KUMAR M , KOMPELLA R R ,et al. Phishnet:predictive blacklisting to detect phishing attacks[C]// INFOCOM. 2010: 1-5.
[10]	CHRISTODORESCU M , JHA S . Testing malware detectors[J]. ACM Sigsoft Software Engineering Notes, 2004,29(4): 34-44.
[11]	CHOU , NEIL , ROBERT LEDESMA , YUKA TERAGUCHI ,et al. Client-side defense against Web-based identity theft[C]// The 11th Annual Network ＆ Distributed System Security Symposium (NDSS). 2004: 1-16.
[12]	HOU Y T , CHANG Y , CHEN T ,et al. Malicious Web content detection by machine learning[J]. Expert Systems with Applications, 2010,37(1): 55-60.
[13]	ROESCH M . Snort-lightweight intrusion detection for networks[J]. Lisa, 1999: 229-238.
[14]	LIN S F , HOU Y T , CHEN C M ,et al. Malicious webpage detection by semantics-aware reasoning[C]// The Eighth International Conference on Intelligent Systems Design and Applications. 2008: 115-120.
[15]	ZHANG Y , HONG J I , CRANOR L F . Cantina:a content-based approach to detecting phishing web sites[C]// The 16th International Conference on World Wide Web. 2007: 639-648.
[16]	HOU Y T , CHANG Y , CHEN T ,et al. Malicious Web content detection by machine learning[J]. Expert Systems with Applications, 2010,37(1): 55-60.
[17]	JUSTIN M , SAUL L K , SAVAGE S ,et al. Beyond blacklists:learning to detect malicious Web sites from suspicious URLs[C]// The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009: 1245-1254.
[18]	YOO S , KIM S , CHOUDHARY A ,et al. Two-phase malicious web page detection scheme using misuse and anomaly detection[J]. International Journal of Reliable Information and Assurance, 2014,2(1).
[19]	CANALI D , COVA M , VIGNA G ,et al. Prophiler:a fast filter for the large-scale detection of malicious web pages[C]// The 20th International Conference on World Wide Web. 2011: 197-206.
[20]	The German honeyclient project[EB/OL]. .
[21]	The Honeynet Project. Know your enemy:honeynets[EB/OL]. .
[22]	MAYNOR D . Metasploit toolkit for penetration testing,exploit development,and vulnerability research[M]. Elsevier, 2011.
[23]	HAUTUS M L J . The formal Laplace transform for smooth linear systems[M]// Mathematical Systems Theory. Berlin: Springer, 1976: 29-47.
[24]	GOLUB G H , HEATH M , WAHBA G . Generalized cross-validation as a method for choosing a good ridge parameter[J]. Technometrics, 1979,21(2): 215-223.
[25]	PRAKASH P , KUMAR M , KOMPELLA R R ,et al. Phishnet:predictive blacklisting to detect phishing attacks[C]// INFOCOM. 2010: 1-5.
[26]	LEE S , KIM J . Warningbird:a near real-time detection system for suspicious URLs in twitter stream[J]. IEEE Transactions on Dependable and Secure Computing, 2013,10(3): 183-195.
[27]	LIKARISH P , JUNG E , JO I . Obfuscated malicious javascript detection using classification techniques[C]// The 4th International Conference on Malicious and Unwanted Software (MALWARE). 2009: 47-54.
[28]	MA J , SAUL L K , SAVAGE S ,et al. Beyond blacklists:learning to detect malicious Web sites from suspicious URLs[C]// The 15th ACM SIGKDD international conference on knowledge discovery and data mining. 2009: 1245-1254.
[29]	LIU G , QIU B , WENYIN L . Automatic detection of phishing target from phishing webpage[C]// The 20th International Conference on Pattern Recognition (ICPR). 2010: 4153-4156.

λ	ACC	TPR	FPR
1	78.30%	76.10%	14.20%
1.2	71.30%	85.90%	21.60%
1.4	62.90%	92.70%	31.50%
1.6	53.50%	96.30%	42.20%

算法	ACC	Precision	Recall
KNN	76.31%	79.22%	78.69%
C4.5	82.15%	85.74%	85.30%
CART	86.40%	90.48%	90.44%
SVM	93.57%	92.11%	91.80%

λ	检测时间/s	ACC	TPR	FPR
1	1.49	79.28%	81.10%	8.81%
1.2	1.64	86.60%	86.71%	7.95%
1.4	1.81	91.14%	91.76%	7.69%
1.6	2.24	93.14%	93.87%	7.62%
无TSMWD-I	3.47	93.57%	94.37%	7.61%

检测方案	检测效率(毫秒/个）	ACC	TPR	FPR
PhishNet	1	90%以上	—	—
WarningBird	1.5	91.53%	88.84%	1.23%
TSMWD(λ=1.4)	0.58	91.14%	91.76%	7.69%

检测方案	ACC	TPR	FPR
Justin	89.82%	91.0%	7.60%
Peter Likarish	92.0%	91.0%	7.60%
Gang Liu	91.44%	91.0%	7.60%
TSMWD(λ=1.4)	91.14%	91.0%	7.60%