基于集成分类器的恶意网络流量检测

doi:10.11959/j.issn.1000-436x.2018224

摘要/Abstract

摘要：

针对目前网络大数据环境攻击检测中因某些攻击步骤样本的缺失而导致攻击模型训练不够准确的问题，以及现有集成分类器在构建多级分类器时存在的不足，提出基于多层集成分类器的恶意网络流量检测方法。该方法首先采用无监督学习框架对数据进行预处理并将其聚成不同的簇，并对每一个簇进行噪音处理，然后构建一个多层集成分类器 MLDE 检测网络恶意流量。MLDE 集成框架在底层使用基分类器，非底层使用不同的集成元分类器。该框架构建简单，能并发处理大数据集，并能根据数据集的大小来调整集成分类器的规模。实验结果显示，当MLDE的基层使用随机森林、第2层使用bagging集成分类器、第3层使用AdaBoost集成分类器时，AUC的值能达到0.999。

关键词: 恶意网络流量, 攻击检测, 攻击阶段, 网络流量聚类, 集成分类器

Abstract:

A malicious network traffic detection method based on multi-level distributed ensemble classifier was proposed for the problem that the attack model was not trained accurately due to the lack of some samples of attack steps for detecting attack in the current network big data environment,as well as the deficiency of the existing ensemble classifier in the construction of multilevel classifier.The dataset was first preprocessed and aggregated into different clusters,then noise processing on each cluster was performed,and then a multi-level distributed ensemble classifier,MLDE,was built to detect network malicious traffic.In the MLDE ensemble framework the base classifier was used at the bottom,while the non-bottom different ensemble classifiers were used.The framework was simple to be built.In the framework,big data sets were concurrently processed,and the size of ensemble classifier was adjusted according to the size of data sets.The experimental results show that the AUC value can reach 0.999 when MLDE base users random forest was used in the first layer，bagging was used in the second layer and AdaBoost classifier was used in the third layer.

Key words: malicious network traffic, attack detection, attack phase, network flow clustering, ensemble classifier

中图分类号:

TP302

汪洁,杨力立,杨珉. 基于集成分类器的恶意网络流量检测[J]. 通信学报, 2018, 39(10): 155-165.

Jie WANG,Lili YANG,Min YANG. Multitier ensemble classifiers for malicious network traffic detection[J]. Journal on Communications, 2018, 39(10): 155-165.

图/表 16

表1

图1

表2

图2

图3

图4

图5

图6

图7

图8

图9

图10

图11

表3

图12

图13

参考文献 46

[1]	MOKHTAR B , ELTOWEISSY M . Big data and semantics management system for computer networks[J]. Ad Hoc Networks, 2017,57: 32-51.
[2]	BROEDERS D , SCHRIJVERS E , SLOOT B VD ,et al. Big data and security policies:towards a framework for regulating the phases of analytics and use of big data[J]. Computer Law ＆ Security Review, 2017,33(3): 309-323.
[3]	MANOGARAN G , THOTA C , KUMAR M V . MetaCloudDataStorage architecture for BIG DATA security in cloud computing[J]. Procedia Computer Science, 2016,87: 128-133.
[4]	XIA Y , CHEN J , LU X ,et al. Big traffic data processing framework for intelligent monitoring and recording systems[J]. Neurocomputing, 2016,181: 139-146.
[5]	ZHANG J , LI H , GAO Q ,et al. Detecting anomalies from big network traffic data using an adaptive detection approach[J]. Information Sciences, 2015,318(C): 91-110.
[6]	SARALADEVI B , PAZHANIRAJA N , PAUL P V ,et al. Big data and hadoop-a study in security perspective[J]. Procedia computer science, 2015,50: 596-601.
[7]	WANG H , JIANG X , KAMBOURAKIS G . Special issue on Security,Privacy and Trust in network-based big data[J]. Information Sciences, 2015,318(C): 48-50.
[8]	SANCHEZ M I , ZEYDAN E , OLIVA A D L ,et al. Mobility management:deployment and adaptability aspects through mobile data traffic analysis[J]. Computer Communications, 2016,95: 3-14.
[9]	刘敬, 谷利泽, 钮心忻 ,等. 基于单分类支持向量机和主动学习的网络异常检测研究[J]. 通信学报, 2012,36(11): 136-146.
	LIU J , GU L Z , NIU X X ,et al. Research on network anomaly detection based on one-class SVM and active learning[J]. Journal on Communications, 2012,36(11): 136-146.
[10]	钱叶魁, 陈鸣, 叶立新 . 基于多尺度主成分分析的全网络异常检测方法[J]. 软件学报, 2012,23(2): 361-377.
	QIAN Y K , CHEN M , YE L X . Network-wide anomaly detection method based on multiscale principal component analysis[J]. Journal of Software, 2012,23(2): 361-377.
[11]	郑黎明 . 大规模通信网络流量异常检测与优化关键技术研究[D]. 长沙:国防科技大学, 2012.
	ZHENG L M . Key Technologies research on traffic anomaly detection and optimization for large-scale networks[D]. Changsha:National University of Defense Technology, 2012.
[12]	李宇翀, 罗兴国, 钱叶魁 ,等. RMPCM:一种基于健壮多元概率校准模型的全网络异常检测方法[J]. 通信学报, 2015,36(11): 201-212.
	LI Y C , LUO X G , QIAN Y K ,et al. Network-wide anomaly detection method based on robust multivariate probabilistic calibration model[J]. Journal on Communications, 2015,36(11): 201-212.
[13]	ABAWAJY J H , KELAREV A , CHOWDHURY M . Large iterative multitier ensemble classifiers for security of big data[J]. IEEE Transactions on Emerging Topics in Computing, 2014,2(3): 352-363.
[14]	ABAWAJY J , CHOWDHURY M , KELAREV A . Hybrid consensus pruning of ensemble classifiers for big data malware detection[J]. IEEE Transactions on Cloud Computing, 2015,PP(99): 1-1.
[15]	ISLAM R , ABAWAJY J . A multi-tier phishing detection and filtering approach[J]. Journal of Network and Computer Applications, 2013,36(1): 324-335.
[16]	ISLAM M R , ABAWAJY J , WARREN M . Multi-tier phishing email classification with an impact of classifier rescheduling[C]// Pervasive Systems,Algorithms,and Networks (ISPAN). IEEE, 2009: 789-793.
[17]	ISLAM R , SINGH J , CHONKA A ,et al. Multi-classifier classification of spam email on a ubiquitous multi-core architecture[C]// Network and Parallel Computing. IEEE, 2008: 210-217.
[18]	ISLAM MR , ZHOU W , GUO M ,et al. An innovative analyser for multi-classifier email classification based on grey list analysis[J]. Journal of network and computer applications, 2009,32(2): 357-366.
[19]	RUTHERFORD J R , WHITE G B . Using an improved cybersecurity kill chain to develop an improved honey community[C]// International Conference on System Sciences. 2016: 2624-2632.
[20]	MIHAI I C , PRUNA S , BARBU I D . Cyber kill chain analysis[J]. Information Security and Cybercrime, 2014,3:37.
[21]	DALZIEL H . Securing social media in the enterprise[M]. Amsterdam: Syngress PublishingPress, 2015: 7-15.
[22]	WINKLER I , GOMES A T . Advanced persistent security[M]. Amsterdam : Syngress PublishingPress, 2017: 179-184.
[23]	汪洁, 何小贤 . 基于种子——扩充的多态蠕虫特征自动提取方法[J]. 通信学报, 2014,35(9): 12-19.
	WANG J , HE X X . Automated polymorphic worm signature generation approach based on seed-extending[J]. Journal on Communications, 2014,35(9): 12-19.
[24]	LINCOLN LABORATORY . 2000 DARPA Intrusion Detection Scenario Specific Data Sets[EB]. Lexington:Massachusetts Institute of Technology, 2000.
[25]	WANG Y , XIANG Y , ZHANG J ,et al. Internet traffic classification using constrained clustering[J]. IEEE Transactions on Parallel and Distributed Systems, 2014,25(11): 2932-2943.
[26]	MOORE A , ZUEV D , CROGAN M . Discriminators for use in flow-based classification[M]. London: Queen Mary and Westfield CollegePress, 2005.
[27]	CASAS P , MAZEL J , OWEZARSKI P . Unsupervised network intrusion detection systems:Detecting the unknown without knowledge[J]. Computer Communications, 2012,35(7): 772-783.
[28]	WANG Y , XIANG Y , ZHANG J ,et al. Internet traffic clustering with side information[J]. Journal of Computer and System Sciences, 2014,80(5): 1021-1036.
[29]	COMAR P M , LIU L , SAHA S ,et al. Combining supervised and unsupervised learning for zero-day malware detection[C]// INFOCOM,2013 Proceedings IEEE. IEEE, 2013: 2022-2030.
[30]	LIM Y , KIM H , JEONG J ,et al. Internet traffic classification demystified:on the sources of the discriminative power[C]// International Conference. ACM, 2010:9.
[31]	HAN J W , KAMBER M , PEI J . Data mining:concepts and techniques,Third Edition[M]. 3rd ed. San Francisco: Morgan Kaufmann PublishingPress, 2011: 211-321.
[32]	QUINLAN J R . C4.5:programs for machine learning[M]. Elsevier, 2014.
[33]	PLATT J C . Fast training of support vector machines using sequential minimal optimization[M]. Advances in kernel methods. MIT Press, 1999: 185-208.
[34]	HüHN J HüLLERMEIER E . FURIA:an algorithm for unordered fuzzy rule induction[J]. Data Mining and Knowledge Discovery, 2009,19(3): 293-319.
[35]	SHALEV-SHWARTZ S , SINGER Y , SREBRO N . Pegasos:Primal estimated sub-gradient solver for SVM[C]// Proceedings of the 24th international conference on Machine learning. ACM, 2007: 807-814.
[36]	BREIMAN L . Random forests[J]. Machine learning, 2001,45(1): 5-32.
[37]	RUMELHART D E , HINTON G E , WILLIAMS R J . Learning internal representations by error propagation[R]. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[38]	HALL M A , FRANK E . Combining naive bayes and decision tables[C]// FLAIRS Conference. 2008, 2118: 318-319.
[39]	WOLPERT D H . Stacked generalization[J]. Neural networks, 1992,5(2): 241-259.
[40]	BREIMAN L . Bagging predictors[J]. Machine learning, 1996,24(2): 123-140.
[41]	FREUND Y , SCHAPIRE R E . Experiments with a new boosting algorithm[C]// ICML. 199696: 148-156.
[42]	WEBB G I . Multiboosting:A technique for combining boosting and wagging[J]. Machine learning, 2000,40(2): 159-196.
[43]	SEEWALD A K , FüRNKRANZ J , . An evaluation of grading classifiers[C]// International Symposium on Intelligent Data Analysis. Springer-Verlag, 2001: 115-124.
[44]	MELVILLE P , MOONEY R J . Constructing diverse classifier ensembles using artificial training examples[C]// International Joint Conference on Artificial Intelligence.Morgan Kaufmann Publishers Inc. 20033 505-510.
[45]	KAI M T , WITTEN I H . Stacking bagged and dagged models[C]// Fourteenth international conference on machine learning.Morgan Kaufmann Publisher Inc. 1997: 367-375.
[46]	WITTEN I H , FRANK E . Data mining:practical machine learning tools and techniques[M]. Amsterdam: Elsevier/Morgan KaufmanPress, 2011.

序号	侦查阶段（踩点阶段）	扫描阶段（武装阶段）获取目标权限阶段（投送、攻击阶段）	控制目标阶段（安装，命令与控制阶段）	发起攻击阶段（收割阶段）	攻击结果
1	√				失败
2		√			失败
3		√			成功
4			√		成功
5				√	成功
6	√	√			失败
7		√√			成功
8		√	√		成功
9			√	√	成功
10	√	√√			成功
11		√√	√		成功
12		√	√	√	成功
13	√	√√	√		成功
14		√ √	√	√	成功
15	√	√√	√	√	成功

特征	描述
pkts	报文总数
pkt_noPayload	无负载报文总数
bytes	传送的字节总数
pay_bytes	所有负载的字节总数
duration	流持续时间
maxsz	最大的报文尺寸
minsz	最小报文尺寸
avfsz	平均报文尺寸
stdsz	报文大小的标准偏差
maxpy	最大的负载尺寸
minpy	最小的负载尺寸
avgpy	平均负载尺寸
stdpy	负载尺寸的标准偏差
synflag	SYN的数目
rstfalg	RST的数目
pushflag	PSH的数目
finflag	FIN的数目
ackflag	ACK数目
syn_ackflag	SYN_ACK的数目

	未去除噪音数据	种子扩充算法去除噪音之后的数据
AUC值	0.9308	0.9432