一种基于随机森林和改进卷积神经网络的网络流量分类方法

doi:10.11959/j.issn.1000-0801.2023138

摘要/Abstract

摘要：

为了提高网络流量分类模型的效率、降低模型复杂度，提出了一种基于随机森林和改进卷积神经网络的分类方法。首先，利用随机森林评估了网络流量各个特征的重要性，并根据重要性排序进行特征选择；其次，采用 AdamW 优化器和三角循环学习率优化了卷积神经网络分类模型；最后，将该模型搭建在 Spark集群上实现模型训练的并行化。采用循环幅度恒定的三角循环学习率，选择1 024、400、256和100个最重要的特征作为输入的实验结果表明，模型的准确率分别提高到97.68%、95.84%、95.03%和94.22%。选择256个最重要的特征，采用不同学习率的实验结果表明，循环幅度减半的三角循环学习率的效果最佳，模型的准确率提高到95.25%，模型训练时间减少近1/2。

关键词: 网络流量分类, 随机森林, 卷积神经网络, Spark

Abstract:

In order to improve the efficiency and reduce the complexity of network traffic classification model, a classification method based on random forest and improved convolutional neural network was proposed.Firstly, the random forest was used to evaluate the importance of each feature of network traffic, and the feature was selected according to the importance ranking.Secondly, AdamW optimizer and triangular cyclic learning rate were adopted to optimize the convolutional neural network classification model.Then, the model was built on Spark cluster to realize the parallelization of model training.Adopting triangular cyclic learning rate with constant cycle amplitude, the experimental results of selecting 1 024, 400, 256 and 100 most important features as input show that the model accuracy is improved to 97.68%, 95.84%, 95.03% and 94.22%, respectively.The 256 most important features were selected and the experimental results based on adopting different learning rates show that the learning rate with half the cycle amplitude works best, the accuracy of the model is improved to 95.25%, and training time of the model is reduced by nearly half.

Key words: network traffic classification, random forest, convolutional neural network, Spark

中图分类号:

TP393

云本胜, 干潇雅, 钱亚冠. 一种基于随机森林和改进卷积神经网络的网络流量分类方法[J]. 电信科学, 2023, 39(7): 80-89.

Bensheng YUN, Xiaoya GAN, Yaguan QIAN. A network traffic classification method based on random forest and improved convolutional neural network[J]. Telecommunications Science, 2023, 39(7): 80-89.

图/表 13

图1

图2

图3

图4

图5

图6

图7

表1

表2

图8

图9

表3

表4

参考文献 9

[10]	MARí G , CAASAS P , CAPDEHOURAT G . DeepMAL - deep learning models for malware traffic detection and classification[C]// Data Science – Analytics and Applications. Wiesbaden:Springer Vieweg, 2021: 105-112.
[11]	REIS B , MAIA E , PRA?A I . Selection and performance analysis of CICIDS2017 features importance[C]// International Symposium on Foundations and Practice of Security. Cham:Springer, 2020: 56-71.
[12]	BREIMAN L . Random forests[J]. Machine Learning, 2001,45(1): 5-32.
[13]	陈卓, 吕娜 . 基于随机森林和XGBoost的网络入侵检测模型[J]. 信号处理, 2020,36(7): 1055-1064.
	CHEN Z , LYU N . Network intrusion detection model based on random forest and XGBoost[J]. Journal of Signal Processing, 2020,36(7): 1055-1064.
[14]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 770-778.
[15]	甘众远 . 基于深度学习的轻量化恶意流量识别及其分布式方法的研究与实现[D]. 南京:南京邮电大学, 2021.
	GAN Z Y . Research and implementation of lightweight malicious traffic identification and its distributed method based on deep learning[D]. Nanjing:Nanjing University of Posts and Telecommunications, 2021.
[16]	LOSHCHILOV I , HUTTER F . Decoupled weight decay regularization[J]. arXiv preprint, 2017,arXiv:1711.05101.
[17]	KINGMA D P , BA J . Adam:a method for stochastic optimization[J]. arXiv preprint， 2014,arXiv:1412.6980.
[18]	刘云飞, 张俊然 . 深度神经网络学习率策略研究进展[J]. 控制与决策, 2022:0147.
	LIU Y F , ZHANG J R . Research advances in deep neural networks learning rate strategies[J]. Control and Decision, 2022:0147.
[1]	顾玥, 李丹, 高凯辉 . 基于机器学习和深度学习的网络流量分类研究[J]. 电信科学, 2021,37(3): 105-113.
	GU Y , LI D , GAO K H . Research on network traffic classification based on machine learning and deep learning[J]. Telecommunications Science, 2021,37(3): 105-113.
[2]	冯文博, 洪征, 吴礼发 ,等. 网络协议识别技术综述[J]. 计算机应用, 2019,39(12): 3604-3614.
	FENG W B , HONG Z , WU L F ,et al. Review of network protocol recognition techniques[J]. Journal of Computer Applications, 2019,39(12): 3604-3614.
[3]	WANG W , ZHU M , ZENG X W ,et al. Malware traffic classification using convolutional neural network for representation learning[C]// Proceedings of 2017 International Conference on Information Networking (ICOIN). Piscataway:IEEE Press, 2017: 712-717.
[4]	FENG W B , HONG Z , WU L F ,et al. Network protocol recognition based on convolutional neural network[J]. China Communications, 2020,17(4): 125-139.
[5]	SUN Y L , YUN B S , QIAN Y G ,et al. A Spark-based method for identifying large-scale network burst traffic[J]. Journal of Computers, 2021,32(4): 123-136.
[6]	TONG V , TRAN H A , SOUIHI S ,et al. A novel QUIC traffic classifier based on convolutional neural networks[C]// Proceedings of 2018 IEEE Global Communications Conference (GLOBECOM). Piscataway:IEEE Press, 2019: 1-6.
[7]	HU X Y , GU C X , WEI F S . CLD-net:a network combining CNN and LSTM for Internet encrypted traffic classification[J]. Security and Communication Networks, 2021: 1-15.
[8]	于帅, 董育宁, 邱晓晖 . 一种基于深度特征融合的网络流量分类方法[J]. 南京邮电大学学报(自然科学版), 2022,42(3): 82-89.
	YU S , DONG Y N , QIU X H . A network traffic classification method based on deep feature fusion[J]. Journal of Nanjing University of Posts and Telecommunications (Natural Science), 2022,42(3): 82-89.
[9]	薛靖靓, 陈迎春, 李鸥 . 未知流量数据的智能特征提取与实时分类识别算法[J]. 信息工程大学学报, 2021,22(5): 597-605.
	XUE J L , CHEN Y C , LI O . Intelligent feature extraction and real-time identification algorithm for unknown traffic data[J]. Journal of Information Engineering University, 2021,22(5): 597-605.

对比项	sparkmaster	sparkslaver 1	sparkslaver 2
Spark	Master	Worker	Worker
HDFS	NameNode DataNode	DataNode	SecondaryNameNode DataNode
YARN	NodeManager	ResourceManager NodeManager	NodeManager
注：Master 为 Spark 主节点，Worker 为 Spark 工作节点；NameNode 为 HDFS 管理节点，DataNode 为 HDFS 工作节点， SecondaryNameNode为辅助管理节点；ResourceManager为YARN管理节点，NodeManager为YARN工作节点。

C_imp	C_id	V_imp
1	0	0.060 4
2	4	0.038 7
3	3	0.027 0
4	5	0.021 8
5	2	0.020 4
6	7	0.018 6
7	1	0.016 1
8	1 455	0.013 9
9	1 448	0.013 8
10	9	0.012 8

特征选取方法	准确率	精确率	召回率	F1值	训练时间/min
原始字节前1 024个特征^[5]	90.42%	89.47%	91.62%	90.53%	—
随机森林选择1 024个特征	97.68%	97.94%	97.42%	97.68%	97
随机森林选择400个特征	95.84%	94.95%	96.83%	95.88%	56
随机森林选择256个特征	95.03%	92.35%	98.20%	95.19%	52
随机森林选择100个特征	94.22%	94.19%	94.25%	94.22%	48
随机森林选择64个特征	87.20%	92.27%	81.20%	86.38%	26

学习率（括号中为取值范围）	准确率	精确率	召回率	F1值	训练时间/min
EXP（0.001）	93.77%	93.09%	94.55%	93.81%	41
TRI（0.000 1～0.001 5）	95.03%	92.35%	98.20%	95.19%	52
TRI2（0.000 1～0.001 5）	95.25%	95.83%	94.61%	95.21%	44
TRIEXP（0.000 1～0.001 5）	94.24%	94.59%	93.85%	94.22%	44