面向不平衡样本的物联网入侵检测方法

doi:10.11959/j.issn.2096-109x.2023005

摘要/Abstract

摘要：

随着设备的迭代，网络流量呈现指数级别的增长，针对各种应用的攻击行为越来越多，从流量层面识别并对这些攻击流量进行分类具有重要意义。同时，随着物联网设备的激增，针对这些设备的攻击行为也逐渐增多，造成的危害也越来越大。物联网入侵检测方法可以从这些海量的流量中识别出攻击流量，从流量层面保护物联网设备，阻断攻击行为。针对现阶段各类攻击流量检测准确率低以及样本不平衡问题，提出了基于重采样随机森林（RF，random forest）的入侵检测模型——Resample-RF，共包含3种具体算法：最优样本选择算法、基于信息熵的特征归并算法、多分类贪心转化算法。在物联网环境中，针对不平衡样本问题，提出最优样本选择算法，增加小样本所占权重，从而提高模型准确率；针对随机森林特征分裂效率不高的问题，提出基于信息熵的特征归并算法，提高模型运行效率；针对随机森林多分类精度不高的问题，提出多分类贪心转化算法，进一步提高准确率。在两个公开数据集上进行模型的检验，在 IoT-23 数据集上 F1 达到0.99，在Kaggle数据集上F1达到1.0，均具有显著效果。从实验结果中可知，提出的模型具有非常好的效果，能从海量流量中有效识别出攻击流量，较好地防范黑客对应用的攻击，保护物联网设备，从而保护用户。

关键词: 流量分析, 物联网, 入侵检测, 随机森林, 不平衡样本

Abstract:

In recent years, network traffic increases exponentially with the iteration of devices, while more and more attacks are launched against various applications.It is significant to identify and classify attacks at the traffic level.At the same time, with the explosion of Internet of Things (IoT) devices in recent years, attacks on IoT devices are also increasing, causing more and more damages.IoT intrusion detection is able to distinguish attack traffic from such a large volume of traffic, secure IoT devices at the traffic level, and stop the attack activity.In view of low detection accuracy of various attacks and sample imbalance at present, a random forest based intrusion detection method (Resample-RF) was proposed, which consisted of three specific methods: optimal sample selection algorithm, feature merging algorithm based on information entropy, and multi-classification greedy transformation algorithm.Aiming at the problem of unbalanced samples in the IoT environment, an optimal sample selection algorithm was proposed to increase the weight of small samples.Aiming at the low efficiency problem of random forest feature splitting, a feature merging method based on information entropy was proposed to improve the running efficiency.Aiming at the low accuracy problem of random forest multi-classification, a multi-classification greedy transformation method was proposed to further improve the accuracy.The method was evaluated on two public datasets.F1 reaches 0.99 on IoT-23 dataset and 1.0 on Kaggle dataset, both of which have good performance.The experimental results show that the proposed model can effectively identify the attack traffic from the massive traffic, better prevent the attack of hackers on the application, protect the IoT devices, and thus protect the related users.

Key words: traffic analysis, IoT, intrusion detection, random forest, unbalanced sample

中图分类号:

TP393

潘桐, 陈伟, 吴礼发. 面向不平衡样本的物联网入侵检测方法[J]. 网络与信息安全学报, 2023, 9(1): 130-139.

ANTONG P, Wen CHEN, Lifa WU. IoT intrusion detection method for unbalanced samples[J]. Chinese Journal of Network and Information Security, 2023, 9(1): 130-139.

图/表 14

图1

图2

图3

图4

图5

表1

表2

表3

IoT-23数据集上改进的随机森林模型结果对比Table 3 Improved random forest model results on IoT-23 dataset"

模型	耗时/s	Recall	F1	Acc
随机森林	57	0.946	0.959	0.971
随机森林1000	369	0.946	0.959	0.979
XGBoost	691	0.982	0.982	0.982
LightGBM	$30$	0.985	0.986	0.985
逻辑回归	146	0.644	0.675	0.763
文献[21]	727	0.946	0.959	0.979
本节模型	1453	$0 . 996$	$0 . 997$	$0 . 999$

表3

图6

图7

表4

Kaggle数据集上改进的随机森林模型结果Table 4 Improved random forest model results on Kaggle dataset"

模型	耗时/s	Recall	F1	Acc
随机森林	104	0.9989	0.9984	0.9999
随机森林1000	496	0.9989	0.9984	0.9999
XGBoost	1030	0.9990	0.9990	0.9999
LightGBM	$28$	0.9990	0.9995	0.9999
逻辑回归	313	0.3877	0.3939	0.9886
文献[22]	670	0.9988	0.9989	0.9999
本节模型	2841	$1 . 0$	$1 . 0$	$1 . 0$

表4

图8

图9

图10

参考文献 21

[1]	BREIMAN L . Random forests[J]. Machine Learning, 2001,45(1): 5-32.
[2]	ZHANG G P . Neural networks for classification:a survey[J]. IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews), 2000,30(4): 451-462.
[3]	朱应武, 杨家海, 张金祥 . 基于流量信息结构的异常检测[J]. 软件学报, 2010,21(10): 2573-2583.
	ZHU Y W , YANG J H , ZHANG J X . Anomaly detection based on traffic information structure[J]. Journal of Software, 2010,21(10): 2573-2583.
[4]	WANG G , HAO J , MA J ,et al. A new approach to intrusion detection using artificial neural networks and fuzzy clustering[J]. Expert Systems with Applications, 2010,37(9): 6225-6232.
[5]	麻文刚, 张亚东, 郭进 . 基于LSTM与改进残差网络优化的异常流量检测方法[J]. 通信学报, 2021,42(5): 23-40.
	MA W G , ZHANG Y D , GUO J . Abnormal traffic detection method based on LSTM and improved residual neural network optimization[J]. Journal on Communications, 2021,42(5): 23-40.
[6]	高妮, 高岭, 贺毅岳, 王海 . 基于自编码网络特征降维的轻量级入侵检测模型[J]. 电子学报, 2017,45(3): 730-739.
	GAO N , GAO L , HE Y Y , WANG H . A lightweight intrusion detection model based on antoencoder network with feature reduction[J]. Acta Electronica Sinica, 2017,45(3): 730-739.
[7]	TAVALLAEE M , BAGHERI E , LU W ,et al. A detailed analysis of the KDD CUP 99 data set[C]// 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications. 2009: 1-6.
[8]	ZHANG J , LING Y , FU X ,et al. Model of the intrusion detection system based on the integration of spatial-temporal features[J]. Computers ＆ Security, 2020,89:101681.
[9]	FENG X , SUN R , ZHU X ,et al. Snipuzz:black-box fuzzing of IoT firmware via message snippet inference[C]// Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021: 337-350.
[10]	SHEKARI T , IRVENE C , CARDENAS A A ,et al. MaMIoT:Manipulation of energy market leveraging high wattage IoT botnets[C]// Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021: 1338-1356.
[11]	REZAEI A . Using ensemble learning technique for detecting botnet on IoT[J]. SN Computer Science, 2021,2(3): 1-14.
[12]	韩春雨, 张永铮, 张玉 . Fast-flucos:基于 DNS 流量的 Fast-flux恶意域名检测方法[J]. 通信学报, 2020,41(5): 37-47.
	HAN C Y , ZHANG Y Z , ZHANG Y . Fast-flucos:malicious domain name detection method for fast-flux based on DNS traffic[J]. Journal on Communications, 2020,41(5): 37-47.
[13]	HAN X , PASQUIER T , BATES A ,et al. Unicorn:runtime provenance-based detector for advanced persistent threats[J]. arXiv preprint arXiv:2001.01525, 2020.
[14]	ZHANG J , LING Y , FU X ,et al. Model of the intrusion detection system based on the integration of spatial-temporal features[J]. Computers ＆ Security, 2020,89:101681.
[15]	PAXSON V . Bro:a system for detecting network intruders in real-time[J]. Computer Networks, 1999,31(23-24): 2435-2463.
[16]	Flowmeter[EB].
[17]	GOGOI P , BHUYAN M H , BHATTACHARYYA D K ,et al. Packet and flow based network intrusion dataset[C]// International Conference on Contemporary Computing. 2012: 322-334.
[18]	COMBS G . Tshark:Dump and analyze network traffic[J]. Wireshark, 2012.
[19]	GARCIA S , PARMISANO A , ERQUIAGA M J . IoT-23:a labeled dataset with malicious and benign IoT network traffic[R]. Stratosphere Lab. 2020.
[20]	VACCARI I , CHIOLA G , AIELLO M ,et al. MQTTset,a new dataset for machine learning techniques on MQTT[J]. Sensors, 2020,20(22): 6578.
[21]	何红艳, 黄国言, 张炳 ,等. 基于多种特征选择策略的入侵检测模型研究[J]. 信息安全研究, 2021,7(3): 225-232.
	HE H Y , HUANG G Y , ZHANG B ,et al. Research on intrusion detection model based on multiple feature selection strategies[J]. Information Security Research, 2021,7(3): 225-232.

类型	数量	占比
FileDownload	18	<0.01%
PartOfAPortScan	213 852 924	65.74%
C＆C	21 995	<0.01%
C＆C-FileDownload	53	<0.01%
Attack	9 398	<0.01%
DDoS	19 538 713	6.00%
C＆C-HeartBeat	33673	0.01%
Okiru	60 990 708	18.75%
C＆C-Torii	30	<0.01%
Okiru-Attack	3	<0.01%
C＆C-Mirai	2	<0.01%
C＆C-HeartBeat-FileDownload	11	<0.01%
C＆C-HeartBeat-Attack	834	<0.01%
C＆C-PortScan	888	<0.01%
PortScan-Attack	5	<0.01%

类型	数量	占比
dos	130 223	1.07%
slowite	9 202	0.08%
flood	613	<0.01%
bruteforce	0.982	0.12%
malformed	10 924	0.09%