基于快速密度峰值聚类离群因子的离群点检测算法

doi:10.11959/j.issn.1000-436x.2022193

通信学报 ›› 2022, Vol. 43 ›› Issue (10): 186-195.doi: 10.11959/j.issn.1000-436x.2022193

基于快速密度峰值聚类离群因子的离群点检测算法

张忠平¹^,²^,³, 李森¹, 刘伟雄¹, 刘书霞⁴

¹ 燕山大学信息科学与工程学院，河北秦皇岛 066004
² 河北省计算机虚拟技术与系统集成重点实验室，河北秦皇岛 066004
³ 河北省软件工程重点实验室，河北秦皇岛 066004
⁴ 河北科技师范学院，河北秦皇岛 066004

修回日期:2022-07-08 出版日期:2022-10-25 发布日期:2022-10-01
作者简介:张忠平（1972− ），男，吉林松原人，博士，燕山大学教授，主要研究方向为大数据、数据挖掘、半结构化数据等
李森（1997− ），男，河南周口人，燕山大学硕士生，主要研究方向为数据挖掘
刘伟雄（1997− ），男，广东广州人，燕山大学硕士生，主要研究方向为数据挖掘
刘书霞（1974− ），女，河北邢台人，博士，河北科技师范学院讲师，主要研究方向为大数据技术、深度学习、区块链等
基金资助:
国家自然科学基金资助项目(61972334);国家社会科学基金资助项目(20BJ122);河北省创新能力提升计划基金资助项目(20557640D);四达铁路智能图像工件识别基金资助项目(x2021134)

Outlier detection algorithm based on fast density peak clustering outlier factor

Zhongping ZHANG¹^,²^,³, Sen LI¹, Weixiong LIU¹, Shuxia LIU⁴

¹ College of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China
² The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao 066004, China
³ The Key Laboratory of Software Engineering of Hebei Province, Qinhuangdao 066004, China
⁴ Hebei Normal University of Science and Technology, Qinhuangdao 066004, China

Revised:2022-07-08 Online:2022-10-25 Published:2022-10-01
Supported by:
The National Natural Science Foundation of China(61972334);The National Social Science Foundation of China(20BJ122);Hebei Province Innovation Capability Improvement Plan Project(20557640D);The Intelligent Image Workpiece Recognition of Sida Railway(x2021134)

摘要/Abstract

摘要：

摘要：针对密度峰值聚类算法需要人工设置参数、时间复杂度高的问题，提出了基于快速密度峰值聚类离群因子的离群点检测算法。首先，使用k近邻算法代替密度峰值聚类中的密度估计，采用KD-Tree索引数据结构计算数据对象的k近邻；然后，采用密度和距离乘积的方式自动选取聚类中心。此外，定义了向心相对距离、快速密度峰值聚类离群因子来刻画数据对象的离群程度。在人工数据集和真实数据集上对所提算法进行实验验证，并与一些经典和新颖的算法进行对比实验，从正确性和时间效率上验证了所提算法的有效性。

关键词: 数据挖掘, 密度峰值聚类, 离群点, k近邻, 向心相对距离

Abstract:

For the problem that peak density clustering algorithm requires human set parameters and high time complexity, an outlier detection algorithm based on fast density peak clustering outlier factor was proposed.Firstly, k nearest neighbors algorithm was used to replace the density peak of density estimate, which adopted the KD-Tree index data structure calculation of k close neighbors of data objects, and then the way of the product of density and distance was adopted to automatic selection of clustering centers.In addition, the centripetal relative distance and fast density peak clustering outliers were defined to describe the degree of outliers of data objects.Experiments on artificial data sets and real data sets were carried out to verify the algorithm, and compared with some classical and novel algorithms.The validity and time efficiency of the proposed algorithm are verified.

Key words: data mining, density peak clustering, outlier, k nearest neighbor, centripetal relative distance

中图分类号:

TP311

张忠平, 李森, 刘伟雄, 刘书霞. 基于快速密度峰值聚类离群因子的离群点检测算法[J]. 通信学报, 2022, 43(10): 186-195.

Zhongping ZHANG, Sen LI, Weixiong LIU, Shuxia LIU. Outlier detection algorithm based on fast density peak clustering outlier factor[J]. Journal on Communications, 2022, 43(10): 186-195.

图/表 13

图1

图2

图3

表1

图4

表2

表3

不同算法在人工数据集上的精确率"

算法	D₁	D₂	D₃	D₄
FDPC-OF	$100 . 00 %$	$97 . 67 %$	$87 . 05 %$	$95 . 83 %$
COF	86.04%	95.34%	76.47%	93.05%
NOF	95.34%	41.86%	57.64%	75.00%
LDF	93.02%	$97 . 67 %$	84.70%	81.94%
RDOS	97.67%	32.55%	58.82%	84.72%
IForest	53.48%	93.02%	76.47%	26.38%
NANOD	48.83%	32.55%	71.76%	18.05%
MOD	83.72%	93.02%	76.47%	88.89%

表3

表4

不同算法在人工数据集上的F1值"

算法	D₁	D₂	D₃	D₄
FDPC-OF	$97 . 72 %$	$95 . 45 %$	$85 . 05 %$	$93 . 87 %$
COF	85.23%	93.18%	75.29%	91.84%
NOF	93.18%	40.90%	56.32%	74.15%
LDF	92.05%	$95 . 45 %$	83.91%	80.27%
RDOS	96.60%	31.81%	58.05%	82.99%
IForest	52.27%	92.05%	74.71%	25.85%
NANOD	47.72%	31.81%	70.11%	18.35%
MOD	81.82%	90.91%	76.44%	88.44%

表4

图5

表5

表6

不同算法在真实数据集上的精确率"

算法	Ionosphere	Iris	Wdbc	Vowels
FDPC-OF	$91 . 26 %$	70%	$78 . 78 %$	$78 %$
COF	77.77%	60%	36.36%	50%
RDOS	53.17%	60%	15.15%	4%
LDF	87.30%	70%	33.33%	48%
NOF	69.04%	50%	12.12%	26%
IForest	63.49%	50%	57.57%	14%
NANOD	73.81%	$80 %$	$78 . 78 %$	52%
MOD	84.92%	60%	$78 . 78 %$	42%

表6

表7

不同算法在真实数据集上的F1值"

算法	Ionosphere	Iris	Wdbc	Vowels
FDPC-OF	$89 . 14 %$	70%	$79 . 09 %$	$76 . 47 %$
COF	76.36%	60%	37.27%	49.01%
RDOS	53.17%	60%	14.92%	3.92%
LDF	85.27%	70%	32.83%	48.03%
NOF	68.61%	50%	11.94%	26.45%
IForest	62.01%	50%	56.71%	14.65%
NANOD	72.48%	$80 %$	77.61%	52.90%
MOD	84.12%	60%	$79 . 09 %$	42.15%

表7

图6

参考文献 32

[1]	RAMOTSOELA D , ABU-MAHFOUZ A , HANCKE G . A survey of anomaly detection in industrial wireless sensor networks with critical water system infrastructure as a case study[J]. Sensors (Basel,Switzerland), 2018,18(8): 2491.
[2]	AVDIIENKO V , KUZNETSOV K , ROMMELFANGER I ,et al. Detecting behavior anomalies in graphical user interfaces[C]// Proceedings of IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). Piscataway:IEEE Press, 2017: 201-203.
[3]	NGAI E W T , HU Y , WONG Y H ,et al. The application of data mining techniques in financial fraud detection:a classification framework and an academic review of literature[J]. Decision Support Systems, 2011,50(3): 559-569.
[4]	季一木, 杨卫东, 李奎 ,等. 基于主机系统调用频率的容器入侵检测方法[J]. 网络与信息安全学报, 2021,7(4): 18-29.
	JI Y M , YANG W D , LI K ,et al. Container intrusion detection method based on host system call frequency[J]. Chinese Journal of Network and Information Security, 2021,7(4): 18-29.
[5]	ANDRYSIAK T . Sparse representation and overcomplete dictionary learning for anomaly detection in electrocardiograms[J]. Neural Computing and Applications, 2020,32(5): 1269-1285.
[6]	ROUSSEEUW P J , LEROY A M . Robust regression and outlier detection[M]. New Jersey: John Wiley ＆ Sons, 1987.
[7]	BARNETT V , LEWIS T , ABELES F . Outliers in statistical data[M]. New Jersey: John Wiley ＆ Sons, 1994.
[8]	KNORR E M , NG R T , TUCAKOV V . Distance-based outliers:algorithms and applications[J]. The VLDB Journal, 2000,8(3/4): 237-253.
[9]	KNORR E M , NG R T . A unified approach for mining outliers:properties and computation[C]// Proceedings of Conference of the Centre for Advanced Studies on Collaborative Research.[S.n.:s.l.], 1997: 219-222.
[10]	JAIN A K , MURTY M N , FLYNN P J . Data clustering[J]. ACM Computing Surveys, 1999,31(3): 264-323.
[11]	ESTER M , KRIEGEL H , SANDER J ,et al. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise[C]// International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 1996: 226-231.
[12]	KARYPIS G , HAN E H , KUMAR V . Chameleon:hierarchical clustering using dynamic modeling[J]. Computer, 1999,32(8): 68-75.
[13]	BREUNIG M M , KRIEGEL H P , NG R T ,et al. LOF:identifying density-based local outliers[C]// Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2000: 93-104.
[14]	杨晓晖, 刘晓明 . 基于双向邻居修正的局部异常因子算法[J]. 通信学报, 2020,41(8): 130-140.
	YANG X H , LIU X M . Local outlier factor algorithm based on correction of bidirectional neighbor[J]. Journal on Communications, 2020,41(8): 130-140.
[15]	ZHANG K , HUTTER M , JIN H D . A new local distance-based outlier detection approach for scattered real-world data[C]// Advances in Knowledge Discovery and Data Mining. Berlin:Springer, 2009: 813-822.
[16]	WANG L N , FENG C , REN Y J ,et al. Local outlier detection based on information entropy weighting[J]. International Journal of SensorNetworks, 2019,30(4): 207.
[17]	SCHUBERT E , ZIMEK A , KRIEGEL H P . Generalized outlier detection with flexible kernel density estimates[C]// Proceedings of the 2014 SIAM International Conference on Data Mining.[S.n.:s.l.], 2014: 542-550.
[18]	WAHID A , ANNAVARAPU C S R . NaNOD:a natural neigh-bour-based outlier detection algorithm[J]. Neural Computing and Applications, 2021,33(6): 2107-2123.
[19]	RODRIGUEZ A , LAIO A . Clustering by fast search and find of density peaks[J]. Science, 2014,344(6191): 1492-1496.
[20]	XU X , DING S F , DU M J ,et al. DPCG:an efficient density peaks clustering algorithm based on grid[J]. International Journal of Machine Learning and Cybernetics, 2018,9(5): 743-754.
[21]	HUANG J L , ZHU Q S , YANG L J ,et al. A non-parameter outlier detection algorithm based on natural neighbor[J]. Knowledge-Based Systems, 2016,92: 71-77.
[22]	MACQUEEN J , . Some methods for classification and analysis of multivariate observations[C]// Proceedings of Berkeley Symposium on Mathematical Statistics ＆ Probability. Berkeley:University of California Press, 1967: 281-297.
[23]	ESTER M , KRIEGEL H , SANDER J ,et al. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise[C]// International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 1996: 226-231.
[24]	COMANICIU D , MEER P . MeanShift:a robust approach toward feature space analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002,24(5): 603-619.
[25]	DANG T T , NGAN H Y T , LIU W . Distance-based k-nearest neighbors outlier detection method in large-scale traffic data[C]// Proceedings of IEEE International Conference on Digital Signal Processing. Piscataway:IEEE Press, 2015: 507-510.
[26]	TANG J , CHEN Z X , FU A W C ,et al. Enhancing effectiveness of outlier detections for low density patterns[C]// Advances in Knowledge Discovery and Data Mining. Berlin:Springer, 2002: 535-548.
[27]	TANG B , HE H B . A local density-based approach for outlier detection[J]. Neurocomputing, 2017,241: 171-180.
[28]	LATECKI L J , LAZAREVIC A , POKRAJAC D . Outlier detection with kernel density functions[C]// Machine Learning and Data Mining in Pattern Recognition. Berlin:Springer, 2007: 61-75.
[29]	LIU F T , TING K M , ZHOU Z H . Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012,6(1): 1-39.
[30]	WAHID A , ANNAVARAPU C S R . NaNOD:a natural neighbour-based outlier detection algorithm[J]. Neural Computing and Applications, 2021,33(6): 2107-2123.
[31]	YANG J W , RAHARDJA S , FR?NTI P , . Mean-shift outlier detection and filtering[J]. Pattern Recognition, 2021,115:107874.
[32]	FRANK A , ASUNCION A . UCI machine learning repository[R]. 2010.

软硬件环境	参数
CPU	2.60 Hz Inter i7-4720HQ
硬盘/GB	512.0
内存/GB	16.0
开发环境	PyCharm
编译环境	Python 3.8
可视化工具	PyCharm

数据集	样本个数/个	离群点个数/个	离群点比例
D₁	1 256	43	3.4%
D₂	1 043	43	4.1%
D₃	1 000	85	8.5%
D₄	1 372	72	5.2%

基于快速密度峰值聚类离群因子的离群点检测算法

Outlier detection algorithm based on fast density peak clustering outlier factor

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 32

相关文章 15

Metrics

推荐阅读 0

数据集	样本个数/个	属性个数/个	离群点个数/个	离群点比例
Ionosphere	351	34	126	35.9%
Iris	110	4	10	9.1%
Wdbc	390	30	33	8.4%
Vowels	1 456	12	50	3.4%

[1]	张忠平, 刘伟雄, 张玉停, 邓禹, 魏棉鑫. ERDOF：基于相对熵权密度离群因子的离群点检测算法[J]. 通信学报, 2021, 42(9): 133-143.
[2]	高胜, 向康, 田有亮, 谭伟杰, 冯涛, 吴晓雪. 基于BCP的联合委托学习模型及协议[J]. 通信学报, 2021, 42(5): 137-148.
[3]	徐丰力,李勇. 城市环境下的用户移动行为建模概述[J]. 通信学报, 2020, 41(7): 18-28.
[4]	项英倬,徐正国,游凌. 基于节点通信行为时序的指控信息流挖掘算法[J]. 通信学报, 2019, 40(9): 51-60.
[5]	王莹,苏壮. 无线网络中的移动预测综述[J]. 通信学报, 2019, 40(8): 157-168.
[6]	胡铮,袁浩,朱新宁,倪万里. 面向5G需求的人群流量预测模型研究[J]. 通信学报, 2019, 40(2): 1-10.
[7]	许建秋,梁珺秀,秦小麟. 基于时空标签轨迹的k近邻模式匹配查询[J]. 通信学报, 2018, 39(4): 112-122.
[8]	彭舰,王屯屯,陈瑜,刘唐,徐文政. 基于跨平台的在线社交网络用户推荐研究[J]. 通信学报, 2018, 39(3): 147-158.
[9]	高志强,王宇涛. 差分隐私技术研究进展[J]. 通信学报, 2017, 38(Z1): 151-155.
[10]	周长利,田晖,马春光,杨松涛. 路网环境下基于伪随机置换的LBS隐私保护方法研究[J]. 通信学报, 2017, 38(6): 19-29.
[11]	何明,刘伟世,张江. 支持推荐非空率的关联规则推荐算法[J]. 通信学报, 2017, 38(10): 18-25.
[12]	穆海蓉,丁丽萍,宋宇宁,卢国庆. DiffPRFs：一种面向随机森林的差分隐私保护算法[J]. 通信学报, 2016, 37(9): 175-182.
[13]	李洪成,吴晓平,陈燕. MapReduce框架下支持差分隐私保护的k-means聚类方法[J]. 通信学报, 2016, 37(2): 125-131.
[14]	丁丽萍,卢国庆. 面向频繁模式挖掘的差分隐私保护研究综述[J]. 通信学报, 2014, 35(10): 200-209.
[15]	丁丽萍，卢国庆. 面向频繁模式挖掘的差分隐私保护研究综述[J]. 通信学报, 2014, 35(10): 23-209.