通信学报 ›› 2021, Vol. 42 ›› Issue (9): 133-143.doi: 10.11959/j.issn.1000-436x.2021152

• 学术论文 • 上一篇    下一篇

ERDOF:基于相对熵权密度离群因子的离群点检测算法

张忠平1,2,3, 刘伟雄1, 张玉停1, 邓禹1, 魏棉鑫1   

  1. 1 燕山大学信息科学与工程学院,河北 秦皇岛 066004
    2 河北省计算机虚拟技术与系统集成重点实验室,河北 秦皇岛 066004
    3 河北省软件工程重点实验室,河北 秦皇岛 066004
  • 修回日期:2021-06-30 出版日期:2021-09-25 发布日期:2021-09-01
  • 作者简介:张忠平(1972− ),男,吉林松原人,博士,燕山大学教授,主要研究方向为大数据、数据挖掘、半结构化数据等
    刘伟雄(1997− ),男,广东广州人,燕山大学硕士生,主要研究方向为数据挖掘
    张玉停(1996− ),男,安徽阜阳人,燕山大学硕士生,主要研究方向为数据挖掘
    邓禹(1996− ),男,河北唐山人,燕山大学硕士生,主要研究方向为数据挖掘
    魏棉鑫(1997− ),男,广东汕头人,燕山大学硕士生,主要研究方向为数据挖掘
  • 基金资助:
    河北省创新能力提升计划基金资助项目(20557640D)

ERDOF: outlier detection algorithm based on entropy weight distance and relative density outlier factor

Zhongping ZHANG1,2,3, Weixiong LIU1, Yuting ZHANG1, Yu DENG1, Mianxin WEI1   

  1. 1 College of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China
    2 The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao 066004, China
    3 The Key Laboratory of Software Engineering of Hebei Province, Qinhuangdao 066004, China
  • Revised:2021-06-30 Online:2021-09-25 Published:2021-09-01
  • Supported by:
    Hebei Province Innovation Capability Improvement Plan Project(20557640D)

摘要:

针对现有离群点检测算法在复杂数据分布和高维度数据集上精度低的问题,提出了一种基于相对熵权密度离群因子的离群点检测算法。首先引入熵权距离取代欧氏距离以提高离群点检测精度。然后结合自然邻居的概念对数据对象进行高斯核密度估计。同时提出相对距离来刻画数据对象偏离邻域的程度,提高所提算法在低密度区域检测离群点的能力。最后提出相对熵权密度离群因子来刻画数据对象的离群程度。在人工数据集和真实数据集下进行的实验表明,所提算法能有效适应各种数据分布和高维数据的离群点检测。

关键词: 数据挖掘, 离群点检测, 信息熵, 核密度估计

Abstract:

An outlier detection algorithm based on entropy weight distance and relative density outlier factor was proposed to solve the problem of low accuracy in complex data distribution and high dimensional data sets.Firstly, entropy weight distance was introduced instead of euclidean distance to improve the detection accuracy of outliers.Then, the Gaussian kernel density estimation was carried out for the data object based on the concept of natural neighbor.At the same time, relative distance was proposed to describe the degree of the data object deviating from the neighborhood and improve the ability of the algorithm to detect outliers in the low-density region.Finally, the entropy weight distance and relative density outlier factor were proposed to describe the degree of outliers.Experiments with artificial data sets and real data sets show that the proposed algorithm can effectively adapt to various data distributions and outlier detection of high-dimensional data.

Key words: data mining, outlier detection, information entropy, kernel density estimation

中图分类号: 

No Suggested Reading articles found!