Please wait a minute...
通信学报  2017 Issue (3)    DOI: 10.11959/j.issn.1000-436x.2017061
  学术论文 本期目录 | 过刊浏览 | 高级检索 |
基于快速高斯变换的不确定数据聚类算法
迟荣华1,程媛2,3,朱素霞2,黄少滨1,陈德运2,3
1 哈尔滨工程大学计算机科学与技术学院,黑龙江 哈尔滨 150001
2 哈尔滨理工大学计算机科学与技术学院,黑龙江 哈尔滨 150080
3 哈尔滨理工大学计算机科学与技术学院博士后流动站,黑龙江 哈尔滨 150080
Uncertain data analysis algorithm based on fast Gaussian transform
Rong-hua CHI1,Yuan CHENG2,3,Su-xia ZHU2,Shao-bin HUANG1,De-yun CHEN2,3
1 College of Computer Science and Technology,Harbin Engineering University,Harbin 150001,China
2 College of Computer Science and Technology,Harbin University of Science and Technology,Harbin 150080,China
3 Postdoctoral Research Station,College of Computer Science and Technology,Harbin University of Science and Technology,Harbin 150080,China
全文: PDF (1374KB)   ( 192 ) HTML     XML
输出: BibTeX | EndNote (RIS)  
摘要 

数据中不确定性的存在使对其聚类分析时要充分考虑不确定性的影响。针对现有不确定数据聚类算法中构建不确定数据模型以及距离度量时存在的影响结果准确性与聚类性能等问题,提出一种基于快速高斯变换的不确定数据聚类算法。首先在不假设数据分布的前提下,构建符合不确定性分布特征的数据模型;然后结合不确定对象的2个重要特征:属性特征与表示不确定数据分布特征的概率密度函数,度量不确定数据对象间的相似性;并以此为基础提出不确定数据聚类算法;最后在UCI以及真实数据集上的实验结果表明,所提算法在运行效率和聚类准确性方面均能取得较好效果。

服务
加入引用管理器
E-mail Alert
RSS
作者相关文章
迟荣华
程媛
朱素霞
黄少滨
陈德运
关键词 聚类分析不确定数据概率密度函数快速高斯变换核密度估计    
Abstract

The effect of the uncertainties needs to be taken full advantage during uncertain data clustering.An uncertain data clustering algorithm based on fast Gaussian transform was proposed,to solve the problems about the impact on the accuracy of clustering results and the clustering efficiency caused by the uncertainties,during the construction of uncertain data models and the distance measurement,which existed in the current researches.First,the data model according to the characteristic of the uncertainty distribution was constructed,without the premise of assuming the data distribution.And the similarity between uncertain data objects was measured by combining the two important features of uncertain objects,attribute features and the probability density function representing the characteristic of uncertainty distribution.And then the uncertain data clustering algorithm was proposed.Finally,the experiment results on UCI and real datasets indicate the better efficiency and accuracy of proposed algorithm.

Key words: clustering analysis    uncertain data    probability density function    fast Gaussian transform    kernel density estimation
出版日期: 2017-04-13
基金资助:国家自然科学青年基金资助项目;黑龙江省青年科学基金资助项目;黑龙江省普通高等学校青年学术骨干支持计划基金资助项目
链接本文:  
http://www.infocomm-journal.com/txxb/CN/10.11959/j.issn.1000-436x.2017061
图1   确定性对象与不确定性对象对比
  
数据集 对象个数 属性个数 类别数
Iris 150 4 3
Ecoli 336 8 8
Yeast 1 484 8 10
Abalone 4 177 8 16
Letter 20 000 16 26
表1   UCI数据集
  
服务人数 采集时间 经度/° 纬度/°
889 2013/2/4 1:00 126.683 3 45.750 97
888 2013/2/4 1:00 126.683 3 45.750 97
443 2013/2/4 2:00 126.683 3 45.750 97
441 2013/2/4 2:00 126.683 3 45.750 97
510 2013/2/4 1:00 128.038 8 45.950 6
表2   各基站下服务人数数据集
  
图2   数据模型构建运行时间相对对比结果
  
图3   UCI数据集上基于Purity和Entropy的聚类准确性对比结果
  
图4   UCI数据集上基于NMI和F-measure的聚类准确性对比结果
  
指标 IUK-means& UK-means IUK-means& MC IUK-means& LMHIK
Purity 0.001 0.746 0.039
Entropy 0.001 0.446 0.027
NMI 0.001 0.353 0.002
F-measure 0.001 0.221 0.017
表3   聚类结果准备性的成对数据T检验结果
  
服务人数观测值 采集周期 经度/° 纬度/°
889 888 … 888 887 2013/2/4 1:00 126.683 3 45.7509 7
443 441 … 444 442 2013/2/4 1:00 126.683 3 45.7509 7
418 417 … 417 418 2013/2/4 1:00 126.683 3 45.7509 7
510 509 … 511 508 2013/2/4 1:00 128.038 8 45.950 6
表4   用于不确定分析的基站下服务人数数据集
  
图5   真实数据集上聚类准确性对比结果
  
图6   真实数据集上聚类算法运行时间对比结果
  
图7   真实数据集上聚类结果示意
  
[1] WANG Y , LI X , LI X ,et al. A survey of queries over uncertain data[J]. Knowledge and Information Systems, 2013,37(3): 485-530.
[2] SEN P , DESHPANDE A . Representing and querying correlated tuples in probabilistic databases[C]// IEEE 23rd International Conference on Data Engineering. Istanbul,Turkey,IEEE, 2007: 596-605.
[3] MUZAMMAL M , RAMAN R . Mining sequential patterns from probabilistic databases[J]. Knowledge and Information Systems, 2015,44(2): 325-358.
[4] SOLIMAN M , ILYAS I , CHANG K . Top-k query processing in uncertain databases[C]// IEEE 23rd International Conference on Data Engineering. Istanbul,Turkey,IEEE, 2007: 896-905.
[5] AGGARWAL C . Managing and mining uncertain data[M]. USA: SpringerPress, 2009: 1-35.
[6] BARBARá D , GARCIA-MOLINA H , PORTER D . The management of probabilistic data[J]. IEEE Transactions on Knowledge and Data Engineering, 1992,4(5): 487-502.
[7] ANTOVA L , JANSEN T , KOCH C ,et al. Fast and simple relational processing of uncertain data[C]// IEEE 24th International Conference on Data Engineering. Cancun,Mexico,IEEE, 2008: 983-992.
[8] TANG R , CHENG R , WU H ,et al. A framework for conditioning uncertain relational data[J]. Database and Expert Systems Applications, 2012,37(3): 71-87.
[9] TASKAR B , SEGAL E , KOLLER D . Probabilistic classification and clustering in relational data[C]// The Seventeenth International Joint Conference on Artificial Intelligence.San Francisco,USA:Morgan Kaufmann Publishers Inc. 2001: 870-878.
[10] NGAI W K , KAO B , CHUI C K ,et al. Efficient clustering of uncertain data[C]// The Sixth International Conference on Data Mining. 2006: 436-445.
[11] KRIEGEL H , PFEIFLE M . Density-based clustering of uncertain data[C]// The 11st ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. Chicago,USA:ACM, 2005: 672-677.
[12] KRIEGEL H , PFEIFLE M . Hierarchical density-based clustering of uncertain data[C]// Fifth IEEE International Conference on Data Mining. Houston,USA,IEEE, 2005: 689-692.
[13] 李云飞, 王丽珍, 周丽华 . 不确定数据的高效聚类算法[J]. 广西师范大学学报:自然科学版, 2011,29(2): 161-166. LI Y F , WANG L Z , ZHOU L H . More efficient clustering algorithm over uncertain data[J]. Journal of Guangxi Normal University (Natural Science Edition), 2011,29(2): 161-166.
[14] 金萍, 宗瑜, 屈世超 ,等. 面向不确定数据的近似骨架启发式聚类算法[J]. 南京大学学报 (自然科学), 2015,51(1): 197-205. JIN P , ZONG Y , QU S C ,et al. Approximate backbone guided heuristic clustering algorithm for uncertain data[J]. Journal of Nanjing University (Natural Sciences), 2015,51(1): 197-205.
[15] 肖宇鹏, 何云斌, 万静 ,等. 基于模糊 C-均值的空间不确定数据聚类[J]. 计算机工程, 2015,41(10): 47-52. XIAO Y P , HE Y B , WAN J ,et al. Clustering of space uncertain data based on fuzzy C-means[J]. Computer Engineering, 2015,41(10): 47-52.
[16] CORMODE G , MCGREGOR A . Approximation algorithms for clustering uncertain data[C]// The 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Vancouver,Canada,ACM, 2008: 191-200.
[17] ZüFLE A , EMRICH T , SCHMID K A ,et al. Representative clustering of uncertain data[C]// The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York,USA,ACM, 2014: 243-252.
[18] SCHUBERT E , KOOS A , EMRICH T ,et al. A framework for clustering uncertain data[C]// The 41st International Conference on Very Large Data Bases. Hawaii,USA,VLDB Endowment, 2015: 1976-1979.
[19] XU L , HU Q , HUNG E ,et al. Large margin clustering on uncertain data by considering probability distribution similarity[J]. Neurocomputing, 2015,158: 81-89.
[20] JIANG B , PEI J , TAO Y ,et al. Clustering uncertain data based on probability distribution similarity[J]. IEEE Transactions on Knowledge and Data Engineering, 2013,25(4): 751-763.
[21] LóPEZ-RUBIO E , JOSE . Probability densty function estimation with the frequency polygon transform[J]. Information Science, 2015,298: 136-158.
[22] ELGAMMAL A , DURAISWAMI R , DAVIS L S . Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003,25(11): 1499-1504.
[23] SCOTT D . On optimal and data-based histograms[J]. Biometrika, 1979,66(3): 605-610.
[24] GREENGARD L , STRAIN J . The fast Gauss transform[J]. SIAM Journal on Scientific and Statistical Computing, 1991,12(1): 79-94.
[25] YANG C , DURAISWAMI R , GUMEROV N ,et al. Improved fast gauss transform and efficient kernel density estimation[C]// Ninth IEEE International Conference on Computer Vision. Nice,France:IEEE, 2003: 664-671.
[26] ZHAO Y , KARYPIS G . Criterion functions for document clustering:experiment and analysis[R]. Technical Report TR 01-40,Department of Computer Science,University of Minnesota,USA, 2001.
[27] MANNING C D , RAGHAVAN P , SCHUTZE H ,著. 王斌,译 信息检索导论[M]. 北京: 人民邮电出版社, 2010. MANNING C D , RAGHAVAN P , SCHUTZE H.WANG B,translate WANG B ,translate. Introduction to information retrieval[M]. Beijing: PTPRESS, 2010.
[28] DEMSAR J . Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006(7): 1-30.
[29] ARBELAITZ O , GURRUTXAGA I , MUGUERZA J ,et al. An extensive comparative study of cluster validity indices[J]. Pattern Recognition, 2013,46(1): 243-256.
[1] 赵春晖,李雪源,崔 颖. 混合编码方式的图像聚类算法[J]. 通信学报, 2017, 38(2): 1-9.
[2] 房玉琢,许志勇,赵 兆. 基于近似核密度估计的近场多声源定位算法[J]. 通信学报, 2017, 38(1): 106-116.
[3] 丁 伟,徐 杰,卓文辉. 基于层次聚类的网络流识别算法研究[J]. 通信学报, 2014, 0(Z1): 9-45.
[4] 刘文杰1,伍之昂2,曹杰2,潘金贵1. 基于成对约束Info-Kmeans聚类的图像索引方法[J]. 通信学报, 2013, 0(7): 18-166.
[5] 闫健恩1,袁春阳2,许海燕1,张兆心1. 基于多维流量特征的IRC僵尸网络频道检测[J]. 通信学报, 2013, 0(10): 6-55.
本论文下载/浏览情况
全文


摘要

版权所有 © 2015 《通信学报》编辑部