高阶异构数据模糊联合聚类算法

doi:10.3969/j.issn.1000-436x.2014.06.003

摘要/Abstract

摘要：

为了更有效地分析聚簇重叠部分高阶异构数据的聚簇结果，提出了一种高阶异构数据模糊联合聚类（HFCC）算法，该算法最小化每个特征空间中对象与聚簇中心的加权距离。推导出对象隶属度和特征权重的迭代更新公式，设计出聚类过程的迭代算法，并且从理论上证明了该迭代算法的收敛性。另外，通过泛化XB指标，提出适用于评估高阶异构数据聚类质量的指标GXB，用于判断聚簇数目。实验表明，HFCC算法能够有效探测数据内部隐藏的重叠聚簇结构，并且HFCC算法聚类效果明显优于5种有代表性的硬划分算法，此外GXB指标能够有效判定高阶异构数据的聚簇数目。

关键词: 高阶异构数据, 联合聚类, 模糊聚类

Abstract:

In order to analyze the clustering results of high-order heterogeneous data at the overlaps of different clusters more efficiently, a fuzzy co-clustering algorithm was developed for high-order heterogeneous data (HFCC). HFCC algo-rithm minimized distances between objects and centers of clusters in each feature space. The update rules for fuzzy memberships of objects and weights of features were derived, and then an iterative algorithm was designed for the clus-tering process. Additionally, convergence of iterative algorithm was proved. In order to estimate the number of clusters, GXB validity index was proposed by generalizing the XB validity index, which could measure the quality of high-order clustering results. Finally, experimental results show that HFCC can efficiently mine the overlapped clusters and the qualities of clustering results of HFCC are superior five classical hard high-order co-clustering algorithms. Additionally, GXB validity index can efficiently estimate the number of high-order clusters.

Key words: high-order heterogeneous data, co-clustering, fuzzy clustering

黄少滨,杨欣欣,申林山,李艳梅. 高阶异构数据模糊联合聚类算法[J]. 通信学报, 2014, 35(6): 15-24.

Shao-bin HUANG,Xin-xin YANG,Lin-shan SHEN,Yan-mei LI. Fuzzy co-clustering algorithm for high-order heterogeneous data[J]. Journal on Communications, 2014, 35(6): 15-24.

图/表 12

图1

图2

图3

表1

图4

表2

图5

表3

表4

表5

图6

图7

参考文献 19

[1]	LONG B , WU X Y , ZHANG Z F , et al. Unsupervised learning on k-partite graphs[A]. roceedings of the 12th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining[C]. Philadelphia, USA, 2006.317-326.
[2]	DINO I , ROBARDET C , RUGGERO G , et al. Parameter-less co-clustering for star-structured heterogeneous data[J]. Data Mining and Knowledge Discovery, 2013,26(2): 217-254.
[3]	周志华，王珏 . 机器学习及其应用[M]. 北京：清华大学出版社， 2007. ZHOU Z H , WANG J . Machine Learning and Application[M]. Beijing: Tsinghua University Press, 2007.
[4]	GAO B , LIU T Y , ZHENG X , et al. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering[A]. Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining[C]. Chicago, USA, 2005.41-50.
[5]	GAO B , LIU T Y , MA W Y , et al. Star-structured high-order heterogenous data co-clustering based on consistent information theory[A]. ro-ceedings of the 6th IEEE International Conference on Data Mining[C]. HongKong, China, 2006.880-884.
[6]	WANG H , NIE F P , HUANG H , et al. Nonnegative matrix tri-factorization based high-order co-clustering and its fast implemen-tation[A]. Proceedings of 11th IEEE International Conference on Data Miningg[C]. Arlington, USA, 2011.174-183.
[7]	SHAO J , YIN W T , MA S , et al. Topic discovery of Web video using star-structured k-partite graph[A]. PProceedings of the International Conference on Multimedia[C]. Firenze, Italy, 2010.915-918.
[8]	GAO B , LIU T Y , FENG G ， et al. Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph co-partitioning[J]. IEEE Transactions on Knowledge and Data Engi-neering, 2005,17(9): 1293-1273.
[9]	GAO B , LIU T Y , QIN T , et al. Web image clustering by consistent utilization of visual features and surrounding texts[A]. Proceedings of the 13th annual ACM International Conference on Multimedia[C]. Singapore, 2005.112-121.
[10]	REGE M , DONG M , HUA J . Graph theoretical framework for simul-taneously integrating visual and textual features for efficient Web im-age clustering[A]. Proceedings of the 17th International Conference on World Wide Web[C]. Beijing, China, 2008.317-326.
[11]	CHIARAVALLOTI A D , GRECO G , GUZZO A , et al. An informa-tion-theoretic framework for high-order co-clustering of heterogene-ous objects[A]. Proceedings of the 17th European Conference on Ma-chine Learning[C]. Berlin, Germany, 2006.598-605.
[12]	GRECO G , GUZZO A . APONTIERI L.Coclustering multiple heteroge-neous domains: linear combinations and agreements[J]. IEEE Transac-tions on Knowledge and Data Engineering, 2010,22(12): 1649-1663.
[13]	JING L P , YUN J L , YU J , et al. High-order co-clustering text data on semantics-based representation model[A]. Proceedings of the 15th Pa-cific-Asia Conference on Advance in Knowledge Discovery and Data Mining[C]. Shenzhen, China, 2011.171-182.
[14]	SUN Y Z , YU Y T , HAN J W . Ranking-based clustering of heteroge-neous information networks with star network schema[A]. Proceed-ings of the 15th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining[C]. Paris, France, 2009.797-805.
[15]	SUN Y Z , HAN J W , ZHAO P X , et al. RankClus: integrating cluster-ing with ranking for heterogeneous information network analysis[A]. Proceedings of the 12th International Conference on Extending Data-base Technology: Advances in Database Technology[C]. Saint Peters-burg, Russia, 2009.565-576.
[16]	CHEN Y H , WANG L J , DONG M . Non-negative matrix factorization for semisupervised heterogeneous data coclusteering[J]. IEEE Trans-actions on Knowledge and Data Engineering, 2010,22(10): 1459-1474.
[17]	XIE X L , BENI G . A validity measure for fuzzy clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991,13(8): 841-847.
[18]	PAL N R , BEZDEK J C . On cluster validity for fuzzy c-means model[J]. IEEE Transactions on Fuzzy Systems, 1995,3(3): 370-379.
[19]	STREHL A , GHOSH J . Cluster ensembles-a knowledge reuse frame-work for combining multiple partitions[J]. Journal of Machine Learn-ing Research, 2002,3(3): 583-617.

简称	数据集	单词数	对象数	聚类个数	主题
T1	Cora	2 000	1 000	2	database, artificial intelligence
T2	Cora	2 000	1 000	2	operating systems, architecture
I1	Corel	116	300	3	cow, grass, horses
I2	Corel	114	300	3	tree, bird, sky
P1	IAPR TC-12	1000	600	2	traveler, animal

簇								隶属度值
簇	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001 10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001	10.3969/j.issn.1000-436x.2014.06.003.F001
簇1	0.518	0.492	0.481	0.521	0.494	0.513	0.517	0.4940.518	0.503	0.519	0.521	0.517	0.485	0.489	0.502
簇2	0.482	0.508	0.519	0.479	0.506	0.487	0.483	0.5060.482	0.497	0.481	0.479	0.483	0.515	0.511	0.498

c			HXB值

	m=1	m=1.5	m=2	m=2.5	m=3
2	2.53×10^?1	1.28×10^?1	1.26×10^?1	1.34×10^?1	1.67×10^?1
3	1.24×10^?5	5.79×10^?5	7.98×10^?4	1.45×10^?4	3.47×10^?1
4	3.18×10^?6	3.26×10^?6	8.02×10^?8	1.67×10^?8	1.27×10^?1
5	5.49×10^?7	3.48×10^?7	2.63×10^?9	2.39×10^?9	2.89×10^?2
6	7.49×10^?8	2.76×10^?8	1.08×10^?10	2.78×10^?9	1.03×10^?1

c			HXB值

	m=1	m=1.5	m=2	m=2.5	m=3
2	8.79×10^?1	3.19×10^?2	7.89×10^?2	2.09×10^?2	1.09×10^?2
3	2.20×10^?2	8.76×10^?1	4.13×10^?2	9.45×10^?1	9.39×10^?1
4	1.93×10^?3	2.54×10^?3	4.69×10^?3	1.83×10^?4	2.13×10^?4
5	3.63×10^?4	1.65×10^?5	9.29×10^?5	6.50×10^?6	6.20×10^?5
6	5.29×10^?5	8.40×10^?6	3.49×10^?5	7.28×10^?9	8.29×10^?6

c			HXB值

	m=1	m=1.5	m=2	m=2.5	m=3
2	3.79×10^?2	9.38×10^?2	6.52×10^?1	3.29×10^?3	6.90×10^?2
3	9.67×10^?2	1.16×10^?2	2.40×10^?2	9.89×10^?3	5.59×10^?3
4	6.09×10^?5	5.39×10^?4	3.80×10^?4	5.30×10^?5	6.60×10^?5
5	4.58×10^?8	7.39×10^?6	5.09×10^?3	3.59×10^?7	1.47×10^?7
6	9.28×10^?9	5.60×10^?7	4.39×10^?5	8.92×10^?6	5.40×10^?6