CMDC：一种差异互补的迭代式多维度文本聚类算法

doi:10.11959/j.issn.1000-436x.2020152

通信学报 ›› 2020, Vol. 41 ›› Issue (8): 155-164.doi: 10.11959/j.issn.1000-436x.2020152

CMDC：一种差异互补的迭代式多维度文本聚类算法

黄瑞章^1,²,白瑞娜¹,陈艳平^1,²,秦永彬^1,²,程欣宇^1,³,田有亮^1,²

¹ 贵州大学计算机科学与技术学院，贵州贵阳 550025
² 贵州省公共大数据重点实验室，贵州贵阳 550025
³ 贵州省智能人机交互工程技术研究中心，贵州贵阳 550025

修回日期:2020-06-10 出版日期:2020-08-25 发布日期:2020-09-05
作者简介:黄瑞章（1979- ），女，天津人，博士，贵州大学副教授、硕士生导师，主要研究方向为数据挖掘、文本挖掘、机器学习和信息检索|白瑞娜（1994- ），女，山西兴县人，贵州大学硕士生，主要研究方向为文本挖掘、机器学习|陈艳平（1980- ），男，贵州长顺人，博士，贵州大学副教授、硕士生导师，主要研究方向为人工智能、自然语言处理|秦永彬（1980- ），男，山东招远人，博士，贵州大学教授、博士生导师，主要研究方向为智慧计算与智能计算、大数据管理与应用|程欣宇（1978- ），男，贵州绥阳人，贵州大学副教授，主要研究方向为机器学习和计算机视觉|田有亮（1982- ），男，贵州盘县人，博士，贵州大学教授，主要研究方向为算法博弈论、密码学与安全协议、大数据安全与隐私保护、电子货币与区块链技术
基金资助:
国家自然科学基金资助项目(61462011);国家自然科学基金资助项目(91746116);国家自然科学基金联合基金资助项目(U1836205);贵州省科学技术基金资助项目([2020]1Z055)

CMDC:an iterative algorithm for complementary multi-view document clustering

Ruizhang HUANG^1,²,Ruina BAI¹,Yanping CHEN^1,²,Yongbin QIN^1,²,Xinyu CHENG^1,³,Youliang TIAN^1,²

¹ College of Computer Science and Technology,Guizhou University,Guiyang 550025,China
² Guizhou Provincial Key Laboratory of Public Big Data,Guiyang 550025,China
³ Guizhou Intelligent Human-Computer Interaction Engineering Technology Research Center,Guiyang 550025,China

Revised:2020-06-10 Online:2020-08-25 Published:2020-09-05
Supported by:
The National Natural Science Foundation of China(61462011);The National Natural Science Foundation of China(91746116);The Joint Funds of the National Natural Science Foundation of China(U1836205);The Key Projects of Science and Technology of Guizhou([2020]1Z055)

摘要/Abstract

摘要：

针对传统多维度文本聚类算法把文本表示与聚类过程分离，忽略了维度间的互补特性的问题，提出了一种差异互补的迭代式多维度文本聚类算法——CMDC，实现文本聚类与特征调整过程的统一优化。CMDC算法挑选维度聚类间结果的互补文本，基于局部度量学习算法利用互补文本促进聚类的特征调优，以维度的度量一致性来解决多维度文本聚类的划分一致性。实验结果表明，CMDC算法有效地提升了多维度聚类性能。

关键词: 多维度文本聚类, 互补文本, 约束文本聚类, 度量计算

Abstract:

In response to the problems traditional multi-view document clustering methods separate the multi-view document representation from the clustering process and ignore the complementary characteristics of multi-view document clustering,an iterative algorithm for complementary multi-view document clustering——CMDC was proposed,in which the multi-view document clustering process and the multi-view feature adjustment were conducted in a mutually unified manner.In CMDC algorithm,complementary text documents were selected from the clustering results to aid adjusting the contribution of view features via learning a local measurement metric of each document view.The complementary text document of the results among the dimensionality clusters was selected by CMDC,and used to promote the feature tuning of the clusters.The partition consistency of the multi-dimensional document clustering was solved by the measure consistency of the dimensions.Experimental results show that CMDC effectively improves multi-dimensional clustering performance.

Key words: multi-view document clustering, complementary text, constrained document clustering, metric calculation

中图分类号:

TP301

黄瑞章,白瑞娜,陈艳平,秦永彬,程欣宇,田有亮. CMDC：一种差异互补的迭代式多维度文本聚类算法[J]. 通信学报, 2020, 41(8): 155-164.

Ruizhang HUANG,Ruina BAI,Yanping CHEN,Yongbin QIN,Xinyu CHENG,Youliang TIAN. CMDC:an iterative algorithm for complementary multi-view document clustering[J]. Journal on Communications, 2020, 41(8): 155-164.

图/表 6

图1

表1

表2

表3

图2

图3

参考文献 25

[1]	ALLAHYARI M , POURIYEH S , ASSEFI M ,et al. Text summarization techniques:a brief survey[J]. International Journal of Advanced Computer Science and Applications, 2017,8(10): 397-405.
[2]	QIAN M , ZHAI C . Unsupervised feature selection for multi-view clustering on text-image Web news data[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. New York:ACM Press, 2014: 1963-1966.
[3]	YANG Y , WANG H . Multi-view clustering:a survey[J]. Big Data Mining and Analytics, 2018,1(2): 83-107.
[4]	BICKEL S , SCHEFFER T . Multi-view clustering[C]// Industrial Conference on Data Mining. Piscataway:IEEE Press, 2004: 19-26.
[5]	CHAUDHURI K , KAKADE S M , LIVESCU K ,et al. Multi-view clustering via canonical correlation analysis[C]// Proceedings of the 26th Annual International Conference on Machine Learning. New York:ACM Press, 2009: 129-136.
[6]	KUMAR A,DAUMé H , . A co-training approach for multi-view spectral clustering[C]// Proceedings of the 28th International Conference on Machine Learning. Washington:IMLS, 2011: 393-400.
[7]	KUMAR A , RAI P , DAUME H . Co-regularized multi-view spectral clustering[C]// Advances in Neural Information Processing Systems.[S.n.:s]. 2011: 1413-1421.
[8]	YIN Q , WU S , HE R ,et al. Multi-view clustering via pairwise sparse subspace representation[J]. Neurocomputing, 2015(156): 12-21.
[9]	TIAN F , GAO B , CUI Q ,et al. Learning deep representations for graph clustering[C]// 28th AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2014: 1293-1299.
[10]	PENG X , XIAO S , FENG J ,et al. Deep subspace clustering with sparsity prior[C]// International Joint Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2016: 1925-1931.
[11]	CAI X , NIE F , HUANG H ,et al. Heterogeneous image feature integration via multi-modal spectral clustering[C]// Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2011: 1977-1984.
[12]	WANG Y , WU L . Beyond low-rank representations:orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering[J]. Neural Networks, 2018(103): 1-8.
[13]	XIE Y , LIN B , QU Y ,et al. Joint deep multi-view learning for image clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2020(99):1.
[14]	XIE J , GIRSHICK R , FARHADI A . Unsupervised deep embedding for clustering analysis[C]// International Conference on Machine Learning. New York:IMLS, 2016: 478-487.
[15]	PERKINS H , YANG Y . Dialog Intent induction with deep multi-view clustering[J]. arXiv Preprint,arXiv:1908.11487, 2019
[16]	XING E P , JORDAN M I , RUSSELL S J ,et al. Distance metric learning with application to clustering with side-information[C]// Advances in Neural Information Processing Systems.[S.n.:s.l]. 2003: 521-528.
[17]	YE J , ZHAO Z , LIU H . Adaptive distance metric learning for clustering[C]// 2007 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2007: 1-7.
[18]	BAGHSHAH M S , SHOURAKI S B . Kernel-based metric learning for semi-supervised clustering[J]. Neurocomputing, 2010,73(7-9): 1352-1361.
[19]	MOUTAFIS P , LENG M , KAKADIARIS I A . An overview and empirical comparison of distance metric learning methods[J]. IEEE Transactions on Cybernetics, 2016,47(3): 612-625.
[20]	HYUN Y , KIM N , CHO Y . A multi-dimensional issue clustering from the perspective consumers’ interests and R＆D[J]. Journal of the Korea Society of IT Services, 2015,14(1): 237-249.
[21]	黎万英, 黄瑞章, 丁志远 ,等. 基于用户行为特征的多维度文本聚类[J]. 计算机应用, 2018,38(11): 3127-3131.
	LI W Y , HUANG R Z , DING Z Y ,et al. Multi-dimensional text clustering with user behavior characteristics[J]. Journal of Computer Applications, 2018,38(11): 3127-3131.
[22]	DEVLIN J , CHANG M W , LEE K ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[J]. arXiv Preprint,arXiv:1810.04805, 2018
[23]	BRBI? M , KOPRIVA I . Multi-view low-rank sparse subspace clustering[J]. Pattern Recognition, 2018(73): 247-258.
[24]	WANG X , LEI Z , GUO X ,et al. Multi-view subspace clustering with intactness-aware similarity[J]. Pattern Recognition, 2018(88): 50-63.
[25]	RASIWASIA N , MAHAJAN D , MAHADEVAN V ,et al. Cluster canonical correlation analysis[J]. Aistats, 2014: 823-831.

数据集	样本个数/个	维度数量/维	类别数量/种
AMiner	1 500	2	3
MHN	2 605	3	4

算法	AMiner摘要维度	AMiner用户维度	MHN标题维度	MHN正文维度	MHN主题维度
k-means	0.679	0.710	0.525	0.825	0.709
MTCUBC	0.783	0.733	0.606	0.851	0.621
CMDC	0.818	0.850	0.668	0.837	0.756

算法	AMiner	MHN
Mv+k-means	0.782	0.760
P-MLRSSC	0.826	0.748
C-MLRSSC	0.833	0.817
MSC_IAS	0.769	0.854
CMDC	0.850	0.837

CMDC：一种差异互补的迭代式多维度文本聚类算法

CMDC:an iterative algorithm for complementary multi-view document clustering

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 25

相关文章 15

Metrics

推荐阅读 0

[1]	钟诚, 孙辉. 高错误率长序列基因组数据敏感序列识别并行算法[J]. 通信学报, 2023, 44(2): 160-171.
[2]	汤凌韬, 王迪, 刘盛云. 面向非独立同分布数据的联邦学习数据增强方案[J]. 通信学报, 2023, 44(1): 164-176.
[3]	王维琼, 许豪杰, 崔萌, 谢琼. 优良布尔函数的混合禁忌搜索算法[J]. 通信学报, 2022, 43(5): 133-143.
[4]	方晨, 郭渊博, 王一丰, 胡永进, 马佳利, 张晗, 胡阳阳. 基于区块链和联邦学习的边缘计算隐私保护方法[J]. 通信学报, 2021, 42(11): 28-40.
[5]	刘西蒙, 张郁芳, 周书明, 李小燕. 分层超立方网络的可靠性评估[J]. 通信学报, 2021, 42(3): 111-121.
[6]	吕芳,柏军,黄俊恒,王佰玲. 基于蚁群算法的骨干网络发现[J]. 通信学报, 2020, 41(11): 74-85.
[7]	张兆娟,王万良,唐继军. 适应度二次选择的QPSO和SA协同搜索大规模离散优化算法[J]. 通信学报, 2020, 41(8): 22-31.
[8]	苏洵,李艳芳,宗宁,魏巍,李娟,丁莹. 基于多特征动态优先级的网络实时调度算法[J]. 通信学报, 2020, 41(5): 159-167.
[9]	项英倬,徐正国,游凌. 基于节点通信行为时序的指控信息流挖掘算法[J]. 通信学报, 2019, 40(9): 51-60.
[10]	郭晨,肖志芳,冷明,彭硕,王博. 交换交叉立方网络在PMC模型下的(t,k)-诊断度研究[J]. 通信学报, 2019, 40(6): 190-202.
[11]	黄春光,程海,丁群. 基于PUF的Logistic混沌序列发生器[J]. 通信学报, 2019, 40(3): 182-189.
[12]	董道广,芮国胜,田文飚,康健,刘歌. 基于结构相似性的非参数贝叶斯字典学习算法[J]. 通信学报, 2019, 40(1): 43-50.
[13]	董文永,董学士,王豫峰. 改进蜂群算法求解大规模着色瓶颈旅行商问题[J]. 通信学报, 2018, 39(12): 18-29.
[14]	俞艺涵,付钰,吴晓平. 基于Shannon信息熵与BP神经网络的隐私数据度量与分级模型[J]. 通信学报, 2018, 39(12): 10-17.
[15]	俞艺涵,付钰,吴晓平. MapReduce框架下支持差分隐私保护的随机梯度下降算法[J]. 通信学报, 2018, 39(1): 70-77.