结合时序和语义的中文微博话题检测与跟踪方法

doi:10.11959/j.issn.2096-109x.2016.00048

摘要/Abstract

摘要：

摘要：微博文本具有短小快捷、主题多变等特点，社交话题检测与跟踪研究面临新的挑战。结合微博的话题时序性和短文本语义相似度等特点，提出了基于微博聚类的话题检测与跟踪系统方法。首先，通过定义微博文本的时序频繁词集，给出面向热点话题的特征词选择方法；然后，根据时序频繁特征词集，利用最大频繁项集获得微博初始聚类；针对初始簇间存在文本重叠情况，提出基于短文本扩展语义隶属度的簇间重叠消减算法，获得完全分离的初始簇；最后，根据簇语义相似度矩阵，给出凝聚式话题聚类方法。通过新浪微博完成实验测试，表明所提方法可用于中文微博热点话题检测与跟踪。

关键词: 微博文本, 频繁词集, 特征选择, 聚类, 话题检测, 时序, 语义

Abstract:

As a widely used tool in social networks,microblog is definitely with short document,quick broadcasting and topic changeable,which results in big challenging for social topic detection and tracking.A new systematic framework for micro-blog topic detection and tracking was proposed based on the microblog clustering using temporal trend and semantic similarity.Firstly,a feature words selection method for hot topics was presented by defining the temporal frequent words set.Secondly,an initially clustering was conducted depending on the selected temporal frequent words set.As far as the overlaps between initial clusters concerned,an effective overlap elimination algorithm was proposed,by introducing the extended short document semantic membership,to separate any possible overlapped initial clusters.Finally,an aggregated topic clustering method was employed using the cluster semantic similarity matrix.The experiments were at last done on some real-world dataset from Sina microblog.It show that the method for chinese microblog topic detection and tracking can obtain excellent performance and results.

Key words: microblog text, frequent words, feature selection, clustering,topic detection, time series, semantics

中图分类号:

TP301

陈铁明,王小号,庞卫巍,江颉. 结合时序和语义的中文微博话题检测与跟踪方法[J]. 网络与信息安全学报, 2016, 2(5): 21-29.

Tie-ming CHEN,Xiao-hao WANG,Wei-wei PANG,Jie JIANG. Time series and semantics-based chinese microblog topic detection and tracking method[J]. Chinese Journal of Network and Information Security, 2016, 2(5): 21-29.

图/表 13

图1

图2

表1

表2

图3

图4

图5

图6

图7

图8

图9

表3

图10

参考文献 23

[1]	ALLAN J . Topic detection and tracking:event-based information organization[M]. Kluwer Academic Publisher, 2002.
[2]	NIST. The 2003 topic detection and tracking task definition and evaluation plan[EB/OL]. .
[3]	ALLAN J , CARBONELL J , DODINGTON G ,et al. Topic detection and tracking pilot study:final report[C]// The Darpa Broadcast News Transcription and Understanding Workshop . c2000: 194-218.
[4]	WAYNE C , . Multilingual topic detection and tracking:successful research enabled by corpora and evaluation[C]// The Language Resources and Evaluation Conference. c2000: 1487-1494.
[5]	骆卫华, 于满泉, 许洪波 ,等. 基于多策略优化的分治多层聚类算法的话题发现研究[J]. 中文信息学报, 2006,20(1): 29-36.
	LUO W H , YU M Q , XU H B ,et al. The study of topice detection based on algorithm of division and multilevel clustering with multistrategy optimization[J]. Journal of Chinese Information Processing, 2006,20(1): 29-36.
[6]	贾自艳, 何清, 张俊海 ,等. 一种基于动态进化模型的事件探测和追踪算法[J]. 计算机研究与发展, 2004,41(7): 1273-1280.
	JIA Z Y , HE Q , ZHANG J H ,et al. A new event detection and tracking algorithm based on dynamic evolution model[J]. Journal of Computer Research and Development, 2004,41(7): 1273-1280.
[7]	YAMRON J P , KNECHT S , MULBREGT P V . Dragon’s tracking and detection systems for the tdt2000 evaluation[C]// TopicThe Detection and Tracking Workshop. c2000: 75-80.
[8]	DAI X Y , CHEN Q C , WANG X L ,et al. Online topic detection and tracking of financial news based on hierarchical clustering[C]// 2010 International Conference on Machine Learning and Cybernetics. c2010: 3341-3346.
[9]	张阔, 李涓子, 吴刚 ,等. 基于夫键词元的话题内事件检测[J]. 计算机研究与发展, 2009,46(2): 245-252.
	ZHANG K , LI J Z , WU G ,et al. Word committee based event identification[J]. Journal of Computer Research and Development, 2009,46(2): 245-252.
[10]	洪宇, 仓玉, 姚建民 ,等. 话题跟踪中静态和动态话题模型的核捕捉衰减[J]. 软件学报, 2012,23(5): 1100-1119.
	HONG Y , CANG Y , YAO J M ,et al. Descending kernel track of static and dynamic topic models in topic tracking[J]. Journal of Software, 2012,23(5): 1100-1119.
[11]	张小明, 李舟军, 巢文涵 . 基于增量型聚类的自动话题检测研究[J]. 软件学报, 2012,23(6): 1578-1587.
	ZHANG X M , LI Z J , CHAO W H . Research of automatic topic detection based on incremental clustering[J]. Journal of Software, 2012,23(6): 1578-1587.
[12]	SAKAKI T , OKAZAKI M , MATSUO Y . Earthquake shakes twitter user:real-time event detection by social sensors[C]// The 19th International Conference on World Wide Web. c2010: 851-861.
[13]	PHUVIPADAWAT S , MURATA T . Breaking news detection and tracking in twitter[C]// 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology(WI-IAT). c2010: 120-123.
[14]	CATALDI M , CARO L D , SCHIFANELLA C . Emerging topic detection on twitter based on temporal and social terms evaluation[C]// The 10th International Workshop on Multimedia Data Mining. c2010: 1-10.
[15]	路荣, 项亮, 刘明荣 ,等. 基于隐主题分析和文本聚类的微博客新闻话题发现研究[J]. 模式识别与人工智能, 2012,3: 382-387.
	LU R , XIANG L , LIU M R ,et al. Extracting news topics from microblogs based on hidden topics analysis and text clustering[J]. Pattern Recognition and Artificial Intelligence, 2012,3: 382-387.
[16]	王永恒 . 海量短语信息挖掘技术的研究和实现[D]. 长沙:国防科学技术大学. 2006.
	WANG Y H . Research and implementation of information mining on massive short messages[D]. Changsha:National University of Defense Technology. 2006.
[17]	GABRILOVICH E . Feature generation for textual information retrieval using world knowledge[J]. ACM SIGIR Forum, 2007,41(2):123.
[18]	BAGHEL R , DHIR R . Text document clustering based on frequent concepts[C]// 2010 1st International Conference on Parallel,Distributed and Grid Computing (PDGC). c2010: 366-371.
[19]	ZELIKOVITZ S , . Transductive LSI for short text classification problems[C]// The 17th International FLAIRS Conference. c2004.
[20]	BEIL F , ESTER M , XU X . Frequent term-based text clustering[C]// The 8th ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. c2002: 436-442.
[21]	LI Y J , CHUNG S M , HOLT J D . Text document clustering based on frequent word meaning sequences[C]// Data ＆ Knowledge Engineering. c2008: 381-404.
[22]	FUNG B C M , WANG K , ESTER M . Hierarchical document clustering using frequent itemsets[C]// The Siam International Conference on Data Mining,San Francisco. c2003.
[23]	许云, 樊孝忠, 张锋 . 基于《知网》的语义相似度计算[J]. 北京理工大学学报, 2005,25(5): 411-414.
	XU Y , FAN X Z , ZHANG F . Semantic relevancy computing based on hownet[J]. Transactions of Beijing Institute of Technology, 2005,25(5): 411-414.

	t_i1	t_i2	…	t_in-1	t_in
t_j1	sim(t_j1,t_i1)	sim(t_j1,t_i2)	…	sim(t_j1,t_in?1)	sim(t_j1,t_in)
t_j2	sim(t_j2,t_i1)	sim(t_j2,t_i2)	…	sim(t_j2,t_in?1)	sim(t_j2,t_in)
…	…	…	…	…	…
t_jm?1	sim(t_jm?1,t_i1)	sim(t_jm?1,t_i2)	…	sim(t_jm?1,t_in?1)	sim(t_jm?1,t_in)
t_jm	sim(t_jm,t_i1)	sim(t_jm,t_i2)	…	sim(t_jm,t_in?1)	sim(t_jm,t_in)

序号	标记簇	簇大小
1	{保钓，钓鱼岛}	1 438
2	{京东，电商}	1 427
3	{海葵}	2 463
4	{流星雨}	157
5	{爆头哥,周克华}	680
6	{伦敦,奥运}	2 000
7	{小米}	528
8	{星座，运势}	3 115
9	{叶诗文}	799
10	{羽毛球}	749

序号	人工标注话题（7个）	聚类结果Top-10
1	电商大战	{京东，苏宁，价格}
2	鸡蛋灌饼	{垃圾，杭州，里脊，生菜，口感，鸡蛋，滋味，经典，小吃，外皮}
3	钓鱼岛	{保钓，钓鱼岛，香港，日本，人士}
4	向日葵	{偶像，好友，先锋，盛宴，大麦，门票，上海…}
5	高温	{星座，双鱼，射手，处女}
6	小米手机	{杂志，免费，书店}
7	周克华	{美国，中国}
8		{今年，向日葵，土壤，积累，微生物}
9		{图片，手机，小米，发布会}
10		{高温，杭州，气温，重庆}