网络与信息安全学报 ›› 2016, Vol. 2 ›› Issue (5): 21-29.doi: 10.11959/j.issn.2096-109x.2016.00048

• 学术论文 • 上一篇    下一篇

结合时序和语义的中文微博话题检测与跟踪方法

陈铁明,王小号,庞卫巍,江颉   

  1. 浙江工业大学计算机科学与技术学院,浙江 杭州 310023
  • 修回日期:2016-04-27 出版日期:2016-05-15 发布日期:2020-03-26
  • 作者简介:陈铁明(1978-),男,浙江诸暨人,博士,浙江工业大学教授,主要研究方向为网络与信息安全。|王小号(1981-),男,浙江新昌人,浙江工业大学讲师,主要研究方向为信息安全。|庞卫巍(1989-),男,浙江绍兴人,浙江工业大学硕士生,主要研究方向为网络安全与本文挖掘。|江颉(1972-),女,浙江平湖人,博士,浙江工业大学副教授,主要研究方向为网络信息安全。
  • 基金资助:
    国家自然科学基金资助项目(U1509214);浙江省自然科学基金资助项目(LY16F020035)

Time series and semantics-based chinese microblog topic detection and tracking method

Tie-ming CHEN,Xiao-hao WANG,Wei-wei PANG,Jie JIANG   

  1. College of Computer Science &Technology,Zhejiang University of Technology,Hangzhou 310023,China
  • Revised:2016-04-27 Online:2016-05-15 Published:2020-03-26
  • Supported by:
    The National Natural Science Foundation of China(U1509214);The Natural Science Foundation of Zhejiang Province(LY16F020035)

摘要:

摘 要:微博文本具有短小快捷、主题多变等特点,社交话题检测与跟踪研究面临新的挑战。结合微博的话题时序性和短文本语义相似度等特点,提出了基于微博聚类的话题检测与跟踪系统方法。首先,通过定义微博文本的时序频繁词集,给出面向热点话题的特征词选择方法;然后,根据时序频繁特征词集,利用最大频繁项集获得微博初始聚类;针对初始簇间存在文本重叠情况,提出基于短文本扩展语义隶属度的簇间重叠消减算法,获得完全分离的初始簇;最后,根据簇语义相似度矩阵,给出凝聚式话题聚类方法。通过新浪微博完成实验测试,表明所提方法可用于中文微博热点话题检测与跟踪。

关键词: 微博文本, 频繁词集, 特征选择, 聚类, 话题检测, 时序, 语义

Abstract:

As a widely used tool in social networks,microblog is definitely with short document,quick broadcasting and topic changeable,which results in big challenging for social topic detection and tracking.A new systematic framework for micro-blog topic detection and tracking was proposed based on the microblog clustering using temporal trend and semantic similarity.Firstly,a feature words selection method for hot topics was presented by defining the temporal frequent words set.Secondly,an initially clustering was conducted depending on the selected temporal frequent words set.As far as the overlaps between initial clusters concerned,an effective overlap elimination algorithm was proposed,by introducing the extended short document semantic membership,to separate any possible overlapped initial clusters.Finally,an aggregated topic clustering method was employed using the cluster semantic similarity matrix.The experiments were at last done on some real-world dataset from Sina microblog.It show that the method for chinese microblog topic detection and tracking can obtain excellent performance and results.

Key words: microblog text, frequent words, feature selection, clustering,topic detection, time series, semantics

中图分类号: 

No Suggested Reading articles found!