通信学报 ›› 2016, Vol. 37 ›› Issue (8): 24-33.doi: 10.11959/j.issn.1000-436x.2016152

• 学术论文 • 上一篇    下一篇

新浪微博反垃圾中特征选择的重要性分析

张宇翔1,2,3,孙菀1,杨家海2,3,周达磊4,孟祥飞5,肖春景1   

  1. 1 中国民航大学计算机科学与技术学院,天津 300300
    2 清华大学网络科学与网络空间研究院,北京 100084
    3 清华信息科学与技术国家实验室,北京 100084
    4 北京邮电大学网络技术研究院,北京 100876
    5 北京航空航天大学虚拟现实技术与系统国家重点实验室,北京 100876
  • 出版日期:2016-08-25 发布日期:2016-09-01
  • 基金资助:
    国家重点基础研究发展计划(“973”计划)基金资助项目;国家科技支撑计划基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;国家自然科学基金资助项目;教育部博士点基金资助项目

Feature importance analysis for spammer detection in Sina Weibo

Yu-xiang ZHANG1,2,3,Yu SUN1,Jia-hai YANG2,3,Da-lei ZHOU4,Xiang-fei MENG5,Chun-jing XIAO1   

  1. 1 College of Computer Science,Civil Aviation University of China,Tianjin 300300,China
    2 Institute for Network Sciences and Cyberspace,Tsinghua University,Beijing 100084,China
    3 Tsinghua National Laboratory for Information Science and Technology (TNList),Beijing 100084,China
    4 Institue of Network Technology,Beijing University of Posts and Telecommunications,Beijing 100876,China
    5 State Key Laboratory of Virtual Reality Technology and Systems,Beihang University,Beijing 100876,China
  • Online:2016-08-25 Published:2016-09-01
  • Supported by:
    The National Basic Research Program of China (973 Program);The National Key Tech-nology R&D Program of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;The National Natural Science Foundation of China;Ph.D.Programs Foundation of Ministry of Education of China

摘要:

微博中的垃圾用户非常普遍,其异常行为及生产的垃圾信息显著降低了用户体验。为了提高识别准确率,已有研究或是尽可能多地定义特征,或是不断尝试提出新的分类检测方法;那么,微博反垃圾问题的突破点优先置于寻找分类特征还是改进分类检测方法,是否特征越多检测效果越好,新的方法是否可以显著提高检测效果。以新浪微博为例,试图通过不同的特征选择方法与不同的分类器组合实验回答以上问题,实验结果表明特征组的选择较分类器的改进更为重要,需从内容信息、用户行为和社会关系多侧面生成特征,且特征并非越多检测效果越好,这些结论将有助于未来微博反垃圾工作的突破。

关键词: 新浪微博, 特征生成, 特征选择, 垃圾用户检测

Abstract:

Microblog has drawn attention of not only legitimate users but also spammers.The garbage information pro-vided by spammers handicaps users' experience significantly.In order to improve the detection accuracy of spammers,most existing studies on spam focus on generating more classification features or putting forward new classifiers.Which kind of issues would be put the high priority of an enormous amount of research effort into? Are extensive features or novel classifiers better for the detection accuracy of spammers? It is tried to address these questions through combining different feature selection methods with different classifiers on a real Sina Weibo dataset.Experimental results show that selected features are more important than novel classifiers for spammer detection.In addition,features should be derived from a wide range,such as text contents,user behaviors,and social relationship,and the dimension of features should not be too high.These results will be useful in finding the breakpoint of Microblog anti-spam works in the future.

Key words: Sina Weibo, feature definition, feature selection, spammer detection

No Suggested Reading articles found!