基于贝叶斯模型的微博网络水军识别算法研究

doi:10.11959/j.issn.1000-436x.2017006

通信学报 ›› 2017, Vol. 38 ›› Issue (1): 44-53.doi: 10.11959/j.issn.1000-436x.2017006

基于贝叶斯模型的微博网络水军识别算法研究

张艳梅¹,黄莹莹¹,甘世杰¹,丁熠²,马志龙³

¹ 中央财经大学信息学院，北京 100081
² 电子科技大学网络与数据安全四川省重点实验室，四川成都 610054
³ 新疆财经大学计算机科学与工程学院，新疆乌鲁木齐 830012

修回日期:2016-09-26 出版日期:2017-01-01 发布日期:2017-01-23
作者简介:张艳梅（1976-），女，吉林省吉林市人，博士，中央财经大学副教授，主要研究方向为智能数据分析和服务计算。|黄莹莹（1995-），女，海南海口人，主要研究方向为智能数据分析。|甘世杰（1997-），男，四川邻水人，主要研究方向为信息安全、智能数据分析。|丁熠（1985-），男，四川宜宾人，博士，电子科技大学副教授，主要研究方向为医学图像处理、模式识别。|马志龙（1983-），男，新疆裕民人，新疆财经大学讲师，主要研究方向为信息安全。
基金资助:
国家自然科学基金资助项目(61602536);国家自然科学基金资助项目(61273293);国家自然科学基金资助项目(61309029);北京市社会科学重点基金资助项目(16YJA001);网络与数据安全四川省重点实验室开放课题基金资助项目(NDSMS201605);中央财经大学学科建设基金资助项目

Weibo spammers’ identification algorithm based on Bayesian model

Yan-mei ZHANG¹,Ying-ying HUANG¹,Shi-jie GAN¹,Yi DING²,Zhi-long MA³

¹ Information School,Central University of Finance and Economics,Beijing 100081,China
² Network and Data Security Key Laboratory of Sichuan Province，University of Electronic Science and Technology of China,Chengdu 610054,China
³ Computer Science and Engineering School,Xinjiang University of Finance and Economics,Urumqi 830012,China

Revised:2016-09-26 Online:2017-01-01 Published:2017-01-23
Supported by:
The National Natural Science Foundation of China(61602536);The National Natural Science Foundation of China(61273293);The National Natural Science Foundation of China(61309029);Beijing Mu-nicipal Social Science Foundation(16YJA001);The Open Project of Network and Data Security Key Laboratory of Sichuan Province(NDSMS201605);The Discipline Construction Foundation of the Central University of Finance and Economics

摘要/Abstract

摘要：

为了能够有效地识别水军，在以往相关研究基础上，设置粉丝关注比、平均发布微博数、互相关注数、综合质量评价、收藏数和阳光信用这6个特征属性来设计微博水军识别分类器，并基于贝叶斯模型和遗传智能优化算法实现了水军识别算法。利用新浪微博真实数据对算法性能进行了验证，实验结果表明，提出的贝叶斯水军识别算法能够在不牺牲非水军识别率的情况下，保证水军识别的准确率，而且提出的阈值优化算法能显著提升水军识别的准确率。

关键词: 网络水军, 水军识别, 微博, 贝叶斯模型, 遗传算法

Abstract:

In order to distinguish the spammers efficiently,a classifier based on the behavior characteristics was established.By analyzing the previous research,the ratio of followers,total number of blog posts,the number of friends,comprehensive quality evaluation and favorites according to latest data set,the Weibo spammers’ identification algorithm was realized based on Bayesian model and genetic algorithm.The experiment result based on the real-time data of Sina Weibo verify that the Bayesian model recognition algorithm can ensure spammers recognition accuracy without sacrificing recognition rate of non-spammers,and the proposed threshold value matrix proposed optimization can significantly improve recognition accuracy navy.

Key words: network spammer, spammer identification, Weibo, Bayesian model, genetic algorithm

中图分类号:

TP393

张艳梅,黄莹莹,甘世杰,丁熠,马志龙. 基于贝叶斯模型的微博网络水军识别算法研究[J]. 通信学报, 2017, 38(1): 44-53.

Yan-mei ZHANG,Ying-ying HUANG,Shi-jie GAN,Yi DING,Zhi-long MA. Weibo spammers’ identification algorithm based on Bayesian model[J]. Journal on Communications, 2017, 38(1): 44-53.

图/表 12

表1

图1

表2

表3

图3

图4

图5

图6

图7

图8

图9

图10

参考文献 25

[1]	SRIRAM B , FUHRY D , DEMIR E ,et al. Short text classification in Twitter to improve information filtering[C]// 33rd Int’l ACM SIGIR Conf.on Research and Development in Information Retrieval (SIGIR 2010). New York:ACM Press, 2010: 841-842.
[2]	LIU B . Sentiment analysis and subjectivity[M]. Handbook of Natural Language Processing. Boca Raton: CRC PressPress, 2010: 627-666.
[3]	ZHAO Y Y , QIN B , LIU T . Sentiment analysis[J]. Journal of Software, 2010,21(8): 1834-1848.
[4]	PARAMESWARAN M , RUI H , SAYIN S . A game theoretic model and empirical analysis of spammer strategies[C]// 7th Annual Collaboration,Electronic Messaging,Anti-Abuse and Spam Conf. 2010: 1-7.
[5]	GARGARI S M , OGUDUCU S G . A novel framework for spammer detection in social bookmarking systems[C]// IEEE/ACM Int’l Conf.on Advances in Social Networks Analysis and Mining (ASONAM 2012). 2012: 827-834.
[6]	莫倩, 杨珂 . 网络水军识别研究[J]. 软件学报, 2014,25(7): 1505-1526.
	MO Q , YANG K . Overview of Web spammer detection[J]. Journal of Software, 2014,25(7): 1505-1526.
[7]	KRESTEL R , CHEN L . Using co-occurrence of tags and resources to identify spammers[C]// Discovery Challenge Workshop at the European Conf on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2008). 2008: 38-46.
[8]	GAYO-AVELLO D , BRENES D J . Overcoming spammers in Twitter—a tale of five algorithms[C]// Spanish Conf.on Information Retrieval (CERI 2010). 2010: 41-52.
[9]	韩忠明, 杨珂, 谭旭升 . 利用加权用户关系图的谱分析探测大规模电子商务水军团体[J/OL]. .
	HAN Z M , YANG K , TAN X S . Analyzing spectrum features of weight user relation graph to identify large spammer groups in online shopping websites[J/OL]. .
[10]	张良, 朱湘, 李爱平 ,等. 一种基于逻辑回归算法的水军识别方法[J]. 信息安全与技术, 2015(4): 57-62.
	ZHANG L , ZHU X , LI A P ,et al. The Spammer detection based on logistic regression[J]. Information Security and Technology, 2015 (4): 57-62.
[11]	叶施仁, 孙宁 . 基于 SVM 的新浪微博营销类水帖识别研究[J]. 湘潭大学自然科学学报, 2015,37(4): 70-74.
	YE S R , SUN N . Research on Sina microblogging marketing spam review detection based on support vector machine[J]. Natural Science Journal of Xiangtan University, 2015,37(4): 70-74.
[12]	程晓涛, 刘彩霞, 刘树新 . 基于关系图特征的微博水军发现方法[J]. 自动化学报, 2015,41(9): 1533-1541.
	CHENG X T , LIU C X , LIU S X . Graph-based features for identifying spammers in microblog networks[J]. Acta Automatica Sinica, 2015,41(9): 1533-1541.
[13]	陈侃, 陈亮, 朱培栋 ,等. 基于交互行为的在线社会网络水军检测方法[J]. 通信学报, 2015,36(7): 120-127.
	CHEN K , CHEN L , ZHU P D ,et al. Interaction based on method for spam detection in online social networks[J]. Journal on Communications, 2015,36(7): 120-127.
[14]	杨长春, 徐小松, 叶施仁 ,等. 基于文本相似度的微博网络水军发现算法[J]. 微电子学与计算机, 2014,31(3): 82-85.
	YANG C C , XU X S , YE S R ,et al. A method to find water armies in weibo based on text similarity[J]. Microelectronics ＆ Computer, 2014,31(3): 82-85.
[15]	袁旭萍, 王仁武, 翟伯荫 . 基于综合指数和熵值法的微博水军自动识别[J]. 情报杂志, 2014,33(7): 176-179.
	YUAN X P , WANG R W , ZHAI B Y . Automatic recognition of micro-blog water army based on multi-index comprehensive index method and entropy method[J]. Journal of Intelligence, 2014,33(7): 176-179.
[16]	倪平, 张玉清, 闻观行 ,等. 基于群体特征的社交僵尸网络检测方法[J]. 中国科学院大学学报, 2015,31(5): 691-700.
	NI P , ZHANG Y Q , WEN G X ,et al. Detection of socialbot networks based on population characteristics[J]. Journal of University of Chinese Academy of Sciences, 2015,31(5): 691-700.
[17]	董雨辰, 刘琰, 罗军勇 ,等. 基于支持向量机的炒作微博识别方法[J]. 计算机工程, 2015,41(3): 7-14.
	DONG Y C , LIU Y , LUO J Y ,et al. Hype microblog recognition method based on support vector machine[J]. Computer Engineering, 2015,41(3): 7-14.
[18]	韩忠明, 许峰敏, 段大高 . 面向微博的概率图水军识别模型[J]. 计算机研究与发展, 2013,50(S2): 180-186.
	HAN Z M , XU F M , DUAN D G . Probabilistic graphical model for identifying water army in microblogging system[J]. Journal of Computer Research and Development, 2013,50(S2): 180-186.
[19]	刘勘, 袁蕴英, 刘萍 . 基于随机森林分类的微博机器用户识别研究[J]. 北京大学学报, 2015,52(2): 290-300.
	LIU K , YUAN Y Y , LIU P . A Weibo bot-users indentification model based on random forest[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015,52(2): 290-300.
[20]	STRINGHINI G , KRUEGEL C , VIGNA G . Detecting spammers on social networks[C]// 26th Annual Computer Security Applications Conf.(ACSAC 2010). 2010: 1-9.
[21]	MURMANN A J . Enhancing spammer detection in online social networks with trust-based metrics[D]. San Jose:San Jose State University, 2009.
[22]	SRIRAM B , FUHRY D , DEMIR E ,et al. Short text classification in Twitter to improve information filtering[C]// 33rd Int’l ACM SIGIR Conf.on Research and Development in Information Retrieval (SIGIR 2010). 2010: 841-842.
[23]	MOH T S , MURMANN A J . Can you judge a man by his friends? Enhancing spammer detection on the Twitter microblogging platform using friends and followers[C]// Int’l Conf.on Information Systems and Technology Management (ICISTM 2010). 2010: 210-220.
[24]	BHAT S Y , ABULAISH M . Community-based features for identifying spammers in online social networks[C]// 2013 IEEE/ACM Int’l Conf.on Advances in Social Networks Analysis and Mining (ASONAM 2013). 2013: 100-107.
[25]	潘正茂 . 不平衡数据分类问题研究[D]. 西安:西安建筑科技大学, 2012: 2-49.
	PAN Z M . Research on classification for imbalanced dataset[D]. Xi’an:Xi’an University of Architecture and Technology, 2012: 2-49.

文献	属性	主要算法
文献[10]	URL率和文本自相似度以及好友数、粉丝数、博文数等	逻辑回归算法
文献[11]	评论时间、评论的ID、来自何客户端和粉丝数等	SVM原理、simhash算法
文献[12]	昵称、关注用户列表、微博文本、评论等	关系图结合、朴素贝叶斯、贝叶斯网络或决策树
文献[13]	设置了关注者?传播者、发布者?传播者、传播者?传播者 3 种类型来区分传播特征	决策树
文献[14]	网页特征码	文本分析、B-Tree索引
文献[15]	综合指数、信息熵值	计算综合指数、熵值法
文献[16]	注册时间、昵称、活跃时间	k-means聚类、深度优先搜索
文献[17]	发布时间、转发数、评论数、转发者ID等和用户ID等	支持向量机
文献[18]	用户活跃度、用户类别、粉丝值、好友值等	概率图
文献[19]	账户关注度、互粉比例、@比例等	决策树
文献[20]	好友请求率、URL率、文本相似性等	honey-profiles
文献[21]	互粉关注比、收藏数等	trust-based矩阵、PageRank 算法
文献[22]	消息、事件、评论等	词袋模型
文献[23]	互粉关注比、收藏数、每日增加好友数等	SVM算法、重复增量修枝算法
文献[24]	总出度（如发出的消息）、总入度、总环数等	OCTracker 算法

标识	解释
FF	粉丝关注比
AW	平均发布微博数
IF	互相关注数
QE	综合质量评价
C	收藏数
I	矩阵行数，这里表示代表属性个数
J	矩阵列数
M	非水军阈值矩阵
T	非水军概率矩阵
N	水军阈值矩阵
S	水军概率矩阵
x= {a ₁,a₂,… ,a_m?1,a _m}	未分类的数据，每个a为x的每个属性的值
B={y₁ ,y₂}	类别集，y₁表示此条数据代表非水军，y₂表示此条数据代表水军
var	阈值矩阵（通用）
population	种群矩阵，每行代表一个个体，每4列表示个体的一个属性的基因值
TP(true positive)	水军样本被预测为水军的个数
TN(true negative)	非水军样本被预测为非水军的个数
FN(false negative)	预测错误的实际水军类样本数目
FP(false positive)	预测错误的实际非水军类样本数目
acc⁺	分类器对水军类样本的分类准确率
acc^?	分类器对非水军类样本的分类准确率
g	数据集整体的平均分类性能（即总体分类准确率）
SR(spammer recall)	水军召回率
LR(legitimate recall)	非水军召回率

类别	实际水军类	实际非水军类
预测水军类	TP	FP
预测非水军类	FN	TN

基于贝叶斯模型的微博网络水军识别算法研究

Weibo spammers’ identification algorithm based on Bayesian model

在线阅读

PDF下载

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 25

相关文章 15

Metrics

推荐阅读 0

[1]	邱航, 汤红波, 游伟, 赵宇, 柏溢. NFV中基于量子遗传算法的网络服务扩展算法[J]. 通信学报, 2022, 43(11): 44-52.
[2]	丛玉良, 孙闻晞, 薛科, 钱志鸿, 陈绵书. 基于改进的混合遗传算法的车联网任务卸载策略研究[J]. 通信学报, 2022, 43(10): 77-85.
[3]	苏新, 薛淏阳, 周一青, 朱金秀. 面向海洋观监测传感网的计算卸载方法研究[J]. 通信学报, 2021, 42(5): 149-163.
[4]	黄小红, 张勇, 闪德胜, 钱叶魁, 韩璐, 李丹丹, 丛群. 基于多目标效用优化的分布式数据交易算法[J]. 通信学报, 2021, 42(2): 52-63.
[5]	卢毅,徐梦颖,周杰. 基于改进的免疫克隆蛙跳算法的多约束QoS路由优化研究[J]. 通信学报, 2020, 41(5): 141-149.
[6]	王新胜,卞震. 基于贝叶斯模型的驾驶行为识别与预测[J]. 通信学报, 2018, 39(3): 108-117.
[7]	张震,魏鹏,李玉峰,兰巨龙,徐萍,陈博. 改进粒子群联合禁忌搜索的特征选择算法[J]. 通信学报, 2018, 39(12): 60-68.
[8]	刘浩然,丁攀,郭长江,常金凤,崔静闯. 基于贝叶斯算法的中文垃圾邮件过滤系统研究[J]. 通信学报, 2018, 39(12): 151-159.
[9]	俸皓,罗蕾,王勇,叶苗. 无线传感网中基于时变多旅行商和遗传算法的多目标数据采集策略[J]. 通信学报, 2017, 38(3): 112-123.
[10]	王健,赵国生,李志新. 面向SDN的虚拟网络映射算法研究[J]. 通信学报, 2017, 38(10): 26-35.
[11]	马学彬,李爱丽,张晓娟. 基于多目标优化的固定中继节点唤醒策略[J]. 通信学报, 2017, 38(10): 47-59.
[12]	张宇翔,孙菀,杨家海,周达磊,孟祥飞,肖春景. 新浪微博反垃圾中特征选择的重要性分析[J]. 通信学报, 2016, 37(8): 24-33.
[13]	王尔馥,郑远硕,陈新武. 部分精英策略并行遗传优化的神经网络盲均衡[J]. 通信学报, 2016, 37(7): 193-200.
[14]	贺敏,徐杰,杜攀,程学旗,王丽宏. 基于时间序列分析的微博突发话题检测方法[J]. 通信学报, 2016, 37(3): 48-54.
[15]	石悦,邱雪松,郭少勇,亓峰. 基于改进遗传算法的电力光传输网规划方法[J]. 通信学报, 2016, 37(1): 116-122.