Journal on Communications ›› 2016, Vol. 37 ›› Issue (8): 24-33.doi: 10.11959/j.issn.1000-436x.2016152
• Papers • Previous Articles Next Articles
Yu-xiang ZHANG1,2,3,Yu SUN1,Jia-hai YANG2,3,Da-lei ZHOU4,Xiang-fei MENG5,Chun-jing XIAO1
Online:
2016-08-25
Published:
2016-09-01
Supported by:
Yu-xiang ZHANG,Yu SUN,Jia-hai YANG,Da-lei ZHOU,Xiang-fei MENG,Chun-jing XIAO. Feature importance analysis for spammer detection in Sina Weibo[J]. Journal on Communications, 2016, 37(8): 24-33.
"
特征 | 均值 | 变异系数 | ||||||||||||
合法用户 | 垃圾用户 | 合法用户 | 垃圾用户 | 内容垃圾 | 僵尸垃圾 | 封号垃圾 | 合法用户 | 垃圾用户 | 内容垃圾 | 僵尸垃圾 | 封号垃圾 | |||
F1 | 317.00 | 1 107.00 | 492.92 | 1 110.95 | 1 106.02 | 1 396.29 | 1 004.48 | 1.01 | 0.48 | 0.60 | 0.41 | 0.33 | ||
F2 | 276.00 | 141.50 | 1 137.95 | 428.13 | 772.52 | 239.67 | 241.19 | 28.20 | 7.12 | 6.52 | 5.53 | 1.05 | ||
F3 | 119.00 | 4.00 | 184.42 | 67.66 | 146.75 | 46.22 | 15.07 | 1.28 | 2.88 | 1.93 | 3.30 | 4.41 | ||
F4 | 1.11 | 6.33 | 3.00 | 50.31 | 46.11 | 140.50 | 16.95 | 5.23 | 3.21 | 2.41 | 1.99 | 6.24 | ||
F5 | 2.50 | 249.71 | 18.85 | 419.37 | 316.20 | 560.24 | 443.94 | 5.94 | 1.19 | 1.73 | 1.13 | 0.81 | ||
F6 | 3.00 | 3.00 | 3.56 | 4.27 | 3.77 | 3.30 | 5.05 | 0.81 | 1.13 | 1.10 | 0.90 | 1.14 | ||
F7 | 555.00 | 349.00 | 1 271.76 | 541.86 | 818.45 | 372.96 | 397.47 | 1.73 | 2.22 | 2.37 | 1.59 | 0.65 | ||
F8 | 3.33 | 4.94 | 5.27 | 7.84 | 10.56 | 4.13 | 7.31 | 1.63 | 4.15 | 5.14 | 1.12 | 0.71 | ||
F9 | 62.00 | 62.00 | 86.51 | 135.35 | 165.86 | 247.50 | 66.56 | 1.00 | 1.09 | 0.96 | 0.81 | 0.55 | ||
F10 | 0.58 | 1.00 | 0.55 | 0.73 | 0.51 | 0.57 | 0.96 | 0.49 | 0.51 | 0.74 | 0.70 | 0.17 | ||
F11 | 0.09 | 0.01 | 0.17 | 0.23 | 0.47 | 0.29 | 0.03 | 1.22 | 1.50 | 0.81 | 1.21 | 5.15 | ||
F12 | 0.82 | 0.38 | 1.70 | 0.36 | 0.35 | 0.16 | 0.42 | 1.56 | 2.09 | 3.22 | 1.87 | 0.33 | ||
F13 | 1.50 | 0.00 | 2.75 | 0.20 | 0.41 | 0.15 | 0.03 | 1.37 | 4.73 | 3.35 | 3.26 | 6.36 | ||
F14 | 97.06 | 124.11 | 97.63 | 111.64 | 100.37 | 97.93 | 126.25 | 0.30 | 0.24 | 0.30 | 0.28 | 0.11 | ||
F15 | 0.04 | 0.04 | 0.07 | 0.09 | 0.16 | 0.07 | 0.04 | 1.18 | 1.51 | 1.15 | 1.54 | 0.89 | ||
F16 | 5.12 | 7.53 | 6.37 | 8.62 | 10.44 | 7.02 | 7.92 | 1.26 | 0.99 | 1.26 | 0.76 | 0.41 | ||
F17 | 0.05 | 0.05 | 0.05 | 0.07 | 0.09 | 0.07 | 0.05 | 0.50 | 0.71 | 0.60 | 0.41 | 0.33 |
"
编号 | 名称 | 分类 | 评价标准 |
FS1 | CHI(chi-squared)[ | Filter | CHI-square |
FS2 | IG(information gain)[ | Filter | 信息熵 |
FS3 | ReliefF[ | Filter | 欧拉距离 |
FS4 | SVM-RFE(recursive feature elimination for SVM)[ | Wrapper | 预测分析 |
FS5 | SU(symmetrical uncertainty)[ | Filter | 不确定分析 |
FS6 | CR(comprehensive ranking) | — | — |
FS7 | CFS(correlation-based feature selection)[ | Filter | 相关分析 |
"
编号 | 名称 | 说明 |
Classfier1 | Naive Bayes (NB)[ | 基于贝叶斯定理与特征之间独立假设基础之上,根据某对象的先验概率利用贝叶斯公式计算出其后验概率,选择具有最大后验概率的类作为该对象所属的类 |
Classfier2 | logistic regression (LR)[ | 使用逻辑回归sigmod函数来计算后验概率,根据后验概率对所给对象进行分类识别 |
Classfier3 | support vector machine (SVM)[ | 建立在统计学理论中的结构风险最小化准则基础上,原理是将低维空间的点映射到高维空间,使它们成为线性可分,再使用线性划分的原理来判断分类边界 |
Classfier4 | radial basis function network (RBFN)[ | 该方法是一种前馈神经网络,采用径向基函数作为激活函数 |
Classfier5 | k-nearest neighbor (IBk/kNN)[ | 一种基于实例学习的非参数估计的分类方法,计算新样本与训练样本之间的距离,找到距离最近的k个邻居,如果邻居的大多数属于某一个类别,则该样本也属于这个类别 |
Classfier6 | AdaBoost.M1 (ABM1)[ | 一种提高给定学习算法精度的方法,使用同一个训练集训练不同的弱分类器,然后把这些弱分类器集合起来,构成一个强的最终分类器 |
Classfier7 | bootstrap aggregating (BA)[ | 与 AdaBoost 一样,也是一种集成学习分类方法,但在训练集的选取和预测函数的生成方面存在明显差异,通常AdaBoost的分类准确度较BA的高,不过BA可以有效避免过拟合 |
Classfier8 | decision trees (J48/C4.5)[ | 一种简单且快速的非参数树状分类方法,利用信息增益率来选择特征,将信息增益率最大的特征作为决策树的分裂节点,每个分支均重复这一过程 |
Classfier9 | random forest (RF)[ | 以决策树为基本分类器的一个集成学习分类方法,它包含多个由BA集成学习技术训练得到的决策树,当输入待分类的样本时,最终的分类结果由单个决策树的输出结果投票决定 |
Classfier10 | logistic model trees (LMT)[ | 在决策树中引入了线性逻辑回归,节点包含逻辑回归函数 |
"
δ | 方法 | Top1 | Top2 | Top3 | Top4 | Top5 | Top6 | Top7 | Top8 | Top9 | Top10 | Top11 | Top12 | Top13 | Top14 | Top15 | Top16 | Top17 |
ChiSq | F5 | F10 | F1 | F14 | F17 | F11 | F4 | F9 | F3 | F13 | F15 | F6 | F7 | F12 | F16 | F8 | F2 | |
IG | F5 | F10 | F1 | F14 | F17 | F11 | F4 | F9 | F3 | F13 | F7 | F15 | F12 | F6 | F16 | F8 | F2 | |
5.9 | ReliefF | F10 | F1 | F14 | F11 | F17 | F3 | F5 | F13 | F6 | F15 | F7 | F12 | F9 | F4 | F16 | F8 | F2 |
SVM-RFE(c=0.05) | F5 | F4 | F3 | F1 | F7 | F17 | F10 | F11 | F9 | F6 | F13 | F15 | F12 | F8 | F14 | F16 | F2 | |
SU | F5 | F1 | F10 | F4 | F17 | F14 | F9 | F11 | F13 | F3 | F7 | F12 | F15 | F16 | F6 | F8 | F2 | |
CFS | {F5,F10,F13} | |||||||||||||||||
ChiSq | F10 | F1 | F5 | F17 | F12 | F14 | F11 | F13 | F9 | F3 | F4 | F7 | F15 | F16 | F6 | F8 | F2 | |
IG | F5 | F10 | F1 | F17 | F12 | F14 | F11 | F13 | F9 | F3 | F4 | F7 | F15 | F6 | F16 | F8 | F2 | |
ReliefF | F1 | F10 | F5 | F14 | F11 | F17 | F12 | F9 | F3 | F13 | F6 | F16 | F15 | F4 | F7 | F2 | F8 | |
1 | ||||||||||||||||||
SVM-RFE(c=0.01) | F10 | F1 | F5 | F11 | F17 | F12 | F14 | F9 | F3 | F13 | F7 | F15 | F6 | F16 | F4 | F8 | F2 | |
SU | F5 | F1 | F10 | F12 | F17 | F13 | F11 | F14 | F4 | F9 | F3 | F7 | F15 | F6 | F16 | F8 | F2 | |
CFS | {F1,F5,F10,F12,F17} |
[1] | Available online[EB/OL]. |
[2] | Available online[EB/OL]. |
[3] | SPIRIN N , HAN J W . Survey on web spam detection:principles and algorithms[J]. ACM SIGKDD Explorations Newsletter, 2012,13(2):50-64. |
[4] | MUKHERJEE A , LIU B , GLANCE N S . Spotting fake reviewer groups in consumer reviews[C]// The WWW. c2012:191-200. |
[5] | WANG T Y , WANG G , LI X . Characterizing and detecting malicious crowdsourcing[C]// The ACM SIGCOMM. c2013:537-538. |
[6] | WANG G , WILSON C , ZHAO X H . Serf and turf:crowdturfing for fun and profit[C]// The WWW. c2012:679-688. |
[7] | SRIDHARAN V , SHANKAR V , GUPTA M . Twitter games:how successful spammers pick targets[C]// The ACSAC. c2012:389-398. |
[8] | STRINGHINI G , KRUEGEL C , VIGNA G . Detecting spammers on social networks[C]// The ACSAC. c2012:1-9. |
[9] | IRANI D , WEBB S , PU C . Study of static classification of social spam profiles in MySpace[C]// The ICWSM. c2010:82-89. |
[10] | GAO H Y , HU J , WILSON C . Detecting and characterizing social spam campaigns[C]// The CCS. c2010:681-683. |
[11] | AGGARWAL A , ALMEIDA J M , KUMARAGURU P . Detection of spam tipping behaviour on foursquare[C]// The WWW. c2013:641-648. |
[12] | GAO Q , ABEL F , HOUBEN G J . A comparative study of user's mi-croblogging behavior on Sina weibo and Twitter[C]// The 20th Interna-tional Conference on User Modeling. c2012:88-101. |
[13] | YU L , ASUR S , HUBERMAN BA . What trends in Chinese social media[C]// SNA-KDD Workshop. c2011:1-10. |
[14] | YU LL , ASUR S , HUBERMAN BA . Artificial inflation:the real story of trends and trend-setters in Sina weibo[C]// The International Confernece on Social Computing. c2012:514-519. |
[15] | 樊鹏翼, 王晖, 姜志宏 , 等. 微博网络测量研究[J]. 计算机研究与发展, 2012,49(4):691-699. FAN P Y , WANG H , JIANG Z H , et al. Measurement of microblog-ging network[J]. Journal of Computer Research Development, 2012,49(4):691-699. |
[16] | SHARMA P , BISWAS S . Identifying spam in Twitter trending topics.technical report[R]. USC(University of Southern California) Informa-tion Sciences Institute, 2011.1-4. |
[17] | BENEVENUTO F , MAGNO G , RODRIGUES T . Detecting spammers on Twitter[C]// The 7th Collaboration,Electronic messaging,Anti-Abuse and Spam Conference. c2010:1-9. |
[18] | HASTIE T , TIBSHIRANI R . DISCRIMINANT adaptive nearest neighbor classification[J]. IEEE Trans.on Pattern Analysis and Ma-chine Intelligence, 1996,18(6):607-616. |
[19] | FREUND Y , SCHAPIRE RE . A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Com-puter and System Sciences, 1997,55(1):119-139. |
[20] | ORR M J L . Regularization in the selection of radial basis function centres[J]. Neural Computation, 1995,7(3):606-623. |
[21] | HO T K . The random subspace method for constructing decision forests[J]. IEEE Trans.on Pattern Analysis and Machine Intelligence, 1998,20(8):832-844. |
[22] | MILLER Z , DICKINSON B , DEITRICK W , et al. Twitter spammer detection using data stream clustering[J]. Information Sciences, 2014,260(1):64-73. |
[23] | SHOBEIR F , JAMES F , MADHUSHDANA S , et al. Collective spam-mer detection in evolving multi-relation social networks[C]// The KDD. c2015:1769-1778. |
[24] | WANG A H . Detecting spam bots in online social networking sites:a machine learning approach[C]// DBSec. c2010:335-342. |
[25] | LEE K , CAVERLEE J , WEBB S , et al. Uncovering social spammers:social honeypots+machine learning[C]// The SIGIR. c2010:435-442. |
[26] | MARTINEZ R J , ARAUJO L . Detecting malicious tweets in trending topics using a statistical analysis of language[J]. Expert Systems with Applications, 2013,40(8):2992-3000. |
[27] | ZHU Y , WANG X , ZHONG E H . Discovering spammers in social networks[C]// The AAAI. c2012:1-7. |
[28] | HU X , TANG J L , GAO HJ , et al. Social spammer detection with sentiment information[C]// The ICDM. c2014:180-189. |
[29] | TAN E , GUO L , CHEN S , et al. Unik:unsupervised social network spam detection[C]// The ICDM. c2013:479-488. |
[30] | ZHANG X , ZHU S , LIANG W . Detecting spam and promoting cam-paigns in the twitter social network[C]// The ICDM. c2012:1194-1199. |
[31] | SURENDRA S , AIXIN S . HSpam14:a collection of 14 million tweets for hashtag-oriented spam research[C]// The SIGIR. c2015:9-13. |
[32] | YANG C , HARKREADER R C , ZHANG J . Analyzing spammers' social networks for fun and profit:a case study of cyber criminal eco-system on twitter[C]// The WWW. c2012:71-80. |
[33] | HU X , TANG J L , LIU H . Online social spammer detection[C]// The AAAI. c2014:1-7. |
[34] | HU X , TANG J L , ZHANG Y C , et al. Social spammer detection in microblogging[C]// The IJCAI. c2013:177-188. |
[35] | CASTILLO C , MENDOZA M , POBLETE B . Information credibility on twitter[C]// The WWW. c2011:675-684. |
[36] | RATKIEWICZ J , CONOVER M , MEISS M . Detecting and tracking political abuse in social media[C]// The ICWSM. c2011:1-8. |
[37] | 丁兆云, 周斌, 贾焰 , 等. 微博中基于统计特征与双向投票的垃圾用户发现[J]. 计算机研究与发展, 2013,50(11):2336-2348. DING Z Y , ZHOU B , JIA Y , , et al. Detecting spammers with a bidirec-tional vote algorithm based on statistical features in microblogs[J]. Journal of Computer Research and Development, 2013,50(11):2336-2348. |
[38] | HU X , TANG J L , ZHANG Y C , LIU H . Leveraging knowledge across media for spammer detection in microblogging[C]// The ACM SIGIR. c2014:547-556. |
[39] | Available online[EB/OL]. |
[40] | DASH M , LIU H . Feature selection for classifications[J]. Intelligent Data Analysis, 1997,16(21):131-156. |
[41] | LIU H , SETIONO R . CHI2:feature selection and discretization of numeric attributes[C]// The ICTAI. c1995:338-391. |
[42] | NOWOZIN S . Estimating attributes:analysis and extensions of RELIEF[C]// The ECML-PKDD. c2012:1-8. |
[43] | KONONENKO I . Estimating attributes:analysis and extensions of RELIEF[C]// The ECML-PKDD. c1994:171-182. |
[44] | GUYON I , WESTON J , BARNHILL SMD . Gene selection for cancer classification using support vector machines[J]. Machine Learning, 2002,46(1-3):389-422. |
[45] | STECK J B . Netpix:a method of feature selection leading to accurate sentiment-based classification models[D]. Central Connecticut State University, 2005. |
[46] | HALL M A . Correlation-based feature selection for discrete and nu-meric class machine learning[C]// The ICML. c2000:359-366. |
[47] | JOHN GH , EDU S , LANGLEY P . Estimating continuous distributions in Bayesian classifiers[C]// The UAI. c1995:338-345. |
[48] | KEERTHI S S , DUAN K , SHEVADE S K . A fast dual algorithm for kernel logistic regression[J]. Machine Learning, 2005,61(1):151-165. |
[49] | CORTES C , VAPNIK V N . Support-vector networks[J]. Machine Learning, 1995,20(3):273-297. |
[50] | ORR M J L . Regularization in the selection of radial basis function centres[J]. Neural Computation, 1995,7(3):606-623. |
[51] | BREIMAN L . Bagging predictors[J]. Machine Learning, 1996,24(2):123-140. |
[52] | QUINLAN J R . C4.5:programs for machine learning[M]. Morgan Kaufmann Publishers,San Mateo,California, 1993. |
[53] | LANDWEHR N , HALL M , FRANK E . Logistic model trees[J]. Ma-chine Learning, 2005,59(1):161-205. |
[54] | KOHAVI R . A study of cross-validation and bootstrap for accuracy estimation and model selection[C]// The IJCAI. c1995:1137-1143. |
[1] | Shuangyan YI, Yongsheng LIANG, Jingjing LU, Wei LIU, Tao HU, Zhenyu HE. Robust feature selection method via joint low-rank reconstruction and projection reconstruction [J]. Journal on Communications, 2023, 44(3): 209-219. |
[2] | Yonghao LI, Liang HU, Ping ZHANG, Wanfu GAO. Multi-label feature selection based on dynamic graph Laplacian [J]. Journal on Communications, 2020, 41(12): 47-59. |
[3] | Zhanshan LI, Zhaogeng LIU. Feature selection algorithm based on XGBoost [J]. Journal on Communications, 2019, 40(10): 101-108. |
[4] | Li ZHANG,Cong WANG. Multi-label feature selection algorithm based on joint mutual information of max-relevance and min-redundancy [J]. Journal on Communications, 2018, 39(5): 111-122. |
[5] | Zhen ZHANG,Peng WEI,Yufeng LI,Julong LAN,Ping XU,Bo CHEN. Feature selection algorithm based on improved particle swarm joint taboo search [J]. Journal on Communications, 2018, 39(12): 60-68. |
[6] | Yong WANG,Huiyi ZHOU,Hao FENG,Miao YE,Wenlong KE. Network traffic classification method basing on CNN [J]. Journal on Communications, 2018, 39(1): 14-23. |
[7] | Xiao-nian WU,Xiao-jin PENG,Yu-yang YANG,Kun FANG. Two-level feature selection method based on SVM for intrusion detection [J]. Journal on Communications, 2015, 36(4): 19-26. |
[8] | Chun-hua JU,Fu-guang BAO. Research on a multidimensional personalized recommendation model based on a situation and characteristics of the users [J]. Journal on Communications, 2012, 33(Z1): 17-27. |
[9] | Liang CHEN,Jian GONG. Fast application-level traffic classification using NetFlow records [J]. Journal on Communications, 2012, 33(1): 145-152. |
[10] | Tie-ming CHEN,Ji-xia MA,Yi-guang XUAN,Jia-mei CAI. Quick feature selection method and its application on network intrusion detection [J]. Journal on Communications, 2010, 31(9A): 233-238. |
[11] | Ying ZHUO,Chun-ye GONG,Zheng-hu GONG. Research and implementation of network transmission situation awareness [J]. Journal on Communications, 2010, 31(9): 55-64. |
[12] | Bo WANG,Yan JIA,Shu-qiang YANG,Bin ZHOU. Feature selection algorithm for uncertain text classification [J]. Journal on Communications, 2009, 30(8): 32-38. |
[13] | Yang LI,Li GUO,Tian-bo LU,Zhi-hong TIAN. Research on performance optimizations for TCM-KNN network anomaly detection algorithm [J]. Journal on Communications, 2009, 30(7): 13-19. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||
|