面向虚假新闻检测的社交媒体多模态数据集构建

doi:10.11959/j.issn.2096-109x.2023060

网络与信息安全学报 ›› 2023, Vol. 9 ›› Issue (4): 144-154.doi: 10.11959/j.issn.2096-109x.2023060

• 学术论文 • 上一篇

面向虚假新闻检测的社交媒体多模态数据集构建

高国鹏¹, 房耀东¹, 韩彦芳¹, 钱振兴², 秦川¹

¹ 上海理工大学光电信息与计算机工程学院，上海 200093
² 复旦大学计算机科学技术学院，上海 200433

修回日期:2023-05-26 出版日期:2023-08-01 发布日期:2023-08-01
作者简介:高国鹏（1998- ），男，江苏淮安人，上海理工大学硕士生，主要研究方向为多媒体信息安全
房耀东（1997- ），男，江苏泰州人，上海理工大学硕士生，主要研究方向为多媒体信息安全
韩彦芳（1974- ），女，上海人，博士，上海理工大学讲师，主要研究方向为图像处理和模式识别
钱振兴（1981- ），男，江苏南通人，博士，复旦大学教授、博士生导师，主要研究方向为信息隐藏和AI安全
秦川（1980- ），男，安徽芜湖人，博士，上海理工大学教授、博士生导师，主要研究方向为多媒体信息安全和 AI安全
基金资助:
国家自然科学基金(U20B2051);国家自然科学基金(62172280);上海市自然科学基金(21ZR1444600)

Construction of multi-modal social media dataset for fake news detection

Guopeng GAO¹, Yaodong FANG¹, Yanfang HAN¹, Zhenxing QIAN², Chuan QIN¹

¹ School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
² School of Computer Science, Fudan University, Shanghai 200433, China

Revised:2023-05-26 Online:2023-08-01 Published:2023-08-01
Supported by:
The National Natural Science Foundation of China(U20B2051);The National Natural Science Foundation of China(62172280);The Natural Science Founda-tion of Shanghai(21ZR1444600)

摘要/Abstract

摘要：

社交媒体的出现正在改变着人们的生活，通过社交媒体可以便捷地获取和分享新闻，但同时助力了虚假新闻的滋生和传播，从而严重影响社会安全和稳定。因此，虚假新闻检测引起了研究者广泛关注。尽管存在多种基于深度学习的解决方案，但这些方法需要大量的数据作为支撑。现有的虚假新闻数据集，尤其是中文数据集不仅稀缺，而且数据集中的新闻大多属于同一个类别。为了更好地检测虚假新闻，构建了一个新的多模态的虚假新闻数据集（MFND，multi-modal fake news dataset），其中包含政治、经济、娱乐、体育、国际、科技、军事、教育、健康和社会生活这 10 个类别的中文和英文新闻数据。对提出的虚假新闻数据集的词频和类别进行分析，并与现有的虚假新闻数据集在新闻数量、新闻类别、模态信息和新闻语种等方面进行了对比，结果显示 MFND 在类别信息和新闻语种方面表现突出。另外，利用现有的典型虚假新闻检测方法在 MFND 上进行训练和验证，实验结果表明，相较于现有主流的虚假新闻数据集，MFND可以为模型提供10%左右的性能提升。

关键词: 社交媒体, 虚假新闻检测, 多模态, 多类别, 数据集

Abstract:

The advent of social media has brought about significant changes in people’s lives.While social media allows for easy access and sharing of news, it has also become a breeding ground for the dissemination of fake news, posing a serious threat to social security and stability.Consequently, researchers have shifted their focus towards fake news detection.Although several deep learning-based solutions have been proposed, these methods heavily rely on large amounts of supporting data.Currently, there is a scarcity of existing datasets, particularly in Chinese, and the collected news articles are often limited to the same category.To enhance the detection of fake news, a new multi-modal fake news dataset (MFND) was developed, which comprised Chinese and English news data from ten diverse categories: politics, economy, entertainment, sports, international affairs, technology, military, education, health, and social life.The word frequencies and categories of the proposed fake news dataset were analyzed and compared with existing fake news datasets in terms of number of news, news categories, modal information and news languages.The results of the comparison demonstrate that the MFND dataset excels in terms of category information and news languages.Moreover, training and validating existing typical fake news detection methods with MFND dataset, the experimental results show an improvement of approximately 10% in model performance compared to existing mainstream fake news datasets.

Key words: social media, fake news detection, multi-modal, multi-category, dataset

中图分类号:

TP393

高国鹏, 房耀东, 韩彦芳, 钱振兴, 秦川. 面向虚假新闻检测的社交媒体多模态数据集构建[J]. 网络与信息安全学报, 2023, 9(4): 144-154.

Guopeng GAO, Yaodong FANG, Yanfang HAN, Zhenxing QIAN, Chuan QIN. Construction of multi-modal social media dataset for fake news detection[J]. Chinese Journal of Network and Information Security, 2023, 9(4): 144-154.

图/表 9

表1

图1

表2

图2

表3

MFND和现有的多模态虚假新闻数据集对比Table 3 Comparison of our MFND with existing multi-modal fake news datasets"

数据集	真实新闻	虚假新闻	模态信息	类别数量	社交上下文	语种
Weibo^[14]	4 749	4 779	文本+图片	-	包含	1
MediaEval^[23]	6 225	9 596	文本+图片	-	不包含	1
FacebookHoax^[24]	6 577	8 923	文本+图片	1	包含	1
TI-CNN^[25]	11 941	8 074	文本+图片	1	不包含	1
Fakeddit^[26]	$527 049$	$628 501$	文本+图片	-	包含	1
MM-COVID^[27]	7 192	3 981	文本+图片	1	包含	$6$
MFND	5 000	5 000	文本、图片、文本+图片	$10$	不包含	2

表3

表4

表5

图3

图4

参考文献 32

[1]	ALLCOTT H , GENTZKOW M . Social media and fake news in the 2016 election[J]. Journal of Economic Perspectives, 2017,31(2): 211-36.
[2]	CONROY N K , RUBIN V L , CHEN Y . Automatic deception detection:methods for finding fake news[J]. Proceedings of the Association for Information Science and Technology, 2015,52(1): 1-4.
[3]	SHU K , SLIVA A , WANG S H ,et al. Fake news detection on social media:a data mining perspective[J]. ACM SIGKDD Explorations Newsletter, 2017,19(1): 22-36.
[4]	VOSOUGHI S , ROY D , ARAL S . The spread of true and false news online[J]. Science, 2018,359(6380): 1146-1151.
[5]	LI L , CAI G , CHEN N . A rumor events detection method based on deep bidirectional GRU neural network[C]// Proc of 2018 IEEE 3rd International Conference on Image,Vision and Computing. 2018: 755-759.
[6]	郑洪浩, 郝一诺, 于洪涛 ,等. 基于改进 Transformer 的社交媒体谣言检测[J]. 网络与信息安全学报, 2022,8(4): 168-174.
	ZHEN H H , HAO Y N , YU H T ,et al. Rumor detection in social media based on eahanced Transformer[J]. Chinese Journal of Network and Information Security, 2022,8(4): 168-174.
[7]	CHENG M X , NAZARIAN S , BOGDAN P . Vroc:variational autoencoder-aided multi-task rumor classifier based on text[C]// Proc of the Web Conference 2020. 2020: 2892-2898.
[8]	MA J , GAO W , MITRA P ,et al. Detecting rumors from microblogs with recurrent neural networks[C]// Proc of the 25th International Joint Conference on Artificial Intelligence. 2016: 3818-3824.
[9]	YU F , LIU Q , WU S ,et al. A convolutional approach for misinformation identification[C]// Proc of the 26th International Joint Conference on Artificial Intelligence. 2017: 3901-3907.
[10]	BIAN T , XIAO X , XU T Y ,et al. Rumor detection on social media with bi-directional graph convolutional networks[C]// Proc of the AAAI Conference on Artificial Intelligence. 2020: 549-556.
[11]	ZHANG H W , FANG Q , QIAN S S ,et al. Multi-modal knowledge-aware event memory network for social media rumor detection[C]// Proc of the 27th ACM International Conference on Multimedia. 2019: 1942-1951.
[12]	AZRI A , FAVRE C , HARBI N ,et al. Calling to CNN-LSTM for rumor detection:a deep multi-channel model for message veracity classification in microblogs[C]// Proc of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2021: 497-513.
[13]	亓鹏, 曹娟, 盛强 . 语义增强的多模态虚假新闻检测[J]. 计算机研究与发展, 2021,58(7): 1456-1465.
	QI P , CAO J , SHENG Q . Semantics-enhanced multi-modal fake news detection[J]. Journal of Computer Research and Development, 2021,58(7): 1456-1465.
[14]	JIN Z W , CAO J , GUO H ,et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C]// Proc of the 25th ACM International Conference on Multimedia. 2017: 795-816.
[15]	TUAN N M D , MINH P Q N . Multimodal fusion with BERT and attention mechanism for fake news detection[C]// Proc of 2021 RIVF International Conference on Computing and Communication Technologies. 2021: 1-6.
[16]	GOODFELLOW I , POUGET-ABADIE J , MIRZA M ,et al. Generative adversarial nets[C]// Proc of the 27th International Conference on Neural Information Processing Systems. 2014: 2672-268.
[17]	MITRA T , GILBERT E . Credbank:a large-scale social media corpus with associated credibility annotations[C]// Proc of the International AAAI Conference on Web and Social Media. 2015: 258-267.
[18]	POTTHAST M , KIESEL J , REINARTZ K ,et al. A stylometric inquiry into hyperpartisan and fake news[J]. arXiv Preprint arXiv:1702.05638, 2017.
[19]	WANG W Y . “Liar,liar pants on fire”:a new benchmark dataset for fake news detection[J]. arXiv Preprint arXiv:1705.00648, 2017.
[20]	GAO Z W , YADA S , WAKAMIYA S ,et al. Naist covid:multilingual covid-19 Twitter and Weibo dataset[J]. arXiv Preprint arXiv:2004.08145, 2020.
[21]	SHAHI G K , NANDINI D . FakeCovid—a multilingual cross-domain fact check news dataset for COVID-19[J]. arXiv Preprint arXiv:2006.11343, 2020.
[22]	DU J S , DOU Y T , XIA C Y ,et al. Cross-lingual covid-19 fake news detection[C]// Proc of 2021 International Conference on Data Mining Workshops. 2021: 859-862.
[23]	BOIDIDOU C , ANDREADOU K , PAPADOPOULOS S ,et al. Verifying multimedia use at mediaeval 2015[C]// Proc of the MediaEval 2015 Multimedia Benchmark Workshop. 2015. 14-15.
[24]	TACCHINI E , BALLARIN G , DELLA VEDOVA M L ,et al. Some like it hoax:automated fake news detection in social networks[J]. arXiv Preprint arXiv:1704.07506, 2017.
[25]	YANG Y , ZHENG L , ZHANG J W ,et al. TI-CNN:convolutional neural networks for fake news detection[J]. arXiv Preprint arXiv:1806.00749, 2018.
[26]	NAKAMURA K , LEVY S , WANG W Y . Fakeddit:a new multimodal benchmark dataset for fine-grained fake news detection[J]. arXiv Preprint arXiv:1911.03854, 2019.
[27]	LI Y C , JIANG B H , SHU K ,et al. MM-COVID:a multilingual and multimodal data repository for combating covid-19 disinformation[J]. arXiv Preprint arXiv:2011.04088, 2020.
[28]	SLANEY M , CASEY M . Locality-sensitive hashing for finding nearest neighbors[J]. IEEE Signal Processing Magazine, 2008,25(2): 128.
[29]	WANG Y Q , MA F L , JIN Z W ,et al. Eann:event adversarial neural networks for multi-modal fake news detection[C]// Proc of the 24th ACM SIGKDD International Conference on Knowledge Discovery＆ Data Mining. 2018: 849-857.
[30]	KIM Y . Convolutional neural networks for sentence classification[C]// Proc of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[31]	DEVLIN J , CHANG M W , LEE K ,et al. Bert:pre-training of deep bidirectional transformers for language understanding[J]. arXiv Preprint arXiv:1810.04805, 2018.
[32]	SINGHAL S , SHAH R R , CHAKRABORTY T ,et al. SpotFake:a multi-modal framework for fake news detection[C]// Proc of 2019 IEEE 5th International Conference on Multimedia Big Data. 2019: 39-47.

语言	真实新闻	虚假新闻
中文	4 000	4 000
英文	1 000	1 000

检测方法	批量大小	优化器	词向量大小
tanh-RNN^[8]	50	自适应学习	100
TextCNN^[30]	50	随机梯度下降	768
BERT^[31]	256	自适应矩估计	768
EANN^[29]	100	自适应矩估计	32
SpokeFake^[32]	256	自适应矩估计	768

方法		MFND				Weibo_Wang数据集^[29]
方法		Accuracy	Precision	Recall	F1-score	Accuracy	Precision	Recall	F1-score
	tanh-RNN^[8]	0.917	0.963	0.853	0.905	0.575	0.523	0.415	0.463
单模态	TextCNN^[30]	0.931	0.924	0.934	0.933	0.875	0.812	0.921	0.865
	BERT^[31]	0.980	0.987	0.975	0.981	0.858	0.888	0.868	0.878
多模态	EANN^[29]	0.982	0.969	0.995	0.982	0.803	0.866	0.731	0.793
	SpokeFake^[32]	0.982	0.995	0.97	0.982	0.854	0.812	0.895	0.854

面向虚假新闻检测的社交媒体多模态数据集构建

Construction of multi-modal social media dataset for fake news detection

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 32

相关文章 5

Metrics

推荐阅读 0

数据集分类	数据集名称	年份	标签数量
	CREDBANK^[17]	2015	2
	BuzzFeedNews^[18]	2016	4
单模态	LIAR^[19]	2017	6
	NaistCovid^[20]	2020	2
	FakeCovid^[21]	2020	2
	CrossCOVID19^[22]	2021	2
	MediaEval^[23]	2016	2
	Weibo^[14]	2017	2
多模态	FacebookHoax^[24]	2017	2
	TI-CNN^[25]	2018	2
	Fakeddit^[26]	2020	2、3、6
	MM-COVID^[27]	2020	2

[1]	薛锦, 袁福祥, 刘毅敏, 张萌, 乔亚琼, 罗向阳. 基于单点地名匹配和局部地名筛选的推特用户定位方法[J]. 网络与信息安全学报, 2023, 9(4): 53-63.
[2]	黄诗瑀, 叶锋, 黄添强, 李伟, 黄丽清, 罗海峰. 人脸伪造与检测中的对抗攻防综述[J]. 网络与信息安全学报, 2023, 9(4): 1-15.
[3]	郑洪浩, 郝一诺, 于洪涛, 李邵梅, 吴翼腾. 基于改进Transformer的社交媒体谣言检测[J]. 网络与信息安全学报, 2022, 8(4): 168-174.
[4]	王裕鑫, 张博强, 谢洪涛, 张勇东. 基于空域与频域关系建模的篡改文本图像检测[J]. 网络与信息安全学报, 2022, 8(3): 29-40.
[5]	季一木, 杨卫东, 李奎, 刘尚东, 刘强, 邵思思, 尤帅, 黄乃娇. 基于主机系统调用频率的容器入侵检测方法[J]. 网络与信息安全学报, 2021, 7(4): 18-29.