网络与信息安全学报 ›› 2023, Vol. 9 ›› Issue (4): 144-154.doi: 10.11959/j.issn.2096-109x.2023060

• 学术论文 • 上一篇    

面向虚假新闻检测的社交媒体多模态数据集构建

高国鹏1, 房耀东1, 韩彦芳1, 钱振兴2, 秦川1   

  1. 1 上海理工大学光电信息与计算机工程学院,上海 200093
    2 复旦大学计算机科学技术学院,上海 200433
  • 修回日期:2023-05-26 出版日期:2023-08-01 发布日期:2023-08-01
  • 作者简介:高国鹏(1998- ),男,江苏淮安人,上海理工大学硕士生,主要研究方向为多媒体信息安全
    房耀东(1997- ),男,江苏泰州人,上海理工大学硕士生,主要研究方向为多媒体信息安全
    韩彦芳(1974- ),女,上海人,博士,上海理工大学讲师,主要研究方向为图像处理和模式识别
    钱振兴(1981- ),男,江苏南通人,博士,复旦大学教授、博士生导师,主要研究方向为信息隐藏和AI安全
    秦川(1980- ),男,安徽芜湖人,博士,上海理工大学教授、博士生导师,主要研究方向为多媒体信息安全和 AI安全
  • 基金资助:
    国家自然科学基金(U20B2051);国家自然科学基金(62172280);上海市自然科学基金(21ZR1444600)

Construction of multi-modal social media dataset for fake news detection

Guopeng GAO1, Yaodong FANG1, Yanfang HAN1, Zhenxing QIAN2, Chuan QIN1   

  1. 1 School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
    2 School of Computer Science, Fudan University, Shanghai 200433, China
  • Revised:2023-05-26 Online:2023-08-01 Published:2023-08-01
  • Supported by:
    The National Natural Science Foundation of China(U20B2051);The National Natural Science Foundation of China(62172280);The Natural Science Founda-tion of Shanghai(21ZR1444600)

摘要:

社交媒体的出现正在改变着人们的生活,通过社交媒体可以便捷地获取和分享新闻,但同时助力了虚假新闻的滋生和传播,从而严重影响社会安全和稳定。因此,虚假新闻检测引起了研究者广泛关注。尽管存在多种基于深度学习的解决方案,但这些方法需要大量的数据作为支撑。现有的虚假新闻数据集,尤其是中文数据集不仅稀缺,而且数据集中的新闻大多属于同一个类别。为了更好地检测虚假新闻,构建了一个新的多模态的虚假新闻数据集(MFND,multi-modal fake news dataset),其中包含政治、经济、娱乐、体育、国际、科技、军事、教育、健康和社会生活这 10 个类别的中文和英文新闻数据。对提出的虚假新闻数据集的词频和类别进行分析,并与现有的虚假新闻数据集在新闻数量、新闻类别、模态信息和新闻语种等方面进行了对比,结果显示 MFND 在类别信息和新闻语种方面表现突出。另外,利用现有的典型虚假新闻检测方法在 MFND 上进行训练和验证,实验结果表明,相较于现有主流的虚假新闻数据集,MFND可以为模型提供10%左右的性能提升。

关键词: 社交媒体, 虚假新闻检测, 多模态, 多类别, 数据集

Abstract:

The advent of social media has brought about significant changes in people’s lives.While social media allows for easy access and sharing of news, it has also become a breeding ground for the dissemination of fake news, posing a serious threat to social security and stability.Consequently, researchers have shifted their focus towards fake news detection.Although several deep learning-based solutions have been proposed, these methods heavily rely on large amounts of supporting data.Currently, there is a scarcity of existing datasets, particularly in Chinese, and the collected news articles are often limited to the same category.To enhance the detection of fake news, a new multi-modal fake news dataset (MFND) was developed, which comprised Chinese and English news data from ten diverse categories: politics, economy, entertainment, sports, international affairs, technology, military, education, health, and social life.The word frequencies and categories of the proposed fake news dataset were analyzed and compared with existing fake news datasets in terms of number of news, news categories, modal information and news languages.The results of the comparison demonstrate that the MFND dataset excels in terms of category information and news languages.Moreover, training and validating existing typical fake news detection methods with MFND dataset, the experimental results show an improvement of approximately 10% in model performance compared to existing mainstream fake news datasets.

Key words: social media, fake news detection, multi-modal, multi-category, dataset

中图分类号: 

No Suggested Reading articles found!