电信科学 ›› 2017, Vol. 33 ›› Issue (11): 73-82.doi: 10.11959/j.issn.1000-0801.2017313

• 研究与开发 • 上一篇    下一篇

基于主题模型的垃圾邮件过滤系统的设计与实现

寇晓淮,程华   

  1. 华东理工大学信息科学与工程学院,上海200237
  • 修回日期:2017-09-16 出版日期:2017-11-01 发布日期:2017-12-08
  • 作者简介:寇晓淮(1989-),男,华东理工大学信息科学与工程学院硕士生,主要研究方向为信息分析与处理、智能信号处理和网络与信息安全。|程华(1975-),男,博士,华东理工大学信息科学与工程学院副教授,主要研究方向为信息安全、信号处理、网络行为学和流量工程。

Design and implementation of spam filtering system based on topic model

Xiaohuai KOU,Hua CHENG   

  1. College of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Revised:2017-09-16 Online:2017-11-01 Published:2017-12-08

摘要:

垃圾邮件过滤技术在保证信息安全、提高资源利用、分拣信息数据等方面都发挥着重要作用。然而,垃圾邮件的出现影响了用户的体验,并且会造成不必要的经济与时间损失。针对现有的垃圾邮件过滤技术的不足,基于多个主题词理论,构建了基于朴素贝叶斯的垃圾邮件分类方法。在邮件主题获取中,采用主题模型LDA得到邮件的相关主题及主题词;并进一步采用Word2Vec寻找主题词的同义词和关联词,扩展主题词集合。在邮件分类中,对训练数据集进行统计学习得到词语的先验概率;基于扩展的主题词集合及其概率,通过贝叶斯公式推导得到某个主题和某封邮件的联合概率,以此作为垃圾邮件判定的依据。同时,基于主题模型的垃圾邮件过滤系统具有简洁易应用的特点。通过与其他典型垃圾邮件过滤方法的对比实验,证明基于主题模型的垃圾邮件分类方法及基于Word2Vec的改进方法均能有效提高垃圾邮件过滤的准确度。

关键词: 文本分类, 垃圾邮件, 主题模型, 贝叶斯原理

Abstract:

Spam filtering technology plays a key role in many areas including information security,transmission efficiency,and automatic information classification.However,the emergence of spam affects the user's sense of experience,and can cause unnecessary economic and time loss.The deficiency of spam filtering technology was researched,and a method of spam classification based on naive Bayesian was put forward based on multiple keywords.In the subject of mail,the theme model was used by LDA to get the related subject and keyword of the message,and Word2Vec was further used to search keyword synonyms and related words,extending the keyword collection.In the classification of mails,the transcendental probability of the words in the training dataset was obtained by statistical learning.Based on the extended keyword collection and its probability,the joint probability of a subject and a message was deduced by the Bayesian formula as a basis for the spam judgment.At the same time,the spam filtering system based on topic model was simple and easy to apply.By comparing experiments with other typical spam filtering method,it is proved that the method of spam classification based on theme model and the improved method based on Word2Vec can effectively improve the accuracy of spam filtering.

Key words: text classification, spam, topic model, Bayesian theory

中图分类号: 

No Suggested Reading articles found!