电信科学 ›› 2020, Vol. 36 ›› Issue (3): 83-94.doi: 10.11959/j.issn.1000-0801.2020061

• 研究与开发 • 上一篇    下一篇

基于子语义空间的挖掘短文本策略方法

孙洋,粟栗,张星,王峰生,杜海涛   

  1. 中国移动通信有限公司研究院,北京 100032
  • 修回日期:2020-03-06 出版日期:2020-03-20 发布日期:2020-03-26
  • 作者简介:孙洋(1983- ),女,中国移动通信有限公司研究院工程师,主要研究方向为自然语言处理、机器学习和大数据安全|粟栗(1981- ),男,博士,中国移动通信有限公司研究院教授级高级工程师,主要研究方向为大数据安全和密码学|张星(1980- ),男,中国移动通信有限公司研究院工程师、技术经理,主要研究方向为数据安全、数据存储和虚拟化技术|王峰生(1979- ),男,现就职于中国移动通信有限公司研究院,主要研究方向为移动通信网络安全|杜海涛(1979—),男,博士,中国移动通信有限公司研究院高级工程师,主要研究方向为大数据安全和移动通信网络安全
  • 基金资助:
    教育部-中国移动科研基金项目(MCM201805-2)

Method of short text strategy mining based on sub-semantic space

Yang SUN,Li SU,Xing ZHANG,Fengsheng WANG,Haitao DU   

  1. China Mobile Research Institute,Beijing 100032,China
  • Revised:2020-03-06 Online:2020-03-20 Published:2020-03-26
  • Supported by:
    Ministry of Education-China Mobile Research Fund(MCM201805-2)

摘要:

为解决精准识别短文本数据的问题,提出一种基于子语义空间的短文本策略挖掘方法。该方法首先采用语义空间技术,解决短文本在分析过程中存在的“词汇鸿沟”与“数据稀疏”问题;然后基于聚类算法将语义空间划分为多个子语义空间,在各子语义空间并行挖掘关联规则,提高了策略生成的效率与质量;最后利用二叉树进行策略归并,生成最简策略集。实验证明,与传统的分类模型相比,该方案生成的策略集在误报率为6.5%的情况下,准确率可达88%。在违规短信的发现处理中,使用该技术挖掘的策略集,覆盖能力强、准确率高,具有很强的实用性。

关键词: 子语义空间, 策略提取, 短文本, 关联规则挖掘, 聚类

Abstract:

To solve the problem of identifying short text data accurately,a method of short text strategy mining based on sub-semantic space was proposed.Firstly,semantic space technology was used to solve the problem of “vocabularygap” and “data sparseness” in short text analysis.Then,based on clustering algorithm,the semantic space was divided into several sub-semantic spaces,and association rules were mined in the sub-semantic space,which improved the efficiency and quality of strategy generation.Finally,binary tree was used to merge strategies and generate the simplest strategy set.Experiments show that compared with the traditional classification model,the accuracy rate of the strategy set generated by the proposed scheme can achieve 85% when the false positive rate is 6.5%.In the processing of illegal short messages,using this technology to mine potential policy sets has strong coverage ability,high accuracy and strong practicability.

Key words: sub-semantic space, strategy extraction, short text, association rule mining, clustering

中图分类号: 

No Suggested Reading articles found!