通信学报 ›› 2021, Vol. 42 ›› Issue (10): 173-181.doi: 10.11959/j.issn.1000-436x.2021192

• 学术论文 • 上一篇    下一篇

基于NLP的文本相似度检测方法

代晓丽1,2, 刘世峰1, 宫大庆1   

  1. 1 北京交通大学经济管理学院,北京 100044
    2 北京信通传媒有限责任公司,北京 100078
  • 修回日期:2021-09-13 出版日期:2021-10-25 发布日期:2021-10-01
  • 作者简介:代晓丽(1979- ),女,河南安阳人,北京交通大学博士生,主要研究方向为信息管理
    刘世峰(1970- ),男,河北保定人,博士,北京交通大学教授、博士生导师,主要研究方向为信息管理、大数据分析等
    宫大庆(1982- ),男,山东威海人,博士,北京交通大学副教授,主要研究方向为大数据分析与应用
  • 基金资助:
    国家自然科学基金资助项目(J1824031)

Text similarity detection method based on NLP

Xiaoli DAI1,2, Shifeng LIU1, Daqing GONG1   

  1. 1 School of Economics and Management, Beijing Jiaotong University, Beijing 100044, China
    2 China InfoCom Media Group, Beijing 100078, China
  • Revised:2021-09-13 Online:2021-10-25 Published:2021-10-01
  • Supported by:
    The National Natural Science Foundation of China(J1824031)

摘要:

针对当前的文本相似度检测方法忽略文档结构信息、缺乏语义关联性的问题,提出了面向文本的相似度检测方法。首先,采用层次分析法(AHP)计算词语位置权重以提取特征词。其次,引入 Pearson 相关系数度量词语间的语义关联,并将其作为广义 Dice 系数的权重计算相似度。实验表明,所提方法在提高特征词提取的精确度、相似度计算结果的准确率方面表现良好。

关键词: 文本相似度, 词语位置权重, 层次分析法, 特征词提取, Pearson相关系数

Abstract:

Current text similarity detection methods that ignore document structure information and lack semantic relevance.To solve these problems, a text-oriented similarity detection method was proposed.First, analytic hierarchy process (AHP) was used to calculate word position weight to extract feature words.Second, the Pearson correlation coefficient was used to measure semantic correlation between words which was the weight of generalized Dice coefficient to calculate similarity.Experimental results show that the proposed method can improve the precision of feature word extraction and the accuracy of similarity calculation results.

Key words: text similarity, word position weight, analytic hierarchy process,, feature word extraction, Pearson correlation coefficient

中图分类号: 

No Suggested Reading articles found!