基于NLP的文本相似度检测方法

doi:10.11959/j.issn.1000-436x.2021192

摘要/Abstract

摘要：

针对当前的文本相似度检测方法忽略文档结构信息、缺乏语义关联性的问题，提出了面向文本的相似度检测方法。首先，采用层次分析法（AHP）计算词语位置权重以提取特征词。其次，引入 Pearson 相关系数度量词语间的语义关联，并将其作为广义 Dice 系数的权重计算相似度。实验表明，所提方法在提高特征词提取的精确度、相似度计算结果的准确率方面表现良好。

关键词: 文本相似度, 词语位置权重, 层次分析法, 特征词提取, Pearson相关系数

Abstract:

Current text similarity detection methods that ignore document structure information and lack semantic relevance.To solve these problems, a text-oriented similarity detection method was proposed.First, analytic hierarchy process (AHP) was used to calculate word position weight to extract feature words.Second, the Pearson correlation coefficient was used to measure semantic correlation between words which was the weight of generalized Dice coefficient to calculate similarity.Experimental results show that the proposed method can improve the precision of feature word extraction and the accuracy of similarity calculation results.

Key words: text similarity, word position weight, analytic hierarchy process,, feature word extraction, Pearson correlation coefficient

中图分类号:

TP391

代晓丽, 刘世峰, 宫大庆. 基于NLP的文本相似度检测方法[J]. 通信学报, 2021, 42(10): 173-181.

Xiaoli DAI, Shifeng LIU, Daqing GONG. Text similarity detection method based on NLP[J]. Journal on Communications, 2021, 42(10): 173-181.

图/表 10

图1

表1

表2

表3

表4

图2

图3

图4

图5

图6

参考文献 37

[1]	YANG Z X , CHEN Z F , ZHANG P ,et al. An information intelligent search method for computer forensics based on text similarity[C]// Proceedings of Proceedings of the 2020 4th International Conference on Cryptography,Security and Privacy. New York:ACM Press, 2020: 79-83.
[2]	ALMEIDA C , SANTOS D . Text similarity using word embeddings to classify misinformation[J]. arXiv Preprint,arXiv:2003.06634, 2020: 63-68.
[3]	SEKI K . Cross-lingual text similarity exploiting neural machine translation models[J]. Journal of Information Science, 2021,47(3): 404-418.
[4]	LIANG H Z , LIN K B , ZHU S Z . Short text similarity hybrid algorithm for a Chinese medical intelligent question answering system[C]// Technology-Inspired Smart Learning for Future Education. Singapore:Springer, 2020: 129-142.
[5]	PRAKOSO D W , ABDI A , AMRIT C . Short text similarity measurement methods:a review[J]. Soft Computing, 2021,25(6): 4699-4723.
[6]	IRVING R W , FRASER C B . Two algorithms for the longest common subsequence of three (or more) strings[C]// Combinatorial Pattern Matching. Berlin:Springer, 1992: 214-229.
[7]	DAMERAU F J . A technique for computer detection and correction of spelling errors[J]. Communications of the ACM, 1964,7(3): 171-176.
[8]	JACCARD P . The distribution of the flora in the alpine zone.1[J]. New Phytologist, 1912,11(2): 37-50.
[9]	DICE L . Measures of the amount of ecologic association between species[J]. Ecology, 1945,26(3): 297-302.
[10]	DEZA M M , DEZA E . Encyclopedia of distances[M]. Berlin: Springer, 2009.
[11]	CHANDRASEKARAN D , MAGO V . Evolution of semantic similarity—A survey[J]. ACM Computing Surveys, 2021,54(2): 1-37.
[12]	陈二静, 姜恩波 . 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017,1(6): 1-11.
	CHEN E J , JIANG E B . Review of studies on text similarity measures[J]. Data Analysis and Knowledge Discovery, 2017,1(6): 1-11.
[13]	黄文彬, 车尚锟 . 计算文本相似度的方法体系与应用分析[J]. 情报理论与实践, 2019,42(11): 128-134.
	HUANG W B , CHE S K . Methodological system and application scenarios on text similarity calculation[J]. Information Studies:Theory ＆Application, 2019,42(11): 128-134.
[14]	LUHN H P . A statistical approach to mechanized encoding and searching of literary information[J]. IBM Journal of Research and Development, 1957,1(4): 309-317.
[15]	BLEI D M , NG A Y , JORDAN M I . Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2001,3: 601-608.
[16]	MIHALCEA R , TARAU P . Textrank:bringing order into text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.[S.n.:s.l.], 2004: 404-411.
[17]	王小林, 杨林, 王东 ,等. 改进的TF-IDF关键词提取方法[J]. 计算机科学与应用, 2013,3(1): 64-68.
	WANG X L , YANG L , WANG D ,et al. Improved TF-IDF keyword extraction algorithm[J]. Computer Science and Application, 2013,3(1): 64-68.
[18]	KIM S W , GIL J M . Research paper classification systems based on TF-IDF and LDA schemes[J]. Human-Centric Computing and Information Sciences, 2019,9(1): 30.
[19]	CHEN W , YU Z T , XIAN Y T ,et al. Mining keywords from short text based on LDA-based hierarchical semantic graph model[J]. International Journal of Information Systems in the Service Sector, 2020,12(2): 76-87.
[20]	PUSPANINGRUM E Y , NUGROHO B , SETIAWAN A ,et al. Detection of text similarity for indication plagiarism using winnowing algorithm based K-gram and jaccard coefficient[J]. Journal of Physics:Conference Series,, 20201569: 022044.
[21]	郭庆琳, 李艳梅, 唐琦 . 基于VSM的文本相似度计算的研究[J]. 计算机应用研究, 2008,25(11): 3256-3258.
	GUO Q L , LI Y M , TANG Q . Similarity computing of documents based on VSM[J]. Application Research of Computers, 2008,25(11): 3256-3258.
[22]	BAO X A , DAI S C , ZHANG N ,et al. Large-scale text similarity computing with spark[J]. International Journal of Grid and Distributed Computing, 2016,9(4): 95-100.
[23]	LIU Y , LI D M , DAI C . Short text similarity measure based on double vector space model[J]. International Journal of Database Theory and Application, 2016,9(10): 33-46.
[24]	WANG J Y , XU W H , YAN W H ,et al. Text similarity calculation method based on hybrid model of LDA and TF-IDF[C]// Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence.[S.n.:s.l.], 2019: 1-8.
[25]	LIU Y Q , LI Z J . Semantic based text similarity computation[C]// Lecture Notes in Electrical Engineering. Singapore:Springer, 2017: 343-348.
[26]	WANG X L , DONG X T , CHEN S X . Text duplicated-checking algorithm implementation based on natural language semantic analysis[C]// Proceedings of 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC). Piscataway:IEEE Press, 2020: 732-735.
[27]	王春柳, 杨永辉, 邓霏 ,等. 文本相似度计算方法研究综述[J]. 情报科学, 2019,37(3): 158-168.
	WANG C L , YANG Y H , DENG F ,et al. A review of text similarity approaches[J]. Information Science, 2019,37(3): 158-168.
[28]	WANG J P , DONG Y H . Measurement of text similarity:a survey[J]. Information, 2020,11(9): 421.
[29]	SHAHMIRZADI O , LUGOWSKI A , YOUNGE K . Text similarity in vector space models:a comparative study[C]// Proceedings of 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA). Piscataway:IEEE Press, 2019: 659-666.
[30]	李琳, 李辉 . 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018,2(5): 48-58.
	LI L , LI H . Computing text similarity based on concept vector space[J]. Data Analysis and Knowledge Discovery, 2018,2(5): 48-58.
[31]	陈福, 林闯, 薛超 ,等. 短句语义向量计算方法[J]. 通信学报, 2016,37(2): 11-19.
	CHEN F , LIN C , XUE C ,et al. Vector semantic computing method study for short sentence[J]. Journal on Communications, 2016,37(2): 11-19.
[32]	MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. arXiv Preprint,arXiv:1301.3781, 2013.
[33]	张宇, 刘雨东, 计钊 . 向量相似度测度方法[J]. 声学技术, 2009,28(4): 532-536.
	ZHANG Y , LIU Y D , JI Z . Vector similarity measurement method[J]. Technical Acoustics, 2009,28(4): 532-536.
[34]	邹学强, 包秀国, 黄晓军 ,等. 基于层次分析的微博短文本特征计算方法[J]. 通信学报, 2016,37(12): 50-55.
	ZOU X Q , BAO X G , HUANG X J ,et al. Calculating the feature method of short text based on analytic hierarchy process[J]. Journal on Communications, 2016,37(12): 50-55.
[35]	许树柏 . 实用决策方法:层次分析法原理[M]. 天津: 天津大学出版社, 1988.
	XU S B . Practical decision-making method:the principle of analytic hierarchy process[M]. Tianjin: Tianjin University Press, 1988.
[36]	ZHELEZNIAK V , SAVKOV A , SHEN A ,et al. Correlation coefficients and semantic textual similarity[J]. arXiv Preprint,arXiv:1905.07790, 2019.
[37]	WESTON J , CHOPRA S , ADAMS K . #TagSpace:semantic embeddings from hashtags[C]// Proceedings of Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).[S.n.:s.l.], 2014: 1822-1827.

因素i比因素j的重要程度	量化值
同等重要	1
稍微重要	3
较强重要	5
强烈重要	7
极端重要	9
两相邻判断的中间值	2、4、6、8

因素	T	K	A	权重
T	B_tt	B_tk	B_ta	W(T)
K	B_kt	B_kk	B_ka	W(K)
A	B_at	B_ak	B_aa	W(A)

语料类别	特征词提取个数/个
agriculture	6
art	8
computer	7
economy	6
environment	7
history	9
politics	7
space	8
mix	8

因素	T	K	A	权重
T	1	3	5	0.636 2
K	1/3	1	3	0.260 5
A	1/5	1/3	1	0.106 1