基于斜率密度聚类的相似文本标定

通信学报

基于斜率密度聚类的相似文本标定

邹杜1，唐文军1，龙卫江2，张凌3

1. 华南理工大学信息网络工程研究中心，广东广州 510640； 2. 华南理工大学理学院，广东广州 510640；3. 华南理工大学计算机学院，广东广州 510640

出版日期:2013-12-25 发布日期:2013-12-17
基金资助:
国家自然科学基金资助项目 (61070092)

Similar text positioning method based on slope-density cluster

Online:2013-12-25 Published:2013-12-17

摘要/Abstract

摘要： 相似文本标定是抄袭检测的一个重要环节，现有标定方法大多采用直接对文本或指纹进行合并的方式，标定精度受干扰信息影响较大。针对这种局限性，分析了匹配指纹对的语义特征，提出基于斜率密度的相似文本聚类方法，将文本匹配合并问题转化成稠密样本点聚类问题，并在PAN公用语料库上对该方法进行了测试，得到的主要指标优于PAN10前3名。目前已将该方法用于华南理工大学特色专业教学平台的作业查抄，取得了较好的效果。

Abstract: Similar text positioning is an important part of plagiarism detection. The existing positioning method directly merges text or fingerprint to obtain similar text. Due to the disturb information in the similar text, the positioning accuracy is poor. The semantic features of the match fingerprints were analyzed, and a cluster method based on slope density for similar text positioning was proposed, which converts the text merge problem into dense sample points clustering problem, and improves the efficiency and accuracy of the positioning. Through the experiment on the PAN public corpus, the result shows it performs better than the PAN10 top three. This method has been used in the South China University of Technology 's feature professional teaching platform to detect the plagiarism of homework.

邹杜1，唐文军1，龙卫江2，张凌3. 基于斜率密度聚类的相似文本标定[J]. 通信学报.