电信科学 ›› 2016, Vol. 32 ›› Issue (5): 96-104.doi: 10.11959/j.issn.1000-0801.2016142

• 研究与开发 • 上一篇    下一篇

基于子主题选择与三级分层结构的Web文本挖掘方法

史玉珍,单冬红   

  1. 平顶山学院软件学院,河南 平顶山 467000
  • 出版日期:2017-02-22 发布日期:2017-02-22
  • 基金资助:
    河南省科技厅科技重点攻关项目

Web text mining method based on subtopic selection and three-level stratified structure

Yuzhen SHI,Donghong SHAN   

  1. School of Software,Pingdingshan University,Pingdingshan 467000,China
  • Online:2017-02-22 Published:2017-02-22
  • Supported by:
    Key Project of Science and Technology Department in Henan Province

摘要:

针对用户和查询之间的意图差距导致的查询模糊宽泛和数据稀疏问题,根据流行性和多样性返回可能子主题的排名列表,利用子主题选择与排序的分层结构进行Web 文本挖掘。首先,在名词性短语和可替代部分查询的基础上,使用简单模式提取各种相关的短语作为候选子主题;然后,使用网页文档集合中的相关文档构建候选子主题的三级层次结构;最后,综合考虑流行性和多样性,利用该结构和估计的流行度进行排序。实验使用了NTCIR-9库的100个日文查询和来自TREC 2009库的100个英文查询以及网络跟踪多样性任务,实验结果验证了本文方法可有效应用于各种搜索,对于高排名的子主题挖掘优于外部资源。

关键词: 数据稀疏, 文本挖掘, 层次结构, 多样性, 流行性

Abstract:

As the problem of fuzzy inquiry and data sparseness cased by intention gap between users and queries,according to the ranking list of possible subtopic from popularity and diversity,subtopic selection and sorting of stratified structure were used for web text mining.Firstly,on the basic of noun phrase and substitute of part query,a simple model was used to extract a variety of related phrases as candidate subtopic.Then,related documents of a web document collection were used to build three-level stratified structure of candidate subtopic.Finally,considering popularity and diversity,the stratified structure and estimated popularity were applied for sorting.Based on 100 Japanese queries from NTCIR-9 library,100 English queries from TREC 2009 library and network tracking diversity task,experiments verify that the proposed method can be effectively applied to a variety of search,and the proposed mining is better than external resources for high ranking subtopics.

Key words: data sparseness, text mining, stratified structure, diversity,popularity

No Suggested Reading articles found!