基于子主题选择与三级分层结构的Web文本挖掘方法

doi:10.11959/j.issn.1000-0801.2016142

摘要/Abstract

摘要：

针对用户和查询之间的意图差距导致的查询模糊宽泛和数据稀疏问题，根据流行性和多样性返回可能子主题的排名列表，利用子主题选择与排序的分层结构进行Web 文本挖掘。首先，在名词性短语和可替代部分查询的基础上，使用简单模式提取各种相关的短语作为候选子主题；然后，使用网页文档集合中的相关文档构建候选子主题的三级层次结构；最后，综合考虑流行性和多样性，利用该结构和估计的流行度进行排序。实验使用了NTCIR-9库的100个日文查询和来自TREC 2009库的100个英文查询以及网络跟踪多样性任务，实验结果验证了本文方法可有效应用于各种搜索，对于高排名的子主题挖掘优于外部资源。

关键词: 数据稀疏, 文本挖掘, 层次结构, 多样性, 流行性

Abstract:

As the problem of fuzzy inquiry and data sparseness cased by intention gap between users and queries,according to the ranking list of possible subtopic from popularity and diversity,subtopic selection and sorting of stratified structure were used for web text mining.Firstly,on the basic of noun phrase and substitute of part query,a simple model was used to extract a variety of related phrases as candidate subtopic.Then,related documents of a web document collection were used to build three-level stratified structure of candidate subtopic.Finally,considering popularity and diversity,the stratified structure and estimated popularity were applied for sorting.Based on 100 Japanese queries from NTCIR-9 library,100 English queries from TREC 2009 library and network tracking diversity task,experiments verify that the proposed method can be effectively applied to a variety of search,and the proposed mining is better than external resources for high ranking subtopics.

Key words: data sparseness, text mining, stratified structure, diversity,popularity

史玉珍,单冬红. 基于子主题选择与三级分层结构的Web文本挖掘方法[J]. 电信科学, 2016, 32(5): 96-104.

Yuzhen SHI,Donghong SHAN. Web text mining method based on subtopic selection and three-level stratified structure[J]. Telecommunications Science, 2016, 32(5): 96-104.

图/表 7

图1

图2

图3

表1

表2

表3

图4

参考文献 15

[1]	唐晓波，肖璐 . 基于单句粒度的微博主题挖掘研究[J]. 情报学报 2014,33(6):214-219. TANG X B ， XIAO L . Research of micro-blog topics mining based on sentence granularity[J]. Journal of the China Society for Scientific and Technical Information, 2014 33(6):214-219.
[2]	田宇辰 . 专业搜索引擎的无日志查询推荐机制研究及实现[D]. 广州：华南理工大学 2014. TIAN Y C . Research and implementation of non log query recommendation mechanism for professional search engine[D]. Guangzhou: South China University of Technology, 2014.
[3]	李胜浩 . 基于MapReduce的Web文本挖掘系统的研究与实现[D]. 北京：北京邮电大学 2013. LI S H . Research and implementation of Web text mining system based on MapReduce[D]. Beijing: Beijing University of Posts and Telecommunications, 2013.
[4]	BHATIA S , MAJUMDAR D , MITRA P . Query suggestions in the absence of query logs[C]// International ACM SIGIR Conference on Research & Development in Information Retrieval， July 24-28, 2011, Beijing,China. NewYork: IEEE Press, 2011: 795-804.
[5]	HE J , HOLLINK V , DE VRIES A . Combining implicit and explicit topic representations for result diversification[C]// The 35th international ACM SIGIR conference on Research and development in information retrieval， August 12-16, 2012, Poreland,OR,USA. New York: ACM Press, 2012: 851-860.
[6]	肖璐，唐晓波 . 基于句子成分的微博热点主题挖掘模型研究[J]. 情报科学 2015,35(11):137-141. XIAO L ， TANG X B . Research on micro-blog hot topic mining model based on sentence composition[J]. Journal of the China Society for Scientific and Technical Information, 2015 35(11):137-141.
[7]	ZHU X , GUO J , CHENG X , et al. A unified framework for recommending diverse and relevant queries[C]// World Wide Web Conference Series， March 28-April1, 2011, Hyderabad,India. New York: ACM Press, 2011: 37-46.
[8]	KIM S J , SHIN K Y , LEE J H , et al. Hierarchical subtopic mining for topic annotation[C]// The 6th international workshop on exploiting semantic annotations in information retrieval， October 28, 2013, San Francisco,CA,USA. New York: ACM Press, 2013: 49-52.
[9]	刘少鹏，印鉴，欧阳佳等. 基于MB-HDP模型的微博主题挖掘[J]. 计算机学报 2015,42(7):1408-1419. LIU S P ， YIN J ， OU-YANG J et al. Topic mining from microblogs based on MB-HDP model[J]. JChinese Journal of Computers, 2015 42(7):1408-1419.
[10]	岑荣伟，刘奕群，张敏等. 基于日志挖掘的搜索引擎用户行为分析[J]. 中文信息学报 2010,24(3):49-54. CEN R W ， LIU Y Q ， ZHANG M et al. User behavior analysis of search engine based on log mining[J]. Journal of Chinese Information Processing, 2010 24(3):49-54.
[11]	谭彩丽，弥寅 . 基于主题相关博客的属性挖掘模型设计[D]. 北京：北京邮电大学 2011. TAN C L . Design of attribute mining model based on topic related blog[D]. Beijing: Beijing University of Posts and Telecommunications, 2011.
[12]	DANG V , CROFT B W . Term level search result diversification[C]// International ACM SIGIR Conference on Research &Development in Information Retrieval， July 28-August 1, 2013, Dublin,Ireland. New York: ACM Press, 2013: 603-612.
[13]	曾依灵，许洪波，白硕 . 网络文本主题词的提取与组织研究[J]. 中文信息学报 2008,22(3):64-70. ZENG Y L ， XU H B ， BAI S . Research on the extraction and organization of Web text topic words[J]. Journal of Chinese Information Processing, 2008 22(3):64-70.
[14]	刘德喜，万常选，刘喜平等. 基于结点权重模型的XML片段检索策略[J]. 计算机学报 2013,36(8):1729-1744. LIU D X ， WAN C X ， LIU X P et al. XML fragment retrieval strategy based on node weight model[J]. Chinese Journal of Computers, 2013 36(8):1729-1744.
[15]	刘志勇，耿新青 . 基于模糊聚类的文本挖掘算法[J]. 计算机工程 2009,35(5):44-45. LIU Z Y ， GENG X Q . Text mining algorithm based on fuzzy clustering[J]. Computer Engineering, 2009 35(5):44-45.

语言	方法名	均值I-rec@10	均值D-nDCG@10	均值D#-nDCG@10
日文	BASE-J-QS	0.395 9	0.388 0	0.391 9
	BASE-J-AC	0.346 8	0.385 7	0.352 8
	BASE-J-BP	0.398 5	0.390 4	0.394 5
	PROP-J-PT	0.417 8	0.409 7	0.413 8^qA
	PROP-J-HR	0.437 8	0.429 3	0.433 6^QAB
	PROP-J-DC	0.428 2	0.408 7	0.418 4^qAb
	PROP-J-DCA	0.436 9	0.430 6	0.433 7^QAB
	EXT-AC	0.435 5	0.405 6	0.420 5
英文	BASE-J-QS	0.511 9	0.565 0	0.538 5
	BASE-J-AC	0.434 9	0.449 5	0.442 2
	BASE-J-BP	0.528 8	0.515 1	0.521 9
	PROP-J-PT	0.547 4	0.555 6	0.551 5^Ab
	PROP-J-HR	0.561 4	0.562 5	0.561 9^qAB
	PROP-J-DC	0.557 1	0.569 6	0.563 4^qAB
	PROP-J-DCA	0.555 2	0.562 3	0.558 7^qAB

语言	方法名	均值I-rec@20	均值D-nDCG@20	均值D#-nDCG@20
日文	BASE-J-QS	0.511 7	0.393 8	0.452 8
	BASE-J-AC	0.453 3	0.349 6	0.401 5
	BASE-J-BP	0.514 4	0.391 4	0.452 9
	PROP-J-PT	0.555 3	0.411 5	0.483 4^qA
	PROP-J-HR	0.581 8	0.423 9	0.502 9^QAB
	PROP-J-DC	0.559 0	0.407 2	0.483 1^qAb
	PROP-J-DCA	0.581 7	0.424 8	0.503 2^QAB
	EXT-AC	0.630 7	0.420 1	0.525 4
英文	BASE-J-QS	0.607 0	0.563 0	0.585 0
	BASE-J-AC	0.552 0	0.433 2	0.492 6
	BASE-J-BP	0.618 1	0.512 7	0.565 4
	PROP-J-PT	0.635 7	0.570 0	0.602 9^Ab
	PROP-J-HR	0.664 0	0.570 8	0.617 4^QAB
	PROP-J-DC	0.670 9	0.578 0	0.624 5^QAB
	PROP-J-DCA	0.670 4	0.571 6	0.621 0^QAB

语言	方法名	均值I-rec@30	均值D-nDCG@30	均值D#-nDCG@30
日文	BASE-J-QS	0.567 4	0.401 6	0.484 5
	BASE-J-AC	0.515 3	0.350 8	0.433 1
	BASE-J-BP	0.566 2	0.391 5	0.478 9
	PROP-J-PT	0.624 5	0.424 2	0.524 4^qA
	PROP-J-HR	0.648 9	0.425 4	0.537 1^QAB
	PROP-J-DC	0.638 1	0.417 4	0.527 8^qAb
	PROP-J-DCA	0.649 7	0.424 7	0.537 2^QAB
	EXT-AC	0.653 3	0.386 7	0.520 0
英文	BASE-J-QS	0.660 1	0.570 6	0.615 4
	BASE-J-AC	0.592 8	0.420 6	0.506 7
	BASE-J-BP	0.663 0	0.503 2	0.583 1
	PROP-J-PT	0.711 8	0.579 8	0.645 8^Ab
	PROP-J-HR	0.725 7	0.577 6	0.651 7^QAB
	PROP-J-DC	0.727 9	0.580 5	0.654 2^QAB
	PROP-J-DCA	0.727 4	0.578 4	0.652 9^QAB