电信科学 ›› 2017, Vol. 33 ›› Issue (1): 77-84.doi: 10.11959/j.issn.1000-0801.2017001

• 研究与开发 • 上一篇    下一篇

大数据中基于时态特征和混合式搜索的博客筛选挖掘

张丽娜,匡泰,姜迪清   

  1. 浙江安防职业技术学院信息工程系,浙江 温州 325000
  • 修回日期:2016-09-14 出版日期:2017-01-01 发布日期:2017-06-04
  • 作者简介:张丽娜(1980-),女,浙江安防职业技术学院讲师,主要研究方向为数据挖掘、图形图像、智能算法、云计算。|匡泰(1964-),男,浙江安防职业技术学院信息工程系主任、副教授,主要研究方向为大数据、人工智能。|姜迪清(1965-),男,现就职于浙江安防职业技术学院,主要研究方向为舆情管理、人事管理等。
  • 基金资助:
    浙江省2016年教育技术研究规划课题支持项目(JB139)

Blog screening and mining based on temporal features and hybrid search in big data

Lina ZHANG,Tai KUANG,Diqing JIANG   

  1. Department of Information Engineering,Zhejiang College of Security Technology,Wenzhou 325000,China
  • Revised:2016-09-14 Online:2017-01-01 Published:2017-06-04
  • Supported by:
    Educational Technology Research Prgram of Zhejiang Province in 2016(JB139)

摘要:

针对现存很多博客筛选挖掘方法的相关性程度比较松散以及信息检索方法的缺陷,提出一种基于时态特征和混合式搜索的方法。考虑到用户评论是组合证据的重要来源以及时间因素的影响,提出的方法将博客文章的平均评论数量、消息来源的 BM25的相关性分数、最久博客文章的 BM25分数和最新相关博文和最旧博文的时间范围作为时态特征集。另外,考虑到线性搜索的局部性优势以及差分进化搜索的全局优势,将两种信息搜索方式组合。实验使用 BlogS06数据集,由博客主页、XML 源文件和其博客入口页面组成,用于TREC 2007和TREC 2008的博客筛选挖掘实验。实验结果表明,提出的方法在运行时间和有效性方面获得了满意的效果。

关键词: 博客筛选挖掘, 时态特征, 线性搜索, 差分进化, 大数据, BM25

Abstract:

Concerning that the correlation degree of the existing methods of blog screen and mining is loose and the information retrieval of the methods is deficient,a method based on temporal feature and hybrid search method was proposed.Considering the user reviews are important sources of evidence combination,the average number of reviews for blogs,the sources of BM25 relevance scores,the longest blog BM25 scores and time range between the latest related blog paper and the oldest related blog paper are being as the temporal feature sets.In addition,considering local search advantage of linear search(LS) and global search advantage of differential evolution(DE),the two kinds of information search methods were combined.BlogS06 data set was used in the experiment which was consists of blog home pages,XML source files and its blog portal pages,it was used for TREC 2007 and TREC 2008 blog mining experiments.Experimental results show that the proposed method can obtain satisfactory results in terms of running time and effectiveness.

Key words: blog screening and mining, temporal feature, linear search, differential evolution, big data, BM25

中图分类号: 

No Suggested Reading articles found!