电信科学 ›› 2011, Vol. 27 ›› Issue (11): 62-65.doi: 10.3969/j.issn.1000-0801.2011.11.018

• 云计算专栏 • 上一篇    下一篇

基于MapReduce的术语权重计算方法研究

王锴1,施水才1,2,王涛1,2,吕学强1,2   

  1. 1 北京信息科技大学中文信息处理研究中心 北京 100101
    2 北京拓尔思信息技术股份有限公司 北京 100101
  • 出版日期:2011-11-15 发布日期:2011-11-15
  • 基金资助:
    国家自然科学基金资助项目;北京市自然科学基金资助项目;北京市教委科技发展计划资助项目

Research on Term Weighting Based on MapReduce

Kai Wang1,Shuicai Shi1,2,Tao Wang1,2,Xueqiang Lv1,2   

  1. 1 Beijing Information Science and Technology University,Chinese Information Processing Research Center, Beijing 100101,China
    2 Beijing TRS Information Technology Co.,Ltd.,Beijing 100101,China
  • Online:2011-11-15 Published:2011-11-15

摘要:

术语识别在本体构建、词典构建等领域应用广泛,而术语权重计算是术语识别中的关键步骤。本文通过改进 TF-IDF 公式,将组成术语词条的长度作为权重因素之一,同时考虑术语在文档集中的领域相关性。整个过程基于MapReduce 编程模型实现,在 Hadoop 云平台中以分布式方式计算候选领域术语的权重。实验结果表明,该方法不仅简化了术语权重计算的实施步骤,也提高了算法执行效率。

关键词: 术语权重, TF-IDF, MapReduce, 分布式

Abstract:

Term recognition is widely used in the ontology construction,dictionary construction and other fields. And term weighting is a key step in the term recognition. In this paper,several improvements have been made to TF-IDF algorithm,e.g., the length of terms is considered in weighting,also with terms’ correlations to documentation set. The candidate term weight is calculated in a distributed manner based on MapReduce on Hadoop. Experimental results show that the method proposed not only simplifies the steps of term weighting,but also improves the efficiency of the algorithm.

Key words: term weight, TF-IDF, MapReduce, distributed

No Suggested Reading articles found!