通信学报 ›› 2016, Vol. 37 ›› Issue (10): 9-17.doi: 10.11959/j.issn.1000-436x.2016190

• 学术论文 • 上一篇    下一篇

基于节点属性与正文内容的海量Web信息抽取方法

王海艳1,2,曹攀1   

  1. 1 南京邮电大学计算机学院,江苏 南京 210023
    2 江苏省无线传感网高技术研究重点实验室,江苏 南京 210003
  • 出版日期:2016-10-25 发布日期:2016-10-25
  • 基金资助:
    国家自然科学基金资助项目;国家自然科学基金资助项目;“六大人才高峰”基金资助项目;江苏省“333高层次人才培养工程”基金资助项目

Information extraction from massive Web pages based on node property and text content

Hai-yan WANG1,2,Pan CAO1   

  1. 1 School of Computer Science and Technology,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks,Nanjing 210003,China
  • Online:2016-10-25 Published:2016-10-25
  • Supported by:
    The National Natural Science Foundation of China;The National Natural Science Foundation of China;Six Talent Peaks Project in Jiangsu Province;333 High Level Personnel Training Project in Jiangsu Province

摘要:

为解决大数据场景下从海量Web页面中抽取有价值的信息,提出了一种基于节点属性与正文内容的海量Web信息抽取方法。将Web页面转化为DOM树表示,并提出剪枝与融合算法,对DOM树进行简化;定义DOM树节点的密度和视觉属性,根据属性值对Web页面内容进行预处理;引入MapReduce计算框架,实现海量Web信息的并行化抽取。仿真实验结果表明,提出的海量Web信息抽取方法不仅具有更好的性能,还具备较好的系统可扩展性。

关键词: Web信息, 抽取, MapReduce, DOM树

Abstract:

To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.

Key words: Web information, extraction, MapReduce, DOM tree

No Suggested Reading articles found!