电信科学 ›› 2013, Vol. 29 ›› Issue (12): 1-8.doi: 10.3969/j.issn.1000-0801.2013.12.001

• 研究与开发 •    下一篇

一种基于改进的链式MapReduce的并行ETL应用

吴斌,刘心光   

  1. 北京邮电大学计算机学院通信软件工程中心 北京100876
  • 出版日期:2013-12-20 发布日期:2017-07-04
  • 基金资助:
    国家自然科学基金资助项目

A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework

Bin Wu,Xinguang Liu   

  1. Telecommunication and Software Engineering Center, School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Online:2013-12-20 Published:2017-07-04

摘要:

介绍了并行ETL 的相关工作和常见的处理多MapReduce 作业流程的方法;提出一种改进的链式MapReduce 框架,并将此框架应用于一个并行ETL 工具,同时提出一些针对ETL 处理的流程级优化规则,使ETL流程产生更少的MapReduce作业,从而减少I/O以及网络传输的消耗;利用某省份手机上网数据与Hive进行了大数据对比实验,结果表明,本ETL工具的性能平均比Hive快10%~20%。

关键词: ETL, 优化规则, 改进的链式MapReduce

Abstract:

The related work in parallel ETL and common methods to deal with multiple MapReduce jobs were introduced. Then an improved chain-MapReduce framework was presented, based on this framework,a parallel ETL tool was designed. Several optimization rules on ETL which will make the ETL process generate less MapReduce jobs to avoid unnecessary I/O and network cost were presented. The ETL tool on real queries and real big datasets were evaluated. Compared with Hive, the tool reduces time on average by 10% to 20%.

Key words: improved chain-MapReduce, ETL, optimization rule

No Suggested Reading articles found!