CC-MRSJ:Hadoop平台下缓存敏感的星型联接算法

doi:10.3969/j.issn.1000-0801.2013.10.007

Abstract

Abstract:

A cache-conscious MapReduce star join algorithm was presented,each column of fact table was separately stored,and dimension table was divided into several column families according to dimension hierarchy.Fact table foreign key column and corresponding dimension table was co-location storage,thus reducing data movement in the join process.CC-MRSJ consists of two phases:firstly each foreign key column and the corresponding dimension table were joined; then the intermediate results were joined and random accessed measure columns,and so got the final result.CC-MRSJ read only the data needed,and cache utilization is high,so it has good cache conscious feature; it also takes advantage of late materialization,avoiding unnecessary data access and movement.CC-MRSJ has higher performance comparing to hive system based on SSB datasets.

Key words: star join, MapReduce, cache conscious, storage model

Guoliang Zhou,Yongli Zhu,Guilan Wang. CC-MRSJ:Cache Conscious Star Join Algorithm on Hadoop Platform[J]. Telecommunications Science, 2013, 29(10): 31-37.

Figures/Tables 7

References 20

1	Dean J , Ghemawat S . MapReduce:simplified data processing on large clusters. Communications of the ACM, 2008（1）
2	Chang F , Dean J , Ghemawat S ,et al. Bigtable:a distributed storage system for structured data. ACM Transactions on Computer Systems, 2008（2）
3	Thusoo A , Sarma J S , Jain N ,et al. Hive-a warehousing solution over a MapReduce framework. Proceedings of the VLDB Endowment, 2009,2（2）: 1626～1629
4	Gates A , Natkovich O , Chopra S ,et al. Srivastava,building a high level dataflow system on top of MapReduce:the pig experience. Proceedings of the VLDB Endowment, 2009,2（2）: 1414～1425
5	Stonebraker M , Abadi D J , Batkin A ,et al. C-store:a column-oriented dbms. Proceedings of the 31st International Conference on Very Large Data Bases,Trondheim,Norway, 2005: 553～564
6	Abadi D J , Madden S , Hachem N . Column-stores vs row-stores:how different are they really. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data,Vancouver, 2008: 967～980
7	Ailamaki A , DeWitt D J , Hill M D ,et al. Weaving relations for cache performance. Proceedings of the 27th International Conference on Very Large Data Bases,Roma, 2001: 169～180
8	Lee R , Yin H , Zheng S ,et al. RCFile:a fast and space-efficient data place-ment structure in MapReduce-based warehouse systems. ICDE 2011,Hannover,HGermany: 2001: 1199～1208
9	Floratou A , Patel J M , Shekita E J ,et al. Column-oriented storage techniques for MapReduce. Proceedings of the VLDB Endowment, 2011（7）
10	Lin Y T , Agrawal D , Chen C ,et al. Llama:leveraging columnar storage for scalable join processing in the MapReduce framework. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data,Athens,Greece, 2011
11	Blanas S , Patel J M , Ercegovac V ,et al. A comparison of join algorithms for log processing in mapreduce. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data Indiana,USA, 2010:975～986
12	Han H , Jung H S , Eom H S ,et al. Yeom:scatter-gather-merge:an efficient star-join query processing algorithm for data-parallel frameworks. Cluster Computing, 2011,14（2）:183～197
13	Rao J , Ross K A . Cache conscious indexing for decision-support in main memory. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data Indiana,USA, 2010:975～986
14	Brewer E A , . Towards robust distributed systems. Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing,Portland,Oregon, 2000
15	Zhang S B , Han J Z , Liu Z Y . Accelerating MapReduce with distributed memory cache. ICPADS 2009,Shenzhen,China, 2009:472～478
16	Shinnar A , Cunningham D , Saraswat V ,et al. M3R:increased performance for in-memory Hadoop jobs. Proceedings of the VLDB Endowment, 2012（5）
17	O'Neil P , O'Neil E , Chen X . The star schema benchmark,. SchemaB.PDF,Minneapdis, 2007
18	Apache Hadoop. , 2012
19	Lee R , Luo T , Huai Y ,et al. YSmart:Yet another SQL-to-MapReduce translator. Proceedings of the 31st International Conference on Minneapolis,MN,USA, 2011:25～36
20	Huai Y , Lee R , Zhang S ,et al. A matrix model for analyzing,optimizing and deploying software for big data analytics in distributed systems. Proceedings of the 2nd ACM Symposium on Cloud Computing,Cascais, 2011

Metrics

Recommended 0

No Suggested Reading articles found!

[1]	Yuan WANG,Hao JIANG,Ming WU,Donggui YAO,Yi ZHANG,Shuwen YI,Hai WANG,Jing WU. A distributed high efficiency similarity matrix computation method based on users’ mobile network access location [J]. Telecommunications Science, 2018, 34(5): 26-38.
[2]	Huimin ZHAO,Jiangtao LUO,Junchao YANG,Zheng XU,Xiao LEI,Lin LUO. Research and application of prediction model based on ensemble BP neural network [J]. Telecommunications Science, 2016, 32(2): 60-67.
[3]	Zhongwei Wang,Yefang Chen,Siyou Xiao,Jiangbo Qian. An AkNN Algorithm for High-Dimensional Big Data [J]. Telecommunications Science, 2015, 31(7): 52-62.
[4]	Jianxin Ren,Huahui Chen. An Adaptive Subspace Similarity Search Approach [J]. Telecommunications Science, 2015, 31(7): 63-74.
[5]	Ai Fang,Xiong Xu,Bing Liang,Yuzhong Zhang,Yiping Yang. Comparison of Open-Source Distributed Computing Framework for Big Data [J]. Telecommunications Science, 2015, 31(7): 152-157.
[6]	Jinfeng Xu,Yihong Dong,Shiyi Wang,Xianmang He,Huahui Chen. Summary of Large-Scale Grapb Partitioning Algoritbms [J]. Telecommunications Science, 2014, 30(7): 100-106.
[7]	Guanmin Shan,Yihong Dong,Xianmang He. Continuous Skyline Queries Based on MapReduce [J]. Telecommunications Science, 2014, 30(5): 94-104.
[8]	Yong Liu,Jiangtao Luo,Shengxiong Deng,Xiaoping Wang. Diffluent Internet Traffic and Characteristics Computation Based on Hadoop [J]. Telecommunications Science, 2014, 30(12): 76-81.
[9]	Caixia Tao,Xiaojun Xie,Kang Chen,Lirong, Guo,Chun Liu. Design of Mobile Internet Big Data User Behavior Analysis Engine Based on Cloud Computing [J]. Telecommunications Science, 2013, 29(3): 27-31.
[10]	Hong Tang. A Large Scale Network Traffic Analysis System Design Based on the MapReduce Platform [J]. Telecommunications Science, 2013, 29(12): 155-157.
[11]	Bin Wu,Xinguang Liu. A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework [J]. Telecommunications Science, 2013, 29(12): 1-8.
[12]	Hongjian Li,Heng Liu,Guangwen Huang,Li Bu. Study on Massive Telecom Data Cloud Computing Platform Based on Hadoop [J]. Telecommunications Science, 2012, 28(8): 80-85.
[13]	Chunhua Ju,Jiangbo Zou,Zui Zhang,Jianliang Wei. Parallel Ensemble Classification Algorithm Based on the MapReduce Technology [J]. Telecommunications Science, 2012, 28(7): 40-47.
[14]	Yabin Xu,Yanping Li,Xizi Liu. A Peer-to-Peer Traffic Classification System Model Based on Cloud Computing [J]. Telecommunications Science, 2012, 28(10): 58-63.
[15]	Lei Ye,Qingzang Huang,Mingyuan Yu,Donghui Yu. Integration of Medical Information Based on Cloud Computing [J]. Telecommunications Science, 2011, 27(12): 12-16.

CC-MRSJ:Cache Conscious Star Join Algorithm on Hadoop Platform

RichHTML

PDF下载

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 20

Related Articles 15

Metrics

Recommended 0