Big Data Research

A survey of index structure in big data era

Zhaofeng YAN, Weihua ZHANG

2019, 5(4): 3-15. doi:10.11959/j.issn.2096-0271.2019028

Asbtract ( 449 )

HTML ( 102)

PDF (1158KB) ( 386 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the advent of the big data era,the volume of data storage has exploded.In order to ensure the performance of system,the concurrent index structure becomes more and more important.However,the increase of storage volume and the increase of user performance requirements have made the index structure face many challenges.Designing an efficient and easy-to-use index structure has become an important issue in the system field.Based on current research status,firstly,the research status of designing more optimized and easy-to-use concurrency control strategies were discussed,and then the researches based on the new hardware structure features and possible directions were discussed.

Graph processing engine based on graph query system

Xuehan KE,Rong CHEN

2019, 5(4): 16-26. doi:10.11959/j.issn.2096-0271.2019029

Asbtract ( 415 )

HTML ( 58)

PDF (1237KB) ( 498 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Recently,graph query and graph processing are emphasis of graph-structured data research.However,independent graph system mismatched combining applications,which needed both query and processing.To avoid some issues brought by independent system,such as wasting resource and data inconsistency,a method that providing a graph processing engine based on graph query system was proposed,in order to support query and processing operation in a unified system.Through adding index for graph processing and applying pull-based graph propagation method to over locality issue,the performance of the computation and transmission was largely improved.Besides,some optimization approaches were put forward for message updating and work balanced.The experimental results show that the processing engine can provide close(reduced by no more than 1x) or even better (up to 20x) performance compared to specific graph processing systems (e.g.,Gemini and PowerLyra) by leveraging new designs and optimizations,and also has good scalability.

Building storage systems in big data era:challenges,methods and trends

Youmin CHEN, Fei LI, Jiwu SHU

2019, 5(4): 27-40. doi:10.11959/j.issn.2096-0271.2019030

Asbtract ( 760 )

HTML ( 209)

PDF (1172KB) ( 905 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The rapid expansion of the Internet has led to an explosive growth in global data.New applications,such as the Internet of things and e-commerce,demand higher requirements for the latency and performance of data storage and processing.Thus,big data is not only more “big” but also more “fast”,and there is an urgent need to combine new storage media to build large-scale,high-performance storage systems.Two storage system design schemes were started,namely,flash storage and persistent memory storage.Their respective challenges were illustrated in detail,and the existing solutions were summarized.Finally,the future trends in the construction of data centers and storage systems were looked into.

A hybrid memory trace collection and analysis toolkit for big data applications

Zuojun LI,Haiyang PAN,Mingyu CHEN,Yungang BAO

2019, 5(4): 41-49. doi:10.11959/j.issn.2096-0271.2019031

Asbtract ( 283 )

HTML ( 28)

PDF (962KB) ( 294 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The rise of in-memory computing framework represented by Spark,the gradual deepening of new non-volatile memory research and the increasingly severe data security situation made the existing memory behavior analysis tools unable to meet the demand for big data applications.A software-hardware hybrid memory trace collection and analysis toolkit for big data applications was proposed.Based on the basic memory trace collected by hardware,the memory behavior information with rich semantic information can be obtained quickly,accurately and undistorted by combining software information synchronization and offline annotation.It also provides an implementation method for real-time security monitoring of large data access.Finally,a group of real big data applications were analyzed by this toolkit.

Open-source chip,RISC-V and agile development

Huizhe WANG,Dan TANG,Zihao YU,Zhigang LIU,Biwei XIE,Yungang BAO

2019, 5(4): 50-66. doi:10.11959/j.issn.2096-0271.2019032

Asbtract ( 600 )

HTML ( 73)

PDF (1436KB) ( 971 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Due to the end of Moore’s Law,traditional chip developing effort focusing on general-purpose performance cannot last long.However,the high entrance requirements and commercial limits block the further innovation and delay its time to market.Therefore open-source chips,universal platform and modern developing methodology are essential.The significance and development of open-source chips,the merit and impact of RISC-V instruction architecture and the agile development practice in logical design were introduced,and this new trend and its remaining defects were commented on in the end.

Integrative analysis for big data in genomics

Xianghong HU,Heng PENG,Can YANG,Tsunghui CHANG,Xiang WAN,Zhiquan LUO

2019, 5(4): 67-88. doi:10.11959/j.issn.2096-0271.2019033

Asbtract ( 703 )

HTML ( 74)

PDF (6217KB) ( 346 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the rapid development of bio-technology (e.g.,genotyping chip and sequencing),world-wide researchers have accumulated massive data sets at different levels.Integrative analysis of multi-layered genomic data can greatly contribute to the completion of causal chain from genetic variants to phenotype variations,laying a scientific foundation for personalized and precise medicine.The integrative analysis from the following three aspects mainly reviewed:identification of causal variants and their functional annotation,pleiotropy in human complex traits,Mendelian randomization forcausal inference between phenotypes,and several case studies were provided.Finally,the importance of integrative analysis in genomic data for precision medicine was highlighted.

Research on the consensus of big data systems based on RDMA and NVM

Hao WU,Kang CHEN,Yongwei WU,Weimin ZHENG

2019, 5(4): 89-99. doi:10.11959/j.issn.2096-0271.2019034

Asbtract ( 451 )

HTML ( 58)

PDF (1234KB) ( 824 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Distributed storage systems and computing systems are the foundation for constructing big data processing systems.High availability of the system is the cornerstone of any distributed system.High-availability technologies generally rely on consensus protocols.The classic non-Byzantine distributed consensus protocol was discussed,as well as the RDMA communication protocol and NVM storage media under the development of new technologies to achieve higher performance high availability systems by combining them.The consensus protocol to make the better use of the features of RDMA and NVM was modified.The implemented system effectively improves the performance of the protocol while ensuring the consistency and availability of the system data.Experiments show that the system implemented in this paper can achieve 40% performance improvement compared to existing systems.

Knowledge graph-based fraud detection for small and micro enterprise loans

Panshi JIN,Guangming WAN,Lizhong SHEN

2019, 5(4): 100-112. doi:10.11959/j.issn.2096-0271.2019035

Asbtract ( 978 )

HTML ( 162)

PDF (1200KB) ( 1276 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

While major commercial banks have been providing various business loans,the risk of loan fraud has arisen at the same time.In order to overcome this challenge,an anti-fraud model was proposed based on the full-scale enterprise portrait and enterprise relation graph.By integrating multi-source information to form a concrete enterprise profile,and quantifying the interactions among enterprise entities,the fraud risk of new SMEs’loan applications could be quantitatively evaluated.Experiments show that compared with purely considering enterprise’s attributes,the additional use of relational information contributes a 5% increase in the AUC of the test set,which is more helpful for banking institutions to accurately assess the corporate fraud risk.

Research and practice on traffic big data application system of urban intelligent transportation in Guangzhou

Zi ZHANG,Qinyan HUANG,Chuan FENG

2019, 5(4): 113-120. doi:10.11959/j.issn.2096-0271.2019036

Asbtract ( 372 )

HTML ( 88)

PDF (7413KB) ( 486 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

In order to construct modernization traffic governance system and enhance people-centered transport service level,it’s an urgent need to establish a strong drive and sustainable intelligent transportation system(ITS) with traffic big data.Firstly,the present situation of traffic big data research and application was comprehensively analyzed.Then,the development demand and goal were analyzed.At last,some typical practice cases of traffic big data application in Guangzhou were taken as example,the framework of traffic big data application system was built with “one center,three platforms”.All above work is for in-depth application and innovation development of traffic big data.

当期目录