Big Data Research

Progress on human-in-the-loop data preparation

Ju FAN, Yueguo CHEN, Xiaoyong DU

2019, 5(6): 3-18. doi:10.11959/j.issn.2096-0271.2019046

Asbtract ( 306 )

HTML ( 48)

PDF (1427KB) ( 427 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the rapid development of data analytics,data preparation has become a major bottleneck.The two essential challenges for data preparation on cost and time were analyzed.To address the challenges,the research progress on human-in-theloop data preparation was reviewed.Firstly,interactive data preparation was reviewed,which aimed to reduce the time for data preparation by predictively interacting with the end users.Then,crowdsourced data preparation was introduced,which utilize human’s computational power from the crowd to support foundamental data preparation tasks,and developed algorithms for controlling result quality and reducing crowdsourcing cost.Finally,future research directions were summarized and discussed.

Data quality management of industrial temporal big data

Xiaoou DING, Hongzhi WANG, Shengjian YU

2019, 5(6): 19-29. doi:10.11959/j.issn.2096-0271.2019047

Asbtract ( 430 )

HTML ( 81)

PDF (893KB) ( 588 )

Knowledge map

References | Related Articles | Metrics

Industrial big data has become an important strategic resource for the transformation and upgrading of China’s manufacturing industry,and industrial big data analysis is attracting more and more attention.As an important data form of industrial big data,time series have a lot of quality problems,which is necessary to be detected and handled effectively by designing data cleaning methods.The characteristics of industrial time series big data and the difficulties of industrial data quality management were introduced.Then,the recent developments in the area of that was analyzed and summarized.At last,the quality management method of temporal big data and the improvement direction of system performance were put forward.

Data curation technologies and applications

Minghe YU, Tiezheng NIE, Guoliang LI

2019, 5(6): 30-46. doi:10.11959/j.issn.2096-0271.2019048

Asbtract ( 388 )

HTML ( 31)

PDF (1260KB) ( 606 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Data curation is emerged in order to process,store and applied efficiency.Data curation processes active and continuous management the data through the whole lifecycle of it.And utilizing data curation techniques,data could be used to the maximum extent,and the speed of its elimination can be effectively slowed down.The process and key techniques of data curation aroundits goals,solutions and applications were described.For the crucial techniques,existing solutions were analyzed and introduced.In addition,the applications of data curation in the various domains were also introduced and compared.Finally,the development prospect and future challenges were expounded.

A data-space based platform for the integration and application of electronic health records

Xiaoyuan BAO, Kai ZHANG, Meng JIN, Shuanglian XIE, Kai SONG

2019, 5(6): 47-61. doi:10.11959/j.issn.2096-0271.2019049

Asbtract ( 366 )

HTML ( 55)

PDF (1462KB) ( 640 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

In order to build an efficient,scalable and easy-to-manage data integration and application platform,using data space structure,electronic medical records were integrated in original data space,anonymous data space,and model data space according to data sensitivity,and anonymous data were used for data mining and secondary analysis.Different storage,security protection and data access mechanisms were designed and implemented for each data space.The platform has been applied in the analysis of national health care performance and the evaluation of medical capabilities,quality,and efficiency of affiliated hospitals of Peking University.

Container cloud resource prediction based on APMSSGA-LSTM

Xiaolan XIE, Zhengzheng ZHANG, Qiangqing ZHENG, Chaoquan CHEN

2019, 5(6): 62-72. doi:10.11959/j.issn.2096-0271.2019050

Asbtract ( 370 )

HTML ( 31)

PDF (3400KB) ( 202 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the development and application of container cloud,the demand for high concurrency,high availability,high flexibility,and high flexibility of resources is becoming more and more intense.After investigating the current research status of container cloud resource prediction,a container cloud resource prediction model which using an adaptive probability multiselection strategy genetic algorithm (APMSSGA) to optimize the long short term memory network (LSTM) was proposed.The experimental results show that compared with the simple genetic algorithm (SGA),APMSSGA is more efficient in LSTM parameter optimal solution combination search,and the APMSSGA-LSTM model has higher prediction accuracy.

Cluster computing mode for water environment simulation based on Hadoop

Jinfeng MA, Li TANG, Kaifeng RAO, Gang HONG, Mei MA

2019, 5(6): 73-84. doi:10.11959/j.issn.2096-0271.2019051

Asbtract ( 266 )

HTML ( 28)

PDF (1948KB) ( 144 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Water environment numerical models are effective tools for the simulation,analysis and prediction of the processes of pollutant transport and transformation in water.The development of high-performance batch computation of water environment models has long been a hot topic.The distributed cluster computing mode based on big data technology is a promising approach for massive data management and batch computation,which provides a viable solution to large-scale water environment simulations.The adaptability of water environmental models under the framework of big data technology was explored,and a distributed cluster computing mode for water environment simulations was proposed.Moreover,the feasibility of adapting Delft3D model for cluster computing under Hadoop MapReduce environment was verified with real examples.

WEB:a fraud prediction method of Internet lending using network embedding

Cheng WANG, Pengfei SHU

2019, 5(6): 85-100. doi:10.11959/j.issn.2096-0271.2019052

Asbtract ( 274 )

HTML ( 39)

PDF (1444KB) ( 466 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Internet lending fraud prediction method based on association graph limits the mining efficiency and depth of features,as well as the reusability and expressibility of features.To solve this problem,the network embedding technology was introduced,and the structure and semantic information in the network by using the vector was expressed.The network update method based on periodic time window and decision batch method were proposed to improve the performance of network embedding in the two business requirements of accuracy and real-time.The experiment shows that the network embedding technology can automatically and effectively learn the implicit relationship and characteristics of the network.By combining the traditional method and the network embedding method,the fraud prediction performance has been significantly improved.

Research on the prediction of outpatient volume based on SARIMA-LSTM

Pengfei LU, Chengjie XU, Jingyi ZHANG, Lyu HAN, Jing LI

2019, 5(6): 101-110. doi:10.11959/j.issn.2096-0271.2019053

Asbtract ( 367 )

HTML ( 29)

PDF (1570KB) ( 305 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

In order to achieve more robust and accurate outpatient volume prediction,a hybrid prediction model based on SARIMALSTM was constructed.SARIMA model was used to build a single index model of outpatient volume to extract the cycle,trend and other information contained in outpatient volume index.Then multiple related indexes,including holiday days,legal working days,average maximum temperature,were used as input of a many-to-one LSTM model,in order to further learn the residual of SARIMA model and extract the nonlinear relationship between residual and multiple variables.The empirical results show that the SARIMA-LSTM hybrid model constructed in this paper has higher prediction accuracy than the five mainstream prediction methods,so it has good practical application value.

当期目录