Please wait a minute...

当期目录

    15 November 2019, Volume 5 Issue 6
    TOPIC:BIG DATA WRANGLING
    Progress on human-in-the-loop data preparation
    Ju FAN, Yueguo CHEN, Xiaoyong DU
    2019, 5(6):  3-18.  doi:10.11959/j.issn.2096-0271.2019046
    Asbtract ( 306 )   HTML ( 48)   PDF (1427KB) ( 427 )   Knowledge map   
    Figures and Tables | References | Related Articles | Metrics

    With the rapid development of data analytics,data preparation has become a major bottleneck.The two essential challenges for data preparation on cost and time were analyzed.To address the challenges,the research progress on human-in-theloop data preparation was reviewed.Firstly,interactive data preparation was reviewed,which aimed to reduce the time for data preparation by predictively interacting with the end users.Then,crowdsourced data preparation was introduced,which utilize human’s computational power from the crowd to support foundamental data preparation tasks,and developed algorithms for controlling result quality and reducing crowdsourcing cost.Finally,future research directions were summarized and discussed.

    Data quality management of industrial temporal big data
    Xiaoou DING, Hongzhi WANG, Shengjian YU
    2019, 5(6):  19-29.  doi:10.11959/j.issn.2096-0271.2019047
    Asbtract ( 430 )   HTML ( 81)   PDF (893KB) ( 588 )   Knowledge map   
    References | Related Articles | Metrics

    Industrial big data has become an important strategic resource for the transformation and upgrading of China’s manufacturing industry,and industrial big data analysis is attracting more and more attention.As an important data form of industrial big data,time series have a lot of quality problems,which is necessary to be detected and handled effectively by designing data cleaning methods.The characteristics of industrial time series big data and the difficulties of industrial data quality management were introduced.Then,the recent developments in the area of that was analyzed and summarized.At last,the quality management method of temporal big data and the improvement direction of system performance were put forward.

    Data curation technologies and applications
    Minghe YU, Tiezheng NIE, Guoliang LI
    2019, 5(6):  30-46.  doi:10.11959/j.issn.2096-0271.2019048
    Asbtract ( 388 )   HTML ( 31)   PDF (1260KB) ( 606 )   Knowledge map   
    Figures and Tables | References | Related Articles | Metrics

    Data curation is emerged in order to process,store and applied efficiency.Data curation processes active and continuous management the data through the whole lifecycle of it.And utilizing data curation techniques,data could be used to the maximum extent,and the speed of its elimination can be effectively slowed down.The process and key techniques of data curation aroundits goals,solutions and applications were described.For the crucial techniques,existing solutions were analyzed and introduced.In addition,the applications of data curation in the various domains were also introduced and compared.Finally,the development prospect and future challenges were expounded.

    A data-space based platform for the integration and application of electronic health records
    Xiaoyuan BAO, Kai ZHANG, Meng JIN, Shuanglian XIE, Kai SONG
    2019, 5(6):  47-61.  doi:10.11959/j.issn.2096-0271.2019049
    Asbtract ( 366 )   HTML ( 55)   PDF (1462KB) ( 640 )   Knowledge map   
    Figures and Tables | References | Related Articles | Metrics

    In order to build an efficient,scalable and easy-to-manage data integration and application platform,using data space structure,electronic medical records were integrated in original data space,anonymous data space,and model data space according to data sensitivity,and anonymous data were used for data mining and secondary analysis.Different storage,security protection and data access mechanisms were designed and implemented for each data space.The platform has been applied in the analysis of national health care performance and the evaluation of medical capabilities,quality,and efficiency of affiliated hospitals of Peking University.

    STUDY
    Container cloud resource prediction based on APMSSGA-LSTM
    Xiaolan XIE, Zhengzheng ZHANG, Qiangqing ZHENG, Chaoquan CHEN
    2019, 5(6):  62-72.  doi:10.11959/j.issn.2096-0271.2019050
    Asbtract ( 370 )   HTML ( 31)   PDF (3400KB) ( 202 )   Knowledge map   
    Figures and Tables | References | Related Articles | Metrics

    With the development and application of container cloud,the demand for high concurrency,high availability,high flexibility,and high flexibility of resources is becoming more and more intense.After investigating the current research status of container cloud resource prediction,a container cloud resource prediction model which using an adaptive probability multiselection strategy genetic algorithm (APMSSGA) to optimize the long short term memory network (LSTM) was proposed.The experimental results show that compared with the simple genetic algorithm (SGA),APMSSGA is more efficient in LSTM parameter optimal solution combination search,and the APMSSGA-LSTM model has higher prediction accuracy.

    Cluster computing mode for water environment simulation based on Hadoop
    Jinfeng MA, Li TANG, Kaifeng RAO, Gang HONG, Mei MA
    2019, 5(6):  73-84.  doi:10.11959/j.issn.2096-0271.2019051
    Asbtract ( 266 )   HTML ( 28)   PDF (1948KB) ( 144 )   Knowledge map   
    Figures and Tables | References | Related Articles | Metrics

    Water environment numerical models are effective tools for the simulation,analysis and prediction of the processes of pollutant transport and transformation in water.The development of high-performance batch computation of water environment models has long been a hot topic.The distributed cluster computing mode based on big data technology is a promising approach for massive data management and batch computation,which provides a viable solution to large-scale water environment simulations.The adaptability of water environmental models under the framework of big data technology was explored,and a distributed cluster computing mode for water environment simulations was proposed.Moreover,the feasibility of adapting Delft3D model for cluster computing under Hadoop MapReduce environment was verified with real examples.

    APPLICATION
    WEB:a fraud prediction method of Internet lending using network embedding
    Cheng WANG, Pengfei SHU
    2019, 5(6):  85-100.  doi:10.11959/j.issn.2096-0271.2019052
    Asbtract ( 274 )   HTML ( 39)   PDF (1444KB) ( 466 )   Knowledge map   
    Figures and Tables | References | Related Articles | Metrics

    Internet lending fraud prediction method based on association graph limits the mining efficiency and depth of features,as well as the reusability and expressibility of features.To solve this problem,the network embedding technology was introduced,and the structure and semantic information in the network by using the vector was expressed.The network update method based on periodic time window and decision batch method were proposed to improve the performance of network embedding in the two business requirements of accuracy and real-time.The experiment shows that the network embedding technology can automatically and effectively learn the implicit relationship and characteristics of the network.By combining the traditional method and the network embedding method,the fraud prediction performance has been significantly improved.

    Research on the prediction of outpatient volume based on SARIMA-LSTM
    Pengfei LU, Chengjie XU, Jingyi ZHANG, Lyu HAN, Jing LI
    2019, 5(6):  101-110.  doi:10.11959/j.issn.2096-0271.2019053
    Asbtract ( 367 )   HTML ( 29)   PDF (1570KB) ( 305 )   Knowledge map   
    Figures and Tables | References | Related Articles | Metrics

    In order to achieve more robust and accurate outpatient volume prediction,a hybrid prediction model based on SARIMALSTM was constructed.SARIMA model was used to build a single index model of outpatient volume to extract the cycle,trend and other information contained in outpatient volume index.Then multiple related indexes,including holiday days,legal working days,average maximum temperature,were used as input of a many-to-one LSTM model,in order to further learn the residual of SARIMA model and extract the nonlinear relationship between residual and multiple variables.The empirical results show that the SARIMA-LSTM hybrid model constructed in this paper has higher prediction accuracy than the five mainstream prediction methods,so it has good practical application value.

Most Download
Most Read
Most Cited