Editors Recommend
15 July 2024, Volume 10 Issue 4
TOPIC: BIG DATA AND CLOUD STORAGE
Big Data and Cloud Storage
2024, 10(4):  1-2.  doi:10.11959/j.issn.2096-0271.2024046-1
Asbtract ( 28 )   HTML ( 13)   PDF (824KB) ( 15 )   Knowledge map   
References | Related Articles | Metrics
Research on key technologies for efficient storage and access of turbulent big data
Wendi CHENG, Xiao ZHANG, Zhaohui PAN, Youjun ZHAO, Chenguang SUN, Xueqiang SHAN, Yuzhan JIN, Xiaonan ZHAO
2024, 10(4):  3-20.  doi:10.11959/j.issn.2096-0271.2024046
Asbtract ( 42 )   HTML ( 19)   PDF (4839KB) ( 25 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

With the development of measurement techniques and numerical simulation technologies, data-driven turbulence research has become a new approach in this field.In China, several wind tunnel laboratories and supercomputing centers have been established for turbulence simulations, resulting in a substantial collection of turbulence data.However, there is currently no centralized turbulence data management platform in China, which makes it difficult to achieve the exchange and share of the expensive experimental and simulation data.Turbulence data is characterized by its large volume, high dimensionality, precision and heterogeneity, which present problems in terms of storage, access and management efficiency.A turbulence big data distributed storage system called TDFS was designed, specifically targeting typical flow problems in aviation, aerospace, and marine applications.Considering the access characteristics of turbulence big data, the novel metadata management methods and data access interfaces were designed in TDFS.Experimental results demonstrate that TDFS achieves interface response speed improvements of 54.38% and 57.7% compared with HDFS and GlusterFS, respectively.Additionally, to reduce the storage overhead of turbulence big data, a lazy replication compression mechanism based on HDF5 was designed, resulting in 34% reduction in storage space, compared to the original replication storage approach.

System performance optimization practice for big data scenarios
Jibin WANG, Hailong YANG, Kai FENG, Xin SUN, Minda ZHANG, Kelun LEI, Zhiwen XIAO, Yifei ZHANG, Jiaxi WU
2024, 10(4):  21-33.  doi:10.11959/j.issn.2096-0271.2024049
Asbtract ( 52 )   HTML ( 25)   PDF (2685KB) ( 24 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

In the existing large-scale distributed environments, there is still much room for improvement in the performance and computational efficiency of big data applications.However, performance analysis and optimization in large-scale environments requires a large number of human resources from domain experts.This paper proposes a general lowperformance query statement detection and optimization process for performance optimization in big data applications, summarizes four types of low-performance behaviors that significantly affect the performance of big data applications, and proposes specific optimization strategies for each type of low-performance behavior.Finally, through experimental evaluation, the effectiveness of the optimization scheme in actual large-scale cluster is verified.

A polymorphic cooperative compression strategy for IoT time series data based on NVM
Tao CAI, Tianle LEI, Dejiao NIU, Jianfei DAI, Zeyu HUANG, Qiangqiang NI
2024, 10(4):  34-50.  doi:10.11959/j.issn.2096-0271.2024048
Asbtract ( 27 )   HTML ( 4)   PDF (3163KB) ( 12 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

The compression strategy plays an important role in the performance of IoT time series data storage system.However, the current compression strategies can not adapt to the characteristics of NVM and IoT time series data.This paper proposes a polymorphic cooperative compression strategy for IoT time-series data based on NVM.Firstly, the overall structure of IoT time series data is given.Then, to address the consistent patterns in IoT time series data and the different granularity between user-space and kernel-space operations on NVM, a dual-compression strategy is devised.Initially, a lightweight compression method is applied directly as IoT time series data is received in user-space.This method efficiently reduces the volume of data for storage, while minimizing the impact on the timeliness of data storage.Moreover, a deep compression algorithm is designed for the kernel-space, primarily focusing on querying and analyzing anomalous time series data.Additionally, to address the competition for NVM bandwidth between deep compression and data storage, a dynamic adjustment algorithm that guarantees write bandwidth is proposed.Finally, a prototype of the polymorphic cooperative compression strategy is implemented and YCSB-TS is used to evaluate the results.The results show that the proposed method can effectively improve the write throughput of IoT time-series data by up to 161.3% and reduce the storage space by up to 14.6%, compared with InfluxDB, OpenTSDB, KairosDB and TVStore.

A dynamic bidirectional matching method of tasks and resources oriented to wide-area distributed computing
Jing SHANG, Limin XIAO, Zhiwen XIAO, Jinquan WANG, Zhihui WU, Huiyang LI, Yifei ZHANG, Yao SONG, Jibin WANG
2024, 10(4):  51-65.  doi:10.11959/j.issn.2096-0271.2024050
Asbtract ( 33 )   HTML ( 21)   PDF (2942KB) ( 15 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

Due to the huge capacities of computing and storage resources, wide-area distributed computing environment has become important infrastructures supporting computing power and data interconnection.In wide-area distributed computing environment, matching of tasks and resources is important to improve system performance.However, the diversity of tasks and resources and the geographical dispersion of resources increase the complexity of matching problems.To solve the problems of high response delay and low matching efficiency, a dynamic bidirectional matching method of tasks and resources oriented to wide-area distributed computing environments is proposed.The matching process is simplified and the response delay is mitigated by building a unified task requirement model and resource capability model.Moreover, the task-oriented and resource-oriented matching degrees are defined to express the preference of task-perspective and resource-perspective; the two-side comprehensive matching degree of tasks and resources is defined by the trade-off of the task-oriented and resource-oriented matching degree.The two-side comprehensive matching degrees are dynamically calculated for each task group and the resources to improve the matching quality.The experimental results show that the proposed method can effectively promoting the matching quality and significantly reduce the response delay compared with the existing methods.

Carbon emission prediction method of steel plants based on long short-term memory network
Fengyun LI, Zehui DOU, Peng LI, Wei GUO
2024, 10(4):  66-76.  doi:10.11959/j.issn.2096-0271.2024051
Asbtract ( 38 )   HTML ( 8)   PDF (4577KB) ( 18 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

As the second largest carbon emitter in China, iron and steel enterprises have great potential for carbon emission reduction.In order to facilitate the supervision and control of carbon emissions by relevant departments, carbon emission prediction research is carried out.Taking a steelmaking plant as the research object, firstly, the carbon dioxide emissions in the steelmaking process were analyzed, and 10 energy substances that caused carbon emissions were determined.The basic energy data of the steelmaking plant from 2001 to 2023 were collected, and the carbon emissions were calculated from the basic energy data according to the carbon emission accounting method.Secondly, based on the long short-term memory network to predict the carbon emissions in the next 7 years, the training error and test error were close to 0.01, and the actual error was 1 323 307.46 tons of carbon dioxide.Then, the Mann-Kendall trend test was used to evaluate the overall carbon emission trend of the steelmaking plant.Finally, some reasonable suggestions were put forward for steelmaking plants in order to actively respond to the goal of low-carbon environmental protection.

STUDY
Multiple-feature fusion based generative adversarial network for image dehazing
Yazhong SI, Xulong ZHANG, Fan YANG, Jianzong WANG, Ning CHENG, Jing XIAO
2024, 10(4):  77-88.  doi:10.11959/j.issn.2096-0271.2024047
Asbtract ( 96 )   HTML ( 21)   PDF (26501KB) ( 25 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

To enhance image clarity and address the difficulties in feature extraction and incomplete haze removal in traditional image dehazing processes, a multi-feature fusion based generative adversarial dehazing network is proposed.The network adopts a generative adversarial approach and consists of a generator and a discriminator.The generator utilizes an encoderdecoder structure, and extracts haze-related feature maps from multiple receptive fields by a multi-feature extraction fusion (MFEF) block.The discriminator uses a series of convolutional calculations to analyze the feature differences between the generated images and the clear images, guiding the generator to output move realistic dehazing images.The experimental images show that the proposed method can effectively eliminate haze interference while preserving the original color tone of the image to the greatest extent possible.The experimental results demonstrate that the dehazed images produced by our algorithm have improved peak signal-to-noise ratio and structural similarity with an average of 2.588 dB and 2.66% respectively, compared with existing methods.

A scalable parallel sorting algorithm by regular sampling for big data
Ying WANG, Zhiguang CHEN, Yutong LU
2024, 10(4):  89-105.  doi:10.11959/j.issn.2096-0271.2024021
Asbtract ( 30 )   HTML ( 3)   PDF (2549KB) ( 13 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

Sorting is one of the basic algorithms in computer science, and has been extensively used in a variety of applications.In the big data era, as the volumes of data increase rapidly, parallel sorting has attracted much attention.Existing parallel sorting algorithms suffered from excessive communication overhead and load imbalance, making it difficult to scale massively.To solve above problems, a scalable parallel algorithm sorting by regular sampling (ScaPSRS) was proposed, which sampled the p-1 pivot elements to divide the entire data set into p disjoint subsets by all parallel processes, rather than by only one given process as PSRS did.Furthermore, ScaPSRS adopted a novel iterative update strategy of selecting pivots to guarantee that the workloads and data were evenly scheduled among the parallel processes, thus ensuring superior overall performance.A variety of experiments conducted on the Tianhe-Ⅱ supercomputer demonstrated that ScaPSRS succeeded in scaling to 32 000 cores and outperformed state-of-the-art works significantly.

A dual channel semi-supervised network representation learning model
Hangyuan DU, Fuzhong XIE, Wenjian WANG, Liang BAI
2024, 10(4):  106-120.  doi:10.11959/j.issn.2096-0271.2024052
Asbtract ( 17 )   HTML ( 6)   PDF (3537KB) ( 13 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

In semi-supervised network representation learning, node labels play an important role in guiding the establishment of network mapping relationships among different spaces.However, in many practical tasks, the available label information is usually limited or difficult to obtain, which makes it difficult to provide sufficient and effective supervision to the process of learning low-dimensional network representations.In order to solve this problem, a dual channel semisupervised network representation learning model is proposed, which is composed of two information transmission channels, namely self-supervised and semi-supervised channels, with the framework of autoencoder.The self-supervised information and the label information provide guidance for the establishment of network representation mapping in the two channels respectively, and they form such a sense of information complementation and enhancement for the learning process.Considering the possible information redundancy between the two channels, a redundancy recognition and elimination mechanism is designed in the perspective of mutual information.On this basis, an integrated optimization model is constructed to combine the self-supervised learning and the semi-supervised learning into a collaborative mechanism, which enables the learned representations to capture and preserve the structure and characteristics of the network.Experimental results on several real datasets show that the network representations learned by the proposed model can achieve better performance than baseline methods in node classification, clustering and visualization tasks.

APPLICATION
Construction of a paper-assistant reading system based on machine reading comprehension
Rongxin MI, Wenwen YAO, Hongkun RUAN
2024, 10(4):  121-129.  doi:10.11959/j.issn.2096-0271.2024039
Asbtract ( 33 )   HTML ( 6)   PDF (1823KB) ( 19 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

In the era of informatization and digitization, the rapid increase in the number of scientific papers has given rise to various challenges, such as lengthy articles, difficulty in information extraction and high time costs associated with reading.Literature reading challenges for researchers are increasingly tedious and time-consuming.By utilizing the language models, the assited reading system of scientific papers has been designed to address these challenges.By adopting machine reading comprehension technology as the core, the system parses scientific texts and offers some common questions to achieve automated response capabilities.By fully utilizing the pre-trained language model PERT, the system enhances its capabilities in semantic understanding and information extraction, effectively resolving various challenges in reading scientific papers and helping readers improve the efficiency of scientific literature review.

Studies
Elementarisation method for public data based on urban knowledge systems
Yu ZHENG, Xiuwen YI, Dekang QI, Zheyi PAN
2024, 10(4):  130-148.  doi:10.11959/j.issn.2096-0271.2024042
Asbtract ( 79 )   HTML ( 18)   PDF (2955KB) ( 49 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

Data elements are the key momentum for boosting digital economy.The data generated by public services provided by governments (a.k.a.public data) is ready to be transferred into data elements, because it has been well organized in the past decade.Unfortunately, public data is strictly coupled with the systems generating them, making it difficult for different applications to share data.The process of munul data governance is lagging, heavy and inefficient, and relying on automatic extraction method can’t ensure the accuracy of data elements.To tackle these challenges, leveraging the synergy between human and machine intelligence, we propose an elementarisation method for public data based on urban knowledge system.Our method is comprised of an urban knowledge system, a set of digital controls and some machine learning algorithms.The urban knowledge system consists of entities, relationships between entities, and the properties associated with these entities and relationships, which can be used to construct different kinds of public services and form standard data representation that can be shared among different applications.Powered by the urban knowledge system, the digital controls enable governments to create different applications as public services flexibly, through a configurable way without writing any codes.Later, the information input by citizens through digital controls in these applications is transferred into data elements automatically.Finally, the machine learning algorithms assist users to use digital controls smoothly through intelligent recommendations.Our method can produce data elements automatically, efficiently and accurately, unlocking the value of data for digital economy.

Design and practice of spatial and temporal data center on digital government
Yun WANG, Zhishuang DU, Kang TIAN, Xiaobao SU, Pengfei CHANG, Difei MEI, Jinyu LI, Longjian JI, Yifeng GUO, Wuai ZHOU, Wanzhe ZHANG, Jianhua FENG
2024, 10(4):  149-160.  doi:10.11959/j.issn.2096-0271.2024054
Asbtract ( 54 )   HTML ( 62)   PDF (40502KB) ( 56 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

Natural resources and geographic information big data are regarded as essential productive factors in the context of digital government, and an important component of the national integrated government big data system.Due to data dispersion and isolated application program existing in the various departments, it is difficult to realize data sharing and application across departments and businesses, leading to the low utilization rate of data.In response to the issues mentioned above, the digital government spatio-temporal data center was designed with the demands for natural resource and geographic information big data services.The critical components, including storage computation, data structure and application support were introduced clearly.By integrating the natural resources and geographic information database with the population comprehensive database and legal person comprehensive database, an organic composition of people,enterprises and geographic locations was established, which was explained by the specific application practices.

FORUM
Spatial network and spillover effect of Chinese digital economy
Fenggao NIU, Ruoyu SHI
2024, 10(4):  161-171.  doi:10.11959/j.issn.2096-0271.2024045
Asbtract ( 23 )   HTML ( 4)   PDF (2291KB) ( 15 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

With the vigorous development of digital technologies, digital economy has become a brand economic model, providing a strong driving force for improving the matching of supply and demand, enhancing the allocation of resources, and promoting economic transformation and upgrading.To comprehensively analyze the overall situation of the development of digital economy and the spatial relationship, firstly, the digital economy evaluation index system was established for 31 provinces (municipalities and autonomous regions) in China, and the gravity value was calculated by the revised gravity model to build the spatial network.Secondly, the spatial dependence of regions was explored through the global Moran indexes.Finally, the space-time fixed Durbin model was established to analyze the influence of explanatory variables effected on the development of digital economy and their spatial spillover.The results are as follows: China's digital economy development spatial network is not compact enough.The difference among regions is obvious and neighboring regions depend on each other.The improvement of urbanization level not only promotes the development of digital economy in the province involved, but also indirectly drives the improvement of neighboring provinces, which is a strong spatial spillover effect, while the human capital level has a restraining effect.

COLUMN: INFORMATION TECHNOLOGY APPLICATION INNOVATION:SYSTEMS FOR BIG DATA
LSTM training system based on heterogeneous hardware
Weixin HUANG, Weifang HU, Xuejiao CAO, Xuanhua SHI
2024, 10(4):  172-188.  doi:10.11959/j.issn.2096-0271.2024053
Asbtract ( 24 )   HTML ( 5)   PDF (4136KB) ( 25 )   Knowledge map   
Figures and Tables | References | Related Articles | Metrics

In the era of big data, deep neurals network models represented by LSTM have the ability to process massive data, and have excellent performance in the fields of language processing, speech recognition and time series data prediction.However, with the increase of model complexity, the training cost increases significantly.The existing LSTM training systems use acceleration methods, such as operator fusion and multi-stream, but neglect the parallelism of the internal calculation of a single training operator, which leads a low utilization rate of computing resources and a long traning time.Therefore, this paper designs a training acceleration system called TurboLSTM based on fine-grained model partitioning method and multi-stream parallel scheduling strategy.A new underlying training operator built on NVIDIA GPU and domestic Ascend NPU heterogeneous hardware realizes reasonable utilization of computing resources for tasks.Compared with the existing training systems, TurboLSTM on NVIDIA GPU has about 23% speed improvement of a single operator and about 17% speed improvement of the overall training time of a model, while TurboLSTM on Ascend NPU has about 15% speed improvement of a single operator, and the significant increase in the utilization of computing resources is observed.This shows that the acceleration method is efficient and has good generalization ability.

2024 Vol.10 No.4 No.3 No.2 No.1
2023 Vol.9 No.6 No.5 No.4 No.3 No.2 No.1
2022 Vol.8 No.6 No.5 No.4 No.3 No.2 No.1
2021 Vol.7 No.6 No.5 No.4 No.3 No.2 No.1
2020 Vol.6 No.6 No.5 No.4 No.3 No.2 No.1
2019 Vol.5 No.6 No.5 No.4 No.3 No.2 No.1
2018 Vol.4 No.6 No.5 No.4 No.3 No.2 No.1
2017 Vol.3 No.6 No.5 No.4 No.3 No.2 No.1
2016 Vol.2 No.6 No.5 No.4 No.3 No.2 No.1
2015 Vol.1 No.4 No.2 No.3 No.1
Most Download
Most Read
Most Cited