Big Data Research

Four issues to consider in building a computer system supporting large model training

Weimin ZHENG

2024, 10(1): 1-8. doi:10.11959/j.issn.2096-0271.2024016

Asbtract ( 211 )

HTML ( 143)

PDF (56597KB) ( 329 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

There are three types of computer systems that support large model training, among which the ecosystem based on domestic AI chip systems is not very good.To change this situation, it is necessary to develop 10 key software such as AI compilers and parallel acceleration.Moreover, systems based on supercomputers require good software and hardware collaborative design to better serve large model training.This article proposes a 4-point balanced design for building the infrastructure of a large model to ensure system performance, reliability, and scalability.

Big data and computing models

Guojie LI

2024, 10(1): 9-16. doi:10.11959/j.issn.2096-0271.2024017

Asbtract ( 368 )

HTML ( 184)

PDF (1457KB) ( 446 )

Knowledge map

References | Related Articles | Metrics

At present, artificial intelligence continues to heat up.Large language models have attracted much attention and set off a wave of enthusiasm around the world.The success of artificial intelligence is not essentially a ＆quot;miracle＆quot; of large computing power, but a change in computing models.Firstly, this paper affirms the fundamental role of data in AI, and points out that synthetic data will be the main source of data in the future.Then, this paper reviews the development of computing models, highlights the historic competition between neural network models and Turing models.We points out that the important hallmark of large language models is the emergence of intelligence in machines, emphasizes that the essence of large language models is ＆quot;compression＆quot;, and analyzes the reasons for the ＆quot;illusion＆quot; of large language models.Finally, we call on the scientific community to attach importance to large scientific models in ＆quot;AI for research(AI4R)＆quot;.

Approximate nearest neighbor hybrid query algorithm based on tolerance factor

Guangfu HE, Yuanhai XUE, Cuiting CHEN, Xiaoming YU, Xinran LIU, Xueqi CHENG

2024, 10(1): 17-34. doi:10.11959/j.issn.2096-0271.2024010

Asbtract ( 74 )

HTML ( 18)

PDF (14112KB) ( 141 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Approximate nearest neighbor search (ANNS) is an important technique in the field of computer science for efficient similarity search, enabling fast information retrieval in large-scale datasets.With the increasing demand for high-precision information retrieval, there is a growing use of the hybrid query method that utilizes both structured and unstructured information for querying, which has broad application prospects.However, filtered greedy algorithms based on nearest neighbor graphs may reduce the connectivity of the graph due to the impact of structural constraints in hybrid queries, ultimately damaging search accuracy.This article proposes a tolerance factor based filtered greedy algorithm, which controls the participation of vertices that do not meet structural constraints in routing through a tolerance factor, maintaining the connectivity of the original nearest neighbor graph without changing the index structure, and overcoming the negative impact of structural constraints on retrieval accuracy.The experimental results demonstrate that the new method can achieve high-precision search for ANNS under different levels of structural constraints, while maintaining retrieval efficiency.This study solves the problem of ANNS based on nearest neighbor graphs in hybrid query scenarios, providing an effective solution for fast hybrid query information retrieval in large-scale datasets.

Edge-intelligence-based immersive metaverse:key technologies and prospects

Zhi WANG, Shutao XIA, Rui MAO

2024, 10(1): 35-45. doi:10.11959/j.issn.2096-0271.2024001

Asbtract ( 88 )

HTML ( 31)

PDF (2286KB) ( 191 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

In recent years, applications such as 360 video, augmented reality, virtual reality, etc.have been rapidly developed and gradually formed a new mode of immersive metaverse experience.These new services and applications have common features such as high fidelity and immersive interaction.As a new paradigm, edge computing is increasingly ready to support these features.This paper firstly elucidates that the key to releasing the power of edge computing to support immersive metaverse experience is the processing of AI models and data for content generation.Then, this paper presents a generic framework to realize adaptive deep learning model deployment and data flow to support metaverse services and applications.

Survey on entity extraction for lowresource scenarios

Daozhu XU, Kailin ZHAO, Dong KANG, Chao MA, Yuming FENG, Zixuan LI, Burong YI, Xiaolong JIN

2024, 10(1): 46-61. doi:10.11959/j.issn.2096-0271.2023079

Asbtract ( 83 )

HTML ( 16)

PDF (2918KB) ( 121 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Entity extraction is an essential task in information extraction.In recent years, under the trend of training model with big data, deep learning has achieved success in entity extraction.However, in the fields such as natural environment, there are very few entity samples or labeled samples of terrain, disasters and other types, and labeling those unlabeled samples is time-consuming and laborious.Therefore, entity extraction for low-resource scenarios has gradually attracted more and more attention, which is called low-resource entity extraction or few-shot entity extraction.This paper systematically combs the current approaches of low-resource entity extraction.It introduces the research status of three types of methods: metalearning based, multi-task learning based, and prompt learning based.Next, the paper summarizes the low-resource entity extraction datasets and the experimental results of the representative models on these datasets.In the following, the current low-resource entity extraction approaches are analysed.Finally, this paper summarizes the challenges of low-resource entity extraction and discusses the future research direction in this field.

A survey on the fairness of federated learning

Zhitao ZHU, Shijing SI, Jianzong WANG, Ning CHENG, Lingwei KONG, Zhangcheng HUANG, Jing XIAO

2024, 10(1): 62-85. doi:10.11959/j.issn.2096-0271.2022088

Asbtract ( 129 )

HTML ( 22)

PDF (2998KB) ( 277 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Federated learning uses data from multiple participants to collaboratively train global models and has played an increasingly important role in recent years in facilitating inter-firm data collaboration.On the other hand, the federal learning training paradigm often faces the dilemma of insufficient data, so it is important to provide assurance of fairness to motivate more participants to contribute their valuable resources.This paper illustrates the issue of fairness in federated learning.Firstly, three classifications of fairness based on different equity goals, from model performance balance, contribution assessment equity, and elimination of group discrimination are proposed, and then we provide indepth introduction and comparison of existing fairness promotion methods, aiming to help researchers develop new fairness promotion methods.Finally, by dissecting the needs in the process of federal learning implementation, five directions for future federated learning fairness research are proposed.

Exploration and practice of XAI architecture

Zhengxun XIA, Jianfei TANG, Yifan YANG, Shengmei LUO, Yan ZHANG, Fenglei TAN, Shengru TAN

2024, 10(1): 86-109. doi:10.11959/j.issn.2096-0271.2024013

Asbtract ( 68 )

HTML ( 18)

PDF (3923KB) ( 163 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

XAI(explainable AI) is an important component of trusted AI.In-depth research on the technology points of XAI has been carried out in the current industry, but systematic research on engineering implementation is lacking.This paper proposed a general XAI technical architecture, which started from the follow four aspects: atomic interpretation generation, core competence enhancement, business component embedding and trusted interpretation application.We designed four levels:XAI foundation layer, XAI core competence layer, XAI business component layer and XAI application layer.Through the division of labor and cooperation among the technical layers, the implementation of XAI engineering was guaranteed throughout the whole process.Based on the XAI architecture presented in this paper, new technical modules can be introduced flexibly to support the industrialization application of XAI, providing certain reference for the promotion of XAI in the industry.

Industrial digital transformation:research onfault diagnosis methods

Biao YANG, Yun XIONG, Ling FU, Weifeng XU, Jing LI

2024, 10(1): 110-126. doi:10.11959/j.issn.2096-0271.2023041

Asbtract ( 425 )

HTML ( 69)

PDF (3317KB) ( 322 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Industrial digitalization is an important way for industrial transformation and upgrading of China's industry, and digital transformation has become an important trend in the development of China's industry.The reliability and stability of industrial systems play an important role in the high quality and sustainable development of industrial production.Failures affect the operation of industrial systems and even cause major safety accidents and economic losses.To deal with this problem, fault diagnosis technology has emerged and gradually developed.Efficient and high-quality fault diagnosis digital technology has become a key technology for industrial digital transformation.The research progress of digital methods of fault diagnosis in industry were analyzed.According to the development characteristics, three stages were divided, including domain experience-led modeling methods, data-driven digital methods combining with domain experience, and data-driven digital methods combining with interpretability.Focus on the basic ideas and characteristics of the methods in each stage, the future research directions were discussed, and the references were provided for promoting industrial digital transformation.

Building bidirectional digital channels between governments and citizens

Yu ZHENG

2024, 10(1): 127-140. doi:10.11959/j.issn.2096-0271.2024012

Asbtract ( 80 )

HTML ( 22)

PDF (3512KB) ( 145 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The primary-level governance is the foundation of the national governance system, protecting the security of citizens and ensuring the stability of a country.However, given the increasingly complex settings of a city, the communication between citizens and government is not efficient enough anymore, causing heavy work load to grass-root officers and also bringing trouble to citizens.To address this issue, we propose a digital bidirectional channel between the government and citizens of a city, which consists of several configurable components.The channel enables government officers to deploy online applications for dealing with different issues, flexibly and efficiently, through a very simple setup process.It also creates standard data resources that are connected to each other and can be shared among different applications as a production factor.Thus, citizens do not need to fill the same information in different online forms repetitively.

Research on data legal system in China

Yi XIE, Bo HE

2024, 10(1): 141-156. doi:10.11959/j.issn.2096-0271.2024003

Asbtract ( 97 )

HTML ( 48)

PDF (1871KB) ( 189 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

In promoting the development of China's digital economy, relevant departments have given full play to the underpinning role of the rule of law, actively promoted the formulation of laws and regulations in the field of data, and put in place a foundational legal system for data that combines vertical and horizontal aspects.Firstly, this paper introduces the basic overview of China's data legal system, including both horizontal system and vertical system.Secondly, it analyzes the vertical system, including data legislation at central and local levels, and covering laws, administrative regulations, departmental rules, local regulations and local administrative rules.Thirdly, it analyzes the horizontal system, including legal systems in the fields of data security and development, personal information protection, commercial data circulation and government data management.Finally, this paper summarizes the achievements of China's data legislation and deficiencies of the current data law system, and puts forward suggestions for improvement.

Gansu smart tourism system based on big data

Liang GUO, Yi YANG, Bingfeng QIN, Jianwen CAO, Min LI, Wei YUAN, Caihong LI, Juntao WANG

2024, 10(1): 157-169. doi:10.11959/j.issn.2096-0271.2024020

Asbtract ( 145 )

HTML ( 53)

PDF (14316KB) ( 231 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the continuous evolution of tourists' travel preferences, the traditional tourism management and service models are no longer able to satisfy modern tourists' desire for personalized and high-quality travel experiences.To address this issue, the Gansu Province Smart Tourism System has been developed.The relevant research is reviewed firstly, followed by a detailed description of the system's composition and implementation process, including the construction of the Gansu Smart Tourism Big Data Center and the design of the ＆quot;Explore Gansu with One Mobile App＆quot; comprehensive service platform.Using a layered architecture and logical framework, the system achieves the correlated mapping of tourism data and visitor behavior, along with the fusion computation of diverse data.Finally, taking examples such as the highway selfdriving traffic prediction model, the image selection model based on tourism destination impressions, and the analysis model of factors influencing sentiment in tourism reviews, the comprehensive service platform's provision of intelligent services to government, industry, and tourists is elaborated.The application results demonstrate that the system's implementation effectively enhances the quality of tourism services and visitor satisfaction in the Gansu region, further propelling the rapid development of smart tourism in Gansu.

Prediction of daily energy consumption for ship special coating maintenance based on stochastic forest regression

Ruiping GAN, Xinmin REN, Jun JIANG, Peng LI, Xiaobing ZHOU

2024, 10(1): 170-184. doi:10.11959/j.issn.2096-0271.2024018

Asbtract ( 56 )

HTML ( 21)

PDF (2950KB) ( 206 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Predicting energy consumption is an important task in the intelligent energy efficiency optimization of ship maintenance, with special coating (spec coat) being the core aspect.In this experiment, the random forest regression (RFR) model was employed to analyze the daily energy consumption of ship maintenance for special coating.The dataset was preprocessed by removing outliers, randomizing and standardizing the data.Subsequently, the RFR model was trained and fitted using historical data of daily energy consumption in ship maintenance.The RFR model was optimized using grid search with cross-validation, and analysis of daily energy consumption data for ship special coating maintenance using optimized RFR model.Comparative experiments were conducted with other models.The results revealed that the optimized RFR model outperformed several other models, achieving an R-squared value of 93.25% and significantly lower mean squared error (MSE).

Data expansion method for genetic engineering of special materials with small sample data

Tao YANG, Zhaobo ZHANG, Tianyi ZHENG, Bao PENG

2024, 10(1): 185-194. doi:10.11959/j.issn.2096-0271.2024019

Asbtract ( 54 )

HTML ( 16)

PDF (3512KB) ( 81 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the increasing diversity and complexity of material requirements for underground water conservancy and water pipeline networks, the efficient and convenient design of special materials to meet individual needs through machine learning has become a hot topic of concern.Traditional supervised learning methods are all based on a large dataset to train models, but obtaining large datasets for special materials required in deeply buried underground water pipeline networks and high-end military equipment, such as rare and high-entropy alloys, etc.requires extremely high cost and a long period.To solve this problem, we propose a small sample expansion model-RX-SMOGN, using XGBoost and RFECV algorithms for feature screening.We enrich the dataset with the SMOGN algorithm.In this paper, the phase structure of high-entropy alloys is used as the research object, and traditional machine learning models are trained to predict them to verify the effectiveness of the RX-SMOGN model.From the results of 5-fold cross-verification and 4 evaluation indicators, it can be seen that the RX-SMOGN model fully improves the performance of the machine learning model, provides a more convenient method for alloy material design, and fully improves the efficiency of alloy material design.

当期目录