Big Data Research

Cross-Domain Data Management

2023, 9(4): 1-2. doi:10.11959/j.issn.2096-0271.2023040-1

Asbtract ( 106 )

HTML ( 38)

PDF (761KB) ( 184 )

Knowledge map

References | Related Articles | Metrics

Distributed consensus algorithms for crossdomain data management: state-of-the-art, challenges and perspectives

Weiming LI, Tong LI, Dafang ZHANG, Longchao DAI, Yunpeng CHAI

2023, 9(4): 3-15. doi:10.11959/j.issn.2096-0271.2023040

Asbtract ( 122 )

HTML ( 20)

PDF (2422KB) ( 284 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the exponential growth of data and the company's cross-domain disaster recovery requirements, companies increasingly need to manage data across spatial domains.Cross-domain data management requires a distributed consensus algorithm to make the data consistent.However, the existing distributed consensus algorithms only consider the situation of a single data center, and do not consider the uncertainty of network communication between data centers, so they face long log synchronization delays and low system throughput in cross-space region scenarios and other issues.The current status and new challenges of distributed consensus algorithms in the cross-space domain were sorted out systematically, and the technical route to solve these challenges was looked forward.

Harp: optimization algorithm for cross-domain distributed transactions

Qiyu ZHUANG, Tong LI, Wei LU, Xiaoyong DU

2023, 9(4): 16-31. doi:10.11959/j.issn.2096-0271.2023043

Asbtract ( 84 )

HTML ( 7)

PDF (3847KB) ( 144 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The paradigm of near-data computing has driven banks and securities firms to build multiple data centers globally or nationally.In the traditional business model, transactions focused on accessing data within a single data center.With the changing business model, distributed transactions across data centers have become common, such as transferring money between bank accounts or exchanging equipment between game accounts, with data stored in different data centers in different regions.Distributed transaction processing requires the two-phase commit protocol to ensure the atomicity of the sub-transactions submitted by each participating node.In processing cross-domain transactions, traditional transaction processing technology needs to be expanded to ensure that the system can provide higher throughput due to the longer and more varied network latency between nodes.After analyzing the problems and optimizing space for crossdomain distributed transactions, this paper proposes a new distributed transaction processing algorithm called Harp.Harp delays the execution of some sub-transactions based on the difference in network latency while ensuring serializable isolation level, reducing the duration of transaction lock contention, and improving system concurrency and throughput.Experiments show that Harp improves the performance by 1.39 times compared with the traditional algorithm under YCSB workload.

Cross trust domain federated k-dominant skyline query processing

Yexuan SHI, Yongxin TONG, Hao ZHOU, Ke XU, Weifeng LYU

2023, 9(4): 32-43. doi:10.11959/j.issn.2096-0271.2023047

Asbtract ( 87 )

HTML ( 4)

PDF (3073KB) ( 312 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

k-dominant skyline is a prevailing skyline query which has widespread applications in multi-criteria decision making and recommendation.As these applications continuously scale up, there is an increasing demand to support k-dominant skyline over a data federation which consists of multiple data silos, each holding disjoint columns of the entire dataset.Yet it is challenging to support k-dominant skyline over a data federation.This is because strict security constraints are often imposed to query processing over data federations, whereas naively adopting security techniques leads to unacceptably inefficient queries.In this paper, we presented an efficient and secure k-dominant skyline for a data federation.Specifically, we devised a novel private vector aggregation-based solution with ciphertext compressionbased optimization for efficient k-dominant skyline query processing while providing security guarantees.Extensive evaluations on both synthetic and real datasets showed the superiority of our method.

Trusted sharing of cross-domain intelligence based on data objects

Tai PENG, Jing SUN, Xujian CHEN, Xian ZHOU, Yuming YE, Xiaoying BAI

2023, 9(4): 44-58. doi:10.11959/j.issn.2096-0271.2023049

Asbtract ( 93 )

HTML ( 16)

PDF (4281KB) ( 198 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Intelligence data, as a high-value data asset, is stored on different platforms and held by different subjects, and has the characteristics of high dispersion and low availability.Due to the different structural forms and storage methods, it is non-trivial to achieve efficient aggregation and sharing of multi-source heterogeneous intelligence data, and there are great difficulties in the fusion and comprehensive analysis and utilization of intelligence information among multiple subjects.Therefore, there is an urgent need to establish a secure and trustworthy sharing and interoperability mechanism among cross-domain intelligence subjects to meet management requirements such as data validation and security auditing, while realizing correlation analysis and cross-validation of intelligence information and deeply mining the intelligence value therein.To address the needs and application characteristics of cross-domain intelligence data trustworthy sharing, this paper proposes a data object-based intelligence management method and adopts the digital object architecture and blockchain trusted access control technology to build a cross-domain intelligence data trustworthy sharing system, realizing the view unification and cross-domain trustworthy sharing of multisource heterogeneous intelligence data, providing technical support for intelligence data fusion and convergence and intelligence information intelligent analysis, and fully exploiting the huge potential of intelligence information.This will provide technical support for intelligence data convergence and intelligent analysis, and fully exploit the huge potential of intelligence information.

Research on iterative data cleaning of human-computer interaction

Yida LIU, Xiaoou DING, Hongzhi WANG, Donghua YANG

2023, 9(4): 59-68. doi:10.11959/j.issn.2096-0271.2023048

Asbtract ( 78 )

HTML ( 16)

PDF (2826KB) ( 268 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The advancement of data collection technology has led to a rapid increase in the size of datasets.Due to the big scale and high complexity of the data volume, serious data quality issues arise.Therefore, data cleaning is a necessary and important step in data activities.To effectively reduce human annotation costs while ensuring the accuracy of cleaning, an iterative data cleaning method (IDCHI) with human participation was proposed.This method proposed a data selection optimization method in the detection module, which enables the classifier to have high accuracy in the initial stage; and further proposed a method for selecting data to be manually annotated, effectively reducing the amount of data to be manually annotated.The experimental results show that the proposed method is effective and efficient in cleaning erroneous data.

Urban traffic flow prediction based on the multisource heterogeneous spatio-temporal data fusion

Yang AN, Jianwei SUN, Qian LI, Yongshun GONG

2023, 9(4): 69-82. doi:10.11959/j.issn.2096-0271.2023042

Asbtract ( 145 )

HTML ( 29)

PDF (3423KB) ( 323 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The problem of traffic flow forecasting has multi-source heterogeneity.The traffic flow in the future is not only related to the flow at the previous moment, but also affected by heterogeneous spatio-temporal data such as the relationship between urban regions, weather conditions and POI (point of interest).To solve this problem, a traffic flow prediction model based on multi-source heterogeneous spatio-temporal data fusion was proposed, which was called MHFSTNet (multi-source heterogeneous fusion spatio-temporal network).Firstly, this model used clustering methods to obtain different traffic patterns in urban areas, and utilized various methods such as concatenation, weight addition, and attention mechanism to integrate spatio-temporal data of multiple modalities, including traffic flow, location relationships between urban areas, weather, POI and the time of day.Deep learning methods were used to uniformly model heterogeneous data and predict traffic flow in the future.Experiments were conducted on three real-world traffic datasets, TaxiBJ, TaxiNYC and BikeNYC datasets.The results showed that MHF-STNet achieved the best performance compared with some classic traffic flow prediction models, which verified the effectiveness of MHF-STNet for unified modeling of heterogeneous spatio-temporal data.

Research and application of cross-domain data authorization and operation

Jilin ZHANG, Xiaowei GU, Yizhao ZHANG, Xiaolin ZHENG, Chaochao CHEN

2023, 9(4): 83-97. doi:10.11959/j.issn.2096-0271.2023050

Asbtract ( 125 )

HTML ( 46)

PDF (4841KB) ( 144 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the development of big data and cloud computing, data management is breaking down ＆quot;data silos＆quot; and developing from isolated services for single domain to cross-domain data sharing and collaborative services.Based on the public data authorization and operation framework, this paper presented the full-link structure of cross-domain data authorization and operation, and discussed the challenges of data privacy and efficiency in the cross-domain data processing.In response to these challenges, two data processing models, centralized and privacy computing, were proposed, which could improve data processing efficiency while protecting data privacy.Finally, an application case of a cross-domain data authorization operation platform in a practical scenario was given.

Argus: multi-source data-driven industrial control security situational awareness system

Tianchen ZHU, Jun ZHAO, Bo LI, Jianxin LI

2023, 9(4): 98-115. doi:10.11959/j.issn.2096-0271.2023051

Asbtract ( 134 )

HTML ( 10)

PDF (4091KB) ( 250 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Industrial control system (ICS) is the brain of national industrial manufacturing and civil infrastructure.However, the security risks associated with ICS have become increasingly prominent, making it a significant target for cybersecurity protection.This paper proposed a solution for the issues associated with ICS security data dispersion and delayed threat perception.Specifically, the paper presented a multi-source data-driven ICS security situational awareness system named Argus, which incorporated an awareness chain for ICS security.Furthermore, the paper developed autonomous situational awareness technologies for ICS security, such as stateless high-speed device scanning, precise threat intelligence extraction, and suspicious attack behavior detection, to achieve multi-channel and three-dimensional ICS security monitoring and situational awareness.The experimental results indicated that, compared with conventional ICS situational awareness methods, the perception accuracy of the Argus system has improved by over 10%, with efficiency improvements by two orders of magnitude.Additionally, Argus allows for proactive warning and mitigation of potential security risks.

Research on data pricing model based on data market type

Hongrun REN, Yangyong ZHU

2023, 9(4): 116-138. doi:10.11959/j.issn.2096-0271.2023052

Asbtract ( 176 )

HTML ( 45)

PDF (1503KB) ( 386 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The market is the process of establishing product prices, and the methods of establishing prices in different markets are different.The pricing model of product is an abstraction of the process of establishing product prices in the market.At present, the market demand for data has been formed, but an effective data market has not yet been formed and data pricing is still in the state of exploration.Most of the existing data pricing models are designed for some specific data transaction scenarios, rather than for specific types of data markets.This paper considers the economic market types of the data market, and divides the current data market into five data market types from the perspective of economics:monopoly market, monopsony market, oligopoly market, centralized perfect competition market and decentralized perfect competition market.Existing data pricing models are summarized into corresponding data market types.By analyzing the dependence relationship between data market types and data pricing models, the ＆quot;market type principle＆quot; of data pricing is proposed to provide theoretical guidance for the construction of data element markets and its data pricing.

Overview of observational data-based time series causal inference

Zefan ZENG, Siya CHEN, Xi LONG, Guang JIN

2023, 9(4): 139-158. doi:10.11959/j.issn.2096-0271.2022059

Asbtract ( 669 )

HTML ( 69)

PDF (2614KB) ( 1500 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the increase of data storage and the improvement of computing power,using observational data to infer time series causality has become a novel approach.Based on the properties and research status of time series causal inference,five observational data-based methods were induced,including Granger causal analysis,information theory-based method,causal network structure learning algorithm,structural causal model-based method and method based on nonlinear state-space model.Then we briefly introduced typical applications in economics and finance,medical science and biology,earth system science and other engineering fields.Further,we compared the advantages and disadvantages and analyzed the ways for improvement of the five methods according to the focus and difficulties of time series causal inference.Finally,we looked into the future research directions.

Medical named entity recognition algorithm based on probability distribution difference

Cong LIU, Xuefeng LYU, Honglin WANG, Xiaowei WANG, Jin LU, Shun SUN, Songqi HU

2023, 9(4): 159-171. doi:10.11959/j.issn.2096-0271.2023008

Asbtract ( 61 )

HTML ( 5)

PDF (3402KB) ( 290 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the improvement of data abilities and the development of emerging technologies, there are profound changes occurring in economic patterns and competitive structure of industries.In order to better respond to future opportunities and challenges, and to improve competitiveness of enterprises in new situations, it is necessary to understand and master the knowledge of digital transformation.The new competitive situation was discussed in which traditional enterprises would gradually be replaced by digital-transformed ones, digital transformation was differentiated from digitalization.Main challenges facing traditional enterprises while undergoing digital transformation were pinpointed, which were the lack of funds, talents, data and consciousness.A digital transformation service platform oriented to new competitive situation was proposed, which provided a feasible solution to enhancing enterprise competitiveness and conducting digital transformation.

PARIS principle: improving the usability of scientific data in the open collaborative environment

Hongzhi SHEN, Xiaolin ZHANG, Xiaohuan ZHENG

2023, 9(4): 172-188. doi:10.11959/j.issn.2096-0271.2023013

Asbtract ( 77 )

HTML ( 7)

PDF (4783KB) ( 119 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The demand for scientific data utilization is increasingly urgent, and in the open environment brought by the new scientific research paradigms such as “Fourth Paradigm” and “Convergence Science”, the data utilization shows the characteristics of cross-the-boundary, end-to-end, dynamic and collaborative.As products of the “era of data repository”, the FAIR and TRUST principles can no longer provide in-depth guidance for the efficient use of scientific data in the open environment.This paper analyzed the typical scenarios of scientific data utilization in detail.Then, it presented the PARIS principles to promote scientific data utilization: processable, askable, reliable, incorporable, and suppliable.Finally, this paper given a technical practice path that the PARIS principles can refer to.As beneficial extensions of the FAIR and TRUST principles, it is expected that the PARIS principles can effectively improve the usability of scientific data.

Smarter healthcare in Marvel Cinematic Universe

2023, 9(4): 189-191. doi:10.11959/j.issn.2096-0271.2023053

Asbtract ( 47 )

HTML ( 23)

PDF (4648KB) ( 166 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

当期目录