大数据 ›› 2023, Vol. 9 ›› Issue (1): 1-20.doi: 10.11959/j.issn.2096-0271.2023009
• 战略研究 • 下一篇
梅宏1, 杜小勇2, 金海3, 程学旗4,5, 柴云鹏2, 石宣化3, 靳小龙4,5, 王亚沙1, 刘驰6
出版日期:
2023-01-15
发布日期:
2023-01-01
作者简介:
梅宏(1963- ),男,博士,北京大学教授、高可信软件技术教育部重点实验室(北京大学)主任,中国科学院院士,发展中国家科学院院士,欧洲科学院外籍院士,中国计算机学会理事长。主要研究方向为软件工程与系统软件Hong MEI1, Xiaoyong DU2, Hai JIN3, Xueqi CHENG4,5, Yunpeng CHAI2, Xuanhua SHI3, Xiaolong JIN4,5, Yasha WANG1, Chi LIU6
Online:
2023-01-15
Published:
2023-01-01
摘要:
世界主要国家高度重视大数据发展,我国也将发展大数据作为国家战略,发展大数据技术具有重要意义。大数据技术涉及从采集、传输到管理、处理、分析、应用的全生命周期以及生命周期各阶段的数据治理。选取数据生命周期中的管理、处理和分析技术以及大数据治理技术来梳理国内外技术发展现状,特别是研判我国大数据技术发展与国际先进技术之间的差距。另外,在大数据应用需求的驱动下,计算技术体系正面临重构,从“以计算为中心”向“以数据为中心”转型,在新的计算技术体系下,一系列基础理论和核心技术问题亟待破解,新型大数据系统技术成为重要发展方向。在计算体系重构的背景下,提出大数据技术发展的四大技术挑战和十大发展趋势。
中图分类号:
梅宏, 杜小勇, 金海, 程学旗, 柴云鹏, 石宣化, 靳小龙, 王亚沙, 刘驰. 大数据技术前瞻[J]. 大数据, 2023, 9(1): 1-20.
Hong MEI, Xiaoyong DU, Hai JIN, Xueqi CHENG, Yunpeng CHAI, Xuanhua SHI, Xiaolong JIN, Yasha WANG, Chi LIU. Big data technologies forward-looking[J]. Big Data Research, 2023, 9(1): 1-20.
[1] | 裴威, 李战怀, 潘巍 . GPU数据库核心技术综述[J]. 软件学报, 2021,32(3): 859-885. |
PEI W , LI Z H , PAN W . Survey of key technologies in GPU database system[J]. Journal of Software, 2021,32(3): 859-885. | |
[2] | SHERKAT R , FLORENDO C , ANDREI M ,et al. Native store extension for SAP HANA[J]. Proceedings of the VLDB Endowment, 2019,12(12): 2047-2058. |
[3] | SHEN S J , CHEN R , CHEN H B ,et al. Retrofitting High availability mechanism to tame hybrid transaction/analytical processing[C]// Proceedings of 2021 Operating Systems Design and Implementation.[S.l.:s.n.], 2021: 219-238. |
[4] | LIU G , CHEN L Y , CHEN S M . Zen:a high-throughput log-free OLTP engine for non-volatile main memory[J]. Proceedings of the VLDB Endowment, 2021,14(5): 835-848. |
[5] | KRASKA T , BEUTEL A , CHI E H ,et al. The case for learned index structures[C]// Proceedings of 2018 International Conference on Management of Data. New York:ACM Press, 2018: 489-504. |
[6] | CHATTERJEE S , JAGADEESAN M , QIN W ,et al. Cosine[J]. Proceedings of the VLDB Endowment, 2021,15(1): 112-126. |
[7] | DAS S , GRBIC M , ILIC I ,et al. Automatically indexing millions of databases in microsoft azure SQL database[C]// Proceedings of 2019 International Conference on Management of Data. New York:ACM Press, 2019: 666-679. |
[8] | AHMED R , BELLO R , WITKOWSKI A ,et al. Automated generation of materialized views in Oracle[J]. Proceedings of the VLDB Endowment, 2020,13(12): 3046-3058. |
[9] | LIU X Z , YIN Z , ZHAO C ,et al. PinSQL:pinpoint root cause SQLs to resolve performance issues in cloud databases[C]// Proceedings of 2022 IEEE 38th International Conference on Data Engineering. Piscataway:IEEE Press, 2022: 2549-2561. |
[10] | LI G L , ZHOU X H , SUN J ,et al. OpenGauss:an autonomous database system[C]// Proceedings of the International Conference on Very Large Databases.[S.l.:s.n.], 2021,14(12): 3028-3041. |
[11] | ZHOU X H , LI G L , CHAI C L ,et al. A learned query rewrite system using Monte Carlo tree search[J]. Proceedings of the VLDB Endowment, 2021,15(1): 46-58. |
[12] | WANG J Y , CHAI C L , LIU J B ,et al. FACE:a normalizing flowbased cardinality estimator[C]// Proceedings of the International Conference on Very Large Databases.[S.l.:s.n.], 2022,15(1): 72-84. |
[13] | DEPOUTOVITCH A , CHEN C , CHEN J ,et al. Taurus database:how to be fast,available,and frugal in the cloud[C]// Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2020: 1463-1478. |
[14] | CAO W , LIU Z J , WANG P ,et al. PolarFS:an ultra-low latency and failure resilient distributed file system for shared storage cloud database[J]. Proceedings of the VLDB Endowment, 2018,11(12): 1849-1862. |
[15] | TAFT R , SHARIF I , MATEI A ,et al. CockroachDB:the resilient geodistributed SQL database[C]// Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2020: 1493-1509. |
[16] | CAO W , LIU Z J , WANG P ,et al. PolarFS:an ultra-low latency and failure resilient distributed file system for shared storage cloud database[J]. Proceedings of the VLDB Endowment, 2018,11(12): 1849-1862. |
[17] | WANG Y Y , WANG Z K , CHAI Y P ,et al. Rethink the linearizability constraints of raft for distributed key-value stores[C]// Proceedings of 2021 IEEE 37th International Conference on Data Engineering. Piscataway:IEEE Press, 2021: 1877-1882. |
[18] | HUANG D X , LIU Q , CUI Q ,et al. TiDB[J]. Proceedings of the VLDB Endowment, 2020,13(12): 3072-3084. |
[19] | WANG H X , XU C , ZHANG C ,et al. A blockchain system ensuring query integrity[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2020: 2693-2696. |
[20] | DANG H , DINH T T A , LOGHIN D ,et al. Towards scaling blockchain systems via sharding[C]// Proceedings of 2019 International Conference on Management of Data. New York:ACM Press, 2019: 123-140. |
[21] | ALAKUIJALA J , FARRUGGIA A , FERRAGINA P ,et al. Brotli:a generalpurpose data compressor[J]. ACM Transactions on Information Systems, 2019,37(1): 1-30. |
[22] | CAO W , ZHANG Y Q , YANG X J ,et al. PolarDB serverless:a cloud native database for disaggregated data centers[C]// Proceedings of 2021 International Conference on Management of Data. New York:ACM Press, 2021: 2477-2489. |
[23] | ZHANG F , WAN W T , ZHANG C Y ,et al. CompressDB:enabling efficient compressed data direct processing for various databases[C]// Proceedings of 2022 International Conference on Management of Data.[S.l.:s.n.], 2022: 1655-1669. |
[24] | WOJTOWICZ D T , YIN S Y , MORVAN F ,et al. Cost-effective dynamic optimisation for multi-cloud queries[C]// Proceedings of 2021 IEEE 14th International Conference on Cloud Computing. Piscataway:IEEE Press, 2021: 387-397. |
[25] | 王建冬, 于施洋, 窦悦 . 东数西算:我国数据跨域流通的总体框架和实施路径研究[J]. 电子政务, 2020(3): 13-21. |
WANG J D , YU S Y , DOU Y . East-west computing transfer:research on the overall framework and implementation path of cross-domain data circulation in China[J]. E-Government, 2020(3): 13-21. | |
[26] | DEAN J , GHEMAWAT S . MapReduce:simplified data processing on large clusters[J]. Communications of the ACM, 2008,51(1): 137-150. |
[27] | FEY M , LENSSEN J E . Fast graph representation learning with PyTorch geometric[J]. arXiv preprint, 2019,arXiv:1903.02428v2. |
[28] | RASCHKA S , PATTERSON J , NOLET C . Machine learning in python:main developments and technology trends in data science,machine learning,and artificial intelligence[J]. Information, 2020,11(4): 193. |
[29] | AHN J , YOO S , MUTLU O ,et al. PIMenabled instructions:a low-overhead,locality-aware processing-in-memory architecture[J]. Computer Architecture News, 2015,43(3): 336-348. |
[30] | WU M Y , ZHAO Z M , LI H Y ,et al. Espresso:brewing Java for more non-volatility with non-volatile memory[C]// Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. New York:ACM Press, 2018: 70-83. |
[31] | SHI X H , KE Z X , ZHOU Y L ,et al. Deca:a garbage collection optimizer for in-memory data processing[J]. ACM Transactions on Computer Systems, 2018,36(1): 1-47. |
[32] | ZEUCH S , MONTE B D , KARIMOV J ,et al. Analyzing efficient stream processing on modern hardware[J]. Proceedings of the VLDB Endowment, 2019,12(5): 516-530. |
[33] | TOSHNIWAL A , TANEJA S , SHUKLA A ,et al. Storm@twitter[C]// Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2014. |
[34] | ZAHARIA M , DAS T , LI H Y ,et al. Discretized streams:fault-tolerant streaming computation at scale[C]// Proceedings of the 24th ACM Symposium on Operating Systems Principles. New York:ACM Press, 2013. |
[35] | NASIR M A U , MORALES G D F , GARCíA-SORIANO D ,et al. The power of both choices:practical load balancing for distributed stream processing engines[C]// Proceedings of 2015 IEEE 31st International Conference on Data Engineering. Piscataway:IEEE Press, 2015: 137-148. |
[36] | NASIR M A U , MORALES G D F , KOURTELLIS N ,et al. When two choices are not enough:balancing at scale in distributed stream processing[C]// Proceedings of 2016 IEEE 32nd International Conference on Data Engineering. Piscataway:IEEE Press, 2016: 589-600. |
[37] | ABDELHAMID A S , MAHMOOD A R , DAGHISTANI A ,et al. Prompt:dynamic data-partitioning for distributed microbatch stream processing systems[C]// Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2020: 2455-2469. |
[38] | CHEN H H , ZHANG F , JIN H . PStream:a popularity-aware differentiated distributed stream processing system[J]. IEEE Transactions on Computers, 2021,70(10): 1582-1597. |
[39] | MALEWICZ G , AUSTERN M H , BIK A J C ,et al. Pregel:a system for large-scale graph processing[C]// Proceedings of 2010 ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2010: 135-146. |
[40] | WANG Y , DAVIDSON A , PAN Y C ,et al. Gunrock:a high-performance graph processing library on the GPU[C]// Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York:ACM Press, 2016. |
[41] | ZHOU S J , KANNAN R , PRASANNA V K ,et al. HitGraph:high-throughput graph processing framework on FPGA[J]. IEEE Transactions on Parallel and Distributed Systems, 2019,30(10): 2249-2264. |
[42] | RAHMAN S , ABU-GHAZALEH N , GUPTA R . GraphPulse:an event-driven hardware accelerator for asynchronous graph processing[C]// Proceedings of 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture. Piscataway:IEEE Press, 2020: 908-921. |
[43] | SHUN J L , BLELLOCH G E . Ligra:a lightweight graph processing framework for shared memory[C]// Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming. New York:ACM Press, 2013: 135-146. |
[44] | GONZALEZ J E , LOW Y C , GU H J ,et al. PowerGraph:distributed graph-parallel computation on natural graphs[C]// Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. New York:ACM Press, 2012: 17-30. |
[45] | KYROLA A , BLELLOCH G , GUESTRIN C . GraphChi:large-scale graph computation on just a PC[C]// Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation. New York:ACM Press, 2012: 31-46. |
[46] | HAM T J , WU L S , SUNDARAM N ,et al. Graphicionado:a high-performance and energy-efficient accelerator for graph analytics[C]// Proceedings of 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. Piscataway:IEEE Press, 2016: 1-13. |
[47] | ZHANG Y , LIAO X F , JIN H ,et al. HotGraph:efficient asynchronous processing for real-world graphs[J]. IEEE Transactions on Computers, 2017,66(5): 799-809. |
[48] | ZHANG K Y , CHEN R , CHEN H B . NUMA-aware graph-structured analytics[C]// Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York:ACM Press, 2015: 183-193. |
[49] | ZHU X W , CHEN W G , ZHENG W M ,et al. Gemini:a computation- centric distributed graph processing system[C]// Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. New York:ACM Press, 2016: 301-316. |
[50] | ZHANG Y , LIAO X F , GU L ,et al. Asyngraph:maximizing data parallelism for efficient iterative graph processing on gpus[J]. ACM Transactions on Architecture and Code Optimization, 2020,17(4): 1-21. |
[51] | DAI G H , HUANG T H , CHI Y Z ,et al. ForeGraph:exploring largescale graph processing on multi-FPGA architecture[C]// Proceedings of 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York:ACM Press, 2017: 217-226. |
[52] | ZHAO J , YANG Y , ZHANG Y ,et al. TDGraph:a topology-driven accelerator for high-performance streaming graph processing[C]// Proceedings of the 49th Annual International Symposium on Computer Architecture. New York:ACM Press, 2022: 116-129. |
[53] | LIN H , ZHU X W , YU B W ,et al. ShenTu:processing multi-trillion edge graphs on millions of cores in seconds[C]// Proceedings of International Conference for High Performance Computing,Networking,Storage and Analysis. Piscataway:IEEE Press, 2018. |
[54] | ZHANG Y , LIAO X F , JIN H ,et al. DiGraph:an efficient path-based iterative directed graph processing system on multiple GPUs[C]// Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. New York:ACM Press, 2019: 601-614. |
[55] | PHAM H , LIANG P P , MANZINI T ,et al. Found in translation:learning robust joint representations by cyclic translations between modalities[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019,33(1): 6892-6899. |
[56] | WANG W H , BAO H B , DONG L ,et al. Image as a foreign language:BEIT pretraining for all vision and visionlanguage tasks[J]. arXiv preprint, 2022,arXiv:2208.10442. |
[57] | CHEN X , WANG X , CHANGPINYO S ,et al. Pali:a jointly-scaled multilingual language-image model[J]. arXiv preprint, 2022,arXiv:2209.06794. |
[58] | LIU J , ZHU X X , LIU F ,et al. OPT:omni-perception pre-trainer for crossmodal understanding and generation[J]. arXiv preprint, 2021,arXiv:2107.00249. |
[59] | MAMMEN P M . Federated learning:opportunities and challenges[J]. arXiv preprint, 2021,arXiv:2101.05428. |
[60] | ZILLER A , TRASK A , LOPARDO A ,et al. PySyft:a library for easy federated learning[M]// Federated learning systems. Cham: Springer, 2021: 111-139. |
[61] | WELTEN S , MOU Y L , NEUMANN L ,et al. A privacy-preserving distributed analytics platform for health care data[J]. Methods of Information in Medicine, 2022,61(S 01): e1-e11. |
[62] | LI Q B , WEN Z Y , WU Z M ,et al. A survey on federated learning systems:vision,hype and reality for data privacy and protection[J]. IEEE Transactions on Knowledge and Data Engineering, 2021:10.1109/TKDE.2021.3124599. |
[63] | RUBIN D B . Estimating causal effects of treatments in randomized and nonrandomized studies[J]. Journal of Educational Psychology, 1974,66(5): 688-701. |
[64] | PEARL J . Causality:models,reasoning and inference[M]. Cambridge: Cambridge University Press, 2009. |
[65] | PEARL J , MACKENZIE D . The book of why:the new science of cause and effect[J]. Journal of MultiDisciplinary Evaluation, 2018,14(31): 47-54. |
[66] | SUN X W , WU B T , ZHENG X Y ,et al. Recovering latent causal factor for generalization to distributional shifts[C]// Advances in Neural Information Processing Systems.[S.l.:s.n.], 2021: 16846-16859. |
[67] | CUI P , ATHEY S . Stable learning establishes some common ground between causal inference and machine learning[J]. Nature Machine Intelligence, 2022,4(2): 110-115. |
[68] | ZHANG Y , FENG F L , HE X N ,et al. Causal intervention for leveraging popularity bias in recommendation[J]. arXiv preprint, 2021,arXiv:2105.06067. |
[69] | ZHU Z M , CHEN X H , TIAN H L ,et al. Offline reinforcement learning with causal structured world models[J]. arXiv preprint, 2022,arXiv:2206.01474. |
[70] | STONEBRAKER M . The solution:data curation at scale[M]. Getting data right.[S.l.]: O’Reilly, 2016. |
[71] | 华为公司数据管理部. 华为数据之道[M]. 北京: 机械工业出版社, 2020. |
Data Management Department of Huawei. Enterprise data at Huawei[M]. Beijing: China Machine Press, 2020. | |
[72] | REKATSINAS T , CHU X , ILYAS I F ,et al. HoloClean:holistic data repairs with probabilistic inference[J]. arXiv preprint,2017, 2017,arXiv:1702.00820. |
[73] | DONG X , GABRILOVICH E , HEITZ G ,et al. Knowledge vault:a web-scale approach to probabilistic knowledge fusion[J]. SIGKDD Explorations, 2014(CD/ROM): 597-606. |
[74] | 郝爽, 李国良, 冯建华 ,等. 结构化数据清洗技术综述[J]. 清华大学学报(自然科学版), 2018,58(12): 1037-1050. |
HAO S , LI G L , FENG J H ,et al. Survey of structured data cleaning methods[J]. Journal of Tsinghua University (Science and Technology), 2018,58(12): 1037-1050. | |
[75] | 丁小欧, 王宏志, 于晟健 . 工业时序大数据质量管理[J]. 大数据, 2019,5(6): 1-11. |
DING X O , WANG H Z , YU S J . Data quality management of industrial temporal big data[J]. Big Data Research, 2019,5(6): 1-11. | |
[76] | KAHN R , WILENSKY R . A framework for distributed digital object services[J]. International Journal on Digital Libraries, 2006,6(2): 115-123. |
[77] | 梅宏, 黄罡, 刘譞哲 ,等. 网构软件研究:回顾与展望[J]. 科学通报, 2022,67(32): 3782-3792. |
MEI H , HUANG G , LIU X Z ,et al. Research on internetware:review and prospect[J]. Chinese Science Bulletin, 2022,67(32): 3782-3792. | |
[78] | 黄罡 . 数联网:数字空间基础设施[J]. 中国计算机学会通讯, 2021,17(12): 58-60. |
HUANG G . Internet of Data:infrastructure of digtital space[J]. Communications of the CCF, 2021,17(12): 58-60. |
[1] | 李汶龙, 袁媛, 安筱鹏. 刍议大数据治理的三大基础思维[J]. 大数据, 2022, 8(4): 34-45. |
[2] | 潘小多, 李新, 冉有华, 郭学军. 开放科学背景下的科学数据开放共享:国家青藏高原科学数据中心的实践[J]. 大数据, 2022, 8(1): 113-120. |
[3] | 宫明, 蒋翔宇, 陈莹, 刘朝峰. 从格点量子色动力学应用看国产超算环境的基础软件[J]. 大数据, 2021, 7(5): 31-39. |
[4] | 张晨浩, 肖利民, 秦广军, 宋尧, 蒋世轩, 王继业. 面向大数据处理应用的广域存算协同调度系统[J]. 大数据, 2021, 7(5): 82-97. |
[5] | 李刚, 郑佳, 尹华山, 黄文超. 大数据技术在疫情精准防控中的应用[J]. 大数据, 2021, 7(1): 124-134. |
[6] | 夏大文, 王林, 张乾, 魏嘉银, 冯夫健, 李华青. 大数据应用技术课程教学改革与实践[J]. 大数据, 2020, 6(4): 115-124. |
[7] | 邹骁锋, 阳王东, 容学成, 李肯立, 李克勤. 面向大数据处理的数据流编程模型和工具综述[J]. 大数据, 2020, 6(3): 59-72. |
[8] | 毕倪飞, 丁光耀, 陈启航, 徐辰, 周傲英. 数据流计算模型及其在大数据处理中的应用[J]. 大数据, 2020, 6(3): 73-86. |
[9] | 苏华友, 梅松竹, 李荣春, 窦勇. 数据流技术在GPU和大数据处理中的应用[J]. 大数据, 2020, 6(3): 117-128. |
[10] | 杨孟辉, 杜小勇. 政府大数据治理:政府管理的新形态[J]. 大数据, 2020, 6(2): 3-18. |
[11] | 吴维刚, 常亮, 任江涛, 古天龙. 面向政府治理大数据的高性能计算系统[J]. 大数据, 2020, 6(2): 41-56. |
[12] | 李望月, 刘瑾, 陈娜. 大数据技术在乡村画像中的应用研究[J]. 大数据, 2020, 6(1): 99-118. |
[13] | 丁小欧, 王宏志, 于晟健. 工业时序大数据质量管理[J]. 大数据, 2019, 5(6): 19-29. |
[14] | 安小米, 郭明军, 洪学海, 魏玮. 政府大数据治理体系的框架及其实现的有效路径[J]. 大数据, 2019, 5(3): 3-12. |
[15] | 蒋东兴, 高若楠, 王浩宇. 证券期货行业监管大数据治理方案研究[J]. 大数据, 2019, 5(3): 23-34. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
|