大数据 ›› 2019, Vol. 5 ›› Issue (6): 3-18.doi: 10.11959/j.issn.2096-0271.2019046
范举1,2,陈跃国1,2,杜小勇1,2
出版日期:
2019-11-15
发布日期:
2020-01-10
作者简介:
范举(1984- ),男,博士,中国人民大学数据工程与知识工程教育部重点实验室与信息学院副教授,中国计算机学会会员,数据库专业委员会委员。主要研究方向为数据库与大数据、众包数据管理、数据准备|陈跃国(1978- ),男,博士,中国人民大学信息学院教授、博士生导师,中国计算机学会高级会员,数据库专业委员会委员,大数据专家委员会通信委员。主要研究方向为大数据分析系统和语义搜索|杜小勇(1963- ),男,博士,中国人民大学信息学院教授、博士生导师,教育部数据工程与知识工程重点实验室主任,中国计算机学会会士,数据库专业委员会主任,《大数据》期刊编委会副主任,ACM Transactions on Data Science编委。主要研究方向为数据库与大数据、智能信息检索、知识工程
基金资助:
Ju FAN1,2,Yueguo CHEN1,2,Xiaoyong DU1,2
Online:
2019-11-15
Published:
2020-01-10
Supported by:
摘要:
随着数据分析技术的迅猛发展,数据准备越来越成为一个瓶颈性问题。以真实的数据分析场景为背景,分析了数据准备的两大核心挑战:人力成本高与时间周期长。在此基础上,介绍了人在回路数据准备技术的研究进展。交互式数据准备技术面向终端用户,通过与用户的交互预测其意图,并通过有效的预测算法来节省数据准备的时间。基于众包的数据准备技术引入互联网上的海量用户作为众包工人扩展计算能力,从而支持数据准备的基本任务,并研究如何对众包做质量控制与成本优化。最后,对人在回路的数据准备做出总结并探讨未来的挑战性问题。
中图分类号:
范举, 陈跃国, 杜小勇. 人在回路的数据准备技术研究进展[J]. 大数据, 2019, 5(6): 3-18.
Ju FAN, Yueguo CHEN, Xiaoyong DU. Progress on human-in-the-loop data preparation[J]. Big Data Research, 2019, 5(6): 3-18.
[1] | 杜小勇, 陈跃国, 范举 ,等. 数据整理——大数据治理的关键技术[J]. 大数据, 2019,5(3): 13-22. |
DU X Y , CHEN Y G , FAN J ,et al. Data wrangling:a key technique of data governance[J]. Big Data Research, 2019,5(3): 13-22 | |
[2] | HELLERSTEIN J M , HEER J , KANDEL S . Self-service data preparation:research to practice[J]. IEEE Data Engineering Bulletin, 2018,41(2): 23-34. |
[3] | DENG J , DONG W , SOCHER R ,et al. ImageNet:a large-scale hierarchical image database[C]// Computer Vision and Pattern Recognition (CVPR),June 20-25,2009,Miami,USA. Piscataway:IEEE Press, 2009: 248-255. |
[4] | YANG Y , MENEGHETTI N , FEHLING R ,et al. Lenses:an on-demand approach to ETL[J]. Proceedings of the VLDB Endowment, 2015,8(12): 1578-1589. |
[5] | DOAN A H , . Human-in-the-loop data analysis:a personal perspective[C]// The Workshop on Human-In-the-Loop Data Analytics (HILDA@SIGMOD 2018),Jun 10-15,2018,Houston,USA. New York:ACM Press, 2019: 1-6. |
[6] | VERROIOS V , GARCIA-MOLINA H , PAPAKONSTANTINOU Y . Waldo:an adaptive human interface for crowd entity resolution[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 1133-1148. |
[7] | WANG J , KRASKA T , FRANKLIN M J ,et al. CrowdER:Crowdsourcing Entity Resolution[J]. Proceedings of the VLDB Endowment, 2012,5(11): 1483-1494. |
[8] | BERNSTEIN M S , BRANDT J , MILLER R C ,et al. Crowds in two seconds:enabling realtime crowd-powered interfaces[C]// Annual ACM Symposium on User Interface Software and Technology (UIST),October 16-19,2011,Santa Barbara,USA. New York:ACM Press, 2011: 33-42. |
[9] | HAAS D , WANG J , WU E ,et al. CLAMShell:speeding up crowds for lowlatency data labeling[J]. Proceedings of the VLDB Endowment, 2015,9(4): 372-383. |
[10] | STONEBRAKER M , BRUCKNER D , ILYAS I F ,et al. Data curation at scale:the data tamer system[C]// Biennial Conference on Innovative Data Systems Research (CIDR),January 6-9,2013,Asilomar,USA.[S.l:s.n.], 2013. |
[11] | DOAN A H , ARDALAN A , BALLARD J R ,et al. Toward a system building agenda for data integration[J]. IEEE Data Engineering Bulletin, 2018,41(2): 35-46. |
[12] | CHEN C , GOLSHAN B , HALEVEY A Y ,et al. BigGorilla:an open-source ecosystem for data preparation and integration[J]. IEEE Data Engineering Bulletin, 2018,41(2): 10-22. |
[13] | LI G . Human-in-the-loop data integration[J]. Proceedings of the VLDB Endowment, 2017,10(12): 2006-2017. |
[14] | FAN J , LI G . Human-in-the-loop rule learning for data integration[J]. IEEE Data Engineering Bulletin, 2018,41(2): 104-115. |
[15] | KANDEL S , PAEPCKE A , HELLERSTEIN J M ,et al. Wrangler:interactive visual specification of data transformation scripts[C]// International Conference on Human Factors in Computing Systems (CHI),May 7-12,2011,Vancouver,Canada. New York:ACM Press, 2011: 3363-3372. |
[16] | HEER J , HELLERSTEIN J M , KANDEL S . Predictive interaction for data transformation[C]// Biennial Conference on Innovative Data Systems Research (CIDR),January 4-7,Asilomar,USA.[S.l:s.n.], 2013 |
[17] | KHAN M A , XU L , NANDI A ,et al. Data tweening:incremental visualization of data transforms[J]. Proceedings of the VLDB Endowment, 2017,10(6): 661-672. |
[18] | LIEBERMAN H . Your wish is my command:programming by example[M]. Morgan Kaufmann Publishers, 2001. |
[19] | JIN Z , ANDERSON M R , CAFARELLA M J ,et al. Foofah:Transforming data by example[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 683-698. |
[20] | BLINKFILL R S . Semi-supervised programming by example for syntactic string transformations[J]. Proceedings of the VLDB Endowment, 2016,9(10): 816-827. |
[21] | SINGH R , MEDURI V V , ELMAGARMID A K ,et al. Synthesizing entity matching rules by examples[J]. Proceedings of the VLDB Endowment, 2017,11(2): 189-202. |
[22] | BONIFATI A , COMIGNANI U , COQUERY E ,et al. Interactive mapping specification with exemplar tuples[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 667-682. |
[23] | ZHU E , HE Y , CHAUDHURI S . Autojoin:joining tables by leveraging transformations[J]. Proceedings of the VLDB Endowment, 2017,10(10): 1034-1045. |
[24] | HE Y , CHU X , GANJAM K ,et al. Transform-data-by-example (TDE):an extensible search engine for data transformations[J]. Proceedings of the VLDB Endowment, 2018,11(10): 1165-1177. |
[25] | MORCOS J , ABEDJAN Z , ILYAS I F ,et al. DataXFormer:an interactive data transformation tool[C]// International Conference on Management of Data (SIGMOD),May 31-June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 883-888. |
[26] | ABEDJAN Z , MORCOS J , ILYAS I F ,et al. DataXFormer:a robust transformation discovery system[C]// IEEE International Conference on Data Engineering (ICDE),May 16-20,2016,Helsinki,Finland. Piscataway:IEEE Press, 2016: 1134-1145. |
[27] | FAN J , LU M , OOI B C ,et al. A hybrid machine-crowdsourcing system for matching web tables[C]// IEEE International Conference on Data Engineering (ICDE),March 31-April 4,2014,Chicago,USA. Piscataway:IEEE Press, 2014: 976-987. |
[28] | HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8): 1735-1780. |
[29] | ZHANG Y , IVES Z G . Juneau:data lake management for Jupyter[J]. Proceedings of the VLDB Endowment, 2019,12(12): 1902-1905. |
[30] | IVES Z , ZHANG Y , HAN S ,et al. Dataset relationship management[C]// Biennial Conference on Innovative Data Systems Research (CIDR),January 13-16,Asilomar,USA.[S.l:s.n.], 2019. |
[31] | VARTAK M , RAHMAN S , MADDEN S ,et al. SEEDB:efficient data-driven visualization recommendations to support visual analytics[J]. Proceedings of the VLDB Endowment, 2015,8(13): 2182-2193. |
[32] | LUO Y , QIN X , TANG N ,et al. DeepEye:towards automatic data visualization[C]// IEEE International Conference on Data Engineering (ICDE),April 16-19,2018,Paris,France. Piscataway:IEEE Press, 2018: 101-112. |
[33] | REZIG E K , CAO L , STONEBRAKER M ,et al. Data civilizer 2.0:a holistic framework for data preparation and analytics[J]. Proceedings of the VLDB Endowment, 2019,12(12): 1954-1957. |
[34] | SHANG Z , ZGRAGGEN E , BURATTI B ,et al. Democratizing data science through interactive curation of ML pipelines[C]// International Conference on Management of Data (SIGMOD),June 30 - July 5,2019,Amsterdam,The Netherlands. New York:ACM Press, 2019: 1171-1188. |
[35] | WANG J , KRISHNAN S , FRANKLIN M J ,et al. A sample-and-clean framework for fast and accurate query processing on dirty data[C]// International Conference on Management of Data (SIGMOD),June 22-27,2014,Salt Lake City,USA. New York:ACM Press, 2014: 469-480. |
[36] | LI G , WANG J , ZHENG Y ,et al. Crowdsourced data management:a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2016,28(9): 2296-2319. |
[37] | DEMARTINI G , DIFALLAH D E , CUDRéMAUROUX P . ZenCrowd:leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking[C]// International World Wide Web Conferences (WWW),April 16-20,2012,Lyon,France.[S.l:s.n. ], 2012: 469-478. |
[38] | KONDREDDI S K , TRIANTAFILLOU P , WEIKUM G . Combining information extraction and human computing for crowdsourced knowledge acquisition[C]// International Conference on Data Engineering (ICDE),March 31 - April 4,2014,Chicago,USA. Piscataway:IEEE Press, 2014: 988-999. |
[39] | ABAD A , NABI M , MOSCHITTI A . Self-Crowdsourcing training for relation extraction[C]// Annual Meeting of the Association for Computational Linguistics (ACL),July 30 - August 4,2017,Vancouver,Canada.[S.l:s.n. ], 2017: 518-523. |
[40] | CHILTON L B , LITTLE G , EDGE D ,et al. Cascade:crowdsourcing taxonomy creation[C]// International Conference on Human Factors in Computing Systems (CHI),April 27 - May 2,2013,Paris,France. New York:ACM Press, 2013: 1999-2008. |
[41] | CHU X , MORCOS J , ILYAS I F ,et al. KATARA:a data cleaning system powered by knowledge bases and crowdsourcing[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1247-1261. |
[42] | TONG Y , CAO C C , ZHANG C J ,et al. CrowdCleaner:Data cleaning for multi-version data on the web via crowdsourcing[C]// IEEE International Conference on Data Engineering (ICDE),March 31 - April 4,2014,IL,USA.[S.l:s.n. ], 2014: 1182-1185. |
[43] | DOLATSHAH M , TEOH M , WANG J ,et al. Cleaning crowdsourced labels using oracles for statistical classification[J]. Proceedings of the VLDB Endowment, 2018,12(4): 376-389. |
[44] | GAO J , LI Q , ZHAO B ,et al. Truth discovery and crowdsourcing aggregation:a unified perspective[J]. Proceedings of the VLDB Endowment, 2015,8(12): 2048-2049. |
[45] | WANG J , LI G , KRASKA T ,et al. Leveraging transitive relations for crowdsourced joins[C]// International Conference on Management of Data (SIGMOD),June 22-27,2013,New York,USA. New York:ACM Press, 2013: 229-240. |
[46] | CHAI C , LI G , LI J ,et al. Cost-effective crowdsourced entity resolution:a partial-order approach[C]// International Conference on Management of Data (SIGMOD),June 26 - July 1,2016,San Francisco,USA. New York:ACM Press, 2016: 969-984. |
[47] | WANG S , XIAO X , LEE C . Crowd-based deduplication:an adaptive approach[C]// International Conference on Management of Data (SIGMOD),May 31-June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1263-1277. |
[48] | DAS S , C P S G , DOAN A ,et al. Falcon:scaling up hands-off crowdsourced entity matching to build cloud services[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 1431-1446. |
[49] | RATNER A , BACH S H , EHRENBERG H R ,et al. Snorkel:rapid training data creation with weak supervision[J]. Proceedings of the VLDB Endowment, 2017,11(3): 269-282. |
[50] | RATNER A J , SA C D , WU S ,et al. Data programming:creating large training sets,quickly[C]// Neural Information Processing Systems (NeurIPS),December 5-10,2016,Barcelona,Spain.[S.l:s.n. ], 2016: 3567-3575. |
[51] | YANG J , FAN J , WEI Z ,et al. Costeffective data annotation using game-based crowdsourcing[J]. Proceedings of the VLDB Endowment, 2018,12(1): 57-70. |
[52] | LIU T , YANG J , FAN J ,et al. CrowdGame:a game-based crowdsourcing system for cost-effective data labeling[C]// International Conference on Management of Data (SIGMOD),June 30 - July 5,2019,Amsterdam,The Netherlands. New York:ACM Press, 2019: 1957-1960. |
[53] | LIU X , LU M , OOI B C ,et al. CDAS:a crowdsourcing data analytics system[J]. Proceedings of the VLDB Endowment, 2012,5(10): 1040-1051. |
[54] | FAN J , LI G , OOI B C ,et al. iCrowd:an adaptive crowdsourcing framework[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1015-1030. |
[55] | ZHENG Y , WANG J , LI G ,et al. QASCA:a quality-aware task assignment system for crowdsourcing applications[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1031-1046. |
[56] | HAAS D , WANG J , WU F ,et al. CLAMShell:speeding up crowds for lowlatency data labeling[J]. Proceedings of the VLDB Endowment, 2015,9(4): 372-383. |
[57] | MOZAFARI B , SARKAR P , FRANKLIN M J ,et al. Scaling up crowd-sourcing to very large datasets:a case for active learning[J]. Proceedings of the VLDB Endowment, 2014,8(2): 125-136. |
[58] | VERROIOS V , LOFGREN P , GARCIAMOLINA H . tDP:an optimallatency budget allocation strategy for crowdsourced MAXIMUM operations[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1047-1062. |
[59] | SARMA A D , PARAMESWARAN A G , GARCIA-MOLINA H ,et al. Crowdpowered find algorithms[C]// IEEE International Conference on Data Engineering (ICDE),March 31 - April 4,2014,Chicago,USA. Piscataway:IEEE Press, 2014: 964-975. |
[60] | BOIM R , GREENSHPAN O , MILO T ,et al. Asking the right questions in crowd data sourcing[C]// IEEE International Conference on Data Engineering (ICDE),April 1-5,2012,Washington,USA. Piscataway:IEEE Press, 2012: 1261-1264. |
[61] | TO H , SHAHABI C , XIONG L . Privacypreserving online task assignment in spatial crowdsourcing with untrusted server[C]// IEEE International Conference on Data Engineering (ICDE),April 16-19,2018,Paris,France. Piscataway:IEEE Press, 2018: 833-844. |
[1] | 梅宏, 杜小勇, 金海, 程学旗, 柴云鹏, 石宣化, 靳小龙, 王亚沙, 刘驰. 大数据技术前瞻[J]. 大数据, 2023, 9(1): 1-20. |
[2] | 韩立帆, 季紫荆, 陈子睿, 王鑫. 数字人文视域下面向历史古籍的信息抽取方法研究[J]. 大数据, 2022, 8(6): 26-39. |
[3] | 黄科满, 杜小勇. 数据治理价值链模型与数据基础制度分析[J]. 大数据, 2022, 8(4): 3-16. |
[4] | 李汶龙, 袁媛, 安筱鹏. 刍议大数据治理的三大基础思维[J]. 大数据, 2022, 8(4): 34-45. |
[5] | 柴唤友, 刘三女牙, 康令云, 张雅娴, 李卿, 刘智. 教育大数据采集机制与关键技术研究[J]. 大数据, 2020, 6(6): 14-25. |
[6] | 代红, 张群, 芦皓麟, 宾军志. 银行业金融机构数据治理指引和DCMM的对比分析[J]. 大数据, 2020, 6(5): 118-128. |
[7] | 杨孟辉, 杜小勇. 政府大数据治理:政府管理的新形态[J]. 大数据, 2020, 6(2): 3-18. |
[8] | 安小米, 郭明军, 洪学海, 魏玮. 政府大数据治理体系的框架及其实现的有效路径[J]. 大数据, 2019, 5(3): 3-12. |
[9] | 杜小勇, 陈跃国, 范举, 卢卫. 数据整理——大数据治理的关键技术[J]. 大数据, 2019, 5(3): 13-22. |
[10] | 蒋东兴, 高若楠, 王浩宇. 证券期货行业监管大数据治理方案研究[J]. 大数据, 2019, 5(3): 23-34. |
[11] | 代红, 张群, 尹卓. 大数据治理标准体系研究[J]. 大数据, 2019, 5(3): 47-54. |
[12] | 阮彤, 邱加辉, 张知行, 叶琪. 医疗数据治理——构建高质量医疗大数据智能分析数据基础[J]. 大数据, 2019, 5(1): 12-24. |
[13] | 赵江华, 王学志, 林青慧, 黎建辉, 周园春. 众包模式在大规模遥感影像信息提取领域的探索[J]. 大数据, 2016, 2(6): 53-64. |
[14] | 马朝辉, 聂瑞华, 谭昊翔, 林嘉洺, 王欣明, 唐华, 杨晋吉, 赵淦森. 大数据治理的数据模式与安全[J]. 大数据, 2016, 2(3): 83-95. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
|