Big Data Research ›› 2019, Vol. 5 ›› Issue (6): 3-18.doi: 10.11959/j.issn.2096-0271.2019046
• TOPIC:BIG DATA WRANGLING • Previous Articles Next Articles
Ju FAN1,2,Yueguo CHEN1,2,Xiaoyong DU1,2
Online:
2019-11-15
Published:
2020-01-10
Supported by:
CLC Number:
Ju FAN, Yueguo CHEN, Xiaoyong DU. Progress on human-in-the-loop data preparation[J]. Big Data Research, 2019, 5(6): 3-18.
[1] | 杜小勇, 陈跃国, 范举 ,等. 数据整理——大数据治理的关键技术[J]. 大数据, 2019,5(3): 13-22. |
DU X Y , CHEN Y G , FAN J ,et al. Data wrangling:a key technique of data governance[J]. Big Data Research, 2019,5(3): 13-22 | |
[2] | HELLERSTEIN J M , HEER J , KANDEL S . Self-service data preparation:research to practice[J]. IEEE Data Engineering Bulletin, 2018,41(2): 23-34. |
[3] | DENG J , DONG W , SOCHER R ,et al. ImageNet:a large-scale hierarchical image database[C]// Computer Vision and Pattern Recognition (CVPR),June 20-25,2009,Miami,USA. Piscataway:IEEE Press, 2009: 248-255. |
[4] | YANG Y , MENEGHETTI N , FEHLING R ,et al. Lenses:an on-demand approach to ETL[J]. Proceedings of the VLDB Endowment, 2015,8(12): 1578-1589. |
[5] | DOAN A H , . Human-in-the-loop data analysis:a personal perspective[C]// The Workshop on Human-In-the-Loop Data Analytics (HILDA@SIGMOD 2018),Jun 10-15,2018,Houston,USA. New York:ACM Press, 2019: 1-6. |
[6] | VERROIOS V , GARCIA-MOLINA H , PAPAKONSTANTINOU Y . Waldo:an adaptive human interface for crowd entity resolution[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 1133-1148. |
[7] | WANG J , KRASKA T , FRANKLIN M J ,et al. CrowdER:Crowdsourcing Entity Resolution[J]. Proceedings of the VLDB Endowment, 2012,5(11): 1483-1494. |
[8] | BERNSTEIN M S , BRANDT J , MILLER R C ,et al. Crowds in two seconds:enabling realtime crowd-powered interfaces[C]// Annual ACM Symposium on User Interface Software and Technology (UIST),October 16-19,2011,Santa Barbara,USA. New York:ACM Press, 2011: 33-42. |
[9] | HAAS D , WANG J , WU E ,et al. CLAMShell:speeding up crowds for lowlatency data labeling[J]. Proceedings of the VLDB Endowment, 2015,9(4): 372-383. |
[10] | STONEBRAKER M , BRUCKNER D , ILYAS I F ,et al. Data curation at scale:the data tamer system[C]// Biennial Conference on Innovative Data Systems Research (CIDR),January 6-9,2013,Asilomar,USA.[S.l:s.n.], 2013. |
[11] | DOAN A H , ARDALAN A , BALLARD J R ,et al. Toward a system building agenda for data integration[J]. IEEE Data Engineering Bulletin, 2018,41(2): 35-46. |
[12] | CHEN C , GOLSHAN B , HALEVEY A Y ,et al. BigGorilla:an open-source ecosystem for data preparation and integration[J]. IEEE Data Engineering Bulletin, 2018,41(2): 10-22. |
[13] | LI G . Human-in-the-loop data integration[J]. Proceedings of the VLDB Endowment, 2017,10(12): 2006-2017. |
[14] | FAN J , LI G . Human-in-the-loop rule learning for data integration[J]. IEEE Data Engineering Bulletin, 2018,41(2): 104-115. |
[15] | KANDEL S , PAEPCKE A , HELLERSTEIN J M ,et al. Wrangler:interactive visual specification of data transformation scripts[C]// International Conference on Human Factors in Computing Systems (CHI),May 7-12,2011,Vancouver,Canada. New York:ACM Press, 2011: 3363-3372. |
[16] | HEER J , HELLERSTEIN J M , KANDEL S . Predictive interaction for data transformation[C]// Biennial Conference on Innovative Data Systems Research (CIDR),January 4-7,Asilomar,USA.[S.l:s.n.], 2013 |
[17] | KHAN M A , XU L , NANDI A ,et al. Data tweening:incremental visualization of data transforms[J]. Proceedings of the VLDB Endowment, 2017,10(6): 661-672. |
[18] | LIEBERMAN H . Your wish is my command:programming by example[M]. Morgan Kaufmann Publishers, 2001. |
[19] | JIN Z , ANDERSON M R , CAFARELLA M J ,et al. Foofah:Transforming data by example[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 683-698. |
[20] | BLINKFILL R S . Semi-supervised programming by example for syntactic string transformations[J]. Proceedings of the VLDB Endowment, 2016,9(10): 816-827. |
[21] | SINGH R , MEDURI V V , ELMAGARMID A K ,et al. Synthesizing entity matching rules by examples[J]. Proceedings of the VLDB Endowment, 2017,11(2): 189-202. |
[22] | BONIFATI A , COMIGNANI U , COQUERY E ,et al. Interactive mapping specification with exemplar tuples[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 667-682. |
[23] | ZHU E , HE Y , CHAUDHURI S . Autojoin:joining tables by leveraging transformations[J]. Proceedings of the VLDB Endowment, 2017,10(10): 1034-1045. |
[24] | HE Y , CHU X , GANJAM K ,et al. Transform-data-by-example (TDE):an extensible search engine for data transformations[J]. Proceedings of the VLDB Endowment, 2018,11(10): 1165-1177. |
[25] | MORCOS J , ABEDJAN Z , ILYAS I F ,et al. DataXFormer:an interactive data transformation tool[C]// International Conference on Management of Data (SIGMOD),May 31-June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 883-888. |
[26] | ABEDJAN Z , MORCOS J , ILYAS I F ,et al. DataXFormer:a robust transformation discovery system[C]// IEEE International Conference on Data Engineering (ICDE),May 16-20,2016,Helsinki,Finland. Piscataway:IEEE Press, 2016: 1134-1145. |
[27] | FAN J , LU M , OOI B C ,et al. A hybrid machine-crowdsourcing system for matching web tables[C]// IEEE International Conference on Data Engineering (ICDE),March 31-April 4,2014,Chicago,USA. Piscataway:IEEE Press, 2014: 976-987. |
[28] | HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8): 1735-1780. |
[29] | ZHANG Y , IVES Z G . Juneau:data lake management for Jupyter[J]. Proceedings of the VLDB Endowment, 2019,12(12): 1902-1905. |
[30] | IVES Z , ZHANG Y , HAN S ,et al. Dataset relationship management[C]// Biennial Conference on Innovative Data Systems Research (CIDR),January 13-16,Asilomar,USA.[S.l:s.n.], 2019. |
[31] | VARTAK M , RAHMAN S , MADDEN S ,et al. SEEDB:efficient data-driven visualization recommendations to support visual analytics[J]. Proceedings of the VLDB Endowment, 2015,8(13): 2182-2193. |
[32] | LUO Y , QIN X , TANG N ,et al. DeepEye:towards automatic data visualization[C]// IEEE International Conference on Data Engineering (ICDE),April 16-19,2018,Paris,France. Piscataway:IEEE Press, 2018: 101-112. |
[33] | REZIG E K , CAO L , STONEBRAKER M ,et al. Data civilizer 2.0:a holistic framework for data preparation and analytics[J]. Proceedings of the VLDB Endowment, 2019,12(12): 1954-1957. |
[34] | SHANG Z , ZGRAGGEN E , BURATTI B ,et al. Democratizing data science through interactive curation of ML pipelines[C]// International Conference on Management of Data (SIGMOD),June 30 - July 5,2019,Amsterdam,The Netherlands. New York:ACM Press, 2019: 1171-1188. |
[35] | WANG J , KRISHNAN S , FRANKLIN M J ,et al. A sample-and-clean framework for fast and accurate query processing on dirty data[C]// International Conference on Management of Data (SIGMOD),June 22-27,2014,Salt Lake City,USA. New York:ACM Press, 2014: 469-480. |
[36] | LI G , WANG J , ZHENG Y ,et al. Crowdsourced data management:a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2016,28(9): 2296-2319. |
[37] | DEMARTINI G , DIFALLAH D E , CUDRéMAUROUX P . ZenCrowd:leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking[C]// International World Wide Web Conferences (WWW),April 16-20,2012,Lyon,France.[S.l:s.n. ], 2012: 469-478. |
[38] | KONDREDDI S K , TRIANTAFILLOU P , WEIKUM G . Combining information extraction and human computing for crowdsourced knowledge acquisition[C]// International Conference on Data Engineering (ICDE),March 31 - April 4,2014,Chicago,USA. Piscataway:IEEE Press, 2014: 988-999. |
[39] | ABAD A , NABI M , MOSCHITTI A . Self-Crowdsourcing training for relation extraction[C]// Annual Meeting of the Association for Computational Linguistics (ACL),July 30 - August 4,2017,Vancouver,Canada.[S.l:s.n. ], 2017: 518-523. |
[40] | CHILTON L B , LITTLE G , EDGE D ,et al. Cascade:crowdsourcing taxonomy creation[C]// International Conference on Human Factors in Computing Systems (CHI),April 27 - May 2,2013,Paris,France. New York:ACM Press, 2013: 1999-2008. |
[41] | CHU X , MORCOS J , ILYAS I F ,et al. KATARA:a data cleaning system powered by knowledge bases and crowdsourcing[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1247-1261. |
[42] | TONG Y , CAO C C , ZHANG C J ,et al. CrowdCleaner:Data cleaning for multi-version data on the web via crowdsourcing[C]// IEEE International Conference on Data Engineering (ICDE),March 31 - April 4,2014,IL,USA.[S.l:s.n. ], 2014: 1182-1185. |
[43] | DOLATSHAH M , TEOH M , WANG J ,et al. Cleaning crowdsourced labels using oracles for statistical classification[J]. Proceedings of the VLDB Endowment, 2018,12(4): 376-389. |
[44] | GAO J , LI Q , ZHAO B ,et al. Truth discovery and crowdsourcing aggregation:a unified perspective[J]. Proceedings of the VLDB Endowment, 2015,8(12): 2048-2049. |
[45] | WANG J , LI G , KRASKA T ,et al. Leveraging transitive relations for crowdsourced joins[C]// International Conference on Management of Data (SIGMOD),June 22-27,2013,New York,USA. New York:ACM Press, 2013: 229-240. |
[46] | CHAI C , LI G , LI J ,et al. Cost-effective crowdsourced entity resolution:a partial-order approach[C]// International Conference on Management of Data (SIGMOD),June 26 - July 1,2016,San Francisco,USA. New York:ACM Press, 2016: 969-984. |
[47] | WANG S , XIAO X , LEE C . Crowd-based deduplication:an adaptive approach[C]// International Conference on Management of Data (SIGMOD),May 31-June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1263-1277. |
[48] | DAS S , C P S G , DOAN A ,et al. Falcon:scaling up hands-off crowdsourced entity matching to build cloud services[C]// International Conference on Management of Data (SIGMOD),May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 1431-1446. |
[49] | RATNER A , BACH S H , EHRENBERG H R ,et al. Snorkel:rapid training data creation with weak supervision[J]. Proceedings of the VLDB Endowment, 2017,11(3): 269-282. |
[50] | RATNER A J , SA C D , WU S ,et al. Data programming:creating large training sets,quickly[C]// Neural Information Processing Systems (NeurIPS),December 5-10,2016,Barcelona,Spain.[S.l:s.n. ], 2016: 3567-3575. |
[51] | YANG J , FAN J , WEI Z ,et al. Costeffective data annotation using game-based crowdsourcing[J]. Proceedings of the VLDB Endowment, 2018,12(1): 57-70. |
[52] | LIU T , YANG J , FAN J ,et al. CrowdGame:a game-based crowdsourcing system for cost-effective data labeling[C]// International Conference on Management of Data (SIGMOD),June 30 - July 5,2019,Amsterdam,The Netherlands. New York:ACM Press, 2019: 1957-1960. |
[53] | LIU X , LU M , OOI B C ,et al. CDAS:a crowdsourcing data analytics system[J]. Proceedings of the VLDB Endowment, 2012,5(10): 1040-1051. |
[54] | FAN J , LI G , OOI B C ,et al. iCrowd:an adaptive crowdsourcing framework[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1015-1030. |
[55] | ZHENG Y , WANG J , LI G ,et al. QASCA:a quality-aware task assignment system for crowdsourcing applications[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1031-1046. |
[56] | HAAS D , WANG J , WU F ,et al. CLAMShell:speeding up crowds for lowlatency data labeling[J]. Proceedings of the VLDB Endowment, 2015,9(4): 372-383. |
[57] | MOZAFARI B , SARKAR P , FRANKLIN M J ,et al. Scaling up crowd-sourcing to very large datasets:a case for active learning[J]. Proceedings of the VLDB Endowment, 2014,8(2): 125-136. |
[58] | VERROIOS V , LOFGREN P , GARCIAMOLINA H . tDP:an optimallatency budget allocation strategy for crowdsourced MAXIMUM operations[C]// International Conference on Management of Data (SIGMOD),May 31 - June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1047-1062. |
[59] | SARMA A D , PARAMESWARAN A G , GARCIA-MOLINA H ,et al. Crowdpowered find algorithms[C]// IEEE International Conference on Data Engineering (ICDE),March 31 - April 4,2014,Chicago,USA. Piscataway:IEEE Press, 2014: 964-975. |
[60] | BOIM R , GREENSHPAN O , MILO T ,et al. Asking the right questions in crowd data sourcing[C]// IEEE International Conference on Data Engineering (ICDE),April 1-5,2012,Washington,USA. Piscataway:IEEE Press, 2012: 1261-1264. |
[61] | TO H , SHAHABI C , XIONG L . Privacypreserving online task assignment in spatial crowdsourcing with untrusted server[C]// IEEE International Conference on Data Engineering (ICDE),April 16-19,2018,Paris,France. Piscataway:IEEE Press, 2018: 833-844. |
[1] | Hong MEI, Xiaoyong DU, Hai JIN, Xueqi CHENG, Yunpeng CHAI, Xuanhua SHI, Xiaolong JIN, Yasha WANG, Chi LIU. Big data technologies forward-looking [J]. Big Data Research, 2023, 9(1): 1-20. |
[2] | Lifan HAN, Zijing JI, Zirui CHEN, Xin WANG. Research on information extraction methods for historical classics under the threshold of digital humanities [J]. Big Data Research, 2022, 8(6): 26-39. |
[3] | Keman HUANG, Xiaoyong DU. Value chain model of data governance and its application on data governance regulation analysis [J]. Big Data Research, 2022, 8(4): 3-16. |
[4] | Wenlong LI, Yuan YUAN, Xiaopeng AN. Modus operandi of big data governance: some preliminary observations [J]. Big Data Research, 2022, 8(4): 34-45. |
[5] | Huanyou CHAI, Sannyuya LIU, Lingyun KANG, Yaxian ZHANG, Qing LI, Zhi LIU. Research on the mechanism and key technologies for big data collection in education [J]. Big Data Research, 2020, 6(6): 14-25. |
[6] | Hong DAI, Qun ZHANG, Haolin LU, Junzhi BIN. Comparative analysis between bank industry data governance guidelines and DCMM [J]. Big Data Research, 2020, 6(5): 118-128. |
[7] | Menghui YANG, Xiaoyong DU. Big data governance in governments:a new form of the government administration [J]. Big Data Research, 2020, 6(2): 3-18. |
[8] | Xiaomi AN, Mingjun GUO, Xuehai HONG, Wei WEI. Framework of government big data governance system and effective way of implementation [J]. Big Data Research, 2019, 5(3): 3-12. |
[9] | Xiaoyong DU, Yueguo CHEN, Ju FAN, Wei LU. Data wrangling:a key technique of data governance [J]. Big Data Research, 2019, 5(3): 13-22. |
[10] | Dongxing JIANG, Ruonan GAO, Haoyu WANG. Research on supervising big data governance method for securities and futures industry [J]. Big Data Research, 2019, 5(3): 23-34. |
[11] | Hong DAI, Qun ZHANG, Zhuo YIN. Study on big data governance standard system [J]. Big Data Research, 2019, 5(3): 47-54. |
[12] | Tong RUAN, Jiahui QIU, Zhixing ZHANG, Qi YE. Medical data governance: building the data foundation for intelligent analysis of high quality medical big data [J]. Big Data Research, 2019, 5(1): 12-24. |
[13] | Jianghua ZHAO, Xuezhi WANG, Qinghui LIN, Jianhui LI, Yuanchun ZHOU. Exploration of crowdsourcing in information extraction from remote sensing images [J]. Big Data Research, 2016, 2(6): 53-64. |
[14] | Chaohui MA, Ruihua NIE, Haoxiang TAN, Jiaming LIN, Xinming WANG, Hua TANG, Jinji YANG, Gansen ZHAO. Research on data schema and security in data governance [J]. Big Data Research, 2016, 2(3): 83-95. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||
|