Big Data Research ›› 2019, Vol. 5 ›› Issue (6): 30-46.doi: 10.11959/j.issn.2096-0271.2019048
• TOPIC:BIG DATA WRANGLING • Previous Articles Next Articles
Minghe YU1,2,Tiezheng NIE3,Guoliang LI4
Online:
2019-11-15
Published:
2020-01-10
Supported by:
CLC Number:
Minghe YU, Tiezheng NIE, Guoliang LI. Data curation technologies and applications[J]. Big Data Research, 2019, 5(6): 30-46.
[1] | 王芳, 慎金花 . 国外数据管护(data curation)研究与实践进展[J]. 中国图书馆学报, 2014,40(4): 116-128. |
WANG F , SHEN J H . Advances in data curation abroad:research and practice[J]. Journal of Library Science in China, 2014,40(4): 116-128. | |
[2] | BISHOP B W , HANK C . Data curation profiling of biocollections[C]// Annual Meeting of the Association for Information Science and Technology,October 14-18,2016,Copenhagen,Denmark. Hoboken:Wiley, 2016: 1-9. |
[3] | BOEHMKE B C . Data wrangling with R[M]. Switzerland: Springer NaturePress, 2016: 1-238. |
[4] | BEHESHTI S , TABEBORDBAR A , BENATALLAH B ,et al. On automating basic data curation tasks[C]// The 26th International Conference on World Wide Web Companion,April 3-7,2017,Perth,Australia. New York:ACM Press, 2017: 165-169. |
[5] | SINGH N , SINGH A K . Data privacy protection mechanisms in cloud[J]. Data Science and Engineering, 2018,3(1): 24-39. |
[6] | BUNEMAN P , CHENEY J , TAN W C ,et al. Curated databases[C]// Symposium on Principles of Database Systems,June 9-11,2008,Vancouver,Canada. New York:ACM Press, 2008: 1-12. |
[7] | PBOHANNON , M FLASTER , W FAN ,et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]// International Conference on Management of Data,June 14-16,2005,Baltimore,USA. New York:ACM Press, 2005: 143-154. |
[8] | CHU X , ILYAS I F , PAPOTTI P . Holistic data cleaning:putting violations into context[C]// International Conference on Data Engineering,April 8-12,2013,Brisbane,Australia. Piscataway:IEEE Press, 2013: 458-469. |
[9] | CHU X , ILYAS I F , KRISHNAN S A ,et al. Data cleaning:overview and emerging challenges[C]// International Conference on Management of Data,June 26 - July 1,2016,San Francisco,USA. New York:ACM Press, 2016: 2201-2206. |
[10] | GOLAB L , KARLOFF H J , KORN F ,et al. On generating near-optimal tableaux for conditional functional dependencies[J]. Proceedings of the VLDB Endowment, 2008,1(1): 376-390. |
[11] | GBESKALES B , ILYAS I F , GOLAB L ,et al. On the relative trust between inconsistent data and inaccurate constraints[C]// International Conference on Data Engineering,April 8-12,2013,Brisbane,Australia. Piscataway:IEEE Press, 2013: 541-552. |
[12] | YAKOUT M , ELMAGARMID A K , NEVILLE J ,et al. Guided data repair[J]. Proceedings of the VLDB Endowment, 2011,4(5): 279-289. |
[13] | WANG J , KRASKA T , FRANKLIN M J ,et al. CrowdER:crowdsourcing entity resolution[J]. Proceedings of the VLDB Endowment, 2012,5(11): 1483-1494. |
[14] | HAO S , TANG N , LI G ,et al. Cleaning relations using knowledge bases[C]// International Conference on Data Engineering,April 19-22,2017,San Diego,USA. Piscataway:IEEE Press, 2017: 933-944. |
[15] | MARCUS A , PARAMESWARAN A . Crowdsourced data management:industry and academic perspectives[J]. Foundations and Trends in Databases, 2013,6(1-2): 1-161. |
[16] | GOKHALE C , DAS S , DOAN A ,et al. Corleone:hands-off crowdsourcing for entity matching[C]// International Conference on Management of Data,June 22-27,2014,Snowbird,USA. New York:ACM Press, 2014: 601-612. |
[17] | HAAS D , WANG J , WU E ,et al. CLAMShell:speeding up crowds for lowlatency data labeling[J]. Proceedings of the VLDB Endowment, 2015,9(4): 372-383. |
[18] | MOZAFARI B , SARKAR P , FRANKLIN M J ,et al. Scaling up crowd-sourcing to very large datasets:a case for active learning[J]. Proceeding of the VLDB Endowment, 2014,8(2): 125-136. |
[19] | ANANTHAKRISHNA R , CHAUDHURI S , GANTI V . Eliminating fuzzy duplicates in data warehouses[C]// International Conference on Very Large Data Bases,August 20-23,2002,Hong Kong,China. San Francisco:Morgan Kaufmann, 2002: 586-597. |
[20] | WANG J , KRISHNAN S , FRANKLIN M J ,et al. A sample-and-clean framework for fast and accurate query processing on dirty data[C]// International Conference on Management of Data,June 22-27,2014,Snowbird,USA. New York:ACM Press, 2014: 469-48 |
[21] | KOLB L , THOR A , RAHM E . Dedoop:efficient deduplication with Hadoop[J]. Proceeding of the VLDB Endowment, 2012,5(12): 1878-1881. |
[22] | KHAYYAT Z , ILYAS I F , JINDAL A ,et al. BigDansing:a system for big data cleansing[C]// International Conference on Management of Data,May 31-June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 1215-1230. |
[23] | CHU X , ILYAS I F , KOUTRIS P . Distributed data deduplication[R]. Waterloo:University of Waterloo, 2016. |
[24] | HUI J , LI L , ZHANG Z . Integration of big data:a survey[C]// International Conference of Pioneering Computer Scientists,Engineers and Educators,September 21-23,2018,Zhengzhou,China. Heidelberg:Springer, 2018: 101-121. |
[25] | LI F , LEE M , HSU W ,et al. Linking temporal records for profiling entities[C]// International Conference on Management of Data,May 31-June 4,2015,Melbourne,Australia. New York:ACM Press, 2015: 593-605. |
[26] | Z ABEDJAN A , AKCORA C G , OUZZANI M ,et al. Temporal rules discovery for web data cleaning[J]. Proceedings of the VLDB Endowment, 2015,9(4): 336-347. |
[27] | PETERMANN A , JUNGHANNS M , MüLLER R ,et al. Graph-based data integration and business intelligence with BIIIG[J]. Proceedings of the VLDB Endowment, 2014,4(13): 1577-1580. |
[28] | LI Q , LI Y , GAO J ,et al. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation[C]// International Conference on Management of Data,June 22-27,2014,Snowbird,USA. New York:ACM Press, 2014: 1187-1198. |
[29] | LI Q , LI Y , GAO J ,et al. A confidenceaware approach for truth discovery on long-tail data[J]. Proceedings of the VLDB Endowment, 2014,8(4): 425-436. |
[30] | REKATSINAS T , JOGLEKAR M , GARCIA-MOLINA H ,et al. SLiMFast:guaranteed results for data fusion and source reliability[C]// International Conference on Management of Data,May 14-19,2017,Chicago,USA. New York:ACM Press, 2017: 1399-1414. |
[31] | YU R , GADIRAJU U , FETAHU B ,et al. FuseM:query-centric data fusion on structured Web markup[C]// International Conference on Data Engineering,April 19-22,2017,San Diego,USA. Piscataway:IEEE Press, 2017: 179-182. |
[32] | SALLOUM M , DONG X L , SRIVASTAVA D ,et al. Online ordering of overlapping data sources[J]. Proceedings of the VLDB Endowment, 2013,7(3): 133-144. |
[33] | REKATSINAS T , DONG X L , SRIVASTAVA D . Characterizing and selecting fresh data sources[C]// International Conference on Management of Data,June 22-27,2014,Snowbird,USA. New York:ACM Press, 2014: 919-930. |
[34] | BONAQUE R , CAO T D , CAUTIS B ,et al. Mixed-instance querying:a lightweight integration architecture for data journalism[J]. Proceedings of the VLDB Endowment, 2016,9(13): 1513-1516. |
[35] | CHAMANARA J , K?NIG-RIES B , JAGADISH H V . QUIS:InSitu heterogeneous data source querying[J]. Proceedings of the VLDB Endowment, 2017,10(12): 1877-1880. |
[36] | SAWADOGO P , KIBATA T , DARMONT J . Metadata management for textual documents in data lakes[C]// International Conference on Enterprise Information Systems,May 3-5,2019,Heraklion,Greece.[S.l]:SciTePress. 2019: 72-83. |
[37] | STEIN B , MORRISON A . The enterprise data lake:better integration and deeper analytics[J]. Technology Forecast, 2014(1): 1-9. |
[38] | QUIX C , HAI R , VATOV I . Metadata extraction and management in data lakes with GEMMS[J]. Complex Systems Informatics and Modeling Quarterly, 2016(9): 67-83. |
[39] | HAI R , GEISLER S , QUIX C . Constance:an intelligent data lake system[C]// International Conference on Management of Data,June 26-July 1,2016,San Francisco,USA. New York:ACM Press, 2016: 2097-2100. |
[40] | INMON B . Data lake architecture:designing the data lake and avoiding the garbage dump[M]. [S.l.]: Technics PublicationsPress, 2016. |
[41] | FANG H , . Managing data lakes in big data era:what’s a data lake and why has it became popular in data management ecosystem[C]// International Conference on Cyber Technology in Automation,Control and Intelligent Systems,June 8-12,2015,Shenyang,China. Piscataway:IEEE Press, 2015: 820-824. |
[42] | MILOSLAVSKAYA N G , TOLSTOY A I . Application of big data,fast data,and data lake concepts to information security issues[C]// International Conference on Future Internet of Things and Cloud Workshops,August 22-24,2016,Vienna,Austria. Piscataway:IEEE Press, 2016: 148-153. |
[43] | MACCIONI A , TORLONE R . Crossing the finish line faster when paddling the data lake with kayak[J]. Proceedings of the VLDB Endowment, 2017,10(12): 1853-1856. |
[44] | HERSCHEL M , DIESTELKA?MPER R , LAHMAR H B . A survey on provenance:what for,what form,what from[J]. The VLDB Journal, 2017,26(6): 881-906. |
[45] | CHENEY J , CHITICARIU L , TAN W C . Provenance in databases:why,how,and where[J]. Foundations and Trends in Databases, 2009,1(4): 379-474. |
[46] | BUNEMAN P , TAN W C . Data provenance:what next[J]. SIGMOD Record, 2018,47(3): 5-16. |
[47] | BHAGWAT D , CHITICARIU L , TAN W C ,et al. An annotation management system for relational databases[J]. The VLDB Journal, 2005,14(4): 373-396. |
[48] | CHITICARIU L , W CH TAN , VIJAYVARGIYA G . DBNotes:a post-it system for relational databases based on provenance[C]// International Conference on Management of Data,June 14-16,2005,Maryland,USA. New York:ACM Press, 2005: 942-944. |
[49] | GEERTS F , KEMENTSIETSIDIS A , MILANO D . MONDRIAN:annotating and querying databases through colors and blocks[C]// International Conference on Data Engineering,April 3-8,2006,Atlanta,USA. Piscataway:IEEE Press, 2006. |
[50] | BUNEMAN P , CHENEY J , VANSUMMEREN S . On the expressiveness of implicit provenance in query and update languages[J]. ACM Transactions on Database Systems, 2008,33(4): 1-47. |
[51] | BUNEMAN P , KHANNA S , TAJIMA K ,et al. Archiving scientific data[J]. ACM Transactions on Database Systems, 2004,29: 2-42. |
[52] | HUANG S , XU L , LIU J ,et al. Orpheusdb:bolt-on versioning for relational databases[J]. Proceeding of the VLDB Endowment, 2017,10(10): 1130-1141. |
[53] | MADDOX M , GOEHRING D , ELMORE A J ,et al. Decibel:the relational dataset branching system[J]. Proceeding of the VLDB Endowment, 2016,9(9): 624-635. |
[54] | LAPPAS T , TERZI E , GUNOPULOSD . Finding Effectors in Social Networks[C]// International Conference on Knowledge Discovery and Data Mining,July 25-28,2010,Washington,DC,USA. New York:ACM Press, 2010: 1059-1068. |
[55] | SHAH D , ZAMAN T . Rumors in a network:Who’s the culprit[J]. Information Forensics and Security, 2011,57(8): 5163-5181. |
[56] | BUNEMAN P , CHENEY J , LINDLEY S ,et al. DBWiki:a structured wiki for curated data and collaborative data management[C]// International Conference on Management of Data,June 12-16,2011,Athens,Greece. New York:ACM Press, 2011: 1335-1338. |
[57] | B RACHMANN M , BAUTISTA C , CASTELO S ,et al. Data debugging and exploration with vizier[C]// International Conference on Management of Data,June 30-July 5,2019,Amsterdam,The Netherlands. New York:ACM Press, 2019: 1877-1880. |
[58] | CALLAHAN S P , FREIRE J , SANTOS E ,et al. VisTrails:visualization meets data management[C]// International Conference on Management of Data,June 27-29,2006,Chicago,USA. New York:ACM Press, 2006: 745-747. |
[59] | YANG Y , MENEGHETTI N , FEHLING R ,et al. An on-demand approach to ETL[J]. Proceedings of the VLDB Endowment, 2015,8(12): 1578-1589. |
[60] | MARINI L , GUTIERREZ-POLO I , KOOPER R .et al Clowder:open source data management for long tail data[C]// The Practice and Experience on Advanced Research Computing,July 22-26,2018,Pittsburgh,USA. New York:ACM Press, 2018: 1-8. |
[61] | VARGAS-SOLAR B , KEMP G , GALLEGOS I H ,et al. Demonstrating data collections curation and exploration with curare[C]// International Conference on Extending Database Technology,March 26-29,2019,Lisbon,Portugal.[S.l.:s.n. ], 2019: 598-601. |
[62] | WOLLATZ L , SCOTT M , JOHNSTON S J ,et al. Curation of image data for medical research[C]// International Conference on e-Science,October 29 November 1,2018,Amsterdam,The Netherlands. Piscataway:IEEE Press, 2018: 105-113. |
[63] | 杜小勇, 陈跃国, 范举 ,等. 数据整理——大数据治理的关键技术[J]. 大数据, 2019,5(3): 13-22. |
DU X Y , CHEN Y G , FAN J ,et al. Data wrangling:a key technique of data governance[J]. Big Data Research, 2019,5(3): 13-22. |
[1] | . Sharing,integration and fusion of governmentgovernance big data [J]. Big Data Research, 2020, 6(2): 27-40. |
[2] | Xiaoou DING, Hongzhi WANG, Shengjian YU. Data quality management of industrial temporal big data [J]. Big Data Research, 2019, 5(6): 19-29. |
[3] | Yanan WANG, Huifu ZHUANG, Yuhua WANG. Constructing iFlora cloud platform for botany big data integration and public service [J]. Big Data Research, 2016, 2(6): 34-42. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||
|