Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31532
Research of Data Cleaning Methods Based on Dependency Rules

Authors: Yang Bao, Shi Wei Deng, Wang Qun Lin

Abstract:

This paper introduces the concept and principle of data cleaning, analyzes the types and causes of dirty data, and proposes several key steps of typical cleaning process, puts forward a well scalability and versatility data cleaning framework, in view of data with attribute dependency relation, designs several of violation data discovery algorithms by formal formula, which can obtain inconsistent data to all target columns with condition attribute dependent no matter data is structured (SQL) or unstructured (NoSql), and gives 6 data cleaning methods based on these algorithms.

Keywords: Data cleaning, dependency rules, violation data discovery, data repair.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1109179

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2069

References:


[1] Lee, M. L., Ling, T. W., Low, W. L. IntelliClean: A knowledge-based intelligent data cleaner. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston: ACM Press, 2000.290 -294.
[2] Galhardas, H., Florescu, D., Shasha, D., et al. AJAX: an extensible data cleaning tool. In: Chen, W.D., Naughton, J. F., Bernstein, P.A., eds. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Texas: ACM, 2000. 590.
[3] Raman, V., Hellerstein, J. Potter' swheel: an interactive data cleaning system. In: Apers, P., Atzeni, P., Ceri, S., et al, eds. Proceedings of the 27th International Conference on Very Large Data Bases. Roma: Morgan Kaufmann, 2001.381 ~ 390.
[4] Dasu T., Johnson T. Exploratory data mining and data cleaning (M). John Wiley, 2003.
[5] Ye H. Z., Wu D, Chen S. An Open Data Clean ing Framework Based on Semantic Rules for Continuous Auditing (C) In Proceedings of the 2nd International Conference on Computer Engineering and Technology, Chengdu, China. 2010: 158- 162.
[6] S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16, 2011.
[7] D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In WebDB, 2009.