Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 2

data cleaning Related Publications

2 Research of Data Cleaning Methods Based on Dependency Rules

Authors: Yang Bao, Shi Wei Deng, Wang Qun Lin

Abstract:

This paper introduces the concept and principle of data cleaning, analyzes the types and causes of dirty data, and proposes several key steps of typical cleaning process, puts forward a well scalability and versatility data cleaning framework, in view of data with attribute dependency relation, designs several of violation data discovery algorithms by formal formula, which can obtain inconsistent data to all target columns with condition attribute dependent no matter data is structured (SQL) or unstructured (NoSql), and gives 6 data cleaning methods based on these algorithms.

Keywords: data cleaning, dependency rules, data repair, violation data discovery

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1945
1 Data Preprocessing for Supervised Leaning

Authors: S. B. Kotsiantis, P. E. Pintelas, D. Kanellopoulos

Abstract:

Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.

Keywords: Data Mining, Feature selection, data cleaning

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 4461