learning-based data cleaning

Learning-based Data Cleaning

Christian Stade-SchuldtFreie Universität Berlin

Thesis Talk, 16.12.2009

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

FU Berlin, Diploma Thesis, 16.12.2009 2

Why Data Cleaning?

É Many sources lead to different formats and standards.

É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issue

É Built-in database techniques are not capabable ofdealing with dirty data.

Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

É Removal of data inconsistencies

É More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

É Removal of data inconsistenciesÉ More complete and accurate data sources

É Identify organizational, process and data issues⇒ enforce standards

É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

Outline

Motivation

Summary

The Data Cleaning Process

Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rules

É Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key

Outline

Motivation

Summary

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are large

É have records with a lot of attributesÉ result in a lot of clusters

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributes

É result in a lot of clusters

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributesÉ result in a lot of clusters

Canopy Clustering

Idea: Apply a cheap distance measure to cluster the datainto overlapping canopies.

Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.

É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|

É The ratio between the number of word matches and thenumber of total words between two records determineshow similar the records are.

É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

Support Vector Machines

É Maximize themargin m = 2

||w||É Kernel trickÉ Black box

technique

Model Generator

É Strings

É AbbreviationDetection

É Normalized EditDistance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

Model Generator

É StringsÉ Abbreviation

Detection

É Normalized EditDistance

Model Generator

DetectionÉ Normalized Edit

Distance

Model Generator

DistanceÉ Learning String

Edit Distance

Model Generator

Edit DistanceÉ Rule Engine

É NumbersÉ Dates

Model Generator

É Numbers

É Dates

Model Generator

É NumbersÉ Dates

Model Generator

É NumbersÉ Dates

Clustering the data

Classification and Backflow

Outline

Motivation

Summary

Clustering Results

É Find "best" features and parametersÉ Trade-off between quality and size of the search space

Outline

Motivation

Summary

Classification Results for Dataset IHow does the number of training samples affect the results?

Classification Results for Dataset IIHow does the computation of features affect the results?

Summary

É Data Cleaning using Clustering and ClassificationÉ Business Value: Reduced Manpower + Improved Data

Quality

É Future WorkÉ Improved featuresÉ Automatic selection of parametersÉ Scalability

learning-based data cleaning

diploma thesis

data maintenance

benets of data cleaning

data issues

dirty data

key fu berlin

accurate data sources

key job functions fu

Data & Analytics

work based learning in lebanon - unesco · work-based...

“project-based learning - british columbia ·...

mastery-based learning simplified · mastery-based learning...

record cleaning machine plattenwaschmaschine · use only...

cleaning and disinfection of ward-based equipment

work-based/project-based learning

learning theory based digital learning

cleaning services for alfa laval compabloc...optimal...

coupling interest-based learning with qualification-based...

health facility work based learning rogram guide · from...

brain based learning learning

validation - dcvmn · validation cleaning validation...

problem based learning dan inquiry based learning

cleaning up the learning environment: soaps, learning...

module five task based learning; project based learning and...

cns case-based learning: cns case-based learning brain

tiddy an artificial intelligence based floor cleaning robot

re for data cleaning with machine learning

duo-solvent cleaning process development for removing...

masters in data science - stellenbosch university ·...