learning-based data cleaning

Post on 10-Aug-2015

23 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Learning-based Data Cleaning

Christian Stade-SchuldtFreie Universität Berlin

Thesis Talk, 16.12.2009

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 2

Why Data Cleaning?

É Many sources lead to different formats and standards.

É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

,

FU Berlin, Diploma Thesis, 16.12.2009 3

Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issue

É Built-in database techniques are not capabable ofdealing with dirty data.

,

FU Berlin, Diploma Thesis, 16.12.2009 3

Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

,

FU Berlin, Diploma Thesis, 16.12.2009 3

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistencies

É More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sources

É Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 5

The Data Cleaning Process

,

FU Berlin, Diploma Thesis, 16.12.2009 6

Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rules

É Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,

FU Berlin, Diploma Thesis, 16.12.2009 7

Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,

FU Berlin, Diploma Thesis, 16.12.2009 7

Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,

FU Berlin, Diploma Thesis, 16.12.2009 7

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 8

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are large

É have records with a lot of attributesÉ result in a lot of clusters

,

FU Berlin, Diploma Thesis, 16.12.2009 9

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributes

É result in a lot of clusters

,

FU Berlin, Diploma Thesis, 16.12.2009 9

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributesÉ result in a lot of clusters

,

FU Berlin, Diploma Thesis, 16.12.2009 9

Canopy Clustering

Idea: Apply a cheap distance measure to cluster the datainto overlapping canopies.

,

FU Berlin, Diploma Thesis, 16.12.2009 10

Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.

É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

,

FU Berlin, Diploma Thesis, 16.12.2009 11

Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|

É The ratio between the number of word matches and thenumber of total words between two records determineshow similar the records are.

,

FU Berlin, Diploma Thesis, 16.12.2009 11

Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

,

FU Berlin, Diploma Thesis, 16.12.2009 11

Support Vector Machines

É Maximize themargin m = 2

||w||É Kernel trickÉ Black box

technique

,

FU Berlin, Diploma Thesis, 16.12.2009 12

Data-Cleaning Workflow

,

FU Berlin, Diploma Thesis, 16.12.2009 13

Model Generator

É Strings

É AbbreviationDetection

É Normalized EditDistance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Model Generator

É StringsÉ Abbreviation

Detection

É Normalized EditDistance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

Distance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É Numbers

É Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Clustering the data

,

FU Berlin, Diploma Thesis, 16.12.2009 15

Classification and Backflow

,

FU Berlin, Diploma Thesis, 16.12.2009 16

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 17

Clustering Results

É Find "best" features and parametersÉ Trade-off between quality and size of the search space

,

FU Berlin, Diploma Thesis, 16.12.2009 18

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 19

Classification Results for Dataset IHow does the number of training samples affect the results?

,

FU Berlin, Diploma Thesis, 16.12.2009 20

Classification Results for Dataset IIHow does the computation of features affect the results?

,

FU Berlin, Diploma Thesis, 16.12.2009 21

Summary

É Data Cleaning using Clustering and ClassificationÉ Business Value: Reduced Manpower + Improved Data

Quality

É Future WorkÉ Improved featuresÉ Automatic selection of parametersÉ Scalability

,

FU Berlin, Diploma Thesis, 16.12.2009 22

top related