potter’s wheel: an interactive data cleaning system vijayshankar raman joseph m. hellerstein
Post on 16-Dec-2015
216 Views
Preview:
TRANSCRIPT
Potter’s Wheel: An Interactive Data Cleaning
System
Vijayshankar Raman
Joseph M. Hellerstein
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive transformation
Conclusions and Future Work
Motivation
Dirty data common E.g., in content integration, e-catalogs
Inter-organizational differences in data representation Home Depot: 60,000 suppliers!
Data often scraped off web pages, etc. E.g. in centralized systems
Data entry “errors”, poor integrity constraints
Cleansing a prereq for analysis, xactionsCleansing done by “content managers” Ease of use critical!
Standards can help a bit (e.g. UDDI) But graphical tools are the name of the game
Current solutions
Detect errors in data “eyeball” data in a spreadsheet data auditing tools domain-specific algorithms
Code up transforms to fix errors “ETL” (extract/transform/load) tools from warehousing world string together domain-specific cleansing rules scripting languages, custom code, etc.
Apply transforms on dataIterate special cases nested discrepancies, e.g. 19997/10/31
Code
Apply
Detect
Problems
Slow, batch tasks
Significant human effort! Specification of transforms
regular expressions, grammars, custom scripts, etc.
Discrepancy detection notion of discrepancy domain-dependent want a mix of custom and standard techniques want to apply on parts of the data values
Rebecca by Daphne du Maurier (Mass Market Paperback) $6.29 ****
Sonnet 19. Craig W.J., ed. 1914. The Oxford Shakespeare
The Big Four Agatha Christie, Mass market paperback 5.39 10%
(from bartleby.com, bn.com)
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive Transformation
Conclusions and Future Work
Potter’s Wheel: Design Goals
Eliminate wait time during each step Even on big data! Use Online Reordering (VLDB ‘99), sampling Ensure transform results can be seen/undone instantly Compile/optimize sequence of transforms when happy
Eliminate programming, but keep user “in the loop” Semi-automatic, “direct manipulation” GUI Support & leverage “eyeball” detection, verification (human input) Point-and-click transformation “by example”
Unify detection and transformation Detection always runs online in the background Detection always runs on transformed “view” of data
Extensibility Domain experts (vendors) should be able to plug
in detectors/transforms
A mixed (“Systems!”) design challenge: Query Processing, HCI, Learning
Limited appreciationfor this kind of systems work
Potter’s Wheel UIData read so far
Dataflow in Potter’s Wheel
Transformationengine
Sp
rea
dsh
eet
dis
pla
y
Optimized program
Onlinereorderer
Data source
Discrepancy detector
compile
get page scrollbar pos.n
specify/undo transforms
scroll
check for errors
Outline
BackgroundPotter’s Wheel architectureDiscrepancy detection Domains in Potter’s Wheel Structure inference
Interactive TransformationConclusions and Future Work
Discrepancy Detection
Challenge: find discrepancies in a column Structure inference: Given:
A set of (possibly composite) data items, including discrepancies
A set of user-defined “domains” (atomic types) Choose a “structure” for the set
A string of domains (w/repetition) that best fits the data E.g. for “March 17, 2000”:
* alpha* digit*, digit* [Machr]* 17, int
Report rows that do not fit chosen domain
PS: Must be an online algorithm!
Extensible Domains
As in Object-Relational, keep domains opaque.
class Domain {// Required inclusion functionboolean match(char *value);
// Helps in structure extractionint cardinality(int length);
// For probabilistic discrepancy checkingfloat matchWithConfidence(char *value, int dataSetSize);void updateState(char *value);
// Helps in parsingboolean isRedundantAfter(Domain d);
}
e.g. integer, ispell word, money, standard part names
Evaluating Structure Fit
Three desired characteristics Recall
match as many values as possible
Precision flag as many real discrepancies as possible e.g. Month day, day over alpha* digit*, digit*
Conciseness avoid over-fitting examples, make use of the domains e.g. alpha* digit*, digit* over March 17, 2000
Evaluating Structure Fit, cont.
Given structure S = d1d2…dp, string vi, how good is S?Minimum Description Length (MDL) principle
Rissanen, ‘78, etc. DL(vi,S) = length of theory for S + length to encode string vi with S
Computing DL(v,S)1) Length of theory = p log (number of domains known)
2) If vi doesn’t match S, encode it explicitly
3) Else encode vi = wi,1 wi,2 …wi,p where wi,j dj
Encode length of each wi,j
Encode each wi,j among all dj’s of length j use cardinality function
DL = AVGi((1) + (2) + (3)) = AVGi (UnConciseness + UnPrecision + UnRecall)
Choose structure with minimum DL(v,S) Hard search problem; heuristics in paper
Potter’s Wheel UI
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive Transformation transforms split-by-example
Conclusions and Future Work
Interactive transformation
Sequence of simple visual transforms rather than a single complex program
Each transform must be easy to specify immediately applicable on screen rows
Must be able to undo transforms compensatory transforms not always possible everything REDO-oriented at display-time
no need for UNDO!
Transforms in Potter’s Wheel
Value translation Format(value) – reg. expr. substitution, arithmetic ops,
…
One-to-one row mappings Add/Drop/Copy columns Merge,Split columns Divide column by predicate
One-to-many row mappings Fold columns
adapted from Fold of SchemaSQL[LSS’96] Resolve some higher-order differences
Example (1)
2 Merges
Format'(.*), (.*)' to '\2 \1'
Stewart,Bob
Dole,JerryDavis
Marsh
Anna
Joan
StewartAnna Davis
DoleJoan MarshJerry
Bob Bob
Jerry
Stewart
DoleAnna
Joan
Davis
Marsh
Split at ' '
Anna
Joan
Davis
Marsh
Bob Stewart
Jerry Dole
Example (2)
Divide (like ’.*,.*’)
Anna Davis
Joan Marsh
Stewart,Bob
Dole,Jerry
Stewart,BobAnna
Dole,JerryJoan
Davis
Marsh
Example (3)
Split
Fold
2 Formats(demotes) Ann
BobMath:43Math:96
Bio:78Bio:54
Name Math4396
AnnBob
7854
Bio
Ann
BobBob
AnnName
Math:96Bio:54
Math:43Bio:78
MathBio
MathBio
AnnAnnBobBob
Name43789654
Name
Power all one-to-{one,many} row mappings interactive many-to-{one,many} mappings hard to do interactively
must find/display companion rows for each row to transform higher-order transforms
Specification click on appropriate columns and choose transform but, Split is hard
important transform in screen-scraping/wrapping need to enter regular expressions not always unambiguous e.g.
want to leverage domains Taylor, Jane, $52,072
Tony Smith, 1,00,533
Transforms summary
Split by Example
User marks split positions on examplesSystem infers structure, then parses rest
Parsing must identify matching substrings for structures multiple alternate parses could work
search heuristics explored in paper DecreasingSpecificity seems good
Taylor, Jane|, $52,072
Tony Smith|, 1,00,533
infer structures < * >, <‘,’ Money>
Related Work
Transformation languages -- e.g. SchemaSQL, YATLData cleaning tools commercial -- ETL and auditing tools research -- e.g. AJAX, Lee/Lu/Ling/Ko ’99
Custom auditing algorithms de-duplication (e.g. Hernandez/Stolfo ’97) outlier detection (e.g. Ramaswamy/Rastogi/Shim ’00) dependency inference (e.g. Kivinen/Manilla ’95)
Structure extraction techniques e.g. XTRACT, DataMold, Brazma ‘94
Transformation tools text-processing tools – e.g. perl/awk/sed, LAPIS screen-scraping -- e.g. NoDoSE, XWRAP, OnDisplay, Cohera Connect,
Telegraph Screen Scraper (TeSS)
Middleware, schema mapping
Conclusions
Interactive data cleaning Couple transformation and discrepancy detection Perform both interactively
short, immediately applied steps specify visually, undo if needed contrast with declarative language
Parse values before discrepancy detection user-defined domains helpful
Software online (http://control.cs.berkeley.edu/abc)
Looking Ahead
Generalizing transform by exampleTransforming nested data (XML, HTML)More complex domain-expressionsExtend to generalized query processor client in Telegraph specify initial query refine by specifying transforms as results stream in dynamically choose transforms to be pushed into
server See Shankar’s upcoming thesis, Telegraph papers
Backup Slides
Optimization of Transform Sequences
In Potter’s Wheel system generates program at end hence opportunities for optimization
remove redundant operationsavoid expensive memory copies/allocations/deallocationsby careful pipeliningmaterialize intermediate strings only when necessaryup to 110% speedup for C programs C programs 10x faster than Perl scripts
Example
vs
top related