2016 vldb - messing up with bart: error generation for evaluating data-cleaning algorithms
TRANSCRIPT
![Page 1: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/1.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
University of Toronto, Illinois Institute of Technology, Università della Basilicata, Arizona State University
Sep 7th 2016
![Page 2: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/2.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 2
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
![Page 3: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/3.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Motivation•Data quality is a crucial task in data
management•Many automatic and semi-automatic
data-cleaning algorithm have been proposed
3
constraint-based
Beskales et al. VLDB10Bohannon et al. SIGMOD05Chu et al. ICDE13Cong et al. VLDB07Geerts et al. VLDB14… statistics-
based
Berti-Equille et al. ICDE11Dasu et al. VLDB12Prokoshyna et al. VLDB15Yakout et al. SIGMOD13…
![Page 4: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/4.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Motivation•Data quality is a crucial task in data
management•Many automatic and semi-automatic
data-cleaning algorithm have been proposed
4
constraint-based
Beskales et al. VLDB10Bohannon et al. SIGMOD05Chu et al. ICDE13Cong et al. VLDB07Geerts et al. VLDB14… statistics-
based
Berti-Equille et al. ICDE11Dasu et al. VLDB12Prokoshyna et al. VLDB15Yakout et al. SIGMOD13…
“What is the right tool for my
data-cleaning task?”
![Page 5: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/5.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Challenges•No openly-available tools or datasets
for benchmarking data-cleaning algorithms
•Usually approaches are evaluated by using either•manually generated errors: very
expensive! •automatically introduced errors in clean
data: algorithms are highly sensitive to the characteristics of the errors!
•Need for scalable and robust evaluation
5
![Page 6: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/6.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Contribution• Benchmarking Algorithms for data Repairing and
Translation• open-source error-generation system with an high level
of control over the errors
• Input: a clean database wrt a set of data-quality rules and a set of configuration parameters
• Output: a dirty database (using a set of cell changes) and an estimate of how hard it will be to restore the original values
6
![Page 7: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/7.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 7
‣ Motivations and Goals‣ Main Ideas
‣ Optimizations
‣ Experimental Results
‣ Detectability
‣ Repairability‣ Violation-Generation
Queries
![Page 8: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/8.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example 8
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3
functional dependencyName, Season → TeamTeam → Stadium
Quality Rules
Represented as Denial Constraintsa very expressive language to capture most data-quality rules used for data repairing: FDs, CFDs, Cleaning EGDs, Editing Rules, Fixing Rules, Ordering Constraints
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ )dc2: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
ViolationAn instance I violates
¬(φ(x)) if there is an assignment m
s.t. I ⊨ φ(m(x))
12
21
![Page 9: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/9.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example9
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Camp Nou
Cell Changesch1: t5. Stadium := “Camp Nou”
✔ ch1 is a detectable change: dc2 is violated since t1, t3 and t5 have same team, but different stadiums
we call {t1, t3, t5} context equivalence class
✔ easy to correct: the original value “Juventus Stadium” appears in t1,t3Repairability: the probability of restoring t5.Stadium to its original value by uniformly at random picking a Stadium value from its context equivalence class
Rep = 2 / 3 = 0.66
functional dependencyName, Season → TeamTeam → Stadium
12
![Page 10: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/10.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example10
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changesch2: t1. Season:= “2014-15”
✔ ch2 is a detectable change: dc1 is violated: t1 and t2 have same name and season, but different teams, stadium and goals
2014-15
✘ hard to correct: the original value “2013-14” disappears from the instanceRepairability: 0 / 2 = 0
functional dependencyName, Season → TeamTeam → Stadium
12
![Page 11: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/11.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example11
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changesch3: t5. Name:= “Pirlo”✘ is a undetectable change
Pirlo
INTERACTION
ch2: t1. Season:= “2014-15” ✔
2014-15
ch4: t3.Name:= “Pirlo” ✔
Pirlo
✘
2014-15
We need to keep track of the context of each change
![Page 12: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/12.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Violation-Generation Queries
• Each comparison of a dc suggests a different strategy for finding cells to modify to generate detectable errors
• Starting from a dc we generate a set of vio-gen queries
12
Name Season Teamt1 Giovinco 2013-14 Juventust2 Giovinco 2013-14 Juventust3 Pirlo 2013-14 N.Y. City
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ )
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’),
n=n’, s=s’, t = t’
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’),
n ≠ n’, s=s’, t ≠ t’
vio-gen query vio-gen query
Result of the query: t1, t2We’ll have a detectable change by making t1.Team and t2.Team
different t1. Team:= “Juve” ✔
Result of the query: t2, t3We’ll have a detectable change
by making t2.Name and t3.Name equalt3. Name:= “Giovinco”
✔
![Page 13: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/13.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Error-Generation Task 13
•S: relational schema•Σ: a set of denial constraints over S•I: an instance over schema S clean wrt Σ•CONF: configuration parameters• % of detectable errors, % of random errors
• Theorem 1: Generating the requested number of detectable errors is NP-Complete (data complexity)
EG-Task E={S, Σ, I, CONF}
![Page 14: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/14.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 14
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
![Page 15: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/15.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Optimizations•Greedy PTIME algorithm• two cell changes cannot share a context• sound but not complete
• in practice for low error ratios (~10-20%) the probability of success is very high
•Main cost factor•executing vio-gen queries on DBMS•optimizations for symmetric constraints and
cross-products
15
![Page 16: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/16.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Symmetric Constraints• Computing joins may be expensive!• We identify a class of DCs (that includes FDs and
most of CFDs) where group-by can be used to reduce the size of join inputs
• Idea: to find and execute isomorphic subqueries to avoid redundant work
16
Player(n, s, t, st), Player(n’, s’, t’, st’),
n=n’, s=s’, t ≠ t’
1. Formula Graph
Player
n s t st
Player
t’ s’ n’st’
=
=≠Nam
eSe
ason
Stadium
Stad
ium
Name
Season
Team Team
2. Reduced Formulawith adornments
Player(n=, s=, t ≠, st)
3. Group-By Query
SELECT name, season, team FROM playerWHERE name, season IN
(SELECT name, season FROM playerGROUP BY name, seasonHAVING count(DISTINCT team) > 1)
ORDER BY name, season
![Page 17: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/17.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Cross Products 17
A Common Patterndc4: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t ≠ t’, st ≠ st’
The result of the vio-gen query will be all possible pairs of players with different team and different stadium quadratic
costHowever: we are typically only interested in a small set of cellsSolution: we materialize a random sample of the tuples in Player
in main-memory and compute the cross product to identify cells to change and
their contexts
![Page 18: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/18.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 18
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
![Page 19: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/19.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Evaluation of the ToolsTools
- Llunatic: Geerts et al. VLDB14- Holistic: Chu et al. ICDE13- Greedy: Bohannon et al. SIGMOD05, Cong et al. VLDB07- Sampling: Beskales et al. VLDB10
Tasks- Constraint-based with 5% errors and different repairability levels: High (~ 0.8), Med (~0.5), and Low (~0.25)
19
![Page 20: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/20.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Scalability Results 20
![Page 21: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/21.jpg)
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
21Lessons Learned
•Automated tools are essential for robust and broad empirical evaluations
•Data-repairing is not yet mature: no definitive automatic data-repairing algorithm yet
•Repairability matters•We need to document our dirty data• Algorithms are sensitive to error
characteristics!
•Generating errors is hard
![Page 22: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms](https://reader036.vdocuments.net/reader036/viewer/2022070509/58a8e5531a28aba6598b5077/html5/thumbnails/22.jpg)
22