paper presentation @ sebd '09
DESCRIPTION
Time-completeness trade-offs in record linkage using Adaptive query ProcessingTRANSCRIPT
![Page 1: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/1.jpg)
Time-completeness trade-offs in record linkage using Adaptive query Processing
Paolo Missier, Alvaro A. A. FernandesSchool of Computer Science, University of Manchester
Roald Lengu, Giovanna GuerriniDISI, Universita' di Genova, Italy
Marco MesitiDiCo, Universita' di Milano, Italy
SEBD 2009, Camogli, ItalyJune, 2009
![Page 2: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/2.jpg)
Conclusions• Optimistic approach to online record linkage
– Based on implicit referential integrity assumption– When assumption not true, goes back to pessimistic
• Technical approach based on autonomic computing– Adaptive query processing– Mix of exact / approximate physical join operators
• Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams)
• Need to relax the referential integrity assumption -- use other sources for result size estimation
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 3: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/3.jpg)
• slight mismatches in records lead to incomplete integration– due to different encodings, conventions– due to errors in data
Context: On-the-fly data integration
3
A case of record linkage
R !"A=B S
Madchester
Manchester
B AAC D
ManchesterXYZ
XY Manchester ZManchester
• “Situational Applications”• Data mashups and personal dataspaces
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 4: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/4.jpg)
Offline vs online linkage• Offline record linkage:
– performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using
your favourite record linkage technique
4
– 2. perform regular equijoin on the transformed tables:
➡ ok for tables that can be analysed ahead of the join➡ ok when multiple queries issued on integrated tables
R! !" S!
〈R,S〉→ 〈R’,S’〉
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 5: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/5.jpg)
Offline vs online linkage• Offline record linkage:
– performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using
your favourite record linkage technique
4
– 2. perform regular equijoin on the transformed tables:
➡ ok for tables that can be analysed ahead of the join➡ ok when multiple queries issued on integrated tables
R! !" S!
• Online linkage:– performed while answering a query– exact join ⇒ similarity (or approximate) join
〈R,S〉→ 〈R’,S’〉
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 6: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/6.jpg)
Record linkage and similarity joinsHistorical timeline:
5
from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 7: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/7.jpg)
Record linkage and similarity joinsHistorical timeline:
5
from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 8: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/8.jpg)
sim(s1, s2) =|q(s1) ! q(s2)||q(s1) " q(s2)|
Measuring string similarity using q-grams• q-grams map string s to a set q(s) of substrings of length q:
Ex.: 3-grams:
q(“Manchester”) = {‘Man’, ‘anc’, ‘nch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.
q(“Madchester”) = {‘Mad’, ‘adc’, ‘dch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.
(Jaccard coefficient)
sim(r1, r2) < !1 ! not match!2 < sim(r1, r2)! match
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 9: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/9.jpg)
P. Missier, SEBD, June 2009, Camogli, Italy
Similarity symmetric hash join
7
R S
probe
probe
insert
xm
yn
R hash table
insert
y
x
r
s
S hash table
[R.m,S.s]
[R.n, S.r]
• Main sources of complexity:– overhead for storing and indexing q-grams– cost of computing set intersection
Efficient relational representation:[CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,
“A primitive operator for similarity joins in data cleaning” (ICDE’06)
new primitive join operator [CGK06]
set intersection computed using a variation of the symmetric hash join
![Page 10: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/10.jpg)
Is full similarity join always necessary?
• Pessimistic: always pays full complexity cost• Typical mismatch rate in real datasets around 5%
8
Research Goal: explore optimistic approach• detect mismatches and react as you go• requires estimates of incremental join result size• statistical + reactive ⇒
• expect to sacrifice join result completeness for faster execution
Approach:- combine exact and approximate join operatorsusing Adaptive Query Processing techniques
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 11: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/11.jpg)
[KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003.
Autonomic computing framework
9P. Missier, SEBD, June 2009, Camogli, Italy
![Page 12: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/12.jpg)
monitor
respond assess
Autonomic computing framework
9P. Missier, SEBD, June 2009, Camogli, Italy
![Page 13: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/13.jpg)
monitor
respond assess
Autonomic computing framework
9
monitor incrementaljoin result size
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 14: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/14.jpg)
monitor
respond assess
Autonomic computing framework
9
monitor incrementaljoin result size
estimatejoin result size
computedivergencefrom exp
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 15: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/15.jpg)
monitor
respond assess
Autonomic computing framework
9
monitor incrementaljoin result size
estimatejoin result size
computedivergencefrom exp
swapjoin
operators
• when using exact join:• if observed/estimated sizes diverge “too much”, then switch to approximate join
• when using approximate join:• if observed/estimated converge, then switch to exact join
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 16: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/16.jpg)
Technical approach and challenges
• Assess:– Need additional assumption for result size estimation– Estimating result size at specific points during join execution
• Respond:– When and how can physical join operators switched?– Can we avoid loss of work?
• operator replacement in pipelined query plans [EFP06]
[EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254
10P. Missier, SEBD, June 2009, Camogli, Italy
![Page 17: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/17.jpg)
Assessment• Expectation of matching records• Used to derive simple result size estimation model
11
• When there are no mismatches:after scanning n < |S| tuples on S:P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|
Thus, join result size On after n tuples is a binomial random variable:
R (parent) S (child)
xc
ydy
x
b
a
n
On ! bin(n,n
|R| )
symmetric hash join, again
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 18: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/18.jpg)
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)
Detecting divergent observed result sizeObservation is outlier wrt expected result size On => divergence
12
Pn,p(n)(On ! O) ! !out
On
![Page 19: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/19.jpg)
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)
Detecting divergent observed result sizeObservation is outlier wrt expected result size On => divergence
12
Pn,p(n)(On ! O) ! !out
On
outlier detection on experimental datasets with various mismatch patterns
![Page 20: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/20.jpg)
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)
Detecting divergent observed result sizeObservation is outlier wrt expected result size On => divergence
12
Pn,p(n)(On ! O) ! !out
On
outlier detection on experimental datasets with various mismatch patterns
Note: when the data does not follow our referential constraint hypothesis, the model leads to a pure
similarity join
![Page 21: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/21.jpg)
Condition for operator replacement• Goal of AQP:
– switch operators without loss of intermediate work• sufficient condition: switch when the operator reaches
a quiescent state [EPF06]
13
[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254
![Page 22: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/22.jpg)
Adaptivity: responder state machineEach state corresponds to a combination of physical
join operators
14
left: exactright: exact
left: approximateright: approximate
left: approximateright: exact
left: exactright: approximate
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 23: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/23.jpg)
Rationale for state transitions
lex /rex
lap /rap
lex /rap
lap /rex
evidence that left and /or right input perturbed
evidence that left and /or right input
no longer perturbed
this is formalized in terms of predicates on the observable variables- outlier detection and more (omitted)
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 24: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/24.jpg)
Experimental evaluation• Marginal gain of hybrid algorithm:
– level of completeness
16
• R: result size for approx join only• r: result size for exact only• rabs: result size actually observed
grel = (rabs – r) / (R – r)
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 25: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/25.jpg)
Experimental evaluation• Marginal gain of hybrid algorithm:
– level of completeness
16
• R: result size for approx join only• r: result size for exact only• rabs: result size actually observed
grel = (rabs – r) / (R – r)
• Marginal Cost:– baseline: exact join throughout
• model marginal cost of hybrid algorithm
unit cost of executing one step in one state– (experimental)
• number of steps in each state• unit state transition cost (experimental)• number of state transitions over entire join
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 26: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/26.jpg)
Test datasetsDatasets chosen as representative of 4 distinct patterns
we expect our results to vary:• uniform perturbation: evidence grows slowly => slow reaction• bursty perturbation: strong evidence => timely reaction
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 27: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/27.jpg)
Results
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 28: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/28.jpg)
Results
P. Missier, SEBD, June 2009, Camogli, Italy
![Page 29: Paper presentation @ SEBD '09](https://reader034.vdocuments.net/reader034/viewer/2022052523/5550669eb4c905c0448b5481/html5/thumbnails/29.jpg)
Conclusions• Optimistic approach to online record linkage
– Based on implicit referential integrity assumption– When assumption not true, goes back to pessimistic
• Technical approach based on autonomic computing– Adaptive query processing– Mix of exact / approximate physical join operators
• Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams)
• Need to relax the referential integrity assumption -- use other sources for result size estimation
P. Missier, SEBD, June 2009, Camogli, Italy