reasoning about record matching rules
DESCRIPTION
Reasoning about Record Matching Rules. Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1 1 University of Edinburgh 2 Bell Labs Jianzhong Li Harbin Institute of Technology. Record matching. - PowerPoint PPT PresentationTRANSCRIPT
Reasoning about Record Matching Rules
Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1
1University of Edinburgh 2Bell Labs
Jianzhong Li
Harbin Institute of Technology
2
Record matching
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
To identify tuples (from one or more unreliable relations) that refer to
the same real-world object.
Record linkage, entity resolution, data deduplication, merge/purge, …
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
the same person?
3
Why bother?
Records for card holders
World-wide losses in 2006: $4.84 billion (www.sas.com)
Records for transaction logs
Data quality, data integration, payment card fraud detection, …
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
fraud?
4
Nontrivial: A longstanding problem
Pairwise comparing attributes via equality only does not work!
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
Real-life data is often dirty: errors in the data sources Data is often represented differently in different sources
5
Matching rules (Hernndez & Stolfo, 1995)
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
IF card[LN, address] = trans[LN, post] AND card[FN] and trans[FN] are
similar, THEN identify the two tuples
Accommodate errors in the data sources
Match=
card
trans
6
A new class of dependencies for record matching
Identifying attributes (not necessarily entire records), across sources
FN LN Address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
card[tel] = trans[phn] card[address] trans[post]
card[LN, address] = trans[LN, post] card[FN] trans[FN] card[X] trans[Y]
What attributes to compare? How to compare them?
X
Y
card
trans
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,3002(m*n) configurations
7
Deducing new dependencies from given ones
FN LN post phn when where amount
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
card[tel] = trans[phn] card[address] trans[post]
card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
Matched by the deduced rule, but NOT by the given ones!
card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
deduction
Match
card
transRadically different
8
Error correction, data enrichment, …
The need for matching dependencies and for reasoning about them
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
3. card[tel] = trans[phn] card[address] trans[post]
1. card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
2. card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
inconsistent
enrich2
1
Match
9
Outline
Matching dependencies (MDs): a departure from traditional
dependencies– Dynamic semantics, similarity operators, across relations
Reasoning about matching dependencies– A sound and complete inference system– A low polynomial algorithm
Relative candidate keys (RCKs): matching rules– Deducing RCKs from MDs: an exponential-time problem– An effective (heuristic) polynomial-time algorithm– Applications: record matching, blocking, windowing
Experimental study
A dependency theory for record matching
10
Matching dependencies (MDs)
(R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]
R1[X]: card[X] , R2[Y]: trans[Y] card[LN, address] = trans[LN, post] card[FN] trans[FN] card[X] trans[Y] card[tel] = trans[phn] card[address] trans[post] card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
Semantic relationship on attributes across different sources
(Aj,Bj): pair of attributes in (R1, R2)
j : similarity operator (equality, edit distance, q-gram, jaro distance, …)
(Z1, Z2): lists of attributes in (R1, R2), of the same length
: matching operator (identify two lists of attributes via updates)
11
Dynamic semantics
= (R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]
Two instances are needed to cope with the dynamic semantics
(D1, D2) satisfies iff for all (t1, t2) D1, if t1[A1] 1 t2[B1] . . . t1[Ak] k t2[Bk] in D1
– then (t1, t2) D2, and t1[Z1] = t2[Z2] in D2
If (t1, t2) match the LHS, then their RHS are updated and equalized
phn post …
3256777 PO Box 25, EDI
tel address …
3256777 10 Oak St, EDI
phn post …
3256777 10 Oak St, EDI, EH8 9LE
tel address …
3256777 10 Oak St, EDI, EH8 9LE
D1 D2
12
An extension of functional dependencies (FDs)?
A departure from traditional dependency theory
tel address …
3256777 10 Oak St, EDI
3256777 PO Box 25, EDI
tel address …
3256777 10 Oak St, EDI, EH8 9LE
3256777 10 Oak St, EDI, EH8 9LE
D1 D2
similarity operators vs. equality (=) only across different relations (R1, R2) vs. on a single relation dynamic semantics (matching operator ) vs. static semantics
FD: tel address
violationof the FD satisfying
the MD
to accommodate unreliable data
MD: (R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]developed for schema design for “clean” data
13
An inference system for deduction of MDs
There is a finite set of axioms sound and complete for MD deduction
1: card[tel] = trans[phn] card[address] trans[post]
: card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
Augmentation Rule
’1: card[LN, tel] = trans[LN, phn] card[LN, address] trans[LN, post]
2: card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
Transitivity Rule
Example: MD is provable from {1, 2} by using the inference system
Recall Armstrong’s
axioms for FDs
More involved than Armstrong’s axioms (11 axioms vs. 3) two relations, generic reasoning for similarity operators
14
Main ideas:
Store deduced MDs in a table M
Process M based on inference rules, until M becomes stable– If the LHS of an MD is in M, then its RHS is added to M
Return yes if the RHS of is in M, and no otherwise
The algorithm is well designed to have low complexity - O(n2)
An algorithm for deducing MDs from given MDs
Algorithm: MDClosure
Input: a set of MDs and a single Output: yes if can be deduced from , in O(n2) time
The deduction analysis can be conducted efficiently
comparable to O(n) time for FDs
15
An algorithm for deducing MDs from given MDs
: card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
1: card[tel] = trans[phn] card[address] trans[post]
2: card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
Example: MD can be deduced from {1, 2}
Step1: M = {card[LN, tel] = trans[LN, phn], card[FN] trans[FN] } add the LHS of
Step2: M = M {card[address] = trans[post] } apply 1
Step3: M = M {card[X] = trans[Y]} apply 2
Return yes
A match may be found by deduced MDs, but NOT by given ones
16
Relative Candidate Keys (RCKs)
(R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[X] R2[Y]
(R1[A1, …, Ak], R2[B1, …, Bk] || [1 , . . ., k])
R1[X]: card[X] , R2[Y]: trans[Y]
card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X]trans[Y] (card[LN, address, FN], trans[LN, post, FN] || [=, =, ])
card[tel] = trans[phn] card[address] trans[post] NOT an RCK
card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
(card[LN, tel, FN], trans[LN, phn, FN] || [=, =, ])
A departure from candidate keys: similarity, different sources
Ultimate goal: to decide whether R1[X] and R2[Y] refer to the same objectrelative to R1[X]
and R2[Y]
what to compare and how to compare
17
What is special about RCKs?
The match quality is highly dependent on the choices of keys
only records in the same block are
compared
– windowing (sorted neighborhood)
D B2B1
B3 discriminating
attributes
D D sortingvia keys
slidingwindow
window of a fixed size; only records in the same window are
compared;
Matching rules: identify records from unreliable data sources
Optimization: efficiency is a big issue for record matching– blocking
18
Deducing quality RCKs from MDs
Input: a set of MDs, (R1[X], R2[Y]), and a number k
Output: a set of top k RCKs deduced from
The deduction analysis can be conducted efficiently
A quality metric: nonredundancy the diversity of attributes the lengths of attributes the accuracy of attributes
Nontrivial: first compute ALL RCKs, and then pick the top-k
exponentialtime
19
A heuristic algorithm for deducing quality RCKs
Algorithm: findRCKs Input: a set of MDs, (R1[X], R2[Y]), and a number k
Output: a set of top k RCKs deduced from , in O(k*n3) time
Main ideas
A notion of completeness
if RCKs deduced from are already “covered” by smaller RCKs in Deduction
(R1[X], R2[Y] || [=, …, =]) itself is an RCK
Make use of algorithm MDClosure to deduce RCKs
One can efficiently deduce keys for matching, blocking, windowing
n: the size of (meta-data)
(R1[V1, Z1], R2[V2, Z2] || [,…, ] )
(R1[U1] R2[U2] R1[Z1] R2[Z2])(R1[V1,U1], R2[V2, U2] || [,…, ] )
A new RCK
20
A heuristic algorithm for deducing quality RCKs
Example: Given a set {1, 2} of MDs, (card[X], trans[Y]) , deduce
RCKs {rck1, rck2, rck3}.
1: card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
2: card[tel] = trans[phn] card[address] trans[post]
Step1: rck1 = (card[X], trans[Y] || [=, …, =])
Step4: rk3 = (card[LN, tel, FN], trans[LN, phn, FN] || [=, =, ])
Step5: rck3 = miniminze(rk3)
Step2: rk2 = (card[LN, address, FN], trans[LN, post, FN] || [=, =, ])
Step3: rck2 = miniminze(rk2)
Apply 2 to rck2
Apply 1 to rck1
Return {rck1, rck2, rck3}.
Minimize: remove redundant attribute pairs in an RCK
21
Experimental study: The reasoning algorithms
The algorithm scales well (100 seconds for 2k MDs & 50 RCKs)
also scales well with k – the number of RCKs
scales well with the
number of MDs
22
The number of RCKs derived
Sufficient quality RCKs can be deduced from a small number of MDs
Quality: reasonably diverse
23RCKs indeed improve the match quality (up to 20%)
Experimental study: Match quality (FS)
improving the
precision without
lowering the recall
Fellegi-Sunter method – a statistical method in action Credit payment data scraped from the Web (relations of arity 21 and
13, with (X, Y) of length 11) 7 MDs, using Damerau-Levenshtein distance, soundex for similarity Precision (to all matches found), recall (to all true matches)
24
Experimental study: Efficiency (FS)
RCKs do not incur extra cost while improving match quality
comparable performance
25
Experimental study: Precision (SN)
RCKs consistently improve the precision (by 20%)
Sorted neighborhood method – a rule-based method insensitive to data size
26
Experimental study: Recall (SN)
RCKs consistently improve the recall (by 20%)
27
Experimental study: Efficiency (SN)
RCKs reduce the number of comparisons and improve efficiency
by 30%
28
Experimental study: Blocking
RCKs make effective blocking (windowing) keys
similar results for windowing
Partial RCKs as keys for blocking Pair completeness: S/N, numbers of matches with and without blocking
29
Summary
A dependency theory for matching unreliable records– Matching dependencies, relative candidate keys: dynamic
semantics, similarity operators, across unreliable data sources– A sound and complete inference system– An O(n2)-time algorithm for the deduction analysis– An efficient (heuristic) algorithm for deducing quality RCKs
Record matching, optimization (blocking, windowing)
A practical tool for deducing matching rules
Future work– Negative rules: if condition then NO match– Conditions with constants– Interaction of record matching and data repairing: being treated as
separated processes