reasoning about record matching rules

Reasoning about Record Matching Rules

Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1

1University of Edinburgh 2Bell Labs

Jianzhong Li

Harbin Institute of Technology

2

Record matching

FN LN post phn when where amount

M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500

… … … … … … …

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300

To identify tuples (from one or more unreliable relations) that refer to

the same real-world object.

Record linkage, entity resolution, data deduplication, merge/purge, …

FN LN address tel DOB gender

Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M

the same person?

3

Why bother?

Records for card holders

World-wide losses in 2006: $4.84 billion (www.sas.com)

Records for transaction logs

Data quality, data integration, payment card fraud detection, …



… … … … … … …




fraud?

4

Nontrivial: A longstanding problem

Pairwise comparing attributes via equality only does not work!



… … … … … … …




Real-life data is often dirty: errors in the data sources Data is often represented differently in different sources

5

Matching rules (Hernndez & Stolfo, 1995)



… … … … … … …




IF card[LN, address] = trans[LN, post] AND card[FN] and trans[FN] are

similar, THEN identify the two tuples

Accommodate errors in the data sources

Match=

card

trans

6

A new class of dependencies for record matching

Identifying attributes (not necessarily entire records), across sources

FN LN Address tel DOB gender


card[tel] = trans[phn] card[address] trans[post]

card[LN, address] = trans[LN, post] card[FN] trans[FN] card[X] trans[Y]

What attributes to compare? How to compare them?

X

Y

card

trans



… … … … … … …

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,3002(m*n) configurations

7

Deducing new dependencies from given ones





card[tel] = trans[phn] card[address] trans[post]

card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]

Matched by the deduced rule, but NOT by the given ones!

card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]

deduction

Match

card

transRadically different

8

Error correction, data enrichment, …

The need for matching dependencies and for reasoning about them



… … … … … … …




3. card[tel] = trans[phn] card[address] trans[post]

1. card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]

2. card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]

inconsistent

enrich2

1

Match

9

Outline

Matching dependencies (MDs): a departure from traditional

dependencies– Dynamic semantics, similarity operators, across relations

Reasoning about matching dependencies– A sound and complete inference system– A low polynomial algorithm

Relative candidate keys (RCKs): matching rules– Deducing RCKs from MDs: an exponential-time problem– An effective (heuristic) polynomial-time algorithm– Applications: record matching, blocking, windowing

Experimental study

A dependency theory for record matching

10

Matching dependencies (MDs)

(R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]

R1[X]: card[X] , R2[Y]: trans[Y] card[LN, address] = trans[LN, post] card[FN] trans[FN] card[X] trans[Y] card[tel] = trans[phn] card[address] trans[post] card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]

Semantic relationship on attributes across different sources

(Aj,Bj): pair of attributes in (R1, R2)

j : similarity operator (equality, edit distance, q-gram, jaro distance, …)

(Z1, Z2): lists of attributes in (R1, R2), of the same length

: matching operator (identify two lists of attributes via updates)

11

Dynamic semantics

= (R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]

Two instances are needed to cope with the dynamic semantics

(D1, D2) satisfies iff for all (t1, t2) D1, if t1[A1] 1 t2[B1] . . . t1[Ak] k t2[Bk] in D1

– then (t1, t2) D2, and t1[Z1] = t2[Z2] in D2

If (t1, t2) match the LHS, then their RHS are updated and equalized

phn post …

3256777 PO Box 25, EDI

tel address …

3256777 10 Oak St, EDI

phn post …

3256777 10 Oak St, EDI, EH8 9LE

tel address …

3256777 10 Oak St, EDI, EH8 9LE

D1 D2

12

An extension of functional dependencies (FDs)?

A departure from traditional dependency theory

tel address …

3256777 10 Oak St, EDI

3256777 PO Box 25, EDI

tel address …

3256777 10 Oak St, EDI, EH8 9LE

3256777 10 Oak St, EDI, EH8 9LE

D1 D2

similarity operators vs. equality (=) only across different relations (R1, R2) vs. on a single relation dynamic semantics (matching operator ) vs. static semantics

FD: tel address

violationof the FD satisfying

the MD

to accommodate unreliable data

MD: (R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]developed for schema design for “clean” data

13

An inference system for deduction of MDs

There is a finite set of axioms sound and complete for MD deduction

1: card[tel] = trans[phn] card[address] trans[post]

: card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]

Augmentation Rule

’1: card[LN, tel] = trans[LN, phn] card[LN, address] trans[LN, post]

2: card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]

Transitivity Rule

Example: MD is provable from {1, 2} by using the inference system

Recall Armstrong’s

axioms for FDs

More involved than Armstrong’s axioms (11 axioms vs. 3) two relations, generic reasoning for similarity operators

14

Main ideas:

Store deduced MDs in a table M

Process M based on inference rules, until M becomes stable– If the LHS of an MD is in M, then its RHS is added to M

Return yes if the RHS of is in M, and no otherwise

The algorithm is well designed to have low complexity - O(n2)

An algorithm for deducing MDs from given MDs

Algorithm: MDClosure

Input: a set of MDs and a single Output: yes if can be deduced from , in O(n2) time

The deduction analysis can be conducted efficiently

comparable to O(n) time for FDs

15

An algorithm for deducing MDs from given MDs

: card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]



Example: MD can be deduced from {1, 2}

Step1: M = {card[LN, tel] = trans[LN, phn], card[FN] trans[FN] } add the LHS of

Step2: M = M {card[address] = trans[post] } apply 1

Step3: M = M {card[X] = trans[Y]} apply 2

Return yes

A match may be found by deduced MDs, but NOT by given ones

16

Relative Candidate Keys (RCKs)

(R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[X] R2[Y]

(R1[A1, …, Ak], R2[B1, …, Bk] || [1 , . . ., k])

R1[X]: card[X] , R2[Y]: trans[Y]

card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X]trans[Y] (card[LN, address, FN], trans[LN, post, FN] || [=, =, ])

card[tel] = trans[phn] card[address] trans[post] NOT an RCK

card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]

(card[LN, tel, FN], trans[LN, phn, FN] || [=, =, ])

A departure from candidate keys: similarity, different sources

Ultimate goal: to decide whether R1[X] and R2[Y] refer to the same objectrelative to R1[X]

and R2[Y]

what to compare and how to compare

17

What is special about RCKs?

The match quality is highly dependent on the choices of keys

only records in the same block are

compared

– windowing (sorted neighborhood)

D B2B1

B3 discriminating

attributes

D D sortingvia keys

slidingwindow

window of a fixed size; only records in the same window are

compared;

Matching rules: identify records from unreliable data sources

Optimization: efficiency is a big issue for record matching– blocking

18

Deducing quality RCKs from MDs

Input: a set of MDs, (R1[X], R2[Y]), and a number k

Output: a set of top k RCKs deduced from

The deduction analysis can be conducted efficiently

A quality metric: nonredundancy the diversity of attributes the lengths of attributes the accuracy of attributes

Nontrivial: first compute ALL RCKs, and then pick the top-k

exponentialtime

19

A heuristic algorithm for deducing quality RCKs

Algorithm: findRCKs Input: a set of MDs, (R1[X], R2[Y]), and a number k

Output: a set of top k RCKs deduced from , in O(k*n3) time

Main ideas

A notion of completeness

if RCKs deduced from are already “covered” by smaller RCKs in Deduction

(R1[X], R2[Y] || [=, …, =]) itself is an RCK

Make use of algorithm MDClosure to deduce RCKs

One can efficiently deduce keys for matching, blocking, windowing

n: the size of (meta-data)

(R1[V1, Z1], R2[V2, Z2] || [,…, ] )

(R1[U1] R2[U2] R1[Z1] R2[Z2])(R1[V1,U1], R2[V2, U2] || [,…, ] )

A new RCK

20

A heuristic algorithm for deducing quality RCKs

Example: Given a set {1, 2} of MDs, (card[X], trans[Y]) , deduce

RCKs {rck1, rck2, rck3}.



Step1: rck1 = (card[X], trans[Y] || [=, …, =])

Step4: rk3 = (card[LN, tel, FN], trans[LN, phn, FN] || [=, =, ])

Step5: rck3 = miniminze(rk3)

Step2: rk2 = (card[LN, address, FN], trans[LN, post, FN] || [=, =, ])

Step3: rck2 = miniminze(rk2)

Apply 2 to rck2

Apply 1 to rck1

Return {rck1, rck2, rck3}.

Minimize: remove redundant attribute pairs in an RCK

21

Experimental study: The reasoning algorithms

The algorithm scales well (100 seconds for 2k MDs & 50 RCKs)

also scales well with k – the number of RCKs

scales well with the

number of MDs

22

The number of RCKs derived

Sufficient quality RCKs can be deduced from a small number of MDs

Quality: reasonably diverse

23RCKs indeed improve the match quality (up to 20%)

Experimental study: Match quality (FS)

improving the

precision without

lowering the recall

Fellegi-Sunter method – a statistical method in action Credit payment data scraped from the Web (relations of arity 21 and

13, with (X, Y) of length 11) 7 MDs, using Damerau-Levenshtein distance, soundex for similarity Precision (to all matches found), recall (to all true matches)

24

Experimental study: Efficiency (FS)

RCKs do not incur extra cost while improving match quality

comparable performance

25

Experimental study: Precision (SN)

RCKs consistently improve the precision (by 20%)

Sorted neighborhood method – a rule-based method insensitive to data size

26

Experimental study: Recall (SN)

RCKs consistently improve the recall (by 20%)

27

Experimental study: Efficiency (SN)

RCKs reduce the number of comparisons and improve efficiency

by 30%

28

Experimental study: Blocking

RCKs make effective blocking (windowing) keys

similar results for windowing

Partial RCKs as keys for blocking Pair completeness: S/N, numbers of matches with and without blocking

29

Summary

A dependency theory for matching unreliable records– Matching dependencies, relative candidate keys: dynamic

semantics, similarity operators, across unreliable data sources– A sound and complete inference system– An O(n2)-time algorithm for the deduction analysis– An efficient (heuristic) algorithm for deducing quality RCKs

Record matching, optimization (blocking, windowing)

A practical tool for deducing matching rules

Future work– Negative rules: if condition then NO match– Conditions with constants– Interaction of record matching and data repairing: being treated as

separated processes

reasoning about record matching rules

Documents