on concise set of relative candidate keys shaoxu song (tsinghua), lei chen (hkust), hong cheng...

13
On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Upload: joleen-rodgers

Post on 20-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

On Concise Set of Relative Candidate Keys

Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Page 2: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Example of Matching Keys

• For identifying the same real-world entities, matching keys specify

– What attributes to compare and

– How to compare them

• ψ2 : (name,address,department∥[0,4],[0,2],[0,0])

• states that for any tuples ti,tj in a relation

– if their distance on attribute name is in [0, 4], i.e., ≥ 0 and ≤ 4

• the distance on address is in [0, 2]

• the same department, with distance in [0,0]

– their ssn must be identifiedSSN Name Address Department

t1 234*** Jason Smith

Mark Road

Social Science

t2 2****3 J Smith Mark Rd Social Science

[Fan et al., VLDB09]

Page 3: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Relative Candidate Keys (RCKs)

• A matching key ψ2 is said redundant w.r.t. a relation, if

– all the tuple pairs that can be identified by ψ2

– can also be identified by another ψ1

• ψ1 : (name,address∥[0,4],[0,3])

• ψ2 : (name,address,department∥[0,4],[0,3],[0,0])

• RCKs, a special group of matching keys

– the number of compared attributes is minimized

– Analogous to candidate keys, w.r.t. functional dependencies

SSN Name Address Department

t1 234*** Jason Smith Mark Road

Social Science

t2 2****3 J Smith Mark Rd Social Science

t3 862*** Wixom J Smith

Park St Social Science

t4 862*** W J Smith Park Street

Social Science

ψ1 is an RCK

[Fan et al., VLDB09]

Page 4: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Minimal Matching Keys

• Redundancy issues exist

– not only w.r.t. “what attributes to compare”

– but also in “how to compare them”

• ψ1 : (name,address∥[0,4],[0,2])

• ψ3 : (name,address∥[0,0],[0,2])

• Redundancy among matching keys on the same attributes

– any tuple pair agreeing ψ3 with name distance in [0, 0] always satisfies [0,4] of ψ1 SSN Name Address Departme

nt

t1 234***

Jason Smith

Mark Road

Social Science

t2 2****3

Smith Mark Rd Social Science

t3 862***

Smith Park St Social Science

t4 862***

Will J Smith

Park Street

Social Science

t5 0****5

C Green Mark Road

Computing

t6

0****5

C Green Mark Rd Computing

ψ3 is an RCKbut not minimal

Page 5: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Reliable Matching Keys

• Consider a training data instance

– the same real-world entities in attribute Y are pre-identified

– e.g., the matching tuple pairs (t1,t2),(t3,t4),(t5,t6) on ssn

• Support

– the number of tuple pairs that can be covered by ψ

• Confidence

– the proportion of covered tuple pairs that correspond to true identifications on Y

• ψ5 : (name, address ∥ [0, 4], [0, 4])

supp(ψ5) = 4/15 conf(ψ5) = 3/4

SSN Name Address Department

t1 234863

Jason Smith

Mark Road

Social Science

t2 234863

J Smith Mark Rd Social Science

t3 862731

W J Smith Park St Social Science

t4 862731

Will J Smith

Park Street

Social Science

t5 068335

C Green Mark Road

Computing

t6

068335

C Green Mark Rd Computing

Page 6: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Matching Key Set

• Consider a set Φ of matching keys relative to the same Y

– ψ1 : (name,address∥[0,4],[0,2])

– ψ6 : (name,department∥[0,0],[0,0])

• A tuple pair may agree on (be covered by) several keys ψ ∈ Φ

• To avoid duplicate counting, consider the distinct tuple pairs that are covered by a set of matching keys

SSN Name Address Department

t1 234863

Jason Smith

Mark Road

Social Science

t2 234863

J Smith Mark Rd Social Science

t3 862731

W J Smith Park St Business

t4 862731

W J Smith Park Street

Business

t5 068335

C Green Mark Road

Computing

t6

068335

C Green Mark Rd Computing

supp(ψ1) = 2/15 supp(ψ6) = 2/15 supp(Φ) = 3/15

Page 7: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Hardness and Solutions

• Given a relation instance r of R, a Y over R, a constant k, and

– the minimum requirements of support ηs and confidence ηc,

• To find a set Φ of matching keys such that

– supp(Φ) ≥ ηs, conf(Φ) ≥ ηc, and

– the size of the set |Φ| is minimized

• The problem is NP-hard

• Greedy solution

– Select a ψ with the maximum support in each iteration

– does not stop until the minimum support ηs is satisfied

Page 8: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Redundancy Free Results

• Subsume: on distance restrictions,

– [0,4] subsumes [0,2]

• Dominate: ψ1 ≺ ψ2,

– if all distance restrictions in ψ1 subsume that of ψ2

• Minimal: a ψ is minimal

– if there does not exist any ψ′ such that

– ψ′≺ψ, ( and conf(ψ’)≥ηc )

• Minimal matching keys are always RCKs

– “minimal” definition is more strict than the RCK definition

• Greedy algorithm returns minimal results

– For any ψ1 ≺ ψ2, we have supp(ψ1) ≥ supp(ψ2)

ψ1 : (name,address∥[0,4],[0,2]) ψ2 : (name,address,department∥ [0,4],[0,2],[0,0]) ψ3 : (name,address∥[0,0],[0,2])

Page 9: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Pruning Idea

• If a ψ1 is selected to result set Φ

– any ψ2, ψ1 ≺ ψ2, has no further contribution to supp(Φ)i.e., supp({ψ1}) = supp({ψ1, ψ2})

– ψ2 can be directly ignored

• Example: suppose that ψ1 is selected to Φ

– supp({ψ1}) = supp({ψ1, ψ2}) = supp({ψ1, ψ2}) = 2/15

– ψ2, ψ3 can be pruned in the following computationSSN Name Address Departme

nt

t1 234863

Jason Smith

Mark Road

Social Science

t2 234863

J Smith Mark Rd Social Science

t3 862731

W J Smith Park St Social Science

t4 862731

Will J Smith

Park Street

Social Science

t5 068335

C Green Mark Road

Computing

t6

068335

C Green Mark Rd Computing

ψ1 : (name,address∥[0,4],[0,2]) ψ2 : (name,address,department∥ [0,4],[0,2],[0,0]) ψ3 : (name,address∥[0,0],[0,2])

Page 10: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Experiments

• The returned set size is affected by ηs and ηc

– Higher ηs and ηc lead to larger set size.

– When both ηs and ηc are too high, there may not exist any valid matching key set

• Pruning technique significantly reduced the time costs

Page 11: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Experiments

• Concise RCK sets with support commitment ηs

– higher accuracy

• Compare with considering all RCKs

– the recall is high by all RCKs

– but the precision is low

• many irrational keys with low support

• probably overfit the data

Page 12: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Conclusion

• Relative candidate keys (RCKs) clear up redundant semantics

– w.r.t. “what attributes to compare”

– minimal on the number of compared attributes

• Minimal matching keys, a concise set of RCKs

– Redundancy among RCKs on the same attributes

– about “how to compare them”

• Introduce a greedy discovery algorithm

• The return results are guaranteed to be

– RCKs (minimal w.r.t. attributes), and also

– minimal w.r.t. distance restrictions i.e., redundancy free w.r.t. “how to compare the attributes”

Page 13: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Thanks