on concise set of relative candidate keys shaoxu song (tsinghua), lei chen (hkust), hong cheng...
TRANSCRIPT
![Page 1: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/1.jpg)
On Concise Set of Relative Candidate Keys
Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)
![Page 2: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/2.jpg)
Example of Matching Keys
• For identifying the same real-world entities, matching keys specify
– What attributes to compare and
– How to compare them
• ψ2 : (name,address,department∥[0,4],[0,2],[0,0])
• states that for any tuples ti,tj in a relation
– if their distance on attribute name is in [0, 4], i.e., ≥ 0 and ≤ 4
• the distance on address is in [0, 2]
• the same department, with distance in [0,0]
– their ssn must be identifiedSSN Name Address Department
t1 234*** Jason Smith
Mark Road
Social Science
t2 2****3 J Smith Mark Rd Social Science
[Fan et al., VLDB09]
![Page 3: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/3.jpg)
Relative Candidate Keys (RCKs)
• A matching key ψ2 is said redundant w.r.t. a relation, if
– all the tuple pairs that can be identified by ψ2
– can also be identified by another ψ1
• ψ1 : (name,address∥[0,4],[0,3])
• ψ2 : (name,address,department∥[0,4],[0,3],[0,0])
• RCKs, a special group of matching keys
– the number of compared attributes is minimized
– Analogous to candidate keys, w.r.t. functional dependencies
SSN Name Address Department
t1 234*** Jason Smith Mark Road
Social Science
t2 2****3 J Smith Mark Rd Social Science
t3 862*** Wixom J Smith
Park St Social Science
t4 862*** W J Smith Park Street
Social Science
ψ1 is an RCK
[Fan et al., VLDB09]
![Page 4: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/4.jpg)
Minimal Matching Keys
• Redundancy issues exist
– not only w.r.t. “what attributes to compare”
– but also in “how to compare them”
• ψ1 : (name,address∥[0,4],[0,2])
• ψ3 : (name,address∥[0,0],[0,2])
• Redundancy among matching keys on the same attributes
– any tuple pair agreeing ψ3 with name distance in [0, 0] always satisfies [0,4] of ψ1 SSN Name Address Departme
nt
t1 234***
Jason Smith
Mark Road
Social Science
t2 2****3
Smith Mark Rd Social Science
t3 862***
Smith Park St Social Science
t4 862***
Will J Smith
Park Street
Social Science
t5 0****5
C Green Mark Road
Computing
t6
0****5
C Green Mark Rd Computing
ψ3 is an RCKbut not minimal
![Page 5: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/5.jpg)
Reliable Matching Keys
• Consider a training data instance
– the same real-world entities in attribute Y are pre-identified
– e.g., the matching tuple pairs (t1,t2),(t3,t4),(t5,t6) on ssn
• Support
– the number of tuple pairs that can be covered by ψ
• Confidence
– the proportion of covered tuple pairs that correspond to true identifications on Y
• ψ5 : (name, address ∥ [0, 4], [0, 4])
supp(ψ5) = 4/15 conf(ψ5) = 3/4
SSN Name Address Department
t1 234863
Jason Smith
Mark Road
Social Science
t2 234863
J Smith Mark Rd Social Science
t3 862731
W J Smith Park St Social Science
t4 862731
Will J Smith
Park Street
Social Science
t5 068335
C Green Mark Road
Computing
t6
068335
C Green Mark Rd Computing
![Page 6: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/6.jpg)
Matching Key Set
• Consider a set Φ of matching keys relative to the same Y
– ψ1 : (name,address∥[0,4],[0,2])
– ψ6 : (name,department∥[0,0],[0,0])
• A tuple pair may agree on (be covered by) several keys ψ ∈ Φ
• To avoid duplicate counting, consider the distinct tuple pairs that are covered by a set of matching keys
SSN Name Address Department
t1 234863
Jason Smith
Mark Road
Social Science
t2 234863
J Smith Mark Rd Social Science
t3 862731
W J Smith Park St Business
t4 862731
W J Smith Park Street
Business
t5 068335
C Green Mark Road
Computing
t6
068335
C Green Mark Rd Computing
supp(ψ1) = 2/15 supp(ψ6) = 2/15 supp(Φ) = 3/15
![Page 7: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/7.jpg)
Hardness and Solutions
• Given a relation instance r of R, a Y over R, a constant k, and
– the minimum requirements of support ηs and confidence ηc,
• To find a set Φ of matching keys such that
– supp(Φ) ≥ ηs, conf(Φ) ≥ ηc, and
– the size of the set |Φ| is minimized
• The problem is NP-hard
• Greedy solution
– Select a ψ with the maximum support in each iteration
– does not stop until the minimum support ηs is satisfied
![Page 8: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/8.jpg)
Redundancy Free Results
• Subsume: on distance restrictions,
– [0,4] subsumes [0,2]
• Dominate: ψ1 ≺ ψ2,
– if all distance restrictions in ψ1 subsume that of ψ2
• Minimal: a ψ is minimal
– if there does not exist any ψ′ such that
– ψ′≺ψ, ( and conf(ψ’)≥ηc )
• Minimal matching keys are always RCKs
– “minimal” definition is more strict than the RCK definition
• Greedy algorithm returns minimal results
– For any ψ1 ≺ ψ2, we have supp(ψ1) ≥ supp(ψ2)
ψ1 : (name,address∥[0,4],[0,2]) ψ2 : (name,address,department∥ [0,4],[0,2],[0,0]) ψ3 : (name,address∥[0,0],[0,2])
![Page 9: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/9.jpg)
Pruning Idea
• If a ψ1 is selected to result set Φ
– any ψ2, ψ1 ≺ ψ2, has no further contribution to supp(Φ)i.e., supp({ψ1}) = supp({ψ1, ψ2})
– ψ2 can be directly ignored
• Example: suppose that ψ1 is selected to Φ
– supp({ψ1}) = supp({ψ1, ψ2}) = supp({ψ1, ψ2}) = 2/15
– ψ2, ψ3 can be pruned in the following computationSSN Name Address Departme
nt
t1 234863
Jason Smith
Mark Road
Social Science
t2 234863
J Smith Mark Rd Social Science
t3 862731
W J Smith Park St Social Science
t4 862731
Will J Smith
Park Street
Social Science
t5 068335
C Green Mark Road
Computing
t6
068335
C Green Mark Rd Computing
ψ1 : (name,address∥[0,4],[0,2]) ψ2 : (name,address,department∥ [0,4],[0,2],[0,0]) ψ3 : (name,address∥[0,0],[0,2])
![Page 10: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/10.jpg)
Experiments
• The returned set size is affected by ηs and ηc
– Higher ηs and ηc lead to larger set size.
– When both ηs and ηc are too high, there may not exist any valid matching key set
• Pruning technique significantly reduced the time costs
![Page 11: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/11.jpg)
Experiments
• Concise RCK sets with support commitment ηs
– higher accuracy
• Compare with considering all RCKs
– the recall is high by all RCKs
– but the precision is low
• many irrational keys with low support
• probably overfit the data
![Page 12: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/12.jpg)
Conclusion
• Relative candidate keys (RCKs) clear up redundant semantics
– w.r.t. “what attributes to compare”
– minimal on the number of compared attributes
• Minimal matching keys, a concise set of RCKs
– Redundancy among RCKs on the same attributes
– about “how to compare them”
• Introduce a greedy discovery algorithm
• The return results are guaranteed to be
– RCKs (minimal w.r.t. attributes), and also
– minimal w.r.t. distance restrictions i.e., redundancy free w.r.t. “how to compare the attributes”
![Page 13: On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)](https://reader036.vdocuments.net/reader036/viewer/2022082613/5697bf8c1a28abf838c8bdbb/html5/thumbnails/13.jpg)
Thanks