1 dependencies for improving data quality conditional functional dependencies (cfds; chapter 2)...
TRANSCRIPT
![Page 1: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/1.jpg)
1
Dependencies for improving data quality
Conditional functional dependencies (CFDs; Chapter 2)
– Syntax and semantics
– Static analysis: consistency and implication, axiom system Conditional inclusion dependencies (CINDs; Chapter 3)
– Syntax and semantics
– Static analysis: consistency and implication Matching dependencies for record matching (MDs; Chapter 4)
– Syntax and semantics
– Relative candidate keys
TDD: Topics in Distributed Databases
![Page 2: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/2.jpg)
2
Characterizing the consistency of data
One of the central technical problems for data consistency is how
to tell whether the data is dirty or clean
Integrity constraints (data dependencies) as data quality rules
Inconsistencies emerge as violations of constraints
Traditional dependencies: – functional dependencies– inclusion dependencies– denial constraints (a special case of full dependencies)– . . .
Question: are these traditional dependencies sufficient?
![Page 3: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/3.jpg)
3
Example: customer relation
Schema: Cust(country, area-code, phone, street, city, zip)
Instance:
country area-code phone street city zip
44 131 1234567 Mayfield NYC EH4 8LE
44 131 3456789 Crichton NYC EH4 8LE
01 908 3456789 Mountain Ave NYC 07974
functional dependencies (FDs):
cust[country, area-code, phone] cust[street, city, zip]
cust[country, area-code] cust[city]
The database satisfies the FDs. Is the data consistent?
![Page 4: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/4.jpg)
4
Capturing inconsistencies in the data
cust ([country = 44, zip] [street])
In the UK, zip code uniquely determines the street
The constraint may not hold for other countries It expresses a fundamental part of the semantics of the data It can NOT be expressed as a traditional FD
– It does not hold on the entire relation; instead, it holds on
tuples representing UK customers only
country area-code phone street city zip
44 131 1234567 Mayfield NYC EH4 8LE
44 131 3456789 Crichton NYC EH4 8LE
01 908 3456789 Mountain Ave NYC 07974
![Page 5: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/5.jpg)
5
Two more constraints
cust([country = 44, area-code = 131, phone] [street, zip, city = EDI])
cust([country = 01, area-code = 908, phone] [street, zip, city = MH])
– In the UK, if the area code is 131, then the city has to be EDI– In the US, if the area code is 908, then the city has to be MH
t1, t2 and t3 violate these constraints
– refining cust([country, area-code, phno] [street, city, zip])
– combining data values and variables
id country Area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH4 8LE
t2 44 131 3456789 Crichton NYC EH4 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
![Page 6: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/6.jpg)
6
The need for new constraints
cust([country = 44, zip] [street])
cust([country = 44, area-code = 131, phone] [street, zip, city = EDI])
cust([country = 01, area-code = 908, phone] [street, zip, city = MH])
They capture inconsistencies that traditional FDs cannot detect
Traditional constraints were developed for schema design, not for
data cleaning! Data integration in real-life: source constraints
– hold on a subset of sources– hold conditionally on the integrated data
They are NOT expressible as traditional FDs– do not hold on the entire relation– contain constant data values, besides logical variables
![Page 7: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/7.jpg)
7
An extension of traditional FDs: (R: X Y, Tp) X Y: embedded traditional FD on R Tp: a pattern tableau
– attributes: X Y
– each tuple in Tp consists of constants and unnamed
variable _
Example: cust([country = 44, zip] [street]) (cust (country, zip street), Tp) pattern tableau Tp
Conditional Functional Dependencies (CFDs)
country zip street
44 _ _
![Page 8: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/8.jpg)
8
cust([country = 44, area-code = 131, phone] [street, zip, city = EDI])
cust([country = 01, area-code = 908, phone] [street, zip, city = MH])
cust([country, area-code, phone] [street, city, zip])
as a SINGLE CFD: (cust(country, area-code, phone street, city, zip), Tp) pattern tableau Tp: one tuple for each constraint
Example CFDs
country area-code phone street city zip
44 131 _ _ Edi _
01 908 _ _ MH _
_ _ _ _ _ _
CFDs subsume traditional FDs. Why?
![Page 9: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/9.jpg)
9
Express
cust[country, area-code] cust[city]
as a CFD: (cust(country, area-code, city), Tp) pattern tableau Tp: a single tuple consisting of _ only
Traditional FDs as a special case
country area-code city
_ _ _
![Page 10: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/10.jpg)
10
a b (a matches b) if – either a or b is _ – both a and b are constants and a = b
tuple t1 matches t2: t1 t2
(a, b) (a, _), but (a, b) does not match (a, c)
DB satisfies (R: X Y, Tp) iff for any tuple tp in the pattern tableau Tp and for any tuples t1, t2 in DB, if t1[X] = t2[X] tp[X], then t1[Y] = t2[Y] tp[Y]– tp[X]: identifying the set of tuples on which the constraint tp
applies, ie, { t | t[X] tp[X]} – t1[Y] = t2[Y] tp[Y]: enforcing the embedded FD, and the
pattern of tp
Semantics of CFDs
![Page 11: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/11.jpg)
11
cust([country = 44, zip] [street])
Tuples t1 and t2 violate the CFD t1[country, zip] = t2[country, zip] tp[country, zip] t1[street] t2[street]
The CFD applies to t1 and t2 since they match tp[country, zip]
Example: violation of CFDs
id country area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH8 8LE
t2 44 131 3456789 Crichton NYC EH8 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
country zip street
44 _ _
CFDs: enforcing binding of semantically related data values
![Page 12: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/12.jpg)
12
(cust(country, area-code city), Tp)
Tuple t1 does not satisfy the CFD t1[country, area-code] = t1[country, area-code] tp1[country, area-code] t1[city] = t1[city]; however, t1[city] does not match tp1[city]
In contrast to traditional FDs, a single tuple may violate a CFD
Violation of CFDs by a single tuple
id country area-code city
tp1 44 131 Edi
tp2 01 908 MH
tp3 _ _ _
id country area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH8 8LE
t2 44 131 3456789 Crichton NYC EH8 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
![Page 13: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/13.jpg)
13
(cust(country, area-code, phno street, city, zip), Tp)
Tuple t3 violates the CFD. Why? Tuples t1 and t4 violate the CFD. Why?
Exercise
id country area-code phone street city zip
t1 44 131 1234567 Mayfield Edi EH4 8LE
t2 44 131 3456789 Mayfield NYC 19082
t3 01 908 3456789 Mountain Ave NYC 19082
t4 44 131 1234567 Chrichton EDI EH8 9LE
id country area-code phon street city zip
tp1 44 131 _ _ Edi _
tp2 01 908 _ _ MH _
tp3 _ _ _ _ _ _
![Page 14: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/14.jpg)
14
“Dirty” constraints?
A set of CFDs may be inconsistent!
Inconsistent: (R(A B), Tp)
In any nonempty database DB and for any tuple t in DB,– tp1: t[B] must be b– tp2: t[B] must be c– Inconsistent if b and c are different
inconsistent = { 1, 2 }, 1 = (R(A B), Tp1), 2 = (R(B A), Tp2)
Why?
id A B
tp1 _ b
tp2 _ cTp
id A B
tp1 true b
tp2 false c
id B A
tp3 b false
tp4 c true
![Page 15: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/15.jpg)
15
The consistency problem
The consistency problem for CFDs is to determine, given a set
of CFDs, whether or not there exists a nonempty database DB
that satisfies , i.e., for any in , DB satisfies .
Whether or not makes sense
For traditional FDs, the consistency problem is not an issue: one
can specify any FDs without worrying about their consistency
A set of CFDs may be inconsistent!
Theorem. The consistency problem for CFDs is NP-complete.
Nontrivial: contrast this with the trivial consistency analysis of FDs!
![Page 16: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/16.jpg)
16
The implication problem
The implication problem for CFDs is to determine, given a set of
CFDs and a single CFD , whether implies , denoted by |=
, i.e., for any database DB, if DB satisfies , then DB satisfies .
Example:
= { 1, 2 } , 1 = (R(A B), Tp1), 2 = (R(B C), Tp2)
= (R(A C), Tp)
|= .
id A B
tp1 _ b
Tp1 Tp2 id B C
tp1 _ c
id A C
tp a c
![Page 17: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/17.jpg)
17
The complexity of the implication problem
For traditional FDs, the implication problem is in linear time
In contrast, the implication problem for CFDs is intractable
Theorem. The implication problem for CFDs is coNP-complete.
The expressive power of CFDs comes at a price
Question: how about constant CFDs (without
wildcard)? Would it simplify the consistency
and implication analyses?
![Page 18: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/18.jpg)
18
Finite axiomatizability: Flashback
Armstrong’s axioms can be found in every database textbook:
Reflexivity: If Y X, then X Y
Augmentation: If X Y , then XZ YZ
Transitivity: If X Y and Y Z, then X Z
Sound and complete for FD implication, i.e, |= iff can be
inferred from using reflexivity, augmentation, transitivity.
Question: is there a sound and complete inference system for the
implication analysis of CFDs?
![Page 19: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/19.jpg)
19
Finite axiomatizability of CFDs
Theorem. There is a sound and complete inference system I for
implication analysis of CFDs
Sound: if |- , i.e., can be proved from using I, then |= Complete: if |= , then |- using I
The inference system is more involved than its counterpart for
traditional FDs, namely, Armstrong’s axioms.
There are 5 axioms.
A normal form of CFDs: (R: X A, tp), tp is a single pattern tuple.
![Page 20: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/20.jpg)
20
Axioms for CFDs: transitivity
Transitivity: if ([A1,…,Ak] [B1,…,Bm], tp)
and ([B1,…,Bm] [C1,…,Cn], t’p)
A1 … Ak B1 … Bm
tp[A1] … tp[Ak] tp[B1] tp[Bm]
A1 … Ak C1 … Cn
tp[A1] … tp[Ak] t’p[C1] t’p[Cn]
B1 … Bm C1 … Cn
tp’[B1] … t’p[Bm] t’p[C1] t’p[Cm]
([A1,…,Ak] [C1,…,Cn], t’p)
match
![Page 21: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/21.jpg)
21
Axioms for CFDs: reduction
reduction: if ([B, X] A, tp), tp[B] = _, and tp[A] = a
A1 … Ak B A
tp[A1] … tp[Ak] _ a
then (X A, t’p) A1 … Ak A
tp[A1] … tp[Ak] a
![Page 22: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/22.jpg)
22
Static analyses: CFD vs. FD
satisfiability implication finite axiom’tyCFD NP-complete coNP-complete yes
FD O(1) O(n) yes
General setting:
satisfiability implication finite axiom’tyCFD O(n2) O(n2) yes
FD O(1) O(n) yes
in the absence of finite-domain attributes:
complications: finite-domain attributes
![Page 23: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/23.jpg)
23
Conditional functional dependencies (CFDs)– Syntax and semantics– Static analysis: consistency and implication, axiom system– SQL techniques for inconsistency detection and incremental
detection Conditional inclusion dependencies (CINDs)
– Syntax and semantics– Static analysis: consistency and implication
Matching dependencies for record matching (MDs)– Syntax and semantics– Relative candidate keys
Conditional Constraints for Data Cleaning
![Page 24: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/24.jpg)
24
Example: Amazon database
Schema:
order(asin, title, type, price, country, county) -- source
book(asin, isbn, title, price, format) -- target
CD(asin, title, price, genre)
asin: Amazon standard identification number
Instances: asin title type price country county
a23 H. Porter book 17.99 US DL
a12 J. Denver CD 7.94 UK Reyden
asin isbn title price
a23 b32 Harry Porter 17.99
a56 b65 Snow white 7.94
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
order
book CD
![Page 25: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/25.jpg)
25
Schema matching
Traditional inclusion dependencies:
order[asin, title, price] book[asin, title, price]
order[asin, title, price] CD[asin, title, price]
These inclusion dependencies do not make sense!
Inclusion dependencies from source to target (e.g., Clio)
asin title type price country county
asin isbn title price asin title price genre
Do these make sense?
![Page 26: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/26.jpg)
26
Schema matching: dependencies with conditions
asin title type price country county
asin isbn title price asin title price genre
Conditional inclusion dependencies:
order[asin, title, price; type = book] book[asin, title, price]
order[asin, title, price; type = CD] CD[asin, title, price]
order[asin, title, price] book[asin, title, price] holds only if type = book
order[asin, title, price] CD[asin, title, price] holds only if type = CD
The constraints do not hold on the entire order table
![Page 27: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/27.jpg)
27
Date cleaning with conditional dependencies
CIND1: order[asin, title, price; type = book] book[asin, title, price]
CIND2: order[asin, title, price; type = CD] CD[asin, title, price]
Tuple t1 violates CIND1
Tuple t2 violates CIND2, why?
id asin title type price country county
t1 a23 H. Porter book 17.99 US DL
t2 a12 J. Denver CD 7.94 UK Reyden
asin isbn title price
a23 b32 Harry Porter 17.99
a56 b65 Snow white 7.94
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
order
book CD
![Page 28: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/28.jpg)
28
More on data cleaning
CD[asin, title, price; genre = ‘a-book’] book[asin, title, price; format = ‘audio’]
Inclusion dependency CD[asin, title, price] book[asin, title, price]
holds only if genre = ‘a-book’, i.e., when the CD is an audio book In addition, the format of the corresponding book must be audio –
a pattern for the referenced tuple
asin isbn title price format
a23 b32 Harry Porter 17.99 Hard cover
a56 b65 Snow White 17.94 audio
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
book
CD
And what?
![Page 29: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/29.jpg)
29
(R1[X; Xp] R2[Y; Yp], Tp) R1[X] R2[Y]: embedded traditional IND from R1 to R2 Tp: a pattern tableau
– attributes: Xp Yp
– tuples in Tp consist of constants and unnamed variable _
Example: express
CIND1: order[asin, title, price; type = book] book[asin, title, price]
(order[asin, title, price; type] book[asin, title, price; nil], Tp)
nil: empty list pattern tableau Tp
Conditional Inclusion Dependencies (CINDs)
type
book
![Page 30: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/30.jpg)
30
R1[X] R2[Y] X: [A1, …, An] Y : [B1, …, Bn]
As a CIND: (R1[X; nil] R2[Y; nil], Tp)
What is the pattern tableau?
pattern tableau Tp: a single tuple ( )
Traditional CINDs as a special case
CINDs subsume traditional INDs
![Page 31: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/31.jpg)
31
Express the following as CINDs:
CIND2: order[asin, title, price; type = CD] CD[asin, title, price]
CIND3: CD[asin, title, price; genre = ‘a-book’] book[asin, title,
price; format = ‘audio’]
Exercise
type
CD
genre format
a-book audio
(order[asin, title, price; type] CD[asin, title, price; nil], Tp)
(CD[asin, title, price; genre] book[asin, title, price; format], Tp)
![Page 32: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/32.jpg)
32
DB = (DB1, DB2), where DBj is an instance of Rj, j = 1, 2.
DB satisfies (R1[X; Xp] R2[Y; Yp], Tp) iff for any tuples t1 in DB1,
and any tuple tp in the pattern tableau Tp, if t1[Xp] tp[Xp],
then there exists t2 in DB2 such that t1[Y] = t2[Y] (traditional IND semantics) t2[Yp] tp[Yp] (matching the pattern tuple on Y, Yp)
Patterns: t1[Xp] tp[Xp]: identifying the set of R1 tuples on which tp
applies: { t1 | t1[Xp] tp[Xp] } t2[Yp] tp[Yp]: enforcing the embedded IND and the constraint
specified by patterns Yp
Semantics of CINDs
![Page 33: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/33.jpg)
33
(CD[asin, title, price; genre] book[asin, title, price; format], Tp)
The following DB satisfies the CIND
Example
asin isbn title price format
a23 b32 Harry Porter 17.99 Hard cover
a56 b65 Snow white 7.94 audio
asin title price genre
a12 J. Denver 17.99 country
a56 Snow White 7.94 a-book
book
CD
genre format
a-book audio
![Page 34: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/34.jpg)
34
Exercise
CIND1: (order[asin, title, price; type] book[asin, title, price; nil], Tp)
The following DB violates CIND1. Why?
id asin title type price country county
t1 a23 H. Porter book 17.99 US DL
t2 a12 J. Denver CD 7.94 UK Reyden
asin isbn title price
a23 b32 Harry Porter 17.99
a56 b65 Snow white 7.94
asin title price genre
a12 J. Denver 17.99 country
a56 S. White 7.94 a-book
order
book CD
type
book
![Page 35: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/35.jpg)
35
The satisfiability problem for CINDs
The consistency problem for CINDs is to determine, given a set of
CINDs, whether or not there exists a nonempty database DB
that satisfies , i.e., for any in , DB satisfies .
Recall
Any set of traditional INDs is always consistent!
For CFDs, the satisfiability problem is intractable.
In contrast.
Theorem. Any set of CINDs is always consistent!
Despite the increased expressive power, the complexity of the
satisfiability analysis does not go up.
![Page 36: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/36.jpg)
36
The implication problem for CINDs
The implication problem for CINDs is to decide, given a set of
CINDs and a single CIND , whether implies ( |= ).
For traditional INDs, the implication problem is PSPACE-complete
For CINDs, the complexity does not hike up, to an extent:
Theorem. For CINDs containing no finite-domain attributes, the
implication problem is PSPACE-complete
In the general setting, however, we have to pay a price:
Theorem. The implication problem for CINDs is EXPTIME-complete
![Page 37: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/37.jpg)
37
Finite axiomatizability of CINDs
Rules for inferring IND implication:
– Reflexivity: If R[X] R[X]
– Projection and Permutation: If R1[A1, …, Ak] R2[B1, …, Bk],
then R1[Ai1, …, Aik] R2[Bi1, …, Bik],
– Transitivity: If R1[X] R2[Y] and R2[Y] R3[Z], then
R1[X] R3[Z]
Sound and complete for IND implication
CINDs retain the finite axiomatizability
Theorem. There is a sound and complete inference system for
implication analysis of CINDs
There are 8 axioms.
![Page 38: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/38.jpg)
38
Inference rules for CINDs
Normal form of CINDs: (R1[X; Xp] R2[Y; Yp], tp), tp is a single pattern tuple tp[A] is a constant iff A is in Xp or Yp (tp[B] = _ if B is in X or Y)
Inference rules
Reflexivity: (R[X; nil] R[X; nil], tp), where tp = ( )
Projection and permutation: If (R1[X; Xp] R2[Y; Yp], tp), then (R1[X’; X’p] R2[Y’; Y’p], t’p), for any permutation of X, Xp
Xp Yp
tp[Xp] tp[Yp]
tp tp’
X’p Y’p
tp[X’p] tp[Y’p]
![Page 39: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/39.jpg)
39
Axioms for CINDs: transitivity
Transitivity: if (R1[X; Xp] R2[Y; Yp], tp),
and (R2[Y; Yp] R3[Z; Zp], t’p),
Xp Yp
tp[Xp] tp[Yp]
Xp Zp
tp[Xp] t’p[Zp]
Yp Zp
tp[Yp] t’p[Zp]
(R1[X; Xp] R3[Z; Zp], t”p)
equal
![Page 40: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/40.jpg)
40
Axioms for CINDs: augmentation
augmentation: if (R1[X; Xp] R2[Y; Yp], tp), A attr(R1),
Xp Yp
tp[Xp] tp[Yp]
Xp A Yp
tp[Xp] a tp[Yp]
(R1[X; Xp, A] R2[Y; Yp], t’p)
![Page 41: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/41.jpg)
41
Static analyses: CIND vs. IND
satisfiability implication finite axiom’tyCIND O(1) EXPTIME-complete yes
IND O(1) PSPACE-complete yes
General setting:
satisfiability implication finite axiom’tyCIND O(1) PSPACE-complete yes
IND O(1) PSPACE-complete yes
in the absence of finite-domain attributes:
CINDs retain most complexity bounds of their traditional counterpart
![Page 42: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/42.jpg)
42
CFDs and CINDs taken together
We need both CFDs and CINDs for data cleaning schema matching
Theorem. The implication problem for CFDs and CINDs is
undecidable
Not surprising: The implication problem for traditional FDs and INDs
is already undecidable
Theorem. The consistency problem for CFDs and CINDs is
undecidable
In contrast, any set of traditional FDs and INDs is consistent!
![Page 43: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/43.jpg)
43
Static analyses: CFD + CIND vs. FD + IND
satisfiability implication finite axiom’tyCFD + CIND undecidable undecidable No
FD + IND O(1) undecidable No
CINDs and CFDs properly subsume FDs and INDs
Both the satisfiability analysis and implication analysis are
beyond reach in practice
This calls for effective heuristic methods
![Page 44: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/44.jpg)
44
Conditional functional dependencies (CFDs)– Syntax and semantics– Static analysis: consistency and implication, axiom system– SQL techniques for inconsistency detection and incremental
detection Conditional inclusion dependencies (CINDs)
– Syntax and semantics– Static analysis: consistency and implication
Matching dependencies for record matching (MDs)– Syntax and semantics– Relative candidate keys
Conditional Constraints for Data Cleaning
![Page 45: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/45.jpg)
4545
Record matching
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
To identify tuples from one or more unreliable sources that refer to
the same real-world object.
Record linkage, entity resolution, data deduplication, merge/purge, …
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
the same person?Nontrivial: Real-life data is often dirty: errors in the
data sources Data is often represented differently in
different sourcesPairwise comparison of attributes
via equality only does not work!
![Page 46: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/46.jpg)
4646
Matching rules (Hernndez & Stolfo, 1995)
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500
… … … … … … …
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
IF card[LN, address] = trans[LN, post] AND card[FN] and
trans[FN] are similar, THEN identify the two tuples
Accommodate errors in the data sources
Match= card
trans
![Page 47: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/47.jpg)
4747
Dependencies for record matching
Identifying attributes (not necessarily entire records), across sources
FN LN post phn when where amount
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
2(m*n) configurations
card[tel] = trans[phn] card[address] trans[post]
card[LN, address] = trans[LN, post] card[FN] trans[FN] card[X] trans[Y]
What attributes to compare? How to compare them?
X
Y
card
trans
![Page 48: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/48.jpg)
4848
Deducing new dependencies from given rules
FN LN post phn when where amount
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC $6,300
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
card[tel] = trans[phn] card[address] trans[post]
card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
Matched by the deduced rule, but NOT by the given ones!
card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
deduction
Match
card
transRadically different
![Page 49: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/49.jpg)
4949
Matching dependencies (MDs)
(R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]
R1[X]: card[FN, LN, address] , R2[Y]: trans[FN, LN, post] card[LN, address] = trans[LN, post] card[FN] trans[FN] card[X] trans[Y] card[tel] = trans[phn] card[address] trans[post] card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
Semantic relationship on attributes across different sources
R1[X], R2[Y]: entities to be identified
(Z1, Z2): lists of attributes in (X, Y), of the same length
j : similarity operator (edit distance, q-gram, jaro distance, …)
: matching operator (identify two lists of attributes via updates)
tel and phn are not part of X, Y
![Page 50: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/50.jpg)
5050
Dynamic semantics
= (R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]
Two instances are needed to cope with the dynamic semantics
(D1, D2) satisfies iff for all (t1, t2) D1, if t1[A1] 1 t2[B1] . . . t1[Ak] k t2[Bk] in D1
– then (t1, t2) D2, and t1[Z1] = t2[Z2] in D2
If (t1, t2) match the LHS, then their RHS are updated and equalized
phn post …
3256777 PO Box 25, EDI
tel address …
3256777 10 Oak St, EDI
phn post …
3256777 10 Oak St, EDI, EH8 9LE
tel address …
3256777 10 Oak St, EDI, EH8 9LE
D1 D2
Different from FDs?
![Page 51: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/51.jpg)
5151
An extension of functional dependencies (FDs)?
A departure from traditional dependency theory
tel address …
3256777 10 Oak St, EDI
3256777 PO Box 25, EDI
tel address …
3256777 10 Oak St, EDI, EH8 9LE
3256777 10 Oak St, EDI, EH8 9LE
D1 D2
similarity operators vs. equality (=) only across different relations (R1, R2) vs. on a single relation dynamic semantic (matching operator ) vs. static semantics
FD: tel address
violationof the FD satisfying
the MD
to accommodate unreliable data
MD: (R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]developed for schema design for “clean” data
![Page 52: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/52.jpg)
5252
Deduction of new MDs from given MDs
Given a set of MDs and a single , can be deduced from ?
Different from our familiar notion of implication
For all (D1, D2) if (D1, D2) satisfies and (D2, D2) satisfies
then (D1, D2) satisfies D1 D2
“fixpoint” reached by enforcing
No matter how is interpreted, if is enforced, so is
is a logical
consequence of
Example: deduction of from {1, 2}, where
: card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
1: card[tel] = trans[phn] card[address] trans[post]
2: card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
![Page 53: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/53.jpg)
5353
An inference system for MDs
There is a finite set of axioms sound and complete for MD deduction
1: card[tel] = trans[phn] card[address] trans[post]
: card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
Augmentation Rule
card[LN, tel] = trans[LN, phn] card[LN, address] trans[LN, post]
2: card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X] trans[Y]
Transitivity Rule
Example: MD is provable from {1, 2} by using the inference system
Recall Armstrong’s
axioms for FDs
More involved than Armstrong’s axioms (11 axioms vs. 3) two relations, generic reasoning for similarity operators
![Page 54: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/54.jpg)
5454
Relative Candidate Keys (RCKs)
(R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[X] R2[Y]
(R1[A1, …, Ak], R2[B1, …, Bk] || [1 , . . ., k])
R1[X]: card[X] , R2[Y]: trans[Y] card[LN,address] = trans[LN,post] card[FN] trans[FN] card[X]trans[Y]
(card[LN, address, FN], trans[LN, post, FN] || [=, =, ])
card[tel] = trans[phn] card[address] trans[post] NOT an RCK
card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y]
(card[LN, tel, FN], trans[LN, phn, FN] || [=, =, ])
A departure from candidate keys: similarity, different sources
Ultimate goal: to decide whether R1[X] and R2[Y] refer to the same objectrelative to R1[X]
and R2[Y]
what to compare and how to compare
![Page 55: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/55.jpg)
5555
What is special about RCKs?
The match quality is highly dependent on the choices of keys
only records in the same block are
compared
– windowing (sorted neighborhood)
D B2B1
B3 discriminating
attributes
D D sortingvia keys
slidingwindow
window of a fixed size; only records in the same window are
compared;
Matching rules: identify records from unreliable data sources
Optimization: efficiency is a big issue for record matching– blocking
RCKs can be deduced from matching dependencies
![Page 56: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/56.jpg)
56
Summary and review
What are CFDs? CINDs? Why do we need new constraints?
What is the consistency problem? Complexity?
What is the implication problem? Inference system? Sound and
complete?
What is record matching? Why bother?
What are matching rules?
Why matching dependencies and relative candidate keys? Why are they
different from functional dependencies and relational keys? What are blocking and windowing?
Problems to think about:
– CFDs across different relations? With numeric values?
– Negative rules for MDs: if condition then NO match
– Conditions for MDs: if then match
![Page 57: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/57.jpg)
57
Project (1)
Given a relational database D and a set S of data quality rules, error detection is to find all pairs of tuples in D that violate at least one dependency in S. Develop two MapReduce algorithms for error detection, when S consists of
• conditional functional dependencies; or• conditional inclusion dependencies
Prove the correctness of your algorithms, give their complexity analyses, and show that they are scalable with both D and SExperimentally evaluate your algorithms, by randomly generating dependencies in S and large datasets D
57
A development project
![Page 58: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/58.jpg)
58
Project (2)
Given a relation D and a set S of matching dependencies, record matching is to identify all pairs of tuples in D based on dependencies in S. For matching dependencies, read
W. Fan, J. Li, X. Jia, and S. Ma. Dynamic constraints for record matching, VLDB, 2009.
Develop a MapReduce algorithm for record matching. Prove the correctness of your algorithm, give its complexity
analyses, and show that it is scalable with both D and S Develop optimization techniques to improve its response time Experimentally evaluate your algorithm, by randomly generating
dependencies in S and large datasets D
Development project
![Page 59: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/59.jpg)
59
Project (3)
Write a survey on integrity constraints studied for improving data quality, including
• Traditional functional and inclusion dependencies• Conditional functional and inclusion dependencies• Denial constraints, and• at least two other forms of constraints
Develop a set of criteria for evaluating these data quality rules, in terms of expressive power, complexity, and practical impact, etc.Critically evaluate each of these classes based on your criteriaPropose possible extensions to these constraints as data quality rules for big data.
59A survey project
![Page 60: 1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649e255503460f94b13732/html5/thumbnails/60.jpg)
Reading for the next weekhttp://homepages.inf.ed.ac.uk/wenfei/publication.html
60
1. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification, SIGMOD 2005.
2. W. Fan, F. Geerts, S. Ma, and H. Muller, Detecting inconsistencies in distributed data, ICDE 2010.
3. W. Fan, J. LI, N. Tang, and W. Yu. Incremental detection of inconsistencies in distributed data, TKDE 26(6). 2014
4. W. Fan, J. Li, S. Ma, N. Tang and W. Yu, Interaction between Record Matching and Data Repairing, SIGMOD, 2011.
5. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Towards certain fixes with editing rules and master data, VLDB 2010.
6. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB 2011. https://cs.uwaterloo.ca/~ilyas/papers/YakoutVLDB2011.pdf