Download - DATA FUSION
DATA FUSIONResolving Inconsistencies at Schema, Tuple
and Value Level
Naveen RajamoorthyNachiappan Chidambaram
Arunkarthikeyan PalaniswamySriramakrishnan Soundarrajan
To compare different Data Sets
Example:
Shopping Agents
Disaster Management System
Need for Data Fusion
3
Completeness - amount of data (number of attributes and tuples) - achieved by adding more data sources
Conciseness - number of unique objects - number of unique attributes of the objects - achieved by reducing schematic inconsistencies by schema mapping
Correctness - validity of data - achieved by performing duplicate detection and data fusion
GOALS OF DATA INTEGRATION
Data Sources
Schema Mapping
Duplicate Detection
Data Fusion
Fusing data from heterogeneous sources.
All Steps are performed at run-time.
Data Cleaning
Maximum Flexibility
Humboldt Merger(HumMer)
Heterogeneous and Dirty data
Three Steps
1. Schema Matching and Data Transformation
2. Duplicate Detection
3. Data Fusion
Components of Data Fusion
Three Steps in Data Fusion
Resolve inconsistencies at schema level
Resolve inconsistencies at tuple level
Resolve inconsistencies at value level
7
Schema Matching and Data Transformation
Process of resolving schematic heterogeneity.
1. DUMAS Schema Matching Algorithm (Duplicate-based Matching of Schemas )
2. TF IDF Similarity (term frequency–inverse document frequency)
Schema Matching
R A B C D ER1 John Doe M (408)7573339 (408)7573338R2 Joe Smith M (249)3615616 (249)2342366R3 Suzy Klein F (358)2436321 (358)2436321
Example Consider the relation R(A,B,C,D,E) and S(B’,F,E,’G)
S B’ F E’ GS1 Doe Jdoe 408-9182043 XPS2 Deen Jdean 369-3663625 XPS3 Klein suzy 358-2436321 UnixS4 Adams Adams 541-8121164 W2000
ExampleConsider the relation R(A,B,C,D,E) and S(B’,F,E,’G)
R A B C D ER1 John Doe M (408)7573339 (408)7573338R2 Joe Smith M (249)3615616 (249)2342366R3 Suzy Klein F (358)2436321 (358)2436321R4 Sam Adams M (541)8127100 (541)8121164
S B’ F E’ GS1 Doe Jdoe 408-9182043 XPS2 Deen Jdean 369-3663625 XPS3 Klein suzy 358-2436321 UnixS4 Adams Adams 541-8121164 W2000
R3 Suzy Klein F (358)2436321 (358)2436321
Example
S3 Klein Suzy 358-2436321 Unix
R A B C D E
S B’ F E’ G
ExampleR A B C D ER1 John Doe M (408)7573339 (408)7573338R2 Joe Smith M (249)3615616 (249)2342366R3 Suzy Klein F (358)2436321 (358)2436321R4 Sam Adams M (541)8127100 (541)8121164
S B’ F E’ GS1 Doe Jdoe 408-9182043 XPS2 Deen Jdean 369-3663625 XPS3 Klein suzy 358-2436321 UnixS4 Adams Adams 541-8121164 W2000
Overlap of R and S schema
Schema Matching
Attributes in R Attributes in SA ----B B’C ----D ----E E’---- F---- G
Preferred schema
Names of attributes are renamed or determined.
sourceID attribute is added to all tables in the schema.
Transformation
15
Duplicate Detection
16
Source A
Source B
<pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><publication> <title> Database Systems: The Complete Book </title> <author> Molina & Ullman</author> <year> 1990 </year></publication>
EXAMPLE
17
<pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><publication> <title> Database Systems: The Complete Book </title> <Author> Molina & Ullman</Author> <year> 1990 </year></publication>
SCHEMA MAPPING
Source A
Source B
<pub> <title> </title> <Authors> <author> </author> <author> </author> </Authors> <year> </year></pub>
18
<pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author><year> 1990 </year>
</pub>
<pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year></pub>
DATA TRANSFORMATIONSource
A
Source B
XQuery
<pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub>
XQuery
19
DUPLICATE DETECTION AND FUSION<pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year></pub>
<pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> <year> 1990 </year></pub>
20
Give the correct order in which integration needs to be carried out:
A) Data Transformation -> Schema Mapping -> Duplicate detection ->Fusion
B) Duplicate detection -> Data Transformation -> Schema Mapping -> Fusion
C)Schema Mapping -> Data Transformation -> Duplicate detection ->Fusion
D) Data Transformation -> Schema Mapping -> Fusion -> Duplicate detection
QUESTION
21
Problem◦ Given one or more data sets, find all sets of
objects that represent the same real-world entity. Difficulties
◦ Duplicates are not identical Similarity measures – Levenshtein, Jaccard, etc.
◦ Large volume, cannot compare all pairs Partitioning strategies – Sorted neighborhood,
Blocking, etc.
Duplicate Detection
22
General Strategy Sorted Neighborhood Method
PARTITIONING STRATEGIES
23
Compare each record with every other record and calculate distance measure. Assuming there are n records in database then we need to compute n(n-1)/2 distance measures.
GENERAL STRATEGY
X Y ZStar Wars Lucas 1985Indiana Jones
Lucas 1989
Home Alone
Wright 1991
Starwars George Lucas
1985
Shrek Adamson
2001
Snatch Ritcie 1999
Number of records, n = 6Number of Distance measures to be computed = 10
If there are say, 100000 records,Then, Number of Distance Measures tobe computed = 5*10^8 calculations
EXPENSIVE
24
Using Sorted Neighborhood method we can reduce the number of potential duplicate pairs.
Different fields are identified as key. The database is sorted using this key. After sorting a window of fixed size slides over
the sorted database and duplicate records are identified.
The technique generates O(wN) pairs where w is window size and N is the total number of records in database.
SORTED NEIGHBORHOOD METHOD
25
DUPLICATE DETECTION WITH DESCRIPTIONSCriteria For Attribute Selection:
Attributes that are:
(i) related to the currently considered objectChild elements having a Foreign key constraint over the attributes of the parent table.
(ii) useable by our similarity measureAttribute City corresponding to attribute Zip code cannot be used to calculate similarity measure
(iii) likely to distinguish duplicates from non-duplicates.Attribute for Denomination is unlikely to distinguish duplicate records
26
Description: Consider attributes from other tables that have a foreign key
relationship with the existing tables. For efficiency, only direct child attributes are considered, i.e. no
descendants reached by following more than one reference are discarded.
DUPLICATE DETECTION WITH DESCRIPTIONS
MovieTitleYearDuration
FilmNameDateRating
ActorNameMovie
ActressNameMovie
ActorsNameFilm
Prod-ComNameFilm
Let tables T1 and T2 be the two matched tables, and let {T1,1, . . . , T1,k} and {T2,1, . . . , T2,m} be their respective children tables.
Then, every pair of tables (T1,i, T2,j), 1<=i<=k, i<=j<=mis matched.
Thus Actor(Movie),Actress(Movie) and Actors(Film) can also be used for Duplicate Detection
27
ExampleID Countr
y1 USA2 United
States3 US
ID City Country ID1 Charlotte 12 California 13 Charlotte 24 California 25 Charlotte 36 California 3Table 1
Table 2ID in Table 1 is a foreign key for Country ID in Table 2
From Sim(Country) in Table 1 we understand row 1 and 3 are duplicates (row 1 = row 3)
Now on using the attribute City in child table, Table 2 for Duplicate Detection we cometo the conclusion that row 1 = row 2 = row 3 in table 1.i.e: USA = United States = US
28
Detection From Similarity Measure
Source 1
Source 2
Source1 x Source2
PartitioningSimilarity measure
Sure Duplicat
es
Non-Duplicat
es
Possibile Duplicat
es
sim < θ1
sim > θ 2
θ1>sim<θ2
29
ObjectiveGiven a duplicate, create a single object-representation while resolving conflicting data values.
Simple Example:
Data Fusion
Source 1
Source 2
98765432
R.J.Ludlum 3.50
Year
98765432
Trevayne
Robert Ludlum
4.00
Month
IDMax_length(author)
Min(price)
Concat(Month,Year)
30
Uncertainty
Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity
Causes:Missing information, such as null values in a source or a completely missing attribute in a source
Contradiction
It is a conflict between two or more different non-null values that are all used to describe the same property of the same entity.
Causes:Contradiction is caused by different sources providing different values for the same attribute of a real-world entity.
TYPES OF DATA CONFLICTUncertainty
NULL value vs. non-NULL value“Easy” case
ContradictionNon-NULL value vs. (different) non-NULL value
Title Year Director
Source
Snatch
2000 Ritchie S1
Snatch
2000 null S2
Title Year Director
Source
Snatch
2000 Ritchie S1
Snatch
2000 Benaud S2
31
unknown◦ There is a value, but I do not know it.◦ E.g.: Unknown date-of-birth
not applicable◦ There is no meaningful value.◦ E.g.: Spouse for singles
withheld◦ There is a value, but we are not authorized to see
it.◦ E.g.: Private phone line
NULL TYPES
32
________ refers to “Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity”
A. Contradiction B. Uncertainty C. Resolution D. Ignorance
Question
33
Classification of Functions
conflictignorance
conflictavoidance
conflictresolution
conflict resolutionstrategies
instancebased
instancebased
metadatabased
metadatabased
decidingmediating deciding mediatingCoalesce
ChooseDependingConcat
AVG, SUMMIN, MAXRandom
Vote
Choose
MostRecentMostAbstractMostSpecific
Escalate
CommonAncestor
34
Function Description ExamplesMin, Max, Sum, Count, Avg
Standard aggregation NumChildren, Salary, Height
Random Random choice Shoe sizeLongest, Shortest Longest/shortest value First_nameChoose(source) Value from a particular source DoB (DMV), CEO (SEC)ChooseDepending(val, col)
Value depends on value chosen in other column
city & zip, e-mail & employer
Vote Majority decision RatingCoalesce First non-null value First_nameGroup, Concat Group or concatenate all values Book_reviewsMostRecent Most recent (up-to-date) value AddressMostAbstract, MostSpecific, CommonAncestor
Use a taxonomy / ontology Location
Escalate Export conflicting values gender
Conflict Resolution Functions
35
Data Fusion Goals
a, b, c a, b, c, d
Assume 2 sources, Source 1(A,B,C) and Source 2(A,B,D)
a, b, da, b, c, -a, b, -, d
a, b, - a, b, -, -a, b, -
a, b, -, -a, b, -, -
a, b, ca, f(b,e), c, d
a, e, da, b, c, -a, e, -, d
a, b, c a, b, c, -a, b, -
a, b, c, -a, b, -, -
Identical tuples
Subsumed tuples
Conflicting tuples
Complementing tuples
36
Identical tuples (duplicates)UNION, OUTER UNION
Subsumed tuples (uncertainty)MINIMUM UNION
Complementing tuples (uncertainty)COMPLEMENT UNION, MERGE
Conflicting tuples (contradiction)MATCH, GROUP, FUSE
Relational Operators – Overview
37
UNION
Title Author ISBNA X 12345678
9B Y 21345678
9Name Author IDD P 31245678
9A X 12345678
9B Y 21345678
9
UNION
Name Author ISBNA X 12345678
9B Y 21345678
9D P 31245678
9
( SELECT Title AS Name,Author,ISBN FROM R)
UNION( SELECT Name,Author,ID AS ISBN FROM S)
38
MINIMUM UNIONA B Ca b ce f gm n o
A B Da b
e f hm p
+ =
A tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values.
A B C Da b c
e f g
e f hm n o
m p
A B C Da b c
a b
e f g
e f hm n o
m p
Select A,B,C,D AS NULL FROM RUNION ALLSELECT A,B, C AS NULL,D FROM S
39
FULL DISJUNCTIONA B Ca b ce f gk ok m
A B Da b
e f hm p
k q r
A B C Da b c
e f g hm p
k o
k m
k q r
|⋈| =
A B C Da b c
e f g hm p
k o
k m
k q r
SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B;
40
A B Ca b ce f gm n om n
q r s
A B Da b
e f hm p
|⋈
A B Ca b ce f gm n om n
q r s
A B Da b
e f hm p
⋈|
A B C Da COAL(b,
b)c
e COAL(f,f)
g h
m COAL(n,p)
o
m COAL(n,p)
q r s
A B C Da COAL(b,
b)c
e COAL(f,f)
g h
m COAL(p,n)
o
m COAL(p,n)
A B C Da b c
e f g hm n o
m n
q r s
=
=
=
=
A B C Da b c
e f g hm p o
m p
MERGE AND PRIORITIZED MERGESELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B;
41
SELECT Name, RESOLVE(Age, max), RESOLVE(Address,choose(EE_Students))FUSE FROM EE_Students,CS_StudentsFUSE BY (Name)
FUSE BYName Age Addre
ssRam 20 ABCDRajesh 21 EFGHName Age Addre
ssRam 23 ABCDRajesh 20 PQRS
RESULT
Name Age Address
Ram 23 ABCDRajesh 21 PQRS
42
SELECT ID,RESOLVE(Title,
Choose(IMDB)), RESOLVE(Year, Max),
RESOLVE(Director,Concat),RESOLVE(Rating),
FUSE FROM IMDB, FilmdienstFUSE BY (ID) ON ORDER Year DESC
ID Title Year Director
Rating
1101 A 1975 Michael
Null
1102 B 1987 John 51103 C 1999 Mark NullID Title Year Direct
orRating
1101 C 1976 King 41102 B 1983 Davis Null1103 D 1997 Anthon
y2
IMDB
FILMBUFF
ID Title Year Director
Rating
1103
C 1999
Mark Anthony
2
1102
B 1987
John Davis
5
1101
A 1976
Mark Anthony
4
RESULT
FUSE BY
43
Question
a, b, c, -a, b, -, d
a, b, -, -a, b, -, -
a, b, c, -a, e, -, d
a, b, c, -a, b, -, -
Identical tuples
Subsumed tuples
Conflicting tuples
Complementing tuples
Match The Following
1 a
2
3
4
b
c
d
44
Hummer Screenshot
45
Hummer Screenshot
46
Hummer Screenshot
47
Hummer Screenshot
48
http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion_vldb.pdf
http://vldb.idi.ntnu.no/program/slides/demo/s1251-bilke.pdf
http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion-3step.pdf
http://www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/Modena05.pdf
http://vldb2009.org/files/DataFusionFinal.pdf http://disi.unitn.it/~p2p/RelatedWork/
Matching/dublicatesICDE05.pdf
REFERENCES