data fusion

48
DATA FUSION Resolving Inconsistencies at Schema, Tuple and Value Level Naveen Rajamoorthy Nachiappan Chidambaram Arunkarthikeyan Palaniswamy Sriramakrishnan Soundarrajan

Upload: ismail

Post on 24-Feb-2016

74 views

Category:

Documents


0 download

DESCRIPTION

DATA FUSION. Resolving Inconsistencies at Schema, Tuple and Value Level Naveen Rajamoorthy Nachiappan Chidambaram Arun karthikeyan Palaniswamy Sriramakrishnan Soundarrajan. Need for Data Fusion. To compare different Data Sets Example: Shopping Agents Disaster Management System. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DATA FUSION

DATA FUSIONResolving Inconsistencies at Schema, Tuple

and Value Level

Naveen RajamoorthyNachiappan Chidambaram

Arunkarthikeyan PalaniswamySriramakrishnan Soundarrajan

Page 2: DATA FUSION

To compare different Data Sets

Example:

Shopping Agents

Disaster Management System

Need for Data Fusion

Page 3: DATA FUSION

3

Completeness - amount of data (number of attributes and tuples) - achieved by adding more data sources

Conciseness - number of unique objects - number of unique attributes of the objects - achieved by reducing schematic inconsistencies by schema mapping

Correctness - validity of data - achieved by performing duplicate detection and data fusion

GOALS OF DATA INTEGRATION

Data Sources

Schema Mapping

Duplicate Detection

Data Fusion

Page 4: DATA FUSION

Fusing data from heterogeneous sources.

All Steps are performed at run-time.

Data Cleaning

Maximum Flexibility

Humboldt Merger(HumMer)

Page 5: DATA FUSION

Heterogeneous and Dirty data

Three Steps

1. Schema Matching and Data Transformation

2. Duplicate Detection

3. Data Fusion

Components of Data Fusion

Page 6: DATA FUSION

Three Steps in Data Fusion

Resolve inconsistencies at schema level

Resolve inconsistencies at tuple level

Resolve inconsistencies at value level

Page 7: DATA FUSION

7

Schema Matching and Data Transformation

Page 8: DATA FUSION

Process of resolving schematic heterogeneity.

1. DUMAS Schema Matching Algorithm (Duplicate-based Matching of Schemas )

2. TF IDF Similarity (term frequency–inverse document frequency)

Schema Matching

Page 9: DATA FUSION

R A B C D ER1 John Doe M (408)7573339 (408)7573338R2 Joe Smith M (249)3615616 (249)2342366R3 Suzy Klein F (358)2436321 (358)2436321

Example Consider the relation R(A,B,C,D,E) and S(B’,F,E,’G)

S B’ F E’ GS1 Doe Jdoe 408-9182043 XPS2 Deen Jdean 369-3663625 XPS3 Klein suzy 358-2436321 UnixS4 Adams Adams 541-8121164 W2000

Page 10: DATA FUSION

ExampleConsider the relation R(A,B,C,D,E) and S(B’,F,E,’G)

R A B C D ER1 John Doe M (408)7573339 (408)7573338R2 Joe Smith M (249)3615616 (249)2342366R3 Suzy Klein F (358)2436321 (358)2436321R4 Sam Adams M (541)8127100 (541)8121164

S B’ F E’ GS1 Doe Jdoe 408-9182043 XPS2 Deen Jdean 369-3663625 XPS3 Klein suzy 358-2436321 UnixS4 Adams Adams 541-8121164 W2000

Page 11: DATA FUSION

R3 Suzy Klein F (358)2436321 (358)2436321

Example

S3 Klein Suzy 358-2436321 Unix

R A B C D E

S B’ F E’ G

Page 12: DATA FUSION

ExampleR A B C D ER1 John Doe M (408)7573339 (408)7573338R2 Joe Smith M (249)3615616 (249)2342366R3 Suzy Klein F (358)2436321 (358)2436321R4 Sam Adams M (541)8127100 (541)8121164

S B’ F E’ GS1 Doe Jdoe 408-9182043 XPS2 Deen Jdean 369-3663625 XPS3 Klein suzy 358-2436321 UnixS4 Adams Adams 541-8121164 W2000

Page 13: DATA FUSION

Overlap of R and S schema

Schema Matching

Attributes in R Attributes in SA ----B B’C ----D ----E E’---- F---- G

Page 14: DATA FUSION

Preferred schema

Names of attributes are renamed or determined.

sourceID attribute is added to all tables in the schema.

Transformation

Page 15: DATA FUSION

15

Duplicate Detection

Page 16: DATA FUSION

16

Source A

Source B

<pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><publication> <title> Database Systems: The Complete Book </title> <author> Molina & Ullman</author> <year> 1990 </year></publication>

EXAMPLE

Page 17: DATA FUSION

17

<pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><publication> <title> Database Systems: The Complete Book </title> <Author> Molina & Ullman</Author> <year> 1990 </year></publication>

SCHEMA MAPPING

Source A

Source B

<pub> <title> </title> <Authors> <author> </author> <author> </author> </Authors> <year> </year></pub>

Page 18: DATA FUSION

18

<pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author><year> 1990 </year>

</pub>

<pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year></pub>

DATA TRANSFORMATIONSource

A

Source B

XQuery

<pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub>

XQuery

Page 19: DATA FUSION

19

DUPLICATE DETECTION AND FUSION<pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors></pub><pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year></pub>

<pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> <year> 1990 </year></pub>

Page 20: DATA FUSION

20

Give the correct order in which integration needs to be carried out:

A) Data Transformation -> Schema Mapping -> Duplicate detection ->Fusion

B) Duplicate detection -> Data Transformation -> Schema Mapping -> Fusion

C)Schema Mapping -> Data Transformation -> Duplicate detection ->Fusion

D) Data Transformation -> Schema Mapping -> Fusion -> Duplicate detection

QUESTION

Page 21: DATA FUSION

21

Problem◦ Given one or more data sets, find all sets of

objects that represent the same real-world entity. Difficulties

◦ Duplicates are not identical Similarity measures – Levenshtein, Jaccard, etc.

◦ Large volume, cannot compare all pairs Partitioning strategies – Sorted neighborhood,

Blocking, etc.

Duplicate Detection

Page 22: DATA FUSION

22

General Strategy Sorted Neighborhood Method

PARTITIONING STRATEGIES

Page 23: DATA FUSION

23

Compare each record with every other record and calculate distance measure. Assuming there are n records in database then we need to compute n(n-1)/2 distance measures.

GENERAL STRATEGY

X Y ZStar Wars Lucas 1985Indiana Jones

Lucas 1989

Home Alone

Wright 1991

Starwars George Lucas

1985

Shrek Adamson

2001

Snatch Ritcie 1999

Number of records, n = 6Number of Distance measures to be computed = 10

If there are say, 100000 records,Then, Number of Distance Measures tobe computed = 5*10^8 calculations

EXPENSIVE

Page 24: DATA FUSION

24

Using Sorted Neighborhood method we can reduce the number of potential duplicate pairs.

Different fields are identified as key. The database is sorted using this key. After sorting a window of fixed size slides over

the sorted database and duplicate records are identified.

The technique generates O(wN) pairs where w is window size and N is the total number of records in database.

SORTED NEIGHBORHOOD METHOD

Page 25: DATA FUSION

25

DUPLICATE DETECTION WITH DESCRIPTIONSCriteria For Attribute Selection:

Attributes that are:

(i) related to the currently considered objectChild elements having a Foreign key constraint over the attributes of the parent table.

(ii) useable by our similarity measureAttribute City corresponding to attribute Zip code cannot be used to calculate similarity measure

(iii) likely to distinguish duplicates from non-duplicates.Attribute for Denomination is unlikely to distinguish duplicate records

Page 26: DATA FUSION

26

Description: Consider attributes from other tables that have a foreign key

relationship with the existing tables. For efficiency, only direct child attributes are considered, i.e. no

descendants reached by following more than one reference are discarded.

DUPLICATE DETECTION WITH DESCRIPTIONS

MovieTitleYearDuration

FilmNameDateRating

ActorNameMovie

ActressNameMovie

ActorsNameFilm

Prod-ComNameFilm

Let tables T1 and T2 be the two matched tables, and let {T1,1, . . . , T1,k} and {T2,1, . . . , T2,m} be their respective children tables.

Then, every pair of tables (T1,i, T2,j), 1<=i<=k, i<=j<=mis matched.

Thus Actor(Movie),Actress(Movie) and Actors(Film) can also be used for Duplicate Detection

Page 27: DATA FUSION

27

ExampleID Countr

y1 USA2 United

States3 US

ID City Country ID1 Charlotte 12 California 13 Charlotte 24 California 25 Charlotte 36 California 3Table 1

Table 2ID in Table 1 is a foreign key for Country ID in Table 2

From Sim(Country) in Table 1 we understand row 1 and 3 are duplicates (row 1 = row 3)

Now on using the attribute City in child table, Table 2 for Duplicate Detection we cometo the conclusion that row 1 = row 2 = row 3 in table 1.i.e: USA = United States = US

Page 28: DATA FUSION

28

Detection From Similarity Measure

Source 1

Source 2

Source1 x Source2

PartitioningSimilarity measure

Sure Duplicat

es

Non-Duplicat

es

Possibile Duplicat

es

sim < θ1

sim > θ 2

θ1>sim<θ2

Page 29: DATA FUSION

29

ObjectiveGiven a duplicate, create a single object-representation while resolving conflicting data values.

Simple Example:

Data Fusion

Source 1

Source 2

98765432

R.J.Ludlum 3.50

Year

98765432

Trevayne

Robert Ludlum

4.00

Month

IDMax_length(author)

Min(price)

Concat(Month,Year)

Page 30: DATA FUSION

30

Uncertainty

Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity

Causes:Missing information, such as null values in a source or a completely missing attribute in a source

Contradiction

It is a conflict between two or more different non-null values that are all used to describe the same property of the same entity.

Causes:Contradiction is caused by different sources providing different values for the same attribute of a real-world entity.

TYPES OF DATA CONFLICTUncertainty

NULL value vs. non-NULL value“Easy” case

ContradictionNon-NULL value vs. (different) non-NULL value

Title Year Director

Source

Snatch

2000 Ritchie S1

Snatch

2000 null S2

Title Year Director

Source

Snatch

2000 Ritchie S1

Snatch

2000 Benaud S2

Page 31: DATA FUSION

31

unknown◦ There is a value, but I do not know it.◦ E.g.: Unknown date-of-birth

not applicable◦ There is no meaningful value.◦ E.g.: Spouse for singles

withheld◦ There is a value, but we are not authorized to see

it.◦ E.g.: Private phone line

NULL TYPES

Page 32: DATA FUSION

32

________ refers to “Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity”

A. Contradiction B. Uncertainty C. Resolution D. Ignorance

Question

Page 33: DATA FUSION

33

Classification of Functions

conflictignorance

conflictavoidance

conflictresolution

conflict resolutionstrategies

instancebased

instancebased

metadatabased

metadatabased

decidingmediating deciding mediatingCoalesce

ChooseDependingConcat

AVG, SUMMIN, MAXRandom

Vote

Choose

MostRecentMostAbstractMostSpecific

Escalate

CommonAncestor

Page 34: DATA FUSION

34

Function Description ExamplesMin, Max, Sum, Count, Avg

Standard aggregation NumChildren, Salary, Height

Random Random choice Shoe sizeLongest, Shortest Longest/shortest value First_nameChoose(source) Value from a particular source DoB (DMV), CEO (SEC)ChooseDepending(val, col)

Value depends on value chosen in other column

city & zip, e-mail & employer

Vote Majority decision RatingCoalesce First non-null value First_nameGroup, Concat Group or concatenate all values Book_reviewsMostRecent Most recent (up-to-date) value AddressMostAbstract, MostSpecific, CommonAncestor

Use a taxonomy / ontology Location

Escalate Export conflicting values gender

Conflict Resolution Functions

Page 35: DATA FUSION

35

Data Fusion Goals

a, b, c a, b, c, d

Assume 2 sources, Source 1(A,B,C) and Source 2(A,B,D)

a, b, da, b, c, -a, b, -, d

a, b, - a, b, -, -a, b, -

a, b, -, -a, b, -, -

a, b, ca, f(b,e), c, d

a, e, da, b, c, -a, e, -, d

a, b, c a, b, c, -a, b, -

a, b, c, -a, b, -, -

Identical tuples

Subsumed tuples

Conflicting tuples

Complementing tuples

Page 36: DATA FUSION

36

Identical tuples (duplicates)UNION, OUTER UNION

Subsumed tuples (uncertainty)MINIMUM UNION

Complementing tuples (uncertainty)COMPLEMENT UNION, MERGE

Conflicting tuples (contradiction)MATCH, GROUP, FUSE

Relational Operators – Overview

Page 37: DATA FUSION

37

UNION

Title Author ISBNA X 12345678

9B Y 21345678

9Name Author IDD P 31245678

9A X 12345678

9B Y 21345678

9

UNION

Name Author ISBNA X 12345678

9B Y 21345678

9D P 31245678

9

( SELECT Title AS Name,Author,ISBN FROM R)

UNION( SELECT Name,Author,ID AS ISBN FROM S)

Page 38: DATA FUSION

38

MINIMUM UNIONA B Ca b ce f gm n o

A B Da b

e f hm p

+ =

A tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values.

A B C Da b c

e f g

e f hm n o

m p

A B C Da b c

a b

e f g

e f hm n o

m p

Select A,B,C,D AS NULL FROM RUNION ALLSELECT A,B, C AS NULL,D FROM S

Page 39: DATA FUSION

39

FULL DISJUNCTIONA B Ca b ce f gk ok m

A B Da b

e f hm p

k q r

A B C Da b c

e f g hm p

k o

k m

k q r

|⋈| =

A B C Da b c

e f g hm p

k o

k m

k q r

SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B;

Page 40: DATA FUSION

40

A B Ca b ce f gm n om n

q r s

A B Da b

e f hm p

|⋈

A B Ca b ce f gm n om n

q r s

A B Da b

e f hm p

⋈|

A B C Da COAL(b,

b)c

e COAL(f,f)

g h

m COAL(n,p)

o

m COAL(n,p)

q r s

A B C Da COAL(b,

b)c

e COAL(f,f)

g h

m COAL(p,n)

o

m COAL(p,n)

A B C Da b c

e f g hm n o

m n

q r s

=

=

=

=

A B C Da b c

e f g hm p o

m p

MERGE AND PRIORITIZED MERGESELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B;

Page 41: DATA FUSION

41

SELECT Name, RESOLVE(Age, max), RESOLVE(Address,choose(EE_Students))FUSE FROM EE_Students,CS_StudentsFUSE BY (Name)

FUSE BYName Age Addre

ssRam 20 ABCDRajesh 21 EFGHName Age Addre

ssRam 23 ABCDRajesh 20 PQRS

RESULT

Name Age Address

Ram 23 ABCDRajesh 21 PQRS

Page 42: DATA FUSION

42

SELECT ID,RESOLVE(Title,

Choose(IMDB)), RESOLVE(Year, Max),

RESOLVE(Director,Concat),RESOLVE(Rating),

FUSE FROM IMDB, FilmdienstFUSE BY (ID) ON ORDER Year DESC

ID Title Year Director

Rating

1101 A 1975 Michael

Null

1102 B 1987 John 51103 C 1999 Mark NullID Title Year Direct

orRating

1101 C 1976 King 41102 B 1983 Davis Null1103 D 1997 Anthon

y2

IMDB

FILMBUFF

ID Title Year Director

Rating

1103

C 1999

Mark Anthony

2

1102

B 1987

John Davis

5

1101

A 1976

Mark Anthony

4

RESULT

FUSE BY

Page 43: DATA FUSION

43

Question

a, b, c, -a, b, -, d

a, b, -, -a, b, -, -

a, b, c, -a, e, -, d

a, b, c, -a, b, -, -

Identical tuples

Subsumed tuples

Conflicting tuples

Complementing tuples

Match The Following

1 a

2

3

4

b

c

d

Page 44: DATA FUSION

44

Hummer Screenshot

Page 45: DATA FUSION

45

Hummer Screenshot

Page 46: DATA FUSION

46

Hummer Screenshot

Page 47: DATA FUSION

47

Hummer Screenshot

Page 48: DATA FUSION

48

http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion_vldb.pdf

http://vldb.idi.ntnu.no/program/slides/demo/s1251-bilke.pdf

http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion-3step.pdf

http://www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/Modena05.pdf

http://vldb2009.org/files/DataFusionFinal.pdf http://disi.unitn.it/~p2p/RelatedWork/

Matching/dublicatesICDE05.pdf

REFERENCES