integrating conflicting data_pverconf_may2011

45
Xin Luna Dong (AT&T Labs—Research) Laure Berti (Universite de Rennes 1, visiting AT&T) Divesh Srivastava (AT&T Labs—Research)

Upload: norc-at-the-university-of-chicago

Post on 26-May-2015

285 views

Category:

News & Politics


3 download

DESCRIPTION

May 2011 Personal Vali

TRANSCRIPT

Page 1: Integrating Conflicting Data_PVERConf_May2011

Xin Luna Dong (AT&T Labs—Research)

Laure Berti (Universite de Rennes 1, visiting AT&T)

Divesh Srivastava (AT&T Labs—Research)

Page 2: Integrating Conflicting Data_PVERConf_May2011

The WWW is GreatThe WWW is Great

Page 3: Integrating Conflicting Data_PVERConf_May2011
Page 4: Integrating Conflicting Data_PVERConf_May2011

False Information on the Web False Information on the Web (I)(I)

Maurice Jarre (1924-2009) French Conductor and Composer

“One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.”

2:29, 30 March 2009

Page 5: Integrating Conflicting Data_PVERConf_May2011

False Information on the Web False Information on the Web (II)(II)

Posted by Andrew BreitbartIn his blog

Page 6: Integrating Conflicting Data_PVERConf_May2011

The Internet needs a way to help people separate rumor from real science.

– Tim Berners-Lee

We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama

Page 7: Integrating Conflicting Data_PVERConf_May2011

Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)Facts and truth really don’t have much to do with each other.

— William Faulkner

S1 S2 S3

Stonebraker

MIT Berkeley

MIT

Dewitt MSR MSR UWisc

Bernstein MSR MSR MSR

Carey UCI AT&T BEA

Halevy Google Google UW

Page 8: Integrating Conflicting Data_PVERConf_May2011

Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)Facts and truth really don’t have much to do with each other.

— William Faulkner

S1 S2 S3

Stonebraker

MIT Berkeley

MIT

Dewitt MSR MSR UWisc

Bernstein MSR MSR MSR

Carey UCI AT&T BEA

Halevy Google Google UW

Naïve voting works

Page 9: Integrating Conflicting Data_PVERConf_May2011

Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)

A lie told often enough becomes the truth. — Vladimir Lenin

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Naïve voting works only if data sources are independent.

Page 10: Integrating Conflicting Data_PVERConf_May2011

Our Goal: Truth Discovery w. Our Goal: Truth Discovery w. Awareness of Dependence Awareness of Dependence Between SourcesBetween SourcesYou can fool some of the people all the time, and all of the people some of the time, but you cannot fool all of the people all the time.

– Abraham Lincoln S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Naïve voting works only if data sources are independent.

Page 11: Integrating Conflicting Data_PVERConf_May2011

Challenges in Dependence Challenges in Dependence DiscoveryDiscovery

1. Sharing common data does not in itself imply copying.

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

2. With only a snapshot it is hard to decide which source is a copier.

3. A copier can also provide or verify some data by itself, so it is inappropriate to ignore all of its data.

Page 12: Integrating Conflicting Data_PVERConf_May2011

High-Level Intuitions for High-Level Intuitions for Dependence DetectionDependence Detection

Intuition I: decide dependence (w/o direction)

Let D1, D2 be data from two sources. D1 and D2 are dependent if

Pr(D1, D2) <> Pr(D1) * Pr(D2).

Page 13: Integrating Conflicting Data_PVERConf_May2011

Dependence?Dependence?

Source 1 on USA Presidents:

1st : George Washington

2nd : John Adams

3rd : Thomas Jefferson

4th : James Madison

41st : George H.W. Bush

42nd : William J. Clinton

43rd : George W. Bush

44th: Barack Obama

Source 2 on USA Presidents:

1st : George Washington

2nd : John Adams

3rd : Thomas Jefferson

4th : James Madison

41st : George H.W. Bush

42nd : William J. Clinton

43rd : George W. Bush

44th: Barack Obama

Are Source 1 and Source 2 dependent?

Not necessarily

Page 14: Integrating Conflicting Data_PVERConf_May2011

Dependence? Dependence?

Source 1 on USA Presidents:

1st : George Washington

2nd : Benjamin Franklin

3rd : Tom Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : Mickey Mouse

44th: Barack Obama

Source 2 on USA Presidents:

1st : George Washington

2nd : Benjamin Franklin

3rd : Tom Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : Mickey Mouse

44th: John McCain

Are Source 1 and Source 2 dependent?

-- Common -- Common ErrorsErrors

Very likely

Page 15: Integrating Conflicting Data_PVERConf_May2011

High-Level Intuitions for High-Level Intuitions for Dependence DetectionDependence DetectionIntuition I: decide dependence (w/o direction)

Let D1, D2 be data from two sources. D1 and D2 are dependent if

Pr(D1, D2) <> Pr(D1) * Pr(D2).

Intuition II: decide copying directionLet F be a property function of the data; e.g.,

accuracy of data. D1 is likely to be dependent on D2 if |F(D1 D2)-F(D1-D2)| > |F(D1 D2)-F(D2-

D1)| .

Page 16: Integrating Conflicting Data_PVERConf_May2011

Dependence? Dependence?

Source 2 on USA Presidents:

1st : George Washington

2nd : Benjamin Franklin

3rd : Tom Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : Mickey Mouse

44th: John McCain

Are Source 1 and Source 2 dependent?

-- Different -- Different AccuracyAccuracy

Source 1 on USA Presidents:

1st : George Washington

2nd : John Adams

3rd : Thomas Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : George W. Bush

44th: John McCain

S1 more likely to be a copier

Page 17: Integrating Conflicting Data_PVERConf_May2011

OutlineOutline

Motivation and intuitions for solutionTechniques

Problem definitionCopying detectionTruth discovery

Experimental ResultsFramework of the Solomon project

Page 18: Integrating Conflicting Data_PVERConf_May2011

Problem DefinitionProblem DefinitionINPUT

Objects: an aspect of a real-world entity E.g., director of a movie, author list of

a book Each associated with one true value

Sources: each providing values for a subset of objects

OUTPUT: the true value for each object

Page 19: Integrating Conflicting Data_PVERConf_May2011

Source DependenceSource DependenceSource dependence: two sources S and T deriving the same part of data directly or transitively from a common source (can be one of S or T).

Independent sourceCopier

copying part (or all) of data from other sources may verify or revise some of the copied valuesmay add additional values

AssumptionsIndependent valuesIndependent copyingNo loop copying

Page 20: Integrating Conflicting Data_PVERConf_May2011

Models for a Static WorldModels for a Static WorldCore case

Conditions1. Same source accuracy2. Uniform false-value distribution3. Categorical value

Proposition: W. independent “good” sources, Naïve voting selects values with highest probability to be true.

ModelsDepe

n

AccuPR

Consider value probabilitiesin dependence analysis

AccuRemove Cond 1

SimRemove Cond 3

NonUni

Remove Cond 2

Page 21: Integrating Conflicting Data_PVERConf_May2011

Bayesian Analysis – BasicBayesian Analysis – BasicDifferent Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Observation: ФGoal: Pr(S1S2| Ф), Pr(S1S2| Ф) (sum up to 1)According to the Bayes Rule, we need to know

Pr(Ф|S1S2), Pr(Ф|S1S2)Key: computing Pr(ФO.A|S1S2), Pr(ФO.A|S1S2)

for each O.AS1 S2

Page 22: Integrating Conflicting Data_PVERConf_May2011

Bayesian Analysis – Bayesian Analysis – Probability Probability ComputationComputation

Pr Independence Copying

O.At

O.Af

O.Ad

nnn

22

21

n

Pd2

211

)1(11 2 cc

)1(2

cn

c

)1( cPd

ε-error rate; n-#wrong-values; c-copy rate

>

Different Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Page 23: Integrating Conflicting Data_PVERConf_May2011

Considering Source AccuracyConsidering Source Accuracy

Pr Independence S1 Copies S2 S2 Copies S1

O.At

O.Af

O.Ad

n

SSPf

21

ftd PPP 1

)1(1 1 cPcS t

)1(1 cPcS f

)1( cPd

21 11 SSPt )1(1 2 cPcS t

)1(2 cPcS f

)1( cPd

Different Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Page 24: Integrating Conflicting Data_PVERConf_May2011

II. Finding the True ValueII. Finding the True Value

Consider dependence

)()(')()(

SISAvCvSS

)()()(

vPAvgSASVv

)(1

)(ln)('

SA

SnASA

)(

)(')(vSS

SAvC

)(

)(

)(

0

0)(

ODv

vC

vC

e

evP

Page 25: Integrating Conflicting Data_PVERConf_May2011

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

UCI AT&T

BEA

Truth Discovery

(1-.99*.8=.2)

(.22)

S1

S2

S4

S3

S5

.87 .2

.2

.99

.99.99

S1 S2

S3

S4 S5

Round 1

Page 26: Integrating Conflicting Data_PVERConf_May2011

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.14

.49

.49.49

.08

.49.49

.49

AT&T

BEA

Truth Discovery

S2

S3

S4 S5

UCI

S1

Round 2

Page 27: Integrating Conflicting Data_PVERConf_May2011

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.12

.49

.49.49

.06

.49.49

.49

AT&T

BEA

Truth Discovery

S2

S3

S4 S5

UCI

S1

Round 3

Page 28: Integrating Conflicting Data_PVERConf_May2011

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.10

.48

.49.50

.05

.49.48

.50

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 4

S3

S4 S5

Page 29: Integrating Conflicting Data_PVERConf_May2011

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 5

S3

S4 S5

S1

S2

S4

S3

S5

.09

.47

.49.51

.04

.49.47

.51

Page 30: Integrating Conflicting Data_PVERConf_May2011

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 13

S3

S4 S5

S1

S2

S4

S3

S5

.55

.49.55.49.44

.44

Page 31: Integrating Conflicting Data_PVERConf_May2011

Combining Accuracy and Combining Accuracy and DependenceDependence

Truth Discovery

Source-accuracy

Computation

DependenceDetection

Step 1Step 3

Step 2

Theorem: w/o accuracy, converges Observation: w. accuracy, converges when #objs >> #srcs

Page 32: Integrating Conflicting Data_PVERConf_May2011

OutlineOutline

Motivation and intuitions for solutionTechniques

Problem definitionCopying detectionTruth discovery

Experimental ResultsFramework of the Solomon project

Page 33: Integrating Conflicting Data_PVERConf_May2011

Experimental SetupExperimental Setup

Dataset: AbeBooks877 bookstores1263 CS books24364 listings, w. ISBN, author-listAfter pre-cleaning, each book on avg has

19 listings and 4 author lists (ranges from 1-23)

Golden standard: 100 random booksManually check author list from book cover

Measure: Precision=#(Corr author lists)/#(All lists)

Page 34: Integrating Conflicting Data_PVERConf_May2011

Naïve Voting and Types of Naïve Voting and Types of ErrorsErrorsNaïve voting has precision .71

Page 35: Integrating Conflicting Data_PVERConf_May2011

Contributions of Various Contributions of Various ComponentsComponents

Methods Prec#Rnds

Time(s)

Naïve .71 1 .2

Only value similarity .74 1 .2

Only source accuracy

.79 23 1.1

Only source dependence

.83 3 28.3

Depen+accu .87 22 185.8

Depen+accu+sim .89 18 197.5

Precision improves by 25.4% over Naïve

Considering dependence improves the results most

Reasonably fast

Page 36: Integrating Conflicting Data_PVERConf_May2011

OutlineOutline

Motivation and intuitions for solutionTechniques

Problem definitionCopying detectionTruth discovery

Experimental ResultsFramework of the Solomon project

Page 37: Integrating Conflicting Data_PVERConf_May2011

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Page 38: Integrating Conflicting Data_PVERConf_May2011

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Page 39: Integrating Conflicting Data_PVERConf_May2011

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Scissors

Paper Scissors

Page 40: Integrating Conflicting Data_PVERConf_May2011

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Scissors

Glue

Page 41: Integrating Conflicting Data_PVERConf_May2011

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

•Schema matching•Model management•Query answering using views•Information extraction

•String matching (edit distance, token-based, etc.)•Object matching (aka. record linkage, reference reconciliation, …)

•Data fusion•Truth discovery

Assume INDEPENDENCEof data sources

Page 42: Integrating Conflicting Data_PVERConf_May2011

Source Copying Adds A New Source Copying Adds A New Dimension to Data IntegrationDimension to Data Integration

Page 43: Integrating Conflicting Data_PVERConf_May2011

Copying Can Be Large Scaled Copying Can Be Large Scaled [VLDB’10a][VLDB’10a](Copying of AbeBooks Data) (Copying of AbeBooks Data)

Data collected from AbeBooks[Yin et al., 2007]

Page 44: Integrating Conflicting Data_PVERConf_May2011

Related WorkRelated Work

Data provenance [Buneman et al., PODS’08] Focus on effective presentation and retrieval Assume knowledge of provenance/lineage

Opinion pooling [Clemen&Winkler, 1985] Combine pr distributions from multiple experts Again, assume knowledge of dependence

Detect plagiarism of text, image/video, programs, etc. [Dong, Sigmod’11 tutorial]

Page 45: Integrating Conflicting Data_PVERConf_May2011

http://www2.research.att.com/~yifanhu/SourceCopying/