integrating conflicting data_pverconf_may2011

Post on 26-May-2015

285 Views

Category:

News & Politics

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

May 2011 Personal Vali

TRANSCRIPT

Xin Luna Dong (AT&T Labs—Research)

Laure Berti (Universite de Rennes 1, visiting AT&T)

Divesh Srivastava (AT&T Labs—Research)

The WWW is GreatThe WWW is Great

False Information on the Web False Information on the Web (I)(I)

Maurice Jarre (1924-2009) French Conductor and Composer

“One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.”

2:29, 30 March 2009

False Information on the Web False Information on the Web (II)(II)

Posted by Andrew BreitbartIn his blog

The Internet needs a way to help people separate rumor from real science.

– Tim Berners-Lee

We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama

Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)Facts and truth really don’t have much to do with each other.

— William Faulkner

S1 S2 S3

Stonebraker

MIT Berkeley

MIT

Dewitt MSR MSR UWisc

Bernstein MSR MSR MSR

Carey UCI AT&T BEA

Halevy Google Google UW

Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)Facts and truth really don’t have much to do with each other.

— William Faulkner

S1 S2 S3

Stonebraker

MIT Berkeley

MIT

Dewitt MSR MSR UWisc

Bernstein MSR MSR MSR

Carey UCI AT&T BEA

Halevy Google Google UW

Naïve voting works

Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)

A lie told often enough becomes the truth. — Vladimir Lenin

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Naïve voting works only if data sources are independent.

Our Goal: Truth Discovery w. Our Goal: Truth Discovery w. Awareness of Dependence Awareness of Dependence Between SourcesBetween SourcesYou can fool some of the people all the time, and all of the people some of the time, but you cannot fool all of the people all the time.

– Abraham Lincoln S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Naïve voting works only if data sources are independent.

Challenges in Dependence Challenges in Dependence DiscoveryDiscovery

1. Sharing common data does not in itself imply copying.

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

2. With only a snapshot it is hard to decide which source is a copier.

3. A copier can also provide or verify some data by itself, so it is inappropriate to ignore all of its data.

High-Level Intuitions for High-Level Intuitions for Dependence DetectionDependence Detection

Intuition I: decide dependence (w/o direction)

Let D1, D2 be data from two sources. D1 and D2 are dependent if

Pr(D1, D2) <> Pr(D1) * Pr(D2).

Dependence?Dependence?

Source 1 on USA Presidents:

1st : George Washington

2nd : John Adams

3rd : Thomas Jefferson

4th : James Madison

41st : George H.W. Bush

42nd : William J. Clinton

43rd : George W. Bush

44th: Barack Obama

Source 2 on USA Presidents:

1st : George Washington

2nd : John Adams

3rd : Thomas Jefferson

4th : James Madison

41st : George H.W. Bush

42nd : William J. Clinton

43rd : George W. Bush

44th: Barack Obama

Are Source 1 and Source 2 dependent?

Not necessarily

Dependence? Dependence?

Source 1 on USA Presidents:

1st : George Washington

2nd : Benjamin Franklin

3rd : Tom Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : Mickey Mouse

44th: Barack Obama

Source 2 on USA Presidents:

1st : George Washington

2nd : Benjamin Franklin

3rd : Tom Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : Mickey Mouse

44th: John McCain

Are Source 1 and Source 2 dependent?

-- Common -- Common ErrorsErrors

Very likely

High-Level Intuitions for High-Level Intuitions for Dependence DetectionDependence DetectionIntuition I: decide dependence (w/o direction)

Let D1, D2 be data from two sources. D1 and D2 are dependent if

Pr(D1, D2) <> Pr(D1) * Pr(D2).

Intuition II: decide copying directionLet F be a property function of the data; e.g.,

accuracy of data. D1 is likely to be dependent on D2 if |F(D1 D2)-F(D1-D2)| > |F(D1 D2)-F(D2-

D1)| .

Dependence? Dependence?

Source 2 on USA Presidents:

1st : George Washington

2nd : Benjamin Franklin

3rd : Tom Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : Mickey Mouse

44th: John McCain

Are Source 1 and Source 2 dependent?

-- Different -- Different AccuracyAccuracy

Source 1 on USA Presidents:

1st : George Washington

2nd : John Adams

3rd : Thomas Jefferson

4th : Abraham Lincoln

41st : George W. Bush

42nd : Hillary Clinton

43rd : George W. Bush

44th: John McCain

S1 more likely to be a copier

OutlineOutline

Motivation and intuitions for solutionTechniques

Problem definitionCopying detectionTruth discovery

Experimental ResultsFramework of the Solomon project

Problem DefinitionProblem DefinitionINPUT

Objects: an aspect of a real-world entity E.g., director of a movie, author list of

a book Each associated with one true value

Sources: each providing values for a subset of objects

OUTPUT: the true value for each object

Source DependenceSource DependenceSource dependence: two sources S and T deriving the same part of data directly or transitively from a common source (can be one of S or T).

Independent sourceCopier

copying part (or all) of data from other sources may verify or revise some of the copied valuesmay add additional values

AssumptionsIndependent valuesIndependent copyingNo loop copying

Models for a Static WorldModels for a Static WorldCore case

Conditions1. Same source accuracy2. Uniform false-value distribution3. Categorical value

Proposition: W. independent “good” sources, Naïve voting selects values with highest probability to be true.

ModelsDepe

n

AccuPR

Consider value probabilitiesin dependence analysis

AccuRemove Cond 1

SimRemove Cond 3

NonUni

Remove Cond 2

Bayesian Analysis – BasicBayesian Analysis – BasicDifferent Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Observation: ФGoal: Pr(S1S2| Ф), Pr(S1S2| Ф) (sum up to 1)According to the Bayes Rule, we need to know

Pr(Ф|S1S2), Pr(Ф|S1S2)Key: computing Pr(ФO.A|S1S2), Pr(ФO.A|S1S2)

for each O.AS1 S2

Bayesian Analysis – Bayesian Analysis – Probability Probability ComputationComputation

Pr Independence Copying

O.At

O.Af

O.Ad

nnn

22

21

n

Pd2

211

)1(11 2 cc

)1(2

cn

c

)1( cPd

ε-error rate; n-#wrong-values; c-copy rate

>

Different Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Considering Source AccuracyConsidering Source Accuracy

Pr Independence S1 Copies S2 S2 Copies S1

O.At

O.Af

O.Ad

n

SSPf

21

ftd PPP 1

)1(1 1 cPcS t

)1(1 cPcS f

)1( cPd

21 11 SSPt )1(1 2 cPcS t

)1(2 cPcS f

)1( cPd

Different Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

II. Finding the True ValueII. Finding the True Value

Consider dependence

)()(')()(

SISAvCvSS

)()()(

vPAvgSASVv

)(1

)(ln)('

SA

SnASA

)(

)(')(vSS

SAvC

)(

)(

)(

0

0)(

ODv

vC

vC

e

evP

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

UCI AT&T

BEA

Truth Discovery

(1-.99*.8=.2)

(.22)

S1

S2

S4

S3

S5

.87 .2

.2

.99

.99.99

S1 S2

S3

S4 S5

Round 1

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.14

.49

.49.49

.08

.49.49

.49

AT&T

BEA

Truth Discovery

S2

S3

S4 S5

UCI

S1

Round 2

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.12

.49

.49.49

.06

.49.49

.49

AT&T

BEA

Truth Discovery

S2

S3

S4 S5

UCI

S1

Round 3

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.10

.48

.49.50

.05

.49.48

.50

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 4

S3

S4 S5

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 5

S3

S4 S5

S1

S2

S4

S3

S5

.09

.47

.49.51

.04

.49.47

.51

Solution on the Motivating Solution on the Motivating ExampleExample

S1 S2 S3 S4 S5

Stonebraker

MIT Berkeley

MIT MIT MS

Dewitt MSR MSR UWisc UWisc UWisc

Bernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEA

Halevy Google Google UW UW UW

Copying Relationship

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 13

S3

S4 S5

S1

S2

S4

S3

S5

.55

.49.55.49.44

.44

Combining Accuracy and Combining Accuracy and DependenceDependence

Truth Discovery

Source-accuracy

Computation

DependenceDetection

Step 1Step 3

Step 2

Theorem: w/o accuracy, converges Observation: w. accuracy, converges when #objs >> #srcs

OutlineOutline

Motivation and intuitions for solutionTechniques

Problem definitionCopying detectionTruth discovery

Experimental ResultsFramework of the Solomon project

Experimental SetupExperimental Setup

Dataset: AbeBooks877 bookstores1263 CS books24364 listings, w. ISBN, author-listAfter pre-cleaning, each book on avg has

19 listings and 4 author lists (ranges from 1-23)

Golden standard: 100 random booksManually check author list from book cover

Measure: Precision=#(Corr author lists)/#(All lists)

Naïve Voting and Types of Naïve Voting and Types of ErrorsErrorsNaïve voting has precision .71

Contributions of Various Contributions of Various ComponentsComponents

Methods Prec#Rnds

Time(s)

Naïve .71 1 .2

Only value similarity .74 1 .2

Only source accuracy

.79 23 1.1

Only source dependence

.83 3 28.3

Depen+accu .87 22 185.8

Depen+accu+sim .89 18 197.5

Precision improves by 25.4% over Naïve

Considering dependence improves the results most

Reasonably fast

OutlineOutline

Motivation and intuitions for solutionTechniques

Problem definitionCopying detectionTruth discovery

Experimental ResultsFramework of the Solomon project

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Scissors

Paper Scissors

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

Scissors

Glue

Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges

•Schema matching•Model management•Query answering using views•Information extraction

•String matching (edit distance, token-based, etc.)•Object matching (aka. record linkage, reference reconciliation, …)

•Data fusion•Truth discovery

Assume INDEPENDENCEof data sources

Source Copying Adds A New Source Copying Adds A New Dimension to Data IntegrationDimension to Data Integration

Copying Can Be Large Scaled Copying Can Be Large Scaled [VLDB’10a][VLDB’10a](Copying of AbeBooks Data) (Copying of AbeBooks Data)

Data collected from AbeBooks[Yin et al., 2007]

Related WorkRelated Work

Data provenance [Buneman et al., PODS’08] Focus on effective presentation and retrieval Assume knowledge of provenance/lineage

Opinion pooling [Clemen&Winkler, 1985] Combine pr distributions from multiple experts Again, assume knowledge of dependence

Detect plagiarism of text, image/video, programs, etc. [Dong, Sigmod’11 tutorial]

http://www2.research.att.com/~yifanhu/SourceCopying/

top related