chapter 13: incorporating uncertainty into data integration

18
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION

Upload: odetta

Post on 21-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Chapter 13: Incorporating Uncertainty into Data Integration. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. Outline. Sources of uncertainty in data integration Representing uncertain data (brief overview) Probabilistic schema mappings. Managing Uncertain Data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 13: Incorporating Uncertainty into Data Integration

ANHAI DOAN ALON HALEVY ZACHARY IVES

Chapter 13: Incorporating Uncertainty into Data

Integration

PRINCIPLES OF

DATA INTEGRATION

Page 2: Chapter 13: Incorporating Uncertainty into Data Integration

Outline

Sources of uncertainty in data integration Representing uncertain data (brief overview) Probabilistic schema mappings

Page 3: Chapter 13: Incorporating Uncertainty into Data Integration

Managing Uncertain Data

Databases typically model certain data: A tuple is either true (in the database) or false (not in the

database). Real life involves a lot of uncertainty:

“The thief had either blond or brown hair” The sensor reading is often unreliable.

Uncertain databases try to model such uncertain data and to answer queries in a principled fashion.

Data integration involves multiple facets of uncertainty!

Page 4: Chapter 13: Incorporating Uncertainty into Data Integration

Uncertainty in Data Integration

Data itself may be uncertain (perhaps it’s extracted from an unreliable source)

Schema mappings can be approximate (perhaps created by an automatic tool)

Reference reconciliation (and hence joins) are approximate

If the domain is broad enough, even the mediated schema could involve uncertainty

Queries, often posed as keywords, have uncertain intent.

Page 5: Chapter 13: Incorporating Uncertainty into Data Integration

Outline

Sources of uncertainty in data integration Representing uncertain data (brief overview) Probabilistic schema mappings

Page 6: Chapter 13: Incorporating Uncertainty into Data Integration

Principles of Uncertain Databases

Instead of describing one possible state of the world, an uncertain database describes a set of possible worlds.

The expressive power of the data model determines which sets of possible world that database can represent. Is uncertainty on values of an attribute? Or on the presence of a tuple? Can dependencies between tuples be represented?

Page 7: Chapter 13: Incorporating Uncertainty into Data Integration

C-Tables: Uncertainty without Probabilities

Alice and Bob want to go on a vacation together, but will go to either Tahiti or Ulaanbaatar. Candace will definitely go to Ulaanbaatar.

Possible words result from different assignments to the variables.

Page 8: Chapter 13: Incorporating Uncertainty into Data Integration

Representing Complex Distributions

The c-table represents mutual exclusion of tuples, but doesn’t represent probability distributions.

Representing complex probability distributions and correlations between tuples requires using probabilistic graphical models.

A couple of simpler models: Independent tuple probabilities Block independent probabilities

Page 9: Chapter 13: Incorporating Uncertainty into Data Integration

Tuple Independent Model

Assign each tuple a probability. The probability of every possible world is the

appropriate product of the probabilities for each of the rows. pi if row i is in the database, and (1-pi) if it’s not.

Cannot represent correlations between tuples.

Page 10: Chapter 13: Incorporating Uncertainty into Data Integration

Block Independent Model

You choose one tuple from every block according to the distribution of that block. Can represent mutual exclusion, but not co-dependence

(i.e., Alice and Bob going to the same location).

Page 11: Chapter 13: Incorporating Uncertainty into Data Integration

Outline

Sources of uncertainty in data integrationRepresenting uncertain data (brief overview) Probabilistic schema mappings

Page 12: Chapter 13: Incorporating Uncertainty into Data Integration

Probabilistic Schema Mappings

Source schema: S=(pname, email-addr, home-addr, office-

addr)

Target schema: T=(name, mailing-addr)

We may not be sure which attribute of S mailing-addr should map to?

Probabilistic schema mappings let us handle such uncertainty.

Page 13: Chapter 13: Incorporating Uncertainty into Data Integration

Probabilistic Schema Mappings

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Possible MappingProbabil

ity{(pname,name),(home-addr, mailing-addr)}

0.5

{(pname,name),(office-addr, mailing-addr)}

0.4

{(pname,name),(email-addr, mailing-addr)}

0.1

Intuitively, we want to give each mapping a probability:

Page 14: Chapter 13: Incorporating Uncertainty into Data Integration

What are the Semantics?

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Possible MappingProbabil

ity{(pname,name),(home-addr, mailing-addr)}

0.5

{(pname,name),(office-addr, mailing-addr)}

0.4

{(pname,name),(email-addr, mailing-addr)}

0.1Should a single mapping apply to the entire table? (by-table semantics), or can different mappings apply to different tuples? (by-tuple semantics)

Page 15: Chapter 13: Incorporating Uncertainty into Data Integration

By-Table versus By-Tuple Semantics

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyval

e

Bob bob@ SunnyvaleSunnyval

e

Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

Alice Sunnyvale

Bob Sunnyvale

name

mailing-addr

Alice alice@

Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1

There are 3 possible databases DT:

Page 16: Chapter 13: Incorporating Uncertainty into Data Integration

By-Table versus By-Tuple Semantics

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyval

e

Bob bob@ SunnyvaleSunnyval

e

Ds=

name

mailing-addr

AliceMountain

View

Bob bob@

DT=nam

emailing-

addr

AliceSunnyval

e

Bob bob@

name

mailing-addr

Alice alice@

Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01

There are 9 possible databases DT:

Page 17: Chapter 13: Incorporating Uncertainty into Data Integration

Complexity of Query Answering

By-table By-tuple

Data Complexity PTIME #P-complete

Mapping Complexity

PTIME PTIME

Answering queries is more expensive under by-tuple semantics:

Page 18: Chapter 13: Incorporating Uncertainty into Data Integration

Summary of Chapter 13

Uncertainty is everywhere in data integration Work on this area is really only beginning

Great opportunity for further research. Probabilistic schema mappings:

By-table versus by-tuple semantics By-tuple semantics is computationally expensive, but restricted

cases can found where query answering is still polynomial. Where do the probabilities come from?

Sometimes we interpret statistics as probabilities Sometimes the provenance of the data is more meaningful than

the probabilities