chapter 13: incorporating uncertainty into data integration
DESCRIPTION
Chapter 13: Incorporating Uncertainty into Data Integration. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. Outline. Sources of uncertainty in data integration Representing uncertain data (brief overview) Probabilistic schema mappings. Managing Uncertain Data. - PowerPoint PPT PresentationTRANSCRIPT
ANHAI DOAN ALON HALEVY ZACHARY IVES
Chapter 13: Incorporating Uncertainty into Data
Integration
PRINCIPLES OF
DATA INTEGRATION
Outline
Sources of uncertainty in data integration Representing uncertain data (brief overview) Probabilistic schema mappings
Managing Uncertain Data
Databases typically model certain data: A tuple is either true (in the database) or false (not in the
database). Real life involves a lot of uncertainty:
“The thief had either blond or brown hair” The sensor reading is often unreliable.
Uncertain databases try to model such uncertain data and to answer queries in a principled fashion.
Data integration involves multiple facets of uncertainty!
Uncertainty in Data Integration
Data itself may be uncertain (perhaps it’s extracted from an unreliable source)
Schema mappings can be approximate (perhaps created by an automatic tool)
Reference reconciliation (and hence joins) are approximate
If the domain is broad enough, even the mediated schema could involve uncertainty
Queries, often posed as keywords, have uncertain intent.
Outline
Sources of uncertainty in data integration Representing uncertain data (brief overview) Probabilistic schema mappings
Principles of Uncertain Databases
Instead of describing one possible state of the world, an uncertain database describes a set of possible worlds.
The expressive power of the data model determines which sets of possible world that database can represent. Is uncertainty on values of an attribute? Or on the presence of a tuple? Can dependencies between tuples be represented?
C-Tables: Uncertainty without Probabilities
Alice and Bob want to go on a vacation together, but will go to either Tahiti or Ulaanbaatar. Candace will definitely go to Ulaanbaatar.
Possible words result from different assignments to the variables.
Representing Complex Distributions
The c-table represents mutual exclusion of tuples, but doesn’t represent probability distributions.
Representing complex probability distributions and correlations between tuples requires using probabilistic graphical models.
A couple of simpler models: Independent tuple probabilities Block independent probabilities
Tuple Independent Model
Assign each tuple a probability. The probability of every possible world is the
appropriate product of the probabilities for each of the rows. pi if row i is in the database, and (1-pi) if it’s not.
Cannot represent correlations between tuples.
Block Independent Model
You choose one tuple from every block according to the distribution of that block. Can represent mutual exclusion, but not co-dependence
(i.e., Alice and Bob going to the same location).
Outline
Sources of uncertainty in data integrationRepresenting uncertain data (brief overview) Probabilistic schema mappings
Probabilistic Schema Mappings
Source schema: S=(pname, email-addr, home-addr, office-
addr)
Target schema: T=(name, mailing-addr)
We may not be sure which attribute of S mailing-addr should map to?
Probabilistic schema mappings let us handle such uncertainty.
Probabilistic Schema Mappings
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Possible MappingProbabil
ity{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
Intuitively, we want to give each mapping a probability:
What are the Semantics?
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Possible MappingProbabil
ity{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1Should a single mapping apply to the entire table? (by-table semantics), or can different mappings apply to different tuples? (by-tuple semantics)
By-Table versus By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyval
e
Bob bob@ SunnyvaleSunnyval
e
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
Alice Sunnyvale
Bob Sunnyvale
name
mailing-addr
Alice alice@
Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1
There are 3 possible databases DT:
By-Table versus By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyval
e
Bob bob@ SunnyvaleSunnyval
e
Ds=
name
mailing-addr
AliceMountain
View
Bob bob@
DT=nam
emailing-
addr
AliceSunnyval
e
Bob bob@
name
mailing-addr
Alice alice@
Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01
…
There are 9 possible databases DT:
Complexity of Query Answering
By-table By-tuple
Data Complexity PTIME #P-complete
Mapping Complexity
PTIME PTIME
Answering queries is more expensive under by-tuple semantics:
Summary of Chapter 13
Uncertainty is everywhere in data integration Work on this area is really only beginning
Great opportunity for further research. Probabilistic schema mappings:
By-table versus by-tuple semantics By-tuple semantics is computationally expensive, but restricted
cases can found where query answering is still polynomial. Where do the probabilities come from?
Sometimes we interpret statistics as probabilities Sometimes the provenance of the data is more meaningful than
the probabilities