managing uncertain data

Managing Uncertain Data

Anish Das SarmaStanford University

April 21, 2023 1Anish Das Sarma

What is Uncertain Data?


(Certain) Data Uncertain Data

Temperature is 74.634589 F Sensor reported 75 ±0.5 F

Bob works for Yahoo Bob works for either Yahoo or Microsoft

Mary sighted a Finch Mary sighted either a Finch (80%) or a Sparrow (20%)

It will rain in Stanford tomorrow

There is a 60% chance of rain in Stanford tomorrow

Yahoo stocks will be at 100 in a month

Yahoo stock will be between 60 and 120 in a month

John’s age is 23 John’s age is in [20,30]

Why Does It Arise?


(Certain) Data Uncertain Data

Temperature is 74.634589 F Sensor reported 75 ±0.5 F

Bob works for Yahoo Bob works for either Yahoo or Microsoft

Mary sighted a Finch Mary sighted either a Finch (80%) or a Sparrow (20%)

It will rain in Stanford tomorrow


Yahoo stocks will be at 100 in a month

Yahoo stock will be between 60 and 120 in a month

John’s age is 23 John’s age is in [20,30]

Precision of devices

Lack of information

Uncertainty about the future

Anonymization

April 21, 2023Anish Das Sarma4

Applications: Information Extraction

Restaurant ZipHard Rock Cafe

94111 9413394109


Applications: Information Integration

name,hPhone,oPhone,hAddr,oAddr

name,phone,address

Combined View


Applications: Deduplication

NameJohn Doe

J. Doe? 80% match


Applications: Scientific & Medical Experiments

Probably not

cancer

How Do Database Management Systems (DBMS) Handle Uncertainty?

They don’t


What Do (Most) Applications Do?

• Clean: turn into data that DBMSs can handle


(1) Loss of information (2) Errors compound insidiously

Observer Bird-1

Mary Finch: 80%Sparrow: 20%

Susan

Dove: 70%Sparrow: 30%

Jane Hummingbird: 65%Sparrow: 35%

Bird-1

Finch

Dove

Hummingbird

Outline of The Talk

• Part 1: Managing Uncertainty in a DBMStheory systems

• Part 2: Handling Uncertainty in Data Integrationsystems theory

• Other Research (trailer)

• Future Plans


Part 1: Managing Uncertain Data

• Primarily in the context of the Trio project1) Data2) Uncertainty3) Lineage

• Today’s focus: how lineage helps


Uncertain Data

April 21, 2023 Anish Das Sarma 12

Uncertain Data

Sensor reported 75 ±0.5 F

Bob works for either Yahoo or Microsoft

Mary sighted either a Finch (80%) or a Sparrow (20%)


• An uncertain database represents a set of possible instances (or, possible worlds)

• Our work: finite sets of possible instances

13

Representing Uncertain Data• 20+ years of work (mostly theoretical)• Appears to be fundamental trade-off between

expressiveness & intuitiveness• We spent some time exploring the space of

models for uncertainty

April 21, 2023 Anish Das Sarma

14

Hierarchy of Models [ICDE 06]

R relations

A or-sets

?maybe-tuples

2 2-clauses

propFull propositional logic

sets tuple-sets


+ Expressive- Complex

+ Intuitive- Inexpressive

Next1.Consider a model M2.Isolate inexpressiveness3.Solve problem with lineage

15

Running Example: Crime-Solver

• Saw (witness, color, car) // may be uncertain

• Drives (person, color, car) // may be uncertain

• Suspects (person) = πperson(Saw ⋈ Drives)


16

Simple Model M

1. Alternatives: uncertainty about value2. ‘?’ (Maybe) Annotations

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possibleinstances


17

Six possibleinstances

Simple Model M

1. Alternatives2. ‘?’ (Maybe): uncertainty about presence

?

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura



Review: Relational Queries

D SQ

Saw

(witness, color, car)

Amy, red, Honda

Betty, blue, Acura

πperson(σcolor=red)

W (witness)

Amy

19

Queries on Uncertain Data

Closure:up-arrowalways exists

Completeness: All sets of possible instances can be represented

D

I1, I2, …, In J1, J2, …, Jm

D′

possibleinstances

Q on eachinstance

rep. ofinstances

directimplementation


20

Model M is Not Closed

Saw (witness, car)

Cathy

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT


21

to the RescueLineage

Model M + Lineage = Completeness


22

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21


22


23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank


???


23

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21


22


23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank


???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Correctly captures possible instances inthe result

24

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values (next)4. Lineage

Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)

Theorem: ULDBs are closed and complete [VLDB 06]Theorem: ULDBs are closed and complete [VLDB 06]


Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]

Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]

25

Confidence Values in Trio

• Confidence values supplied with base data– Default probabilistic interpretation

• Problem: Compute confidence values on result data [ICDE 08]

• 5-minute DBClip– Search “confidence computation” on YouTube.


26

Problem Description

ID Saw (witness,car)

11 (Amy, Honda) : 0.5

12 (Betty, Acura) : 0.6

ID Drives (person,car)

21

(Jimmy, Honda) : 0.9

22

(Billy, Honda) : 0.8

23

(Hank, Acura) : 1.0

ID Cars

41 Honda

42 Acura

Cars = πcar(Saw ⋈ Drives)

: ?

: ?


27

Operator-by-Operator


11 (Amy, Honda) : 0.5



21


22


23

(Hank, Acura) : 1.0

ID Cars

41 Honda

42 Acura

31 (Amy,Jimmy,Honda)

32 (Amy,Billy,Honda)

33 (Betty,Hank,Acura)⋈

Saw

Drives

πcar

: 0.5*0.9: 0.45

: 0.4

: 0.6

0.45 + 0.4 - (0.45*0.4): 0.67

Wrong!!


28

Operator-by-Operator


11 (Amy, Honda) : 0.5



21


22


23

(Hank, Acura) : 1.0

ID Cars

41 Honda

42 Acura

31 (Amy,Jimmy,Honda)

32 (Amy,Billy,Honda)

33 (Betty,Hank,Acura)

: 0.45

: 0.4

: 0.6

0.45 + 0.4 - (0.45*0.4)

Not independent!


29

Database Query Processing 101


Q

Query

Execution Plans

Pick and execute best plan

Statistics, indexes

30

Operator-by-Operator Confidence Computation


Q

Query

Plans

Can be much smaller or empty

31

Decouple Data and Confidence Computation


Q

Query

Plans1. Compute data2. Use lineage to

compute confidences (on demand)

Theorem: Arbitrary improvement. [ICDE 08]

32

Our Approach


11 (Amy, Honda) : 0.5



21


22


23

(Hank, Acura) : 1.0

ID Cars

41 Honda

42 Acura

: ?

: ?

λ(41) = 11 Λ (21 V 22)

λ(42) = 12 Λ 23

0.5 * (0.9 + 0.8 - 0.9*0.8): 0.49

: 0.6

Correct!!


Algorithm


Rt

t1 t2

t4

t5 t6 t7

λ(t) = f(t4,t5,t6,t7)

0.7

0.9 1.0 0.4

0.823

1. Expand lineage to base data

2. Get confidence of base data

3. Evaluate the probability λ(t)

Detecting independence

Memoization

Batch computation

0.4

Some Other Trio Work


Modifications and Versioning [TR 08]-Stored derived relations-Modifications versions

Indexes and Statistics [MUD 08]-Specialized indexes, histograms

Functional Dependencies & Schema Design [TR 07]-Definitions, sound and complete axiomatization of FDs-Lossless decomposition-FD testing, finding, and inference

35

Related Work (sample)• Modeling Uncertainty: Plenty, covered in

textbooks• Systems: Avatar, BayesStore, MayBMS,

MYSTIQ, ORION, PrDB, ProbView, Trio, others?


Part 2: Data Integration

• Reboot!


or, wake up!

Traditional Data Integration: Setup

D1

D2

D3D4

D5

Bib(title, authors, conf, year)

Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)

Mediated Schema

Publication(title, author, conf, year) 1. Mediated Schema

2. Schema Mappings

MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid

3. Query Answering

Significant

up-front

effort

37

Who authored the most SIGMOD papers in the 90’s?

Mike Carey

“Pay-As-You-Go” Data Integration

1. Automated best-effort integration from the outset2. Further improve the system over time with feedback

38

How advanced a starting point can we provide?


• Automatic integrationMake guessesModel probabilities

• Specifically– Probabilistic schema mappings– Probabilistic mediated-schema

Anish Das Sarma 39April 21, 2023

to the RescueUncertainty

>90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08]

Next

1. Probabilistic mediated schemas2. Probabilistic schema mappings3. Experimental results


Mediated Schema

S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr)

Med-S (name, email, phone, addr)

{name, person-name}

{phone-num, phone}

{address,mailing-addr}

{email}

A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas.

41Anish Das SarmaApril 21, 2023

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})

Example

S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

?

Q: SELECT name, hPhone, oPhone FROM Med 42



Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})

Q: SELECT name, phone, address FROM Med 43

Example

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})





Example

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})






Example

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})







Example


Probabilistic Mediated Schema



Pr=0.5


Pr=0.5

• Probabilistic Mediated Schema (p-med-schema) is a set M = {(M1,Pr(M1)), …, (Mk,Pr(Mk))} where

• Mi is a med-schema; i≠j => Mi≠ Mj

• Pr(Mi)ϵ(0,1]; ΣPr(Mi) = 1

P-Mappings

PM1

Med3 (name, hPP, oP, hAA, oA)

S1(name, hP, oP, hA, oA)Pr=.64







PM2

Med4 (name, oPP, hP, oAA, hA)







S1(name, hP, oP, hA, oA)Pr=.04 49Anish Das SarmaApril 21, 2023

Expressive Power of P-Med-Schema & P-Mapping

Theorem 1. For one-to-many mappings: (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings)

Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) = (p-med-schema + mappings) > (mediated schema + p-mapping)


Next

• Creating p-med-schemas (briefly) • Creating p-mappings (briefly)• Experimental Results


P-med-schema Creation

S2

S1name address

email-address

pname home-address

1

.6

.6

.2

52

April 21, 2023

1. Certain/uncertain edges

S2

S1name address

email-address

pname home-addressS2

S1name address

email-address

pname home-address

S2

S1name address

email-address


S1name address

email-address

pname home-address

53

P-med-schema Creation2. Clustering

S2

S1name address

email-address


S1name address

email-address

pname home-address

S2

S1name address

email-address


S1name address

email-address

pname home-address

Pr=1/6 Pr=1/6

Pr=1/3 Pr=1/3

54

P-med-schema Creation3. Assign probabilities

P-mapping Creation

S=(num, pname, home-addr, office-addr)

T=(name, mailing-addr)

0.8 0.9 0.90.2

55

Goal: find a p-mapping that is consistent with a set of weighted correspondences

Theorem: There exists a p-mapping consistent if and only if for every source/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.

Experiments Data: tables extracted from HTML tables on the web

Domain #Sources Search Keywords

Movie 161 movie, year

Car 817 make, model

People 49job/title, organization/company/employer

Course 647course/class, instructor/teacher/lecturer, subject/department/title

Bib 649 author, title, year, journal/conference


• Gold standard: manual Approximate standard: semi-automatic• Precision, recall, F-measure for several SQL

queries varying attributes, selectivities

57

Experiments

Quality of Query AnsweringDomain Precision Recall F-measure

Golden Standard

People 1 .849 .918

Course 1 .852 .92

Approximate Golden Standard

Movie .95 1 .924

Car 1 .917 .957

People .958 .984 .971

Course 1 1 1

Bib 1 .955 .97758

Comparison with Other Approaches

Keyword search obtained low precision and low recall.

Querying the sources directly or considering only the highest probability mapping obtained low recall.

We obtained highest F-measure in all domains.

59

Comparison with Other Mediated-Schema Generation Methods

Using p-med-schema obtained highest F-measure in all domains.

60

System Setup Time (one domain)

61

Brief Related Work

• Approximate schema mappings [Magnani et. al. 2007], [Gal 2007], [Dong. et. al. 2007]

• Automatic generation of mediated schemas [He et. al. 2003],

• More (see paper)


Finally…

• Other Research– Data Integration (2)– Deduplication (2)– Quality Estimation of Sensor/RFID Streams [IQIS 06]

• Future Plans


Data Integration


Problem: Foundations for integration of uncertain dataSolution [TR 08]: -Define open- and closed-containment for uncertain data-Algorithms, complexity of consistency checking and finding maximally-correct query answers

Problem: Dependencies in web-data integration (e.g., deep-web, plagiarism)Solution [TR 08]: Algorithms, complexity of fundamental problems: Coverage estimation, cost minimization and coverage maximization, and source ordering

Deduplication


[SIGMOD 07]-Leveraging real-world constraints for deduplication-Tractable optimal solution and experiments over DBLP and ACM publication data

[WWW 07]-Detecting near-duplicate web-pages for crawling-Efficient indexing scheme supporting crawling speeds over web-scale data

Future Work


Short & Medium-Term1.View management over uncertain databases: materialized view updates, versioning, partial materialization, …2.More applications of uncertain data3.More on lineage: internal/external lineage, approximate lineage, uncertain lineage, …

Future Work


Long-term1.Applying uncertainty to other data management problems: query optimization? cloud computing?2.Improve quality of data through conflict resolution and feedback3.Web-data management: Handling huge amounts of data that is conflicting, uncertain, redundant, dependent, …

Thanks!


Anish Das [email protected]

http://i.stanford.edu/~anishds (or search “Anish Das Sarma”)

mailto:[email protected]

http://i.stanford.edu/~anishds

managing uncertain data

Documents

chance of rain

uncertain database

stanford tomorrowthere

uncertain dataprimarily

uncertain dataanish

stanford tomorrowanish

monthjohns age

stanford tomorrowyahoo