managing uncertain data
DESCRIPTION
Managing Uncertain Data. Anish Das Sarma Stanford University. What is Uncertain Data?. Why Does It Arise?. Precision of devices. Lack of information. Uncertainty about the future. Anonymization. Applications: Information Extraction. Applications: Information Integration. name, hPhone, - PowerPoint PPT PresentationTRANSCRIPT
Managing Uncertain Data
Anish Das SarmaStanford University
April 21, 2023 1Anish Das Sarma
What is Uncertain Data?
April 21, 2023 2Anish Das Sarma
(Certain) Data Uncertain Data
Temperature is 74.634589 F Sensor reported 75 ±0.5 F
Bob works for Yahoo Bob works for either Yahoo or Microsoft
Mary sighted a Finch Mary sighted either a Finch (80%) or a Sparrow (20%)
It will rain in Stanford tomorrow
There is a 60% chance of rain in Stanford tomorrow
Yahoo stocks will be at 100 in a month
Yahoo stock will be between 60 and 120 in a month
John’s age is 23 John’s age is in [20,30]
Why Does It Arise?
April 21, 2023 3Anish Das Sarma
(Certain) Data Uncertain Data
Temperature is 74.634589 F Sensor reported 75 ±0.5 F
Bob works for Yahoo Bob works for either Yahoo or Microsoft
Mary sighted a Finch Mary sighted either a Finch (80%) or a Sparrow (20%)
It will rain in Stanford tomorrow
There is a 60% chance of rain in Stanford tomorrow
Yahoo stocks will be at 100 in a month
Yahoo stock will be between 60 and 120 in a month
John’s age is 23 John’s age is in [20,30]
Precision of devices
Lack of information
Uncertainty about the future
Anonymization
April 21, 2023Anish Das Sarma4
Applications: Information Extraction
Restaurant ZipHard Rock Cafe
94111 9413394109
April 21, 2023Anish Das Sarma5
Applications: Information Integration
name,hPhone,oPhone,hAddr,oAddr
name,phone,address
Combined View
April 21, 2023Anish Das Sarma6
Applications: Deduplication
NameJohn Doe
J. Doe? 80% match
April 21, 2023Anish Das Sarma7
Applications: Scientific & Medical Experiments
Probably not
cancer
How Do Database Management Systems (DBMS) Handle Uncertainty?
They don’t
April 21, 2023 8Anish Das Sarma
What Do (Most) Applications Do?
• Clean: turn into data that DBMSs can handle
April 21, 2023 9Anish Das Sarma
(1) Loss of information (2) Errors compound insidiously
Observer Bird-1
Mary Finch: 80%Sparrow: 20%
Susan
Dove: 70%Sparrow: 30%
Jane Hummingbird: 65%Sparrow: 35%
Bird-1
Finch
Dove
Hummingbird
Outline of The Talk
• Part 1: Managing Uncertainty in a DBMStheory systems
• Part 2: Handling Uncertainty in Data Integrationsystems theory
• Other Research (trailer)
• Future Plans
April 21, 2023 10Anish Das Sarma
Part 1: Managing Uncertain Data
• Primarily in the context of the Trio project1) Data2) Uncertainty3) Lineage
• Today’s focus: how lineage helps
April 21, 2023 11Anish Das Sarma
Uncertain Data
April 21, 2023 Anish Das Sarma 12
Uncertain Data
Sensor reported 75 ±0.5 F
Bob works for either Yahoo or Microsoft
Mary sighted either a Finch (80%) or a Sparrow (20%)
There is a 60% chance of rain in Stanford tomorrow
• An uncertain database represents a set of possible instances (or, possible worlds)
• Our work: finite sets of possible instances
13
Representing Uncertain Data• 20+ years of work (mostly theoretical)• Appears to be fundamental trade-off between
expressiveness & intuitiveness• We spent some time exploring the space of
models for uncertainty
April 21, 2023 Anish Das Sarma
14
Hierarchy of Models [ICDE 06]
R relations
A or-sets
?maybe-tuples
2 2-clauses
propFull propositional logic
sets tuple-sets
April 21, 2023 Anish Das Sarma
+ Expressive- Complex
+ Intuitive- Inexpressive
Next1.Consider a model M2.Isolate inexpressiveness3.Solve problem with lineage
15
Running Example: Crime-Solver
• Saw (witness, color, car) // may be uncertain
• Drives (person, color, car) // may be uncertain
• Suspects (person) = πperson(Saw ⋈ Drives)
April 21, 2023 Anish Das Sarma
16
Simple Model M
1. Alternatives: uncertainty about value2. ‘?’ (Maybe) Annotations
Saw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Three possibleinstances
April 21, 2023 Anish Das Sarma
17
Six possibleinstances
Simple Model M
1. Alternatives2. ‘?’ (Maybe): uncertainty about presence
?
Saw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Betty blue, Acura
April 21, 2023 Anish Das Sarma
April 21, 2023 Anish Das Sarma 18
Review: Relational Queries
D SQ
Saw
(witness, color, car)
Amy, red, Honda
Betty, blue, Acura
πperson(σcolor=red)
W (witness)
Amy
19
Queries on Uncertain Data
Closure:up-arrowalways exists
Completeness: All sets of possible instances can be represented
D
I1, I2, …, In J1, J2, …, Jm
D′
possibleinstances
Q on eachinstance
rep. ofinstances
directimplementation
April 21, 2023 Anish Das Sarma
20
Model M is Not Closed
Saw (witness, car)
Cathy
Honda ∥ Mazda
Drives (person, car)
Jimmy, Toyota ∥ Jimmy, Mazda
Billy, Honda ∥ Frank, Honda
Hank, Honda
Suspects
Jimmy
Billy ∥ Frank
Hank
Suspects = πperson(Saw ⋈ Drives)
???
Does not correctlycapture possibleinstances in theresult
CANNOT
April 21, 2023 Anish Das Sarma
21
to the RescueLineage
Model M + Lineage = Completeness
April 21, 2023 Anish Das Sarma
22
Example with Lineage
ID Saw (witness, car)
11
Cathy
Honda ∥ Mazda
ID Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
April 21, 2023 Anish Das Sarma
23
Example with Lineage
ID Saw (witness, car)
11
Cathy
Honda ∥ Mazda
ID Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23
Correctly captures possible instances inthe result
24
Trio’s Data Model
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values (next)4. Lineage
Uncertainty-Lineage Databases (ULDBs)Uncertainty-Lineage Databases (ULDBs)
Theorem: ULDBs are closed and complete [VLDB 06]Theorem: ULDBs are closed and complete [VLDB 06]
April 21, 2023 Anish Das Sarma
Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]
Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]
25
Confidence Values in Trio
• Confidence values supplied with base data– Default probabilistic interpretation
• Problem: Compute confidence values on result data [ICDE 08]
• 5-minute DBClip– Search “confidence computation” on YouTube.
April 21, 2023 Anish Das Sarma
26
Problem Description
ID Saw (witness,car)
11 (Amy, Honda) : 0.5
12 (Betty, Acura) : 0.6
ID Drives (person,car)
21
(Jimmy, Honda) : 0.9
22
(Billy, Honda) : 0.8
23
(Hank, Acura) : 1.0
ID Cars
41 Honda
42 Acura
Cars = πcar(Saw ⋈ Drives)
: ?
: ?
April 21, 2023 Anish Das Sarma
27
Operator-by-Operator
ID Saw (witness,car)
11 (Amy, Honda) : 0.5
12 (Betty, Acura) : 0.6
ID Drives (person,car)
21
(Jimmy, Honda) : 0.9
22
(Billy, Honda) : 0.8
23
(Hank, Acura) : 1.0
ID Cars
41 Honda
42 Acura
31 (Amy,Jimmy,Honda)
32 (Amy,Billy,Honda)
33 (Betty,Hank,Acura)⋈
Saw
Drives
πcar
: 0.5*0.9: 0.45
: 0.4
: 0.6
0.45 + 0.4 - (0.45*0.4): 0.67
Wrong!!
April 21, 2023 Anish Das Sarma
28
Operator-by-Operator
ID Saw (witness,car)
11 (Amy, Honda) : 0.5
12 (Betty, Acura) : 0.6
ID Drives (person,car)
21
(Jimmy, Honda) : 0.9
22
(Billy, Honda) : 0.8
23
(Hank, Acura) : 1.0
ID Cars
41 Honda
42 Acura
31 (Amy,Jimmy,Honda)
32 (Amy,Billy,Honda)
33 (Betty,Hank,Acura)
: 0.45
: 0.4
: 0.6
0.45 + 0.4 - (0.45*0.4)
Not independent!
April 21, 2023 Anish Das Sarma
29
Database Query Processing 101
April 21, 2023 Anish Das Sarma
Q
Query
Execution Plans
Pick and execute best plan
Statistics, indexes
30
Operator-by-Operator Confidence Computation
April 21, 2023 Anish Das Sarma
Q
Query
Plans
Can be much smaller or empty
31
Decouple Data and Confidence Computation
April 21, 2023 Anish Das Sarma
Q
Query
Plans1. Compute data2. Use lineage to
compute confidences (on demand)
Theorem: Arbitrary improvement. [ICDE 08]
32
Our Approach
ID Saw (witness,car)
11 (Amy, Honda) : 0.5
12 (Betty, Acura) : 0.6
ID Drives (person,car)
21
(Jimmy, Honda) : 0.9
22
(Billy, Honda) : 0.8
23
(Hank, Acura) : 1.0
ID Cars
41 Honda
42 Acura
: ?
: ?
λ(41) = 11 Λ (21 V 22)
λ(42) = 12 Λ 23
0.5 * (0.9 + 0.8 - 0.9*0.8): 0.49
: 0.6
Correct!!
April 21, 2023 Anish Das Sarma
Algorithm
April 21, 2023 Anish Das Sarma 33
Rt
t1 t2
t4
t5 t6 t7
λ(t) = f(t4,t5,t6,t7)
0.7
0.9 1.0 0.4
0.823
1. Expand lineage to base data
2. Get confidence of base data
3. Evaluate the probability λ(t)
Detecting independence
Memoization
Batch computation
0.4
Some Other Trio Work
April 21, 2023 34Anish Das Sarma
Modifications and Versioning [TR 08]-Stored derived relations-Modifications versions
Indexes and Statistics [MUD 08]-Specialized indexes, histograms
Functional Dependencies & Schema Design [TR 07]-Definitions, sound and complete axiomatization of FDs-Lossless decomposition-FD testing, finding, and inference
35
Related Work (sample)• Modeling Uncertainty: Plenty, covered in
textbooks• Systems: Avatar, BayesStore, MayBMS,
MYSTIQ, ORION, PrDB, ProbView, Trio, others?
April 21, 2023 Anish Das Sarma
Part 2: Data Integration
• Reboot!
April 21, 2023 36Anish Das Sarma
or, wake up!
Traditional Data Integration: Setup
D1
D2
D3D4
D5
Bib(title, authors, conf, year)
Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)
Mediated Schema
Publication(title, author, conf, year) 1. Mediated Schema
2. Schema Mappings
MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid
3. Query Answering
Significant
up-front
effort
37
Who authored the most SIGMOD papers in the 90’s?
Mike Carey
“Pay-As-You-Go” Data Integration
1. Automated best-effort integration from the outset2. Further improve the system over time with feedback
38
How advanced a starting point can we provide?
April 21, 2023 Anish Das Sarma
• Automatic integrationMake guessesModel probabilities
• Specifically– Probabilistic schema mappings– Probabilistic mediated-schema
Anish Das Sarma 39April 21, 2023
to the RescueUncertainty
>90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08]
Next
1. Probabilistic mediated schemas2. Probabilistic schema mappings3. Experimental results
Anish Das Sarma 40April 21, 2023
Mediated Schema
S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr)
Med-S (name, email, phone, addr)
{name, person-name}
{phone-num, phone}
{address,mailing-addr}
{email}
A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas.
41Anish Das SarmaApril 21, 2023
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Example
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
?
Q: SELECT name, hPhone, oPhone FROM Med 42
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med 43
Example
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med 44
Example
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med 45
Example
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med 46
Example
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})
Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Q: SELECT name, phone, address FROM Med 47
Example
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})
Probabilistic Mediated Schema
S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})
Pr=0.5
48Anish Das SarmaApril 21, 2023
Pr=0.5
• Probabilistic Mediated Schema (p-med-schema) is a set M = {(M1,Pr(M1)), …, (Mk,Pr(Mk))} where
• Mi is a med-schema; i≠j => Mi≠ Mj
• Pr(Mi)ϵ(0,1]; ΣPr(Mi) = 1
P-Mappings
PM1
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.64
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.16
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.16
Med3 (name, hPP, oP, hAA, oA)
S1(name, hP, oP, hA, oA)Pr=.04
PM2
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.64
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.16
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.16
Med4 (name, oPP, hP, oAA, hA)
S1(name, hP, oP, hA, oA)Pr=.04 49Anish Das SarmaApril 21, 2023
Expressive Power of P-Med-Schema & P-Mapping
Theorem 1. For one-to-many mappings: (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings)
Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) = (p-med-schema + mappings) > (mediated schema + p-mapping)
50Anish Das SarmaApril 21, 2023
Next
• Creating p-med-schemas (briefly) • Creating p-mappings (briefly)• Experimental Results
Anish Das Sarma 51April 21, 2023
P-med-schema Creation
S2
S1name address
email-address
pname home-address
1
.6
.6
.2
52
April 21, 2023
1. Certain/uncertain edges
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
53
P-med-schema Creation2. Clustering
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
S2
S1name address
email-address
pname home-addressS2
S1name address
email-address
pname home-address
Pr=1/6 Pr=1/6
Pr=1/3 Pr=1/3
54
P-med-schema Creation3. Assign probabilities
P-mapping Creation
S=(num, pname, home-addr, office-addr)
T=(name, mailing-addr)
0.8 0.9 0.90.2
55
Goal: find a p-mapping that is consistent with a set of weighted correspondences
Theorem: There exists a p-mapping consistent if and only if for every source/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.
Experiments Data: tables extracted from HTML tables on the web
Domain #Sources Search Keywords
Movie 161 movie, year
Car 817 make, model
People 49job/title, organization/company/employer
Course 647course/class, instructor/teacher/lecturer, subject/department/title
Bib 649 author, title, year, journal/conference
56Anish Das SarmaApril 21, 2023
• Gold standard: manual Approximate standard: semi-automatic• Precision, recall, F-measure for several SQL
queries varying attributes, selectivities
57
Experiments
Quality of Query AnsweringDomain Precision Recall F-measure
Golden Standard
People 1 .849 .918
Course 1 .852 .92
Approximate Golden Standard
Movie .95 1 .924
Car 1 .917 .957
People .958 .984 .971
Course 1 1 1
Bib 1 .955 .97758
Comparison with Other Approaches
Keyword search obtained low precision and low recall.
Querying the sources directly or considering only the highest probability mapping obtained low recall.
We obtained highest F-measure in all domains.
59
Comparison with Other Mediated-Schema Generation Methods
Using p-med-schema obtained highest F-measure in all domains.
60
System Setup Time (one domain)
61
Brief Related Work
• Approximate schema mappings [Magnani et. al. 2007], [Gal 2007], [Dong. et. al. 2007]
• Automatic generation of mediated schemas [He et. al. 2003],
• More (see paper)
Anish Das Sarma 62April 21, 2023
Finally…
• Other Research– Data Integration (2)– Deduplication (2)– Quality Estimation of Sensor/RFID Streams [IQIS 06]
• Future Plans
April 21, 2023 63Anish Das Sarma
Data Integration
April 21, 2023 64Anish Das Sarma
Problem: Foundations for integration of uncertain dataSolution [TR 08]: -Define open- and closed-containment for uncertain data-Algorithms, complexity of consistency checking and finding maximally-correct query answers
Problem: Dependencies in web-data integration (e.g., deep-web, plagiarism)Solution [TR 08]: Algorithms, complexity of fundamental problems: Coverage estimation, cost minimization and coverage maximization, and source ordering
Deduplication
April 21, 2023 65Anish Das Sarma
[SIGMOD 07]-Leveraging real-world constraints for deduplication-Tractable optimal solution and experiments over DBLP and ACM publication data
[WWW 07]-Detecting near-duplicate web-pages for crawling-Efficient indexing scheme supporting crawling speeds over web-scale data
Future Work
April 21, 2023 66Anish Das Sarma
Short & Medium-Term1.View management over uncertain databases: materialized view updates, versioning, partial materialization, …2.More applications of uncertain data3.More on lineage: internal/external lineage, approximate lineage, uncertain lineage, …
Future Work
April 21, 2023 67Anish Das Sarma
Long-term1.Applying uncertainty to other data management problems: query optimization? cloud computing?2.Improve quality of data through conflict resolution and feedback3.Web-data management: Handling huge amounts of data that is conflicting, uncertain, redundant, dependent, …
Thanks!
April 21, 2023 Anish Das Sarma 68
Anish Das [email protected]
http://i.stanford.edu/~anishds (or search “Anish Das Sarma”)