probability boolean expressions
TRANSCRIPT
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 1/33
1
Probabilistic/Uncertain DataManagement -- III
Slides based on the Suciu/DalviSIGMOD’05 tutorial
1. Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004.
2. Das Sarma et al. “Working models for uncertaindata”, ICDE’2006.
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 2/33
2
What is a Probabilistic Database ?
• “An item belongs to the database” is a probabilisticevent – Tuple-existence uncertainty
– Attribute-value uncertainty
• “A tuple is an answer to the query” is a probabilisticevent
• Can be extended to all data models; we discuss onlyprobabilistic relational data
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 3/33
3
Possible Worlds Semantics
The set of all possible database instances:
INST = {I1, I2, I3, . . ., IN}
Definition A probabilistic database Ip
is a probability distribution on INST
s.t. i=1,N Pr(Ii) = 1Pr : INST ! [0,1]
Definition A possible world is I s.t. Pr(I) > 0
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 4/33
4
Query Semantics
Given a query Q and a probabilistic database Ip,what is the meaning of Q(Ip) ?
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 5/33
5
Query Semantics
Semantics 1: Possible Answers
A probability distribution on sets of tuples
8A. Pr(Q = A) = I 2 INST. Q(I) = A Pr(I)
Semantics 2: Possible Tuples
A probability function on tuples
8 t. Pr(t 2 Q) = I 2 INST. t2 Q(I) Pr(I)
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 6/33
6
Possible Worlds Query Semantics
Possible answers semantics
• Precise
• Can be used to compose queries• Difficult user interface
Possible tuples semantics
• Less precise, but simple; sufficient for most apps
• Cannot be used to compose queries
• Simple user interface
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 7/33
7
Possible Worlds Semantics:
SummaryComplete model; Clean formal semantics for SQL
queries
Not very useful as a representation or implementation
tool• HUGE number of possible worlds!
Need more effective representation formalisms
• Something that users can understand/explore
• Allow more efficient query execution – Avoid “possible worlds explosion”
• Perhaps giving up completeness
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 8/33
8
Representation Formalisms
Problem
Need a good representation formalism
• Will be interpreted as possible worlds
• Several formalisms exists, but no winner
Main open problem in probabilistic db
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 9/33
9
Evaluation of Formalisms
Completeness?
• What possible worlds can it represent?
• What probability distributions on worlds?
Closure?• Is it closed under evaluation of query
operators?
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 10/33
10
A Complete Formalism:
Intensional Databases
Atomic event ids
Intensional probabilistic database J:
each tuple t has an event attribute t.E
e1, e2, e3, …
p1, p2, p3, … 2 [0,1]
e3 Æ (e5 Ç : e2)
Probabilities:
Event expressions: Æ, Ç, :
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 11/33
11
Intensional DB ) Possible Worlds
Name Address E
John Seattle e1 Æ (e2 Ç e3)
Sue Denver (e1 Æ e2 ) Ç (e2 Æ e3 )
J =
Ip
e1e2e3= 000 001 010 011 100 101 110 111
John Seattle
Sue Denver
John Seattle Sue Denver;
(1-p1)(1-p2)(1-p3)
+(1-p1)(1-p2)p3
+(1-p1)p2(1-p3)
+p1(1-p2)(1-p3
p1(1-p2) p3 (1-p1)p2 p3
p1p2(1-p3)
+p1p2p3
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 12/33
12
Possible Worlds ) Intensional DB
Name Address E
John Seattle E1 Ç E2
John Boston E1 Ç E4
Sue Seattle E1 Ç E2 Ç E3
E1 = e1
E2 = :e1 Æ e2
E3 = :e1 Æ :e2 Æ e3
E4
= :e1
Æ :e2
Æ :e3
Æ e4
“Prefix code”
Pr(e1) = p1
Pr(e2) = p2 /(1-p1)
Pr(e3) = p3 /(1-p1-p2)
Pr(e4
) = p4
/(1-p1
-p2
-p3
)
J =
Intensional DBs are complete
Name Address
John Seattle
John Boston
Sue Seattle
Name Address
John Seattle
Sue Seattle
Name Address
Sue Seattle
Name Address
John Boston
p1
p2
p3
p4
=Ip
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 13/33
13
Closure Under Operators
v E
v E
£
v1 E1
v1 v2 E1 Æ
E2
v2 E2
v E1
v E2
… …
v E1 Ç E2 Ç . .
One still needs to compute probability of event expression
-
v E1
v E1 Æ:E2
v E2
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 14/33
14
Summary on Intensional
DatabasesEvent expression for each tuple
• Possible worlds: any subset
• Probability distribution: any
Complete… but impractical
• Evaluate the probability of long event expressions
Important abstraction: consider restrictions
Related to c-tables [Imilelinski&Lipski:1984]
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 15/33
15
A Restricted Formalism:
Explicit Independent Tuples
Pr(I) =t 2 I pr(t) £ t I (1-pr(t))
No restrictionspr : TUP ! [0,1]
Tuple independent probabilistic database
INST = P(TUP)
N = 2M
TUP = {t1, t2, …, tM} = all tuples
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 16/33
16
Tuple Prob. ) Possible WorldsName City pr
John Seattle p1 = 0.8
Sue Boston p2 = 0.6
Fred Boston p3 = 0.9Ip =Name City
John Seattl
Sue Bosto
Fred Bosto
Name City
Sue Bosto
Fred Bosto
Name City
John Seattl
Fred Bosto
Name City
John Seattl
Sue Bosto
Name City
Fred Bosto
Name City
Sue Bosto
Name City
John Seattl;
I1
(1-p1)
(1-p2)
(1-p3)
I2
p1(1-p2)(1-p3)
I3
(1-p1)p2(1-p3)
I4
(1-p1)(1-p2)p3
I5
p1p2(1-p3)
I6
p1(1-p2)p3
I7
(1-p1)p2p3
I8
p1p2p3
= 1
J =E[ size(Ip) ] =
2.3 tuples
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 17/33
17
Tuple-Independent DBs are Incomplete
Name Address pr
John Seattle p1
Sue Seattle p2
Name Address
John Seattle
Sue Seattle
Name Address
John Seattle p1
p1p2
=Ip
; 1-p1 - p1p2
Very limited – cannot capture
correlations across tuples
Not Closed
• Query operators can introduce
complex correlations!
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 18/33
18
Tuple Prob. ) Query Evaluation
Name City pr
John Seattle p1
Sue Boston p2
Fred Boston p3
Customer Product Date pr
John Gizmo . . . q1
John Gadget . . . q2
John Gadget . . . q3
Sue Camera . . . q4
Sue Gadget . . . q5
Sue Gadget . . . q6
Fred Gadget . . . q7
SELECT DISTINCT x.city
FROM Person x, Purchase y
WHERE x.Name = y.Customer
and y.Product = „Gadget‟
Tuple Probability
Seattle
Boston
1-(1-q2)(1-q3)p1( )
1- (1- )
£(1 - )
p2( )1-(1-q5)(1-q6)
p3 q7
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 19/33
19
Application: Similarity Predicates
Name City Profession
John Seattle statistician
Sue Boston musician
Fred Boston physicist
Cust Product Category
John Gizmo dishware
John Gadget instrument
John Gadget instrument
Sue Camera musicware
Sue Gadget microphone
Sue Gadget instrument
Fred Gadget microphone
SELECT DISTINCT x.city
FROM Person x, Purchase y
WHERE x.Name = y.Cust
and y.Product = „Gadget‟
and x.profession ~ ‘scientist’
and y.category ~ ‘music’
Step 1:
evaluate ~ predicates
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 20/33
20
Application: Similarity Predicates
Name City Profession pr
John Seattle statistician p1=0.8
Sue Boston musician p2=0.2
Fred Boston physicist p3=0.9
Cust Product Category pr
John Gizmo dishware q1=0.2
John Gadget instrument q2=0.6
John Gadget instrument q3
=0.6
Sue Camera musicware q4=0.9
Sue Gadget microphone q5=0.7
Sue Gadget instrument q6=0.6
Fred Gadget microphone q7=0.7SELECT DISTINCT x.city
FROM Personp x, Purchasep y
WHERE x.Name = y.Cust
and y.Product = „Gadget‟
and x.profession ~ ‘scientist’
and y.category ~ ‘music’
Tuple Probability
Seattle p1(1-(1-q2)(1-q3))
Boston1-(1-p
2
(1-(1-q5
)(1-q6
)))£(1-p3q7)
Step 1:
evaluate ~ predicates
Step 2:
evaluate rest
of query
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 21/33
21
Summary on Explicit Independent
TuplesIndependent tuples
• Possible worlds: subsets
• Probability distribution: restricted
• Closure: no
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 22/33
22
Query Evaluation on Probabilistic
DBs• Focus on possible tuple semantics
– Compute likelihood of individual answer tuples
• Probability of Boolean expressions
• Complexity of query evaluation
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 23/33
23
Probability of Boolean Expressions
E = X1X3 Ç X1X4 Ç X2X5 Ç X2X6
Randomly make each variable true with the following probabilities
Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6
What is Pr(E) ???
Answer: re-group cleverly E = X1 (X3 Ç X4 ) Ç X2 (X5 Ç X6)
Pr(E)=1 - (1-p1(1-(1-p3)(1-p4)))
(1-p2(1-(1-p5)(1-p6)))
Needed for query processing
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 24/33
24
E = X1X2 Ç X1X3 Ç X2X3
Pr(E)=(1-p1)p2p3 +
p1(1-p2)p3 +
p1p2(1-p3) +
p1p2p3
Now let‟s try this:
No clever grouping seems possible.
Brute force:
X1 X2 X3 E Pr
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1 (1-p1)p2p3
1 0 0 0
1 0 1 1 p1(1-p2)p3
1 1 0 1 p1p2(1-p3)
1 1 1 1 p1p2p3
Seems inefficient in general…
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 25/33
25
Complexity of Boolean
Expression ProbabilityTheorem [Valiant:1979]
For a boolean expression E, computing Pr(E) is #P-complete
NP = class of problems of the form “is there a witness ?” SAT
#P = class of problems of the form “how many witnesses ?” #SAT
The decision problem for 2CNF is in PTIME
The counting problem for 2CNF is #P-complete
[Valiant:1979]
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 26/33
26
Summary on
Boolean Expression Probability
• #P-complete
• It‟s hard even in simple cases: 2DNF
• Can approximate through Monte Carlo(MC) simulation
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 27/33
27
Query Complexity
Data complexity of a query Q:
• Compute Q(Ip), for probabilistic database Ip
Simplest scenario only:
•Possible tuples semantics for Q
• Independent tuples for Ip
[F h &R llk 1997 D l i&S i 2004]
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 28/33
28
Extensional Query EvaluationRelational ops compute probabilities
v p
v p
£
v1 p1
v1 v2 p1 p2
v2 p2
v p1
v p2
v 1-(1-p1)(1-p2)…
[Fuhr&Roellke:1997,Dalvi&Suciu:2004]
-
v p1
v p1(1-p2)
v p2
Unlike intensional evaluation, data complexity: PTIME
[D l i&S i 2004]
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 29/33
29
Jon Sea p1
Jon q1
Jon q2
Jon q3
SELECT DISTINCT x.City
FROM Personp x, Purchasep y
WHERE x.Name = y.Cust
and y.Product = „Gadget‟
Jon Sea p1q1
Jon Sea p1q2
Jon Sea p1q3
Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)
£
Jon Sea p1
Jon q1
Jon q2
Jon q3
Jon 1-(1-q1)(1-q2)(1-q3)
£
Jon Sea p1(1-(1-q1)(1-q2)(1-q3))
[Dalvi&Suciu:2004]
Wrong !
Correct
Depends on plan !!!
[D l i&S i 2004]
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 30/33
30
Query Complexity
Sometimes @ correct (“safe”) extensional plan
Qbad :- R(x), S(x,y), T(y) Data complexityis #P complete
Theorem The following are equivalent
• Q has PTIME data complexity
• Q admits an extensional plan (and one finds it in PTIME)
• Q does not have Qbad as a subquery
[Dalvi&Suciu:2004]
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 31/33
31
Computing a Safe SPJ Extensional Plan
Problem is due to projection operations
• An “unsafe” extensional projection combines
tuples that are correlated assuming independence
Projection over a join that projects away at least
one of the join attrs Unsafe projection!
• Intuitive: Joins create correlated output tuples
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 32/33
32
Computing a Safe SPJ Extensional Plan
Algorithm for Safe Extensional SPJ Evaluation
• Apply safe projections as late as possible in the
plan
• If no more safe projections exist, look for joins
where all attributes are included in the output
– Recurse on the LHS, RHS of the join
Sound and complete safe SPJ evaluation algorithm
• If a safe plan exists, the algo finds it!
8/2/2019 Probability Boolean Expressions
http://slidepdf.com/reader/full/probability-boolean-expressions 33/33
33
Summary on Query Complexity
Extensional query evaluation:
• Very popular
• Guarantees polynomial complexity
• However, result depends on query plan andcorrectness not always possible!
General query complexity
• #P complete (not surprising, given #SAT)
• Already #P hard for very simple query (Qbad)
P b bili ti d t b h high q m l it