baye’s rule. baye’s rule and reasoning allows use of uncertain causal knowledge knowledge: given...
TRANSCRIPT
Baye’s Rule
(b)
1 where)()|(
)(
)()|()|(
PPP
P
PPP aab
b
aabba
)|()|()|(
)()|()|(
acabacb
aacbcba
PPP
PPP
)|()(),...,,(
stributionability DiJoint Prob Full- s Baye'Naive
1CauseEffectCauseEffectEffectCause
iin PPP
Baye’s Rule and Reasoning
Allows use of uncertain causal knowledge Knowledge: given a cause what is the likelihood of
seeing particular effects (conditional probabilities) Reasoning: Seeing some effects, how do we infer the
likelihood of a cause.
This can be very complicated: need joint probability distribution of (k+1) variables, i.e., 2k+1 numbers.
Use conditional independence to simplify expressions.Allows sequential step by step computation
)...(
)()|...()...|(
21
2121
k
kk eee
HHeeeeeeH
P
PPP
Bayesian/Belief NetworkTo avoid problems of enumerating large joint probabilities
Use causal knowledge and independence to simplify reasoning, and draw inferences
)()|().......,....,|(),....,|(
),....,(),....,|(),....,,(
13221
22121
nnnnn
nnn
XPXXPXXXPXXXP
XXPXXXPXXXP
Bayesian NetworksAlso called Belief Network or probabilistic network Nodes – random variables, one variable per
node Directed Links between pairs of nodes. AB
A has a direct influence on B With no directed cycles
A conditional distribution for each node given its parents
))(|( ii XParentsXP
Cavity
Toothache Catch
WeatherMust determine theDomain specific topology.
Bayesian NetworksNext step is to determine the conditional probability distribution for each variable. Represented as a conditional probability
table (CPT) giving the distribution over Xi for each combination of the parent value.
Once CPT is determined, the full joint probability distribution is represented by the network.
The network provides a complete description of a domain.
Belief Networks: ExampleIf you go to college, this will effect the likelihood that you will study and the likelihood that you will party. Studying and partying effect your chances of exam success, and partying effects your chances of having fun.Variables: College, Study, Party, Exam (success), FunCausal Relations: College will affect studying College will affect parting Studying and partying will affect exam success Partying affects having fun. College
PartyStudy
FunExam
College example: CPTs
CPT
Discrete Variables only in this format
College
PartyStudy
FunExam
P(C)
0.2
C P(S)
True 0.8
False 0.2
C P(P)
True 0.6
False 0.5
S P P(E)
True True 0.6
True False 0.9
False
True 0.1
False
False 0.2
P P(F)
True 0.9
False 0.7
Belief Networks: Compactness
A CPT for Boolean variable Xi with k Boolean parents is 2k rows for combinations of parent valuesEach row requires one number p for Xi = true (the number Xi = false is 1-p) Row must sum to 1. Conditional Probability
If each variable had no more than k parents, then complete network requires O(n2k) numbers i.e., the numbers grow linearly in n vs. O(2n) for the full
joint distribution
College net has 1+2+2+4+2=11 numbers
Belief Networks: Joint Probability Distribution Calculation
Global semantics defines the full joint distribution as the product of local distributions: ))(|(),...,,(
121
n
iiin XParentsXXXX PP
)|(),|()|()|()(
)(
fun. haveor party not but exams your on successful
be andstudy will that youand college togoingofyProbabilit
PFPPSEPCPPCSPCP
FEPSCP
0.2*0.8*0.4*0.9*0.3 = 0.01728
Can use the networks to make inferences.
College
PartyStudy
FunExam Every value in a full joint probability distributioncan be calculated.
College example: CPTs
College
PartyStudy
FunExam
P(C)
0.2
C P(S)
True 0.8
False 0.2
C P(P)
True 0.6
False 0.5
S P P(E)
True True 0.6
True False 0.9
False
True 0.1
False
False 0.2
P P(F)
True 0.9
False 0.7
)|(),|()|()|()( PFPPSEPCPPCSPCP 0.2*0.8*0.4*0.9*0.3 = 0.01728
)( FEPSCP
Network ConstructionMust ensure network and distribution are good representations of the domain. Want to rely on conditional independence
relationships. First, rewrite the joint distribution in
terms of the conditional probability.
Repeat for each conjunctive probability
),...,(),...,|(),...,( 11111 xxPxxxPxxP nnnn
n
i
nnnn
xxxP
xPxxPxxPxxxPxxP
ii
11
11211111
),...,|(
)()|()...,...,(),...,|(),...,(
1 Chain Rule
Network Construction Note is equivalent to:
where the partial order is defined by the graph structure.
n
iiin xxxPxxP
1111 ),...,|(),...,(
))(|(),...,|( 11 iiii XParentsXXXX PP
},...,{)( 11 XXXParents ii
The above equation says that the network correctly represents the domainonly if each node is conditionally independent of its predecessors in the node ordering, given the node’s parents. Means: Parents of Xi needs to contain all nodes in X1,…,Xi-1 that have a direct influence on Xi.
College example:
P(F|C, S, P, E) = P(F|P)
College
PartyStudy
FunExam
P(C)
0.2
C P(S)
True 0.8
False 0.2
C P(P)
True 0.6
False 0.5
S P P(E)
True True 0.6
True False 0.9
False
True 0.1
False
False 0.2
P P(F)
True 0.9
False 0.7
Compact NetworksBayesian networks are sparse, therefore, much more compact than full joint distribution. Sparse: each subcomponent interacts directly
with a bounded number of other nodes independent of the total number of components.
Usually linearly bounded complexity. College net has 1+2+2+4+2=11 numbers Fully connected domain = full joint distribution.
Must determine the correct network topology. Add “root causes” first then the variables that
they influence.
Network ConstructionNeed a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics
1. Choose an ordering of variables X1, …., Xn
2. For i = 1 to nadd Xi to networkselect parents from X1, …, Xi-1 such
thatP(Xi |Parents(Xi)) = P(Xi | X1,… Xi-1 )
The choice of parents guarantees the global semantics
onconstructibyXParentsX
rulechainXXXXX
ii
n
i
ii
n
in
))(|(
),...,|(),...,(
1
111
1
P
PP
Constructing Baye’s networks: Example
Choose an ordering F, E, P, S, C
Party Study
College
Fun
ExamP(E|F)=P(E)?
P(S|F,E)=P(S|E)?
P(C|F,E,P,S)=P(C|P,S)?
P(C|F,E,P,S)=P(C)?
Note that this network has additional dependencies
P(S|F,E)=P(S)?
P(P|F)=P(P)?
Compact Networks
Party Study
College
Fun
Exam
College
PartyStudy
FunExam
Network Construction: Alternative
Start with topological semantics that specifies the conditional independence relationships. Defined by either:
A node is conditionally independent of its non-descendants, given its parents.
A node is conditionally independent of all other nodes given its parents, children, and children’s parents: Markov Blanket.
Then reconstruct the CPTs.
X
Network Construction: Alternative
Each node is conditionally independent of its non-descendants given its parents
Local semantics
Global semantics
Exam is independent of College, given the values of Study and Party.
Network Construction: Alternative
Each node is conditionally independent of its parents, children and children’s parents. – Markov Blanket
U1 Um
…
…
XZ1j Znj
Y1Yn
College is independent of fun, given Party.
Canonical DistributionCompleting a node’s CPT requires up to O(2k) numbers. (k – number of parents) If the parent child relationship is
arbitrary, than can be difficult to do.
Standard patterns can be named along with a few parameters to satisfy the CPT. Canonical distribution
Deterministic NodesSimplest form is to use deterministic nodes. A value is specified exactly by its parent’s
values. No uncertainty.
But what about relationships that are uncertain? If someone has a fever do they have a cold, the
flu, or a stomach bug?
Can you have a cold or stomach bug without a fever?
Noisy-Or RelationshipsA Noisy-or relationship permits uncertainty related to the each parent causing a child to be true. The causal relationship may be inhibited. Assumes:
All possible causes are known. Can have a miscellaneous category if necessary
(leak node)
Inhibition of a particular parent is independent of inhibiting other parents.
Can you have a cold or stomach bug without a fever? Fever is true iff cold, Flu, or Malaria is true.
ExampleGiven:
1.0),,|(
2.0),,|(
6.0),,|(
malariaflucoldfeverP
malariaflucoldfeverP
malariaflucoldfeverP
Example
Cold
Flu Malaria
P( Fever)
F F F 1.0
F F T 0.1
F T F 0.2
F T T
T F F 0.6
T F T
T T F
T T T
0.2 * 0.1 = 0.02
0.6 * 0.1 = 0.06
0.6 * 0.2 = 0.12
0.6 * 0.2 * 0.1 = 0.012
Requires O(k) parameters rather than O(2k)
Networks with Continuous Variables
How are continuous variables represented? Discretization using intervals
Can result in loss of accuracy and large CPTs Define probability density functions
specified by a finite number of parameters. i.e. Gaussian distribution
Hybrid Bayesian NetworksContains both discrete and continuous variables.
Specification of such a network requires: Conditional distribution for a continuous
variable with discrete or continuous parents.
Conditional distribution for a discrete variable with continuous parents.
Example
subsidy harvest
Cost
Buys
Discrete parent Continuous parent
Discrete parent isExplicitly enumerated.
Continuous parent is represented as a distribution. Cost c depends on the distribution function for h.
A linear Gaussian distribution can be used. Have to define the distribution forboth values of subsidy.
Continuous child with a discrete parent and a continuous parent
Example
subsidy harvest
Cost
BuysDiscrete child
Continuous parent
Discrete child with a continuous parent
Set a threshold for cost.
Can use a integral of the standard normal distribution.
Underlying decision processhas a hard threshold but the Threshold’s location moves based upon randomGaussian noise. Probit Distribution
ExampleProbit distribution Usually a better fit for real problems
Logit distribution Uses sigmoid function to determine
threshold. Can be mathematically easier to work
with.
Baye’s Networks and Exact Inference
Notation X: Query variable E: set of evidence variables E1,…Em
e: a particular observed event Y: set of nonevidence variables Y1,…Ym
Also called hidden variables. The complete set of variables: A query: P(X|e)
YEX }{X
College example: CPTs
College
PartyStudy
FunExam
P(C)
0.2
C P(S)
True 0.8
False 0.2
C P(P)
True 0.6
False 0.5
S P P(E)
True True 0.6
True False 0.9
False
True 0.1
False
False 0.2
P P(F)
True 0.9
False 0.7
Example QueryIf you succeeded on an exam and had fun, what is the probability of partying? P(Party|Exam=true, Fun=true)
Inference by EnumerationFrom Chap 13 we know:
From this Chapter we have:
P(x,b,y) in the joint distribution can be represented as products of the conditional probabilities.
y
XXX ),,(),()|( yePePeP
))(|(),...,,(1
21
n
iiin XParentsXXXX PP
Inference by EnumerationA query can be answered using a Baye’s Net by computing the sums of products of the conditional probabilities from the network.
Example QueryIf you succeeded on an exam and had fun, what is the probability of partying? P(Party|Exam=true, Fun=true)
What are the hidden variables?
Example QueryLet: C = College PR = Party S = Study E = Exam F =Fun
Then we have from eq. 13.6 (p.476):
C S
CSfeprPfeprfepr ),,,,(),,(),|( PP
Example QueryUsing
we can put in terms of the CPT entries.
))(|(),...,,(1
21
n
iiin XParentsXXXX PP
C S
CPprfPprSePCSPCprPfepr )()|(),|()|()|(),|( P
The worst case complexity of this equation is: O(n2n) for n variables.
C S
CSfeprfeprfepr ),,,,(),,(),|( PPP
Example QueryImproving the calculation P(f|pr) is a constant so it can be moved
out of the summation over C and S.
The move the elements that only involve C and not S to outside the summation over S.
C S
CPprSePCSPCprPprfPfepr )(),|()|()|()|(),|( P
C S
prSePCSPCprPCPprfPfepr ),|()|()|()()|(),|( P
College example:
College
PartyStudy
FunExam
P(C)
0.2
C P(S)
True 0.8
False 0.2
C P(PR)
True 0.6
False 0.5
S PR P(E)
True True 0.6
True False 0.9
False
True 0.1
False
False 0.2
PR P(F)
True 0.9
False 0.7
Example Query
+
+ +
P(f|pr).9
P(c).2
P(s|c).8
P(e|s,pr).6
2.
)|( csP
1.
),|( prseP
8.
)|( csP
8.
)( cP
.48 + .02 = .5.12 + .08 = .2
.06 + .08 = .14
.126Similarly for P( pr|e,f).
Still O(2n)
2.
)|( csP
P(pr|c).6 5.
)|( cprP
P(e|s,pr).6 1.
),|( prseP
c s
c s
prsePcsPcprPcPprfP
prsePcsPcprPcPprfPfePR
),|()|()|()()|(
,),|()|()|()()|(),|(
P
C S
prSePCSPCprPCPprfPfepr ),|()|()|()()|(),|( P
Variable EliminationA problem with the enumeration method is that particular products can be computed multiple times, thus reducing efficiency. Reduce the number of duplicate calculations by
doing the calculation once and saving it for later.
Variable elimination evaluates expressions from right to left, stores the intermediate results and sums over each variable for the portions of the expression dependent upon the variable.
Variable EliminationFirst, factor the equation.
Second, store the factor for E A 2x2 matrix fE(S,PR).
Third, store the factor for S. A 2x2 matrix.
F C S E
c s
PRSePCSPCPRPCPPRfPfePR ),|()|()|()()|(),|( P
PR
)|(
)|(
)|(
)|(),(
csP
csP
csP
csPCSSf
Variable EliminationFourth, Sum out S from the product of the first two factors.
This is called a pointwise product It creates a new factor whose variables are the
union of the two factors in the product.
Any factor that does not depend on the variable to be summed out can be moved outside the summation.
),( * ),( ),( * ),(
),( * ),(),(
PRsCsPRsCs
PRSCSPRCS
ES
ESEs
ES
ffff
fff
)...,...(*)...,...()...,...,...( 1111111 lkkjlkj ZZYYfYYXXfZZYYXXf
Variable Elimination
Fifth, store the factor for PR A 2x2 matrix.
Sixth, Store the factor for C.
),()|()()|(),|( PRCCPRPCPPRfPfePRES
C
fP
)|(
)|(
)|(
)|(),(
cprP
cprP
cprP
cprPCPRPRf
)(
)()(
cP
cPCCf
Variable Elimination
Seventh, sum out C from the product of the factors where
),(* ),( * )(
),(* ),( * )(
),(* ),( * )()(
PRccPRc
PRccPRc
PRCCPRCPR
S
S
Sc
ESPRC
EPRC
EPRC
EPRC
fff
fff
ffff
),()|()()|(),|( PRCCPRPCPPRfPfePRES
c
fP
Variable Elimination
Next, store the factor for F.
Finally, calculate the final result
)()|(),|( PRPRfPfePRESPRC
fP
)(),|( PRPRfePRESPRC
)f(fF
P
)|(
)|()(
prfP
prfPPRFf
Elimination SimplificationAny leaf node that is not a query variable or an evidence variable can be removed.Every variable that is not an ancestor of a query variable or an evidence variable is irrelevant to the query and can be eliminated.
Elimination SimplificationBook Example: What is the probability that John calls
if there is a burglary?
e a m
amPaJPebaPePbPbJP )|()|(),|()()()|(
Does this matter?Burglary
Alarm
Earthquake
MaryCallsJohnCalls
Complexity of Exact Inference
Variable elimination is more efficient than enumeration. Time and space requirements are
dominated by the size of the largest factor constructed which is determined by the order of variable elimination and the network structure.
PolytreesPolytrees are singly connected networks At most one directed path between any
two nodes. Time and space requirements are linear
in the size of the network. Size is the number of CPT entries.
Polytrees
Burglary
Alarm
Earthquake
MaryCallsJohnCalls
College
PartyStudy
FunExam
Are these trees polytrees?
Applying variable elimination to multiply connected networks has worst case exponential
time and space complexity.