marginalization & conditioning marginalization (summing out): for any sets of variables y and z:...
Post on 13-Dec-2015
257 Views
Preview:
TRANSCRIPT
Marginalization & Conditioning
O ften w an t to d o th is fo r ( ) in stead o f (Y ).
R eca ll ( ) = ( )
( )
P P
PP
P
Y |X
Y |XX Y
X
)z,...,z ,P(...)P( )P( n
1
1
Z Zz
zz zn
Y Y,Y
)P()P( )P( zz| YYz
• Marginalization (summing out): for any sets of variables Y and Z:
• Conditioning(variant of marginalization):
Example of Marginalization• Using the full joint distribution
P(cavity) = P(cavity, toothache, catch) + P(cavity, toothache, catch) + P(cavity, toothache, catch) + P(cavity, toothache, catch)= 0.108 +0.012 + 0.072 + 0.008
= 0.2
tootache catch
catch), toothacheP(cavity, P(cavity)
Inference By Enumeration using Full Joint Distribution
• Let X be a random variable about which we want to know its probabilities, given some evidence (values e for a set E of other variables). Let the remaining (unobserved, so-called hidden) variables be Y. The query is P(X|e), and it can be answered using the full joint distribution by
P e P e P e yy
( | ) = (X , ) = (X , , )
X
Example of Inference By Enumeration using Full Joint Distribution
0.4 0.064) (0.016 ) toothache|cavity P(
0.6 0.012) (0.108 ) toothache|P(cavity
5
1 0.064) (0.016 0.012) (0.108
0.064) (0.016
) catch) , toothachecavity,P(
catch) , toothachecavity,(P(
catch) toothache,cavity,P(
) toothachecavity,P( ) toothache|cavity P(
0.012) (0.108
) catch) , toothacheP(cavity,
catch) , toothache(P(cavity,
catch) toothache,P(cavity,
) toothacheP(cavity, ) toothache|P(cavity
catch
catch
Independence
• Propositions a and b are independent if and only if
• Equivalently (by product rule):
• Equivalently:
P ( ) = P ( ) P ( )a b a bP ( | ) = P ( )a b a
P ( | = P ( )b a b)
Illustration of Independence
• We know (product rule) thatP ( ) =
P ( | )
P ( ). B y in d ep en d en ce :
too thache ca tch cavity W ea ther cloudy
W eather cloudy too thache ca tch cavity
too thache ca tch cavity
, , ,
, ,
, ,
P ( | ) =
P ( . T h ere fo re w e h av e th a t
P ( ) =
P ( P ( ).
W eather cloudy too thache ca tch cavity
W eather cloudy
too thache ca tch cavity W eather cloudy
W eather cloudy too thache ca tch cavity
, ,
)
, , ,
) , ,
Illustration continued
• Allows us to represent a 32-element table for full joint on Weather, Toothache, Catch, Cavity by an 8-element table for the joint of Toothache, Catch, Cavity, and a 4-element table for Weather.
• If we add a Boolean variable X to the 8-element table, we get 16 elements. A new 2-element table suffices with independence.
Difficulty with Bayes’ Rule with More than Two Variables
T h e d efin itio n o f B ay es ' R u le ex te n d s n a tu ra lly to
m u ltip le v ariab les:
( =
( ( ).
B u t n o tice th a t to ap p ly it w e m u st k n o w co n d itio n a l
p ro b ab ilitie s lik e
P (
fo r a ll 2 se ttin g s o f th e s an d a ll se ttin g s o f th e s
(assu m in g B o o lean s). M ig h t as w ell u se fu ll jo in t.
P
P P
X X Y Y
Y Y X X X X
y y x x
Y X
m n
n m m
n m
n m
1 1
1 1 1
1 1
2
, . . . , | , . . . , )
, . . . , | , . . . , ) , . . . ,
, . . . , | , . . . , )
Conditional Independence
• X and Y are conditionally independent given Z if and only if P(X,Y|Z) = P(X|Z) P(Y|Z).
• Y1,…,Yn are conditionally independent given X1,…,Xm if and only if P(Y1,…,Yn|X1,…,Xm)= P(Y1|X1,…,Xm) P(Y2|X1,…,Xm) … P(Ym|X1,…,Xm).
• We’ve reduced 2n2m to 2n2m. Additional conditional independencies may reduce 2m.
Conditional Independence
• As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used:
P(X|Y, Z) = P(X|Z) and
P(Y|X, Z) = P(Y|Z)
Benefits of Conditional Independence
• Allows probabilistic systems to scale up (tabular representations of full joint distributions quickly become too large.)
• Conditional independence is much more commonly available than is absolute independence.
Decomposing a Full Joint by Conditional Independence
• Might assume Toothache and Catch are conditionally independent given Cavity: P(Toothache,Catch|Cavity) = P(Toothache|Cavity) P(Catch|Cavity).
• Then P(Toothache,Catch,Cavity) =[product rule] P(Toothache,Catch|Cavity) P(Cavity) =[conditional independence] P(Toothache|Cavity) P(Catch|Cavity) P(Cavity).
Naive Bayes Algorithm
Let Fi be the i-th feature having valuej and Out be the target feature.
• We can use training data to estimate:
P(Fi = vj)
P(Fi = vj | Out = True)
P(Fi = vj | Out = False)
P(Out = True)
P(Out = False)
Naive Bayes Algorithm
•For a test example described by F1 = v1 , ..., Fn = vn , we need to compute:
P(Out = True | F1 = v1 , ..., Fn = vn )
Applying Bayes rule:
P(Out = True | F1 = v1 , ..., Fn = vn ) =
P(F1 = v1 , ..., Fn = vn | Out = True) P(Out = True)
_______________________________________
P(F1 = v1 , ..., Fn = vn)
Naive Bayes Algorithm
•By independence assumption:
P(F1 = v1 , ..., Fn = vn) = P(F1 = v1 )x ...x P(Fn = vn)
•This leads to conditional independence:
P(F1 = v1 , ..., Fn = vn | Out = True) =
P(F1 = v1 | Out = True) x ...x P(Fn = vn | Out = True)
Naive Bayes Algorithm
P(Out = True | F1 = v1 , ..., Fn = vn ) =
P(F1 = v1 | Out = True) x ...x P(Fn = vn | Out = True)x P(Out = True)
_______________________________________
P(F1 = v1 )x ...x P(Fn = vn)
•All terms are computed using the training data!
•Works well despite of strong assumptions(see [Domingos and Pazzani MLJ 97]) and thus provides a simple benchmark testset accuracy for a new data set
Bayesian Networks: Motivation• Although the full joint distribution can answer
any question about the domain it can become intractably large as the number of variable grows.
• Specifying probabilities for atomic events is rather unnatural and may be very difficult.
• Use a graphical representation for which we can more easily investigate the complexity of inference and can search for efficient inference algorithms.
Bayesian Networks
• Capture independence and conditional independence where they exist, thus reducing the number of probabilities that need to be specified.
• It represents dependencies among variables and encodes a concise specification of the full joint distribution.
A Bayesian Network is a ...
• Directed Acyclic Graph (DAG) in which …
• … the nodes denote random variables
• … each node X has a conditional probability distribution P(X|Parents(X)).
• The intuitive meaning of an arc from X to Y is that X directly influences Y.
Additional Terminology• If X and its parents are discrete, we can
represent the distribution P(X|Parents(X)) by a conditional probability table (CPT) specifying the probability of each value of X given each possible combination of settings for the variables in Parents(X).
• A conditioning case is a row in this CPT (a setting of values for the parent nodes). Each row must sum to 1.
Bayesian Network Semantics
• A Bayesian Network completely specifies a full joint distribution over its random variables, as below -- this is its meaning.
• P
• In the above, P(x1,…,xn) is shorthand notation for P(X1=x1,…,Xn=xn).
( ) = P ( )x x x P aren ts Xn i i
i =
n
1
1
, . . . , | ( )
Inference Example
• What is probability alarm sounds, but neither a burglary nor an earthquake has occurred, and both John and Mary call?
• Using j for John Calls, a for Alarm, etc.:
P ( ) =
P ( ) P ( ) P ( ) P ( ) P ( ) =
(0 .9 )(0 .7 )(0 .0 0 1 )(0 .9 9 9 )(0 .9 9 8 ) = 0 .0 0 0 6 2
j m a b e
j a m a a b e b e
| | |
Chain Rule
• Generalization of the product rule, easily proven by repeated application of the product rule.
• Chain Rule: P ( ) =
P ( )P ( ). . . P ( )P ( )
= P ( )
x ,...x
x |x ,...,x x |x ,...,x x |x x
x |x ,...,x
n
n n- n - n -
i i- i
i=
n
1
1 1 1 2 1 2 1 1
1
1
Chain Rule and BN Semantics
B N sem an tics: P ( ) = P ( )
K ey P ro p erty : ( ) = ( )
p ro v id ed . S ay s a n o d e is
co n d itio n a lly in d ep e n d en t o f its p red ece sso rs in th e
n o d e o rd erin g g iv en its p aren ts , an d su g g ests
in crem en ta l p ro ced u re fo r n e tw o rk co n stru c tio n .
x ,...,x x |P aren ts(X )
X |X ,...,X X |P aren ts(X )
P aren ts(X ) X ,...,X
n i i
i=
n
i i- i i
i i-
1
1
1 1
1 1
P P
Example of the Key Property
• The following conditional independence holds:
P(MaryCalls |JohnCalls, Alarm, Earthquake, Burglary) =
P(MaryCalls | Alarm)
Procedure for BN Construction
• Choose relevant random variables.
• While there are variables left:
1 . a n ex t v a riab le an d ad d a n o d e fo r it.
2 . S e t to m in im al se t o f n o d es su ch
th a t th e K ey P ro p erty (p rev io u s s lid e ) is sa tisfied .
3 . D efin e th e co n d itio n a l d is trib u tio n ( ).
C hoose
som e
P
X
P aren ts(X )
X |P aren ts(X )
i
i
i i
Principles to Guide Choices
• Goal: build a locally structured (sparse) network -- each component interacts with a bounded number of other components.
• Add root causes first, then the variables that they influence.
top related