marginalization & conditioning marginalization (summing out): for any sets of variables y and z:...

Marginalization & Conditioning

O ften w an t to d o th is fo r ( ) in stead o f (Y ).

R eca ll ( ) = ( )

Y |XX Y

)z,...,z ,P(...)P( )P( n

)P()P( )P( zz| YYz

• Marginalization (summing out): for any sets of variables Y and Z:

• Conditioning(variant of marginalization):

Example of Marginalization• Using the full joint distribution

P(cavity) = P(cavity, toothache, catch) + P(cavity, toothache, catch) + P(cavity, toothache, catch) + P(cavity, toothache, catch)= 0.108 +0.012 + 0.072 + 0.008

tootache catch

catch), toothacheP(cavity, P(cavity)

Inference By Enumeration using Full Joint Distribution

• Let X be a random variable about which we want to know its probabilities, given some evidence (values e for a set E of other variables). Let the remaining (unobserved, so-called hidden) variables be Y. The query is P(X|e), and it can be answered using the full joint distribution by

P e P e P e yy

( | ) = (X , ) = (X , , )

Example of Inference By Enumeration using Full Joint Distribution

0.4 0.064) (0.016 ) toothache|cavity P(

0.6 0.012) (0.108 ) toothache|P(cavity

1 0.064) (0.016 0.012) (0.108

0.064) (0.016

) catch) , toothachecavity,P(

catch) , toothachecavity,(P(

catch) toothache,cavity,P(

) toothachecavity,P( ) toothache|cavity P(

0.012) (0.108

) catch) , toothacheP(cavity,

catch) , toothache(P(cavity,

catch) toothache,P(cavity,

) toothacheP(cavity, ) toothache|P(cavity

Independence

• Propositions a and b are independent if and only if

• Equivalently (by product rule):

• Equivalently:

P ( ) = P ( ) P ( )a b a bP ( | ) = P ( )a b a

P ( | = P ( )b a b)

Illustration of Independence

• We know (product rule) thatP ( ) =

P ( | )

P ( ). B y in d ep en d en ce :

too thache ca tch cavity W ea ther cloudy

W eather cloudy too thache ca tch cavity

too thache ca tch cavity

P ( | ) =

P ( . T h ere fo re w e h av e th a t

P ( ) =

P ( P ( ).

W eather cloudy

too thache ca tch cavity W eather cloudy

Illustration continued

• Allows us to represent a 32-element table for full joint on Weather, Toothache, Catch, Cavity by an 8-element table for the joint of Toothache, Catch, Cavity, and a 4-element table for Weather.

• If we add a Boolean variable X to the 8-element table, we get 16 elements. A new 2-element table suffices with independence.

Difficulty with Bayes’ Rule with More than Two Variables

T h e d efin itio n o f B ay es ' R u le ex te n d s n a tu ra lly to

m u ltip le v ariab les:

( ( ).

B u t n o tice th a t to ap p ly it w e m u st k n o w co n d itio n a l

p ro b ab ilitie s lik e

fo r a ll 2 se ttin g s o f th e s an d a ll se ttin g s o f th e s

(assu m in g B o o lean s). M ig h t as w ell u se fu ll jo in t.

X X Y Y

Y Y X X X X

y y x x

, . . . , | , . . . , )

, . . . , | , . . . , ) , . . . ,

, . . . , | , . . . , )

Conditional Independence

• X and Y are conditionally independent given Z if and only if P(X,Y|Z) = P(X|Z) P(Y|Z).

• Y1,…,Yn are conditionally independent given X1,…,Xm if and only if P(Y1,…,Yn|X1,…,Xm)= P(Y1|X1,…,Xm) P(Y2|X1,…,Xm) … P(Ym|X1,…,Xm).

• We’ve reduced 2n2m to 2n2m. Additional conditional independencies may reduce 2m.

Conditional Independence

• As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used:

P(X|Y, Z) = P(X|Z) and

P(Y|X, Z) = P(Y|Z)

Benefits of Conditional Independence

• Allows probabilistic systems to scale up (tabular representations of full joint distributions quickly become too large.)

• Conditional independence is much more commonly available than is absolute independence.

Decomposing a Full Joint by Conditional Independence

• Might assume Toothache and Catch are conditionally independent given Cavity: P(Toothache,Catch|Cavity) = P(Toothache|Cavity) P(Catch|Cavity).

• Then P(Toothache,Catch,Cavity) =[product rule] P(Toothache,Catch|Cavity) P(Cavity) =[conditional independence] P(Toothache|Cavity) P(Catch|Cavity) P(Cavity).

Naive Bayes Algorithm

Let Fi be the i-th feature having valuej and Out be the target feature.

• We can use training data to estimate:

P(Fi = vj)

P(Fi = vj | Out = True)

P(Fi = vj | Out = False)

P(Out = True)

P(Out = False)

•For a test example described by F1 = v1 , ..., Fn = vn , we need to compute:

P(Out = True | F1 = v1 , ..., Fn = vn )

Applying Bayes rule:

P(Out = True | F1 = v1 , ..., Fn = vn ) =

P(F1 = v1 , ..., Fn = vn | Out = True) P(Out = True)

_______________________________________

P(F1 = v1 , ..., Fn = vn)

•By independence assumption:

P(F1 = v1 , ..., Fn = vn) = P(F1 = v1 )x ...x P(Fn = vn)

•This leads to conditional independence:

P(F1 = v1 , ..., Fn = vn | Out = True) =

P(F1 = v1 | Out = True) x ...x P(Fn = vn | Out = True)

P(Out = True | F1 = v1 , ..., Fn = vn ) =

P(F1 = v1 | Out = True) x ...x P(Fn = vn | Out = True)x P(Out = True)

_______________________________________

P(F1 = v1 )x ...x P(Fn = vn)

•All terms are computed using the training data!

•Works well despite of strong assumptions(see [Domingos and Pazzani MLJ 97]) and thus provides a simple benchmark testset accuracy for a new data set

Bayesian Networks: Motivation• Although the full joint distribution can answer

any question about the domain it can become intractably large as the number of variable grows.

• Specifying probabilities for atomic events is rather unnatural and may be very difficult.

• Use a graphical representation for which we can more easily investigate the complexity of inference and can search for efficient inference algorithms.

Bayesian Networks

• Capture independence and conditional independence where they exist, thus reducing the number of probabilities that need to be specified.

• It represents dependencies among variables and encodes a concise specification of the full joint distribution.

A Bayesian Network is a ...

• Directed Acyclic Graph (DAG) in which …

• … the nodes denote random variables

• … each node X has a conditional probability distribution P(X|Parents(X)).

• The intuitive meaning of an arc from X to Y is that X directly influences Y.

Additional Terminology• If X and its parents are discrete, we can

represent the distribution P(X|Parents(X)) by a conditional probability table (CPT) specifying the probability of each value of X given each possible combination of settings for the variables in Parents(X).

• A conditioning case is a row in this CPT (a setting of values for the parent nodes). Each row must sum to 1.

Bayesian Network Semantics

• A Bayesian Network completely specifies a full joint distribution over its random variables, as below -- this is its meaning.

• In the above, P(x1,…,xn) is shorthand notation for P(X1=x1,…,Xn=xn).

( ) = P ( )x x x P aren ts Xn i i

, . . . , | ( )

Inference Example

• What is probability alarm sounds, but neither a burglary nor an earthquake has occurred, and both John and Mary call?

• Using j for John Calls, a for Alarm, etc.:

P ( ) =

P ( ) P ( ) P ( ) P ( ) P ( ) =

(0 .9 )(0 .7 )(0 .0 0 1 )(0 .9 9 9 )(0 .9 9 8 ) = 0 .0 0 0 6 2

j m a b e

j a m a a b e b e

Chain Rule

• Generalization of the product rule, easily proven by repeated application of the product rule.

• Chain Rule: P ( ) =

P ( )P ( ). . . P ( )P ( )

= P ( )

x ,...x

x |x ,...,x x |x ,...,x x |x x

x |x ,...,x

n n- n - n -

i i- i

1 1 1 2 1 2 1 1

Chain Rule and BN Semantics

B N sem an tics: P ( ) = P ( )

K ey P ro p erty : ( ) = ( )

p ro v id ed . S ay s a n o d e is

co n d itio n a lly in d ep e n d en t o f its p red ece sso rs in th e

n o d e o rd erin g g iv en its p aren ts , an d su g g ests

in crem en ta l p ro ced u re fo r n e tw o rk co n stru c tio n .

x ,...,x x |P aren ts(X )

X |X ,...,X X |P aren ts(X )

P aren ts(X ) X ,...,X

i i- i i

Example of the Key Property

• The following conditional independence holds:

P(MaryCalls |JohnCalls, Alarm, Earthquake, Burglary) =

P(MaryCalls | Alarm)

Procedure for BN Construction

• Choose relevant random variables.

• While there are variables left:

1 . a n ex t v a riab le an d ad d a n o d e fo r it.

2 . S e t to m in im al se t o f n o d es su ch

th a t th e K ey P ro p erty (p rev io u s s lid e ) is sa tisfied .

3 . D efin e th e co n d itio n a l d is trib u tio n ( ).

C hoose

P aren ts(X )

X |P aren ts(X )

Principles to Guide Choices

• Goal: build a locally structured (sparse) network -- each component interacts with a bounded number of other components.

• Add root causes first, then the variables that they influence.

marginalization & conditioning marginalization (summing out): for any sets of variables y and z:...

x pf n

conditional independence

joint distribution slide

false slide

pyz slide

independent given x

true f

absolute independence

Documents

steff preamplifier & summing amplifier test...

stuart hall - summing up

ud1.7 summing up

ukoln streaming summing up

advertising and covert marginalization animated

signal conditioning -...

external marginalization of sport philosophy

reducing marginalization of fishermen through

bridging regional responses to marginalization …

1 9 summing-up_week_1_powerpoint

true coincidence summing

roma marginalization and exclusion in a comparative...

uncovering children in marginalization: explaining

economic marginalization of turkey

summing up what now?

fighting marginalization with transnational networks

consumer search and double marginalization

penguat summing inverting

true coincidence summing corrections -...

an231e04 datasheet rev 1 - anadigm inc · the an231e04...