slide 1 aug 25th, 2001 copyright © 2001, andrew w. moore probabilistic models brigham s. anderson...

Aug 25th, 2001

ProbabilisticModels

Brigham S. Anderson

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~brigham

brigham@cmu.edu

Probabililty = relative area

OUTLINE

Probability• Axioms Rules

Probabilistic Models• Probability Tables Bayes Nets

Machine Learning• Inference• Anomaly detection• Classification

Overview

The point of this section:1. Introduce probability models.

2. Introduce Bayes Nets as powerful probability models.

Representing P(A,B,C)• How can we represent the function P(A,B,C)?

• It ideally should contain all the information in this Venn diagram

C0.050.25

0.10 0.050.05

0.100.30

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows).

2. For each combination of values, say how probable it is.

3. If you subscribe to the axioms of probability, those numbers must sum to 1.

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

Example: P(A, B, C)

C0.050.25

0.10 0.050.05

0.100.30

The Joint Probability Table

Using the Prob. Table

One you have the PT you can ask for the probability of any logical expression involving your attribute

PEP matching rows

)row()(

P(Poor, Male) = 0.4654 E

PEP matching rows

)row()(

P(Poor) = 0.7604 E

PEP matching rows

)row()(

Inference with the

Prob. Table

matching rows

and matching rows

2121 )row(

),()|(

EEPEEP

Inference with the

Prob. Table

matching rows

and matching rows

2121 )row(

),()|(

EEPEEP

P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Inference is a big deal• I’ve got this evidence. What’s the chance

that this conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chance my spouse

is already asleep?• I see that there’s a long queue by my gate and the clouds are dark

and I’m traveling by US Airways. What’s the chance I’ll be home tonight? (Answer: 0.00000001)

• There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

The Full Probability Table

Problem:We usually don’t know P(A,B,C)!

Solutions:1. Construct it analytically

2. Learn it from data : P(A,B,C)

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

P(A,B,C)

Learning a Probability Table

Build a PT for your attributes in which the probabilities are unspecified

The fill in each row with

records ofnumber total

row matching records)row(ˆ P

A B C Prob

0 0 0 ?

0 0 1 ?

0 1 0 ?

0 1 1 ?

1 0 0 ?

1 0 1 ?

1 1 0 ?

1 1 1 ?

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10Fraction of all records in whichA and B are True but C is False

Example of Learning a Prob. Table

• This PTable was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

Where are we?

• We have covered the fundamentals of probability

• We have become content with PTables

• And we even know how to learn PTables from data.

Probability Tableas Anomaly Detector

• Our PTable is our first example of something called an Anomaly Detector.

• An anomaly detector tells you how likely a datapoint is given a model.

AnomalyDetector P(x)Data point x

• Given a record x, a model M can tell you how likely the record is:

• Given a dataset with R records, a model can tell you how likely the dataset is:(Under the assumption that all records were independently generated

from the model’s probability function)

Evaluating a Model

kkR |MP|MP|MP

121 )(ˆ),,,(ˆ)dataset(ˆ xxxx

)(ˆ |MP x

A small dataset: Miles Per Gallon

From the UCI repository (thanks to Ross Quinlan)

192 Training Set Records

mpg modelyear maker

good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe

mpg modelyear maker

10 3.4 case) (in this

)(ˆ),,,(ˆ)dataset(ˆ

kkR |MP|MP|MP xxxx

Log Probabilities

Since probabilities of datasets get so small we usually use log probabilities

kk |MP|MP|MP

)(ˆlog)(ˆlog)dataset(ˆlog xx

mpg modelyear maker

466.19 case) (in this

)(ˆlog)(ˆlog)dataset(ˆlog11

kk |MP|MP|MP xx

Summary: The Good News

Prob. Tables allow us to learn P(X) from data.

• Can do anomaly detection spot suspicious / incorrect records

• Can do inference: P(E1|E2)Automatic Doctor, Help Desk, etc

• Can do Bayesian classificationComing soon…

Summary: The Bad News

Learning a Prob. Table for P(X) is for Cowboy data miners

• Those probability tables get big fast! • With so many parameters relative to our data, the

resulting model could be pretty wild.

Using a test set

An independent test set with 196 cars has a worse log likelihood

(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)

….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!

Overfitting Prob. Tables

If this ever happens, it means there are certain combinations that we learn are impossible

0)(ˆ any for if

)(ˆlog)(ˆlog)testset(ˆlog11

|MP|MP|MP

Using a test set

The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020

We need P() estimators that are less prone to overfitting

Naïve Density Estimation

The problem with the Probability Table is that it just mirrors the training data.

In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from our Table.

We need something which generalizes more usefully.

The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

Probability Models

Full Prob.Table Naïve Density

• No assumptions• Overfitting-prone• Scales horribly

• Strong assumptions• Overfitting-resistant• Scales incredibly well

Independently Distributed Data

• Let x[i] denote the i’th field of record x.• The independently distributed assumption says

that for any i,v, u1 u2… ui-1 ui+1… uM

)][,]1[,]1[,]2[,]1[|][( 1121

uMxuixuixuxuxvixP Mii

• Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}

• This is often written as ]}[],1[],1[],2[],1[{][ Mxixixxxix

A note about independence

• Assume A and B are Boolean Random Variables. Then

“A and B are independent”

if and only if

P(A|B) = P(A)

• “A and B are independent” is often notated as

Independence Theorems

• Assume P(A|B) = P(A)• Then P(A, B) =

= P(A) P(B)

• Assume P(A|B) = P(A)• Then P(B|A) =

= P(B)

Independence Theorems

• Assume P(A|B) = P(A)• Then P(~A|B) =

= P(~A)

• Assume P(A|B) = P(A)• Then P(A|~B) =

= P(A)

Multivalued Independence

For Random Variables A and B,

BAif and only if

)()|(:, uAPvBuAPvu from which you can then prove things like…

)()(),(:, vBPuAPvBuAPvu )()|(:, vBPvAvBPvu

Back to Naïve Density Estimation

• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}• Example:

• Suppose that each record is generated by randomly shaking a green dice and a red dice

• Dataset 1: A = red value, B = green value

• Dataset 2: A = red value, B = sum of values

• Dataset 3: A = sum of values, B = difference of values

• Which of these datasets violates the naïve assumption?

Using the Naïve Distribution

• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.

• Suppose A, B, C and D are independently distributed.

What is P(A^~B^C^~D)?

= P(A) P(~B) P(C) P(~D)

Naïve Distribution General Case

• Suppose x[1], x[2], … x[M] are independently distributed.

kkM ukxPuMxuxuxP

121 )][()][,]2[,]1[(

• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.

• So we can do any inference* • But how do we learn a Naïve Density

Estimator?

Learning a Naïve Density Estimator

records ofnumber total

][in which records#)][(ˆ uixuixP

Another trivial learning algorithm!

Contrast

Joint DE Naïve DE

Can model anything Can model only very boring distributions

No problem to model “C is a noisy copy of A”

Outside Naïve’s scope

Given 100 records and more than 6 boolean attributes will screw up badly

Given 100 records and 10,000 multivalued attributes will be fine

Empirical Results: “Hopeless”

The “hopeless” dataset consists of 40,000 records and 21 boolean attributes called a,b,c, … u. Each attribute in each record is generated 50-50 randomly as 0 or 1.

Despite the vast amount of data, “Joint” overfits hopelessly and does much worse

Average test set log probability during 10 folds of k-fold cross-validation*Described in a future Andrew lecture

Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The DE learned by

“Joint”

The DE learned by

“Naive”

Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The DE learned by

“Joint”

The DE learned by

“Naive”

A tiny part of the DE

learned by “Joint”

Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes

The DE learned by

“Naive”

The DE learned by

“Joint”

Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes

The DE learned by

“Naive”

The DE learned by

“Joint”

Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes

The DE learned by

“Naive”

The DE learned by

“Joint”

“Weight vs MPG”: The best that Naïve can do

The DE learned by

“Naive”

Reminder: The Good News

• We have two ways to learn a Density Estimator from data.

• There are vastly more impressive* Density Estimators * (Mixture Models, Bayesian Networks, Density Trees, Kernel

Densities and many more)

• Density estimators can do many good things…• Anomaly detection• Can do inference: P(E1|E2) Automatic Doctor / Help Desk etc

• Ingredient for Bayes Classifiers

Probability Models

Full Prob.Table Naïve Density

• No assumptions• Overfitting-prone• Scales horribly

• Strong assumptions• Overfitting-resistant• Scales incredibly well

Bayes Nets

• Carefully chosen assumptions• Overfitting and scaling

properties depend on assumptions

Bayes Nets

Bayes Nets• What are they?

• Bayesian nets are a network-based framework for representing and analyzing models involving uncertainty

• What are they used for?• Intelligent decision aids, data fusion, 3-E feature recognition, intelligent

diagnostic aids, automated free text understanding, data mining• Where did they come from?

• Cross fertilization of ideas between the artificial intelligence, decision analysis, and statistic communities

• Why the sudden interest?• Development of propagation algorithms followed by availability of easy

to use commercial software • Growing number of creative applications

• How are they different from other knowledge representation and probabilistic analysis tools?• Different from other knowledge-based systems tools because

uncertainty is handled in mathematically rigorous yet efficient and simple way

Bayes Net Concepts

1.Chain RuleP(A,B) = P(A) P(B|A)

2.Conditional IndependenceP(B|A) = P(B)

A Simple Bayes Net

• Let’s assume that we already have P(Mpg,Horse)

How would you rewrite this using the Chain rule?

0.480.12bad

0.040.36good

highlowP(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

Review: Chain Rule

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Mpg)

P(Horse|Mpg)

Review: Chain Rule

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

= P(good) * P(low|good) = 0.4 * 0.89

= P(good) * P(high|good) = 0.4 * 0.11

= P(bad) * P(low|bad) = 0.6 * 0.21

= P(bad) * P(high|bad) = 0.6 * 0.79

How to Make a Bayes Net

P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

• Each node is a probability function

• Each arc denotes conditional dependence

So, what have we accomplished thus far?

Nothing; we’ve just “Bayes Net-ified”

P(Mpg, Horse) using the Chain rule.

…the real excitement starts when we wield conditional independence

P(Mpg)

P(Horse|Mpg)

Before we apply conditional independence, we need a worthier opponent than puny P(Mpg, Horse)…

We’ll use P(Mpg, Horse, Accel)

Rewrite joint using the Chain rule.

P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

Note:Obviously, we could have written this 3!=6 different ways…

P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = … = …

Rewrite joint using the Chain rule.

P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

P(Accel|Mpg,Horse)

* Note: I made these up…

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

P(Accel|Mpg,Horse)

A Miracle Occurs!

You are told by God (or another domain expert)that Accel is independent of Mpg given Horse!

P(Accel | Mpg, Horse) = P(Accel | Horse)

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89

P(Accel|Horse)

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89

P(Accel|Horse)

Thank you, domain expert!

Now I only need to learn 5 parameters

instead of 7 from my data!

My parameter estimateswill be more accurate as

a result!

Independence“The Acceleration does not depend on the Mpg once I know the Horsepower.”

This can be specified very simply:

P(Accel Mpg, Horse) = P(Accel | Horse)

This is a powerful statement!

It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

Bayes Net-Building Tools

1.Chain RuleP(A,B) = P(A) P(B|A)

2.Conditional IndependenceP(B|A) = P(B)

This is Ridiculously Useful in general

Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of

any length are allowed.

Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the

probability of this variable’s values depends on all possible combinations of parental values.

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Where are we now?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Where are we now?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match ~R ^ T ^ ~S

Where are we now?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Each of these obtained by the “computing a joint probability entry” method of the earlier slides

4 joint computes

IndependenceWe’ve stated:

P(M) = 0.6P(S) = 0.3P(S M) = P(S)

M S Prob

And since we now have the joint pdf, we can make any queries we like.

From these statements, we can derive the full joint pdf.

The good news

We can do inference. We can compute any conditional probability:

P( Some variable Some other variable values )

matching entriesjoint

and matching entriesjoint

2121 )entryjoint (

)entryjoint (

EEPEEP

The good news

We can do inference. We can compute any conditional probability:

P( Some variable Some other variable values )

matching entriesjoint

and matching entriesjoint

2121 )entryjoint (

)entryjoint (

EEPEEP

Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.

How much work is the above computation?

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

The sad, bad news

But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find

there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?

The sad, bad news

But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find

there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?

Sadder and worse news:General querying of Bayes nets is NP-complete.

Case Study I

Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).

• Diagnostic system for lymph-node diseases.

• 60 diseases and 100 symptoms and test-results.

• 14,000 probabilities

• Expert consulted to make net.

• 8 hours to determine variables.

• 35 hours for net topology.

• 40 hours for probability table values.

• Apparently, the experts found it quite easy to invent the causal links and probabilities.

• Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

What you should know

• The meanings and importance of independence and conditional independence.

• The definition of a Bayes net.• Computing probabilities of assignments of variables (i.e.

members of the joint p.d.f.) with a Bayes net.• The slow (exponential) method for computing arbitrary,

conditional probabilities.• The stochastic simulation method and likelihood weighting.

ChainRule

0.4good

bad 0.790.11high

0.210.89 low

badgood

P(Mpg) P(Horse | Mpg)

0.920.28bad

0.080.72 good

highlow

0.48 low

P(Horse) P(Mpg | Horse)

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)

ChainRule

P(Mpg)

P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)

0.48high

0.12lowbad

0.04high

0.36lowgood

HorseMpg

good 0.4

bad 0.6

0.79high

0.21lowbad

0.11high

0.89lowgood

HorseMpgMpg

P(Horse | Mpg)

P(Horse)

low 0.48

high 0.52

0.92bad

0.08goodhigh

0.28bad

0.72goodlow

MpgHorseHorse

P(Mpg | Horse)

Making a Bayes Net

P(Horse) P(Mpg | Horse)

1.Start with a Joint

2.Rewrite joint with Chain Rule

3.Each component becomes a node

(with arcs)

P(Horse, Mpg)

P(Mpg, Horsepower)

Horsepower

P(Mpg)

P(Horsepower | Mpg)

0.790.11high

0.210.89 low

badgood

0.4good

We can rewrite P(Mpg, Horsepower) as a Bayes net!

1. Decompose with the Chain rule.2. Make each element of decomposition into a node

P(Mpg, Horsepower)

Horsepower

P(Horse)

P(Mpg | Horse)

0.92bad

0.08goodhigh

0.28bad

0.72goodlow

MpgHorse

0.52high

0.48low

f 0.01 0.01

~f 0.09 0.89

0.01 0.01

The Joint Distribution

Joint Distribution

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

Venn Diagrams• Visual• Good for tutorials

Probability Tables• Efficient• Good for computation

good0.36

high0.48

Up to this point, we have represented a probability model (or joint distribution) in two ways:

Where do Joint Distributions come from?

• Idea One: Expert Humans with lots of free time• Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew

P(A) = 0.7

P(B|A) = 0.2P(B|~A) = 0.1

P(C|A,B) = 0.1P(C|A,~B) = 0.8P(C|~A,B) = 0.3P(C|~A,~B) = 0.1

Then you can automatically compute the PT using the chain rule

P(x, y, z) = P(x) P(y|x) P(z|x,y)Later: Bayes Nets will do this systematically

Where do Joint Distributions come from?

• Idea Three: Learn them from data!

Prepare to be impressed…

A more interesting case

• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

We already know the Joint of S and M, so all we need now isP(L S=u, M=v)

in the 4 cases of u/v = True/False.

P(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven*

*Savings are larger for larger numbers of variables.

P(S M) = P(S)P(S) = 0.3P(M) = 0.6

Question: ExpressP(L=x ^ M=y ^ S=z)

in terms that only need the above expressions, where x,y and z may each be True or False.

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(s)=0.3P(M)=0.6

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(s)=0.3P(M)=0.6

Read the absence of an arrow between S and M to mean “it

would not help me predict M if I knew the value of S”

Read the two arrows into L to mean that if I want to know the

value of L it may help me to know M and to know S.

is kind o

f stuff w

ill be

ghly form

An even cuter trick

Suppose we have these three events:• M : Lecture taught by Manuela• L : Lecturer arrives late• R : Lecture concerns robotsSuppose:• Andrew has a higher chance of being late than Manuela.• Andrew has a higher chance of giving robotics lectures.What kind of independence can we find?

How about:• P(L M) = P(L) ?• P(R M) = P(R) ?• P(L R) = P(L) ?

Conditional independence

Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots.

P(R M,L) = P(R M) and

P(R ~M,L) = P(R ~M)

We express this in the following way:

“R and L are conditionally independent given M”

Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc.

..which is also notated by the following diagram.

Conditional Independence formalized

R and L are conditionally independent given M if

for all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:

Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

Example:

More generally:

“Shoe-size is conditionally independent of Glove-size given height weight and age”

forall s,g,h,w,a

P(ShoeSize=s|Height=h,Weight=w,Age=a)

P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)

Example:

More generally:

“Shoe-size is conditionally independent of Glove-size given height weight and age”

does not mean

forall s,g,h

P(ShoeSize=s|Height=h)

P(ShoeSize=s|Height=h, GloveSize=g)

Conditional independence

We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(LM) and P(L~M) and know we’ve fully specified L’s behavior. Ditto for R.

P(M) = 0.6 P(L M) = 0.085 P(L ~M) = 0.17 P(R M) = 0.3 P(R ~M) = 0.6

‘R and L conditionally independent given M’

Conditional independenceM

P(M) = 0.6

P(L M) = 0.085

P(L ~M) = 0.17

P(R M) = 0.3

P(R ~M) = 0.6

Conditional Independence:

P(RM,L) = P(RM),

P(R~M,L) = P(R~M)

Again, we can obtain any member of the Joint prob dist that we desire:

P(L=x ^ R=y ^ M=z) =

Assume five variables

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

• T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L)

• L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S)

• R only directly influenced by M (i.e. R is conditionally independent of L,S, given M)

• M and S are independent

Making a Bayes net

Step One: add variables.• Just choose the variables you’d like to be included in the

Making a Bayes net

Step Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promising

that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn}

Making a Bayes net

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for each

possible combination of parent values

Making a Bayes net

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

• Two unconnected variables may still be correlated• Each node is conditionally independent of all non-

descendants in the tree, given its parents.• You can deduce many other conditional independence

relations from a Bayes net. See the next lecture.

Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of

any length are allowed.

Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the

probability of this variable’s values depends on all possible combinations of parental values.

Building a Bayes Net

1. Choose a set of relevant variables.2. Choose an ordering for them

3. Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc)

4. For i = 1 to m:

1. Add the Xi node to the network

2. Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi )

3. Define the probability table of

P(Xi =k Assignments of Parents(Xi ) ).

Example Bayes Net Building

Suppose we’re building a nuclear power station.There are the following random variables:

GRL : Gauge Reads Low.CTL : Core temperature is low.FG : Gauge is faulty.FA : Alarm is faultyAS : Alarm sounds

• If alarm working properly, the alarm is meant to sound if the gauge stops reading a low temp.

• If gauge working properly, the gauge is meant to read the temp of the core.

Computing a Joint EntryHow to compute an entry in a joint distribution?

E.G: What is P(S ^ ~M ^ L ~R ^ T)?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Computing with Bayes Net

P(T ^ ~R ^ L ^ ~M ^ S) =P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T L) * P(~R ^ L ^ ~M ^ S) =P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S).

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

The general case

P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =

P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *

P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =

xXxXxXP

Parents of sAssignment

So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Where are we now?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Where are we now?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Where are we now?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Where are we now?

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Each of these obtained by the “computing a joint probability entry” method of the earlier slides

4 joint computes

slide 1 aug 25th, 2001 copyright © 2001, andrew w. moore probabilistic models brigham s. anderson...

Documents

brigham and women’s hospital department of surgery faculty...

ehrhardt & brigham

brigham young...

brigham files

1 the digital millennium copyright act david s. touretzky...

jeffrey nichols september 25, 2001 handheld computers in...

manuela veloso 15-381 - fall 2001 - carnegie mellon school...

toward exploiting eeg input in a reading...

slide 1 copyright © 2001, andrew w. moore probabilistic and...

ubiquitous human-machine interactivity via a universal...

paul brigham

copyright © 2006, brigham s. anderson active learning as...

1 aura project © 2001 carnegie mellon university wearable...

desktop publishing carnegie-mellon university spring 2001

lag correlator testing aria meyhoefer carnegie mellon...

060518 brigham city

carnegie mellon today: machine programming i: basics · dac...

industrial background research problematic method overview...

carnegie mellon school of computer...

scalable data mining the auton lab, carnegie mellon...