![Page 1: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/1.jpg)
CSE 573: Artificial IntelligenceSpring 2012
Learning Bayesian Networks
Dan Weld
Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer
1
![Page 2: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/2.jpg)
2
trofdo
Way to much of KG’s coin junk – get to the point!
MAP vs Bayseian estimate Prior Beta
Should be 5-10 min not 30
![Page 3: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/3.jpg)
What action next?
Percepts Actions
Environment
Static vs. Dynamic
Fully vs.
Partially Observable
Perfectvs.
Noisy
Deterministic vs.
Stochastic
Instantaneous vs.
Durative
![Page 4: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/4.jpg)
What action next?
Blind search
Heuristic search
Mini-max & Expectimax
MDPs (& POMDPS)
Reinforcement learning
State estimation
Algorithms
![Page 5: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/5.jpg)
Knowledge Representation
HMMs
Bayesian networks
First-order logic
Description logic
Constraint networks
Markov logic networks
…
![Page 6: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/6.jpg)
Learning
?
![Page 7: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/7.jpg)
7
What is Machine Learning ?
![Page 8: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/8.jpg)
8©2005-2009 Carlos Guestrin
Machine Learning
Study of algorithms that improve their performance at some task with experience
Data UnderstandingMachine Learning
![Page 9: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/9.jpg)
9©2005-2009 Carlos Guestrin
Exponential Growth in Data
Data UnderstandingMachine Learning
![Page 10: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/10.jpg)
10©2005-2009 Carlos Guestrin
Supremacy of Machine Learning
Machine learning is preferred approach to Speech recognition, Natural language processing Web search – result ranking Computer vision Medical outcomes analysis Robot control Computational biology Sensor networks …
This trend is accelerating Improved machine learning algorithms Improved data capture, networking, faster computers Software too complex to write by hand New sensors / IO devices Demand for self-customization to user, environment
![Page 11: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/11.jpg)
11
Space of ML Problems
What is B
eing Learned?
Type of Supervision (eg, Experience, Feedback)
LabeledExamples
Reward Nothing
Discrete Function
Classification Clustering
Continuous Function
Regression
Policy Apprenticeship Learning
ReinforcementLearning
![Page 12: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/12.jpg)
12©2009 Carlos Guestrin
Classification
from data to discrete classes
![Page 13: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/13.jpg)
Spam filtering
13©2009 Carlos Guestrin
data prediction
![Page 14: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/14.jpg)
Weather prediction
15©2009 Carlos Guestrin
![Page 15: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/15.jpg)
16©2009 Carlos Guestrin
Object detection
Example training images for each orientation
(Prof. H. Schneiderman)
![Page 16: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/16.jpg)
The classification pipeline
18©2009 Carlos Guestrin
Training
Testing
![Page 17: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/17.jpg)
Machine Learning
19
Supervised Learning
Parametric
Reinforcement Learning
Unsupervised Learning
Non-parametric
Nearest neighbor
Kernel density estimation
Support vector machines
![Page 18: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/18.jpg)
Machine Learning
20
Supervised Learning
Y Discrete Y Continuous
GaussiansLearned in closed form
Linear Functions1. Learned in closed form2. Using gradient descent
Decision TreesGreedy search; pruning
Probability of class | features1. Learn P(Y), P(X|Y); apply Bayes 2. Learn P(Y|X) w/ gradient descent
Parametric
Reinforcement Learning
Unsupervised Learning
Non-parametric
Non-probabilistic Linear ClassifierLearn w/ gradient descent
![Page 19: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/19.jpg)
21©2009 Carlos Guestrin
Regression
predicting a numeric value
![Page 20: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/20.jpg)
Stock market
22©2009 Carlos Guestrin
![Page 21: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/21.jpg)
Weather prediction revisted
23©2009 Carlos Guestrin
Temperature
![Page 22: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/22.jpg)
24©2009 Carlos Guestrin
Clustering
discovering structure in data
![Page 23: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/23.jpg)
Machine Learning
25
Supervised Learning
Parametric
Reinforcement Learning
Unsupervised Learning
Non-parametric
Agglomerative Clustering
K-means
Expectation Maximization (EM)
Principle Component Analysis (PCA)
![Page 24: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/24.jpg)
Clustering Data: Group similar things
![Page 25: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/25.jpg)
Clustering images
27©2009 Carlos Guestrin[Goldberger et al.]
Set of Images
![Page 26: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/26.jpg)
Clustering web search results
28©2009 Carlos Guestrin
![Page 27: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/27.jpg)
29
In Summary
What is B
eing Learned?
Type of Supervision (eg, Experience, Feedback)
LabeledExamples
Reward Nothing
Discrete Function
Classification Clustering
Continuous Function
Regression
Policy Apprenticeship Learning
ReinforcementLearning
![Page 28: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/28.jpg)
30
Key Concepts
![Page 29: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/29.jpg)
Classifier
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0
1.0
2.0
3.0
Hypothesis:Function for labeling examples
++
+ +
++
+
+
- -
-
- -
--
-
-
- +
++
-
-
-
+
+ Label: -Label: +
?
?
?
?
![Page 30: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/30.jpg)
32
Generalization
Hypotheses must generalize to correctly classify instances not in the training data.
Simply memorizing training examples is a consistent hypothesis that does not generalize.
![Page 31: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/31.jpg)
© Daniel S. Weld 33
A Learning Problem
![Page 32: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/32.jpg)
© Daniel S. Weld 34
Hypothesis Spaces
![Page 33: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/33.jpg)
© Daniel S. Weld 35
Why is Learning Possible?
Experience alone never justifies any conclusion about any unseen instance.
Learning occurs when
PREJUDICE meets DATA!
Learning a “Frobnitz”
![Page 34: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/34.jpg)
36
Frobnitz Not a Frobnitz
![Page 35: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/35.jpg)
© Daniel S. Weld 37
Bias
The nice word for prejudice is “bias”. Different from “Bias” in statistics
What kind of hypotheses will you consider? What is allowable range of functions you use when
approximating?
What kind of hypotheses do you prefer?
![Page 36: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/36.jpg)
© Daniel S. Weld 38
Some Typical Biases
Occam’s razor“It is needless to do more when less will suffice”
– William of Occam,
died 1349 of the Black plague
MDL – Minimum description length Concepts can be approximated by ... conjunctions of predicates
... by linear functions
... by short decision trees Frobnitz?
![Page 37: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/37.jpg)
ML = Function Approximation
39
c(x)
x
May not be any perfect fitClassification ~ discrete functions
h(x)
h(x) = contains(`nigeria’, x) contains(`wire-transfer’, x)
![Page 38: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/38.jpg)
Learning as Optimization
Preference Bias Loss Function
Minimize loss over training data (test data) Loss(h,data) = error(h, data) + complexity(h) Error + regularization
Methods Closed form Greedy search Gradient ascent
40
![Page 39: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/39.jpg)
Bias / Variance Tradeoff
Slide from T Dietterich
Variance: E[ (h(x*) – h(x*))2 ]
How much h(x*) varies between training sets
Reducing variance risks underfitting
Bias: [h(x*) – f(x*)]
Describes the average error of h(x*)
Reducing bias risks overfitting
Note: inductive bias vs estimator bias
![Page 40: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/40.jpg)
Regularization
![Page 41: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/41.jpg)
Regularization: vs.
![Page 42: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/42.jpg)
Learning as Optimization Methods
Closed form Greedy search Gradient ascent
Loss Function Minimize loss over training data (test data) Loss(h,data) = error(h, data) + complexity(h) Error + regularization
44
![Page 43: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/43.jpg)
Bia / Variance Tradeoff
Slide from T Dietterich
Variance: E[ (h(x*) – h(x*))2 ]
How much h(x*) varies between training sets
Reducing variance risks underfitting
Bias: [h(x*) – f(x*)]
Describes the average error of h(x*)
Reducing bias risks overfitting
![Page 44: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/44.jpg)
Regularization
![Page 45: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/45.jpg)
Regularization: vs.
![Page 46: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/46.jpg)
Overfitting
Hypothesis H is overfit when H’ and H has smaller error on training examples, but H has bigger error on test examples
![Page 47: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/47.jpg)
Overfitting
Hypothesis H is overfit when H’ and H has smaller error on training examples, but H has bigger error on test examples
Causes of overfitting Training set is too small Large number of features
Big problem in machine learning Solutions: bias, regularization Validation set
![Page 48: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/48.jpg)
© Daniel S. Weld 50
Overfitting
Accuracy
0.9
0.8
0.7
0.6
On training dataOn test data
Model complexity (e.g., number of nodes in decision tree)
![Page 49: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/49.jpg)
© Daniel S. Weld
51
Learning Bayes Nets
Learning Parameters for a Bayesian Network Fully observable
Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian
Hidden variables (EM algorithm)
Learning Structure of Bayesian Networks
![Page 50: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/50.jpg)
© Daniel S. Weld 52
What’s in a Bayes Net?
Earthquake Burglary
Alarm
Nbr2CallsNbr1Calls
Pr(B=t) Pr(B=f) 0.05 0.95
Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)
Radio
![Page 51: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/51.jpg)
Parameter Estimation and Bayesian Networks
E B R A J M
T F T T F T
F F F F F T
F T F T T T
F F F T T T
F T F F F F
...We have: - Bayes Net structure and observations- We need: Bayes Net parameters
![Page 52: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/52.jpg)
Parameter Estimation and Bayesian Networks
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F
...
P(B) = ?
P(¬B) = 1- P(B)
= 0.4
= 0.6
![Page 53: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/53.jpg)
Parameter Estimation and Bayesian Networks
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F
...
P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?
![Page 54: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/54.jpg)
Parameter Estimation and Bayesian Networks
Coin
![Page 55: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/55.jpg)
Coin Flip
P(H|C2) = 0.5P(H|C1) = 0.1
C1C2
P(H|C3) = 0.9
C3
Which coin will I use?
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
Prior: Probability of a hypothesis before we make any observations
![Page 56: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/56.jpg)
Coin Flip
P(H|C2) = 0.5P(H|C1) = 0.1
C1C2
P(H|C3) = 0.9
C3
Which coin will I use?
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
Uniform Prior: All hypothesis are equally likely before we make any observations
![Page 57: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/57.jpg)
Experiment 1: Heads
Which coin did I use?P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1)=0.1
C1C2 C3
P(C1)=1/3 P(C2) = 1/3 P(C3) = 1/3
![Page 58: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/58.jpg)
Experiment 1: Heads
Which coin did I use?P(C1|H) = 0.066P(C2|H) = 0.333 P(C3|H) = 0.6
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
Posterior: Probability of a hypothesis given data
![Page 59: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/59.jpg)
Terminology
Prior: Probability of a hypothesis before we see any data
Uniform Prior: A prior that makes all hypothesis equally likely
Posterior: Probability of a hypothesis after we saw some data
Likelihood: Probability of data given hypothesis
![Page 60: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/60.jpg)
Experiment 2: Tails
Now, Which coin did I use?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?
![Page 61: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/61.jpg)
Experiment 2: Tails
Now, Which coin did I use?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
P(C1|HT) = 0.21P(C2|HT) = 0.58P(C3|HT) = 0.21
![Page 62: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/62.jpg)
Experiment 2: Tails
Which coin did I use?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
P(C1|HT) = 0.21P(C2|HT) = 0.58P(C3|HT) = 0.21
![Page 63: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/63.jpg)
Your Estimate?
What is the probability of heads after two experiments?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
Best estimate for P(H)
P(H|C2) = 0.5
Most likely coin:
C2
![Page 64: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/64.jpg)
Your Estimate?
P(H|C2) = 0.5
C2
P(C2) = 1/3
Most likely coin: Best estimate for P(H)
P(H|C2) = 0.5C2
Maximum Likelihood Estimate: The best hypothesis
that fits observed data assuming uniform prior
![Page 65: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/65.jpg)
Using Prior Knowledge
Should we always use a Uniform Prior ? Background knowledge:
Heads => we have to buy Dan chocolate
Dan likes chocolate…
=> Dan is more likely to use a coin biased in his favor
P(H|C2) = 0.5P(H|C1) = 0.1
C1C2
P(H|C3) = 0.9
C3
![Page 66: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/66.jpg)
Using Prior Knowledge
P(H|C2) = 0.5P(H|C1) = 0.1
C1C2
P(H|C3) = 0.9
C3
P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
We can encode it in the prior:
![Page 67: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/67.jpg)
Experiment 1: Heads
Which coin did I use?P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
![Page 68: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/68.jpg)
Experiment 1: Heads
Which coin did I use?P(C1|H) = 0.006P(C2|H) = 0.165P(C3|H) = 0.829
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
P(C1|H) = 0.066P(C2|H) = 0.333P(C3|H) = 0.600Compare with ML posterior after
Exp 1:
![Page 69: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/69.jpg)
Experiment 2: Tails
Which coin did I use?P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
![Page 70: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/70.jpg)
Experiment 2: Tails
Which coin did I use?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
P(C1|HT) = 0.035P(C2|HT) = 0.481P(C3|HT) = 0.485
![Page 71: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/71.jpg)
Experiment 2: Tails
Which coin did I use?P(C1|HT) = 0.035P(C2|HT)=0.481P(C3|HT) = 0.485
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
![Page 72: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/72.jpg)
Your Estimate?
What is the probability of heads after two experiments?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
Best estimate for P(H)
P(H|C3) = 0.9C3
Most likely coin:
![Page 73: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/73.jpg)
Your Estimate?
Most likely coin: Best estimate for P(H)
P(H|C3) = 0.9C3
Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data
assuming a non-uniform prior
P(H|C3) = 0.9
C3
P(C3) = 0.70
![Page 74: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/74.jpg)
Did We Do The Right Thing?
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
P(C1|HT)=0.035 P(C2|HT)=0.481P(C3|HT)=0.485
![Page 75: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/75.jpg)
Did We Do The Right Thing?
P(C1|HT) =0.035P(C2|HT)=0.481P(C3|HT)=0.485
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1
C1C2 C3
C2 and C3 are almost equally likely
![Page 76: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/76.jpg)
A Better Estimate
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1C1
C2 C3
Recall: = 0.680
P(C1|HT)=0.035 P(C2|HT)=0.481P(C3|HT)=0.485
![Page 77: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/77.jpg)
Bayesian Estimate
P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485
P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1C1
C2 C3
= 0.680
Bayesian Estimate: Minimizes prediction error, given data assuming an arbitrary prior
![Page 78: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/78.jpg)
Comparison After more experiments: HTHHHHHHHHH
ML (Maximum Likelihood): P(H) = 0.5after 10 experiments: P(H) = 0.9
MAP (Maximum A Posteriori): P(H) = 0.9after 10 experiments: P(H) = 0.9
Bayesian: P(H) = 0.68after 10 experiments: P(H) = 0.9
![Page 79: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/79.jpg)
SummaryPrior Hypothesis
Maximum Likelihood Estimate
Maximum A Posteriori Estimate
Bayesian Estimate
Uniform The most likely
Any The most likely
Any Weighted combination
Easy to compute
Still easy to computeIncorporates prior knowledge
Minimizes errorGreat when data is scarcePotentially much harder to compute
![Page 80: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/80.jpg)
Bayesian Learning
Use Bayes rule:
Or equivalently: P(Y | X) P(X | Y) P(Y)
Prior
Normalization
Data Likelihood
Posterior P(Y | X) = P(X |Y) P(Y) P(X)
![Page 81: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/81.jpg)
Parameter Estimation and Bayesian Networks
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F
...
P(B) = ? -5
0
5
10
15
20
25
0 0.2 0.4 0.6 0.8 1
Prior
+ data = -2
0
2
4
6
8
10
12
14
16
18
20
0 0.2 0.4 0.6 0.8 1
Now computeeither MAP or
Bayesian estimate
![Page 82: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/82.jpg)
What Prior to Use? Prev, you knew: it was one of only three coins
Now more complicated…
The following are two common priors Binary variable Beta
Posterior distribution is binomial Easy to compute posterior
Discrete variable Dirichlet Posterior distribution is multinomial Easy to compute posterior
© Daniel S. Weld
84
![Page 83: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/83.jpg)
Beta Distribution
![Page 84: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/84.jpg)
Beta Distribution Example: Flip coin with Beta distribution as prior
over p [prob(heads)]1. Parameterized by two positive numbers: a, b
2. Mode of distribution (E[p]) is a/(a+b)
3. Specify our prior belief for p = a/(a+b)
4. Specify confidence in this belief with high initial values for a and b
Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome
So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)
![Page 85: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/85.jpg)
One Prior: Beta Distribution
a,b
For any positive integer y, G(y) = (y-1)!
![Page 86: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/86.jpg)
Parameter Estimation and Bayesian Networks
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F
...
P(B|data) = ?
Prior“+ data” = Beta(1,
4)(3,7) .3
B ¬B
.7
Prior P(B)= 1/(1+4) = 20% with equivalent sample size 5
![Page 87: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/87.jpg)
Parameter Estimation and Bayesian Networks
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F
...
P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?
Prior
Beta(2,3)
![Page 88: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/88.jpg)
Parameter Estimation and Bayesian Networks
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F
...
P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?
Prior
+ data= Beta(2,3)
Beta(3,4)
![Page 89: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/89.jpg)
Output of Learning
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F...
Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)
Pr(B=t) Pr(B=f) 0.05 0.95
![Page 90: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/90.jpg)
Did Learning Work Well?
E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F...
Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)
Pr(B=t) Pr(B=f) 0.05 0.95
Can easily calculate P(data) for learned parameters
![Page 91: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/91.jpg)
© Daniel S. Weld
Learning with Continuous Variables
Earthquake
Pr(E=x) mean: = ?
variance: = ?
![Page 92: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/92.jpg)
Using Bayes Nets for Classification
One method of classification: Use a probabilistic model! Features are observed random variables Fi
Y is the query variable Use probabilistic inference to compute most likely Y
You already know how to do this inference
![Page 93: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/93.jpg)
A Popular Structure: Naïve Bayes
F 2 F NF 1 F 3
YClassValue
…
Assume that features are conditionally independent given class variableWorks surprisingly well for classification (predicting the right class) But forces probabilities towards 0 and 1
![Page 94: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/94.jpg)
Naïve Bayes Naïve Bayes assumption:
Features are independent given class:
More generally:
How many parameters? Suppose X is composed of n binary features
![Page 95: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/95.jpg)
A Spam Filter
Naïve Bayes spam filter
Data: Collection of emails,
labeled spam or ham Note: someone has to
hand label all this data! Split into training, held-
out, test sets
Classifiers Learn on the training set (Tune it on a held-out set) Test it on new emails
Dear Sir.
First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …
TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.
99 MILLION EMAIL ADDRESSES FOR ONLY $99
Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.
![Page 96: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/96.jpg)
Naïve Bayes for Text Bag-of-Words Naïve Bayes:
Predict unknown class label (spam vs. ham) Assume evidence features (e.g. the words) are independent Warning: subtly different assumptions than before!
Generative model
Tied distributions and bag-of-words Usually, each variable gets its own conditional probability distribution
P(F|Y) In a bag-of-words model
Each position is identically distributed All positions share the same conditional probs P(W|C) Why make this assumption?
Word at position i, not ith word in the dictionary!
![Page 97: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/97.jpg)
Estimation: Laplace Smoothing
Laplace’s estimate:pretend you saw every outcome
once more than you actually did
Can derive this as a MAP estimate with Dirichlet priors (Bayesian justification)
H H T
![Page 98: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/98.jpg)
NB with Bag of Words for text classification
Learning phase: Prior P(Y)
Count how many documents from each topic (prior) P(Xi|Y)
For each of m topics, count how many times you saw word Xi in documents of this topic (+ k for prior)
Divide by number of times you saw the word (+ k|words|)
Test phase: For each document
Use naïve Bayes decision rule
![Page 99: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/99.jpg)
Probabilities: Important Detail!
Any more potential problems here?
P(spam | X1 … Xn) = P(spam | Xi)i
We are multiplying lots of small numbers Danger of underflow!
0.557 = 7 E -18
Solution? Use logs and add! p1 * p2 = e log(p1)+log(p2)
Always keep in log form
![Page 100: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/100.jpg)
Naïve Bayes
F 2 F NF 1 F 3
YClassValue
…
Assume that features are conditionally independent given class variableWorks surprisingly well for classification (predicting the right class) But forces probabilities towards 0 and 1
![Page 101: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/101.jpg)
Example Bayes’ Net: Car
![Page 102: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/102.jpg)
What if we don’t know structure?
![Page 103: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/103.jpg)
Learning The Structureof Bayesian Networks
Search thru the space… of possible network structures! (for now still assume can observe all values)
For each structure, learn parameters As just shown…
Pick the one that fits observed data best Calculate P(data)
![Page 104: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/104.jpg)
A
E
C
D
B
A
E
C
D
B
A
E
C
D
B
Two problems:• Fully connected will be most probable• Exponential number of structures
![Page 105: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/105.jpg)
Learning The Structureof Bayesian Networks
Search thru the space… of possible network structures!
For each structure, learn parameters As just shown…
Pick the one that fits observed data best Calculate P(data)
Two problems:• Fully connected will be most probable
• Add penalty term (regularization) model complexity• Exponential number of structures
• Local search
![Page 106: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/106.jpg)
A
E
C
D
B
A
E
C
D
B
A
E
C
D
B
A
E
C
D
B
A
E
C
D
B
![Page 107: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/107.jpg)
Score Functions
Bayesian Information Criteion (BIC) P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points)
MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponentially with # of
parameters for this to work well
© Daniel S. Weld
118
![Page 108: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/108.jpg)
Learning as Optimization
Preference Bias Loss Function
Minimize loss over training data (test data) Loss(h,data) = error(h, data) + complexity(h) Error + regularization
Methods Closed form Greedy search Gradient ascent
119
![Page 109: CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart](https://reader033.vdocuments.net/reader033/viewer/2022042718/56649e865503460f94b8919c/html5/thumbnails/109.jpg)
© Daniel S. Weld
120
Topics
Learning Parameters for a Bayesian Network Fully observable
Maximum Likelihood (ML), Maximum A Posteriori (MAP) Bayesian
Hidden variables (EM algorithm)
Learning Structure of Bayesian Networks