statistical learning: bayesian and ml comp155 sections 20.1-20.2 may 2, 2007
Post on 19-Dec-2015
226 views
TRANSCRIPT
![Page 1: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/1.jpg)
Statistical Learning:Bayesian and ML
COMP155
Sections 20.1-20.2May 2, 2007
![Page 2: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/2.jpg)
Definitions• a posteriori: derived from observed facts
• a priori: based on hypothesis or theory rather than experiment
![Page 3: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/3.jpg)
Bayesian Learning• Make predictions using all hypotheses,
weighted by their probabilities• Bayes’ rule: P(a | b) = α P(b | a) P(a)
• For each hypothesis hi, observed data d:
• P(hi | d) = α P(d | hi) P(hi)
• P(d | hi) is the likelihood of d under hypothesis hi
• P(hi) is the hypothesis prior
α is a normalization constant = 1 / ∑i P(d | hi) P(hi)
![Page 4: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/4.jpg)
Bayesian Learning• We want to predict some quantity X:
P(X | d) = ∑i P(X | d, hi) P(hi | d) = ∑i P(X | hi) P(hi | d)
• The predictions are weighted averages over the predictions of the individual hypotheses
![Page 5: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/5.jpg)
Example• Suppose we know that there are 5 kinds of bags
of candy:
cherry lime % of all bags
Type 1 100% 10%
Type 2 75% 25% 20%
Type 3 50% 50% 40%
Type 4 25% 75% 20%
Type 5 100% 10%
![Page 6: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/6.jpg)
Example: priors• Given a new bag of candy,
predict the type of the bag:
• Five hypotheses:• h1: bag is type 1, P(h1) = .1
• h2: bag is type 2, P(h2) = .2
• h3: bag is type 3, P(h3) = .4
• h4: bag is type 4, P(h4) = .2
• h5: bag is type 5, P(h5) = .1
With no evidence, we use the hypothesis priors
![Page 7: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/7.jpg)
Example: one lime candy• Suppose we unwrap one candy and
determine that it is lime.• P(h1 | onelime) = α P(onlime | h1)P(h1)
= 0.5*(0 * 0.1) = 0
• P(h2 | onelime) = α P(onlime | h2)P(h2) = 0.5*(0.25 * 0.2) = 0.1
• P(h3 | onelime) = α P(onlime | h3)P(h3) = 0.5*(0.5 * 0.4) = 0.4
• P(h4 | onelime) = α P(onlime | h4)P(h4) = 0.5*(0.75 * 0.2) = 0.3
• P(h5 | onelime) = α P(onlime | h5)P(h5) = 0.5*(1.0 * 0.1) = 0.2
![Page 8: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/8.jpg)
Example: two lime candies• Suppose we unwrap another candy and it
is also lime.• P(h1 | twolime) = α P(twolime | h1)P(h1)
= 0.33*(0 * 0.1) = 0
• P(h2 | twolime) = α P(twolime | h2)P(h2) = 0.33*(0.0625 * 0.2) = 0.05
• P(h3 | twolime) = α P(twolime | h3)P(h3) = 0.33*(0.25 * 0.4) = 0.4
• P(h4 | twolime) = α P(twolime | h4)P(h4) = 0.33*(0.5625 * 0.2) = 0.45
• P(h5 | twolime) = α P(twolime | h5)P(h5) = 0.33*(1.0 * 0.1) = 0.4
![Page 9: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/9.jpg)
Example: n lime candies• Suppose we unwrap n candies and they
are all lime.• P(h1 | nlime) = αn (0n * 0.1)
• P(h2 | nlime) = αn (0.25n * 0.2)
• P(h3 | nlime) = αn (0.5n * 0.4)
• P(h4 | nlime) = αn (0.75n * 0.2)
• P(h5 | nlime) = αn (1n * 0.1)
![Page 10: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/10.jpg)
![Page 11: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/11.jpg)
Prediction: what candy is next?• P(nextlime | nlime) =
∑i P(nextlime | hi) P(hi | nlime)P(nextlime | h1) P(h1 | nlime) + P(nextlime | h2) P(h2 | nlime) + P(nextlime | h3) P(h3 | nlime) + P(nextlime | h4) P(h4 | nlime) + P(nextlime | h5) P(h5 | nlime) =
0 * αn (0n * 0.1) + 0.25 * αn (0.25n * 0.2) + 0.5 * αn (0.5n * 0.4) + 0.75 * αn (0.75n * 0.2) + 1 * αn (1n * 0.1)
![Page 12: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/12.jpg)
0.97
![Page 13: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/13.jpg)
Analysis: Bayesian Prediction• The true hypothesis eventually dominates
• The posterior probability of any false hypothesis will eventually dominate
• Probability of uncharacteristic data will become vanishingly small
• Bayesian prediction is optimal
• Bayesian prediction is expensive• Hypothesis space may be very large (or
infinite)
![Page 14: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/14.jpg)
MAP Approximation• To avoid expense of Bayesian learning,
one approach is to simply chose the most probable hypothesis and assume it is correct• MAP = maximum a posteriori
• hmap = hi with highest value for P(hi | d)
• In candy example, after 3 limes have been selected a MAP learner will always predict next candy is lime with 100% probability• Less accurate, but much cheaper
![Page 15: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/15.jpg)
![Page 16: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/16.jpg)
Avoiding Complexity• As we’ve seen earlier, allowing overly
complex hypotheses can lead to overfitting
• Bayesian and MAP learning use the hypothesis prior to penalize complex hypotheses• Complex hypotheses typically have lower
priors – since there are typically more complex hypotheses
• We get the simplest hypothesis consistent with the data (as per Ockham’s razor)
![Page 17: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/17.jpg)
ML Approximation• For large data sets, the priors become
irrelevant, in this case we may use maximum likelihood (ML) learning• Choose hml that maximizes P(d | hi)
• Choose the hypothesis that has the highest probability of causing the observed data
• identical to MAP for uniform priors
• ML is the standard (non-Bayesian) statistical learning method
![Page 18: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/18.jpg)
![Page 19: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/19.jpg)
![Page 20: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/20.jpg)
Exercise• Suppose we were pulling candy from a
50/50 bag (type 3) or a 25/75 bag (type 4)
• With full Bayesian learning, what would the posterior probability and prediction plots look like after 100 candies?
• What would prediction plots look like for MAP and ML learning after 1000 candies?
![Page 21: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/21.jpg)
Bayesian 50/50 bag
![Page 22: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/22.jpg)
Bayesian 50/50 bag
![Page 23: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/23.jpg)
Bayesian 25/75 bag
![Page 24: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/24.jpg)
Bayesian 25/75 bag
![Page 25: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/25.jpg)
MAP 50/50 bag
![Page 26: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/26.jpg)
ML 50/50 bag
![Page 27: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/27.jpg)
MAP 25/75 bag
![Page 28: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/28.jpg)
ML 25/75 bag
![Page 29: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/29.jpg)
Exercise
![Page 30: Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007](https://reader035.vdocuments.net/reader035/viewer/2022081515/56649d385503460f94a12399/html5/thumbnails/30.jpg)
Answer