cs b553 : a lgorithms for optimization and learning
DESCRIPTION
CS b553 : A lgorithms for Optimization and Learning. Monte Carlo Methods for Probabilistic Inference. Agenda. Monte Carlo methods O(1/ sqrt (N)) standard deviation For Bayesian inference Likelihood weighting Gibbs sampling. Monte Carlo Integration. Estimate large integrals/sums: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/1.jpg)
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGMonte Carlo Methods for Probabilistic Inference
![Page 2: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/2.jpg)
AGENDA Monte Carlo methods
O(1/sqrt(N)) standard deviation For Bayesian inference
Likelihood weighting Gibbs sampling
![Page 3: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/3.jpg)
MONTE CARLO INTEGRATION Estimate large integrals/sums:
I = f(x)p(x) dx I = f(x)p(x)
Using a sample of N i.i.d. samples from p(x) I 1/N f(x(i))
Examples: [a,b] f(x) dx (b-a)/N f(x(i)) E[X] = x p(x) dx 1/N x(i)
Volume of a set in Rn
![Page 4: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/4.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
![Page 5: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/5.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
E[I-IN]=I-E[IN] (linearity of expectation)
![Page 6: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/6.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
E[I-IN]=I-E[IN] (linearity of expectation)= E[f(x)] - 1/N E[f(x(i))] (definition of I
and IN)
![Page 7: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/7.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
E[I-IN]=I-E[IN] (linearity of expectation)= E[f(x)] - 1/N E[f(x(i))] (definition of I
and IN)= 1/N (E[f(x)]-E[f(x(i))]) = 1/N 0 (x and x(i) are distributed
w.r.t. p(x))= 0
![Page 8: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/8.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
Unbiased estimator What is the variance Var[IN]?
![Page 9: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/9.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
Unbiased estimator What is the variance Var[IN]?
Var[IN] = Var[1/N f(x(i))] (definition)
![Page 10: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/10.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
Unbiased estimator What is the variance Var[IN]?
Var[IN] = Var[1/N f(x(i))] (definition)= 1/N2 Var[ f(x(i))] (scaling of
variance)
![Page 11: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/11.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
Unbiased estimator What is the variance Var[IN]?
Var[IN] = Var[1/N f(x(i))] (definition)= 1/N2 Var[ f(x(i))] (scaling of
variance)= 1/N2 Var[f(x(i))] (variance of a sum of
independent variables)
![Page 12: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/12.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
Unbiased estimator What is the variance Var[IN]?
Var[IN] = Var[1/N f(x(i))] (definition)= 1/N2 Var[ f(x(i))] (scaling of
variance)= 1/N2 Var[f(x(i))]= 1/N Var[f(x)] (i.i.d. sample)
![Page 13: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/13.jpg)
MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the
estimate of the integral with N samples What is the bias (mean error) E[I-IN]?
Unbiased estimator What is the variance Var[IN]?
1/N Var[f(x)] Standard deviation: O(1/sqrt(N))
![Page 14: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/14.jpg)
APPROXIMATE INFERENCE THROUGH SAMPLING Unconditional simulation:
To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed
![Page 15: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/15.jpg)
APPROXIMATE INFERENCE THROUGH SAMPLING Unconditional simulation:
To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed
Conditional simulation: To estimate the probability P(H) that a coin
picked out of bucket B flips heads: Repeat for i=1,…,N:1. Pick a coin C out of a random bucket b(i) chosen
with probability P(B)2. h(i) = flip C according to probability P(H|b(i))3. Sample (h(i),b(i)) comes from distribution P(H,B)
Result approximates P(H,B)
![Page 16: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/16.jpg)
MONTE CARLO INFERENCE IN BAYES NETS BN over variables X Repeat for i=1,…,N
In top-down order, generate x(i) as follows: Sample xj
(i) ~ P(Xj |paXj(i))
(RHS is taken by putting parent values in sample into the CPT for Xj)
Sample x(1)… x(N) approximates the
distribution over X
![Page 17: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/17.jpg)
APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION Sample from the joint distribution
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0A=0J=1M=0
![Page 18: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/18.jpg)
APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION As more samples are generated, the
distribution of the samples approaches the joint distribution
B=0E=0A=0J=1M=0
B=0E=0A=0J=0M=0
B=0E=0A=0J=0M=0
B=1E=0A=1J=1M=0
![Page 19: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/19.jpg)
BASIC METHOD FOR HANDLING EVIDENCE Inference: given evidence E=e (e.g., J=1),
approximate P(X/E|E=e) Remove the samples that conflict
B=0E=0A=0J=1M=0
B=0E=0A=0J=0M=0
B=0E=0A=0J=0M=0
B=1E=0A=1J=1M=0
Distribution of remaining samples approximates the conditional distribution
![Page 20: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/20.jpg)
RARE EVENT PROBLEM: What if some events are really rare (e.g.,
burglary & earthquake ?) # of samples must be huge to get a
reasonable estimate Solution: likelihood weighting
Enforce that each sample agrees with evidence While generating a sample, keep track of the
ratio of(how likely the sampled value is to occur in the real world)
(how likely you were to generate the sampled value)
![Page 21: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/21.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
w=1
![Page 22: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/22.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=1
w=0.008
![Page 23: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/23.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=1A=1
w=0.0023
A=1 is enforced, and the weight updated to reflect the likelihood that this occurs
![Page 24: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/24.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=1A=1M=1J=1
w=0.0016
![Page 25: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/25.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0
w=3.988
![Page 26: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/26.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0A=1
w=0.004
![Page 27: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/27.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0A=1M=1J=1
w=0.0028
![Page 28: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/28.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=1E=0A=1
w=0.00375
![Page 29: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/29.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=1E=0A=1M=1J=1
w=0.0026
![Page 30: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/30.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)0.001
P(E)0.002
A P(J|…)TF
0.900.05
A P(M|…)
TF
0.700.01
B=1E=1A=1M=1J=1
w=5e-7
![Page 31: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/31.jpg)
LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375
B=0E=1A=1M=1J=1
w=0.0016
B=0E=0A=1M=1J=1
w=0.0028
B=1E=0A=1M=1J=1
w=0.0026
B=1E=1A=1M=1J=1
w~=0
![Page 32: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/32.jpg)
ANOTHER RARE-EVENT PROBLEM B=b given as evidence Probability each bi is rare given all but one
setting of Ai (say, Ai=1)
Chance of sampling all 1’s is very low => most likelihood weights will be too low
Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b))
A1 A2 A10
B1 B2 B10
![Page 33: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/33.jpg)
GIBBS SAMPLING Idea: reduce the computational burden of
sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes Cycle through j=1,…,n Sample xj ~ P(xj | x[1…j-1,j+1,…n])
Over the long run, the random walk taken by x approaches the true distribution P(x)
![Page 34: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/34.jpg)
GIBBS SAMPLING IN BNS Each Gibbs sampling step: 1) pick a variable
Xi, 2) sample xi ~ P(Xi|X/Xi) Look at values of “Markov blanket” of Xi:
Parents PaXi Children Y1,…,Yk Parents of children (excluding Xi) PaY1/Xi, …,
PaYk/Xi Xi is independent of rest of network given Markov
blanket Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi)
= 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) Product of Xi’s factor and the factors of its
children
![Page 35: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/35.jpg)
HANDLING EVIDENCE Simply set each evidence variable to its
appropriate value, don’t sample Resulting walk approximates distribution
P(X/E|E=e) Uses evidence more efficiently than
likelihood weighting
![Page 36: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/36.jpg)
GIBBS SAMPLING ISSUES Demonstrating correctness & convergence
requires examining Markov Chain random walk (more later)
Need to take many steps before the effects of poor initialization wear off (mixing time) Difficult to tell how much is needed a priori
Numerous variants Known as Markov Chain Monte Carlo techniques
![Page 37: CS b553 : A lgorithms for Optimization and Learning](https://reader035.vdocuments.net/reader035/viewer/2022062501/56816309550346895dd388bc/html5/thumbnails/37.jpg)
NEXT TIME Continuous and hybrid distributions