austerity in mcmc land: cutting the computational budget
DESCRIPTION
Austerity in MCMC Land: Cutting the Computational Budget. Max Welling (U. Amsterdam / UC Irvine) Collaborators: Yee Whye The ( University of Oxford) S. Ahn , A. Korattikara , Y. Chen (PhD students UCI). The Big Data Hype. (and what it means if you’re a Bayesian). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/1.jpg)
1
Austerity in MCMC Land:Cutting the Computational Budget
Max Welling (U. Amsterdam / UC Irvine)
Collaborators:Yee Whye The (University of Oxford)
S. Ahn, A. Korattikara, Y. Chen (PhD students UCI)
![Page 2: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/2.jpg)
2
The Big Data Hype
(and what it means if you’re a Bayesian)
![Page 3: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/3.jpg)
3
Why be a Big Bayesian?
• If there is so much data any, why bother being Bayesian?
• Answer 1: If you don’t have to worry about over-fitting, your model is likely too small.
• Answer 2: Big Data may mean big D instead of big N.
• Answer 3: Not every variable may be able to use all the data-items to reduce their uncertainty.
?
![Page 4: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/4.jpg)
4
Bayesian Modeling
• Bayes rule allows us to express the posterior over parameters in terms of the prior and likelihood terms:
!
![Page 5: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/5.jpg)
5
• Predictions can be approximated by performing a Monte Carlo average:
MCMC for Posterior Inference
![Page 6: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/6.jpg)
6
Mini-Tutorial MCMCFollowing example copied from: An Introduction to MCMC for Machine LearningAndrieu, de Freitas, Doucet, Jordan, Machine Learning, 2003
![Page 7: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/7.jpg)
7Example copied from: An Introduction to MCMC for Machine LearningAndrieu, de Freitas, Doucet, Jordan, Machine Learning, 2003
![Page 8: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/8.jpg)
8
![Page 9: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/9.jpg)
9
Examples of MCMC in CS/Eng.
Image Segmentation by Data-Driven MCMCTu & Zhu, TPAMI, 2002
Image SegmentationSimultaneous Localization and Mapping
Simulation by Dieter Fox
![Page 10: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/10.jpg)
10
MCMC
• We can generate a correlated sequence of samples that has the posterior as its equilibrium distribution.
Painful when N=1,000,000,000
![Page 11: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/11.jpg)
11
What are we doing (wrong)?
1 billion real numbers (N log-likelihoods)
1 bit(accept or reject sample)
At every iteration, we compute 1 billion (N) real numbers to make a single binary decision….
![Page 12: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/12.jpg)
12
• Observation 1: In the context of Big Data, stochastic gradient descent can make fairly good decisions before MCMC has made a single move.
• Observation 2: We don’t think very much about errors caused by sampling from the wrong distribution (bias) and errors caused by randomness (variance).
• We think “asymptotically”: reduce bias to zero in burn-in phase, then start sampling to reduce variance.
• For Big Data we don’t have that luxury: time is finite and computation on a budget.
Can we do better?
bias variance
computation
![Page 13: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/13.jpg)
13
Markov Chain Convergence
Error dominated by bias
Error dominated by variance
![Page 14: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/14.jpg)
14
The MCMC tradeoff• You have T units of computation to achieve the lowest possible error.
• Your MCMC procedure has a knob to create bias in return for “computation”
Turn right: Fast: strong bias low variance
Turn left: Slow: small bias, high variance
Claim: the optimal setting depends on T!
![Page 15: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/15.jpg)
15
Two Ways to turn a Knob
• Accept a proposal with a given confidence: easy proposals now require far fewer data-items for a decision.
• Knob = Confidence
• Langevin dynamics based on stochastic gradients: ignore MH step
• Knob = Stepsize [W. & Teh, ICML 2011; Ahn, et al, ICML 2012]
[Korattikara et al, ICML 1023 (under review)]
![Page 16: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/16.jpg)
16
Metropolis Hastings on a BudgetStandard MH rule. Accept if:
• Frame as statistical test: given n<N data-items, can we confidently conclude: ?
![Page 17: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/17.jpg)
17
MH as a Statistical Test• Construct a t-statistic using using a random draw of n data-cases out of N data-cases, without replacement.
Correction factor for no replacement
collectmore data
accept proposalreject proposal
![Page 18: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/18.jpg)
18
Sequential Hypothesis Tests
collectmore data
accept proposalreject proposal
• Our algorithm draws more data (w/o/ replacement) until a decision is made.
• When n=N the test is equivalent to the standard MH test (decision is forced).
• The procedure is related to “Pocock Sequential Design”.
• We can bound the error in the equilibrium distribution because we control the error in the transition probability .
• Easy decisions (e.g. during burn-in) can now be made very fast.
![Page 19: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/19.jpg)
19
Tradeoff
Percentage data usedPercentage wrong decisions
Allowed uncertainty to make decision
![Page 20: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/20.jpg)
20
Logistic Regression on MNIST
![Page 21: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/21.jpg)
21
Two Ways to turn a Knob
• Accept a proposal with a given confidence: easy proposals now require far fewer data-items for a decision.
• Knob = Confidence
• Langevin dynamics based on stochastic gradients: ignore MH step
• Knob = Stepsize
[Korattikara et al, ICML 1023 (under review)]
[W. & Teh, ICML 2011; Ahn, et al, ICML 2012]
![Page 22: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/22.jpg)
22
Stochastic Gradient Descent
Not painful when N=1,000,000,000
• Due to redundancy in data, this method learns a good model long before it has seen all the data
![Page 23: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/23.jpg)
23
Langevin Dynamics
• Add Gaussian noise to gradient ascent with the right variance.
• This will sample from the posterior if the stepsize goes to 0.
• One can add a accept/reject step and use larger stepsizes.
• One step of Hamiltonian Monte Carlo MCMC.
![Page 24: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/24.jpg)
24
Langevin Dynamics with Stochastic Gradients
• Combine SGD with Langevin dynamics.
• No accept/reject rule, but decreasing stepsize instead.
• In the limit this non-homogenous Markov chain converges to the correct posterior
• But: mixing will slow down as the stepsize decreases…
![Page 25: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/25.jpg)
25
Stochastic Gradient Ascent
Gradient Ascent
Stochastic Gradient Langevin Dynamics
Langevin Dynamics
e.g.
↓ Metropolis-Hastings Accept Step
Stochastic Gradient Langevin Dynamics
Metropolis-Hastings Accept Step
![Page 26: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/26.jpg)
26
A Closer Look …
Optimization
Samplinglarge
![Page 27: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/27.jpg)
27
A Closer Look …
Optimization Sampling
small
![Page 28: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/28.jpg)
28
Example: MoG
![Page 29: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/29.jpg)
29
Mixing Issues
• Gradient is large in high curvature direction, however we need large variance in the direction of low curvature slow convergence & mixing.
We need a preconditioning matrix C.
• For large N we know from Bayesian CLT that posterior is normal (if conditions apply).
Can we exploit this to sample approximately with large stepsizes?
![Page 30: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/30.jpg)
30
The Bernstein-von Mises Theorem(Bayesian CLT)
“True” Parameter Fisher Information at ϴ0
Fisher Information
![Page 31: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/31.jpg)
31
Sampling Accuracy– Mixing Rate Tradeoff
Stochastic Gradient Langevin Dynamics with Preconditioning
Markov Chain for Approximate Gaussian Posterior
Sampling
Accuracy
Mixing Rate
Samples from the correct posterior, , at low ϵ
Samples from approximate posterior, , at any ϵ
Mixing Rate
Sampling
Accuracy
![Page 32: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/32.jpg)
32
A Hybrid
Small ϵ
Large ϵ
Sampling Accuracy
Mixing Rate
![Page 33: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/33.jpg)
33
Experiments (LR on MNIST)
No additional noise was added(all noise comes from subsampling data)Batchsize = 300
Diagonal approximation of Fisher Information (approximation would becomebetter is we decrease stepizeand added noise)
Ground truth (HMC)
![Page 34: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/34.jpg)
34
Experiments (LR on MINIST)X-axis: mixing rate perunit of computation =Inverse of total auto-correlation timetimes wallclock time per it.
Y-axis: Error after T units of computation.
Every marker is a different value stepsize, alpha etc.
Slope down:Faster mixing still decreases error: variance reduction.
Slope up: Faster mixing increases error:Error floor (bias) has been reached.
![Page 35: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/35.jpg)
SGFS in a Nutshell
Stochastic Optimization
Sampling from Accurate
sampling
35
Larg
e St
epsiz
e
Smal
l Ste
psize
![Page 36: Austerity in MCMC Land: Cutting the Computational Budget](https://reader036.vdocuments.net/reader036/viewer/2022070423/5681661d550346895dd96d64/html5/thumbnails/36.jpg)
Conclusions• Bayesian methods need to be scaled to Big Data problems.
• MCMC for Bayesian posterior inference can be much more efficient if we allow to sample with asymptotically biased procedures.
• Future research: optimal policy for dialing down bias over time.
• Approximate MH – MCMC performs sequential tests to accept or reject.
• SGLD/SGFS perform updates at the cost of O(100) data-points per iteration.
QUESTIONS?