deep generative learning_icml_part1
DESCRIPTION
Workshop Deep generative Learning at ICML 2014 part 1TRANSCRIPT
Bayesian Posterior Inference in the
Big Data Arena
Max Welling Anoop Korattikara
1vrijdag 4 juli 14
Outline
• Introduction
• Stochastic Variational Inference– Variational Inference 101– Stochastic Variational Inference– Deep Generative Models with SVB
• MCMC with mini-batches– MCMC 101– MCMC using noisy gradients– MCMC using noisy Metropolis-Hastings– Theoretical results
2vrijdag 4 juli 14
Big Data (mine is bigger than yours)
Square Kilometer Array (SKA) produces 1 Exabyte per day by 2024… (interested to do approximate inference on this data, talk to me)
3vrijdag 4 juli 14
Introduction
Why do we need posterior inference if the datasets are BIG?
4vrijdag 4 juli 14
p>>N Big data may mean large p, small n
Gene expression datafMRI data
5
5vrijdag 4 juli 14
PlanningPlanning against uncertainty needs probabili=es
6
6vrijdag 4 juli 14
Little data inside Big dataNot every data-‐case carries informa=on about every model component
New user with no raGngs(cold start problem)
7
7vrijdag 4 juli 14
1943: First NN(+/-‐ N=10)
1988: NetTalk(+/-‐ N=20K)
2009: Hinton’s Deep Belief Net(+/-‐ N=10M)
2013: Google/Y! (N=+/-‐ 10B)
Big Models!Models grow faster than useful informa=on in data
8
8vrijdag 4 juli 14
Two Ingredients for Big Data Bayes
Any big data posterior inference algorithm should:
1. easily run on a distributed architecture.
2. only use a small mini-‐batch of the data at every itera=on.
9vrijdag 4 juli 14
Bayesian Posterior Inference
Variational Sampling
Variational Family Q
All probability distributions
• DeterminisGc• Biased • Local minima• Easy to assess convergence
• StochasGc (sample error)• Unbiased• Hard to mix between modes• Hard to assess convergence
10vrijdag 4 juli 14
Variational Bayes
11
Hinton & van Camp (1993)Neal & Hinton (1999)Saul & Jordan (1996)Saul, Jaakkola & Jordan (1996)ATas (1999,2000) Wiegerinck (2000)Ghahramani & Beal (2000,2001)
Coordinate descent on Q
P
Q
(Bishop, PaYern Recogni[on and Machine Learning)
11vrijdag 4 juli 14
Stochastic VB Hoffman, Blei & Bach, 2010
Stochas=c natural gradient descent on Q
12
• P and Q in exponenGal family.• Q factorized:
• At every iteraGon: subsample n<<N data-‐cases:• solve analyGcally.• update parameter using stochas=c natural gradient descent.
12vrijdag 4 juli 14
General SVB
very high variance
sample
13
subsample X
(ignoring latent variables Z)
13vrijdag 4 juli 14
Reparameterization Trick
14
-‐Varia[onal Bayesian Inference with Stochas[c Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]-‐Fixed-‐Form Varia[onal Posterior Approxima[on through Stochas[c Linear Regression [T. Salimans and A. Knowles, 2013].-‐Black Box Varia[onal Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]-‐Stochas[c Varia[onal Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]-‐Es[ma[ng or propaga[ng gradients through stochas[c neurons. [Y. Bengio, 2013].-‐Neural Varia[onal Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014]
Kingma 2013, Bengio 2013, Kingma & W. 2014
Other solu=ons to solve the same "large variance problem":
14vrijdag 4 juli 14
Reparameterization Trick
14
-‐Varia[onal Bayesian Inference with Stochas[c Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]-‐Fixed-‐Form Varia[onal Posterior Approxima[on through Stochas[c Linear Regression [T. Salimans and A. Knowles, 2013].-‐Black Box Varia[onal Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]-‐Stochas[c Varia[onal Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]-‐Es[ma[ng or propaga[ng gradients through stochas[c neurons. [Y. Bengio, 2013].-‐Neural Varia[onal Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014]
Kingma 2013, Bengio 2013, Kingma & W. 2014
Other solu=ons to solve the same "large variance problem":
Talk Monday June 23, 15:20
In Track F (Deep Learning II)
14vrijdag 4 juli 14
Auto Encoding Variational Bayes
Both P(X|Z) and Q(Z|X) are general models (e.g. deep neural net)
Kingma & W., 2013, Rezende et al 2014
15
The Helmholtz machine Wake/Sleep algorithm
Dayan, Hinton, Neal, Zemel, 1995
Z
X
Q(Z|X)
P(X|Z)P(Z)
15vrijdag 4 juli 14
The VB Landscape
SVB SSVB
AEVBFSSVB
Stochas[cVaria[onal Bayes
Auto-‐EncodingVaria[onal Bayes
Structured Stoch.Varia[onal Bayes
Fully Struc. Stoch.Varia[onal Bayes (ICML 2015)
16vrijdag 4 juli 14
Variational Auto-Encoder (with 2 latent variables)
17
17vrijdag 4 juli 14
Variational Auto-Encoder (with 2 latent variables)
17
17vrijdag 4 juli 14
Face Model
18vrijdag 4 juli 14
Face Model
18vrijdag 4 juli 14
Semi-supervised Model
Z
X
Y
Q(Y,Z|X) = Q(Z|Y,X)Q(Y|X)
Analogies: Fix Z, vary Y, sample X|Z,Y
P(X,Z,Y) = P(X|Z,Y)P(Y)P(Z)
Kingma, Rezende, Mohamed, Wierstra, W., 2014
19vrijdag 4 juli 14
REFERENCES SVB:
-‐Prac[cal Varia[onal Inference for Neural Networks [Alex Graves, 2011]-‐Varia[onal Bayesian Inference with Stochas[c Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]-‐Fixed-‐Form Varia[onal Posterior Approxima[on through Stochas[c Linear Regression. Bayesian Analysis [T. Salimans and A. Knowles, 2013].-‐Black Box Varia[onal Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]-‐Stochas[c Varia[onal Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]-‐Stochas[c Structured Mean Field Varia[onal Inference [MaYhew Hoffman, 2013]-‐Doubly Stochas/c Varia/onal Bayes for non-‐Conjugate Inference [M. K. Titsias and M. Lázaro-‐Gredilla, 2014]
REFERENCES STOCHASTIC BACKPROP AND DEEP GENERATIVE MODELS
-‐Fast Gradient-‐Based Inference with Con[nuous Latent Variable Models in Auxiliary Form. [D.P. Kingma, 2013].-‐Es[ma[ng or propaga[ng gradients through stochas[c neurons. [Y. Bengio, 2013].-‐Auto-‐Encoding Varia[onal Bayes [D.P. Kingma and M. W., 2013].-‐Semi-‐supervised Learning with Deep Genera[ve Models [D.P. Kingma, D.J. Rezende, S. Mohamed, M. W., 2014]-‐Efficient Gradient-‐Based Inference through Transforma/ons between Bayes Nets and Neural Nets [D.P. Kingma and M. W., 2014]-‐Deep Genera/ve Stochas/c Networks Trainable by Backprop [Y. Bengio, E. Laufer, G. Alain, J, Yosinski, 2014]-‐Stochas/c Back-‐propaga/on and Approximate Inference in Deep Genera/ve Models [D.J. Rezende, S. Mohamed and D. Wierstra, 2014]-‐Deep AutoRegressive Networks [K. Gregor, A. Mnih and D. Wierstra, 2014].-‐Neural Varia/onal Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014].
References: Lots of action at ICML 2014!
20vrijdag 4 juli 14
Sampling 101 – Why MCMC?
21vrijdag 4 juli 14
Sampling 101 – Why MCMC?
21vrijdag 4 juli 14
Sampling 101 – Why MCMC?
Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
21vrijdag 4 juli 14
Sampling 101 – Why MCMC?
Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
21vrijdag 4 juli 14
Sampling 101 – Why MCMC?
Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
• Make steps by perturbing previous sample
21vrijdag 4 juli 14
Sampling 101 – Why MCMC?
Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
• Make steps by perturbing previous sample
• Probability of visiting a state is equal to P(θ|X)
21vrijdag 4 juli 14
Sampling 101 – What is MCMC?
. . .
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
. . .
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
. . . . . .
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
Burn-in ( Throw away)
. . . . . . Samples from S0
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
Burn-in ( Throw away)
. . . . . . Samples from S0
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
Burn-in ( Throw away)
. . . . . . Samples from S0
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
Burn-in ( Throw away)
. . . . . . Samples from S0
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
Burn-in ( Throw away)
. . . . . . Samples from S0
Auto correlation time
22vrijdag 4 juli 14
Sampling 101 – What is MCMC?
Burn-in ( Throw away)
. . . . . . Samples from S0
Auto correlation time High τ Low τ
22vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
23vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)Propose
23vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)Accept/Reject TestPropose
23vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)Accept/Reject TestPropose
Is the new state more probable?
23vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)Accept/Reject TestPropose
Is it easy to come back to the current state?
23vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)Accept/Reject TestPropose
23vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)Accept/Reject TestPropose
O(N)
For Bayesian Posterior Inference,
1) Burn-in is unnecessarily slow.
23vrijdag 4 juli 14
Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)Accept/Reject TestPropose
O(N)
For Bayesian Posterior Inference,
2) is too high.
1) Burn-in is unnecessarily slow.
23vrijdag 4 juli 14
Approximate MCMC
Low Variance( Fast )
High Bias
xxx xx xx xx x
x x
xx
xx
xxx
xx x
x
xx xx
xx x
x
24vrijdag 4 juli 14
Approximate MCMC
Low Variance( Fast )
High Bias
xxx xx xx xx x
x x
xx
xx
xxx
xx x
x
xx xx
xx x
x
24vrijdag 4 juli 14
Approximate MCMC
Low Variance( Fast )
High Variance( Slow )
High Bias Low Bias
xxx xx xx xx x
x x
xx
xx
xxx
xx x
x
xx xx
xx x
x
xxx
Decreasing ϵ
24vrijdag 4 juli 14
Minimizing Risk
25
Risk Bias Variance= +
2
25vrijdag 4 juli 14
Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
25
Risk Bias Variance= +
2
25vrijdag 4 juli 14
Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time
25
Risk Bias Variance= +
2
25vrijdag 4 juli 14
Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time
25
Risk Bias Variance= +
2
25vrijdag 4 juli 14
Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time
25
Risk Bias Variance= +
2
Given finite sampling time, ϵ=0 is not the optimal setting.
25vrijdag 4 juli 14
Designing fast MCMC samplers
Propose Accept/Reject
O(N)
26vrijdag 4 juli 14
Designing fast MCMC samplers
Propose Accept/Reject
O(N)
Method 1
Develop an approximate accept/reject test that uses only a fraction of the data
26vrijdag 4 juli 14
Designing fast MCMC samplers
Method 2
Develop a proposal with acceptance probability ≈ 1
and avoid the expensive accept/reject test
Propose Accept/Reject
O(N)
Method 1
Develop an approximate accept/reject test that uses only a fraction of the data
26vrijdag 4 juli 14
Stochastic Gradient Langevin DynamicsW. & Teh, 2011
27vrijdag 4 juli 14
Stochastic Gradient Langevin Dynamics
Langevin Dynamics
θt+1 is then accepted /rejected using a Metropolis-Hastings test
W. & Teh, 2011
27vrijdag 4 juli 14
Stochastic Gradient Langevin Dynamics
Langevin Dynamics
θt+1 is then accepted /rejected using a Metropolis-Hastings test
W. & Teh, 2011
27vrijdag 4 juli 14
Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
W. & Teh, 2011
27vrijdag 4 juli 14
Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
Avoid expensive Metropolis-Hastings test by keeping ε small
W. & Teh, 2011
27vrijdag 4 juli 14
Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
Avoid expensive Metropolis-Hastings test by keeping ε small
W. & Teh, 2011
27vrijdag 4 juli 14
SGLD & Optimization
Optimization Large ε
28
28vrijdag 4 juli 14
SGLD & Optimization
Optimization Small ε
29
29vrijdag 4 juli 14
The SGLD Knob
Burn-in Biased Exact
Decrease ϵ over time
Low Variance( Fast )
High Variance( Slow )
High Bias Low Bias
xxx xx xx xx x
x x
xx
xx
xxx
xx x
x
xx xx
xx x
xx
xx
30vrijdag 4 juli 14
Demo: SGLD
31
31vrijdag 4 juli 14
Demo: SGLD
31
31vrijdag 4 juli 14