deep generative learning_icml_part1

Bayesian Posterior Inference in the

Big Data Arena

Max Welling Anoop Korattikara

1vrijdag 4 juli 14

Outline

• Introduction

• Stochastic Variational Inference– Variational Inference 101– Stochastic Variational Inference– Deep Generative Models with SVB

• MCMC with mini-batches– MCMC 101– MCMC using noisy gradients– MCMC using noisy Metropolis-Hastings– Theoretical results

2vrijdag 4 juli 14

Big Data (mine is bigger than yours)

Square Kilometer Array (SKA) produces 1 Exabyte per day by 2024… (interested to do approximate inference on this data, talk to me)

3vrijdag 4 juli 14

Introduction

Why do we need posterior inference if the datasets are BIG?

4vrijdag 4 juli 14

p>>N Big data may mean large p, small n

Gene expression datafMRI data

5

5vrijdag 4 juli 14

PlanningPlanning against uncertainty needs probabili=es

6

6vrijdag 4 juli 14

Little data inside Big dataNot every data-‐case carries informa=on about every model component

New user with no raGngs(cold start problem)

7

7vrijdag 4 juli 14

1943: First NN(+/-‐ N=10)

1988: NetTalk(+/-‐ N=20K)

2009: Hinton’s Deep Belief Net(+/-‐ N=10M)

2013: Google/Y! (N=+/-‐ 10B)

Big Models!Models grow faster than useful informa=on in data

8

8vrijdag 4 juli 14

Two Ingredients for Big Data Bayes

Any big data posterior inference algorithm should:

1. easily run on a distributed architecture.

2. only use a small mini-‐batch of the data at every itera=on.

9vrijdag 4 juli 14

Bayesian Posterior Inference

Variational Sampling

Variational Family Q

All probability distributions

• DeterminisGc• Biased • Local minima• Easy to assess convergence

• StochasGc (sample error)• Unbiased• Hard to mix between modes• Hard to assess convergence

10vrijdag 4 juli 14

Variational Bayes

11

Hinton & van Camp (1993)Neal & Hinton (1999)Saul & Jordan (1996)Saul, Jaakkola & Jordan (1996)ATas (1999,2000) Wiegerinck (2000)Ghahramani & Beal (2000,2001)

Coordinate descent on Q

P

Q

(Bishop, PaYern Recogni[on and Machine Learning)

11vrijdag 4 juli 14

Stochastic VB Hoffman, Blei & Bach, 2010

Stochas=c natural gradient descent on Q

12

• P and Q in exponenGal family.• Q factorized:

• At every iteraGon: subsample n<<N data-‐cases:• solve analyGcally.• update parameter using stochas=c natural gradient descent.

12vrijdag 4 juli 14

General SVB

very high variance

sample

13

subsample X

(ignoring latent variables Z)

13vrijdag 4 juli 14

Reparameterization Trick

14

-‐Varia[onal Bayesian Inference with Stochas[c Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]-‐Fixed-‐Form Varia[onal Posterior Approxima[on through Stochas[c Linear Regression [T. Salimans and A. Knowles, 2013].-‐Black Box Varia[onal Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]-‐Stochas[c Varia[onal Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]-‐Es[ma[ng or propaga[ng gradients through stochas[c neurons. [Y. Bengio, 2013].-‐Neural Varia[onal Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014]

Kingma 2013, Bengio 2013, Kingma & W. 2014

Other solu=ons to solve the same "large variance problem":

14vrijdag 4 juli 14

Reparameterization Trick

14

-‐Varia[onal Bayesian Inference with Stochas[c Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]-‐Fixed-‐Form Varia[onal Posterior Approxima[on through Stochas[c Linear Regression [T. Salimans and A. Knowles, 2013].-‐Black Box Varia[onal Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]-‐Stochas[c Varia[onal Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]-‐Es[ma[ng or propaga[ng gradients through stochas[c neurons. [Y. Bengio, 2013].-‐Neural Varia[onal Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014]

Kingma 2013, Bengio 2013, Kingma & W. 2014

Other solu=ons to solve the same "large variance problem":

Talk Monday June 23, 15:20

In Track F (Deep Learning II)

14vrijdag 4 juli 14

Auto Encoding Variational Bayes

Both P(X|Z) and Q(Z|X) are general models (e.g. deep neural net)

Kingma & W., 2013, Rezende et al 2014

15

The Helmholtz machine Wake/Sleep algorithm

Dayan, Hinton, Neal, Zemel, 1995

Z

X

Q(Z|X)

P(X|Z)P(Z)

15vrijdag 4 juli 14

The VB Landscape

SVB SSVB

AEVBFSSVB

Stochas[cVaria[onal Bayes

Auto-‐EncodingVaria[onal Bayes

Structured Stoch.Varia[onal Bayes

Fully Struc. Stoch.Varia[onal Bayes (ICML 2015)

16vrijdag 4 juli 14

Variational Auto-Encoder (with 2 latent variables)

17

17vrijdag 4 juli 14

Face Model

18vrijdag 4 juli 14

REFERENCES SVB:

-‐Prac[cal Varia[onal Inference for Neural Networks [Alex Graves, 2011]-‐Varia[onal Bayesian Inference with Stochas[c Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012]-‐Fixed-‐Form Varia[onal Posterior Approxima[on through Stochas[c Linear Regression. Bayesian Analysis [T. Salimans and A. Knowles, 2013].-‐Black Box Varia[onal Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013]-‐Stochas[c Varia[onal Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013]-‐Stochas[c Structured Mean Field Varia[onal Inference [MaYhew Hoffman, 2013]-‐Doubly Stochas/c Varia/onal Bayes for non-‐Conjugate Inference [M. K. Titsias and M. Lázaro-‐Gredilla, 2014]

REFERENCES STOCHASTIC BACKPROP AND DEEP GENERATIVE MODELS

-‐Fast Gradient-‐Based Inference with Con[nuous Latent Variable Models in Auxiliary Form. [D.P. Kingma, 2013].-‐Es[ma[ng or propaga[ng gradients through stochas[c neurons. [Y. Bengio, 2013].-‐Auto-‐Encoding Varia[onal Bayes [D.P. Kingma and M. W., 2013].-‐Semi-‐supervised Learning with Deep Genera[ve Models [D.P. Kingma, D.J. Rezende, S. Mohamed, M. W., 2014]-‐Efficient Gradient-‐Based Inference through Transforma/ons between Bayes Nets and Neural Nets [D.P. Kingma and M. W., 2014]-‐Deep Genera/ve Stochas/c Networks Trainable by Backprop [Y. Bengio, E. Laufer, G. Alain, J, Yosinski, 2014]-‐Stochas/c Back-‐propaga/on and Approximate Inference in Deep Genera/ve Models [D.J. Rezende, S. Mohamed and D. Wierstra, 2014]-‐Deep AutoRegressive Networks [K. Gregor, A. Mnih and D. Wierstra, 2014].-‐Neural Varia/onal Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014].

References: Lots of action at ICML 2014!

20vrijdag 4 juli 14

Sampling 101 – Why MCMC?

21vrijdag 4 juli 14


Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling

- Does not scale to high dimensions

21vrijdag 4 juli 14




Markov Chain Monte Carlo

21vrijdag 4 juli 14





• Make steps by perturbing previous sample

21vrijdag 4 juli 14





• Make steps by perturbing previous sample

• Probability of visiting a state is equal to P(θ|X)

21vrijdag 4 juli 14

Sampling 101 – What is MCMC?

. . .

22vrijdag 4 juli 14


. . . . . .

22vrijdag 4 juli 14


Burn-in ( Throw away)

. . . . . . Samples from S0

22vrijdag 4 juli 14




Auto correlation time

22vrijdag 4 juli 14




Auto correlation time High τ Low τ

22vrijdag 4 juli 14

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)

23vrijdag 4 juli 14


Transition Kernel T(θt+1|θt)Propose

23vrijdag 4 juli 14


Transition Kernel T(θt+1|θt)Accept/Reject TestPropose

23vrijdag 4 juli 14



Is the new state more probable?

23vrijdag 4 juli 14



Is it easy to come back to the current state?

23vrijdag 4 juli 14



23vrijdag 4 juli 14



O(N)

For Bayesian Posterior Inference,

1) Burn-in is unnecessarily slow.

23vrijdag 4 juli 14



O(N)

For Bayesian Posterior Inference,

2) is too high.

1) Burn-in is unnecessarily slow.

23vrijdag 4 juli 14

Approximate MCMC

Low Variance( Fast )

High Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

x

24vrijdag 4 juli 14

Approximate MCMC


High Variance( Slow )

High Bias Low Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

x

xxx

Decreasing ϵ

24vrijdag 4 juli 14

Minimizing Risk

25

Risk Bias Variance= +

2

25vrijdag 4 juli 14

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, Risk

25


2

25vrijdag 4 juli 14

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time

25


2

25vrijdag 4 juli 14

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time

25


2

Given finite sampling time, ϵ=0 is not the optimal setting.

25vrijdag 4 juli 14

Designing fast MCMC samplers

Propose Accept/Reject

O(N)

26vrijdag 4 juli 14



O(N)

Method 1

Develop an approximate accept/reject test that uses only a fraction of the data

26vrijdag 4 juli 14


Method 2

Develop a proposal with acceptance probability ≈ 1

and avoid the expensive accept/reject test


O(N)

Method 1

Develop an approximate accept/reject test that uses only a fraction of the data

26vrijdag 4 juli 14

Stochastic Gradient Langevin DynamicsW. & Teh, 2011

27vrijdag 4 juli 14

Stochastic Gradient Langevin Dynamics

Langevin Dynamics

θt+1 is then accepted /rejected using a Metropolis-Hastings test

W. & Teh, 2011

27vrijdag 4 juli 14


Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD)


W. & Teh, 2011

27vrijdag 4 juli 14


Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD)


Avoid expensive Metropolis-Hastings test by keeping ε small

W. & Teh, 2011

27vrijdag 4 juli 14

SGLD & Optimization

Optimization Large ε

28

28vrijdag 4 juli 14

SGLD & Optimization

Optimization Small ε

29

29vrijdag 4 juli 14

The SGLD Knob

Burn-in Biased Exact

Decrease ϵ over time


High Variance( Slow )

High Bias Low Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

xx

xx

30vrijdag 4 juli 14

Demo: SGLD

31

31vrijdag 4 juli 14

deep generative learning_icml_part1

Science

variational bayes

little data

svb mcmc

noisy gradients mcmc

minibatches mcmc

stochastic vb homan

jaakkolajordan1996 atas1999

pn bigdatamaymeanlargep