deep generative learning_icml_part1

68
Bayesian Posterior Inference in the Big Data Arena Max Welling Anoop Korattikara 1 vrijdag 4 juli 14

Upload: scyfer

Post on 21-Jun-2015

752 views

Category:

Science


0 download

DESCRIPTION

Workshop Deep generative Learning at ICML 2014 part 1

TRANSCRIPT

Page 1: Deep generative learning_icml_part1

Bayesian Posterior Inference in the

Big Data Arena

Max Welling Anoop Korattikara

1vrijdag 4 juli 14

Page 2: Deep generative learning_icml_part1

Outline

• Introduction

• Stochastic Variational Inference– Variational Inference 101– Stochastic Variational Inference– Deep Generative Models with SVB

• MCMC with mini-batches– MCMC 101– MCMC using noisy gradients– MCMC using noisy Metropolis-Hastings– Theoretical results

2vrijdag 4 juli 14

Page 3: Deep generative learning_icml_part1

Big Data (mine is bigger than yours)

Square  Kilometer  Array  (SKA)  produces  1  Exabyte  per  day  by  2024…  (interested  to  do  approximate  inference  on  this  data,  talk  to  me)

3vrijdag 4 juli 14

Page 4: Deep generative learning_icml_part1

Introduction

Why  do  we  need  posterior  inference  if  the  datasets  are  BIG?  

4vrijdag 4 juli 14

Page 5: Deep generative learning_icml_part1

p>>N  Big  data  may  mean  large  p,  small  n

Gene  expression  datafMRI  data

5

5vrijdag 4 juli 14

Page 6: Deep generative learning_icml_part1

PlanningPlanning  against  uncertainty  needs  probabili=es  

6

6vrijdag 4 juli 14

Page 7: Deep generative learning_icml_part1

Little data inside Big dataNot  every  data-­‐case  carries  informa=on  about  every  model  component  

New  user  with  no  raGngs(cold  start  problem)

7

7vrijdag 4 juli 14

Page 8: Deep generative learning_icml_part1

1943:  First  NN(+/-­‐  N=10)

1988:  NetTalk(+/-­‐  N=20K)

2009:  Hinton’s  Deep  Belief  Net(+/-­‐  N=10M)

2013:  Google/Y!  (N=+/-­‐  10B)

Big  Models!Models  grow  faster  than  useful  informa=on  in  data

8

8vrijdag 4 juli 14

Page 9: Deep generative learning_icml_part1

Two Ingredients for Big Data Bayes

Any  big  data  posterior  inference  algorithm  should:

1. easily  run  on  a  distributed  architecture.

2. only  use  a  small  mini-­‐batch  of  the  data  at  every  itera=on.            

9vrijdag 4 juli 14

Page 10: Deep generative learning_icml_part1

Bayesian Posterior Inference

Variational Sampling

Variational Family Q

All probability distributions

• DeterminisGc• Biased  • Local  minima• Easy  to  assess  convergence

• StochasGc  (sample  error)• Unbiased• Hard  to  mix  between  modes• Hard  to  assess  convergence

10vrijdag 4 juli 14

Page 11: Deep generative learning_icml_part1

Variational Bayes

11

Hinton  &  van  Camp  (1993)Neal  &  Hinton  (1999)Saul  &  Jordan  (1996)Saul,  Jaakkola  &  Jordan  (1996)ATas  (1999,2000)  Wiegerinck  (2000)Ghahramani  &  Beal  (2000,2001)

Coordinate  descent  on  Q

P

Q

(Bishop,  PaYern  Recogni[on  and  Machine  Learning)

11vrijdag 4 juli 14

Page 12: Deep generative learning_icml_part1

Stochastic VB Hoffman,  Blei  &  Bach,  2010

Stochas=c  natural  gradient  descent  on  Q    

12

• P  and  Q  in  exponenGal  family.• Q  factorized:

• At  every  iteraGon:  subsample  n<<N  data-­‐cases:• solve                                              analyGcally.• update  parameter                        using  stochas=c  natural  gradient  descent.

12vrijdag 4 juli 14

Page 13: Deep generative learning_icml_part1

General SVB

very  high  variance

sample

13

subsample  X  

(ignoring  latent  variables  Z)

13vrijdag 4 juli 14

Page 14: Deep generative learning_icml_part1

Reparameterization Trick

14

-­‐Varia[onal  Bayesian  Inference  with  Stochas[c  Search  [D.M.  Blei,  M.I.  Jordan  and  J.W.  Paisley,  2012]-­‐Fixed-­‐Form  Varia[onal  Posterior  Approxima[on  through  Stochas[c  Linear  Regression  [T.  Salimans  and  A.  Knowles,  2013].-­‐Black  Box  Varia[onal  Inference.  [R.  Ranganath,  S.  Gerrish  and  D.M.  Blei.  2013]-­‐Stochas[c  Varia[onal  Inference  [M.D.  Hoffman,  D.  Blei,  C.  Wang  and  J.  Paisley,  2013]-­‐Es[ma[ng  or  propaga[ng  gradients  through  stochas[c  neurons.  [Y.  Bengio,  2013].-­‐Neural  Varia[onal  Inference  and  Learning  in  Belief  Networks.  [A.  Mnih  and  K.  Gregor,  2014]

Kingma  2013,  Bengio  2013,  Kingma  &  W.  2014

Other  solu=ons  to  solve  the  same  "large  variance  problem":

14vrijdag 4 juli 14

Page 15: Deep generative learning_icml_part1

Reparameterization Trick

14

-­‐Varia[onal  Bayesian  Inference  with  Stochas[c  Search  [D.M.  Blei,  M.I.  Jordan  and  J.W.  Paisley,  2012]-­‐Fixed-­‐Form  Varia[onal  Posterior  Approxima[on  through  Stochas[c  Linear  Regression  [T.  Salimans  and  A.  Knowles,  2013].-­‐Black  Box  Varia[onal  Inference.  [R.  Ranganath,  S.  Gerrish  and  D.M.  Blei.  2013]-­‐Stochas[c  Varia[onal  Inference  [M.D.  Hoffman,  D.  Blei,  C.  Wang  and  J.  Paisley,  2013]-­‐Es[ma[ng  or  propaga[ng  gradients  through  stochas[c  neurons.  [Y.  Bengio,  2013].-­‐Neural  Varia[onal  Inference  and  Learning  in  Belief  Networks.  [A.  Mnih  and  K.  Gregor,  2014]

Kingma  2013,  Bengio  2013,  Kingma  &  W.  2014

Other  solu=ons  to  solve  the  same  "large  variance  problem":

Talk Monday June 23, 15:20

In Track F (Deep Learning II)

14vrijdag 4 juli 14

Page 16: Deep generative learning_icml_part1

Auto Encoding Variational Bayes

Both  P(X|Z)  and  Q(Z|X)  are  general  models  (e.g.  deep  neural  net)

Kingma  &  W.,  2013,  Rezende  et  al  2014

15

The  Helmholtz  machine  Wake/Sleep  algorithm

Dayan,  Hinton,  Neal,  Zemel,  1995

Z

X

Q(Z|X)  

P(X|Z)P(Z)

15vrijdag 4 juli 14

Page 17: Deep generative learning_icml_part1

The VB Landscape

SVB SSVB

AEVBFSSVB

Stochas[cVaria[onal  Bayes

Auto-­‐EncodingVaria[onal  Bayes

Structured  Stoch.Varia[onal  Bayes

Fully  Struc.  Stoch.Varia[onal  Bayes (ICML  2015)

16vrijdag 4 juli 14

Page 18: Deep generative learning_icml_part1

Variational Auto-Encoder (with 2 latent variables)

17

17vrijdag 4 juli 14

Page 19: Deep generative learning_icml_part1

Variational Auto-Encoder (with 2 latent variables)

17

17vrijdag 4 juli 14

Page 20: Deep generative learning_icml_part1

Face Model

18vrijdag 4 juli 14

Page 21: Deep generative learning_icml_part1

Face Model

18vrijdag 4 juli 14

Page 22: Deep generative learning_icml_part1

Semi-supervised Model

Z

X

Y

Q(Y,Z|X)  =  Q(Z|Y,X)Q(Y|X)

Analogies:  Fix  Z,  vary  Y,  sample  X|Z,Y

P(X,Z,Y)  =  P(X|Z,Y)P(Y)P(Z)

Kingma,  Rezende,  Mohamed,  Wierstra,  W.,  2014

19vrijdag 4 juli 14

Page 23: Deep generative learning_icml_part1

REFERENCES  SVB:

-­‐Prac[cal  Varia[onal  Inference  for  Neural  Networks  [Alex  Graves,  2011]-­‐Varia[onal  Bayesian  Inference  with  Stochas[c  Search  [D.M.  Blei,  M.I.  Jordan  and  J.W.  Paisley,  2012]-­‐Fixed-­‐Form  Varia[onal  Posterior  Approxima[on  through  Stochas[c  Linear  Regression.  Bayesian  Analysis  [T.  Salimans  and  A.  Knowles,  2013].-­‐Black  Box  Varia[onal  Inference.  [R.  Ranganath,  S.  Gerrish  and  D.M.  Blei.  2013]-­‐Stochas[c  Varia[onal  Inference  [M.D.  Hoffman,  D.  Blei,  C.  Wang  and  J.  Paisley,  2013]-­‐Stochas[c  Structured  Mean  Field  Varia[onal  Inference  [MaYhew  Hoffman,    2013]-­‐Doubly  Stochas/c  Varia/onal  Bayes  for  non-­‐Conjugate  Inference  [M.  K.  Titsias  and  M.  Lázaro-­‐Gredilla,  2014]

REFERENCES  STOCHASTIC  BACKPROP  AND  DEEP  GENERATIVE  MODELS

-­‐Fast  Gradient-­‐Based  Inference  with  Con[nuous  Latent  Variable  Models  in  Auxiliary  Form.  [D.P.  Kingma,  2013].-­‐Es[ma[ng  or  propaga[ng  gradients  through  stochas[c  neurons.  [Y.  Bengio,  2013].-­‐Auto-­‐Encoding  Varia[onal  Bayes  [D.P.  Kingma  and  M.  W.,  2013].-­‐Semi-­‐supervised  Learning  with  Deep  Genera[ve  Models  [D.P.  Kingma,  D.J.  Rezende,  S.  Mohamed,  M.  W.,  2014]-­‐Efficient  Gradient-­‐Based  Inference  through  Transforma/ons  between  Bayes  Nets  and  Neural  Nets  [D.P.  Kingma  and  M.  W.,  2014]-­‐Deep  Genera/ve  Stochas/c  Networks  Trainable  by  Backprop  [Y.  Bengio,  E.  Laufer,  G.  Alain,  J,  Yosinski,  2014]-­‐Stochas/c  Back-­‐propaga/on  and  Approximate  Inference  in  Deep  Genera/ve  Models  [D.J.  Rezende,  S.  Mohamed  and  D.  Wierstra,  2014]-­‐Deep  AutoRegressive  Networks  [K.  Gregor,  A.  Mnih  and  D.  Wierstra,  2014].-­‐Neural  Varia/onal  Inference  and  Learning  in  Belief  Networks.  [A.  Mnih  and  K.  Gregor,  2014].

References: Lots of action at ICML 2014!

20vrijdag 4 juli 14

Page 24: Deep generative learning_icml_part1

Sampling 101 – Why MCMC?

21vrijdag 4 juli 14

Page 25: Deep generative learning_icml_part1

Sampling 101 – Why MCMC?

21vrijdag 4 juli 14

Page 26: Deep generative learning_icml_part1

Sampling 101 – Why MCMC?

Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling

- Does not scale to high dimensions

21vrijdag 4 juli 14

Page 27: Deep generative learning_icml_part1

Sampling 101 – Why MCMC?

Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling

- Does not scale to high dimensions

Markov Chain Monte Carlo

21vrijdag 4 juli 14

Page 28: Deep generative learning_icml_part1

Sampling 101 – Why MCMC?

Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling

- Does not scale to high dimensions

Markov Chain Monte Carlo

• Make steps by perturbing previous sample

21vrijdag 4 juli 14

Page 29: Deep generative learning_icml_part1

Sampling 101 – Why MCMC?

Generating Independent SamplesSample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling

- Does not scale to high dimensions

Markov Chain Monte Carlo

• Make steps by perturbing previous sample

• Probability of visiting a state is equal to P(θ|X)

21vrijdag 4 juli 14

Page 30: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

.  .  .  

22vrijdag 4 juli 14

Page 31: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

.  .  .  

22vrijdag 4 juli 14

Page 32: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

.  .  .   .  .  .  

22vrijdag 4 juli 14

Page 33: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

Burn-in ( Throw away)

.  .  .   .  .  .  Samples from S0

22vrijdag 4 juli 14

Page 34: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

Burn-in ( Throw away)

.  .  .   .  .  .  Samples from S0

22vrijdag 4 juli 14

Page 35: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

Burn-in ( Throw away)

.  .  .   .  .  .  Samples from S0

22vrijdag 4 juli 14

Page 36: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

Burn-in ( Throw away)

.  .  .   .  .  .  Samples from S0

22vrijdag 4 juli 14

Page 37: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

Burn-in ( Throw away)

.  .  .   .  .  .  Samples from S0

Auto correlation time

22vrijdag 4 juli 14

Page 38: Deep generative learning_icml_part1

Sampling 101 – What is MCMC?

Burn-in ( Throw away)

.  .  .   .  .  .  Samples from S0

Auto correlation time High  τ   Low    τ

22vrijdag 4 juli 14

Page 39: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)

23vrijdag 4 juli 14

Page 40: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)Propose

23vrijdag 4 juli 14

Page 41: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)Accept/Reject TestPropose

23vrijdag 4 juli 14

Page 42: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)Accept/Reject TestPropose

Is the new state more probable?

23vrijdag 4 juli 14

Page 43: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)Accept/Reject TestPropose

Is it easy to come back to the current state?

23vrijdag 4 juli 14

Page 44: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)Accept/Reject TestPropose

23vrijdag 4 juli 14

Page 45: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)Accept/Reject TestPropose

O(N)

For Bayesian Posterior Inference,

1) Burn-in is unnecessarily slow.

23vrijdag 4 juli 14

Page 46: Deep generative learning_icml_part1

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)Accept/Reject TestPropose

O(N)

For Bayesian Posterior Inference,

2) is too high.

1) Burn-in is unnecessarily slow.

23vrijdag 4 juli 14

Page 47: Deep generative learning_icml_part1

Approximate MCMC

Low Variance( Fast )

High Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

x

24vrijdag 4 juli 14

Page 48: Deep generative learning_icml_part1

Approximate MCMC

Low Variance( Fast )

High Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

x

24vrijdag 4 juli 14

Page 49: Deep generative learning_icml_part1

Approximate MCMC

Low Variance( Fast )

High Variance( Slow )

High Bias Low Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

x

xxx

Decreasing ϵ

24vrijdag 4 juli 14

Page 50: Deep generative learning_icml_part1

Minimizing Risk

25

Risk Bias Variance= +

2

25vrijdag 4 juli 14

Page 51: Deep generative learning_icml_part1

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, Risk

25

Risk Bias Variance= +

2

25vrijdag 4 juli 14

Page 52: Deep generative learning_icml_part1

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time

25

Risk Bias Variance= +

2

25vrijdag 4 juli 14

Page 53: Deep generative learning_icml_part1

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time

25

Risk Bias Variance= +

2

25vrijdag 4 juli 14

Page 54: Deep generative learning_icml_part1

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, RiskComputational Time

25

Risk Bias Variance= +

2

Given finite sampling time, ϵ=0 is not the optimal setting.

25vrijdag 4 juli 14

Page 55: Deep generative learning_icml_part1

Designing fast MCMC samplers

Propose Accept/Reject

O(N)

26vrijdag 4 juli 14

Page 56: Deep generative learning_icml_part1

Designing fast MCMC samplers

Propose Accept/Reject

O(N)

Method 1

Develop an approximate accept/reject test that uses only a fraction of the data

26vrijdag 4 juli 14

Page 57: Deep generative learning_icml_part1

Designing fast MCMC samplers

Method 2

Develop a proposal with acceptance probability ≈ 1

and avoid the expensive accept/reject test

Propose Accept/Reject

O(N)

Method 1

Develop an approximate accept/reject test that uses only a fraction of the data

26vrijdag 4 juli 14

Page 58: Deep generative learning_icml_part1

Stochastic Gradient Langevin DynamicsW.  &  Teh,  2011

27vrijdag 4 juli 14

Page 59: Deep generative learning_icml_part1

Stochastic Gradient Langevin Dynamics

Langevin Dynamics

θt+1 is then accepted /rejected using a Metropolis-Hastings test

W.  &  Teh,  2011

27vrijdag 4 juli 14

Page 60: Deep generative learning_icml_part1

Stochastic Gradient Langevin Dynamics

Langevin Dynamics

θt+1 is then accepted /rejected using a Metropolis-Hastings test

W.  &  Teh,  2011

27vrijdag 4 juli 14

Page 61: Deep generative learning_icml_part1

Stochastic Gradient Langevin Dynamics

Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD)

θt+1 is then accepted /rejected using a Metropolis-Hastings test

W.  &  Teh,  2011

27vrijdag 4 juli 14

Page 62: Deep generative learning_icml_part1

Stochastic Gradient Langevin Dynamics

Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD)

θt+1 is then accepted /rejected using a Metropolis-Hastings test

Avoid expensive Metropolis-Hastings test by keeping ε small

W.  &  Teh,  2011

27vrijdag 4 juli 14

Page 63: Deep generative learning_icml_part1

Stochastic Gradient Langevin Dynamics

Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD)

θt+1 is then accepted /rejected using a Metropolis-Hastings test

Avoid expensive Metropolis-Hastings test by keeping ε small

W.  &  Teh,  2011

27vrijdag 4 juli 14

Page 64: Deep generative learning_icml_part1

SGLD & Optimization

Optimization Large ε

28

28vrijdag 4 juli 14

Page 65: Deep generative learning_icml_part1

SGLD & Optimization

Optimization Small ε

29

29vrijdag 4 juli 14

Page 66: Deep generative learning_icml_part1

The SGLD Knob

Burn-in Biased Exact

Decrease ϵ over time

Low Variance( Fast )

High Variance( Slow )

High Bias Low Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

xx

xx

30vrijdag 4 juli 14

Page 67: Deep generative learning_icml_part1

Demo: SGLD

31

31vrijdag 4 juli 14

Page 68: Deep generative learning_icml_part1

Demo: SGLD

31

31vrijdag 4 juli 14