modern deep learning through bayesian...

Modern Deep Learning through Bayesian Eyes

Yarin Gal

[email protected]

To keep things interesting, a photo or an equation in every slide! (unless specified otherwise, photos are eitheroriginal work or taken from Wikimedia, under Creative Commons license)

mailto:[email protected]

https://creativecommons.org/licenses/by-sa/3.0/deed.en

Modern deep learning

Conceptually simplemodels...

I Attracts tremendous attentionfrom popular media,

I Fundamentally affected the wayML is used in industry,

I Driven by pragmaticdevelopments...

I of tractable models...

I that work well...

I and scale well.

2 of 52

But many unanswered questions...I Why does my model work

We don’t understand many of the tools that we use...E.g. stochastic reg. techniques (dropout, MGN1) are used in mostdeep learning models to avoid over-fitting. Why do they work?

I What does my model know?

I Why does my model predict this and not that?1Wager et al. (2013) and Baldi and Sadowski (2013) attempt to explain dropout

as sparse regularisation but cannot generalise to other techniques.3 of 52



We can’t tell whether our models are certain or not...E.g. what would be the CO2 concentration level in Mauna Loa,Hawaii, in 20 years’ time?

I Why does my model predict this and not that?

3 of 52




Our models are black boxes and not interpretable...Physicians and others need to understand why a model predicts anoutput.

3 of 52




Surprisingly, we can use Bayesian modelling to answer thequestions above

3 of 52

Outline

I Many unanswered questions

I Why does my model work?


I Why does my model predict this and not that, and other openproblems

I Conclusions

4 of 52

Outline


I Why does my model work?I Bayesian modelling and neural networks

I Modern deep learning as approximate inference

I Real-world implications



I Conclusions

4 of 52

Bayesian modelling and inferenceI Observed inputs X = {xi}Ni=1 and outputs Y = {yi}Ni=1

I Capture stochastic process believed to have generated outputs

I Def. ω model parameters as r.v.

I Prior dist. over ω: p(ω)

I Likelihood: p(Y|ω,X)

I Posterior: p(ω|X,Y) = p(Y|ω,X)p(ω)p(Y|X) (Bayes’ theorem)

I Predictive distribution given new input x∗

p(y∗|x∗,X,Y) =∫

p(y∗|x∗,ω) p(ω|X,Y)︸︷︷︸posterior

dω

I But... p(ω|X,Y) is often intractable

5 of 52

Approximate inferenceI Approximate p(ω|X,Y) with simple dist. qθ(ω)

I Minimise divergence from posterior w.r.t. θ

KL(qθ(ω) || p(ω|X,Y))

6 of 52

qθ1(ω)

p(ω|X,Y)




6 of 52

qθ2(ω)

p(ω|X,Y)




6 of 52

qθ3(ω)

p(ω|X,Y)




6 of 52

qθ4(ω)

p(ω|X,Y)




6 of 52

qθ5(ω) p(ω|X,Y)

Approximate inference

I Approximate p(ω|X,Y) with simple dist. qθ(ω)



I Identical to minimising

LVI(θ) := −∫

qθ(ω) log

likelihood︷︸︸︷p(Y|X,ω)dω + KL(qθ(ω)||

prior︷︸︸︷p(ω))

I We can approximate the predictive distribution

qθ(y∗|x∗) =∫

p(y∗|x∗,ω)qθ(ω)dω.

6 of 52

What to this and deep learning?We’ll look at dropout specifically:

I Used in most modern deep learning models

I It somehow circumvents over-fitting

I And improves performance

With Bayesian modelling we can explain why

7 of 52

The link —Bayesian neural networks

I Place prior p(Wi):Wi ∼ N (0, I)

for i ≤ L (and write ω := {Wi}Li=1).

I Output is a r.v. f(x,ω

)= WLσ

(...W2σ

(W1x + b1

)...).

I Softmax likelihood for class.: p(y |x,ω

)= softmax

(f(x,ω

))or a Gaussian for regression: p

(y|x,ω

)= N

(y; f(x,ω

), τ−1I

).

I But difficult to evaluate posteriorp(ω|X,Y

).

Many have tried...

8 of 52

The link —

Bayesian neural networksI Place prior p(Wi):

Wi ∼ N (0, I)

for i ≤ L (and write ω := {Wi}Li=1).

I Output is a r.v. f(x,ω

)= WLσ

(...W2σ

(W1x + b1

)...).

I Softmax likelihood for class.: p(y |x,ω

)= softmax

(f(x,ω

))or a Gaussian for regression: p

(y|x,ω

)= N

(y; f(x,ω

), τ−1I

).

I But difficult to evaluate posteriorp(ω|X,Y

).

Many have tried...

8 of 52

Long history1

I Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987)

I Denker and LeCun (1991)

I MacKay (1992)

I Hinton and van Camp (1993)

I Neal (1995)

I Barber and Bishop (1998)

And more recently...I Graves (2011)

I Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015)

I Hernandez-Lobato and Adam (2015)

But we don’t use these... do we?1Complete references at end of slides

9 of 52

Outline







I Conclusions

10 of 52

Deep learning as approx. inference

Approximate inference in Bayesian NNs

I Def qθ(ω)

to approximate posterior p(ω|X,Y

)I KL divergence to minimise:

KL(qθ(ω)|| p(ω|X,Y

))∝ −

∫qθ(ω)

log p(Y|X,ω

)dω + KL

(qθ(ω)|| p(ω))

=: L(θ)

I Approximate the integral with MC integration ω ∼ qθ(ω):

L(θ) := − log p(Y|X, ω

)+ KL

(qθ(ω)|| p(ω))

11 of 52


Stochastic approx. inference in Bayesian NNs

I Unbiased estimator:

Eω∼qθ(ω)

(L(θ)

)= L(θ)

I Converges to the same optima as L(θ)

I For inference, repeat:I Sample ω ∼ qθ(ω)

I And minimise (one step)

L(θ) = − log p(Y|X, ω

)+ KL

(qθ(ω)|| p(ω))

w.r.t. θ.

12 of 52


Specifying qθ(·)

I Given zi,j Bernoulli r.v. and variational parameters θ = {Mi}Li=1(set of matrices):

zi,j ∼ Bernoulli(pi) for i = 1, ...,L, j = 1, ...,Ki−1

Wi = Mi · diag([zi,j ]Kij=1)

qθ(ω) =∏

qMi (Wi)

13 of 52


In summary:

Minimise divergence between qθ(ω) and p(ω|X,Y):

I Repeat:I Sample zi,j ∼ Bernoulli(pi) and set


ω = {Wi}Li=1

I Minimise (one step)


)+ KL

(qθ(ω)|| p(ω))

w.r.t. θ = {Mi}Li=1 (set of matrices).

14 of 52


In summary:


I Repeat:I = Randomly set columns of Mi to zero



)+ KL

(qθ(ω)|| p(ω))


14 of 52


In summary:


I Repeat:I = Randomly set units of the network to zero



)+ KL

(qθ(ω)|| p(ω))


14 of 52


Sounds familiar?2

L(θ) =

= loss︷︸︸︷− log p

(Y|X, ω

)+

= L2 reg︷︸︸︷KL(qθ(ω)|| p(ω))

2For more details see appendix of Gal and Ghahramani (2015) –yarin.co/dropout

15 of 52

http://yarin.co/dropout

Why does my model work?

Now we can answer: “Why does dropout work?”

I It adds noise

I Sexual reproduction3

I Because it approximately integrates over model parameters

I The noise is a side-effect of approx. integration

I Explains model over specification, “adaptive model capacity”

I We fit the process that generated our data

3Srivastava et al. (2014)16 of 52

Diving deeper into dropout

I “Why this qθ(·)?”I Bernoullis are cheap

I Dropout at test time ≈ propagate the mean E(Wi) = piMi

I Constrains the weights to near the origin:I Posterior uncertainty decreases with more data

I Var(Wi) = MiMTi (pi − p2

i )

I For fixed pi , to decrease uncertainty must decrease ||Mi ||

I Smallest ||Mi || = strongest reg. at pi = 0.5.

17 of 52

Other stochastic reg. techniques

I Multiplicative Gaussian noise (Srivastava et al. 2014) –

I Multiply network units by N (1,1)

I Same performance as dropoutm

Multiplicative Gaussian noise as approximate inference4

zi,j ∼ N (1,1) for i = 1, ...,L, j = 1, ...,Ki−1


qθ(ω) =∏

qMi (Wi)

Similarly for drop-connect (Wan et al., 2013), hashed neuralnetworks (Chen et al., 2015)

4See Gal and Ghahramani (2015) and Kingma et al. (2015)18 of 52

Outline







I Conclusions

19 of 52

Real world implications

“A theory is worth nothing if you can’t use it to make bettercode.”

– DeadMG Jun 10 ’12, stackexchange

I Better use of dropout

I Model structure selectionI (No time: use Bayesian statistics to understand model

architecture)

20 of 52

Better use of dropout

How do we use dropout with convolutional neural networks(convnets)?

Figure : LeNet convnet structure

Image Source: LeCun et al. (1998)21 of 52





Dropout ↓





Dropout ↓No dropout ↓


Why not use dropout et al. with convolutions?

I It doesn’t work

I Low co-adaptation in convolutions

I Because it’s not used correctlyI Standard dropout averages weights at test time

22 of 52


Instead, predictive mean, approx. with MC integration:

Eqθ(y∗|x∗)

(y∗)=

1T

T∑t=1

y(x∗, ωt).

with ωt ∼ qθ(ω).

I In practice, average stochastic forward passes through thenetwork (referred to as “MC dropout”).5

I Dropout after convolutions and averaging forward passes =approximate inference in Bayesian convnets.6

5Also suggested in Srivastava et al. (2014) as model averaging.6See yarin.co/bcnn for more details

23 of 52

http://yarin.co/bcnn

Huge improvement (MNIST)

104 105 106 107

Batches

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Err

or

(%)

Standard dropout (lenet-all)

MC dropout (lenet-all)

Standard dropout (lenet-ip)

MC dropout (lenet-ip)

No dropout (lenet-none)

Red: standard LeNet (no dropout)24 of 52


104 105 106 107

Batches

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Err

or

(%)






Green: standard dropout LeNet (dropout at the end)24 of 52


104 105 106 107

Batches

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Err

or

(%)






Dashed blue: Bayesian LeNet (weight averaging – FAIL)24 of 52


104 105 106 107

Batches

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Err

or

(%)






Solid blue: Bayesian LeNet (MC dropout)24 of 52


104 105 106 107

Batches

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Err

or

(%)






24 of 52

Before dropout


104 105 106 107

Batches

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Err

or

(%)






24 of 52

Before dropout

Empirical dropout


104 105 106 107

Batches

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Err

or

(%)






24 of 52

Before dropout

Empirical dropout

Principled dropout

Over-fitting on small dataI Robustness to over-fitting on smaller datasets:

0 200000 400000 600000 800000 1000000

Batches

0.66

0.68

0.70

0.72

0.74

0.76

0.78

0.80

0.82

Err

or

(%)

(a) Entire MNISTStandard dropout convnet

0 200000 400000 600000 800000 1000000

Batches

0.45

0.50

0.55

0.60

0.65

0.70

0.75

Err

or

(%)

(b) Entire MNISTBayesian convnet

0 200000 400000 600000 800000 1000000

Batches

1.00

1.05

1.10

1.15

1.20

1.25

Err

or

(%)

(c) 1/4 of MNISTStandard dropout convnet

0 200000 400000 600000 800000 1000000

Batches

0.75

0.80

0.85

0.90

0.95

1.00

1.05

Err

or

(%)

(d) 1/4 of MNISTBayesian convnet

25 of 52

State-of-the-art on CIFAR-10

CIFAR Test Error (and Std.)

Model Standard Dropout MC Dropout

NIN 10.43 (Lin et al., 2013) 10.27± 0.05DSN 9.37 (Lee et al., 2014) 9.32± 0.02

Augmented-DSN 7.95 (Lee et al., 2014) 7.71± 0.09

Table : Bayesian techniques (MC dropout) with existing state-of-the-art

26 of 52

RW: Camera pose localisation

I Find the location from which a picture was taken7

I Kendall and Cipolla (2015) show 10–15% improvement onstate-of-the-art with Bayesian convnets

7Figures used with author permission27 of 52

RW: Camera pose localisation

I Find the location from which a picture was taken7

I Kendall and Cipolla (2015) show 10–15% improvement onstate-of-the-art with Bayesian convnets

Localisation accuracy for different error thresholds7Figures used with author permission

27 of 52

Outline



I What does my model know?I Why should I care about uncertainty?

I How can I get uncertainty in deep learning?

I What does this uncertainty look like?



I Conclusions28 of 52

Why should I care about uncertainty?

I We train a model to recognise dog breeds

I And are given a cat to classify

I What would you want your model to do?

I Similar problems in decision making, physics, life science, etc.8

I For the practitioner: pass inputs with low confidence tospecialised models

I But I already have uncertainty in classification! well... no

I We need to be able to tell what our model knows and what itdoesn’t.9

8Complete references at end of slides9Friendly introduction given in yarin.co/blog

29 of 52

http://yarin.co/blog








(a) Softmax input as a function ofdata x: f (x)

(b) Softmax output as a functionof data x: σ(f (x))



29 of 52











29 of 52


How to get uncertainty in deep learning

I We fit a distribution; Already used its first moment:

E(y∗)=

1T

T∑t=1

y(x∗, ωt)

with ωt ∼ qθ(ω).

I For uncertainty (in regression) look at the second moment:

Var(y∗)

= τ−1I +1T

T∑t=1

y(x∗, ωt)T y(x∗, ωt)− E(y∗)TE(y∗)

I As simple as looking at the sample variance of stochasticforward passes through the network (plus obs. noise).10

10See yarin.co/dropout for more details30 of 52

http://yarin.co/dropout

Outline









What does this uncertainty look like?

What would be the CO2 concentration level in Mauna Loa,Hawaii, in 20 years’ time?

I Normal dropout (weight averaging, 5 layers, ReLU units):

I Same network, Bayesian perspective:

32 of 52


[Online demo] 11

11yarin.co/blog33 of 52

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html#NPGcanvas

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html#NPGcanvas



I How good is our uncertainty estimate?

34 of 52

Outline









Deep Reinforcement Learning

I We have a “Roomba”12

I Penalised −5 for walking into a wall, +10 reward for collectingdirt

I Our environment is stochastic and ever changing

I We want a net to learn what actions to do in different situations

12Code based on Karpathy and authors. github.com/karpathy/convnetjs36 of 52

github.com/karpathy/convnetjs


Behavioural policies:I Epsilon-greedy – take random actions with probability ε and

optimal actions otherwise

I Using uncertainty we can learn faster

I Thompson sampling – draw realisation from current beliefover world, choose action with highest value

I In practice: simulate a stochastic forward pass through thedropout network and choose action with highest value

37 of 52


[Online demo] 13

13yarin.co/blog38 of 52

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html#graph_canvas

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html#graph_canvas



102 103

Batches

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

Avera

ge r

ew

ard

Thompson sampling (using uncertainty)

Epsilon greedy

Average reward over time (log scale)39 of 52

RW: Camera local. (with uncertainty!)

I Where was a picture taken? (Kendall and Cipolla, 2015)14

I Uncertainty increases as a test photo diverges from trainingdistribution

I Test photos with high uncertainty (strong occlusion fromvehicles, pedestrians or other objects)

I Localisation error correlates with uncertainty14Figures used with author permission

40 of 52

RW: Camera local. (with uncertainty!)

I Where was a picture taken? (Kendall and Cipolla, 2015)14

I Uncertainty increases as a test photo diverges from trainingdistribution

I Test photos with high uncertainty (strong occlusion fromvehicles, pedestrians or other objects)

I Localisation error correlates with uncertainty


RW: Image segmentationI Scene understanding: what’s in a photo and where? (Kendall,

Badrinarayanan, and Cipolla, 2015)15


RW: DNA methylationI Angermueller and Stegle (2015) fit a network to predict DNA

methylation – used for gene regulation

I Look at methylation rate of different embryonic stem cells.Uncertainty increases in genomic contexts that are hard topredict (e.g. LMR or H3K27me3)

42 of 52

Outline





I Conclusions

43 of 52

What lays ahead

Use the theory to answer many questions: How can we...

I ... build interpretable models?

I ... combine Bayesian techniques & deep models?

I ... practically use deep learning uncertainty in existing models?

I ... extend deep learning in a principled way?

44 of 52

Many unanswered questions leftI Interpretable models?

I Will you trust a decision made by a black-box?

I Rich literature in interpretable Bayesian models (e.g. Sun(2006), Letham (2014))

I Combine Bayesian and deep models in a principled way?

I Combine Bayesian techniques & deep models?I Unsupervised learning – Bayesian data analysis?

I Bayesian models with complex data? (sequence data, imagedata)

45 of 52

Many unanswered questions leftI Practical deep learning uncertainty?

I Capture language ambiguity?

Image Source: cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf

I Weight uncertainty for model debugging?

I Principled extensions of deep learning?I Dropout in recurrent networks?

I New appr. distributions = new stochastic reg. techniques?

I Model compression: Wi ∼ discrete distribution w. continuousbase measure?

46 of 52

cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf







46 of 52





Image Source: karpathy.github.io/2015/05/21/rnn-effectiveness



46 of 52

karpathy.github.io/2015/05/21/rnn-effectiveness






qθ(ω) =?


46 of 52







46 of 52







Work in progress!

46 of 52

Outline





I Conclusions

47 of 52

Conclusions

The theory above means that modern deep learning:

I captures stochastic processes underlying observed data

I can use vast Bayesian statistics literature

I can be explained by mathematically rigorous theory

I can be extended in a principled way

I can be combined with Bayesian models / techniques in apractical way (we saw this!)

I has uncertainty estimates built-in (we saw this as well!)

48 of 52

Conclusions

The theory above means that modern deep learning:

I captures stochastic processes underlying observed data

I can use vast Bayesian statistics literature

I can be explained by mathematically rigorous theory

I can be extended in a principled way

I can be combined with Bayesian models / techniques in apractical way (we saw this!)

I has uncertainty estimates built-in (we saw this as well!)

But...

48 of 52

New horizons

Most exciting is work to come:I Practical uncertainty in deep learning

I Principled extensions to deep learning

I Hybrid deep learning – Bayesian models

and much, much, more.

49 of 52

New horizons

Most exciting is work to come:I Practical uncertainty in deep learning

I Principled extensions to deep learning

I Hybrid deep learning – Bayesian models

and much, much, more.Thank you for listening.

49 of 52

ReferencesI Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing

Model Uncertainty in Deep Learning”, arXiv preprint, arXiv:1506.02142 (2015).

I Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Appendix”, arXivpreprint, arXiv:1506.02157 (2015).

I Y Gal, Z Ghahramani, “Bayesian Convolutional Neural Networks with BernoulliApproximate Variational Inference”, arXiv preprint, arXiv:1506.02158 (2015).

I A Kendall, R Cipolla, “Modelling Uncertainty in Deep Learning for CameraRelocalization”, arXiv preprint, arXiv:1509.05909 (2015)

I C Angermueller and O Stegle, “Multi-task deep neural network to predict CpGmethylation profiles from low-coverage sequencing data”, NIPS MLCB workshop(2015).

I JM Hernndez-Lobato, RP Adams, “Probabilistic Backpropagation for ScalableLearning of Bayesian Neural Networks”, ICML (2015).

I DP Kingma, T Salimans, M Welling, “Variational Dropout and the LocalReparameterization Trick”, NIPS (2015).

I DJ Rezende, S Mohamed, D Wierstra, “Stochastic Backpropagation andApproximate Inference in Deep Generative Models”, ICML (2014).

50 of 52

Refs – Bayesian neural networksI Denker, Schwartz, Wittner, Solla, Howard, Jackel, and Hopfield, “Large Automatic

Learning, Rule Extraction, and Generalization”, Complex Systems (1987).

I Tishby, Levin, and Solla, “A statistical approach to learning and generalization inlayered neural networks”, COLT (1989).

I Denker and LeCun, “Transforming neural-net output levels to probabilitydistributions”, NIPS (1991).

I D MacKay, “A practical Bayesian framework for backpropagation networks”,Neural Computation (1992).

I GE Hinton and D van Camp, “Keeping the neural networks simple by minimizingthe description length of the weights”, Computational learning theory (1993).

I R Neal, “Bayesian Learning for Neural Networks”, PhD dissertation (1995).

I D Barber and CM Bishop, “Ensemble learning in Bayesian neural networks”,Computer and Systems Sciences, (1998).

I A Graves, “Practical variational inference for neural networks”, NIPS (2011).

I C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra, “Weight uncertainty inneural networks”, ICML (2015).

51 of 52

Refs – importance of being uncertain

I Krzywinski and Altman, “Points of significance: Importance of being uncertain”,Nature Methods (2013).

I Herzog and Ostwald, “Experimental biology: Sometimes Bayesian statistics arebetter”, Nature (2013).

I Nuzzo, “Scientific method: Statistical errors”, Nature (2014).

I Woolston, “Psychology journal bans P values”, Nature (2015).

I Ghahramani, “Probabilistic machine learning and artificial intelligence”, Nature(2015).

52 of 52

modern deep learning through bayesian...

Documents