modern deep learning through bayesian...
TRANSCRIPT
Modern Deep Learning through Bayesian Eyes
Yarin Gal
To keep things interesting, a photo or an equation in every slide! (unless specified otherwise, photos are eitheroriginal work or taken from Wikimedia, under Creative Commons license)
Modern deep learning
Conceptually simplemodels...
I Attracts tremendous attentionfrom popular media,
I Fundamentally affected the wayML is used in industry,
I Driven by pragmaticdevelopments...
I of tractable models...
I that work well...
I and scale well.
2 of 52
But many unanswered questions...I Why does my model work
We don’t understand many of the tools that we use...E.g. stochastic reg. techniques (dropout, MGN1) are used in mostdeep learning models to avoid over-fitting. Why do they work?
I What does my model know?
I Why does my model predict this and not that?1Wager et al. (2013) and Baldi and Sadowski (2013) attempt to explain dropout
as sparse regularisation but cannot generalise to other techniques.3 of 52
But many unanswered questions...I Why does my model work
I What does my model know?
We can’t tell whether our models are certain or not...E.g. what would be the CO2 concentration level in Mauna Loa,Hawaii, in 20 years’ time?
I Why does my model predict this and not that?
3 of 52
But many unanswered questions...I Why does my model work
I What does my model know?
I Why does my model predict this and not that?
Our models are black boxes and not interpretable...Physicians and others need to understand why a model predicts anoutput.
3 of 52
But many unanswered questions...I Why does my model work
I What does my model know?
I Why does my model predict this and not that?
Surprisingly, we can use Bayesian modelling to answer thequestions above
3 of 52
Outline
I Many unanswered questions
I Why does my model work?
I What does my model know?
I Why does my model predict this and not that, and other openproblems
I Conclusions
4 of 52
Outline
I Many unanswered questions
I Why does my model work?I Bayesian modelling and neural networks
I Modern deep learning as approximate inference
I Real-world implications
I What does my model know?
I Why does my model predict this and not that, and other openproblems
I Conclusions
4 of 52
Bayesian modelling and inferenceI Observed inputs X = {xi}Ni=1 and outputs Y = {yi}Ni=1
I Capture stochastic process believed to have generated outputs
I Def. ω model parameters as r.v.
I Prior dist. over ω: p(ω)
I Likelihood: p(Y|ω,X)
I Posterior: p(ω|X,Y) = p(Y|ω,X)p(ω)p(Y|X) (Bayes’ theorem)
I Predictive distribution given new input x∗
p(y∗|x∗,X,Y) =∫
p(y∗|x∗,ω) p(ω|X,Y)︸ ︷︷ ︸posterior
dω
I But... p(ω|X,Y) is often intractable
5 of 52
Bayesian modelling and inferenceI Observed inputs X = {xi}Ni=1 and outputs Y = {yi}Ni=1
I Capture stochastic process believed to have generated outputs
I Def. ω model parameters as r.v.
I Prior dist. over ω: p(ω)
I Likelihood: p(Y|ω,X)
I Posterior: p(ω|X,Y) = p(Y|ω,X)p(ω)p(Y|X) (Bayes’ theorem)
I Predictive distribution given new input x∗
p(y∗|x∗,X,Y) =∫
p(y∗|x∗,ω) p(ω|X,Y)︸ ︷︷ ︸posterior
dω
I But... p(ω|X,Y) is often intractable
5 of 52
Bayesian modelling and inferenceI Observed inputs X = {xi}Ni=1 and outputs Y = {yi}Ni=1
I Capture stochastic process believed to have generated outputs
I Def. ω model parameters as r.v.
I Prior dist. over ω: p(ω)
I Likelihood: p(Y|ω,X)
I Posterior: p(ω|X,Y) = p(Y|ω,X)p(ω)p(Y|X) (Bayes’ theorem)
I Predictive distribution given new input x∗
p(y∗|x∗,X,Y) =∫
p(y∗|x∗,ω) p(ω|X,Y)︸ ︷︷ ︸posterior
dω
I But... p(ω|X,Y) is often intractable
5 of 52
Approximate inferenceI Approximate p(ω|X,Y) with simple dist. qθ(ω)
I Minimise divergence from posterior w.r.t. θ
KL(qθ(ω) || p(ω|X,Y))
6 of 52
qθ1(ω)
p(ω|X,Y)
Approximate inferenceI Approximate p(ω|X,Y) with simple dist. qθ(ω)
I Minimise divergence from posterior w.r.t. θ
KL(qθ(ω) || p(ω|X,Y))
6 of 52
qθ2(ω)
p(ω|X,Y)
Approximate inferenceI Approximate p(ω|X,Y) with simple dist. qθ(ω)
I Minimise divergence from posterior w.r.t. θ
KL(qθ(ω) || p(ω|X,Y))
6 of 52
qθ3(ω)
p(ω|X,Y)
Approximate inferenceI Approximate p(ω|X,Y) with simple dist. qθ(ω)
I Minimise divergence from posterior w.r.t. θ
KL(qθ(ω) || p(ω|X,Y))
6 of 52
qθ4(ω)
p(ω|X,Y)
Approximate inferenceI Approximate p(ω|X,Y) with simple dist. qθ(ω)
I Minimise divergence from posterior w.r.t. θ
KL(qθ(ω) || p(ω|X,Y))
6 of 52
qθ5(ω) p(ω|X,Y)
Approximate inference
I Approximate p(ω|X,Y) with simple dist. qθ(ω)
I Minimise divergence from posterior w.r.t. θ
KL(qθ(ω) || p(ω|X,Y))
I Identical to minimising
LVI(θ) := −∫
qθ(ω) log
likelihood︷ ︸︸ ︷p(Y|X,ω)dω + KL(qθ(ω)||
prior︷ ︸︸ ︷p(ω))
I We can approximate the predictive distribution
qθ(y∗|x∗) =∫
p(y∗|x∗,ω)qθ(ω)dω.
6 of 52
What to this and deep learning?We’ll look at dropout specifically:
I Used in most modern deep learning models
I It somehow circumvents over-fitting
I And improves performance
With Bayesian modelling we can explain why
7 of 52
The link —Bayesian neural networks
I Place prior p(Wi):Wi ∼ N (0, I)
for i ≤ L (and write ω := {Wi}Li=1).
I Output is a r.v. f(x,ω
)= WLσ
(...W2σ
(W1x + b1
)...).
I Softmax likelihood for class.: p(y |x,ω
)= softmax
(f(x,ω
))or a Gaussian for regression: p
(y|x,ω
)= N
(y; f(x,ω
), τ−1I
).
I But difficult to evaluate posteriorp(ω|X,Y
).
Many have tried...
8 of 52
The link —
Bayesian neural networksI Place prior p(Wi):
Wi ∼ N (0, I)
for i ≤ L (and write ω := {Wi}Li=1).
I Output is a r.v. f(x,ω
)= WLσ
(...W2σ
(W1x + b1
)...).
I Softmax likelihood for class.: p(y |x,ω
)= softmax
(f(x,ω
))or a Gaussian for regression: p
(y|x,ω
)= N
(y; f(x,ω
), τ−1I
).
I But difficult to evaluate posteriorp(ω|X,Y
).
Many have tried...
8 of 52
The link —
Bayesian neural networksI Place prior p(Wi):
Wi ∼ N (0, I)
for i ≤ L (and write ω := {Wi}Li=1).
I Output is a r.v. f(x,ω
)= WLσ
(...W2σ
(W1x + b1
)...).
I Softmax likelihood for class.: p(y |x,ω
)= softmax
(f(x,ω
))or a Gaussian for regression: p
(y|x,ω
)= N
(y; f(x,ω
), τ−1I
).
I But difficult to evaluate posteriorp(ω|X,Y
).
Many have tried...
8 of 52
The link —
Bayesian neural networksI Place prior p(Wi):
Wi ∼ N (0, I)
for i ≤ L (and write ω := {Wi}Li=1).
I Output is a r.v. f(x,ω
)= WLσ
(...W2σ
(W1x + b1
)...).
I Softmax likelihood for class.: p(y |x,ω
)= softmax
(f(x,ω
))or a Gaussian for regression: p
(y|x,ω
)= N
(y; f(x,ω
), τ−1I
).
I But difficult to evaluate posteriorp(ω|X,Y
).
Many have tried...
8 of 52
Long history1
I Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987)
I Denker and LeCun (1991)
I MacKay (1992)
I Hinton and van Camp (1993)
I Neal (1995)
I Barber and Bishop (1998)
And more recently...I Graves (2011)
I Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015)
I Hernandez-Lobato and Adam (2015)
But we don’t use these... do we?1Complete references at end of slides
9 of 52
Outline
I Many unanswered questions
I Why does my model work?I Bayesian modelling and neural networks
I Modern deep learning as approximate inference
I Real-world implications
I What does my model know?
I Why does my model predict this and not that, and other openproblems
I Conclusions
10 of 52
Deep learning as approx. inference
Approximate inference in Bayesian NNs
I Def qθ(ω)
to approximate posterior p(ω|X,Y
)I KL divergence to minimise:
KL(qθ(ω)|| p(ω|X,Y
))∝ −
∫qθ(ω)
log p(Y|X,ω
)dω + KL
(qθ(ω)|| p(ω))
=: L(θ)
I Approximate the integral with MC integration ω ∼ qθ(ω):
L(θ) := − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
11 of 52
Deep learning as approx. inference
Approximate inference in Bayesian NNs
I Def qθ(ω)
to approximate posterior p(ω|X,Y
)I KL divergence to minimise:
KL(qθ(ω)|| p(ω|X,Y
))∝ −
∫qθ(ω)
log p(Y|X,ω
)dω + KL
(qθ(ω)|| p(ω))
=: L(θ)
I Approximate the integral with MC integration ω ∼ qθ(ω):
L(θ) := − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
11 of 52
Deep learning as approx. inference
Approximate inference in Bayesian NNs
I Def qθ(ω)
to approximate posterior p(ω|X,Y
)I KL divergence to minimise:
KL(qθ(ω)|| p(ω|X,Y
))∝ −
∫qθ(ω)
log p(Y|X,ω
)dω + KL
(qθ(ω)|| p(ω))
=: L(θ)
I Approximate the integral with MC integration ω ∼ qθ(ω):
L(θ) := − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
11 of 52
Deep learning as approx. inference
Stochastic approx. inference in Bayesian NNs
I Unbiased estimator:
Eω∼qθ(ω)
(L(θ)
)= L(θ)
I Converges to the same optima as L(θ)
I For inference, repeat:I Sample ω ∼ qθ(ω)
I And minimise (one step)
L(θ) = − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
w.r.t. θ.
12 of 52
Deep learning as approx. inference
Stochastic approx. inference in Bayesian NNs
I Unbiased estimator:
Eω∼qθ(ω)
(L(θ)
)= L(θ)
I Converges to the same optima as L(θ)
I For inference, repeat:I Sample ω ∼ qθ(ω)
I And minimise (one step)
L(θ) = − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
w.r.t. θ.
12 of 52
Deep learning as approx. inference
Stochastic approx. inference in Bayesian NNs
I Unbiased estimator:
Eω∼qθ(ω)
(L(θ)
)= L(θ)
I Converges to the same optima as L(θ)
I For inference, repeat:I Sample ω ∼ qθ(ω)
I And minimise (one step)
L(θ) = − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
w.r.t. θ.
12 of 52
Deep learning as approx. inference
Specifying qθ(·)
I Given zi,j Bernoulli r.v. and variational parameters θ = {Mi}Li=1(set of matrices):
zi,j ∼ Bernoulli(pi) for i = 1, ...,L, j = 1, ...,Ki−1
Wi = Mi · diag([zi,j ]Kij=1)
qθ(ω) =∏
qMi (Wi)
13 of 52
Deep learning as approx. inference
In summary:
Minimise divergence between qθ(ω) and p(ω|X,Y):
I Repeat:I Sample zi,j ∼ Bernoulli(pi) and set
Wi = Mi · diag([zi,j ]Kij=1)
ω = {Wi}Li=1
I Minimise (one step)
L(θ) = − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
w.r.t. θ = {Mi}Li=1 (set of matrices).
14 of 52
Deep learning as approx. inference
In summary:
Minimise divergence between qθ(ω) and p(ω|X,Y):
I Repeat:I = Randomly set columns of Mi to zero
I Minimise (one step)
L(θ) = − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
w.r.t. θ = {Mi}Li=1 (set of matrices).
14 of 52
Deep learning as approx. inference
In summary:
Minimise divergence between qθ(ω) and p(ω|X,Y):
I Repeat:I = Randomly set units of the network to zero
I Minimise (one step)
L(θ) = − log p(Y|X, ω
)+ KL
(qθ(ω)|| p(ω))
w.r.t. θ = {Mi}Li=1 (set of matrices).
14 of 52
Deep learning as approx. inference
Sounds familiar?2
L(θ) =
= loss︷ ︸︸ ︷− log p
(Y|X, ω
)+
= L2 reg︷ ︸︸ ︷KL(qθ(ω)|| p(ω))
2For more details see appendix of Gal and Ghahramani (2015) –yarin.co/dropout
15 of 52
Why does my model work?
Now we can answer: “Why does dropout work?”
I It adds noise
I Sexual reproduction3
I Because it approximately integrates over model parameters
I The noise is a side-effect of approx. integration
I Explains model over specification, “adaptive model capacity”
I We fit the process that generated our data
3Srivastava et al. (2014)16 of 52
Why does my model work?
Now we can answer: “Why does dropout work?”
I It adds noise
I Sexual reproduction3
I Because it approximately integrates over model parameters
I The noise is a side-effect of approx. integration
I Explains model over specification, “adaptive model capacity”
I We fit the process that generated our data
3Srivastava et al. (2014)16 of 52
Why does my model work?
Now we can answer: “Why does dropout work?”
I It adds noise
I Sexual reproduction3
I Because it approximately integrates over model parameters
I The noise is a side-effect of approx. integration
I Explains model over specification, “adaptive model capacity”
I We fit the process that generated our data
3Srivastava et al. (2014)16 of 52
Why does my model work?
Now we can answer: “Why does dropout work?”
I It adds noise
I Sexual reproduction3
I Because it approximately integrates over model parameters
I The noise is a side-effect of approx. integration
I Explains model over specification, “adaptive model capacity”
I We fit the process that generated our data
3Srivastava et al. (2014)16 of 52
Why does my model work?
Now we can answer: “Why does dropout work?”
I It adds noise
I Sexual reproduction3
I Because it approximately integrates over model parameters
I The noise is a side-effect of approx. integration
I Explains model over specification, “adaptive model capacity”
I We fit the process that generated our data
3Srivastava et al. (2014)16 of 52
Diving deeper into dropout
I “Why this qθ(·)?”I Bernoullis are cheap
I Dropout at test time ≈ propagate the mean E(Wi) = piMi
I Constrains the weights to near the origin:I Posterior uncertainty decreases with more data
I Var(Wi) = MiMTi (pi − p2
i )
I For fixed pi , to decrease uncertainty must decrease ||Mi ||
I Smallest ||Mi || = strongest reg. at pi = 0.5.
17 of 52
Diving deeper into dropout
I “Why this qθ(·)?”I Bernoullis are cheap
I Dropout at test time ≈ propagate the mean E(Wi) = piMi
I Constrains the weights to near the origin:I Posterior uncertainty decreases with more data
I Var(Wi) = MiMTi (pi − p2
i )
I For fixed pi , to decrease uncertainty must decrease ||Mi ||
I Smallest ||Mi || = strongest reg. at pi = 0.5.
17 of 52
Diving deeper into dropout
I “Why this qθ(·)?”I Bernoullis are cheap
I Dropout at test time ≈ propagate the mean E(Wi) = piMi
I Constrains the weights to near the origin:I Posterior uncertainty decreases with more data
I Var(Wi) = MiMTi (pi − p2
i )
I For fixed pi , to decrease uncertainty must decrease ||Mi ||
I Smallest ||Mi || = strongest reg. at pi = 0.5.
17 of 52
Diving deeper into dropout
I “Why this qθ(·)?”I Bernoullis are cheap
I Dropout at test time ≈ propagate the mean E(Wi) = piMi
I Constrains the weights to near the origin:I Posterior uncertainty decreases with more data
I Var(Wi) = MiMTi (pi − p2
i )
I For fixed pi , to decrease uncertainty must decrease ||Mi ||
I Smallest ||Mi || = strongest reg. at pi = 0.5.
17 of 52
Other stochastic reg. techniques
I Multiplicative Gaussian noise (Srivastava et al. 2014) –
I Multiply network units by N (1,1)
I Same performance as dropoutm
Multiplicative Gaussian noise as approximate inference4
zi,j ∼ N (1,1) for i = 1, ...,L, j = 1, ...,Ki−1
Wi = Mi · diag([zi,j ]Kij=1)
qθ(ω) =∏
qMi (Wi)
Similarly for drop-connect (Wan et al., 2013), hashed neuralnetworks (Chen et al., 2015)
4See Gal and Ghahramani (2015) and Kingma et al. (2015)18 of 52
Outline
I Many unanswered questions
I Why does my model work?I Bayesian modelling and neural networks
I Modern deep learning as approximate inference
I Real-world implications
I What does my model know?
I Why does my model predict this and not that, and other openproblems
I Conclusions
19 of 52
Real world implications
“A theory is worth nothing if you can’t use it to make bettercode.”
– DeadMG Jun 10 ’12, stackexchange
I Better use of dropout
I Model structure selectionI (No time: use Bayesian statistics to understand model
architecture)
20 of 52
Better use of dropout
How do we use dropout with convolutional neural networks(convnets)?
Figure : LeNet convnet structure
Image Source: LeCun et al. (1998)21 of 52
Better use of dropout
How do we use dropout with convolutional neural networks(convnets)?
Figure : LeNet convnet structure
Image Source: LeCun et al. (1998)21 of 52
Dropout ↓
Better use of dropout
How do we use dropout with convolutional neural networks(convnets)?
Figure : LeNet convnet structure
Image Source: LeCun et al. (1998)21 of 52
Dropout ↓No dropout ↓
Better use of dropout
Why not use dropout et al. with convolutions?
I It doesn’t work
I Low co-adaptation in convolutions
I Because it’s not used correctlyI Standard dropout averages weights at test time
22 of 52
Better use of dropout
Why not use dropout et al. with convolutions?
I It doesn’t work
I Low co-adaptation in convolutions
I Because it’s not used correctlyI Standard dropout averages weights at test time
22 of 52
Better use of dropout
Instead, predictive mean, approx. with MC integration:
Eqθ(y∗|x∗)
(y∗)=
1T
T∑t=1
y(x∗, ωt).
with ωt ∼ qθ(ω).
I In practice, average stochastic forward passes through thenetwork (referred to as “MC dropout”).5
I Dropout after convolutions and averaging forward passes =approximate inference in Bayesian convnets.6
5Also suggested in Srivastava et al. (2014) as model averaging.6See yarin.co/bcnn for more details
23 of 52
Better use of dropout
Instead, predictive mean, approx. with MC integration:
Eqθ(y∗|x∗)
(y∗)=
1T
T∑t=1
y(x∗, ωt).
with ωt ∼ qθ(ω).
I In practice, average stochastic forward passes through thenetwork (referred to as “MC dropout”).5
I Dropout after convolutions and averaging forward passes =approximate inference in Bayesian convnets.6
5Also suggested in Srivastava et al. (2014) as model averaging.6See yarin.co/bcnn for more details
23 of 52
Better use of dropout
Instead, predictive mean, approx. with MC integration:
Eqθ(y∗|x∗)
(y∗)=
1T
T∑t=1
y(x∗, ωt).
with ωt ∼ qθ(ω).
I In practice, average stochastic forward passes through thenetwork (referred to as “MC dropout”).5
I Dropout after convolutions and averaging forward passes =approximate inference in Bayesian convnets.6
5Also suggested in Srivastava et al. (2014) as model averaging.6See yarin.co/bcnn for more details
23 of 52
Huge improvement (MNIST)
104 105 106 107
Batches
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Err
or
(%)
Standard dropout (lenet-all)
MC dropout (lenet-all)
Standard dropout (lenet-ip)
MC dropout (lenet-ip)
No dropout (lenet-none)
Red: standard LeNet (no dropout)24 of 52
Huge improvement (MNIST)
104 105 106 107
Batches
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Err
or
(%)
Standard dropout (lenet-all)
MC dropout (lenet-all)
Standard dropout (lenet-ip)
MC dropout (lenet-ip)
No dropout (lenet-none)
Green: standard dropout LeNet (dropout at the end)24 of 52
Huge improvement (MNIST)
104 105 106 107
Batches
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Err
or
(%)
Standard dropout (lenet-all)
MC dropout (lenet-all)
Standard dropout (lenet-ip)
MC dropout (lenet-ip)
No dropout (lenet-none)
Dashed blue: Bayesian LeNet (weight averaging – FAIL)24 of 52
Huge improvement (MNIST)
104 105 106 107
Batches
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Err
or
(%)
Standard dropout (lenet-all)
MC dropout (lenet-all)
Standard dropout (lenet-ip)
MC dropout (lenet-ip)
No dropout (lenet-none)
Solid blue: Bayesian LeNet (MC dropout)24 of 52
Huge improvement (MNIST)
104 105 106 107
Batches
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Err
or
(%)
Standard dropout (lenet-all)
MC dropout (lenet-all)
Standard dropout (lenet-ip)
MC dropout (lenet-ip)
No dropout (lenet-none)
24 of 52
Before dropout
Huge improvement (MNIST)
104 105 106 107
Batches
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Err
or
(%)
Standard dropout (lenet-all)
MC dropout (lenet-all)
Standard dropout (lenet-ip)
MC dropout (lenet-ip)
No dropout (lenet-none)
24 of 52
Before dropout
Empirical dropout
Huge improvement (MNIST)
104 105 106 107
Batches
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Err
or
(%)
Standard dropout (lenet-all)
MC dropout (lenet-all)
Standard dropout (lenet-ip)
MC dropout (lenet-ip)
No dropout (lenet-none)
24 of 52
Before dropout
Empirical dropout
Principled dropout
Over-fitting on small dataI Robustness to over-fitting on smaller datasets:
0 200000 400000 600000 800000 1000000
Batches
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.82
Err
or
(%)
(a) Entire MNISTStandard dropout convnet
0 200000 400000 600000 800000 1000000
Batches
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Err
or
(%)
(b) Entire MNISTBayesian convnet
0 200000 400000 600000 800000 1000000
Batches
1.00
1.05
1.10
1.15
1.20
1.25
Err
or
(%)
(c) 1/4 of MNISTStandard dropout convnet
0 200000 400000 600000 800000 1000000
Batches
0.75
0.80
0.85
0.90
0.95
1.00
1.05
Err
or
(%)
(d) 1/4 of MNISTBayesian convnet
25 of 52
State-of-the-art on CIFAR-10
CIFAR Test Error (and Std.)
Model Standard Dropout MC Dropout
NIN 10.43 (Lin et al., 2013) 10.27± 0.05DSN 9.37 (Lee et al., 2014) 9.32± 0.02
Augmented-DSN 7.95 (Lee et al., 2014) 7.71± 0.09
Table : Bayesian techniques (MC dropout) with existing state-of-the-art
26 of 52
RW: Camera pose localisation
I Find the location from which a picture was taken7
I Kendall and Cipolla (2015) show 10–15% improvement onstate-of-the-art with Bayesian convnets
7Figures used with author permission27 of 52
RW: Camera pose localisation
I Find the location from which a picture was taken7
I Kendall and Cipolla (2015) show 10–15% improvement onstate-of-the-art with Bayesian convnets
Localisation accuracy for different error thresholds7Figures used with author permission
27 of 52
Outline
I Many unanswered questions
I Why does my model work?
I What does my model know?I Why should I care about uncertainty?
I How can I get uncertainty in deep learning?
I What does this uncertainty look like?
I Real-world implications
I Why does my model predict this and not that, and other openproblems
I Conclusions28 of 52
Why should I care about uncertainty?
I We train a model to recognise dog breeds
I And are given a cat to classify
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.8
I For the practitioner: pass inputs with low confidence tospecialised models
I But I already have uncertainty in classification! well... no
I We need to be able to tell what our model knows and what itdoesn’t.9
8Complete references at end of slides9Friendly introduction given in yarin.co/blog
29 of 52
Why should I care about uncertainty?
I We train a model to recognise dog breeds
I And are given a cat to classify
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.8
I For the practitioner: pass inputs with low confidence tospecialised models
I But I already have uncertainty in classification! well... no
I We need to be able to tell what our model knows and what itdoesn’t.9
8Complete references at end of slides9Friendly introduction given in yarin.co/blog
29 of 52
Why should I care about uncertainty?
I We train a model to recognise dog breeds
I And are given a cat to classify
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.8
I For the practitioner: pass inputs with low confidence tospecialised models
I But I already have uncertainty in classification! well... no
I We need to be able to tell what our model knows and what itdoesn’t.9
8Complete references at end of slides9Friendly introduction given in yarin.co/blog
29 of 52
Why should I care about uncertainty?
I We train a model to recognise dog breeds
I And are given a cat to classify
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.8
I For the practitioner: pass inputs with low confidence tospecialised models
I But I already have uncertainty in classification! well... no
I We need to be able to tell what our model knows and what itdoesn’t.9
8Complete references at end of slides9Friendly introduction given in yarin.co/blog
29 of 52
Why should I care about uncertainty?
I We train a model to recognise dog breeds
I And are given a cat to classify
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.8
I For the practitioner: pass inputs with low confidence tospecialised models
I But I already have uncertainty in classification! well... no
I We need to be able to tell what our model knows and what itdoesn’t.9
8Complete references at end of slides9Friendly introduction given in yarin.co/blog
29 of 52
Why should I care about uncertainty?
I We train a model to recognise dog breeds
I And are given a cat to classify
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.8
I For the practitioner: pass inputs with low confidence tospecialised models
I But I already have uncertainty in classification! well... no
(a) Softmax input as a function ofdata x: f (x)
(b) Softmax output as a functionof data x: σ(f (x))
I We need to be able to tell what our model knows and what itdoesn’t.9
8Complete references at end of slides9Friendly introduction given in yarin.co/blog
29 of 52
Why should I care about uncertainty?
I We train a model to recognise dog breeds
I And are given a cat to classify
I What would you want your model to do?
I Similar problems in decision making, physics, life science, etc.8
I For the practitioner: pass inputs with low confidence tospecialised models
I But I already have uncertainty in classification! well... no
I We need to be able to tell what our model knows and what itdoesn’t.9
8Complete references at end of slides9Friendly introduction given in yarin.co/blog
29 of 52
How to get uncertainty in deep learning
I We fit a distribution; Already used its first moment:
E(y∗)=
1T
T∑t=1
y(x∗, ωt)
with ωt ∼ qθ(ω).
I For uncertainty (in regression) look at the second moment:
Var(y∗)
= τ−1I +1T
T∑t=1
y(x∗, ωt)T y(x∗, ωt)− E(y∗)TE(y∗)
I As simple as looking at the sample variance of stochasticforward passes through the network (plus obs. noise).10
10See yarin.co/dropout for more details30 of 52
How to get uncertainty in deep learning
I We fit a distribution; Already used its first moment:
E(y∗)=
1T
T∑t=1
y(x∗, ωt)
with ωt ∼ qθ(ω).
I For uncertainty (in regression) look at the second moment:
Var(y∗)
= τ−1I +1T
T∑t=1
y(x∗, ωt)T y(x∗, ωt)− E(y∗)TE(y∗)
I As simple as looking at the sample variance of stochasticforward passes through the network (plus obs. noise).10
10See yarin.co/dropout for more details30 of 52
How to get uncertainty in deep learning
I We fit a distribution; Already used its first moment:
E(y∗)=
1T
T∑t=1
y(x∗, ωt)
with ωt ∼ qθ(ω).
I For uncertainty (in regression) look at the second moment:
Var(y∗)
= τ−1I +1T
T∑t=1
y(x∗, ωt)T y(x∗, ωt)− E(y∗)TE(y∗)
I As simple as looking at the sample variance of stochasticforward passes through the network (plus obs. noise).10
10See yarin.co/dropout for more details30 of 52
Outline
I Many unanswered questions
I Why does my model work?
I What does my model know?I Why should I care about uncertainty?
I How can I get uncertainty in deep learning?
I What does this uncertainty look like?
I Real-world implications
I Why does my model predict this and not that, and other openproblems
I Conclusions31 of 52
What does this uncertainty look like?
What would be the CO2 concentration level in Mauna Loa,Hawaii, in 20 years’ time?
I Normal dropout (weight averaging, 5 layers, ReLU units):
I Same network, Bayesian perspective:
32 of 52
What does this uncertainty look like?
What would be the CO2 concentration level in Mauna Loa,Hawaii, in 20 years’ time?
I Normal dropout (weight averaging, 5 layers, ReLU units):
I Same network, Bayesian perspective:
32 of 52
What does this uncertainty look like?
[Online demo] 11
11yarin.co/blog33 of 52
What does this uncertainty look like?
I How good is our uncertainty estimate?
34 of 52
Outline
I Many unanswered questions
I Why does my model work?
I What does my model know?I Why should I care about uncertainty?
I How can I get uncertainty in deep learning?
I What does this uncertainty look like?
I Real-world implications
I Why does my model predict this and not that, and other openproblems
I Conclusions35 of 52
Deep Reinforcement Learning
I We have a “Roomba”12
I Penalised −5 for walking into a wall, +10 reward for collectingdirt
I Our environment is stochastic and ever changing
I We want a net to learn what actions to do in different situations
12Code based on Karpathy and authors. github.com/karpathy/convnetjs36 of 52
Deep Reinforcement Learning
Behavioural policies:I Epsilon-greedy – take random actions with probability ε and
optimal actions otherwise
I Using uncertainty we can learn faster
I Thompson sampling – draw realisation from current beliefover world, choose action with highest value
I In practice: simulate a stochastic forward pass through thedropout network and choose action with highest value
37 of 52
Deep Reinforcement Learning
Behavioural policies:I Epsilon-greedy – take random actions with probability ε and
optimal actions otherwise
I Using uncertainty we can learn faster
I Thompson sampling – draw realisation from current beliefover world, choose action with highest value
I In practice: simulate a stochastic forward pass through thedropout network and choose action with highest value
37 of 52
Deep Reinforcement Learning
Behavioural policies:I Epsilon-greedy – take random actions with probability ε and
optimal actions otherwise
I Using uncertainty we can learn faster
I Thompson sampling – draw realisation from current beliefover world, choose action with highest value
I In practice: simulate a stochastic forward pass through thedropout network and choose action with highest value
37 of 52
Deep Reinforcement Learning
Behavioural policies:I Epsilon-greedy – take random actions with probability ε and
optimal actions otherwise
I Using uncertainty we can learn faster
I Thompson sampling – draw realisation from current beliefover world, choose action with highest value
I In practice: simulate a stochastic forward pass through thedropout network and choose action with highest value
37 of 52
Deep Reinforcement Learning
[Online demo] 13
13yarin.co/blog38 of 52
Deep Reinforcement Learning
102 103
Batches
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
Avera
ge r
ew
ard
Thompson sampling (using uncertainty)
Epsilon greedy
Average reward over time (log scale)39 of 52
RW: Camera local. (with uncertainty!)
I Where was a picture taken? (Kendall and Cipolla, 2015)14
I Uncertainty increases as a test photo diverges from trainingdistribution
I Test photos with high uncertainty (strong occlusion fromvehicles, pedestrians or other objects)
I Localisation error correlates with uncertainty14Figures used with author permission
40 of 52
RW: Camera local. (with uncertainty!)
I Where was a picture taken? (Kendall and Cipolla, 2015)14
I Uncertainty increases as a test photo diverges from trainingdistribution
I Test photos with high uncertainty (strong occlusion fromvehicles, pedestrians or other objects)
I Localisation error correlates with uncertainty14Figures used with author permission
40 of 52
RW: Camera local. (with uncertainty!)
I Where was a picture taken? (Kendall and Cipolla, 2015)14
I Uncertainty increases as a test photo diverges from trainingdistribution
I Test photos with high uncertainty (strong occlusion fromvehicles, pedestrians or other objects)
I Localisation error correlates with uncertainty14Figures used with author permission
40 of 52
RW: Camera local. (with uncertainty!)
I Where was a picture taken? (Kendall and Cipolla, 2015)14
I Uncertainty increases as a test photo diverges from trainingdistribution
I Test photos with high uncertainty (strong occlusion fromvehicles, pedestrians or other objects)
I Localisation error correlates with uncertainty
14Figures used with author permission40 of 52
RW: Image segmentationI Scene understanding: what’s in a photo and where? (Kendall,
Badrinarayanan, and Cipolla, 2015)15
15Figures used with author permission41 of 52
RW: Image segmentationI Scene understanding: what’s in a photo and where? (Kendall,
Badrinarayanan, and Cipolla, 2015)15
15Figures used with author permission41 of 52
RW: DNA methylationI Angermueller and Stegle (2015) fit a network to predict DNA
methylation – used for gene regulation
I Look at methylation rate of different embryonic stem cells.Uncertainty increases in genomic contexts that are hard topredict (e.g. LMR or H3K27me3)
42 of 52
RW: DNA methylationI Angermueller and Stegle (2015) fit a network to predict DNA
methylation – used for gene regulation
I Look at methylation rate of different embryonic stem cells.Uncertainty increases in genomic contexts that are hard topredict (e.g. LMR or H3K27me3)
42 of 52
Outline
I Many unanswered questions
I Why does my model work?
I What does my model know?
I Why does my model predict this and not that, and other openproblems
I Conclusions
43 of 52
What lays ahead
Use the theory to answer many questions: How can we...
I ... build interpretable models?
I ... combine Bayesian techniques & deep models?
I ... practically use deep learning uncertainty in existing models?
I ... extend deep learning in a principled way?
44 of 52
Many unanswered questions leftI Interpretable models?
I Will you trust a decision made by a black-box?
I Rich literature in interpretable Bayesian models (e.g. Sun(2006), Letham (2014))
I Combine Bayesian and deep models in a principled way?
I Combine Bayesian techniques & deep models?I Unsupervised learning – Bayesian data analysis?
I Bayesian models with complex data? (sequence data, imagedata)
45 of 52
Many unanswered questions leftI Interpretable models?
I Will you trust a decision made by a black-box?
I Rich literature in interpretable Bayesian models (e.g. Sun(2006), Letham (2014))
I Combine Bayesian and deep models in a principled way?
I Combine Bayesian techniques & deep models?I Unsupervised learning – Bayesian data analysis?
I Bayesian models with complex data? (sequence data, imagedata)
45 of 52
Many unanswered questions leftI Interpretable models?
I Will you trust a decision made by a black-box?
I Rich literature in interpretable Bayesian models (e.g. Sun(2006), Letham (2014))
I Combine Bayesian and deep models in a principled way?
I Combine Bayesian techniques & deep models?I Unsupervised learning – Bayesian data analysis?
I Bayesian models with complex data? (sequence data, imagedata)
45 of 52
Many unanswered questions leftI Interpretable models?
I Will you trust a decision made by a black-box?
I Rich literature in interpretable Bayesian models (e.g. Sun(2006), Letham (2014))
I Combine Bayesian and deep models in a principled way?
I Combine Bayesian techniques & deep models?I Unsupervised learning – Bayesian data analysis?
I Bayesian models with complex data? (sequence data, imagedata)
45 of 52
Many unanswered questions leftI Interpretable models?
I Will you trust a decision made by a black-box?
I Rich literature in interpretable Bayesian models (e.g. Sun(2006), Letham (2014))
I Combine Bayesian and deep models in a principled way?
I Combine Bayesian techniques & deep models?I Unsupervised learning – Bayesian data analysis?
I Bayesian models with complex data? (sequence data, imagedata)
45 of 52
Many unanswered questions leftI Practical deep learning uncertainty?
I Capture language ambiguity?
Image Source: cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf
I Weight uncertainty for model debugging?
I Principled extensions of deep learning?I Dropout in recurrent networks?
I New appr. distributions = new stochastic reg. techniques?
I Model compression: Wi ∼ discrete distribution w. continuousbase measure?
46 of 52
Many unanswered questions leftI Practical deep learning uncertainty?
I Capture language ambiguity?
I Weight uncertainty for model debugging?
I Principled extensions of deep learning?I Dropout in recurrent networks?
I New appr. distributions = new stochastic reg. techniques?
I Model compression: Wi ∼ discrete distribution w. continuousbase measure?
46 of 52
Many unanswered questions leftI Practical deep learning uncertainty?
I Capture language ambiguity?
I Weight uncertainty for model debugging?
I Principled extensions of deep learning?I Dropout in recurrent networks?
Image Source: karpathy.github.io/2015/05/21/rnn-effectiveness
I New appr. distributions = new stochastic reg. techniques?
I Model compression: Wi ∼ discrete distribution w. continuousbase measure?
46 of 52
Many unanswered questions leftI Practical deep learning uncertainty?
I Capture language ambiguity?
I Weight uncertainty for model debugging?
I Principled extensions of deep learning?I Dropout in recurrent networks?
I New appr. distributions = new stochastic reg. techniques?
qθ(ω) =?
I Model compression: Wi ∼ discrete distribution w. continuousbase measure?
46 of 52
Many unanswered questions leftI Practical deep learning uncertainty?
I Capture language ambiguity?
I Weight uncertainty for model debugging?
I Principled extensions of deep learning?I Dropout in recurrent networks?
I New appr. distributions = new stochastic reg. techniques?
I Model compression: Wi ∼ discrete distribution w. continuousbase measure?
46 of 52
Many unanswered questions leftI Practical deep learning uncertainty?
I Capture language ambiguity?
I Weight uncertainty for model debugging?
I Principled extensions of deep learning?I Dropout in recurrent networks?
I New appr. distributions = new stochastic reg. techniques?
I Model compression: Wi ∼ discrete distribution w. continuousbase measure?
Work in progress!
46 of 52
Outline
I Many unanswered questions
I Why does my model work?
I What does my model know?
I Why does my model predict this and not that, and other openproblems
I Conclusions
47 of 52
Conclusions
The theory above means that modern deep learning:
I captures stochastic processes underlying observed data
I can use vast Bayesian statistics literature
I can be explained by mathematically rigorous theory
I can be extended in a principled way
I can be combined with Bayesian models / techniques in apractical way (we saw this!)
I has uncertainty estimates built-in (we saw this as well!)
48 of 52
Conclusions
The theory above means that modern deep learning:
I captures stochastic processes underlying observed data
I can use vast Bayesian statistics literature
I can be explained by mathematically rigorous theory
I can be extended in a principled way
I can be combined with Bayesian models / techniques in apractical way (we saw this!)
I has uncertainty estimates built-in (we saw this as well!)
48 of 52
Conclusions
The theory above means that modern deep learning:
I captures stochastic processes underlying observed data
I can use vast Bayesian statistics literature
I can be explained by mathematically rigorous theory
I can be extended in a principled way
I can be combined with Bayesian models / techniques in apractical way (we saw this!)
I has uncertainty estimates built-in (we saw this as well!)
48 of 52
Conclusions
The theory above means that modern deep learning:
I captures stochastic processes underlying observed data
I can use vast Bayesian statistics literature
I can be explained by mathematically rigorous theory
I can be extended in a principled way
I can be combined with Bayesian models / techniques in apractical way (we saw this!)
I has uncertainty estimates built-in (we saw this as well!)
48 of 52
Conclusions
The theory above means that modern deep learning:
I captures stochastic processes underlying observed data
I can use vast Bayesian statistics literature
I can be explained by mathematically rigorous theory
I can be extended in a principled way
I can be combined with Bayesian models / techniques in apractical way (we saw this!)
I has uncertainty estimates built-in (we saw this as well!)
48 of 52
Conclusions
The theory above means that modern deep learning:
I captures stochastic processes underlying observed data
I can use vast Bayesian statistics literature
I can be explained by mathematically rigorous theory
I can be extended in a principled way
I can be combined with Bayesian models / techniques in apractical way (we saw this!)
I has uncertainty estimates built-in (we saw this as well!)
But...
48 of 52
New horizons
Most exciting is work to come:I Practical uncertainty in deep learning
I Principled extensions to deep learning
I Hybrid deep learning – Bayesian models
and much, much, more.
49 of 52
New horizons
Most exciting is work to come:I Practical uncertainty in deep learning
I Principled extensions to deep learning
I Hybrid deep learning – Bayesian models
and much, much, more.Thank you for listening.
49 of 52
ReferencesI Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning”, arXiv preprint, arXiv:1506.02142 (2015).
I Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Appendix”, arXivpreprint, arXiv:1506.02157 (2015).
I Y Gal, Z Ghahramani, “Bayesian Convolutional Neural Networks with BernoulliApproximate Variational Inference”, arXiv preprint, arXiv:1506.02158 (2015).
I A Kendall, R Cipolla, “Modelling Uncertainty in Deep Learning for CameraRelocalization”, arXiv preprint, arXiv:1509.05909 (2015)
I C Angermueller and O Stegle, “Multi-task deep neural network to predict CpGmethylation profiles from low-coverage sequencing data”, NIPS MLCB workshop(2015).
I JM Hernndez-Lobato, RP Adams, “Probabilistic Backpropagation for ScalableLearning of Bayesian Neural Networks”, ICML (2015).
I DP Kingma, T Salimans, M Welling, “Variational Dropout and the LocalReparameterization Trick”, NIPS (2015).
I DJ Rezende, S Mohamed, D Wierstra, “Stochastic Backpropagation andApproximate Inference in Deep Generative Models”, ICML (2014).
50 of 52
Refs – Bayesian neural networksI Denker, Schwartz, Wittner, Solla, Howard, Jackel, and Hopfield, “Large Automatic
Learning, Rule Extraction, and Generalization”, Complex Systems (1987).
I Tishby, Levin, and Solla, “A statistical approach to learning and generalization inlayered neural networks”, COLT (1989).
I Denker and LeCun, “Transforming neural-net output levels to probabilitydistributions”, NIPS (1991).
I D MacKay, “A practical Bayesian framework for backpropagation networks”,Neural Computation (1992).
I GE Hinton and D van Camp, “Keeping the neural networks simple by minimizingthe description length of the weights”, Computational learning theory (1993).
I R Neal, “Bayesian Learning for Neural Networks”, PhD dissertation (1995).
I D Barber and CM Bishop, “Ensemble learning in Bayesian neural networks”,Computer and Systems Sciences, (1998).
I A Graves, “Practical variational inference for neural networks”, NIPS (2011).
I C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra, “Weight uncertainty inneural networks”, ICML (2015).
51 of 52
Refs – importance of being uncertain
I Krzywinski and Altman, “Points of significance: Importance of being uncertain”,Nature Methods (2013).
I Herzog and Ostwald, “Experimental biology: Sometimes Bayesian statistics arebetter”, Nature (2013).
I Nuzzo, “Scientific method: Statistical errors”, Nature (2014).
I Woolston, “Psychology journal bans P values”, Nature (2015).
I Ghahramani, “Probabilistic machine learning and artificial intelligence”, Nature(2015).
52 of 52