how to capture uncertainties with neural networks

50
How to capture uncertainties with Neural Networks A high level intro of probabilistic deep learning models 1 Beate Sick

Upload: others

Post on 18-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to capture uncertainties with Neural Networks

How to capture uncertainties with

Neural Networks

A high level intro of probabilistic deep learning models

1

Beate Sick

Page 2: How to capture uncertainties with Neural Networks

Probabilistic Deep Learning is done in team work

2

Lisa Herzog

The topic was started 2017 with Elvis and Oliver at ZHAW and with Lisa at UZH and since then a lot is going on…

Page 3: How to capture uncertainties with Neural Networks

Outline

• Why are traditional neural network models “not probabilistic”?

• How can a NN model capture the data inherent variation?

spoiler: by modeling a parametric conditional probability distribution (CPD)

• How to capture the uncertainty of the parameters that fix the CPD?

spoiler: use Bayesian methods to get a distribution of plausible parameter values

• Aleatoric and epistemic uncertainty: The new buzzwords in the field

• Using epistemic uncertainty to identify novel classes in a real world

classification task.

3

Page 4: How to capture uncertainties with Neural Networks

4

Probabilistic Neural Networks

for simple regression

Page 5: How to capture uncertainties with Neural Networks

Can we predict the blood pressure from the age of a woman?

5

People say that age is just a state of mind. I say it's more about the state of your body.

Geoffrey Parfitt

The sbp (systolic blood pressure) tend to increase with age.

Hypothesis: Knowing the age of a woman helps us to predict the blood pressure.

Page 6: How to capture uncertainties with Neural Networks

Simple regression via a NN: no probabilistic model in mind

6

i( )= +biy x a x

Page 7: How to capture uncertainties with Neural Networks

How to express uncertainty in statistics or machine learning?

• Probability is the language of uncertainty

• Probability distributions are the tool to quantify uncertainties

7

Page 8: How to capture uncertainties with Neural Networks

CPD

Conditional Probability

Distribution

8

Page 9: How to capture uncertainties with Neural Networks

Fitting a probabilistic model allows to capture variations

9

For each age (x-position) we have fitted a whole distribution of possible sbp values (y-values or response values – here systolic blood pressure).

Assume constant variance

Page 10: How to capture uncertainties with Neural Networks

Simple linear regression models the mx-parameter of N(mx,s2)

10

2=(Y|X ) ~ N( , ) ii xX iY sm

2

0 1 i1y = + ~N 0, , ii ix s

Traditional view with focus on means and residuals:

Slightly changed view with focus on conditional distributions:

We model the 𝝁𝑥𝑖-parameter of the CPD:

i= +bix a xm

We model the expected value:

i= +bix a xm

Conditional Probability Distribution (CPD)

Page 11: How to capture uncertainties with Neural Networks

Bird view on a probabilistic prediction model

11

Input X Parameter for CPD

Input Artificial Intelligence ;-) Outcome distribution

ixm ixs𝑥𝑖

Page 12: How to capture uncertainties with Neural Networks

Using a neural network for simple linear regression

12

Input: agei of woman i

age

2sbp ~ N( , ) ii ioutm s

iout sbp

Input Artificial Intelligence ;-) Outcome distribution

i= +bii xout a xm

Page 13: How to capture uncertainties with Neural Networks

Fitting the CPD parameter

via Maximum Likelihood

13

Page 14: How to capture uncertainties with Neural Networks

How to find the weights in the NN that yield good m-estimates?

Maximum Likelihood principle:

14

x

2Y ~ N , i iix xout sm

ML

1

= argmax | , n

i i

i

f y x

w

w w

2

ML, 1

1ˆˆ( , ) = argminn

ia b i

a b a x b yn

2

2

1

1 ˆˆ ˆ2

n

i

i

a x b yn

s

Analyticallygradient descent

ො𝑎 𝑏

Page 15: How to capture uncertainties with Neural Networks

Few code lines to fit the model

15

As machine learner we are interested in the outcome prediction, not in the parameter values!

Page 16: How to capture uncertainties with Neural Networks

How to do linear regression with non-constant variance?

16

Idea 1: Transform data so that variance is stable.

Idea 2: Fit a conditional probability distribution where both parameters (m, s2) depend on x.

Page 17: How to capture uncertainties with Neural Networks

How to do linear regression with non-constant variance?

17

2

x

22Y ~ N ,

ii ix

outeoutm s

Input Neural Network Outcome distribution

1

𝑥

a

b

1out

2outc

d

ixm

ixs

Y𝑥𝑖

1

2

= +

+

i

i io

o

ut c

b

x

u x

d

t a

2

1=

i

i

i i

out

x

x u

e

o t

s

m

Page 18: How to capture uncertainties with Neural Networks

How to find the weights in the NN that yield good m-estimates?

Maximum Likelihood principle:

19

ML

1

= argmin | , n

i i

i

f y x

w

w w

2

x

22Y ~ N ,

ii ix

outeoutm s

gradient descent

ˆ ˆˆ ˆ, , , a b c d

2

1=

i

i

i i

out

x

x u

e

o t

s

m

𝑜𝑢𝑡1 ± 2 ⋅ 𝑒𝑜𝑢𝑡23 lines are plotted

Page 19: How to capture uncertainties with Neural Networks

Outcome CPD is given by its parameter values

20Credits for animated gif: https://crumplab.github.io/statistics/gifs.html

2=(Y|X ) ~ N( , ) ii ixX xiY sm

In traditional linear regression we only model the m-parameter of the CPD as dependent on x and assume that the variance s2 does not depend on x.

= + ix ia x bm

X X

y

Densitity of CPD

depicted as blue dot

2~ N( , ) ii ix xXY m s

i

c x d

x es

2

1=

i

i

i i

out

x

x u

e

o t

s

m

𝑜𝑢𝑡1 ± 2 ⋅ 𝑒𝑜𝑢𝑡23 lines are plotted

Page 20: How to capture uncertainties with Neural Networks

Lets use the same NN architecture for a different data set

21

What is going wrong? Without hidden layer 𝑜𝑢𝑡2 is a linear function of x Sigma is a monotone function of x!

2=(Y|X ) ~ N( , ) ii ixX xiY sm

= + ix ia x bm

i

c x d

x es

2

1=

i

i

i i

out

x

x u

e

o t

s

m

𝑜𝑢𝑡1 ± 2 ⋅ 𝑒𝑜𝑢𝑡23 lines are plotted

Page 21: How to capture uncertainties with Neural Networks

NN architecture for a non-monotone relation of input and variance

22

2=(Y|X ) ~ N( , ) ii ixX xiY sm

2

1=

i

i

i i

out

x

x u

e

o t

s

m

𝑜𝑢𝑡1 ± 2 ⋅ 𝑒𝑜𝑢𝑡2

3 lines are plotted

Page 22: How to capture uncertainties with Neural Networks

23

2=(Y|X ) ~ N( , ) ii ixX xiY sm

NN architecture to fit non-linear heteroscatic data

2

1=

i

i

i i

out

x

x u

e

o t

s

m

Page 23: How to capture uncertainties with Neural Networks

25

Parameter uncertainty

via Bayesian approach

Page 24: How to capture uncertainties with Neural Networks

Recap: Bayesian models have distributions over their parameter

26

0

5

2 3 4 5

0 1 2 3 4 5

2

0 0 0

2

5 5 5

Bayesian fit of a polynomial function with 6 parameters

ˆ ˆ ˆ ˆ ˆ ˆˆ

with

ˆ ~ Distribution , e.g. ,

...

ˆ ~ Distribution , e.g. ,

y x x x x x

N

N

m s

m s

θ

θ

A fitted Bayesian model represents not 1 polynomial but a distribution of polynomials.

𝑥

𝑦

Page 25: How to capture uncertainties with Neural Networks

Frequentist’s vs Bayesian’s answer when asked to do prediction

27

Page 26: How to capture uncertainties with Neural Networks

Bayesian framework

• In Bayesian Modeling we define a prior distribution

over the parameter W: 𝑊𝑖~𝑁(0, 𝐼) defining p(wi)

• For regression NN we have the likelihood:

• Given a dataset X, Y we then look for the

posteriori distribution capturing the most probable

model parameter given the observed data

)

(Y| ,X )

(Y|

) (|X,Y

X

p pp

p

likelihood prior

normalizer=marginal likelihood

28

2=(Y|X ) ~ N( , ) ii ixX xiY sm

2 )

1

ln(1 )

=

i

i i

io t

x

x

u

u

e

o tm

s

1out 2out

not possible to derive exactly in

case on NN

1

n

i

Page 27: How to capture uncertainties with Neural Networks

Approximate Bayesian NN by variational inference

29

Oliver, Elvis and I plan to give additional BBSs on the theory and frameworks of Bayesian DL models.There was a BBS on Dropout as Bayesian Approximation method: https://tensorchiefs.github.io/bbs/files/dropouts-brownbag.pdf*Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning https://arxiv.org/abs/1506.02142

TFp

We should sample from weight distribution during test time

MC DropoutRandomly drop nodes in each run Ususally done during training

p1 p2 p3

1out 2out

Page 28: How to capture uncertainties with Neural Networks

Bayesian NN provide uncertainty on the parameter values

30

To indicate the uncertainty about the 𝜇𝑥𝑖we plot here a curve for 𝑚𝑒𝑎𝑛𝑜𝑢𝑡1𝑥

and

two curves for 𝑚𝑒𝑎𝑛𝑜𝑢𝑡1𝑥± 2 ⋅ 𝑠𝑑𝑜𝑢𝑡1𝑥

We would expect higher uncertainty, if we have less data – here the uncertainty is incredible small – why?

1

𝑥

a

b

1out

2outc

d

There is no hidden layer linear fit over whole range, we can use all points to estimate this line (only 2 df)!

Can be approximated by sampling realized by repeated predictions

𝑚𝑒𝑎𝑛𝑜𝑢𝑡1𝑥± 2 ⋅ 𝑠𝑑𝑜𝑢𝑡1𝑥

3 lines are plotted

( | , ( , , , ))p x w a b c dm

Predictive distribution

2 )

1

ln(1 )

=

i

i i

io t

x

x

u

u

e

o tm

s

We concentrate on the uncertainty of 1= i ix outm

Page 29: How to capture uncertainties with Neural Networks

NN with hidden layers do not impose a linear model

31

With 2 hidden layers (30,20)

Hidden layers no constrain that allow toborrow infromatin from wholerange 𝑚𝑒𝑎𝑡𝑜𝑢𝑡1𝑥

needs to be

estimated locally from few data large uncertaincy about 𝜇𝑥𝑖

𝑚𝑒𝑎𝑛𝑜𝑢𝑡1𝑥± 2 ⋅ 𝑠𝑑𝑜𝑢𝑡1𝑥

Can be approximated by sampling realized by repeated predictions

| , ( , , , )p x w a b c dm

Predictive distribution

We concentrate on the uncertainty of m

𝑚𝑒𝑎𝑛𝑜𝑢𝑡1𝑥± 2 ⋅ 𝑠𝑑𝑜𝑢𝑡1𝑥

3 lines are plotted

Page 30: How to capture uncertainties with Neural Networks

Aleatoric uncertainty

&

Epistemic uncertainty

32

Page 31: How to capture uncertainties with Neural Networks

Getting philosophical – Epistemology and Aleatoricism

33

Epistemology is the branch of philosophy

concerned with the theory of knowledge.

Aleatoricism is the incorporation of

chance into the process of creation.

The word derives from the Greek wordepistēmē, meaning 'knowledge

The word derives from the Latin word alea, the rolling of dice.

https://en.wikipedia.org/wiki/Epistemology

roman die

Alea iacta est

https://en.wikipedia.org/wiki/Aleatoricismhttps://en.wikipedia.org/wiki/Alea_iacta_est

Page 32: How to capture uncertainties with Neural Networks

A random experiment with two kinds of uncertainties

34

• We pull a poker chip out of a bag with mixed poker chips (black or blue) and toss it

• Sometimes we get a black chip and sometimes a blue chip

• In 50% of the tosses we see the side with the emoji pic otherwise the backside

1) Pull a chip 2) Toss the chip Observe color and face of the chip

See http://www.stat.columbia.edu/~gelman/stuff_for_blog/ohagan.pdf

Page 33: How to capture uncertainties with Neural Networks

A random experiment with two kinds of uncertainties

35

• The uncertainties on the color and the face of the drawn and tossed chip are different

• w/o any knowledge we would assign a probability of 0.5 on both colors and faces

• By collecting data we gain knowledge about the color probability: epistemic

• The random tossing process gives always a 50% chance to both faces: aleatoric

1) Pull a chip 2) Toss the chip Observe color and face of the chip

See http://www.stat.columbia.edu/~gelman/stuff_for_blog/ohagan.pdf

Page 34: How to capture uncertainties with Neural Networks

“Aleatoric” and “Epistemic” in probabilistic models

- my understanding

37

«aleatoric uncertainty» = inherent data uncertainty

due to the stochastic process that generates the data

captured by the fitted parametric distribution (CPD)

cannot be reduced by providing more training data

«epistemic uncertainty» = uncertainty about the parameter values

captured by CIs or predictive distributions

can be reduced by providing more training data

Need to learn about the composition of the bag

Page 35: How to capture uncertainties with Neural Networks

Probabilistic Deep Learning

for image classification

38

Page 36: How to capture uncertainties with Neural Networks

DL revolutionized the field of image classification

A “CNN” trained to

classify dog breads Correct

We fit a probabilistic model:We got a CPD (multinomial) which captures the aleatoric variablility. The parameter (𝑝1, 𝑝2, 𝑝3) of the CPD are estimated by a NN.

Page 37: How to capture uncertainties with Neural Networks

What happens if we present an image of a novel class to the CNN

This is what you would expect, right?

Best would be all zero

(not possible due to softmax)

A network trained to

classify dog breads

Page 38: How to capture uncertainties with Neural Networks

Traditional Deep NN tend to be over-confident…

You call me a

collie ? #@*$!!

Are you serious?

This is what you get!

A network trained to

classify dog breads Plain wrong

Remedy?: Bayesian Approaches to DL

Page 39: How to capture uncertainties with Neural Networks

The epistemic (parameter) uncertainty is missing

42

We need some error bars!

Page 40: How to capture uncertainties with Neural Networks

MC probability prediction

CNN predicts class “collie” but with high uncertainty

Many Dropout Runs in forward pass use dropout also during prediction

Remark: Mean of marginal give components of mean in multivariate distribution.

p*max

Page 41: How to capture uncertainties with Neural Networks

MC probability prediction provides epistimic uncertainty

Many Dropout Runs in forward pass

CNN predicts class “collie” this time with low uncertainty

use dropout also during prediction

Remark: Mean of marginal give components of mean in multivariate distribution.

p*max

Page 42: How to capture uncertainties with Neural Networks

Novelty detection via

epistemic uncertainty

45

Page 43: How to capture uncertainties with Neural Networks

Experiment with novel classes in test set

46Dürr O, Murina E, Siegismund D, Tolkachev V, Steigele S, Sick B. Know when you don’t know, Assay Drug Dev Technol. 2018

Page 44: How to capture uncertainties with Neural Networks

Classifying images of known and unknown phenotype

47

Page 45: How to capture uncertainties with Neural Networks

48

Known UnKnown

Classifying images of known and unknown phenotype

Page 46: How to capture uncertainties with Neural Networks

The center of mass quatifies the predicted probablilty

p*max

The spread quantifies in addition the uncertainty of the predicted probability

σ* total standard deviation

PE* entropy,

MI* mutual information

VR* variation ratio

49

How to quantify the spread of a MC* probability distribution

Page 47: How to capture uncertainties with Neural Networks

50

Experiments with images of known and unknown phenotype

Classical (no-MC) CNN Probability of predicted class

Center of MC distribution Probability* of predicted class

Total SD* of MC distribution PE* of MC distribution

Page 48: How to capture uncertainties with Neural Networks

Experiments with images of known and unknown phenotype

51

All MC Dropout based appoaches are superior compared to the non-MC approach.

Page 49: How to capture uncertainties with Neural Networks

Experiments with images of known and unknown phenotype

52

In all cases the Bayesian epistemic uncertainty measures are better suitable to identify novel classes than tradition point estimates of the class probability.

Page 50: How to capture uncertainties with Neural Networks

Summary

53

A probabilistic NN fits a whole conditional outcome distribution (CPD)

We need to choose a parametric model for this CPD (e.g. a Gaussian for regression)

We use the ML or Bayes to estimate the parameters of the CPD

We use Bayes methods to estimate the uncertainty of the parameter values

Aleatoric uncertainty is the data-inherent variability captured by the CPD

Epistemic uncertainty is the uncertainty about the parameter values

Epistemic uncertainty can be used for novelty detection