a brief overview of probability theory in data science fusion data... · a brief overview of...

72
A brief overview of probability theory in data science Geert Verdoolaege 1 Department of Applied Physics, Ghent University, Ghent, Belgium 2 Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium Tutorial 3 rd IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis, 27-05-2019

Upload: others

Post on 27-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

A brief overview of probability theoryin data science

Geert Verdoolaege1Department of Applied Physics, Ghent University, Ghent, Belgium2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium

Tutorial 3rd IAEA Technical Meeting on Fusion Data Processing,Validation and Analysis, 27-05-2019

Page 2: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

2

Overview

Page 3: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

3

Overview

Page 4: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Earliest traces in Western civilization: Jewish writings, Aristotle

Notion of probability in law, based on evidence

Usage in finance

Usage and demonstration in gambling

4

Early history of probability

Page 5: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

World is knowable but uncertainty due to human ignorance

William of Ockham: Ockham’s razor

Probabilis: a supposedly ‘provable’ opinion

Counting of authorities

Later: degree of truth, a scale

Quantification:

Law, faith→ Bayesian notion

Gaming→ frequentist notion

5

Middle Ages

Page 6: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

17th century: Pascal, Fermat, Huygens

Comparative testing of hypotheses

Population statistics

1713: Ars Conjectandi by Jacob Bernoulli:

Weak law of large numbers

Principle of indifference

De Moivre (1718): The Doctrine of Chances

6

Quantification

Page 7: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Paper by Thomas Bayes (1763): inversion of binomial distribution

Pierre Simon Laplace:Practical applications in physical sciences1812: Théorie Analytique des Probabilités

7

Bayes and Laplace

Page 8: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

George Boole (1854), John Venn (1866)

Sampling from a ‘population’

Notion of ‘ensembles’

Andrey Kolmogorov (1930s): axioms of probability

Early 20th century classical statistics: Ronald Fisher,Karl Pearson, Jerzy Neyman, Egon Pearson

Applications in social sciences, medicine, natural sciences

8

Frequentism and sampling theory

Page 9: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Statistical mechanics with Ludwig Boltzmann (1868): maximumentropy energy distribution

Josiah Willard Gibbs (1874–1878)

Claude Shannon (1948): maximum entropy in frequentistlanguage

Edwin Thompson Jaynes: Bayesian re-interpretation

Harold Jeffreys (1939): standard work on (Bayesian) probability

Richard T. Cox (1946): derivation of probability laws

Computers: flowering of Bayesian methods

9

Maximum entropy and Bayesian methods

Page 10: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

10

Overview

Page 11: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Probability = frequency

Straightforward:

Number of 1s in 60 dice throws ≈ 10: p = 1/6

Probability of plasma disruption p ≈ Ndisr/Ntot

Less straightforward:

Probability of fusion electricity by 2050?

Probability of mass of Saturn 90 mA ≤ mS < 100 mA?

11

Frequentist probability

Page 12: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

12

Populations vs. sample

Page 13: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

E.g. weight w of Belgian men: unknown but fixed for everyindividual

Average weight µw in population?

Random variable W

Sample: W1, W2, . . . , Wn

Average weight: statistic (estimator) W

Central limit theorem:

W ∼ p(W|µw, σw)⇒ W ∼ N (µw, σw/√

n)

13

Statistics

Page 14: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Maximum likelihood (ML) principle:

µ̂w = arg maxµw∈R+

p(W1, . . . , Wn|µw, σw)

≈ arg maxµw∈R+

n

∏i=1

1√2πσw

exp[− (Wi − µw)2

2σ2w

]

= arg maxµw∈R+

1√2πσw

exp

[−

n

∑i=1

(Wi − µw)2

2σ2w

]

ML estimator (known σw):

µ̂w = W =1n

n

∑i=1

Wi

14

Maximum likelihood parameter estimation

Page 15: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Weight of Dutch men compared to Belgian men (populations)Observed sample averages WNL, WBENull hypothesis H0: µw,NL = µw,BETest statistic:

WNL −WBE

σWCZ−WBE

∼ N (0, 1) (under H0)

15

Frequentist hypothesis testing

Page 16: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

16

Overview

Page 17: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Statistical (aleatoric) vs. epistemic uncertainty?

Every piece of information has uncertainty

Uncertainty = lack of information

Observation may reduce uncertainty

Probability (distribution) quantifies uncertainty

17

Probability theory: quantifying uncertainty

Page 18: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Measurement of physical quantity

Origin of stochasticity:

Apparatus

Microscopic fluctuations

Systematic uncertainty is assigned aprobability distribution

E.g. coin tossing, voltage measurement,probability of hypothesis vs. another, . . .

Bayesian: no random variables

18

Example: physical sciences

Page 19: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Objective Bayesian view

Probability = real number ∈ [0, 1]

Always conditioned on known information

Notation:p(A|B) or p(A|I)

Extension of logic: measure of degree to which B implies A

Degree of plausibility, but subject to consistency

Same information⇒ same probabilities

Probability distribution: outcome→ probability

19

What is probability?

Page 20: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

p(x, y), p(x), p(y), p(x|y), p(y|x)20

Joint, marginal and conditional distributions

Page 21: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Normal/Gaussian probability density function (PDF):

p(x|µ, σ, I) =1√2πσ

exp[− (x− µ)2

2σ2

]Probability x1 ≤ x < x1 + dxInverse problem: µ, σ given x?

21

Example: normal distribution

Page 22: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Bayes’ theorem

p(θ|x, I) =p(x|θ, I)p(θ|I)

p(x|I)

x = data vectorθ = vector of model parametersI = implicit knowledge

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence:

p(x|I) =∫

p(x, θ|I)dθ =∫

p(x|θ, I)p(θ|I)dθ

Posterior distribution

22

Updating information states

Page 23: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Bayes’ theorem

p(θ|x, I) =p(x|θ, I)p(θ|I)

p(x|I)

x = data vectorθ = vector of model parametersI = implicit knowledge

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence:

p(x|I) =∫

p(x, θ|I)dθ =∫

p(x|θ, I)p(θ|I)dθ

Posterior distribution

22

Updating information states

Page 24: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Bayes’ theorem

p(θ|x, I) =p(x|θ, I)p(θ|I)

p(x|I)

x = data vectorθ = vector of model parametersI = implicit knowledge

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence:

p(x|I) =∫

p(x, θ|I)dθ =∫

p(x|θ, I)p(θ|I)dθ

Posterior distribution

22

Updating information states

Page 25: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Bayes’ theorem

p(θ|x, I) =p(x|θ, I)p(θ|I)

p(x|I)

x = data vectorθ = vector of model parametersI = implicit knowledge

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence:

p(x|I) =∫

p(x, θ|I)dθ =∫

p(x|θ, I)p(θ|I)dθ

Posterior distribution

22

Updating information states

Page 26: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Updating state of knowledge = learning

‘Uninformative’ prior (distribution)

Assigning uninformative priors:

Transformation invariance (Jaynes ’68):e.g. uniform for location variable

Principle of indifference

Testable information: maximum entropy

. . .

Show me the prior!

23

Practical considerations

Page 27: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

n measurements xi → xIndependent and identically distributed xi:

p(x|µ, σ, I) =n

∏i=1

1√2πσ

exp[− (xi − µ)2

2σ2

]Bayes’ rule:

p(µ, σ|x, I) ∝ p(x|µ, σ, I)p(µ, σ|I)Suppose σ ≡ σe → delta functionAssume µ ∈ [µmin, µmax]→ uniform prior:

p(µ|I) =

1

µmax − µmin, if µ ∈ [µmin, µmax]

0, otherwise

Let µmin → −∞, µmax → +∞⇒ improper priorEnsure proper posterior

24

Mean of a normal distribution: uniform prior

Page 28: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Posterior:

p(µ|x, I) ∝ exp[−∑n

i=1(xi − µ)2

2σ2e

]Define

x ≡ 1n

n

∑i=1

xi, (∆x)2 ≡ 1n

n

∑i=1

(xi − x̄)2

Adding and subtracting 2nx2 (‘completing the square’),

p(µ|x, I) ∝ exp{− 1

2σ2e /n

[(µ− x)2 + (∆x)2

]}Retaining dependence on µ,

p(µ|x, I) ∝ exp[− (µ− x)2

2σ2e /n

]µ ∼ N (x, σ2

e /n)25

Posterior for µ

Page 29: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Normal prior: µ ∼ N (µ0, τ2)

Posterior:

p(µ|x, I) ∝ exp[−∑n

i=1(xi − µ)2

2σ2e

]× exp

[− (µ− µ0)

2

2τ2

]Expanding and completing the square,

µ ∼ N (µn, σ2n),

where

µn ≡ σ2n

(nσ2

ex +

1τ2 µ0

)and µn ≡ σ2

n

(nσ2

e+

1τ2

)−1

µn is weighted average of µ0 and x

26

Mean of a normal distribution: normal prior

Page 30: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Intrinsic part of Bayesian theory, enabling coherent learning

High-quality data may be scarce

Regularization of ill-posed problems (e.g. tomography)

Prior can retain influence: e.g. regression with errors in allvariables

27

Care for the prior

Page 31: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Repeated measurements→ information on σ

Scale variable σ→ Jeffreys’ scale prior:

p(σ|I) ∝1σ

, σ ∈ ]0,+∞[

Posterior:

p(µ, σ|x, I) ∝1

σn exp

[− (µ− x)2 + (∆x)2

2σ2/n

]× 1

σ

28

Unknown mean and standard deviation

Page 32: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Marginalization = integrating out a (nuisance) parameter:

p(µ|x, I) =∫ +∞

0p(µ, σ|x, I)dσ

∝∫ +∞

0

12

[(µ− x)2 + (∆x)2

2/n

]− n2

sn2−1e−s ds

=12

Γ(n

2

) [ (µ− x)2 + (∆x)2

2/n

]− n2

,

where

s ≡ (µ− x)2 + (∆x)2

2σ2/n

29

Marginal posterior for µ (1)

Page 33: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

After normalization:

p(µ|x, I) =Γ(n

2

)√π(∆x)2Γ

(n−12

)[

1 +(µ− x)2

(∆x)2

]− n2

Changing variables,

t ≡ (µ− x)2√(∆x)2/(n− 1)

, with p(t|x, I)dt ≡ p(µ|x, I)dµ,

p(t|x, I) =Γ(n

2

)√(n− 1)πΓ

(n−12

) [1 +t2

n− 1

]− n2

Student’s t-distribution with parameter ν = n− 1If n� 1,

p(µ|x, I) −→ 1√2π(∆x)2/n

exp

[− (µ− x)2

2(∆x)2/n

]30

Marginal posterior for µ (2)

Page 34: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Marginalization of µ:

p(σ|x, I) =∫ +∞

−∞p(µ, σ|x, I)dµ

∝1

σn exp

[− (∆x)2

2σ2/n

]

Setting X ≡ n(∆x)2/σ2,

p(X|x, I) =1

2k2 Γ(

k2

)Xk2−1e−

X2 , k ≡ n− 1

χ2 distribution with parameter k

31

Marginal posterior for σ

Page 35: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Laplace (saddle point) approximation of distributions around themode (= maximum)E.g. marginal for µ:

p(µ|x, I) =Γ(n

2

)√π(∆x)2Γ

(n−12

)[

1 +(µ− x)2

(∆x)2

]− n2

Taylor expansion around mode:

ln[p(µ|x, I)] ≈ ln[p(x|x, I)] +12

d2(ln p)dµ2

∣∣∣∣µ=x

(µ− x)2

= ln[

Γ(n

2

)]− ln

(n− 1

2

)]− 1

2ln[π(∆x)2

]− n

2(∆x)2(µ− x)2

32

The Laplace approximation (1)

Page 36: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

On the original scale:

p(µ|x, I) ≈Γ(n

2

)Γ(n−1

2

) 1√π(∆x)2

exp

[− (µ− x)2

2(∆x)2/n

]

Standard deviation σL → curvature of ln p:

σL =

[− d2(ln p)

dµ2

∣∣∣∣µ=x

]−1/2

33

The Laplace approximation (2)

Page 37: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

34

Laplace approximation: example 1

Page 38: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

35

Laplace approximation: example 2

Page 39: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

For θ = [θ1, . . . , θp]t,

p(θ|θ0, I) ∝ exp[

12(θ− θ0)

t[∇∇(ln p)]θ=θ0(θ− θ0)

]

∇∇(ln p): Hessian matrix, where

ΣL = −{[∇∇(ln p)]θ=θ0}−1

36

Multivariate Laplace approximation

Page 40: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Let {Hi} be complete set of hypotheses

Data D to support or reject hypotheses

Bayes’ rule:

p(Hi|D, I) =p(D|Hi, I)p(Hi|I)

p(D|I) , p(D|I) = ∑i

p(D|Hi, I)p(Hi|I)

Assume single hypothesis H and complement H

Odds ratio o:

o ≡ p(H|D, I)p(H|D, I)

=p(D|H, I)p(D|H, I)︸ ︷︷ ︸Bayes factor

p(H|I)p(H|I)︸ ︷︷ ︸

Prior odds

37

Model comparison (hypothesis testing)

Page 41: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

E.g. n measurements xi of a quantity x

Assume normal distribution with known variance σ2

Question: are the data compatible with mean µ = µ0?

Yes: H

No: H

Under H:

p(x|H, I) = C exp[− 1

2σ2/n(x− µ0)

2]

Under H:p(x|H, I) =

∫p(x|µ, σ, I)p(µ|H, I)dµ (1)

38

Testing a Gaussian mean (1)

Page 42: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Assume bounds µmin and µmax:

p(µ|H, I) =1

|µmax − µmin|1(µmin ≤ µ ≤ µmax)

Then (1) becomes

p(x|H, I) =C

|µmax − µmin|

∫ µmax

µmin

exp[− 1

2σ2/n(x− µ)2

]dµ

With SE ≡ σ/√

n = standard error, assume

|x− µmin|, |x− µmax| � SE

39

Testing a Gaussian mean (2)

Page 43: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Then

p(x|H, I) ≈ C|µmax − µmin|

∫ +∞

−∞exp

[− 1

2σ2/n(x− µ)2

]dµ

=C SE

√2π

|µmax − µmin|

Bayes factor BF:

BF =p(x|H, I)p(x|H, I)

=|µmax − µmin|

SE1√2π

e−12 z2

,

z ≡ |x− µ0|SE

Cf. frequentist hypothesis test

40

Testing a Gaussian mean (3)

Page 44: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

41

Overview

Page 45: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

(Strong) nonlinearities through prior or likelihoodSkewed, heavy-tailed, multimodalSampling (simulating) from distributions pEstimate properties of pMarginalization: sampling from joint distribution→ sampling from marginals

42

Monte Carlo sampling

Page 46: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Metropolis-Hastings sampling, Gibbs sampling

Simple algorithms, but implementation issues might arise

Target distribution π(θ|I)

Markov chain θ(t): conditional on last state

Sample from proposal distribution

43

Markov chain Monte Carlo (MCMC) (1)

Page 47: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Subchains sample from corresponding marginalsMonte Carlo averages, e.g.

θj =1n

n

∑i=1

θ(tc+i)j , (∆θj)2 =

1n

n

∑i=1

(θ(tc+i)j − θj

)2

44

Markov chain Monte Carlo (2)

Page 48: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 At t = 0, start from θ(t) =[θ(t)1 , θ

(t)2 . . . , θ

(t)p

]t

2 repeat

3 for j = 1 to p

y ∼ qj

(η∣∣∣ θ

(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t)j , θ

(t)j+1, . . . , θ

(t)p , I

)

θ(t+1)j =

{y, with probability ρ

θ(t)j , with probability 1− ρ

4 end

45

Metropolis-Hastings algorithm (1)

Page 49: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Acceptance probability:

ρ(

y, θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t)j , θ

(t)j+1, . . . , θ

(t)p

)≡ min

1,π(

θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , y, θ

(t)j+1, . . . , θ

(t)p , I

)π(

θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t+1)j , θ

(t)j+1, . . . , θ

(t)p , I

qj

(θ(t)j

∣∣∣ θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , y, θ

(t)j+1, . . . , θ

(t)p , I

)qj

(y∣∣∣ θ

(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t)j , θ

(t)j+1, . . . , θ

(t)p , I

)

Example:

qj

(η∣∣∣ θ

(t+1)j , I

)∝ exp

−(

η − θ(t)j

)2

2σ2j

46

Metropolis-Hastings algorithm (2)

Page 50: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

47

MCMC example

Page 51: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

48

Overview

Page 52: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

49

Overview

Page 53: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

50

Classification or clustering

Page 54: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

M clusters of in total n data points xi in P-dimensional space

Known class labels ωj (j = 1, . . . , M) of xi

Bayes’ rule for new point x:

p(ωj|x, I) =p(x|ωj, I)p(ωj|I)

p(x|I)

Maximum a posteriori (MAP) classification rule for x:

Assign x to ωi = arg maxωj

p(ωj|x, I) = arg maxωj

p(x|ωj, I)p(ωj|I)

51

Simple Bayesian classification

Page 55: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Examples of prior probabilities (indifference):p(ωi|I) = p(ωj|I), ∀i, jCount class membership:

p(ωi|I) ≡nin

, i = 1, . . . , M

Examples of likelihoods:Naive Bayesian classifier:

p(x|ωi) =P

∏k=1

p(xk|ωi), i = 1, . . . , M

Multivariate Gaussian:

p(x|ωj, I) =1

(2π)p/2|Σj|1/2exp

[−1

2(x− µj)

tΣ−1j (x− µj)

]52

Examples of priors and likelihoods

Page 56: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Bayesian classifier minimizes probability of misclassification

53

Optimality of Bayesian classifier

Page 57: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Bayesian classifier minimizes probability of misclassification

54

Optimality of Bayesian classifier

Page 58: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

MAP: maximize w.r.t. ωj:

ln p(ωj|x, I) = ln p(x|ωj, I) + ln p(ωj|I)

Define (M = 2):

g(x) ≡ ln p(ω1|x, I)− ln p(ω2|x, I)

For normal likelihood:

g(x) =

Quadratic︷ ︸︸ ︷12

(xtΣ−1

2 x− xtΣ−11 x)+

Linear︷ ︸︸ ︷µt

1Σ−11 x− µt

2Σ−12 x

−12

µt1Σ−1

1 µ1 +12

µt2Σ−1

2 µ2 +12

ln|Σ2||Σ1|

+ lnp(ω1|I)p(ω2|I)︸ ︷︷ ︸

Constant

g(x) separates classes: decision hypersurface55

MAP classification: decision surfaces

Page 59: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

56

Discriminant analysis

Page 60: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Quadratic discriminant analysis(QDA)

Linear discriminant analysis(LDA): Σ1 = Σ2

57

QDA and LDA

Page 61: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

A. Shabbir et al., Fusion Eng. Des. 123, 717, 2017

Pinput − 1.41ΓD2 = 7.47

58

ELM type classification

Page 62: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Linear classifier with normal likelihood (Σ1 = Σ2 ≡ Σ):

g(x) = θt(x− x0) = 0,

θ ≡ Σ−1(µ1 − µ2)

x0 ≡12(µ1 + µ2)− ln

p(ω1|I)p(ω2|I)

µ1 − µ2‖µ1 − µ2‖Σ−1

‖µ1 − µ2‖Σ−1 ≡√(µ1 − µ2)

tΣ−1(µ1 − µ2)︸ ︷︷ ︸Mahalanobis distance

p(ω1|I) = p(ω2|I)⇒Minimum Mahalanobis distance classifier(to cluster centroid)

Σ = σ2I⇒Minimum Euclidean distance classifier

k-nearest neighbor classification (kNN)

59

Minimum distance classifiers (1)

Page 63: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

60

Minimum distance classifiers (2)

Page 64: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

61

Overview

Page 65: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Regression model:

y = β0 + β1x1 + β2x2 + . . . + βpxp + ε

ε ∼ N (0, σ2), σ known

Negligible uncertainty on xj

Likelihood:

p(y|x, β, σ, I) =1√2πσ

exp

− 12σ2

(y− β0 −

p

∑j=1

βjxj

)2 ,

β ≡ [β0, βtp]

t, βp ≡ [β1, . . . , βp]t

62

Multilinear regression model

Page 66: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Take n measurements:

y ≡ [y1, . . . , yn]t

X ≡

1 x11 . . . x1p...

.... . .

...1 xn1 . . . xnp

Conditional independence:

p(y|X, β, σ, I) = (2π)−n/2σ−n exp[− 1

2σ2 (y−Xβ)t(y−Xβ)

]ML: maximize w.r.t. β:

0 = ∇β(y−Xβ)t(y−Xβ) = −2Xty + 2XtXβ

⇒ βML = (XtX)−1Xt︸ ︷︷ ︸Moore-Penrose pseudoinverse

y = βLS

63

Maximum likelihood solution

Page 67: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Uniform priors on βj (not the most uninformative!):

p(β|y, X, σ, I) ∝ exp[− 1

2σ2 (y−Xβ)t(y−Xβ)

]Due to linearity and Gaussianity: βMAP = βML = βLSTaylor expansion (exact!):

(y−Xβ)t(y−Xβ) = (y−XβMAP)t(y−XβMAP)

+12(β− βMAP)

t 2XtX (β− βMAP)

Hence,

p(β|y, X, σ, I) = (2π)−n/2|Σ|−1/2

× exp[−1

2(β− βMAP)

tΣ−1(β− βMAP)

],

Σ ≡ σ2(XtX)−1

64

MAP solution and posterior

Page 68: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

New predictions by the model?

Posterior predictive distribution:

p(ynew|xnew, y, X, σ, I) =∫

Rp+1p(ynew, β|xnew, y, X, σ, I)dβ

=∫

Rp+1p(ynew|xnew, β, I) p(β|y, X, σ, I)dβ

Butp(ynew|xnew, β, I) = δ(ynew − βtxnew)

Fix β0 = ynew − β1xnew,1 − . . .− βpxnew,p

65

Posterior predictive distribution (1)

Page 69: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Marginalize over βp with flat priors:

p(ynew|xnew, y, X, σ, I)

∝ σ−n∫

Rpexp

− 12σ2

n

∑i=1

[yi − ynew +

p

∑j=1

βj(xnew,j − xij)

]2 dβp

After (quite some) algebra, one finds simply

ynew,MAP =p

∑j=1

xnew,jβMAP,j + βMAP,0

Simpler derivation based on properties of E and Var

General posterior more complicated!

66

Posterior predictive distribution (2)

Page 70: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

67

Overview

Page 71: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

Frequentist vs. Bayesian methods: interpretation of probability

Bayesian probability: extension of logic to situations withuncertainty

Posterior probability of parameters or hypotheses

Numerical approach in general

Underlies or explains many machine learning techniques

68

Conclusions

Page 72: A brief overview of probability theory in data science Fusion Data... · A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent

D.S. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd

edition, Oxford University Press, 2006W. von der Linden, V. Dose and U. von Toussaint, BayesianProbability Theory: Applications in the Physical Sciences, CambridgeUniversity Press, 2014P. Gregory, Data Analysis: A Bayesian Tutorial, 2nd edition, OxfordUniversity Press, 2006S. Theodoridis, Machine Learning: A Bayesian and OptimizationPerspective, Academic Press (Elsevier), 2015E.T. Jaynes (G.L. Bretthorst, ed.), Probability Theory: The Logic ofScience, Cambridge University Press, 2003S.B. McGrayne, The theory that would not die: how Bayes’ rule crackedthe enigma code, hunted down Russian submarines, and emergedtriumphant from two centuries of controversy, Yale University Press,2011

69

References