a brief overview of probability theory in data science fusion data... · a brief overview of...

A brief overview of probability theoryin data science

Geert Verdoolaege1Department of Applied Physics, Ghent University, Ghent, Belgium2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium

Tutorial 3rd IAEA Technical Meeting on Fusion Data Processing,Validation and Analysis, 27-05-2019

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 ApplicationsClassificationRegression analysis

6 Conclusions and references

2

Overview







3

Overview

Earliest traces in Western civilization: Jewish writings, Aristotle

Notion of probability in law, based on evidence

Usage in finance

Usage and demonstration in gambling

4

Early history of probability

World is knowable but uncertainty due to human ignorance

William of Ockham: Ockham’s razor

Probabilis: a supposedly ‘provable’ opinion

Counting of authorities

Later: degree of truth, a scale

Quantification:

Law, faith→ Bayesian notion

Gaming→ frequentist notion

5

Middle Ages

17th century: Pascal, Fermat, Huygens

Comparative testing of hypotheses

Population statistics

1713: Ars Conjectandi by Jacob Bernoulli:

Weak law of large numbers

Principle of indifference

De Moivre (1718): The Doctrine of Chances

6

Quantification

Paper by Thomas Bayes (1763): inversion of binomial distribution

Pierre Simon Laplace:Practical applications in physical sciences1812: Théorie Analytique des Probabilités

7

Bayes and Laplace

George Boole (1854), John Venn (1866)

Sampling from a ‘population’

Notion of ‘ensembles’

Andrey Kolmogorov (1930s): axioms of probability

Early 20th century classical statistics: Ronald Fisher,Karl Pearson, Jerzy Neyman, Egon Pearson

Applications in social sciences, medicine, natural sciences

8

Frequentism and sampling theory

Statistical mechanics with Ludwig Boltzmann (1868): maximumentropy energy distribution

Josiah Willard Gibbs (1874–1878)

Claude Shannon (1948): maximum entropy in frequentistlanguage

Edwin Thompson Jaynes: Bayesian re-interpretation

Harold Jeffreys (1939): standard work on (Bayesian) probability

Richard T. Cox (1946): derivation of probability laws

Computers: flowering of Bayesian methods

9

Maximum entropy and Bayesian methods







10

Overview

Probability = frequency

Straightforward:

Number of 1s in 60 dice throws ≈ 10: p = 1/6

Probability of plasma disruption p ≈ Ndisr/Ntot

Less straightforward:

Probability of fusion electricity by 2050?

Probability of mass of Saturn 90 mA ≤ mS < 100 mA?

11

Frequentist probability

12

Populations vs. sample

E.g. weight w of Belgian men: unknown but fixed for everyindividual

Average weight µw in population?

Random variable W

Sample: W1, W2, . . . , Wn

Average weight: statistic (estimator) W

Central limit theorem:

W ∼ p(W|µw, σw)⇒ W ∼ N (µw, σw/√

n)

13

Statistics

Maximum likelihood (ML) principle:

µ̂w = arg maxµw∈R+

p(W1, . . . , Wn|µw, σw)

≈ arg maxµw∈R+

n

∏i=1

1√2πσw

exp[− (Wi − µw)2

2σ2w

]

= arg maxµw∈R+

1√2πσw

exp

[−

n

∑i=1

(Wi − µw)2

2σ2w

]

ML estimator (known σw):

µ̂w = W =1n

n

∑i=1

Wi

14

Maximum likelihood parameter estimation

Weight of Dutch men compared to Belgian men (populations)Observed sample averages WNL, WBENull hypothesis H0: µw,NL = µw,BETest statistic:

WNL −WBE

σWCZ−WBE

∼ N (0, 1) (under H0)

15

Frequentist hypothesis testing







16

Overview

Statistical (aleatoric) vs. epistemic uncertainty?

Every piece of information has uncertainty

Uncertainty = lack of information

Observation may reduce uncertainty

Probability (distribution) quantifies uncertainty

17

Probability theory: quantifying uncertainty

Measurement of physical quantity

Origin of stochasticity:

Apparatus

Microscopic fluctuations

Systematic uncertainty is assigned aprobability distribution

E.g. coin tossing, voltage measurement,probability of hypothesis vs. another, . . .

Bayesian: no random variables

18

Example: physical sciences

Objective Bayesian view

Probability = real number ∈ [0, 1]

Always conditioned on known information

Notation:p(A|B) or p(A|I)

Extension of logic: measure of degree to which B implies A

Degree of plausibility, but subject to consistency

Same information⇒ same probabilities

Probability distribution: outcome→ probability

19

What is probability?

p(x, y), p(x), p(y), p(x|y), p(y|x)20

Joint, marginal and conditional distributions

Normal/Gaussian probability density function (PDF):

p(x|µ, σ, I) =1√2πσ

exp[− (x− µ)2

2σ2

]Probability x1 ≤ x < x1 + dxInverse problem: µ, σ given x?

21

Example: normal distribution

Updating state of knowledge = learning

‘Uninformative’ prior (distribution)

Assigning uninformative priors:

Transformation invariance (Jaynes ’68):e.g. uniform for location variable

Principle of indifference

Testable information: maximum entropy

. . .

Show me the prior!

23

Practical considerations

n measurements xi → xIndependent and identically distributed xi:

p(x|µ, σ, I) =n

∏i=1

1√2πσ

exp[− (xi − µ)2

2σ2

]Bayes’ rule:

p(µ, σ|x, I) ∝ p(x|µ, σ, I)p(µ, σ|I)Suppose σ ≡ σe → delta functionAssume µ ∈ [µmin, µmax]→ uniform prior:

p(µ|I) =

1

µmax − µmin, if µ ∈ [µmin, µmax]

0, otherwise

Let µmin → −∞, µmax → +∞⇒ improper priorEnsure proper posterior

24

Mean of a normal distribution: uniform prior

Posterior:

p(µ|x, I) ∝ exp[−∑n

i=1(xi − µ)2

2σ2e

]Define

x ≡ 1n

n

∑i=1

xi, (∆x)2 ≡ 1n

n

∑i=1

(xi − x̄)2

Adding and subtracting 2nx2 (‘completing the square’),

p(µ|x, I) ∝ exp{− 1

2σ2e /n

[(µ− x)2 + (∆x)2

]}Retaining dependence on µ,

p(µ|x, I) ∝ exp[− (µ− x)2

2σ2e /n

]µ ∼ N (x, σ2

e /n)25

Posterior for µ

Normal prior: µ ∼ N (µ0, τ2)

Posterior:

p(µ|x, I) ∝ exp[−∑n

i=1(xi − µ)2

2σ2e

]× exp

[− (µ− µ0)

2

2τ2

]Expanding and completing the square,

µ ∼ N (µn, σ2n),

where

µn ≡ σ2n

(nσ2

ex +

1τ2 µ0

)and µn ≡ σ2

n

(nσ2

e+

1τ2

)−1

µn is weighted average of µ0 and x

26

Mean of a normal distribution: normal prior

Intrinsic part of Bayesian theory, enabling coherent learning

High-quality data may be scarce

Regularization of ill-posed problems (e.g. tomography)

Prior can retain influence: e.g. regression with errors in allvariables

27

Care for the prior

Repeated measurements→ information on σ

Scale variable σ→ Jeffreys’ scale prior:

p(σ|I) ∝1σ

, σ ∈ ]0,+∞[

Posterior:

p(µ, σ|x, I) ∝1

σn exp

[− (µ− x)2 + (∆x)2

2σ2/n

]× 1

σ

28

Unknown mean and standard deviation

Marginalization = integrating out a (nuisance) parameter:

p(µ|x, I) =∫ +∞

0p(µ, σ|x, I)dσ

∝∫ +∞

0

12

[(µ− x)2 + (∆x)2

2/n

]− n2

sn2−1e−s ds

=12

Γ(n

2

) [ (µ− x)2 + (∆x)2

2/n

]− n2

,

where

s ≡ (µ− x)2 + (∆x)2

2σ2/n

29

Marginal posterior for µ (1)

After normalization:

p(µ|x, I) =Γ(n

2

)√π(∆x)2Γ

(n−12

)[

1 +(µ− x)2

(∆x)2

]− n2

Changing variables,

t ≡ (µ− x)2√(∆x)2/(n− 1)

, with p(t|x, I)dt ≡ p(µ|x, I)dµ,

p(t|x, I) =Γ(n

2

)√(n− 1)πΓ

(n−12

) [1 +t2

n− 1

]− n2

Student’s t-distribution with parameter ν = n− 1If n� 1,

p(µ|x, I) −→ 1√2π(∆x)2/n

exp

[− (µ− x)2

2(∆x)2/n

]30

Marginal posterior for µ (2)

Marginalization of µ:

p(σ|x, I) =∫ +∞

−∞p(µ, σ|x, I)dµ

∝1

σn exp

[− (∆x)2

2σ2/n

]

Setting X ≡ n(∆x)2/σ2,

p(X|x, I) =1

2k2 Γ(

k2

)Xk2−1e−

X2 , k ≡ n− 1

χ2 distribution with parameter k

31

Marginal posterior for σ

Laplace (saddle point) approximation of distributions around themode (= maximum)E.g. marginal for µ:

p(µ|x, I) =Γ(n

2

)√π(∆x)2Γ

(n−12

)[

1 +(µ− x)2

(∆x)2

]− n2

Taylor expansion around mode:

ln[p(µ|x, I)] ≈ ln[p(x|x, I)] +12

d2(ln p)dµ2

∣∣∣∣µ=x

(µ− x)2

= ln[

Γ(n

2

)]− ln

[Γ

(n− 1

2

)]− 1

2ln[π(∆x)2

]− n

2(∆x)2(µ− x)2

32

The Laplace approximation (1)

On the original scale:

p(µ|x, I) ≈Γ(n

2

)Γ(n−1

2

) 1√π(∆x)2

exp

[− (µ− x)2

2(∆x)2/n

]

Standard deviation σL → curvature of ln p:

σL =

[− d2(ln p)

dµ2

∣∣∣∣µ=x

]−1/2

33

The Laplace approximation (2)

34

Laplace approximation: example 1

35

Laplace approximation: example 2

For θ = [θ1, . . . , θp]t,

p(θ|θ0, I) ∝ exp[

12(θ− θ0)

t[∇∇(ln p)]θ=θ0(θ− θ0)

]

∇∇(ln p): Hessian matrix, where

ΣL = −{[∇∇(ln p)]θ=θ0}−1

36

Multivariate Laplace approximation

Let {Hi} be complete set of hypotheses

Data D to support or reject hypotheses

Bayes’ rule:

p(Hi|D, I) =p(D|Hi, I)p(Hi|I)

p(D|I) , p(D|I) = ∑i

p(D|Hi, I)p(Hi|I)

Assume single hypothesis H and complement H

Odds ratio o:

o ≡ p(H|D, I)p(H|D, I)

=p(D|H, I)p(D|H, I)︸︷︷︸Bayes factor

p(H|I)p(H|I)︸︷︷︸

Prior odds

37

Model comparison (hypothesis testing)

E.g. n measurements xi of a quantity x

Assume normal distribution with known variance σ2

Question: are the data compatible with mean µ = µ0?

Yes: H

No: H

Under H:

p(x|H, I) = C exp[− 1

2σ2/n(x− µ0)

2]

Under H:p(x|H, I) =

∫p(x|µ, σ, I)p(µ|H, I)dµ (1)

38

Testing a Gaussian mean (1)







41

Overview

(Strong) nonlinearities through prior or likelihoodSkewed, heavy-tailed, multimodalSampling (simulating) from distributions pEstimate properties of pMarginalization: sampling from joint distribution→ sampling from marginals

42

Monte Carlo sampling

Metropolis-Hastings sampling, Gibbs sampling

Simple algorithms, but implementation issues might arise

Target distribution π(θ|I)

Markov chain θ(t): conditional on last state

Sample from proposal distribution

43

Markov chain Monte Carlo (MCMC) (1)

Subchains sample from corresponding marginalsMonte Carlo averages, e.g.

θj =1n

n

∑i=1

θ(tc+i)j , (∆θj)2 =

1n

n

∑i=1

(θ(tc+i)j − θj

)2

44

Markov chain Monte Carlo (2)

1 At t = 0, start from θ(t) =[θ(t)1 , θ

(t)2 . . . , θ

(t)p

]t

2 repeat

3 for j = 1 to p

y ∼ qj

(η∣∣∣ θ

(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t)j , θ

(t)j+1, . . . , θ

(t)p , I

)

θ(t+1)j =

{y, with probability ρ

θ(t)j , with probability 1− ρ

4 end

45

Metropolis-Hastings algorithm (1)

Acceptance probability:

ρ(

y, θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t)j , θ

(t)j+1, . . . , θ

(t)p

)≡ min

1,π(

θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , y, θ

(t)j+1, . . . , θ

(t)p , I

)π(

θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t+1)j , θ

(t)j+1, . . . , θ

(t)p , I

)×

qj

(θ(t)j

∣∣∣ θ(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , y, θ

(t)j+1, . . . , θ

(t)p , I

)qj

(y∣∣∣ θ

(t+1)1 , θ

(t+1)2 , . . . , θ

(t+1)j−1 , θ

(t)j , θ

(t)j+1, . . . , θ

(t)p , I

)

Example:

qj

(η∣∣∣ θ

(t+1)j , I

)∝ exp

−(

η − θ(t)j

)2

2σ2j

46

Metropolis-Hastings algorithm (2)

47

MCMC example







48

Overview







49

Overview

50

Classification or clustering

M clusters of in total n data points xi in P-dimensional space

Known class labels ωj (j = 1, . . . , M) of xi

Bayes’ rule for new point x:

p(ωj|x, I) =p(x|ωj, I)p(ωj|I)

p(x|I)

Maximum a posteriori (MAP) classification rule for x:

Assign x to ωi = arg maxωj

p(ωj|x, I) = arg maxωj

p(x|ωj, I)p(ωj|I)

51

Simple Bayesian classification

Bayesian classifier minimizes probability of misclassification

53

Optimality of Bayesian classifier

Bayesian classifier minimizes probability of misclassification

54

Optimality of Bayesian classifier

MAP: maximize w.r.t. ωj:

ln p(ωj|x, I) = ln p(x|ωj, I) + ln p(ωj|I)

Define (M = 2):

g(x) ≡ ln p(ω1|x, I)− ln p(ω2|x, I)

For normal likelihood:

g(x) =

Quadratic︷︸︸︷12

(xtΣ−1

2 x− xtΣ−11 x)+

Linear︷︸︸︷µt

1Σ−11 x− µt

2Σ−12 x

−12

µt1Σ−1

1 µ1 +12

µt2Σ−1

2 µ2 +12

ln|Σ2||Σ1|

+ lnp(ω1|I)p(ω2|I)︸︷︷︸

Constant

g(x) separates classes: decision hypersurface55

MAP classification: decision surfaces

56

Discriminant analysis

Quadratic discriminant analysis(QDA)

Linear discriminant analysis(LDA): Σ1 = Σ2

57

QDA and LDA

A. Shabbir et al., Fusion Eng. Des. 123, 717, 2017

Pinput − 1.41ΓD2 = 7.47

58

ELM type classification

Linear classifier with normal likelihood (Σ1 = Σ2 ≡ Σ):

g(x) = θt(x− x0) = 0,

θ ≡ Σ−1(µ1 − µ2)

x0 ≡12(µ1 + µ2)− ln

p(ω1|I)p(ω2|I)

µ1 − µ2‖µ1 − µ2‖Σ−1

‖µ1 − µ2‖Σ−1 ≡√(µ1 − µ2)

tΣ−1(µ1 − µ2)︸︷︷︸Mahalanobis distance

p(ω1|I) = p(ω2|I)⇒Minimum Mahalanobis distance classifier(to cluster centroid)

Σ = σ2I⇒Minimum Euclidean distance classifier

k-nearest neighbor classification (kNN)

59

Minimum distance classifiers (1)

60

Minimum distance classifiers (2)







61

Overview

Regression model:

y = β0 + β1x1 + β2x2 + . . . + βpxp + ε

ε ∼ N (0, σ2), σ known

Negligible uncertainty on xj

Likelihood:

p(y|x, β, σ, I) =1√2πσ

exp

− 12σ2

(y− β0 −

p

∑j=1

βjxj

)2 ,

β ≡ [β0, βtp]

t, βp ≡ [β1, . . . , βp]t

62

Multilinear regression model

Take n measurements:

y ≡ [y1, . . . , yn]t

X ≡

1 x11 . . . x1p...

.... . .

...1 xn1 . . . xnp

Conditional independence:

p(y|X, β, σ, I) = (2π)−n/2σ−n exp[− 1

2σ2 (y−Xβ)t(y−Xβ)

]ML: maximize w.r.t. β:

0 = ∇β(y−Xβ)t(y−Xβ) = −2Xty + 2XtXβ

⇒ βML = (XtX)−1Xt︸︷︷︸Moore-Penrose pseudoinverse

y = βLS

63

Maximum likelihood solution

Uniform priors on βj (not the most uninformative!):

p(β|y, X, σ, I) ∝ exp[− 1

2σ2 (y−Xβ)t(y−Xβ)

]Due to linearity and Gaussianity: βMAP = βML = βLSTaylor expansion (exact!):

(y−Xβ)t(y−Xβ) = (y−XβMAP)t(y−XβMAP)

+12(β− βMAP)

t 2XtX (β− βMAP)

Hence,

p(β|y, X, σ, I) = (2π)−n/2|Σ|−1/2

× exp[−1

2(β− βMAP)

tΣ−1(β− βMAP)

],

Σ ≡ σ2(XtX)−1

64

MAP solution and posterior

New predictions by the model?

Posterior predictive distribution:

p(ynew|xnew, y, X, σ, I) =∫

Rp+1p(ynew, β|xnew, y, X, σ, I)dβ

=∫

Rp+1p(ynew|xnew, β, I) p(β|y, X, σ, I)dβ

Butp(ynew|xnew, β, I) = δ(ynew − βtxnew)

Fix β0 = ynew − β1xnew,1 − . . .− βpxnew,p

65

Posterior predictive distribution (1)

Marginalize over βp with flat priors:

p(ynew|xnew, y, X, σ, I)

∝ σ−n∫

Rpexp

− 12σ2

n

∑i=1

[yi − ynew +

p

∑j=1

βj(xnew,j − xij)

]2 dβp

After (quite some) algebra, one finds simply

ynew,MAP =p

∑j=1

xnew,jβMAP,j + βMAP,0

Simpler derivation based on properties of E and Var

General posterior more complicated!

66

Posterior predictive distribution (2)







67

Overview

Frequentist vs. Bayesian methods: interpretation of probability

Bayesian probability: extension of logic to situations withuncertainty

Posterior probability of parameters or hypotheses

Numerical approach in general

Underlies or explains many machine learning techniques

68

Conclusions

D.S. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd

edition, Oxford University Press, 2006W. von der Linden, V. Dose and U. von Toussaint, BayesianProbability Theory: Applications in the Physical Sciences, CambridgeUniversity Press, 2014P. Gregory, Data Analysis: A Bayesian Tutorial, 2nd edition, OxfordUniversity Press, 2006S. Theodoridis, Machine Learning: A Bayesian and OptimizationPerspective, Academic Press (Elsevier), 2015E.T. Jaynes (G.L. Bretthorst, ed.), Probability Theory: The Logic ofScience, Cambridge University Press, 2003S.B. McGrayne, The theory that would not die: how Bayes’ rule crackedthe enigma code, hunted down Russian submarines, and emergedtriumphant from two centuries of controversy, Yale University Press,2011

69

References

a brief overview of probability theory in data science fusion data... · a brief overview of...

Documents