a brief overview of probability theory in data science fusion data... · a brief overview of...
TRANSCRIPT
A brief overview of probability theoryin data science
Geert Verdoolaege1Department of Applied Physics, Ghent University, Ghent, Belgium2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium
Tutorial 3rd IAEA Technical Meeting on Fusion Data Processing,Validation and Analysis, 27-05-2019
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
2
Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
3
Overview
Earliest traces in Western civilization: Jewish writings, Aristotle
Notion of probability in law, based on evidence
Usage in finance
Usage and demonstration in gambling
4
Early history of probability
World is knowable but uncertainty due to human ignorance
William of Ockham: Ockham’s razor
Probabilis: a supposedly ‘provable’ opinion
Counting of authorities
Later: degree of truth, a scale
Quantification:
Law, faith→ Bayesian notion
Gaming→ frequentist notion
5
Middle Ages
17th century: Pascal, Fermat, Huygens
Comparative testing of hypotheses
Population statistics
1713: Ars Conjectandi by Jacob Bernoulli:
Weak law of large numbers
Principle of indifference
De Moivre (1718): The Doctrine of Chances
6
Quantification
Paper by Thomas Bayes (1763): inversion of binomial distribution
Pierre Simon Laplace:Practical applications in physical sciences1812: Théorie Analytique des Probabilités
7
Bayes and Laplace
George Boole (1854), John Venn (1866)
Sampling from a ‘population’
Notion of ‘ensembles’
Andrey Kolmogorov (1930s): axioms of probability
Early 20th century classical statistics: Ronald Fisher,Karl Pearson, Jerzy Neyman, Egon Pearson
Applications in social sciences, medicine, natural sciences
8
Frequentism and sampling theory
Statistical mechanics with Ludwig Boltzmann (1868): maximumentropy energy distribution
Josiah Willard Gibbs (1874–1878)
Claude Shannon (1948): maximum entropy in frequentistlanguage
Edwin Thompson Jaynes: Bayesian re-interpretation
Harold Jeffreys (1939): standard work on (Bayesian) probability
Richard T. Cox (1946): derivation of probability laws
Computers: flowering of Bayesian methods
9
Maximum entropy and Bayesian methods
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
10
Overview
Probability = frequency
Straightforward:
Number of 1s in 60 dice throws ≈ 10: p = 1/6
Probability of plasma disruption p ≈ Ndisr/Ntot
Less straightforward:
Probability of fusion electricity by 2050?
Probability of mass of Saturn 90 mA ≤ mS < 100 mA?
11
Frequentist probability
12
Populations vs. sample
E.g. weight w of Belgian men: unknown but fixed for everyindividual
Average weight µw in population?
Random variable W
Sample: W1, W2, . . . , Wn
Average weight: statistic (estimator) W
Central limit theorem:
W ∼ p(W|µw, σw)⇒ W ∼ N (µw, σw/√
n)
13
Statistics
Maximum likelihood (ML) principle:
µ̂w = arg maxµw∈R+
p(W1, . . . , Wn|µw, σw)
≈ arg maxµw∈R+
n
∏i=1
1√2πσw
exp[− (Wi − µw)2
2σ2w
]
= arg maxµw∈R+
1√2πσw
exp
[−
n
∑i=1
(Wi − µw)2
2σ2w
]
ML estimator (known σw):
µ̂w = W =1n
n
∑i=1
Wi
14
Maximum likelihood parameter estimation
Weight of Dutch men compared to Belgian men (populations)Observed sample averages WNL, WBENull hypothesis H0: µw,NL = µw,BETest statistic:
WNL −WBE
σWCZ−WBE
∼ N (0, 1) (under H0)
15
Frequentist hypothesis testing
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
16
Overview
Statistical (aleatoric) vs. epistemic uncertainty?
Every piece of information has uncertainty
Uncertainty = lack of information
Observation may reduce uncertainty
Probability (distribution) quantifies uncertainty
17
Probability theory: quantifying uncertainty
Measurement of physical quantity
Origin of stochasticity:
Apparatus
Microscopic fluctuations
Systematic uncertainty is assigned aprobability distribution
E.g. coin tossing, voltage measurement,probability of hypothesis vs. another, . . .
Bayesian: no random variables
18
Example: physical sciences
Objective Bayesian view
Probability = real number ∈ [0, 1]
Always conditioned on known information
Notation:p(A|B) or p(A|I)
Extension of logic: measure of degree to which B implies A
Degree of plausibility, but subject to consistency
Same information⇒ same probabilities
Probability distribution: outcome→ probability
19
What is probability?
p(x, y), p(x), p(y), p(x|y), p(y|x)20
Joint, marginal and conditional distributions
Normal/Gaussian probability density function (PDF):
p(x|µ, σ, I) =1√2πσ
exp[− (x− µ)2
2σ2
]Probability x1 ≤ x < x1 + dxInverse problem: µ, σ given x?
21
Example: normal distribution
Bayes’ theorem
p(θ|x, I) =p(x|θ, I)p(θ|I)
p(x|I)
x = data vectorθ = vector of model parametersI = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence:
p(x|I) =∫
p(x, θ|I)dθ =∫
p(x|θ, I)p(θ|I)dθ
Posterior distribution
22
Updating information states
Bayes’ theorem
p(θ|x, I) =p(x|θ, I)p(θ|I)
p(x|I)
x = data vectorθ = vector of model parametersI = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence:
p(x|I) =∫
p(x, θ|I)dθ =∫
p(x|θ, I)p(θ|I)dθ
Posterior distribution
22
Updating information states
Bayes’ theorem
p(θ|x, I) =p(x|θ, I)p(θ|I)
p(x|I)
x = data vectorθ = vector of model parametersI = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence:
p(x|I) =∫
p(x, θ|I)dθ =∫
p(x|θ, I)p(θ|I)dθ
Posterior distribution
22
Updating information states
Bayes’ theorem
p(θ|x, I) =p(x|θ, I)p(θ|I)
p(x|I)
x = data vectorθ = vector of model parametersI = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence:
p(x|I) =∫
p(x, θ|I)dθ =∫
p(x|θ, I)p(θ|I)dθ
Posterior distribution
22
Updating information states
Updating state of knowledge = learning
‘Uninformative’ prior (distribution)
Assigning uninformative priors:
Transformation invariance (Jaynes ’68):e.g. uniform for location variable
Principle of indifference
Testable information: maximum entropy
. . .
Show me the prior!
23
Practical considerations
n measurements xi → xIndependent and identically distributed xi:
p(x|µ, σ, I) =n
∏i=1
1√2πσ
exp[− (xi − µ)2
2σ2
]Bayes’ rule:
p(µ, σ|x, I) ∝ p(x|µ, σ, I)p(µ, σ|I)Suppose σ ≡ σe → delta functionAssume µ ∈ [µmin, µmax]→ uniform prior:
p(µ|I) =
1
µmax − µmin, if µ ∈ [µmin, µmax]
0, otherwise
Let µmin → −∞, µmax → +∞⇒ improper priorEnsure proper posterior
24
Mean of a normal distribution: uniform prior
Posterior:
p(µ|x, I) ∝ exp[−∑n
i=1(xi − µ)2
2σ2e
]Define
x ≡ 1n
n
∑i=1
xi, (∆x)2 ≡ 1n
n
∑i=1
(xi − x̄)2
Adding and subtracting 2nx2 (‘completing the square’),
p(µ|x, I) ∝ exp{− 1
2σ2e /n
[(µ− x)2 + (∆x)2
]}Retaining dependence on µ,
p(µ|x, I) ∝ exp[− (µ− x)2
2σ2e /n
]µ ∼ N (x, σ2
e /n)25
Posterior for µ
Normal prior: µ ∼ N (µ0, τ2)
Posterior:
p(µ|x, I) ∝ exp[−∑n
i=1(xi − µ)2
2σ2e
]× exp
[− (µ− µ0)
2
2τ2
]Expanding and completing the square,
µ ∼ N (µn, σ2n),
where
µn ≡ σ2n
(nσ2
ex +
1τ2 µ0
)and µn ≡ σ2
n
(nσ2
e+
1τ2
)−1
µn is weighted average of µ0 and x
26
Mean of a normal distribution: normal prior
Intrinsic part of Bayesian theory, enabling coherent learning
High-quality data may be scarce
Regularization of ill-posed problems (e.g. tomography)
Prior can retain influence: e.g. regression with errors in allvariables
27
Care for the prior
Repeated measurements→ information on σ
Scale variable σ→ Jeffreys’ scale prior:
p(σ|I) ∝1σ
, σ ∈ ]0,+∞[
Posterior:
p(µ, σ|x, I) ∝1
σn exp
[− (µ− x)2 + (∆x)2
2σ2/n
]× 1
σ
28
Unknown mean and standard deviation
Marginalization = integrating out a (nuisance) parameter:
p(µ|x, I) =∫ +∞
0p(µ, σ|x, I)dσ
∝∫ +∞
0
12
[(µ− x)2 + (∆x)2
2/n
]− n2
sn2−1e−s ds
=12
Γ(n
2
) [ (µ− x)2 + (∆x)2
2/n
]− n2
,
where
s ≡ (µ− x)2 + (∆x)2
2σ2/n
29
Marginal posterior for µ (1)
After normalization:
p(µ|x, I) =Γ(n
2
)√π(∆x)2Γ
(n−12
)[
1 +(µ− x)2
(∆x)2
]− n2
Changing variables,
t ≡ (µ− x)2√(∆x)2/(n− 1)
, with p(t|x, I)dt ≡ p(µ|x, I)dµ,
p(t|x, I) =Γ(n
2
)√(n− 1)πΓ
(n−12
) [1 +t2
n− 1
]− n2
Student’s t-distribution with parameter ν = n− 1If n� 1,
p(µ|x, I) −→ 1√2π(∆x)2/n
exp
[− (µ− x)2
2(∆x)2/n
]30
Marginal posterior for µ (2)
Marginalization of µ:
p(σ|x, I) =∫ +∞
−∞p(µ, σ|x, I)dµ
∝1
σn exp
[− (∆x)2
2σ2/n
]
Setting X ≡ n(∆x)2/σ2,
p(X|x, I) =1
2k2 Γ(
k2
)Xk2−1e−
X2 , k ≡ n− 1
χ2 distribution with parameter k
31
Marginal posterior for σ
Laplace (saddle point) approximation of distributions around themode (= maximum)E.g. marginal for µ:
p(µ|x, I) =Γ(n
2
)√π(∆x)2Γ
(n−12
)[
1 +(µ− x)2
(∆x)2
]− n2
Taylor expansion around mode:
ln[p(µ|x, I)] ≈ ln[p(x|x, I)] +12
d2(ln p)dµ2
∣∣∣∣µ=x
(µ− x)2
= ln[
Γ(n
2
)]− ln
[Γ
(n− 1
2
)]− 1
2ln[π(∆x)2
]− n
2(∆x)2(µ− x)2
32
The Laplace approximation (1)
On the original scale:
p(µ|x, I) ≈Γ(n
2
)Γ(n−1
2
) 1√π(∆x)2
exp
[− (µ− x)2
2(∆x)2/n
]
Standard deviation σL → curvature of ln p:
σL =
[− d2(ln p)
dµ2
∣∣∣∣µ=x
]−1/2
33
The Laplace approximation (2)
34
Laplace approximation: example 1
35
Laplace approximation: example 2
For θ = [θ1, . . . , θp]t,
p(θ|θ0, I) ∝ exp[
12(θ− θ0)
t[∇∇(ln p)]θ=θ0(θ− θ0)
]
∇∇(ln p): Hessian matrix, where
ΣL = −{[∇∇(ln p)]θ=θ0}−1
36
Multivariate Laplace approximation
Let {Hi} be complete set of hypotheses
Data D to support or reject hypotheses
Bayes’ rule:
p(Hi|D, I) =p(D|Hi, I)p(Hi|I)
p(D|I) , p(D|I) = ∑i
p(D|Hi, I)p(Hi|I)
Assume single hypothesis H and complement H
Odds ratio o:
o ≡ p(H|D, I)p(H|D, I)
=p(D|H, I)p(D|H, I)︸ ︷︷ ︸Bayes factor
p(H|I)p(H|I)︸ ︷︷ ︸
Prior odds
37
Model comparison (hypothesis testing)
E.g. n measurements xi of a quantity x
Assume normal distribution with known variance σ2
Question: are the data compatible with mean µ = µ0?
Yes: H
No: H
Under H:
p(x|H, I) = C exp[− 1
2σ2/n(x− µ0)
2]
Under H:p(x|H, I) =
∫p(x|µ, σ, I)p(µ|H, I)dµ (1)
38
Testing a Gaussian mean (1)
Assume bounds µmin and µmax:
p(µ|H, I) =1
|µmax − µmin|1(µmin ≤ µ ≤ µmax)
Then (1) becomes
p(x|H, I) =C
|µmax − µmin|
∫ µmax
µmin
exp[− 1
2σ2/n(x− µ)2
]dµ
With SE ≡ σ/√
n = standard error, assume
|x− µmin|, |x− µmax| � SE
39
Testing a Gaussian mean (2)
Then
p(x|H, I) ≈ C|µmax − µmin|
∫ +∞
−∞exp
[− 1
2σ2/n(x− µ)2
]dµ
=C SE
√2π
|µmax − µmin|
Bayes factor BF:
BF =p(x|H, I)p(x|H, I)
=|µmax − µmin|
SE1√2π
e−12 z2
,
z ≡ |x− µ0|SE
Cf. frequentist hypothesis test
40
Testing a Gaussian mean (3)
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
41
Overview
(Strong) nonlinearities through prior or likelihoodSkewed, heavy-tailed, multimodalSampling (simulating) from distributions pEstimate properties of pMarginalization: sampling from joint distribution→ sampling from marginals
42
Monte Carlo sampling
Metropolis-Hastings sampling, Gibbs sampling
Simple algorithms, but implementation issues might arise
Target distribution π(θ|I)
Markov chain θ(t): conditional on last state
Sample from proposal distribution
43
Markov chain Monte Carlo (MCMC) (1)
Subchains sample from corresponding marginalsMonte Carlo averages, e.g.
θj =1n
n
∑i=1
θ(tc+i)j , (∆θj)2 =
1n
n
∑i=1
(θ(tc+i)j − θj
)2
44
Markov chain Monte Carlo (2)
1 At t = 0, start from θ(t) =[θ(t)1 , θ
(t)2 . . . , θ
(t)p
]t
2 repeat
3 for j = 1 to p
y ∼ qj
(η∣∣∣ θ
(t+1)1 , θ
(t+1)2 , . . . , θ
(t+1)j−1 , θ
(t)j , θ
(t)j+1, . . . , θ
(t)p , I
)
θ(t+1)j =
{y, with probability ρ
θ(t)j , with probability 1− ρ
4 end
45
Metropolis-Hastings algorithm (1)
Acceptance probability:
ρ(
y, θ(t+1)1 , θ
(t+1)2 , . . . , θ
(t+1)j−1 , θ
(t)j , θ
(t)j+1, . . . , θ
(t)p
)≡ min
1,π(
θ(t+1)1 , θ
(t+1)2 , . . . , θ
(t+1)j−1 , y, θ
(t)j+1, . . . , θ
(t)p , I
)π(
θ(t+1)1 , θ
(t+1)2 , . . . , θ
(t+1)j−1 , θ
(t+1)j , θ
(t)j+1, . . . , θ
(t)p , I
)×
qj
(θ(t)j
∣∣∣ θ(t+1)1 , θ
(t+1)2 , . . . , θ
(t+1)j−1 , y, θ
(t)j+1, . . . , θ
(t)p , I
)qj
(y∣∣∣ θ
(t+1)1 , θ
(t+1)2 , . . . , θ
(t+1)j−1 , θ
(t)j , θ
(t)j+1, . . . , θ
(t)p , I
)
Example:
qj
(η∣∣∣ θ
(t+1)j , I
)∝ exp
−(
η − θ(t)j
)2
2σ2j
46
Metropolis-Hastings algorithm (2)
47
MCMC example
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
48
Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
49
Overview
50
Classification or clustering
M clusters of in total n data points xi in P-dimensional space
Known class labels ωj (j = 1, . . . , M) of xi
Bayes’ rule for new point x:
p(ωj|x, I) =p(x|ωj, I)p(ωj|I)
p(x|I)
Maximum a posteriori (MAP) classification rule for x:
Assign x to ωi = arg maxωj
p(ωj|x, I) = arg maxωj
p(x|ωj, I)p(ωj|I)
51
Simple Bayesian classification
Examples of prior probabilities (indifference):p(ωi|I) = p(ωj|I), ∀i, jCount class membership:
p(ωi|I) ≡nin
, i = 1, . . . , M
Examples of likelihoods:Naive Bayesian classifier:
p(x|ωi) =P
∏k=1
p(xk|ωi), i = 1, . . . , M
Multivariate Gaussian:
p(x|ωj, I) =1
(2π)p/2|Σj|1/2exp
[−1
2(x− µj)
tΣ−1j (x− µj)
]52
Examples of priors and likelihoods
Bayesian classifier minimizes probability of misclassification
53
Optimality of Bayesian classifier
Bayesian classifier minimizes probability of misclassification
54
Optimality of Bayesian classifier
MAP: maximize w.r.t. ωj:
ln p(ωj|x, I) = ln p(x|ωj, I) + ln p(ωj|I)
Define (M = 2):
g(x) ≡ ln p(ω1|x, I)− ln p(ω2|x, I)
For normal likelihood:
g(x) =
Quadratic︷ ︸︸ ︷12
(xtΣ−1
2 x− xtΣ−11 x)+
Linear︷ ︸︸ ︷µt
1Σ−11 x− µt
2Σ−12 x
−12
µt1Σ−1
1 µ1 +12
µt2Σ−1
2 µ2 +12
ln|Σ2||Σ1|
+ lnp(ω1|I)p(ω2|I)︸ ︷︷ ︸
Constant
g(x) separates classes: decision hypersurface55
MAP classification: decision surfaces
56
Discriminant analysis
Quadratic discriminant analysis(QDA)
Linear discriminant analysis(LDA): Σ1 = Σ2
57
QDA and LDA
A. Shabbir et al., Fusion Eng. Des. 123, 717, 2017
Pinput − 1.41ΓD2 = 7.47
58
ELM type classification
Linear classifier with normal likelihood (Σ1 = Σ2 ≡ Σ):
g(x) = θt(x− x0) = 0,
θ ≡ Σ−1(µ1 − µ2)
x0 ≡12(µ1 + µ2)− ln
p(ω1|I)p(ω2|I)
µ1 − µ2‖µ1 − µ2‖Σ−1
‖µ1 − µ2‖Σ−1 ≡√(µ1 − µ2)
tΣ−1(µ1 − µ2)︸ ︷︷ ︸Mahalanobis distance
p(ω1|I) = p(ω2|I)⇒Minimum Mahalanobis distance classifier(to cluster centroid)
Σ = σ2I⇒Minimum Euclidean distance classifier
k-nearest neighbor classification (kNN)
59
Minimum distance classifiers (1)
60
Minimum distance classifiers (2)
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
61
Overview
Regression model:
y = β0 + β1x1 + β2x2 + . . . + βpxp + ε
ε ∼ N (0, σ2), σ known
Negligible uncertainty on xj
Likelihood:
p(y|x, β, σ, I) =1√2πσ
exp
− 12σ2
(y− β0 −
p
∑j=1
βjxj
)2 ,
β ≡ [β0, βtp]
t, βp ≡ [β1, . . . , βp]t
62
Multilinear regression model
Take n measurements:
y ≡ [y1, . . . , yn]t
X ≡
1 x11 . . . x1p...
.... . .
...1 xn1 . . . xnp
Conditional independence:
p(y|X, β, σ, I) = (2π)−n/2σ−n exp[− 1
2σ2 (y−Xβ)t(y−Xβ)
]ML: maximize w.r.t. β:
0 = ∇β(y−Xβ)t(y−Xβ) = −2Xty + 2XtXβ
⇒ βML = (XtX)−1Xt︸ ︷︷ ︸Moore-Penrose pseudoinverse
y = βLS
63
Maximum likelihood solution
Uniform priors on βj (not the most uninformative!):
p(β|y, X, σ, I) ∝ exp[− 1
2σ2 (y−Xβ)t(y−Xβ)
]Due to linearity and Gaussianity: βMAP = βML = βLSTaylor expansion (exact!):
(y−Xβ)t(y−Xβ) = (y−XβMAP)t(y−XβMAP)
+12(β− βMAP)
t 2XtX (β− βMAP)
Hence,
p(β|y, X, σ, I) = (2π)−n/2|Σ|−1/2
× exp[−1
2(β− βMAP)
tΣ−1(β− βMAP)
],
Σ ≡ σ2(XtX)−1
64
MAP solution and posterior
New predictions by the model?
Posterior predictive distribution:
p(ynew|xnew, y, X, σ, I) =∫
Rp+1p(ynew, β|xnew, y, X, σ, I)dβ
=∫
Rp+1p(ynew|xnew, β, I) p(β|y, X, σ, I)dβ
Butp(ynew|xnew, β, I) = δ(ynew − βtxnew)
Fix β0 = ynew − β1xnew,1 − . . .− βpxnew,p
65
Posterior predictive distribution (1)
Marginalize over βp with flat priors:
p(ynew|xnew, y, X, σ, I)
∝ σ−n∫
Rpexp
− 12σ2
n
∑i=1
[yi − ynew +
p
∑j=1
βj(xnew,j − xij)
]2 dβp
After (quite some) algebra, one finds simply
ynew,MAP =p
∑j=1
xnew,jβMAP,j + βMAP,0
Simpler derivation based on properties of E and Var
General posterior more complicated!
66
Posterior predictive distribution (2)
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 ApplicationsClassificationRegression analysis
6 Conclusions and references
67
Overview
Frequentist vs. Bayesian methods: interpretation of probability
Bayesian probability: extension of logic to situations withuncertainty
Posterior probability of parameters or hypotheses
Numerical approach in general
Underlies or explains many machine learning techniques
68
Conclusions
D.S. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd
edition, Oxford University Press, 2006W. von der Linden, V. Dose and U. von Toussaint, BayesianProbability Theory: Applications in the Physical Sciences, CambridgeUniversity Press, 2014P. Gregory, Data Analysis: A Bayesian Tutorial, 2nd edition, OxfordUniversity Press, 2006S. Theodoridis, Machine Learning: A Bayesian and OptimizationPerspective, Academic Press (Elsevier), 2015E.T. Jaynes (G.L. Bretthorst, ed.), Probability Theory: The Logic ofScience, Cambridge University Press, 2003S.B. McGrayne, The theory that would not die: how Bayes’ rule crackedthe enigma code, hunted down Russian submarines, and emergedtriumphant from two centuries of controversy, Yale University Press,2011
69
References