bayesian methods i: parameter estimation “a statistician is a person who draws a mathematically...

Bayesian Methods I: Parameter Estimation

“A statistician is a person who draws a mathematically precise line from an unwarranted assumption to a foregone conclusion.”

Anon

The Medical TestYou take a test for a rare debilitating disease, Frequentitis:

• False Positive rate for the test = 5%• False Negative rate for the test = 1%• The incidence of Frequentitis in the population is 0.1%• Data D = You test positive.

Bayes TheoremP(H|D,I) = P(D|I)

P(H|I) P(D|H I) What are the odds H that you have the disease?

Normalization factor P(D|I) ensures i P(Hi |D,I) = 1 P(D|I) = P(H|I) P(D|H I) + P(H|I) P(D|H I)

P(H|D,I) = 0.1% 99%0.1% 99% + 99.9% 5%

= 0.019

The odds are 1.9% (not 95%) that you have the disease!

(Sum Rule)

Two basic classes of Inference

Which of two or more competing models is the most probable given our present state of knowledge?

1. Model Comparison

• Competing models may have free parameters• Models may vary in complexity (some with more free parameters)• Generally, model comparison is not concerned with finding values• Free parameters are usually Marginalized out in the analysis.

Given a certain model, what is the probability density function for each of its free parameters?

2. Parameter Estimation

• Suppose model M has free parameters f and A• We wish to find p( f | D, M, I) and p( A | D, M, I) • p( f | D, M, I) is known as the Marginal Posterior Distribution for f

Where: 0 = 37 & L = 2 (channels)

{ }

Spectral Line fittingGaussian line profile in noisy data. We are given the model M:

Tfi = T exp -(i - 0)2

2L2

Noise has been independently characterized as Gaussian withn = 1 (in units of the signal)

Estimates of T from theory are uncertain over three orders of magnitude from 0.1 to 100

Parameter Estimation: Spectral Line FitP(T|D,M,I) =

P(D|,M,I) P(T|M,I) P(D|M,T,

I) P(T|M,I) P(D|M,T,

I) Calculating the likelihood P(D|M,T, I)

di = Tfi + ei P(D|M,T, I) = P(E1,E2,E3...En||M,T, I) = i

P(Ei||M,T, I)

{ } 1 exp -(di - Tfi)2

2n2 i

= n sqrt(2)

N

{ } exp-i(di - Tfi)2

2n2 = n

-N (2)-N/2

Prior Likelihood

For Gaussian Noise

Parameter Estimation: Spectral Line Fit

What to use for the Prior P(T|M, I) ?For now let us assume a Uniform prior: Tmax < T < Tmin Posterior Likelihood.

Maximum Likelihood Estimator

The choice of Prior

Our choice of prior can have strong influence the outcome of a Bayesian analysis. In our example, we adopted a uniform prior for the unknown line strength. Was this the right thing to do?

Tmax < T < Tmin where Tmin = 0.1 and Tmax= 100 (given in problem)

Implication: we don’t know the scale. Uniform prior heavily weights to upper decade of range. In such cases, consider the use of the scale invariant Jeffreys Prior (equal probability per decade).

P(T|I)= T ln(Tmax / Tmin)1

The Jeffreys Prior is defined: Jeffreys

UniformJeffreys

Uniform

PDF Prob. Per log interval

Varying the Prior: Spectral Line Fit

JeffreysUniform

Posterior PDF for line strength T

Increasing the linestrength T

In the case of a stronger line detection, the data make a more powerful constraint on the parameters, so that the choice of prior is less critical.

JeffreysUniform

Posterior PDF for line strength T

Stronger linestrength case

Ignorance Priors

Location Parameter: Measured location from some origin:p(X|I)= p(x → x+dx|I)

The solution to this is: pdf(x) = constant (uniform prior)

From a different (arbitrary) origin p(X’|I)= p(x’ → x’+dx’|I) wherex’ = x + c

Indifference requires p(X|I) = p(X’|I) so that pdf(x)=pdf(x’)=pdf(x+c)

Principle of indifference states p(Ai|B) = 1/N where N possible states

How to select a prior when (we think) we have no clue?

Ignorance Priors

The solution to this is: pdf(t) = constant/t (Jeffreys prior)

Ignorance of a scale parameter implies we should have invariance of the distribution when measured either in units t or t’= βt

Then: p(T|I) dT = p(T’|I) dT’ = p(T’|I) dβT = β p(T’|I) dT

Scale Parameter: for example, half life of a new radioactive element

pdf(t)= β pdf(t’)= β pdf(β t)

Improper PriorsSuppose we have absolutely no idea of the limits, Xmin and Xmax (a recent physics example: the distance to a GRB).

A uniform prior with an infinite range cannot be normalized. Such priors are known as Improper Priors.

Improper priors can still be used for parameter estimation problems (like the previous problem), but not for model comparison (Lecture 4) where the normalization of the prior is required to obtain probabilities

Nuisance ParametersFrequently, we are only interested in a subset of the model parameters. Uninteresting parameters are called Nuisance Parameters.• Example: We may be interested in the frequency

of a sinusoidal signal in a noisy dataset, but not interested in the phase or the amplitude a.

P(|D,I) = d da P(, , a|D,I)

We obtain the marginal posterior for by the process of marginalization (by integration) of the nuisance parameters:

Where: 0 = 37 & L = 2 (channels)

{ }

Spectral Line fitting

Gaussian line profile in noisy data. We are given the model M:

Tfi = T exp -(i - 0)2

2L2

Noise has been independently characterized as Gaussian withn = 1 (in units of the signal)

Estimates of T from theory are uncertain over three orders of magnitude from 0.1 to 100

Likely Examples of Nuisance Parameters

Maximum Likelihood

Then we have:

Assume a uniform PriorIgnore normalization factor

P(X|D,I) = P(D|I)

P(X|I) P(D|X,I) Hypothesis H pertains to the PDF for a variable x

P(X|D,I) P(D|X I)

The value x0 which gives the maximum value for the posterior in this case is that which maximises the value of the likelihood function P(D|X,I), and is referred to as the Maximum Likelihood estimator.

Maximum Likelihood and Least Squares

Assuming the noise is Gaussian, then for each datum:

Where Fi = f(x,i) is our ideal (noiseless) model prediction

i P(Di||X, I)

{ } exp -(Fi - Di )2

2i2 i

√(2)

N

(Fi - Di)2

Given a set of data D, where individual data points are independent, then the likelihood is:

P(D|X, I) =

P(Di||X, I) = 1

Thus P(D|X,I) exp( - ) χ2 2

χ2 = where i i2

N

Since the location of a maximum is not affected by a monotonic transformation:

L = ln [ P(X|D,I) ] = constant - χ2

2Maximum Likelihood is obtained by minimizing χ2 we have derived the well-known Least Squares optimization result.

bayesian methods i: parameter estimation “a statistician is a person who draws a mathematically...

Documents