13 probability and statistics - university of new mexicoquantum.phys.unm.edu/467-18/ch13.pdf · 13...

13

Probability and Statistics

13.1 Probability and Thomas Bayes

The probability P (A) of an outcome in a set A is the sum of the probabilitiesPj of all the di↵erent (mutually exclusive) outcomes j in A

P (A) =X

j2APj . (13.1)

For instance, if one throws two fair dice, then the probability that the sumis 2 is P (1, 1) = 1/36, while the probability that the sum is 3 is P (1, 2) +P (2, 1) = 1/18.If A and B are two sets of possible outcomes, then the probability of an

outcome in the union A[B is the sum of the probabilities P (A) and P (B)minus that of their intersection A \B

P (A [B) = P (A) + P (B)� P (A \B). (13.2)

If the outcomes are mutually exclusive, then P (A\B) = 0, and the probabil-ity of the union is the sum P (A[B) = P (A)+P (B). The joint probabilityP (A,B) ⌘ P (A\B) is the probability of an outcome that is in both sets Aand B. If the joint probability is the product P (A \B) = P (A)P (B), thenthe outcomes in sets A and B are statistically independent.The probability that a result in set B also is in set A is the conditional

probability P (A|B), the probability of A given B

P (A|B) =P (A \B)

P (B). (13.3)

13.1 Probability and Thomas Bayes 601

Interchanging A and B, we get as the probability of B given A

P (B|A) =P (B \A)

P (A). (13.4)

Since A \B = B \A, the last two equations (13.3 & 13.4) tell us that

P (A \B) = P (B \A) = P (B|A)P (A) = P (A|B)P (B) (13.5)

in which the last equality is Bayes’s theorem

P (A|B) =P (B|A)P (A)

P (B)as well as P (B|A) = P (A|B)P (B)

P (A)(13.6)

(Thomas Bayes, 1702–1761).If B = C \D in (13.3), then P (A|C \D) = P (A \ C \D)/P (C \D).If a set B of outcomes is contained in a union of n sets Aj that are mutually

exclusive,

B ⇢n[

j=1

Aj and Ai \Ak = ;, (13.7)

then we must sum over them

P (B) =nX

j=1

P (B|Aj)P (Aj). (13.8)

If, for example, Aj were the probability of selecting an atom with Zj protonsand Nj neutrons, and if P (B|Aj) were the probability that such a nucleuswould decay in time t, then the probability that the nucleus of the selectedatom would decay in time t would be given by the sum (13.8). The proba-bilities P (Aj) are called a priori probabilities. In this case, if we replaceA by Ak in the formula (13.5) for the probability P (A \ B), then we getP (B \ Ak) = P (B|Ak)P (Ak) = P (Ak|B)P (B). This last equality and thesum (13.8) give us these forms of Bayes’s theorem

P (Ak|B) =P (B|Ak)P (Ak)PNj=1 P (B|Aj)P (Aj)

(13.9)

P (B|Ak) =P (Ak|B)

P (Ak)

NX

j=1

P (B|Aj)P (Aj). (13.10)

Example 13.1 (The Low-Base-Rate Problem). Suppose the incidence of arare disease in a population is P (D) = 0.001. Suppose a test for the diseasehas a sensitivity of 99%, that is, the probability that a carrier of the diseasewill test positive is P (+|D) = 0.99. Suppose the test also is highly selective

602 Probability and Statistics

with a false-positive rate of only P (+|N) = 0.005. Then the probability thata random person in the population would test positive is by (13.8)

P (+) = P (+|D)P (D) + P (+|N)P (N) = 0.005993. (13.11)

And by Bayes’s theorem (13.6), the probability that a person who testspositive actually has the disease is only

P (D|+) =P (+|D)P (D)

P (+)=

0.99⇥ 0.001

0.005993= 0.165 (13.12)

and the probability that a person testing positive actually is healthy isP (N |+) = 1� P (D|+) = 0.835.Even with an excellent test, screening for rare diseases is problematic.

Similarly, screening for rare behaviors, such as disloyalty in the FBI, is diceywith a good test and absurd with a poor one like a polygraph.

Example 13.2 (The Three-Door Problem). A prize lies behind one of threeclosed doors. A contestant gets to pick which door to open, but before thechosen door is opened, a door that does not lead to the prize and was notpicked by the contestant swings open. Should the contestant switch andchoose a di↵erent door?We note that a contestant who picks the wrong door and switches always

wins, so P (W |Sw,WD) = 1, while one who picks the right door and switchesnever wins P (W |Sw,RD) = 0. Since the probability of picking the wrongdoor is P (WD) = 2/3, the probability of winning if one switches is

P (W |Sw) = P (W |Sw,WD)P (WD) + P (W |Sw,RD)P (RD) = 2/3.(13.13)

The probability picking the right door is P (RD) = 1/3, and the probabilityof winning if one picks the right door and stays put is P (W |Sp,RD) = 1.So the probability of winning if one stays put is

P (W |Sp) = P (W |Sp,RD)P (RD) + P (W |Sp,WD)P (WD) = 1/3.(13.14)

Thus, one should switch after the door opens.

If the set A is the interval (x � dx/2, x + dx/2) of the real line, thenP (A) = P (x) dx, and version (13.9) of Bayes’s theorem says

P (x|B) =P (B|x)P (x)R1

�1 P (B|x0)P (x0) dx0. (13.15)

Example 13.3 (A tiny poll). We ask 4 people if they will vote for Nancy

13.2 Mean and Variance 603

Pelosi, and 3 say yes. If the probability that a random voter will vote forher is y, then the probability that 3 in our sample of 4 will is

P (3|y) = 4 y3 (1� y) (13.16)

which is the value Pb(3, y, 4) of the binomial distribution (section 13.3, 13.49)for n = 3, p = y, and N = 4. We don’t know the prior probability distri-bution P (y), so we set it equal to unity on the interval (0, 1). Then thecontinuous form of Bayes’s theorem (13.15) and our cheap poll give theprobability distribution of the fraction y who will vote for her as

P (y|3) = P (3|y)P (y)R 10 P (3|y0)P (y0) dy0

=P (3|y)

R 10 P (3|y0) dy0

=4 y3 (1� y)

R 10 4 y03 (1� y0) dy0

= 20 y3 (1� y). (13.17)

Our best guess then for the probability that she will win the election isZ 1

1/2P (y|3) dy =

Z 1

1/220 y3 (1� y) dy =

13

16(13.18)

which is slightly higher that the naive estimate of 3/4.

13.2 Mean and Variance

In roulette and many other games, N outcomes xj can occur with probabil-ities Pj that sum to unity

NX

j=1

Pj = 1. (13.19)

The expected value E[x] of the outcome x is its mean µ or average valuehxi = x

E[x] = µ = hxi = x =NX

j=1

xj Pj . (13.20)

The expected value E[x] also is called the expectation of x or the ex-pectation value of x (and should be called the mean value of x).The `th moment is

E[x`] = µ` = hx`i =NX

j=1

x`jPj (13.21)


and the `th central moment is

E[(x� µ)`] = ⌫` =NX

j=1

(xj � µ)`Pj (13.22)

where always µ0 = ⌫0 = 1 and ⌫1 = 0 (exercise 13.3).The variance V [x] is the second central moment ⌫2

V [x] ⌘ E[(x� hxi)2] = ⌫2 =NX

j=1

(xj � hxi)2 Pj (13.23)

which one may write as (exercise 13.5)

V [x] = hx2i � hxi2 (13.24)

and the standard deviation � is its square root

� =pV [x]. (13.25)

If the values of x are distributed continuously according to a probabilitydistribution or density P (x) normalized to unity

ZP (x) dx = 1 (13.26)

then the mean value is

E[x] = µ = hxi =Z

xP (x) dx (13.27)

and the `th moment is

E[x`] = µ` = hx`i =Z

x` P (x) dx. (13.28)

The `th central moment is

E[(x� µ)`] = ⌫` =

Z(x� µ)` P (x) dx. (13.29)

The variance of the distribution is the second central moment

V [x] = ⌫2 =

Z(x� hxi)2 P (x) dx = µ2 � µ2 (13.30)

and the standard deviation � is its square root � =p

V [x].Many authors use f(x) for the probability distribution P (x) and F (x) for

13.2 Mean and Variance 605

the cumulative probability Pr(�1, x) of an outcome in the interval (�1, x)

F (x) ⌘ Pr(�1, x) =

Z x

�1P (x0) dx0 =

Z x

�1f(x0) dx0 (13.31)

a function that is necessarily monotonic

F 0(x) = Pr0(�1, x) = f(x) = P (x) � 0. (13.32)

Some mathematicians reserve the term probability distribution for prob-abilities like Pr(�1, x) and Pj and call a continuous distribution P (x) aprobability density function. But the usage in physics of the Maxwell-Boltzmann distribution is too widespread for me to observe this distinction.Although a probability distribution P (x) is normalized (13.26), it can

have fat tails, which are important in financial applications (Bouchaud andPotters, 2003). Fat tails can make the variance and even themean absolutedeviation

Eabs ⌘Z

|x� µ|P (x) dx (13.33)

diverge.

Example 13.4 (Heisenberg’s Uncertainty Principle). In quantum mechan-ics, the absolute-value squared | (x)|2 of a wave function (x) is the prob-ability distribution P (x) = | (x)|2 of the position x of the particle, andP (x) dx is the probability that the particle is found between x � dx/2 andx+dx/2. The variance h(x�hxi)2i of the position operator x is written as thesquare (�x)2 of the standard deviation � = �x which is the uncertaintyin the position of the particle. Similarly, the square of the uncertainty in themomentum (�p)2 is the variance h(p� hpi)2i of the momentum.For the wave function (3.73)

(x) =

✓2

⇡

◆1/4 1pae�(x/a)2 . (13.34)

these uncertainties are �x = a/2 and �p = ~/a. They provide a saturatedexample �x�p = ~/2 of Heisenberg’s uncertainty principle

�x�p � ~2. (13.35)

If x and y are two random variables that occur with a joint distribution


P (x, y), then the expected value of the linear combination axnym + bxpyq is

E[axnym + bxpyq] =

Z(axnym + bxpyq)P (x, y) dxdy

= a

Zxnym P (x, y) dxdy + b

Zxpyq P (x, y) dxdy

= aE[xnym] + bE[xpyq]. (13.36)

This result and its analog for discrete probability distributions show thatexpected values are linear.

Example 13.5 (Jensen’s inequalities). A convex function is one that liesabove its tangents:

f(x) � f(y) + (x� y)f 0(y). (13.37)

For example, ex lies above 1 + x which is its tangent at x = 0. Multiplyingboth sides of the definition (13.37) by the probability distribution P (x) andintegrating over x with y = hxi, we find that the mean value of a convexfunction

hf(x)i =Z

f(x)P (x)dx �Z ⇥

f(hxi) + (x� hxi)f 0(hxi)⇤P (x)dx

=

Zf(hxi)P (x) dx = f(hxi)

(13.38)

exceeds its value at hxi. Equivalently, E[f(x)] � f(E[x]).For a concave function, the inequalities (13.37) and (13.38) reverse, and

f(E[x]) � E[f(x)]. Thus since log(x) is concave, we have

log(E[x]) = log

1

n

nX

i=1

xi

!� E[log(x)] =

nX

i=1

1

nlog(xi). (13.39)

Exponentiating both sides, we get the inequality of arithmetic and ge-ometric means

1

n

nX

i=1

xi �

nY

i=1

xi

!1/n

(13.40)

(Johan Jensen, 1859–1925).

The correlation coe�cient or covariance of two variables x and y thatoccur with a joint distribution P (x, y) is

C[x, y] ⌘ZP (x, y)(x� x)(y � y) dxdy = h(x� x)(y � y)i = hx yi � hxihyi.

(13.41)

13.3 The binomial distribution 607

The variables x and y are said to be independent if

P (x, y) = P (x)P (y). (13.42)

Independence implies that the covariance vanishes, but C[x, y] = 0 does notguarantee that x and y are independent (Roe, 2001, p. 9).The variance of x+ y

h(x+ y)2i� hx+ yi2 = hx2i� hxi2+ hy2i� hyi2+2 (hx yi � hxihyi) (13.43)

is the sum

V [x+ y] = V [x] + V [y] + 2C[x, y]. (13.44)

It follows (exercise 13.6) that for any constants a and b the variance of ax+byis

V [ax+ by] = a2 V [x] + b2 V [y] + 2 abC[x, y]. (13.45)

More generally (exercise 13.7), the variance of the sum a1x1 + a2x2 + . . .+aNxN is

V [a1x1 + . . .+ aNxN ] =NX

j=1

a2j V [xj ] +NX

j,k=1,j<k

2ajak C[xj , xk]. (13.46)

If the variables xj and xk are independent for j 6= k, then their covariancesvanish C[xj , xk] = 0, and the variance of the sum a1x1 + . . .+ aNxN is

V [a1x1 + . . .+ aNxN ] =NX

j=1

a2j V [xj ]. (13.47)

13.3 The binomial distribution

If the probability of success is p on each try, then we expect that in N triesthe mean number of successes will be

hni = N p. (13.48)

The probability of failure on each try is q = 1 � p. So the probability of aparticular sequence of successes and failures, such as n successes followed byN �n failures is pn qN�n. There are N !/n! (N �n)! di↵erent sequences of n


0 50 100 150 200 250

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Figure 13.1 If the probability of success on any try is p, then the probabilityPb(n, p,N) of n successes in N tries is given by equation (13.49). For p =0.2, this binomial probability distribution Pb(n, p,N) is plotted against nfor N = 125 (solid), 250 (dashes), 500 (dot dash), and 1000 tries (. . . ).

successes and N � n failures, all with the same probability pn qN�n. So theprobability of n successes (and N � n failures) in N tries is

Pb(n, p,N) =N !

n! (N � n)!pn qN�n =

✓N

n

◆pn (1� p)N�n. (13.49)

This binomial distribution also is called Bernoulli’s distribution (Ja-cob Bernoulli, 1654–1705).The sum (4.93) of the probabilities Pb(n, p,N) for n = 0, 1, 2, . . . , N is

unity

NX

n=0

Pb(n, p,N) =NX

n=0

✓N

n

◆pn (1� p)N�n = (p+ 1� p)N = 1. (13.50)

13.3 The binomial distribution 609

In Fig. 13.1, the probabilities Pb(n, p,N) for 0 n 250 and p = 0.2 areplotted for N = 125, 250, 500, and 1000 tries.

The mean number of successes

µ = hniB =NX

n=0

nPb(n, p,N) =NX

n=0

n

✓N

n

◆pnqN�n (13.51)

is a partial derivative with respect to p with q held fixed

hniB = p@

@p

NX

n=0

✓N

n

◆pnqN�n

= p@

@p(p+ q)N = Np (p+ q)N�1 = Np (13.52)

which verifies the estimate (13.48).One may show (exercise 13.9) that the variance (13.23) of the binomial

distribution is

VB = h(n� hni)2i = p (1� p)N. (13.53)

Its standard deviation (13.25) is

�B =pVB =

pp (1� p)N. (13.54)

The ratio of the width to the mean

�BhniB

=

pp (1� p)N

Np=

r1� p

Np(13.55)

decreases with N as 1/pN .

Example 13.6 (Avogadro’s number). A mole of gas is Avogadro’s numberNA = 6 ⇥ 1023 of molecules. If the gas is in a cubical box, then the chancethat each molecule will be in the left half of the cube is p = 1/2. The meannumber of molecules there is hnib = pNA = 3⇥ 1023, and the uncertainty inn is �b =

pp (1� p)N =

p3⇥ 1023/4 = 3 ⇥ 1011. So the numbers of gas

molecules in the two halves of the box are equal to within �b/hnib = 10�12

or to 1 part in 1012.

Example 13.7 (Counting fluorescent molecules ). Molecular biologists caninsert the DNA that codes for a fluorescent protein next to the DNA thatcodes for a specific natural protein in the genome of a bacterium. The bac-terium then will make its natural protein with the fluorescent protein at-tached to it, and the labeled protein will produce light of a specific colorwhen suitably illuminated by a laser. The intensity I of the light is propor-tional to the number n of labeled protein molecules I = ↵n, and one can


find the constant of proportionality ↵ by measuring the light given o↵ as thebacteria divide. When a bacterium divides, it randomly separates the totalnumber N of fluorescent proteins inside it into its two daughter bacteria,giving one n fluorescent molecules and the other N �n. The variance of thedi↵erence is

h(n� (N � n))2i = h(2n�N)2i = 4 hn2i � 4 hniN +N2. (13.56)

The mean number (13.52) is hni = pN , and our variance formula (13.53)tells us that

hn2i = hni2 + p(1� p)N = (pN)2 + p(1� p)N. (13.57)

Since the probability p = 1/2, the variance of the di↵erence is

h(n� (N � n))2i = (2p� 1)2N2 + 4p(1� p)N = N. (13.58)

Thus the ratio of the variance of the di↵erence of daughters’ intensitiesto the intensity of the parent bacterium reveals the unknown constant ofproportionality ↵ (Phillips et al., 2012)

h(In � IN�n)2ihIN i =

↵2h(n� (N � n))2i↵hNi =

↵2N

↵N= ↵. (13.59)

13.4 Coping with big factorials

Because n! increases very rapidly with n, the rule

Pb(k + 1, p, n) =p

1� p

n� k

k + 1Pb(k, p, n) (13.60)

is helpful when n is big. But when n exceeds a few hundred, the formula(13.49) for Pb(k, p, n) becomes unmanageable even in quadruple precision.One solution is to work with the logarithm of the expression of interest.The Fortran function log_gamma(x), the C function lgamma(x), the Mat-lab function gammaln(x), and the Python function loggamma(x) all givelog(�(x)) = log((x � 1)!) for real x. Using the very tame logarithm of thegamma function, one may compute Pb(k, p, n) even for n = 107 as

✓n

k

◆pkqn�k = exp

⇥log (�(n+ 1))� log (�(n� k + 1))

� log (�(k + 1)) + k log p+ (n� k) log q⇤.

(13.61)

Another way to cope with huge factorials is to use Stirling’s formula (4.41)

13.5 Poisson’s distribution 611

n! ⇡p2⇡n (n/e)n or Srinivasa Ramanujan’s correction (4.42) or Mermin’s

even more accurate approximations (4.43–4.45).

A third way to cope with the unwieldy factorials in the binomial formulaPb(k, p, n) is to use its limiting forms due to Poisson and to Gauss.

13.5 Poisson’s distribution

Poisson approximated the formula (13.49) for the binomial distributionPb(n, p,N) by taking the two limits N ! 1 and p = hni/N ! 0 whilekeeping n and the product pN = hni constant. Using Stirling’s formulan! ⇡

p2⇡n (n/e)n (5.326) for the two huge factorials N ! and (N � n)!, we

get as n/N ! 0 and hni/N ! 0 with hni = pN kept fixed

Pb(n, p,N) =

✓N

n

◆pn(1� p)N�n =

N !

(N � n)!

pn

n!(1� p)N�n

⇡r

N

N � n

✓N

e

◆N ✓ e

N � n

◆N�n (pN)n

n!(1� p)N�n

⇡ e�n⇣1� n

N

⌘�N+n hninn!

✓1� hni

N

◆N�n

.

(13.62)

So using the definition exp(�x) = limN!1 (1� x/N)N to take the limits

⇣1� n

N

⌘�N⇣1� n

N

⌘n! en and

✓1� hni

N

◆N✓1� hni

N

◆�n

! ehni,

(13.63)we get from the binomial distribution Poisson’s estimate

PP (n, hni) =hninn!

e�hni (13.64)

of the probability of n successes in a very large number N of tries, each witha tiny chance p = hni/N of success. (Simeon-Denis Poisson, 1781–1840.Incidentally, poisson means fish and sounds like pwahsahn.)

The Poisson distribution is normalized to unity

1X

n=0

PP (n, hni) =1X

n=0

hninn!

e�hni = ehni e�hni = 1. (13.65)


Its mean µ is the parameter hni = pN of the binomial distribution

µ =1X

n=0

nPP (n, hni) =1X

n=1

nhninn!

e�hni = hni1X

n=1

hni(n�1)

(n� 1)!e�hni

= hni1X

n=0

hninn!

e�hni = hni. (13.66)

As N ! 1 and p ! 0 with pN = hni fixed, the variance (13.53) of thebinomial distribution tends to the limit

VP = limN!1p!0

VB = limN!1p!0

p (1� p)N = hni. (13.67)

Thus the mean and the variance of a Poisson distribution are equal

VP = h(n� hni)2i = hni = µ (13.68)

as one may show directly (exercise 13.11).

Example 13.8 (Accuracy of Poisson’s distribution ). If p = 0.0001 andN = 10, 000, then hni = 1 and Poisson’s approximation to the probabilitythat n = 2 is 1/2e. The exact binomial probability (13.61) and Poisson’sestimate are Pb(2, 0.01, 1000) = 0.18395 and PP (2, 1) = 0.18394.

Example 13.9 (Coherent states). The coherent state |↵i introduced inequation (2.146)

|↵i = e�|↵|2/2e↵a† |0i = e�|↵|2/2

1X

n=0

↵n

pn!|ni (13.69)

is an eigenstate a|↵i = ↵|↵i of the annihilation operator a with eigenvalue↵. The probability P (n) of finding n quanta in the state |↵i is the square ofthe absolute value of the inner product hn|↵i

P (n) = |hn|↵i|2 = |↵|2nn!

e�|↵|2 (13.70)

which is a Poisson distribution P (n) = PP (n, |↵|2) with mean and varianceµ = hni = V (↵) = |↵|2.

Example 13.10 (Radiation and cancer). If a cell becomes cancerous only

13.5 Poisson’s distribution 613

after being hit N times by ionizing radiation, then the probability of cancerP (hni)N rises with the dose or mean number hni of hits per cell as

P (hni)N =1X

n=N

hninn!

e�hni (13.71)

or P (hni)N ⇡ hniN/N ! for hni ⌧ 1. Although the incidence of cancerP (hni)N rises linearly (solid) with the dose hni of radiation if a single hit,N = 1, can cause a cell to become cancerous, it rises more slowly if thethreshold for cancer isN = 2 (dot dash), 3 (dashes), or 4 (dots) hits Fig. 13.2.Most mutations are harmless. The number of harmful mutations that occurbefore a cell becomes cancerous is about 4 but varies with the a↵ected organfrom 1 to 10 (Martincorena et al., 2017).

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0

0.05

0.1

0.15

0.2

0.25

0.3

Figure 13.2 The incidence of cancer P (hni)N rises linearly (solid) with thedose or mean number hni of times a cell is struck by ionizing radiation ifa single hit, N = 1 (solid), can cause a cell to become cancerous. It risesmore slowly if the threshold for cancer is N = 2 (dot dash), 3 (dashes), or4 (dots) hits.


13.6 Gauss’s distribution

Gauss considered the binomial distribution in the limits n ! 1 and N ! 1with the probability p fixed. In this limit, all three factorials are huge, andwe may apply Stirling’s formula to each of them

Pb(n, p,N) =N !

n! (N � n)!pn qN�n

⇡

sN

2⇡n(N � n)

✓N

e

◆N ⇣ en

⌘n✓ e

N � n

◆N�n

pn qN�n

=

sN

2⇡n(N � n)

✓pN

n

◆n ✓ qN

N � n

◆N�n

. (13.72)

This probability Pb(n, p,N) is tiny unless n is near pN which means thatn ⇡ pN and N �n ⇡ (1� p)N = qN are comparable. So we set y = n� pNand treat y/N as small. Since n = pN +y and N �n = (1�p)N +pN �n =qN � y, we can write the square root as

sN

2⇡ n (N � n)=

1p2⇡N [(pN + y)/N ] [(qN � y)/N ]

=1p

2⇡ pqN (1 + y/pN) (1� y/qN). (13.73)

Because y remains finite as N ! 1, in the limit of the square root is

limN!1

sN

2⇡ n (N � n)=

1p2⇡ pqN

. (13.74)

Substituting pN + y for n and qN � y for N � n in (13.72), we find

Pb(n, p,N) ⇡ 1p2⇡ pqN

✓pN

pN + y

◆pN+y ✓ qN

qN � y

◆qN�y

=1p

2⇡ pqN

✓1 +

y

pN

◆�(pN+y) ✓1� y

qN

◆�(qN�y)

(13.75)

13.6 Gauss’s distribution 615

which implies

lnhPb(n, p,N)

p2⇡ pqN

i⇡ �(pN+y) ln

1 +

y

pN

��(qN�y) ln

1� y

qN

�.

(13.76)The first two terms of the power series (4.101) for ln(1 + ✏) are

ln(1 + ✏) ⇡ ✏� 1

2✏2. (13.77)

So applying this expansion to the two logarithms and using the relation1/p+ 1/q = (p+ q)/pq = 1/pq, we get

ln⇣Pb(n, p,N)

p2⇡ pqN

⌘⇡ �(pN + y)

"y

pN� 1

2

✓y

pN

◆2#

(13.78)

� (qN � y)

"� y

qN� 1

2

✓y

qN

◆2#⇡ � y2

2pqN.

Remembering that y = n� pN , we get Gauss’s approximation to the bino-mial probability distribution

PbG(n, p,N) =1p

2⇡pqNexp

✓�(n� pN)2

2pqN

◆. (13.79)

This probability distribution is normalized

1X

n=0

1p2⇡pqN

exp

✓�(n� pN)2

2pqN

◆= 1 (13.80)

almost exactly for pN > 100.Extending the integer n to a continuous variable x, we have

PG(x, p,N) =1p

2⇡pqNexp

✓�(x� pN)2

2pqN

◆(13.81)

which on the real line (�1,1) is (exercise 13.12) a normalized probabilitydistribution mean hxi = µ = pN and variance h(x � µ)2i = �2 = pqN .Replacing pN by µ and pqN by �2, we get the Standard form of Gauss’sdistribution

PG(x, µ,�) =1

�p2⇡

exp

✓�(x� µ)2

2�2

◆. (13.82)

This distribution occurs so often in mathematics and in Nature that it isoften called the normal distribution. Its odd central moments all vanish⌫2n+1 = 0, and its even ones are ⌫2n = (2n� 1)!!�2n (exercise 13.14).


Actin Fibers in HELA Cells

Figure 13.3 Conventional (left, fuzzy) and dSTORM (right, sharp) imagesof actin fibers in HELA cells. The actin is labeled with Alexa Fluor 647Phalloidin. The white rectangles are 5 microns in length. Images courtesyof Fang Huang and Keith Lidke.

Example 13.11 (Acccuracy of Gauss’s distribution ). If p = 0.1 andN = 104, then Gauss’s approximation to the probability that n = 103 is1/(30

p2⇡). The exact binomial probability (13.61) and Gauss’s estimate

are Pb(103, 0.1, 104) = 0.013297 and PG(103, 0.1, 104) = 0.013298.

Example 13.12 (Single-Molecule Super-Resolution Microscopy). If the wave-length of visible light were a nanometer, microscopes would yield muchsharper images. Each photon from a (single-molecule) fluorophore enter-ing the lens of a microscope would follow ray optics and be focused withina tiny circle of about a nanometer on a detector. Instead, a photon arrivesnot at x = (x1, x2) but at yi = (y1i, y2i) with gaussian probability

P (yi) =1

2⇡�2e�(yi�x)2/2�2

(13.83)

where � ⇡ 150 nm is about a quarter of a wavelength. What to do?In the centroid method, one collects N ⇡ 500 points yi and finds the

point x that maximizes the joint probability of the N image points

P =NY

i=1

P (yi) = dNNY

i=1

e�(yi�x)2/(2�2) = dN exp

"�

NX

i=1

(yi � x)2/(2�2)

#

(13.84)

13.7 The error function erf 617

where d = 1/2⇡�2 by solving for k = 1 and 2 the equations

@P

@xk= 0 = P

@

@xk

"�

NX

i=1

(yi � x)2/(2�2)

#=

P

�2

NX

i=1

(yik � xk) . (13.85)

This maximum-likelihood estimate of the image point x is the average ofthe observed points yi

x =1

N

NX

i=1

yi. (13.86)

This method is an improvement, but it is biased by auto-fluorescence andout-of-focus fluorophores. Fang Huang and Keith Lidke use direct stochas-tic optical reconstruction microscopy (dSTORM) to locate the imagepoint x of the fluorophore in ways that account for the finite accuracy oftheir pixilated detector and the randomness of photo-detection.Actin filaments are double helices of the protein actin some 5–9 nm wide.

They occur throughout a eukaryotic cell but are concentrated near its surfaceand determine its shape. Together with tubulin and intermediate filaments,they form a cell’s cytoskeleton. Figure 13.3 shows conventional (left, fuzzy)and dSTORM (right, sharp) images of actin filaments. The finite size of thefluorophore and the motion of the molecules of living cells limit dSTORM’simprovement in resolution to a factor of 10 to 20.

13.7 The error function erf

The probability that a random variable x distributed according to Gauss’sdistribution (13.82) has a value between µ� � and µ+ � is

P (|x� µ| < �) =

Z µ+�

µ��PG(x, µ,�) dx =

1

�p2⇡

Z µ+�

µ��exp

✓� (x� µ)2

2�2

◆dx

=1

�p2⇡

Z �

��exp

✓� x2

2�2

◆dx =

2p⇡

Z �/�p2

0e�t2 dt. (13.87)

The last integral is the error function

erf (x) =2p⇡

Z x

0e�t2dt (13.88)

so in terms of it the probability that x lies within � of the mean µ is

P (|x� µ| < �) = erf

✓�

�p2

◆. (13.89)


0 0.5 1 1.5 2 2.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1The Error Function

x

erf(x

)

1σ 2σ 3σ

Figure 13.4 The error function erf (x) is plotted for 0 < x < 2.5. Thevertical lines are at x = �/(�

p2) for � = �, 2�, and 3� with � = 1/

p2.

In particular, the probabilities that x falls within one, two, or three standarddeviations of µ are

P (|x� µ| < �) = erf (1/p2) = 0.6827

P (|x� µ| < 2�) = erf (2/p2) = 0.9545

P (|x� µ| < 3�) = erf (3/p2) = 0.9973. (13.90)

The error function erf (x) is plotted in Fig. 13.4 in which the vertical linesare at x = �/(�

p2) for � = �, 2�, and 3�.

The probability that x falls between a and b is (exercise 13.15)

P (a < x < b) =1

2

erf

✓b� µ

�p2

◆� erf

✓a� µ

�p2

◆�. (13.91)

In particular, the cumulative probability P (�1, x) that the random variable

13.7 The error function erf 619

is less than x is for µ = 0 and � = 1

P (�1, x) =1

2

erf

✓xp2

◆� erf

✓�1p

2

◆�=

1

2

erf

✓xp2

◆+ 1

�. (13.92)

The complement erfc of the error function is defined as

erfc (x) =2p⇡

Z 1

xe�t2dt = 1� erf (x) (13.93)

and is numerically useful for large x where round-o↵ errors may occur insubtracting erf(x) from unity. Both erf and erfc are intrinsic functions infortran available without any e↵ort on the part of the programmer.

Example 13.13 (Summing Binomial Probabilities). To add up several bi-nomial probabilities when the factorials in Pb(n, p,N) are too big to handle,we first use Gauss’s approximation (13.79)

Pb(n, p,N) =N !

n! (N � n)!pn qN�n ⇡ 1p

2⇡pqNexp

✓�(n� pN)2

2pqN

◆.

(13.94)Then using (13.91) with µ = pN , we find (exercise 13.13)

Pb(n, p,N) ⇡ 1

2

"erf

n+ 1

2 � pNp2pqN

!� erf

n� 1

2 � pNp2pqN

!#(13.95)

which we can sum over the integer n to get

n2X

n=n1

Pb(n, p,N) ⇡ 1

2

"erf

n2 +

12 � pN

p2pqN

!� erf

n1 � 1

2 � pNp2pqN

!#

(13.96)which is easy to evaluate.

Example 13.14 (Polls). Suppose in a poll of 1000 likely voters, 600 havesaid they would vote for Nancy Pelosi. Repeating the analysis of exam-ple 13.3, we see that if the probability that a random voter will vote for heris y, then the probability that 600 in our sample of 1000 will is by (13.94)

P (600|y) = Pb(600, y) =

✓1000

600

◆y600 (1� y)400

⇡ 1

10p20⇡y(1� y)

exp

✓� 20(3� 5y)2

y(1� y)

◆(13.97)

and so the probability density that a fraction y of the voters will vote for


her is

P (y|600) = P (600|y)R 10 P (600, y0) dy0

=[y(1� y)]�1/2 exp

⇣� 20(3�5y)2

y(1�y)

⌘

R 10 [y

0(1� y0)]�1/2 exp⇣� 20(3�5y0)2

y0(1�y0)

⌘dy0

. (13.98)

This normalized probability distribution is negligible except for y near 3/5(exercise 13.16), where it is approximately Gauss’s distribution

P (y|600) ⇡ 1

�p2⇡

exp

✓�(y � 3/5)2

2�2

◆(13.99)

with mean µ = 3/5 and variance

�2 =3

12500= 2.4⇥ 10�4. (13.100)

The probability that y > 1/2 then is by (13.91)

P (12 < y < 1) =1

2

erf

✓1� µ

�p2

◆� erf

✓1/2� µ

�p2

◆�

=1

2

erf

✓20p1.2

◆� erf

✓�5p1.2

◆�⇡ 1. (13.101)

The probability that y < 1/2 is 5.4⇥ 10�11.

13.8 Error analysis

The mean value f = hfi of a smooth function f(x) of a random variable xis

f =

Zf(x)P (x) dx

⇡Z

f(µ) + (x� µ)f 0(µ) +1

2(x� µ)2f 00(µ)

�P (x) dx

= f(µ) +1

2�2f 00(µ)

(13.102)

13.8 Error analysis 621

as long as the higher central moments ⌫n and the higher derivatives f (n)(µ)are small. The mean value of f2 then is

hf2i =Z

f2(x)P (x) dx

⇡Z

f(µ) + (x� µ)f 0(µ) +1

2(x� µ)2f 00(µ)

�2P (x) dx

⇡Z ⇥

f2(µ) + (x� µ)2f 0 2(µ) + (x� µ)2f(µ)f 00(µ)⇤P (x) dx

= f2(µ) + �2 f 0 2(µ) + �2 f(µ) f 00(µ). (13.103)

Subtraction gives the variance of the variable f(x)

�2f = h�f � f

�2i = hf2i � f 2 ⇡ �2 f 0 2(µ). (13.104)

A similar formula gives the variance of a smooth function f(x1, . . . , xn)of several independent variables x1, . . . , xn as

�2f = h�f � f

�2i = hf2i � f 2 ⇡nX

i=1

�2i

✓@f(x)

@xi

◆2��x=x

(13.105)

in which x is the vector (µ1, . . . , µn) of mean values, and �2i = h(xi � µi)2iis the variance of xi.

This formula (13.105) implies that the variance of a sum f(x, y) = c x+d yis

�2cx+dy = c2�2x + d2�2y . (13.106)

Similarly, the variance formula (13.105) gives as the variance of a productf(x, y) = x y

�2xy = �2xµ2y + �2yµ

2x = µ2

xµ2y

�2xµ2x+�2yµ2y

!(13.107)

and as the variance of a ratio f(x, y) = x/y

�2x/y =�2xµ2y+ �2y

µ2x

µ4y=

µ2x

µ2y

�2xµ2x+�2yµ2y

!. (13.108)

The variance of a power f(x) = xa follows from the variance (13.104) of afunction of a single variable

�2xa = �2x�aµa�1

x

�2= (13.109)

In general, the standard deviation � is the square root of the variance �2.


Example 13.15 (Photon density). The 2009 COBE/FIRAS measurementof the temperature of the cosmic microwave background (CMB) radiationis T0 = 2.7255± 0.0006 K. The mass density (4.110) of these photons is

⇢� = �8⇡5 (kBT0)

4

15h3c5= 4.6451⇥ 10�31 kg m�3. (13.110)

Our formula (13.109) for the variance of a power says that the standarddeviation �⇢ of the photon density is its temperature derivative times thestandard deviation �T of the temperature

�⇢ = ⇢�4�TT0

= 0.00088 ⇢� . (13.111)

So the probability that the photon mass density lies within the range

⇢� = (4.6451± 0.0041)⇥ 10�31 kg m�3 (13.112)

is 0.68.

13.9 The Maxwell-Boltzmann distribution

It is a small jump from Gauss’s distribution (13.82) to the Maxwell-Boltzmanndistribution of velocities of molecules in a gas. We start in one dimensionand focus on a single molecule that is being hit fore and aft with equal prob-abilities by other molecules. If each tiny hit increases or decreases its speedby dv, then after n aft hits and N � n fore hits, the speed vx of a moleculeinitially at rest would be

vx = ndv � (N � n)dv = (2n�N)dv. (13.113)

The probability of this speed is given by Gauss’s approximation (13.79) tothe binomial distribution Pb(n,

12 , N) as

PbG(n,12 , N) =

r2

⇡Nexp

✓�(2n�N)2

2N

◆=

r2

⇡Nexp

✓� v2x2Ndv2

◆.

(13.114)

In this formula, the productNdv2 is the variance �2vx which is the mean valuehv2xi because hvxi = 0. Kinetic theory says that this variance �2vx = hv2xiis hv2xi = kT/m in which m is the mass of the molecule, k Boltzmann’sconstant, and T the temperature. So the probability of the molecule’s having

13.10 Fermi-Dirac and Bose-Einstein distributions 623

velocity vx is the Maxwell-Boltzmann distribution

PG(vx) =1

�vp2⇡

exp

✓� v2x2�2v

◆=

rm

2⇡kTexp

✓�mv2x2kT

◆(13.115)

when normalized over the line �1 < vx < 1.In three space dimensions, the Maxwell-Boltzmann distribution PMB(v)

is the product

PMB(v)d3v = PG(vx)PG(vy)PG(vz)d

3v =⇣ m

2⇡kT

⌘3/2e�

12mv2/(kT )4⇡v2dv.

(13.116)The mean value of the velocity of a Maxwell-Boltzmann gas vanishes

hvi =Z

v PMB(v)d3v = 0 (13.117)

but the mean value of the square of the velocity v2 = v · v is the sum of the

three variances �2x = �2y = �2z = kT/m

hv2i = V [v2] =

Zv2 PMB(v) d

3v = 3kT/m (13.118)

which is the familiar statement

1

2mhv2i = 3

2kT (13.119)

that each degree of freedom gets kT/2 of energy.

13.10 Fermi-Dirac and Bose-Einstein distributions

The commutation and anticommutation relations (10.131)

s(t,x) s0(t,x0)� (�1)2j s0(t,x

0) s(t,x) = 0 (13.120)

of Bose fields (�1)2j = 1 and of Fermi fields (�1)2j = �1 determine thestatistics of bosons and fermions.

One can put any number Nn of noninteracting bosons, such as photons orgravitons, into any state |ni of energy En. The energy of that state is NnEn,and the states |n,Nni form an independent thermodynamical system foreach state |ni. The grand canonical ensemble (1.394) gives the probabilityof the state |n,Nni as

⇢n,Nn = hn,Nn|⇢|n,Nni =e��(En�µ)Nn

PNn

hn,Nn|⇢|n,Nni. (13.121)


For each state |ni, the partition function is a geometric sum

Z(�, µ, n) =1X

Nn=0

e��(En�µ)Nn =1X

Nn=0

⇣e��(En�µ)

⌘Nn

=1

1� e��(En�µ).

(13.122)So the probability of the state |n,Nni is

⇢n,Nn =e��(En�µ)Nn

1� e��(En�µ), (13.123)

and the mean number of bosons in the state |ni is

hNni =1

�

@

@µlnZ(�, µ, n) =

1

e�(En�µ) � 1. (13.124)

One can put at most one fermion into a given state |ni. If like neutrinos thefermions don’t interact, then the states |n, 0i and |n, 1i form an independentthermodynamical system for each state |ni. So for noninteracting fermions,the partition function is the sum of only two terms

Z(�, µ, n) =1X

Nn=0

e��(En�µ)Nn = 1 + e��(En�µ), (13.125)

and the probability of the state |n,Nni is

⇢n,Nn =e��(En�µ)Nn

1 + e��(En�µ). (13.126)

So the mean number of fermions in the state |ni is

hNni = =1

e�(En�µ) + 1. (13.127)

13.11 Di↵usion

We may apply the same reasoning as in the preceding section (13.9) to thedi↵usion of a gas of particles treated as a random walk with step size dx. Inone dimension, after n steps forward and N � n steps backward, a particlestarting at x = 0 is at x = (2n�N)dx. Thus as in (13.114), the probabilityof being at x is given by Gauss’s approximation (13.79) to the binomialdistribution Pb(n,

12 , N) as

PbG(n,12 , N) =

r2

⇡Nexp

✓�(2n�N)2

2N

◆=

r2

⇡Nexp

✓� x2

2Ndx2

◆.

(13.128)

13.12 Langevin’s Theory of Brownian Motion 625

In terms of the di↵usion constant

D =Ndx2

2t(13.129)

this distribution is

PG(x) =

✓1

4⇡Dt

◆1/2

exp

✓� x2

4Dt

◆(13.130)

when normalized to unity on (�1,1).In three dimensions, this gaussian distribution is the product

P (r, t) = PG(x)PG(y)PG(z) =

✓1

4⇡Dt

◆3/2

exp

✓� r

2

4Dt

◆. (13.131)

The variance �2 = 2Dt gives the average of the squared displacement ofeach of the three coordinates. Thus the mean of the squared displacementhr2i rises linearly with the time as

hr2i = V [r] = 3�2 =

Zr2 P (r, t) d3r = 6D t. (13.132)

The distribution P (r, t) satisfies the di↵usion equation

P (r, t) = Dr2P (r, t) (13.133)

in which the dot means time derivative.

13.12 Langevin’s Theory of Brownian Motion

Einstein made the first theory of brownian motion in 1905, but Langevin’sapproach (Langevin, 1908) is simpler. A tiny particle of colloidal size andmass m in a fluid is bu↵eted by a force F (t) due to the 1021 collisions persecond it su↵ers with the molecules of the surrounding fluid. Its equation ofmotion is

mdv(t)

dt= F (t). (13.134)

Langevin suggested that the force F (t) is the sum of a viscous drag �v(t)/Band a rapidly fluctuating part f(t)

F (t) = � v(t)/B + f(t) (13.135)

so that

mdv(t)

dt= � v(t)

B+ f(t). (13.136)


The parameter B is called the mobility. The ensemble average (theaverage over the set of particles) of the fluctuating force f(t) is zero

hf(t)i = 0. (13.137)

Thus the ensemble average of the velocity satisfies

mdhvidt

= � hviB

(13.138)

whose solution with ⌧ = mB is

hv(t)i = hv(0)i e�t/⌧ . (13.139)

The instantaneous equation (13.136) divided by the mass m is

dv(t)

dt= � v(t)

⌧+ a(t) (13.140)

in which a(t) = f(t)/m is the acceleration. The ensemble average of thescalar product of the position vector r with this equation is

⌧r · dv

dt

�= � hr · vi

⌧+ hr · ai. (13.141)

But since the ensemble average hr · ai of the scalar product of the positionvector r with the random, fluctuating part a of the acceleration vanishes,we have ⌧

r · dvdt

�= � hr · vi

⌧. (13.142)

Now

1

2

d r2

dt=

1

2

d

dt(r · r) = r · v (13.143)

and so

1

2

d2r2

dt2= r · dv

dt+ v

2. (13.144)

The ensemble average of this equation is

d2hr2idt2

= 2

⌧r · dv

dt

�+ 2hv2i (13.145)

or in view of (13.142)

d2hr2idt2

= �2hr · vi⌧

+ 2hv2i. (13.146)

13.12 Langevin’s Theory of Brownian Motion 627

We now use (13.143) to replace hr · vi with half the first time derivative ofhr2i so that we have

d2hr2idt2

= �1

⌧

dhr2idt

+ 2hv2i. (13.147)

If the fluid is in equilibrium, then the ensemble average of v2 is given by theMaxwell-Boltzmann value (13.119)

hv2i = 3kT

m(13.148)

and so the acceleration (13.147) is

d2hr2idt2

+1

⌧

dhr2idt

=6kT

m. (13.149)

which we can integrate.The general solution (6.13) to a second-order linear inhomogeneous dif-

ferential equation is the sum of any particular solution to the inhomo-geneous equation plus the general solution of the homogeneous equation.The function hr2(t)ipi = 6kT t⌧/m is a particular solution of the inho-mogeneous equation. The general solution to the homogeneous equation ishr2(t)igh = U +W exp(�t/⌧) where U and W are constants. So hr2(t)i is

hr2(t)i = U +W e�t/⌧ + 6kT ⌧ t/m (13.150)

where U and W make hr2(t)i fit the boundary conditions. If the individualparticles start out at the origin r = 0, then one boundary condition is

hr2(0)i = 0 (13.151)

which implies that

U +W = 0. (13.152)

And since the particles start out at r = 0 with an isotropic distribution ofinitial velocities, the formula (13.143) for r2 implies that at t = 0

d hr2idt

��t=0

= 2hr(0) · v(0)i = 0. (13.153)

This boundary condition means that our solution (13.150) must satisfy

d hr2(t)idt

��t=0

= � W

⌧+

6kT ⌧

m= 0. (13.154)

Thus W = �U = 6kT ⌧2/m, and so our solution (13.150) is

hr2(t)i = 6kT ⌧2

m

t

⌧+ e�t/⌧ � 1

�. (13.155)


At times short compared to ⌧ , the first two terms in the power series forthe exponential exp(�t/⌧) cancel the terms �1 + t/⌧ , leaving

hr2(t)i = 6kT ⌧2

m

t2

2⌧2

�=

3kT

mt2 = hv2i t2. (13.156)

But at times long compared to ⌧ , the exponential vanishes, leaving

hr2(t)i = 6kT ⌧

mt = 6B kT t. (13.157)

The di↵usion constant D is defined by

hr2(t)i = 6D t (13.158)

and so we arrive at Einstein’s relation

D = B kT (13.159)

which often is written in terms of the viscous-friction coe�cient v

v ⌘ 1

B=

m

⌧(13.160)

as

v D = kT. (13.161)

This equation expresses Boltzmann’s constant k in terms of three quantitiesv, D, and T that were accessible to measurement in the first decade ofthe 20th century. It enabled scientists to measure Boltzmann’s constant kfor the first time. And since Avogadro’s number NA was the known gasconstant R divided by k, the number of molecules in a mole was revealedto be NA = 6.022⇥ 1023. Chemists could then divide the mass of a mole ofany pure substance by 6.022⇥ 1023 and find the mass of the molecules thatcomposed it. Suddenly the masses of the molecules of chemistry becameknown, and molecules were recognized as real particles and not tricks forbalancing chemical equations.

13.13 The Einstein-Nernst relation

If a particle of mass m carries an electric charge q and is exposed to anelectric field E, then in addition to viscosity � v/B and random bu↵etingf , the constant force qE acts on it

mdv

dt= � v

B+ qE + f . (13.162)

13.14 Fluctuation and Dissipation 629

The mean value of its velocity satisfies the di↵erential equation⌧dv

dt

�= �hvi

⌧+

qE

m(13.163)

where ⌧ = mB. A particular solution of this inhomogeneous equation is

hv(t)ipi =q⌧E

m= qBE. (13.164)

The general solution of its homogeneous version is hv(t)igh = A exp(�t/⌧)in which the constant A is chosen to give hv(0)i at t = 0. So by (6.13), thegeneral solution hv(t)i to equation (13.163) is (exercise 13.17) the sum ofhv(t)ipi and hv(t)igh

hv(t)i = qBE + [hv(0)i � qBE] e�t/⌧ . (13.165)

By applying the tricks of the previous section (13.12), one may show(exercise 13.18) that the variance of the position r about its mean hr(t)i is

D(r � hr(t)i)2

E=

6kT ⌧2

m

✓t

⌧� 1 + e�t/⌧

◆(13.166)

where hr(t)i = (q⌧2E/m)�t/⌧ � 1 + e�t/⌧

�if hr(0)i = hv(0)i = 0. So for

times t � ⌧ , this variance isD(r � hr(t)i)2

E=

6kT ⌧ t

m. (13.167)

Since the di↵usion constant D is defined by (13.158) asD(r � hr(t)i)2

E= 6D t (13.168)

we arrive at the Einstein-Nernst relation

D = BkT =µ

qkT (13.169)

in which the electric mobility is µ = qB.

13.14 Fluctuation and Dissipation

Let’s look again at Langevin’s equation (13.140)

dv(t)

dt+

v(t)

⌧= a(t). (13.170)

If we multiply both sides by the exponential exp(t/⌧)✓dv

dt+

v

⌧

◆et/⌧ =

d

dt

⇣v et/⌧

⌘= a(t) et/⌧ (13.171)


and integrate from 0 to tZ t

0

d

dt0

⇣v et

0/⌧⌘dt0 = v(t) et/⌧ � v(0) =

Z t

0a(t0) et

0/⌧ dt0 (13.172)

then we get

v(t) = e�t/⌧v(0) + e�t/⌧

Z t

0a(t0) et

0/⌧ dt0. (13.173)

Thus the ensemble average of the square of the velocity is

hv2(t)i = e�2t/⌧ hv2(0)i+ 2e�2t/⌧Z t

0hv(0) · a(t0)i et0/⌧ dt0 (13.174)

+e�2t/⌧Z t

0

Z t

0ha(u1) · a(t2)i e(u1+t2)/⌧ du1dt2.

The second term on the RHS is zero, so we have

hv2(t)i = e�2t/⌧ hv2(0)i+ e�2t/⌧Z t

0

Z t

0ha(t1) · a(t2)i e(t1+t2)/⌧ dt1dt2.

(13.175)The ensemble average

C(t1, t2) = ha(t1) · a(t2)i (13.176)

is an example of an autocorrelation function.All autocorrelation functions have some simple properties, which are easy

to prove (Pathria, 1972, p. 458):

1. If the system is independent of time, then its autocorrelation function forany given variable A(t) depends only upon the time delay s:

C(t, t+ s) = hA(t) ·A(t+ s)i ⌘ C(s). (13.177)

2. The autocorrelation function for s = 0 is necessarily non-negative

C(t, t) = hA(t) ·A(t)i = hA(t)2i � 0. (13.178)

If the system is time independent, then C(t, t) = C(0) � 0.

3. The absolute value of C(t1, t2) is never greater than the average of C(t1, t1)and C(t2, t2) because

h|A(t1)±A(t2)|2i = hA(t1)2i+ hA(t2)

2i±2hA(t1) ·A(t2)i � 0 (13.179)

which implies that �2C(t1, t2) C(t1, t1) + C(t2, t2) � 2C(t1, t2) or

2 |C(t1, t2)| C(t1, t1) + C(t2, t2). (13.180)

13.14 Fluctuation and Dissipation 631

For a time-independent system, this inequality is |C(s)| C(0) for everytime delay s.

4. If the variables A(t1) and A(t2) commute, then their autocorrelationfunction is symmetric

C(t1, t2) = hA(t1) ·A(t2)i = hA(t2) ·A(t1)i = C(t2, t1). (13.181)

For a time-independent system, this symmetry is C(s) = C(�s).5. If the variable A(t) is randomly fluctuating with zero mean, then we

expect both that its ensemble average vanishes

hA(t)i = 0 (13.182)

and that there is some characteristic time scale T beyond which thecorrelation function falls to zero:

hA(t1) ·A(t2)i ! hA(t1)i · hA(t2)i = 0 (13.183)

when |t1 � t2| � T .

In terms of the autocorrelation function C(t1, t2) = ha(t1) · a(t2)i of theacceleration, the variance of the velocity (13.175) is

hv2(t)i = e�2t/⌧ hv2(0)i+ e�2t/⌧Z t

0

Z t

0C(t1, t2) e

(t1+t2)/⌧ dt1dt2. (13.184)

Since C(t1, t2) is big only for tiny values of |t2� t1|, it makes sense to changevariables to

s = t2 � t1 and w =1

2(t1 + t2). (13.185)

The element of area then is by (12.6–12.14)

dt1 ^ dt2 = dw ^ ds (13.186)

and the limits of integration are �2w s 2w for 0 w t/2 and�2(t� w) s 2(t� w) for t/2 w t. So hv2(t)i is

hv2(t)i = e�2t/⌧ hv2(0)i+ e�2t/⌧Z t/2

0e2w/⌧dw

Z 2w

�2wC(s) ds

+ e�2t/⌧Z t

t/2e2w/⌧dw

Z 2(t�w)

�2(t�w)C(s) ds. (13.187)

Since by (13.183) the autocorrelation function C(s) vanishes outside a nar-row window of width 2T , we may approximate each of the s-integrals by

C =

Z 1

�1C(s) ds. (13.188)


It follows then that

hv2(t)i = e�2t/⌧ hv2(0)i+ C e�2t/⌧Z t

0e2w/⌧dw

= e�2t/⌧ hv2(0)i+ C e�2t/⌧ ⌧

2

⇣e2t/⌧ � 1

⌘

= e�2t/⌧ hv2(0)i+ C⌧

2

⇣1� e�2t/⌧

⌘. (13.189)

As t ! 1, hv2(t)i must approach its equilibrium value of 3kT/m, and so

limt!1

hv2(t)i = C⌧

2=

3kT

m(13.190)

which implies that

C =6kT

m⌧or

1

B=

m2C

6kT. (13.191)

Our final formula for hv2(t)i then is

hv2(t)i = e�2t/⌧ hv2(0)i+ 3kT

m

⇣1� e�2t/⌧

⌘. (13.192)

Referring back to the definition (13.160) of the viscous-friction coe�cientv = 1/B, we see that v is related to the integral

v =1

B=

m2

6kTC =

m2

6kT

Z 1

�1ha(0) · a(s)ids = 1

6kT

Z 1

�1hf(0) · f(s)ids

(13.193)of the autocorrelation function of the random acceleration a(t) or equiva-lently of the random force f(t). This equation relates the dissipation of vis-cous friction to the random fluctuations. It is an example of a fluctuation-dissipation theorem.If we substitute our formula (13.192) for hv2(t)i into the expression (13.147)

for the acceleration of hr2i, then we get

d2hr2(t)idt2

= �1

⌧

dhr2(t)idt

+ 2e�2t/⌧ hv2(0)i+ 6kT

m

⇣1� e�2t/⌧

⌘. (13.194)

The solution with both hr2(0)i = 0 and dhr2(0)i/dt = 0 is (exercise 13.19)

hr2(t)i = hv2(0)i ⌧2⇣1� e�t/⌧

⌘2� 3kT

m⌧2⇣1� e�t/⌧

⌘⇣3� e�t/⌧

⌘+6kT ⌧

mt.

(13.195)

13.15 The Fokker-Planck equation 633

13.15 The Fokker-Planck equation

Let P (v, t) be the probability distribution of particles in velocity space attime t, and (v;u) be a normalized transition probability that the velocitychanges from v to v+u in the time interval [t, t+�t]. We take the interval�tto be much longer than the interval between successive particle collisions butmuch shorter than the time over which the velocity v changes appreciably.So |u| ⌧ |v|. We also assume that the successive changes in the velocitiesof the particles is a Marko↵ stochastic process, that is that the changesare random and that what happens at time t depends only upon the stateof the system at that time and not upon the history of the system. We thenexpect that the velocity distribution at time t+�t is related to that at timet by

P (v, t+�t) =

ZP (v � u, t) (v � u;u) d3u. (13.196)

Since |u| ⌧ |v|, we can expand P (v, t+�t), P (v � u, t), and (v � u;u)in Taylor series in v like

(v � u;u) = (v;u)� u ·rv (v;u) +1

2

X

i,j

uiuj@2 (v;u)

@vi@vj(13.197)

and get

P (v, t)+�t@P (v, t)

@t=

Z P (v, t)� u ·rvP (v, t) +

1

2

X

i,j

uiuj@2P (v, t)

@vi@vj

�

⇥ (v;u)� u ·rv (v;u) +

1

2

X

i,j

uiuj@2 (v;u)

@vi@vj

�d3u.

(13.198)

The normalization of the transition probability and the average changesin velocity are

1 =

Z (v;u) d3u

huii =Z

ui (v;u) d3u

huiuji =Z

ui uj (v;u) d3u

(13.199)


in which the dependence of the mean values huii and huiuji upon the velocityv is implicit. In these terms, the expansion (13.198) is

�t@P (v, t)

@t= � hu·irvP (v, t) +

1

2

X

i,j

huiuji@2P (v, t)

@vi@vj

� P (v, t)rv · hui+ P (v, t)1

2

X

i,j

@2huiuji@vi@vj

+X

i,j

@P (v, t)

@vi

@huiuji@vj

.

(13.200)

Combining terms, we get the Fokker-Planck equation in its most generalform (Chandrasekhar, 1943)

�t@P (v, t)

@t= �rv [P (v, t) · hui] + 1

2

X

i,j

@2

@vi@vj[P (v, t)huiuji] .

(13.201)

Example 13.16 (Brownian motion). Langevin’s equation (13.136) givesthe change u in the velocity v as the viscous drag plus some 1021 randomtiny accelerations per second

u = � v�t

mB+

f�t

m(13.202)

in which B is the mobility of the colloidal particle. The random changesf�t/m in velocity are gaussian, and the transition probability is

(v;u) =

✓�m2B

4⇡�t

◆3/2

exp

� �m2B

4�t

��u+v�t

mB

��2!. (13.203)

Here � = 1/kT , and Stokes’s formula for the mobility B of a sphericalcolloidal particle of radius r in a fluid of viscosity ⌘ is 1/B = 6⇡r⌘. Themoments (13.199) of the changes u in velocity are in the limit �t ! 0

hui = � v�t

mB

hui uji = 2�ijkT

m2B�t.

(13.204)

So for Brownian motion, the Fokker-Planck equation is

@P (v, t)

@t=

1

mBrv [P (v, t) · v] + kT

m2Br2

v P (v, t). (13.205)

13.16 Characteristic and moment-generating functions 635

13.16 Characteristic and moment-generating functions

The Fourier transform (3.9) of a probability distribution P (x) is its char-acteristic function P (k) sometimes written as �(k)

P (k) ⌘ �(k) ⌘ E[eikx] =

Zeikx P (x) dx. (13.206)

The probability distribution P (x) is the inverse Fourier transform (3.9)

P (x) =

Ze�ikx P (k)

dk

2⇡. (13.207)

Example 13.17 (Gauss). The characteristic function of the gaussian

PG(x, µ,�) =1

�p2⇡

exp

✓�(x� µ)2

2�2

◆(13.208)

is by (3.19)

PG(k, µ,�) =1

�p2⇡

Zexp

✓ikx� (x� µ)2

2�2

◆dx (13.209)

=eikµ

�p2⇡

Zexp

✓ikx� x2

2�2

◆dx = exp

✓iµk � 1

2�2k2

◆.

For a discrete probability distribution Pn the characteristic function is

�(k) ⌘ E[eikx] =X

n

eikxn Pn. (13.210)

The normalization of both continuous and discrete probability distributionsimplies that their characteristic functions satisfy P (0) = �(0) = 1.

Example 13.18 (Poisson). The Poisson distribution (13.64)

PP (n, hni) =hninn!

e�hni (13.211)

has the characteristic function

�(k) =1X

n=0

eiknhninn!

e�hni = e�hni1X

n=0

(hnieik)nn!

= exphhni⇣eik � 1

⌘i.

(13.212)


The moment-generating function is the characteristic function evalu-ated at an imaginary argument

M(k) ⌘ E[ekx] = P (�ik) = �(�ik). (13.213)

For a continuous probability distribution P (x), it is

M(k) = E[ekx] =

Zekx P (x) dx (13.214)

and for a discrete probability distribution Pn, it is

M(k) = E[ekx] =X

n

ekxn Pn. (13.215)

In both cases, the normalization of the probability distribution implies thatM(0) = 1.Derivatives of the moment-generating function and of the characteristic

function give the moments

E[xn] = µn =dnM(k)

dkn

��k=0

= (�i)ndnP (k)

dkn

��k=0

. (13.216)

Example 13.19 (Gauss and Poisson). The moment-generating functionsfor the distributions of Gauss (13.208) and Poisson (13.211) are

MG(k, µ,�) = exp

✓µk +

1

2�2k2

◆and MP (k, hni) = exp

hhni⇣ek � 1

⌘i.

(13.217)They give as the first three moments of these distributions

µG0 = 1, µG1 = µ, µG2 = µ2 + �2 (13.218)

µP0 = 1, µP1 = hni, µP2 = hni+ hni2 (13.219)

(exercise 13.20).

Since the characteristic and moment-generating functions have derivatives(13.216) proportional to the moments µn, their Taylor series are

P (k) = E[eikx] =1X

n=0

(ik)n

n!E[xn] =

1X

n=0

(ik)n

n!µn (13.220)

and

M(k) = E[ekx] =1X

n=0

kn

n!E[xn] =

1X

n=0

kn

n!µn. (13.221)

13.17 Fat Tails 637

The cumulants cn of a probability distribution are the derivatives of thelogarithm of its moment-generating function at k = 0

cn =dn lnM(k)

dkn

��k=0

= (�i)ndn ln P (k)

dkn

��k=0

. (13.222)

One may show (exercise 13.22) that the first five cumulants of an arbitraryprobability distribution are

c0 = 0, c1 = µ, c2 = �2, c3 = ⌫3, and c4 = ⌫4 � 3�4 (13.223)

where the ⌫’s are its central moments (13.29). The 3d and 4th normalizedcumulants are the skewness v = c3/�3 = ⌫3/�3 and the kurtosis =c4/�4 = ⌫4/�4 � 3.

Example 13.20 (Gaussian Cumulants). The logarithm of the moment-generating function (13.217) of Gauss’s distribution is µk + �2k2/2. Thusby (13.222), PG(x, µ,�) has no skewness or kurtosis, its cumulants vanishcGn = 0 for n > 2, and its fourth central moment is ⌫4 = 3�4.

13.17 Fat Tails

The gaussian probability distribution PG(x, µ,�) falls o↵ for |x � µ| � �very fast—as exp

�� (x� µ)2/2�2

�. Many other probability distributions

fall o↵ more slowly; they have fat tails. Rare “black-swan” events—wildfluctuations, market bubbles, and crashes—lurk in their fat tails.Gosset’s distribution, which is known as Student’s t-distribution

with ⌫ degrees of freedom

PS(x, ⌫, a) =1p⇡

�((1 + ⌫)/2)

�(⌫/2)

a⌫

(a2 + x2)(1+⌫)/2(13.224)

has power-law tails. Its even moments are

µ2n = (2n� 1)!!�(⌫/2� n)

�(⌫/2)

✓a2

2

◆n

(13.225)

for 2n < ⌫ and infinite otherwise. For ⌫ = 1, it coincides with the Breit-Wigner or Cauchy distribution

PS(x, 1, a) =1

⇡

a

a2 + x2(13.226)

in which x = E � E0 and a = �/2 is the half-width at half-maximum.


Two representative cumulative probabilities are (Bouchaud and Potters,2003, p.15–16)

Pr(x,1) =

Z 1

xPS(x

0, 3, 1) dx0 =1

2� 1

⇡

arctanx+

x

1 + x2

�(13.227)

Pr(x,1) =

Z 1

xPS(x

0, 4,p2) dx0 =

1

2� 3

4u+

1

4u3 (13.228)

where u = x/p2 + x2 and a is picked so �2 = 1. William Gosset (1876–

1937), who worked for Guinness, wrote as Student because Guinness didn’tlet its employees publish.The log-normal probability distribution on (0,1)

Pln(x) =1

�xp2⇡

exp

� ln2(x/x0)

2�2

�(13.229)

describes distributions of rates of return (Bouchaud and Potters, 2003, p. 9).Its moments are (exercise 13.25)

µn = xn0 en2�2/2. (13.230)

The exponential distribution on [0,1)

Pe(x) = ↵e�↵x (13.231)

has (exercise 13.26) mean µ = 1/↵ and variance �2 = 1/↵2. The sum ofn independent exponentially and identically distributed random variablesx = x1 + . . .+ xn is distributed on [0,1) as (Feller, 1966, p.10)

Pn,e(x) = ↵(↵x)n�1

(n� 1)!e�↵x. (13.232)

The sum of the squares x2 = x21+ . . .+x2n of n independent normally andidentically distributed random variables of zero mean and variance �2 givesrise to Pearson’s chi-squared distribution on (0,1)

Pn,P (x,�)dx =

p2

�

1

�(n/2)

✓x

�p2

◆n�1

e�x2/(2�2)dx (13.233)

which for x = v, n = 3, and �2 = kT/m is (exercise 13.27) the Maxwell-Boltzmann distribution (13.116). In terms of � = x/�, it is

Pn,P (�2/2) d�2 =

1

�(n/2)

✓�2

2

◆n/2�1

e��2/2d��2/2

�. (13.234)

13.18 The Central Limit Theorem and Jarl Lindeberg 639

It has mean and variance

µ = n and �2 = 2n (13.235)

and is used in the chi-squared test (Pearson, 1900).Personal income, the amplitudes of catastrophes, the price changes of fi-

nancial assets, and many other phenomena occur on both small and largescales. Levy distributions describe such multi-scale phenomena. The char-acteristic function for a symmetric Levy distribution is for ⌫ 2

L⌫(k, a⌫) = exp (� a⌫ |k|⌫) . (13.236)

Its inverse Fourier transform (13.207) is for ⌫ = 1 (exercise 13.28) theCauchy or Lorentz distribution

L1(x, a1) =a1

⇡(x2 + a21)(13.237)

and for ⌫ = 2 the gaussian

L2(x, a2) = PG(x, 0,p2a2) =

1

2p⇡a2

exp

✓� x2

4a2

◆(13.238)

but for other values of ⌫ no simple expression for L⌫(x, a⌫) is available.For 0 < ⌫ < 2 and as x ! ±1, it falls o↵ as |x|�(1+⌫), and for ⌫ > 2 itassumes negative values, ceasing to be a probability distribution (Bouchaudand Potters, 2003, pp. 10–13).

13.18 The Central Limit Theorem and Jarl Lindeberg

We have seen in sections (13.9 & 13.11) that unbiased fluctuations tend todistribute the position and velocity of molecules according to Gauss’s distri-bution (13.82). Gaussian distributions occur very frequently. The centrallimit theorem suggests why they occur so often.Let x1, . . . , xN be N independent random variables described by proba-

bility distributions P1(x1), . . . , PN (xN ) with finite means µj and finite vari-ances �2j . The Pj ’s may be all di↵erent. The central limit theorem says that

as N ! 1 the probability distribution P (N)(y) for the average of the xj ’s

y =1

N(x1 + x2 + . . .+ xN ) (13.239)

tends to a gaussian in y quite independently of what the underlying proba-bility distributions Pj(xj) happen to be.


Because expected values are linear (13.36), the mean value of the averagey is the average of the N means

µy = E[y] = E[(x1 + . . .+ xN ) /N ] =1

N(E[x1] + . . .+ E[xN ])

=1

N(µ1 + . . .+ µN ) . (13.240)

Similarly, our rule (13.47) for the variance of a linear combination of in-dependent variables tells us that the variance of the average y is the sum

�2y = V [(x1 + . . .+ xN ) /N ] =1

N2

��21 + . . .+ �2N

�. (13.241)

The independence of the random variables x1, x2, . . . , xN implies (13.42)that their joint probability distribution factorizes

P (x1, . . . , xN ) = P1(x1)P2(x2) . . . PN (xN ). (13.242)

We can use a delta function (3.36) to write the probability distributionP (N)(y) for the average y = (x1 + x2 + . . .+ xN )/N of the xj ’s as

P (N)(y) =

ZP (x1, . . . , xN ) �((x1 + x2 + . . .+ xN )/N � y) dNx (13.243)

where dNx = dx1 . . . dxN . Its characteristic function

P (N)(k) =

Zeiky P (N)(y) dy

=

Zeiky

ZP (x1, . . . , xN ) �((x1 + x2 + . . .+ xN )/N � y) dNx dy

=

Zexp

ik

N(x1 + x2 + . . .+ xN )

�P (x1, . . . , xN ) dNx (13.244)

=

Zexp

ik

N(x1 + x2 + . . .+ xN )

�P1(x1)P2(x2) . . . PN (xN ) dNx

is then the product

P (N)(k) = P 1(k/N) P 2(k/N) . . . PN (k/N) (13.245)

of the characteristic functions

P j(k/N) =

Zeikxj/N Pj(xj) dxj (13.246)

of the probability distributions P1(x1), . . . , PN (xN ).


The Taylor series (13.220) for each characteristic function is

P j(k/N) =1X

n=0

(ik)n

n!Nnµnj (13.247)

and so for big N we can use the approximation

P j(k/N) ⇡ 1 +ik

Nµj �

k2

2N2µ2j (13.248)

in which µ2j = �2j + µ2j by the formula (13.24) for the variance. So we have

P j(k/N) ⇡ 1 +ik

Nµj �

k2

2N2

��2j + µ2

j

�(13.249)

or for large N

P j(k/N) ⇡ exp

✓ik

Nµj �

k2

2N2�2j

◆. (13.250)

Thus as N ! 1, the characteristic function (13.245) for the variable yconverges to

P (N)(k) =NY

j=1

P j(k/N) =NY

j=1

exp

✓ik

Nµj �

k2

2N2�2j

◆

= exp

2

4NX

j=1

✓ik

Nµj �

k2

2N2�2j

◆3

5 = exp

✓iµyk � 1

2�2yk

2

◆ (13.251)

which is the characteristic function (13.209) of a gaussian (13.208) withmean and variance

µy =1

N

NX

j=1

µj and �2y =1

N2

NX

j=1

�2j . (13.252)

The inverse Fourier transform (13.207) now gives the probability distributionP (N)(y) for the average y = (x1 + x2 + . . .+ xN )/N as

P (N)(y) =

Z 1

�1e�iky P (N)(k)

dk

2⇡(13.253)

which in view of (13.251) and (13.209) tends as N ! 1 to Gauss’s distri-


bution PG(y, µy,�y)

limN!1

P (N)(y) =

Z 1

�1e�iky lim

N!1P (N)(k)

dk

2⇡

=

Z 1

�1e�iky exp

✓iµyk � 1

2�2yk

2

◆dk

2⇡(13.254)

= PG(y, µy,�y) =1

�yp2⇡

exp

�(y � µy)2

2�2y

�

with mean µy and variance �2y as given by (13.252). The sense in which

P (N)(y) converges to PG(y, µy,�y) is that for all a and b the probabilityPrN (a < y < b) that y lies between a and b as determined by P (N)(y)converges as N ! 1 to the probability that y lies between a and b asdetermined by the gaussian PG(y, µy,�y)

limN!1

PrN (a < y < b) = limN!1

Z b

aP (N)(y) dy =

Z b

aPG(y, µy,�y) dy.

(13.255)This type of convergence is called convergence in probability (Feller,1966, pp. 231, 241–248).For the special case in which all the means and variances are the same,

with µj = µ and �2j = �2, the definitions in (13.252) imply that µy = µ and�2y = �2/N . In this case, one may show (exercise 13.30) that in terms of thevariable

u ⌘pN(y � µ)

�=

⇣PNn=1 xj

⌘�Nµ

pN �

(13.256)

P (N)(y) converges to a distribution that is normal

limN!1

P (N)(y) dy =1p2⇡

e�u2/2 du. (13.257)

To get a clearer idea of when the central limit theorem holds, let uswrite the sum of the N variances as

SN ⌘NX

j=1

�2j =NX

j=1

Z 1

�1(xj � µj)

2 Pj(xj) dxj (13.258)

and the part of this sum due to the regions within � of the means µj as

SN (�) ⌘NX

j=1

Z µj+�

µj��(xj � µj)

2 Pj(xj) dxj . (13.259)


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4Central-Limit Theorem for Uniform Random Numbers

y

P(N

) (y)

Figure 13.5 The probability distributions P (N)(y) (Eq. 13.243) for themean y = (x1 + . . . + xN )/N of N random variables drawn from the uni-form distribution are plotted for N = 1 (dots), 2 (dot dash), 4 (dashes),and 8 (solid). The distributions P (N)(y) rapidly approach gaussians withthe same mean µy = 1/2 but with shrinking variances �2 = 1/12N .

In terms of these definitions, Jarl Lindeberg (1876–1932) showed that P (N)(y)converges (in probability) to the gaussian (13.254) as long as the part SN (�)is most of SN in the sense that for every ✏ > 0

limN!1

SN�✏pSN�

SN= 1. (13.260)

This is Lindeberg’s condition (Feller 1968, p. 254; Feller 1966, pp. 252–259; Gnedenko 1968, p. 304).

Because we dropped all but the first three terms of the series (13.247) forthe characteristic functions P j(k/N), we may infer that the convergence ofthe distribution P (N)(y) to a gaussian is quickest near its mean µy. If thehigher moments µnj are big, then for finite N the distribution P (N)(y) canhave tails that are fatter than those of the limiting gaussian PG(y, µy,�y).


Example 13.21 (Illustration of the Central-Limit Theorem). The simplestprobability distribution is a random number x uniformly distributed on theinterval (0, 1). The probability distribution P (2)(y) of the mean of two suchrandom numbers is the integral

P (2)(y) =

Z 1

0dx1

Z 1

0dx2 �((x1 + x2)/2� y). (13.261)

Letting u1 = x1/2, we find

P (2)(y) = 4

Z min(y, 12 )

max(0,y� 12 )✓(12 + u1 � y) du1 = 4y ✓(12 � y) + 4(1� y) ✓(y � 1

2)

(13.262)which is the dot-dashed triangle in Fig. 13.5. The probability distributionP (4)(y) is the dashed somewhat gaussian curve in the figure, while P (8)(y)is the solid, nearly gaussian curve.

To work through a more complicated example of the central limit theo-rem, we first need to learn how to generate random numbers that follow anarbitrary distribution.

13.19 Random-Number Generators

To generate truly random numbers, one might use decaying nuclei or anelectronic device that makes white noise. But people usually settle for pseu-dorandom numbers computed by a mathematical algorithm. Such algo-rithms are deterministic, so the numbers they generate are not truly random.But for most purposes, they are random enough.The easiest way to generate pseudorandom numbers is to use a random-

number algorithm that is part of one’s favorite fortran, C, or C++ com-piler. To run it, one first gives it a random starting point called a seed.When using the Intel fortran90 compiler, one includes in the code theline

call random_seed()

to set a random seed. When using the GNU fortran95 compiler, onemay use the statement call init_random_seed() to call the subroutineinit_random_seed() defined in the program clt.f95 of section 13.20. In ei-ther case, one subsequently may use the line

call random_number(x)

13.19 Random-Number Generators 645

to generate a random number x uniformly distributed on the interval 0 <x < 1, or an array of such random numbers.

Some applications require random numbers of very high quality. For suchapplications, one might use Luscher’s ranlux (Luscher, 1994; James, 1994).

Most random-number generators are periodic with very long periods. TheMersenne Twister (Saito and Matsumoto, 2007) has the exceptionallylong period 219937�1 & 4.3⇥106001. Matlab uses it. The GNU fortran95compiler uses George Marsaglia’s KISS generator random_number which hasa period in excess of 259.

Random-number generators distribute random numbers uniformly on theinterval (0, 1). How do we make them follow an arbitrary distribution P (x)?If the distribution is strictly positive P (x) > 0 on the relevant interval (a, b),then its integral

F (x) =

Z x

aP (x0) dx0 (13.263)

is a strictly increasing function on (a, b), that is, a < x < y < b impliesF (x) < F (y). Moreover, the function F (x) rises from F (a) = 0 to F (b) = 1and takes on every value 0 < y < 1 for exactly one x in the interval (a, b).Thus the inverse function F�1(y)

x = F�1(y) if and only if y = F (x) (13.264)

is well defined on the interval (0, 1).Our random-number generator gives us random numbers u that are uni-

form on (0, 1). We want a random variable r whose probability Pr(r < x) ofbeing less than x is F (x). The trick (Knuth, 1981, p. 116) is to set

r = F�1(u) (13.265)

so that Pr(r < x) = Pr(F�1(u) < x). For by (13.264) F�1(u) < x if andonly if u < F (x). So Pr(r < x) = Pr(F�1(u) < x) = Pr(u < F (x)) = F (x).The trick works.

Example 13.22 (P (r) = 3r2). To turn a distribution of random numbersu uniform on (0, 1) into a distribution P (r) = 3r2 of random numbers r, weintegrate and find

F (x) =

Z x

0P (x0) dx0 =

Z x

03x02 dx0 = x3. (13.266)

We then set r = F�1(u) = u1/3.


13.20 Illustration of the Central Limit Theorem

To make things simple, we’ll take all the probability distributions Pj(x)to be the same and equal to Pj(xj) = 3x2j on the interval (0, 1) and zeroelsewhere. Our random-number generator gives us random numbers u thatare uniformly distributed on (0, 1), so by the example (13.22) the variabler = u1/3 is distributed as Pj(x) = 3x2.The central limit theorem tells us that the distribution

P (N)(y) =

Z3x21 3x

22 . . . 3x2N �((x1 + x2 + . . .+ xN )/N � y) dNx (13.267)

of the mean y = (x1+ . . .+xN )/N tends as N ! 1 to Gauss’s distribution

limN!1

P (N)(y) =1

�yp2⇡

exp

✓�(x� µy)2

2�2y

◆(13.268)

with mean µy and variance �2y given by (13.252). Since the Pj ’s are all thesame, they all have the same mean

µy = µj =

Z 1

03x3dx =

3

4(13.269)

and the same variance

�2j =

Z 1

03x4dx�

✓3

4

◆2

=3

5� 9

16=

3

80. (13.270)

By(13.252), the variance of the mean y is then �2y = 3/80N . Thus as Nincreases, the mean y tends to a gaussian with mean µy = 3/4 and evernarrower peaks.For N = 1, the probability distribution P (1)(y) is

P (1)(y) =

Z3x21 �(x1 � y) dx1 = 3y2 (13.271)

which is the probability distribution we started with. In Fig. 13.6, this isthe quadratic, dotted curve.For N = 2, the probability distribution P (1)(y) is (exercise 13.29)

P (2)(y) =

Z3x21 3x

22 �((x1 + x2)/2� y) dx1 dx2 (13.272)

= ✓(12 � y)96

5y5 + ✓(y � 1

2)

✓36

5� 96

5y5 + 48y2 � 36y

◆.

You can get the probability distributions P (N)(y) for N = 2j by runningthe fortan95 program

13.20 Illustration of the Central Limit Theorem 647

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6Il lustrat ion of the Central Limit Theorem

y

P(N

) (y)

Figure 13.6 The probability distributions P (N)(y) (Eq. 13.267) for themean y = (x1 + . . . + xN )/N of N random variables drawn from thequadratic distribution P (x) = 3x2 are plotted for N = 1 (dots), 2 (dotdash), 4 (dashes), and 8 (solid). The four distributions P (N)(y) rapidlyapproach gaussians with the same mean µy = 3/4 but with shrinking vari-ances �2

y = 3/80N .

program clt

implicit none ! avoids typos

character(len=1)::ch_i1

integer,parameter::dp = kind(1.d0) !define double precision

integer::j,k,n,m

integer,dimension(100)::plot = 0

real(dp)::y

real(dp),dimension(100)::rplot

real(dp),allocatable,dimension(:)::r,u

real(dp),parameter::onethird = 1.d0/3.d0

write(6,*)’What is j?’; read(5,*) j


allocate(u(2**j),r(2**j))

call init_random_seed() ! set new seed, see below

do k = 1, 10000000 ! Make the N = 2**j plot

call random_number(u)

r = u**onethird

y = sum(r)/2**j

n = 100*y + 1

plot(n) = plot(n) + 1

end do

rplot = 100*real(plot)/sum(plot)

write(ch_i1,"(i1)") j ! turns integer j into character ch_i1

open(7,file=’plot’//ch_i1) ! opens and names files

do m = 1, 100

write(7,*) 0.01d0*(m-0.5), rplot(m)

end do

end program clt

subroutine init_random_seed()

implicit none

integer i, n, clock

integer, dimension(:), allocatable :: seed

call random_seed(size = n) ! find size of seed

allocate(seed(n))

call system_clock(count=clock) ! get time of processor clock

seed = clock + 37 * (/ (i-1, i=1, n) /) ! make seed

call random_seed(put=seed) ! set seed

deallocate(seed)

end subroutine init_random_seed

The distributions P (N)(y) for N = 1, 2, 4, and 8 are plotted in Fig. 13.6.P (1)(y) = 3y2 is the original distribution. P (2)(y) is trying to be a gaussian,while P (4)(y) and P (8)(y) have almost succeeded. The variance �2y = 3/80Nshrinks with N .Although fortran95 is an ideal language for computation, C++ is more

versatile, more modular, and more suited to large projects involving manyprogrammers. An equivalent C++ code written by Sean Cahill is:

#include <stdlib.h>

#include <time.h>

#include <math.h>

#include <string>

#include <iostream>

13.20 Illustration of the Central Limit Theorem 649

#include <fstream>

#include <sstream>

#include <iomanip>

#include <valarray>

using namespace std;

// Fills the array val with random numbers between 0 and 1

void rand01(valarray<double>& val)

{

// Records the size

unsigned int size = val.size();

// Loops through the size

unsigned int i=0;

for (i=0; i<size; i++)

{

// Generates a random number between 0 and 1

val[i] = static_cast<double>(rand()) / RAND_MAX;

}

}

void clt ()

{

// Declares local constants

const int PLOT_SIZE = 100;

const int LOOP_CALC_ITR = 10000000;

const double ONE_THIRD = 1.0 / 3.0;

// Inits local variables

double y=0;

int i=0, j=0, n=0;

// Gets the value of J

cout << "What is J? ";

cin >> j;

// Bases the vec size on J

const int VEC_SIZE = static_cast<int>(pow(2.0,j));


// Inits vectors

valarray<double> plot(PLOT_SIZE);

valarray<double> rplot(PLOT_SIZE);

valarray<double> r(VEC_SIZE);

// Seeds random number generator

srand ( time(NULL) );

// Performs the calculations

for (i=0; i<LOOP_CALC_ITR; i++)

{

rand01(r);

r = pow(r, ONE_THIRD);

y = r.sum() / VEC_SIZE;

n = static_cast<int>(100 * y);

plot[n]++;

}

// Normalizes RPLOT

rplot = plot * (100.0 / plot.sum());

// Opens a data file

ostringstream fileName;

fileName << "plot_" << j << ".txt";

ofstream fileHandle;

fileHandle.open (fileName.str().c_str());

// Sets precision

fileHandle.setf(ios::fixed,ios::floatfield);

fileHandle.precision(7);

// Writes the data to a file

for (i=1; i<=PLOT_SIZE; i++)

fileHandle << 0.01*(i-0.5) << " " << rplot[i-1] << endl;

// Closes the data file

fileHandle.close();

}

13.21 Measurements, Estimators, and Friedrich Bessel 651

13.21 Measurements, Estimators, and Friedrich Bessel

A probability distribution P (x;✓) for a stochastic variable x may dependupon one or more unknown parameters ✓ = (✓1, . . . , ✓m) such as the meanµ and the variance �2.

Experimenters seek to determine the unknown parameters ✓ by collectingdata in the form of values x = x1, . . . , xN of the stochastic variable x. Theyassume that the probability distribution for the sequence x = (x1, . . . , xN )is the product of N factors of the physical distribution P (x;✓)

P (x;✓) =NY

j=1

P (xj ;✓). (13.273)

They approximate the unknown value of a parameter ✓` as the mean valueof an estimator u(N)

` (x) of ✓`

E[u(N)` ] =

Zu(N)` (x)P (x;✓) dNx = ✓` + b(N)

` (13.274)

in which the bias b(N)` depends upon ✓ and N . If as N ! 1, the bias

b(N)` ! 0, then the estimator u(N)

` (x) is consistent.Inasmuch as the mean (13.27) is the integral of the physical distribution

µ =

ZxP (x;✓) dx (13.275)

a natural estimator for the mean is

u(N)µ (x) = (x1 + . . .+ xN )/N. (13.276)

Its expected value is

E[u(N)µ ] =

Zu(N)µ (x)P (x;✓) dNx =

Zx1 + . . .+ xN

NP (x;✓) dNx (13.277)

=1

N

NX

k=1

Zxk P (xk;✓) dxk

NY

k 6=j=1

ZP (xj ;✓) dxj =

1

N

NX

k=1

µ = µ.

Thus the natural estimator u(N)µ (x) of the mean (13.276) has b(N)

` = 0, andso it is a consistent and unbiased estimator for the mean.Since the variance (13.30) of the probability distribution P (x;✓) is the

integral

�2 =

Z(x� µ)2 P (x;✓) dx (13.278)


the variance of the estimator uNµ is

V [uNµ ] =

Z ⇣u(N)µ (x)� µ

⌘2P (x;✓) dNx =

Z 2

4 1

N

NX

j=1

(xj � µ)

3

52

P (x;✓) dNx

=1

N2

NX

j,k=1

Z(xj � µ) (xk � µ)P (x;✓) dNx (13.279)

=1

N2

NX

j,k=1

�jk

Z(xj � µ)2 P (x;✓) dNx =

1

N2

NX

j,k=1

�jk �2 =

�2

N

in which �2 is the variance (13.278) of the physical distribution P (x;✓).We’ll learn in the next section that no estimator of the mean can have alower variance than this.A natural estimator for the variance of the probability distribution P (x;✓)

is

u(N)�2 (x) = B

NX

j=1

⇣xj � u(N)

µ (x)⌘2

(13.280)

in which B = B(N) is a constant of proportionality. The naive choiceB(N) = 1/N leads to a biased estimator. To find the correct value of B, we

set the expected value E[u(N)�2 ] equal to �2

E[u(N)�2 ] =

ZB

NX

j=1

⇣xj � u(N)

µ (x)⌘2

P (x;✓) dNx = �2 (13.281)

and solve for B. Subtracting the mean µ from both xj and u(N)µ (x), we

express �2/B as the sum of three terms

�2

B=

NX

j=1

Z hxj � µ�

⇣u(N)µ (x)� µ

⌘i2P (x;✓) dNx = Sjj + Sjµ + Sµµ

(13.282)the first of which is

Sjj =NX

j=1

Z(xj � µ)2 P (x;✓) dNx = N�2. (13.283)

13.21 Measurements, Estimators, and Friedrich Bessel 653

The cross-term Sjµ is

Sjµ = � 2NX

j=1

Z(xj � µ)

⇣u(N)µ (x)� µ

⌘P (x;✓) dNx (13.284)

= � 2

N

NX

j=1

Z(xj � µ)

NX

k=1

(xk � µ)P (x;✓) dNx = � 2�2.

The third term is the related to the variance (13.279)

Sµµ =NX

j=1

Z ⇣u(N)µ (x)� µ

⌘2P (x;✓) dNx = NV [uNµ ] = �2. (13.285)

Thus the factor B must satisfy

�2/B = N�2 � 2�2 + �2 = (N � 1)�2 (13.286)

which tells us that B = 1/(N � 1), which is Bessel’s correction. Our esti-

mator for the variance �2 = E[u(N)�2 ] of the probability distribution P (x;✓)

then is

u(N)�2 (x) =

1

N � 1

NX

j=1

⇣xj � u(N)

µ (x)⌘2

=1

N � 1

NX

j=1

xj �

1

N

NX

k=1

xk

!2

.

(13.287)

It is consistent and unbiased since E[u(N)�2 ] = �2 by construction (13.281).

It gives for the variance �2 of a single measurement the undefined ratio 0/0,as it should, whereas the naive choice B = 1/N absurdly gives zero.On the basis of N measurements x1, . . . , xN we can estimate the mean of

the unknown probability distribution P (x;✓) as µN = (x1 + . . . + xN )/N .And we can use Bessel’s formula (13.287) to estimate the variance �2 of theunknown distribution P (x;✓). Our formula (13.279) for the variance �2(µN )of the mean µN then gives

�2(µN ) =�2

N=

1

N(N � 1)

NX

j=1

xj �

1

N

NX

k=1

xk

!2

. (13.288)

Thus we can use N measurements xj to estimate the mean µ to within astandard error or standard deviation of

�(µN ) =

r�2

N=

vuuut 1

N(N � 1)

NX

j=1

xj �

1

N

NX

k=1

xk

!2

. (13.289)


Few formulas have seen so much use.

13.22 Information and Ronald Fisher

The Fisher information matrix of a distribution P (x;✓) is the mean ofproducts of its partial logarithmic derivatives

Fk`(✓) ⌘ E

@ lnP (x;✓)

@✓k

@ lnP (x;✓)

@✓`

�

=

Z@ lnP (x;✓)

@✓k

@ lnP (x;✓)

@✓`P (x;✓) dNx (13.290)

(Ronald Fisher, 1890–1962). Fisher’s matrix (exercise 13.31) is symmetricFk` = F`k and nonnegative (1.39), and when it is positive (1.40), it has aninverse. By di↵erentiating the normalization condition

ZP (x;✓) dNx = 1 (13.291)

we have

0 =

Z@P (x;✓)

@✓kdNx =

Z@ lnP (x;✓)

@✓kP (x;✓) dNx (13.292)

which says that the mean value of the logarithmic derivative of the proba-bility distribution, a quantity called the score, vanishes. Using the notation(lnP ),k ⌘ @ lnP/@✓k and (lnP ),k` ⌘ @(@ lnP/@✓k)/@✓` and di↵erentiatingagain, one has (exercise 13.32)

Z(lnP ),k (lnP ),` P dNx = �

Z(lnP ),k` P dNx (13.293)

so that another form of Fisher’s information matrix is

Fk`(✓) = � E [(lnP ),k`] = �Z(lnP ),k` P dNx. (13.294)

Cramer and Rao used Fisher’s information matrix to form a lower boundon the covariance (13.41) matrix C[uk, u`] of any two estimators. To see howthis works, we use the vanishing (13.292) of the mean of the score to writethe covariance of the kth score Vk ⌘ (lnP (x;✓)),k with the `th estimatoru`(x) as a derivative huì,k of the mean huì

C[Vk, u`] =

Z(lnP ),k (u` � b` � ✓`)P dNx =

Z(lnP ),k u` P dNx

=

ZP,k(x;✓)u`(x) d

Nx = huì,k. (13.295)

13.22 Information and Ronald Fisher 655

Thus for any two sets of constants yk and w`, we have with P =pPpP

mX

`,k=1

yk @khuìw` =

Z mX

`,k=1

yk (lnP ),kpP (u`�b`�✓`)w`

pP dNx. (13.296)

We can suppress some indices by grouping the yj ’s, the wj ’s, and so forthinto the vectors Y T = (y1, . . . , ym), WT = (w1, . . . , wm), UT = (u1, . . . , um),BT = (b1, . . . , bm), and ⇥T = (✓1, . . . , ✓m), and by grouping the @khuì’s intoa matrix (rU)kl which by (13.274) is

(rU)kl ⌘ @khuì = @k (✓` + b`) = �kl + @kb`. (13.297)

In this compact notation, our relation (13.296) is

Y TrU W =

ZY T(r lnP )

pP (UT �BT �⇥T)W

pP dNx. (13.298)

Squaring, we apply Schwarz’s inequality (6.457)

⇥Y TrU W

⇤2=

ZY T(r lnP )

pP (UT �BT �⇥T)W

pP dNx

�2(13.299)

Z h

Y T(r lnP )pPi2

dNx

Z h(UT �BT �⇥T)W

pPi2

dNx

=

Z ⇥Y Tr lnP

⇤2P dNx

Z ⇥(UT �BT �⇥T)W

⇤2P dNx.

In the last line, we recognize the first integral as Y TFY , where F is Fisher’smatrix (13.290), and the second as WTCW in which C is the covariance ofthe estimators

Ck` ⌘ C[U,U ]k` = C[uk � bk � ✓k, u` � b` � ✓`]. (13.300)

So (13.299) says�Y TrUW

�2 Y TFY WTCW. (13.301)

Thus as long as the symmetric non-negative matrix F is positive and so hasan inverse, we can set the arbitrary constant vector Y = F�1rU W and get

(WTrUTF�1rUW )2 WTrUTF�1rUW WTCW. (13.302)

Canceling a common factor, we obtain the Cramer-Rao inequality

WTCW � WTrUTF�1rU W (13.303)

often written as

C � rUT F�1rU. (13.304)


By (13.297), the matrix rU is the identity matrix I plus the gradient of thebias B

rU = I +rB. (13.305)

Thus another form of the Cramer-Rao inequality is

C � (I +rB) T F�1 (I +rB) (13.306)

or in terms of the arbitrary vector W

WTCW � WT (I +rB) T F�1 (I +rB) W. (13.307)

Letting the arbitrary vector W be Wj = �jk, one arrives at (exercise 13.33)the Cramer-Rao lower bound on the variance V [uk] = C[uk, uk]

V [uk] � (F�1)kk +mX

`=1

2(F�1)k`@lbk +mX

`,j=1

(F�1)`j@lbk@jbk. (13.308)

If the estimator uk(x) is unbiased, then this lower bound simplifies to

V [uk] � (F�1)kk. (13.309)

Example 13.23 (Cramer-Rao Bound for a Gaussian). The elements ofFisher’s information matrix for the mean µ and variance �2 of Gauss’s dis-tribution for N data points x1, . . . , xN

P (N)G (x, µ,�) =

NY

j=1

PG(xj ;µ,�) =

✓1

�p2⇡

◆N

exp

0

@�NX

j=1

(xj � µ)2

2�2

1

A

(13.310)are

Fµµ =

Z ⇣lnP (N)

G (x, µ,�)⌘

,µ

�2P (N)G (x, µ,�) dNx

=NX

i,j=1

Z ✓xi � µ

�2

◆✓xj � µ

�2

◆P (N)G (x, µ,�) dNx

=NX

i=1

Z ✓xi � µ

�2

◆2

P (N)G (x, µ,�) dNx =

N

�2(13.311)

Fµ�2 =

Z(lnP (N)

G (x, µ,�)),µ (lnP(N)G (x, µ,�)),�2 P (N)

G (x, µ,�) dNx

=NX

i,j=1

Z xi � µ

�2

� (xj � µ)2

2�4� 1

2�2

�P (N)G (x, µ,�) dNx = 0

13.22 Information and Ronald Fisher 657

F�2µ = Fµ�2 = 0, and

F�2�2 =

Z h(lnP (N)

G (x, µ,�)),�2

i2P (N)G (x, µ,�) dNx

=NX

i,j=1

Z (xi � µ)2

2�4� 1

2�2

� (xj � µ)2

2�4� 1

2�2

�P (N)G (x, µ,�) dNx

=N

2�4. (13.312)

The inverse of Fisher’s matrix then is diagonal with (F�1)µµ = �2/N and(F�1)�2�2 = 2�4/N .The variance of any unbiased estimator uµ(x) of the mean must exceed

its Cramer-Rao lower bound (13.309), and so V [uµ] � (F�1)µµ = �2/N .

The variance V [u(N)µ ] of the natural estimator of the mean u(N)

µ (x) = (x1 +. . .+xN )/N is �2/N by (13.279), and so it respects and saturates the lowerbound (13.309)

V [u(N)µ ] = E[(u(N)

µ � µ)2] = �2/N = (F�1)µµ. (13.313)

One may show (exercise 13.34) that the variance V [u(N)�2 ] of Bessel’s esti-

mator (13.287) of the variance is (Riley et al., 2006, p. 1248)

V [u(N)�2 ] =

1

N

✓⌫4 �

N � 3

N � 1�4◆

(13.314)

where ⌫4 is the fourth central moment (13.28) of the probability distribution.For the gaussian PG(x;µ,�) one may show (exercise 13.35) that this momentis ⌫4 = 3�4, and so for it

VG[u(N)�2 ] =

2

N � 1�4. (13.315)

Thus the variance of Bessel’s estimator of the variance respects but does notsaturate its Cramer-Rao lower bound (13.309, 13.312)

VG[u(N)�2 ] =

2

N � 1�4 >

2

N�4. (13.316)

Estimators that saturate their Cramer-Rao lower bounds are e�cient.The natural estimator u(N)

µ (x) of the mean is e�cient as well as consistent

and unbiased, and Bessel’s estimator u(N)�2 (x) of the variance is consistent

and unbiased but not e�cient.


13.23 Maximum Likelihood

Suppose we measure some quantity x at various values of another variablet and find the values x1, x2, . . . , xN at the known points t1, t2, . . . , tN . Wemight want to fit these measurements to a curve x = f(t;↵) where ↵ =↵1, . . . ,↵M is a set of M < N parameters. In view of the central limittheorem, we’ll assume that the points xj fall in Gauss’s distribution aboutthe values xj = f(tj ;↵) with some known variance �2. The probability ofgetting the N values x1, . . . , xN then is

P (x) =NY

j=1

P (xj , tj ,�) =

✓1

�p2⇡

◆N

exp

0

@�NX

j=1

(xj � f(tj ;↵))2

2�2

1

A .

(13.317)To find theM parameters↵, we maximize the likelihood P (x) by minimizingthe argument of its exponential

0 =@

@↵`

NX

j=1

(xj � f(tj ;↵))2 = �2NX

j=1

(xj � f(tj ;↵))@f(tj ;↵)

@↵`. (13.318)

If the function f(t;↵) depends nonlinearly upon the parameters ↵, then wemay need to use numerical methods to solve this least-squares problem.But if the function f(t;↵) depends linearly upon the M parameters ↵

f(t;↵) =MX

k=1

gk(t)↵k (13.319)

then the equations (13.318) that determine these parameters ↵ are linear

0 =NX

j=1

xj �

MX

k=1

gk(tj)↵k

!g`(tj). (13.320)

In matrix notation with G the N⇥M rectangular matrix with entries Gjk =gk(tj), they are

GTx = GTG↵. (13.321)

The basis functions gk(t) may depend nonlinearly upon the independentvariable t. If one chooses them to be su�ciently di↵erent that the columnsof G are linearly independent, then the rank of G is M , and the nonnegativematrix GTG has an inverse. The matrix G then has a pseudoinverse (1.431)

G+ =�GTG

��1GT (13.322)

13.24 Karl Pearson’s Chi-Squared Statistic 659

and it maps the N -vector x into our parameters ↵

↵ = G+x. (13.323)

The product G+G = IM is the M ⇥M identity matrix, while

GG+ = P (13.324)

is an N ⇥N projection operator (exercise 13.36) onto the M ⇥M subspacefor which G+G = IM is the identity operator. Like all projection operators,P satisfies P 2 = P .

13.24 Karl Pearson’s Chi-Squared Statistic

The argument of the exponential (13.317) in P (x) is (the negative of) KarlPearson’s chi-squared statistic (Pearson, 1900)

�2 ⌘NX

j=1

(xj � f(tj ;↵))2

2�2. (13.325)

When the function f(t;↵) is linear (13.319) in ↵, the N -vector f(tj ;↵) isf = G↵. Pearson’s �2 then is

�2 = (x�G↵)2/2�2. (13.326)

Now (13.323) tells us that ↵ = G+x, and so in terms of the projection

operator P = GG+, the vector x�G↵ is

x�G↵ = x�GG+x =

�I �GG+

�x = (I � P )x. (13.327)

So �2 is proportional to the squared length

�2 = x2/2�2 (13.328)

of the vector

x ⌘ (I � P )x. (13.329)

Thus if the matrix G has rank M , and the vector x has N independentcomponents, then the vector x has only N �M independent components.

Example 13.24 (Two Position Measurements). Suppose we measure a po-sition twice with error �, get x1 and x2, and choose GT = (1, 1). Then the


single parameter ↵ is their average ↵ = (x1 + x2)/2, and �2 is

�2 =n[x1 � (x1 + x2)/2]

2 + [x2 � (x1 + x2)/2]2o.

2�2

=n[(x1 � x2)/2]

2 + [(x2 � x1)/2]2o.

2�2

=h(x1 � x2)/

p2i2

/2�2. (13.330)

Thus instead of having two independent components x1 and x2, �2 just hasone (x1 � x2)/

p2.

We can see how this happens more generally if we use as basis vectorsthe N � M orthonormal vectors |ji in the kernel of P (that is, the |ji’sannihilated by P )

P |ji = 0 1 j N �M (13.331)

and the M that lie in the range of the projection operator P

P |ki = |ki N �M + 1 k N. (13.332)

In terms of these basis vectors, the N -vector x is

x =N�MX

j=1

xj |ji +NX

k=N�M+1

xk|ki (13.333)

and the last M components of the vector x vanish

x = (I � P )x =N�MX

j=1

xj |ji. (13.334)

Example 13.25 (N position measurements). Suppose the N values of xjare the measured values of the position f(tj ;↵) = xj of some object. ThenM = 1, and we choose Gj1 = g1(tj) = 1 for j = 1, . . . , N . Now GTG = N isa 1⇥ 1 matrix, the number N , and the parameter ↵ is the mean x

↵ = G+x =

�GTG

��1GT

x =1

N

NX

j=1

xj = x (13.335)

of the N position measurements xj . So the vector x has components xj =xj � x and is orthogonal to GT = (1, 1, . . . , 1)

GTx =

0

@NX

j=1

xj

1

A�Nx = 0. (13.336)

13.24 Karl Pearson’s Chi-Squared Statistic 661

The matrix GT has rank 1, and the vector x has N � 1 independent com-ponents.

Suppose now that we have determined our M parameters ↵ and have atheoretical fit

x = f(t;↵) =MX

k=1

gk(t)↵k (13.337)

which when we apply it to N measurements xj gives �2 as

�2 = (x)2 /2�2. (13.338)

How good is our fit?A �2 distribution with N�M degrees of freedom has by (13.235) mean

E[�2] = N �M (13.339)

and variance

V [�2] = 2(N �M). (13.340)

So our �2 should be about

�2 ⇡ N �M ±p

2(N �M). (13.341)

If it lies within this range, then (13.337) is a good fit to the data. But ifit exceeds N � M +

p2(N �M), then the fit isn’t so good. On the other

hand, if �2 is less than N �M �p2(N �M), then we may have used too

many parameters or overestimated �. Indeed, by using N parameters withGG+ = IN , we could get �2 = 0 every time.The probability that �2 exceeds �2

0 is the integral (13.234)

Prn(�2 > �2

0) =

Z 1

�20

Pn(�2/2) d�2 =

Z 1

�20

1

2�(n/2)

✓�2

2

◆n/2�1

e��2/2d�2

(13.342)in which n = N � M is the number of data points minus the number ofparameters, and �(n/2) is the gamma function (5.105, 4.68). So an M -parameter fit to N data points has only a chance of ✏ of being good if its�2 is greater than a �2

0 for which PrN�M (�2 > �20) = ✏. These probabilities

PrN�M (�2 > �20) are plotted in Fig. 13.7 for N �M = 2, 4, 6, 8, and 10. In

particular, the probability of a value of �2 greater than �20 = 20 respectively

is 0.000045, 0.000499, 0.00277, 0.010336, and 0.029253 for N �M = 2, 4, 6,8, and 10.


0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

The Chi-Squared Test

χ20

Pr N

−M(χ

2>

χ2 0)

N-M = 2 4 6 8 10

Figure 13.7 The probabilities PrN�M (�2 > �20) are plotted from left to

right for N �M = 2, 4, 6, 8, and 10 degrees of freedom as functions of �20.

13.25 Kolmogorov’s Test

Suppose we want to use a sequence of N measurements xj to determinethe probability distribution that they come from. Our empirical probabilitydistribution is

P (N)e (x) =

1

N

NX

j=1

�(x� xj). (13.343)

Our cumulative probability for events less than x then is

Pr(N)e (�1, x) =

Z x

�1P (N)e (x0) dx0 =

Z x

�1

1

N

NX

j=1

�(x0 � xj) dx0. (13.344)

13.25 Kolmogorov’s Test 663

So if we label our events in increasing order x1 x2 . . . xN , then theprobability of an event less than x is a staircase

Pr(N)e (�1, x) =

j

Nfor xj < x < xj+1. (13.345)

Having approximately and experimentally determined our empirical cu-mulative probability distribution Pr(N)

e (�1, x), we might want to knowwhether it comes from some hypothetical, theoretical cumulative probabilitydistribution Prt(�1, x). One way to do this is to compute the distance DN

between the two cumulative probability distributions

DN = sup�1<x<1

��Pr(N)e (�1, x)� Prt(�1, x)

�� (13.346)

in which sup stands for supremum and means least upper bound. Sincecumulative probabilities lie between zero and one, it follows (exercise 13.37)that the Kolmogorov distance is bounded by

0 DN 1. (13.347)

In general, as the number N of data points increases, we expect that ourempirical distribution Pr(N)

e (�1, x) should approach the actual empiricaldistribution Pre(�1, x) from which the events xj came. In this case, theKolmogorov distance DN should converge to a limiting value D1

limN!1

DN = D1 = sup�1<x<1

|Pre(�1, x)� Prt(�1, x)| 2 [0, 1]. (13.348)

If the empirical distribution Pre(�1, x) is the same as the theoretical dis-tribution Prt(�1, x), then we expect that D1 = 0. This expectation isconfirmed by a theorem due to Glivenko (Glivenko, 1933; Cantelli, 1933)according to which the probability that the Kolmogorov distance DN shouldgo to zero as N ! 1 is unity

Pr(D1 = 0) = 1. (13.349)

The real issue is how fast DN should decrease with N if our events xjdo come from Prt(�1, x). This question was answered by Kolmogorov whoshowed (Kolmogorov, 1933) that if the theoretical distribution Prt(�1, x)is continuous, then for large N (and for u > 0) the probability that

pN DN ,

DN being the Kolmogorov distance between the empirical and theoreticalcumulative distributions, is less than u is given by the Kolmogorov func-tion K(u)

limN!1

Pr(pN DN < u) = K(u) ⌘ 1 + 2

1X

k=1

(�1)ke�2k2u2. (13.350)


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Kolmogorov’s Distr ibution

u

K(u

)

Figure 13.8 Kolmogorov’s cumulative probability distribution K(u) de-fined by (13.350) rises from zero to unity as u runs from zero to abouttwo.

Amazingly, this upper bound is universal and independent of the par-ticular probability distributions Pre(�1, x) and Prt(�1, x).On the other hand, if our events xj come from a di↵erent probability distri-

bution Pre(�1, x), then as N ! 1 we should expect that Pr(N)e (�1, x) !

Pre(�1, x), and so that DN should converge to a positive constant D1 2(0, 1]. In this case, we expect that as N ! 1 the quantity

pN DN should

grow with N aspN D1.

Example 13.26 (Kolmogorov’s Test). How do we use (13.350)? As illus-trated in Fig. 13.8, Kolmogorov’s distribution K(u) rises from zero to unityon (0,1), reaching 0.9993 already at u = 2. So if our points xj come fromthe theoretical distribution, then Kolmogorov’s theorem (13.350) tells usthat as N ! 1, the probability that

pN DN is less than 2 is more than

99.9%. But if the experimental points xj do not come from the theoreticaldistribution, then the quantity

pN DN should grow as

pN D1 as N ! 1.

To see what this means in practice, I took as the theoretical distributionPt(x) = PG(x, 0, 1) which has the cumulative probability distribution (13.92)


−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

Comparison of PG(x, 0, 1) with PS(x, 3, 1)

x

P(x

)

Gauss’s

Gosset ’s Student’s

Figure 13.9 The probability distributions of Gauss PG(x, 0, 1) and Gos-set/Student PS(x, 3, 1) with zero mean and unit variance.

Prt(�1, x) =1

2

herf⇣x/

p2⌘+ 1i. (13.351)

I generated N = 10m points xj for m = 1, 2, 3, 4, 5, and 6 from this theo-retical distribution Pt(x) = PG(x, 0, 1) and computed uN =

p10mD10m for

these points. I foundp10mD10m = 0.6928, 0.7074, 1.2000, 0.7356, 1.2260,

and 1.0683. All were less than 2, as expected since I had taken the experi-mental points xj from the theoretical distribution.

To see what happens when the experimental points do not come from thetheoretical distribution Pt(x) = PG(x, 0, 1), I generated N = 10m points xjfor m = 1, 2, 3, 4, 5, and 6 from Gosset’s Student’s distribution PS(x, 3, 1)defined by (13.224) with ⌫ = 3 and a = 1. Both Pt(x) = PG(x, 0, 1) andPS(x, 3, 1) have the same mean µ = 0 and standard deviation � = 1, asillustrated in Fig. 13.9. For these points, I computed uN =

pN DN and

foundp10mD10m = 0.7741, 1.4522, 3.3837, 9.0478, 27.6414, and 87.8147.


0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Kolmogorov’s Test

u

Pr(u)

Pr(2)e , S,G

Pr(3)e , S,G

Pr(4)e ,G,G

Figure 13.10 Kolmogorov’s test is applied to points xj taken from Gauss’sdistribution PG(x, 0, 1) and from Gosset’s Student’s distribution PS(x, 3, 1)to see whether the xj came from PG(x, 0, 1). The thick smooth curve isKolmogorov’s universal cumulative probability distribution K(u) definedby (13.350). The thin jagged curve that clings to K(u) is the cumulative

probability distribution Pr(4)e,G,G(�1, u) made (13.352) from points taken

from PG(x, 0, 1). The other curves Pr(m)e,S,G(�1, u) for m = 2 and 3 are

made (13.353) from 10m points taken from PS(x, 3, 1).

Only the first two are less than 2, and the last four grow aspN , indicating

that the xj had not come from the theoretical distribution. In fact, we canapproximate the limiting value of DN as D1 ⇡ u106/

p106 = 0.0878. The

exact value is (exercise 13.40) D1 = 0.0868552356.

At the risk of overemphasizing this example, I carried it one step further.I generated ` = 1, 2, . . . 100 sets of N = 10m points x(`)j for m = 2, 3, and 4drawn from PG(x, 0, 1) and from PS(x, 3, 1) and used them to form 100

empirical cumulative probabilities Pr(`,10m)

e,G (�1, x) and Pr(`,10m)

e,S (�1, x)

as defined by (13.343–13.345). Next, I computed the distances D(`)G,G,10m


and D(`)S,G,10m of each of these cumulative probabilities from the gaussian

distribution PG(x, 0, 1). I labeled the two sets of 100 quantities u(`,m)G,G =

p10mD(`)

G,G,10m and u(`,m)S,G =

p10mD(`)

S,G,10m in increasing order as u(m)G,G,1

u(m)G,G,2 . . . u(m)

G,G,100 and u(m)S,G,1 u(m)

S,G,2 . . . u(m)S,G,100. I then used

(13.343–13.345) to form the cumulative probabilities

Pr(m)e,G,G(�1, u) =

j

Nsfor u(m)

G,G,j < u < u(m)G,G,j+1 (13.352)

and

Pr(m)e,S,G(�1, u) =

j

Nsfor u(m)

S,G,j < u < u(m)S,G,j+1 (13.353)

for Ns = 100 sets of 10m points.I plotted these cumulative probabilities in Fig. 13.10. The thick smooth

curve is Kolmogorov’s universal cumulative probability distribution K(u)defined by (13.350). The thin jagged curve that clings to K(u) is the cu-

mulative probability distribution Pr(4)e,G,G(�1, u) made from 100 sets of 104

points taken from PG(x, 0, 1). As the number of sets increases beyond 100and the number of points 10m rises further, the probability distributionsPr(m)

e,G,G(�1, u) converge to the universal cumulative probability distribu-tion K(u) and provide a numerical verification of Kolmogorov’s theorem.Such curves make poor figures, however, because they hide beneath K(u).

The curves labeled Pr(m)e,S,G(�1, u) for m = 2 and 3 are made from 100 sets

of N = 10m points taken from PS(x, 3, 1) and tested as to whether theyinstead come from PG(x, 0, 1). Note that as N = 10m increases from 100 to

1000, the cumulative probability distribution Pr(m)e,S,G(�1, u) moves farther

from Kolmogorov’s universal cumulative probability distribution K(u). In

fact, the curve Pr(4)e,S,G(�1, u) made from 100 sets of 104 points lies beyondu > 8, too far to the right to fit in the figure. Kolmogorov’s test gets moreconclusive as the number of points N ! 1.

Warning, mathematical hazard: While binned data are ideal for chi-squared fits, they ruin Kolmogorov tests. The reason is that if the data arein bins of width w, then the empirical cumulative probability distributionPr(N)

e (�1, x) is a staircase function with steps as wide as the bin-width weven in the limit N ! 1. Thus even if the data come from the the-oretical distribution, the limiting value D1 of the Kolmogorov distancewill be positive. In fact, one may show (exercise 13.41) that when the datado come from the theoretical probability distribution Pt(x) assumed to be


continuous, then the value of D1 is

D1 ⇡ sup�1<x<1

wPt(x)

2. (13.354)

Thus in this case, the quantitypN DN would diverge as

pN D1 and lead

one to believe that the data had not come from Pt(x).Suppose we have made some changes in our experimental apparatus and

our software, and we want to see whether the new data x01, x02, . . . , x

0N 0 we

took after the changes are consistent with the old data x1, x2, . . . , xN wetook before the changes. Then following equations (13.343–13.345), we can

make two empirical cumulative probability distributions—one Pr(N)e (�1, x)

made from the N old points xj and the other Pr(N0)

e (�1, x) made from theN 0 new points x0j . Next, we compute the distances

D+N,N 0 = sup

�1<x<1

⇣Pr(N)

e (�1, x)� Pr(N0)

e (�1, x)⌘

DN,N 0 = sup�1<x<1

��Pr(N)e (�1, x)� Pr(N

0)e (�1, x)

�� .(13.355)

Smirnov (Smirnov 1939; Gnedenko 1968, p. 453) has shown that as N,N 0 !1 the probabilities that

u+N,N 0 =

rNN 0

N +N 0 D+N,N 0 and uN,N 0 =

rNN 0

N +N 0 DN,N 0 (13.356)

are less than u are

limN,N 0!1

Pr(u+N,N 0 < u) = 1� e�2u2

limN,N 0!1

Pr(uN,N 0 < u) = K(u)(13.357)

in which K(u) is Kolmogorov’s distribution (13.350).

Further reading

Students can learn more about probability and statistics in Mathematical

Methods for Physics and Engineering (Riley et al., 2006), An Introduction

to Probability Theory and Its Applications I, II (Feller, 1968, 1966), Theoryof Financial Risk and Derivative Pricing (Bouchaud and Potters, 2003), andProbability and Statistics in Experimental Physics (Roe, 2001).

Exercises

13.1 Find the probabilities that the sum on two thrown fair dice is 4, 5, or6.

Exercises 669

13.2 Redo the three-door example for the case in which there are 100 doors,and 98 are opened to reveal empty rooms after one picks a door. Shouldone switch? What are the odds?

13.3 Show that the zeroth moment µ0 and the zeroth central moment ⌫0always are unity, and that the first central moment ⌫1 always vanishes.

13.4 Compute the variance of the uniform distribution on (0, 1).13.5 In the formulas (13.23 & 13.30) for the variances of discrete and con-

tinuous distributions, show that E[(x� hxi)2] = µ2 � µ2.13.6 (a) Show that the covariance h(x� x)(y� y)i is equal to hx yi � hxihyi

as asserted in (13.41). (b) Derive (13.45) for the variance V [ax+ by].13.7 Derive expression (13.46) for the variance of a sum of N variables.13.8 Find the range of pq = p(1� p) for 0 p 1.13.9 Show that the variance of the binomial distribution (13.49) is given by

(13.53).13.10 Redo the polling example (13.16–13.18) for the case of a slightly better

poll in which 16 people were asked and 13 said they’d vote for NancyPelosi. What’s the probability that she’ll win the election? (You mayuse Maple or some other program to do the tedious integral.)

13.11 Without using the fact that the Poisson distribution is a limitingform of the binomial distribution, show from its definition (13.64) andits mean (13.66) that its variance is equal to its mean, as in (13.68).

13.12 Show that Gauss’s approximation (13.81) to the binomial distributionis a normalized probability distribution with mean hxi = µ = pN andvariance V [x] = pqN .

13.13 Derive the approximations (13.95 & 13.96) for binomial probabilitiesfor large N .

13.14 Compute the central moments (13.29) of the gaussian (13.82).13.15 Derive formula (13.91) for the probability that a gaussian random

variable falls within an interval.13.16 Show that the expression (13.98) for P (y|600) is negligible on the

interval (0, 1) except for y near 3/5.13.17 Determine the constant A of the homogeneous solution hv(t)igh and

derive expression (13.165) for the general solution hv(t)i to (13.163).13.18 Derive equation (13.166) for the variance of the position r about its

mean hr(t)i. You may assume that hr(0)i = hv(0)i = 0 and that h(v�hv(t)i)2i = 3kT/m.

13.19 Derive equation (13.195) for the ensemble average hr2(t)i for the casein which hr2(0)i = 0 and dhr2(0)i/dt = 0.

13.20 Use (13.216) to derive the lower moments (13.218 & 13.219) of thedistributions of Gauss and Poisson.


13.21 Find the third and fourth moments µ3 and µ4 for the distributions ofPoisson (13.211) and Gauss (13.208).

13.22 Derive formula (13.223) for the first five cumulants of an arbitraryprobability distribution.

13.23 Show that like the characteristic function, the moment-generatingfunction M(t) for an average of several independent random variablesfactorizes M(t) = M1(t/N)M2(t/N) . . . MN (t/N).

13.24 Derive formula (13.230) for the moments of the log-normal probabilitydistribution (13.229).

13.25 Why doesn’t the log-normal probability distribution (13.229) have asensible power-series about x = 0? What are its derivatives there?

13.26 Compute the mean and variance of the exponential distribution (13.231).

13.27 Show that the chi-square distribution P3,G(v,�) with variance �2 =kT/m is the Maxwell-Boltzmann distribution (13.116).

13.28 Compute the inverse Fourier transform (13.207) of the characteristicfunction (13.236) of the symmetric Levy distribution for ⌫ = 1 and 2.

13.29 Show that the integral that defines P (2)(y) gives formula (13.272) withtwo Heaviside functions. Hint: keep x1 and x2 in the interval (0, 1).

13.30 Derive the normal distribution (13.257) in the variable (13.256) fromthe central limit theorem (13.254) for the case in which all the meansand variances are the same.

13.31 Show that Fisher’s matrix (13.290) is symmetric Fk` = F`k and non-negative (1.39), and that when it is positive (1.40), it has an inverse.

13.32 Derive the integral equations (13.292 & 13.293) from the normaliza-tion condition

RP (x;✓) dNx = 1.

13.33 Derive the Cramer-Rao lower bound (13.308) on the variance V [tk]from the inequality (13.303).

13.34 Show that the variance V [u(N)�2 ] of Bessel’s estimator (13.287) is given

by (13.314).

13.35 Compute the fourth central moment (13.29) of Gauss’s probabilitydistribution PG(x;µ,�2).

13.36 Show that when the real N ⇥M matrix G has rank M , the matricesP = GG+ and P? = 1� P are projection operators that are mutuallyorthogonal P (I � P ) = (I � P )P = 0.

13.37 Show that Kolmogorov’s distance DN is bounded as in (13.347).

13.38 Show that Kolmogorov’s distanceDN is the greater of the two Smirnov

Exercises 671

distances

D+N = sup

�1<x<1

⇣Pr(N)

e (�1, x)� Prt(�1, x)⌘

D�N = sup

�1<x<1

⇣Prt(�1, x)� Pr(N)

e (�1, x)⌘.

(13.358)

13.39 Derive the formulas

D+N = sup

1jN

✓j

N� Prt(�1, xj)

◆

D�N = sup

1jN

✓Prt(�1, xj)�

j � 1

N

◆ (13.359)

for D+N and D�

N .13.40 Compute the exact limiting value D1 of the Kolmogorov distance

between PG(x, 0, 1) and PS(x, 3, 1). Use the cumulative probabilities(13.351 & 13.227) to find the value of x that maximizes their di↵erence.Using Maple or some other program, you should find x = 0.6276952185and then D1 = 0.0868552356.

13.41 Show that when the data do come from the theoretical probabilitydistribution (assumed to be continuous) but are in bins of width w,then the limiting value D1 of the Kolmogorov distance is given by(13.354).

13.42 Suppose in a poll of 1000 likely voters, 510 have said they would votefor Nancy Pelosi. Redo example 13.14.

13 probability and statistics - university of new mexicoquantum.phys.unm.edu/467-18/ch13.pdf · 13...

Documents