probabilistic combinatorics - math.uni-frankfurt.deacoghlan/probcomb.pdf · 3 the erdos-ko-rado...

Probabilistic Combinatorics

Amin [email protected]

June 25, 2013

Abstract

Random structures play a crucial role in combinatorics, because they enjoy properties that are difficultto obtain via deterministic constructions. For example, random structures are essential in the design ofalgorithms or error correcting codes. Furthermore, random discrete structures can be used to model alarge variety of objects in physics, biology, or computer science (e.g., social networks). The goal ofthis lecture is to convey the most important models and the main analysis techniques, as well as a fewinstructive applications. Topics include

• fundamentals of discrete probability distributions,

• techniques for the analysis of rare events,

• random trees and graphs,

• statistical mechanics models,

• applications to counting and sampling.

1

1 INTRODUCTION AND MOTIVATION 2

1 Introduction and motivationProbabilistic combinatorics is a relatively new area of mathematics. It started in the 1950s and 60s and hasbeen strongly influenced by the Hungarian mathematicians Paul Erdos and Alfred Renyi. Their primarymotivation was to address problems in combinatorics, particularly extremal combinatorics. In that area, itis often necessary to produce discrete objects (e.g., graphs) with properties that seem contradictory. Thestroke of genius that marks the beginning of probabilistic combinatorics is that sometimes there is a simplerandom experiment whose likely outcome is the desired object. This way of producing discrete objectswith particular properties has become known as the probabilistic method; we will see an example, thelower bound on the Ramsey number, in Section 2.

Coming up with an explicit or ‘deterministic’ construction of the desired object is often difficult. Infact, in many cases (such as Ramsey numbers) no deterministic construction is known that can hold acandle to very basic probabilistic ones, despite decades of extensive research.

The probabilistic method has since become a key tool in far broader areas than merely extremal com-binatorics. Modern applicactions abound in areas ranging from information theory (design efficient codes)to computer science (algorithm design, computational complexity theory). We will discuss (some of) thetechniques and various applications of the probablistic method in due course.

A second area where random discrete structures play an important role is statistical mechanics. Thisbranch of physics, pioneered by Ludwig Boltzmann and Josiah W. Gibbs in the late 19th century, is con-cerned with the interaction of myriads of tiny particles, with the goal of studying the arising macroscopicphenoma. Because it seems hopeless (as well as pointless) to deterministically trace the microscopic statesand positions of each individual elementary particle, the system is modeled probabilistically. That is, themacroscopic “potentials” of the system are random variables. Their distribution comes with a handful ofmacroscopic parameters (such as the temperature), and the name of the game is to study how these param-eters affect the potentials. Of particular interest are phase transitions, i.e., values of the parameters (suchas the temperature) where the macroscopic behavior of the system changes abruptly. An example of this isthe freezing of water at zero centigrade.

Physicists have developed a host of ingenious ideas and methods for studying such phenomena. How-ever, usually their derivations are mathematically non-rigorous. In effect, the need for a rigorous foundationfor statistical mechanics has been a driving force behind the development of probabilistic combinatorics.

None of the material in these lecture notes is original. In fact, many of the lectures closely follow eithertextbooks, research monographs [2, 10, 14] or research papers. More precisely, Section 2, 3, 4, 6 and 7follow [2]. Moreover, Section 5 is based on [10].

Only a very basic knowledge of probability theory and discrete mathematics will be assumed thorough-out the course.

2 A first example: lower bounds on Ramsey numbersThis section follows [2, Section 3.1].

Ramsey’s theorem is one of the cornerstones of extremal combinatorics. To state the theorem, we needto recap a few concepts from graph theory. Recall that a graph is a pair G = (V,E) consisting of a finiteset V of vertices and a set E of edges, each of which is set containing two elements of V . A set S ⊂ V is aclique if v, w ∈ E for any two (distinct) v, w ∈ S. Moreover, a set T ⊂ V is independent if v, w 6∈ Efor any two v, w ∈ T . Now, Ramsey’s theorem can be stated as follows.

Theorem 2.1 For any s ≥ 1 there is a number r such that any graph on at least r vertices contains eithera clique or an independent set of size s.

A proof of this theorem will be given in the seminar. The obvious question associated with Ramsey’stheorem is how big r has to be (as a function of s). More precisely, we let R(s) be the minimum r so thatany graph on at least r vertices contains either a clique or an independent set of size s. Then the problemis to figure out R(s).

There are several possible ways answers to this question. The most ambitious one would be an explicitformula that yields R(s) for any value of s. Such a formula is not known. In fact, if one exits, it may well

2 A FIRST EXAMPLE: LOWER BOUNDS ON RAMSEY NUMBERS 3

be so complicated as to be rather unilluminating. A second possibility is to figure out R(s) for ‘small’ s.For instance, one could hope for a fast algorithm that computes R(s). Such an algorithm is not known. Infact, for small s the only values of R(s) that are known exactly are R(2) = 2, R(3) = 6, R(4) = 18. TheRamsey number R(5) is unkown, although 43 ≤ R(5) ≤ 49. With respect to an algorithm for computingthe Ramsey number, the state of affairs was described neatly by Joel Spencer, who attributes the metaphorto Paul Erdos:

Imagine an alien force, vastly more powerful than us, landing on Earth and demanding thevalue ofR(5) or they will destroy our planet. In that case, we should marshal all our computersand all our mathematicians and attempt to find the value. But suppose, instead, that they askfor R(6). In that case, we should attempt to destroy the aliens.

This scepticism is due to the sheer number of possible graphs on a given number n of vertices, namely

2(n2) = 2n(n−1)/2.

Essentially the only known way to compute R(s) explicitly is by inspecting all graphs on n vertices. Forinstance, to check wheter R(5) = 43, the number of possible graphs that one would have to check is

2(432 ) = 2903,

a number with 271 decimal digits. Even if a billion possibilities could be checked in a second, this wouldtake 10255 years. Thus, in the absence of a far smarter method for finding R(s), the exact computation ofthese numbers is hopeless.

Instead, what one could hope for is reasonably tight upper and lower bounds on R(s). Ideally, theseshould describe exactly how R(s) scales as a function of s as s→∞. What we will see is that

2s/2 ≤ R(s) ≤ 22s.

You will see the proof of the upper bound in the seminar. The proof of the lower bound, which is ar-guably the far greater challenge, is the first example of the use of the probabilistic method in extremalcombinatorics (Erdos 1947).

To prove the lower bound, we need to produce a graph that has neither a ‘large’ independent set nora large clique. We will obtain this graph by a simple random experiment, namely: just choose a graphwith vertex set V = 1, . . . , n uniformly at random from the set of all 2(n2) such graphs. We denote thisrandom object by G(n, 1

2 ). Formally, G(n, 12 ) is a random variable ranging over the set of all graphs with

n vertices. To prove the desired lower bound on R(s), we start with the following lemma.

Lemma 2.2 Let Is be the number of independent sets of size s in G(n, 12 ). Similarly, let Cs be the number

of cliques of size s in G(n, 12 ). Then

E(Is) = E(Cs) =

(n

s

)· 2−(s2).

Proof. For a specific set S ⊂ V of size s we define a random variable

IS =

1 if S is independent in G(n, 1

2 ),0 otherwise.

Then Is =∑S IS , and thus

E(Is) =∑S

E(IS) ≤(n

s

)·max

SE(IS). (1)

Here S ranges over subsets of V of size s. Moreover, we have used the fact that the binomial coefficient(ns

)counts the number of subsets of V of size s.

3 THE ERDOS-KO-RADO THEOREM 4

Thus, we need to compute E(IS) for a fixed set S. Of course, E(IS) is nothing but the probability thatS is an independent set. In order to compute this probability, recall that G(n, 1

2 ) is just a uniformly chosengraph on the vertex set V . Hence, if we look at the subgraph of G(n, 1

2 ) induced on the vertices S, thenthis is just a uniformly distributed subgraph with vertex set S. Since |S| = s, the total number of possiblegraphs on S equals 2(s2). Consequently,

E(IS) = P [S is independent] = 1/2(s2).

Plugging this estimate into (1), we obtain the assertion. 2

To proceed, we need Stirling’s formula, which approximates k! for integers k: it reads

√2πk ·

(k

e

)k≤ k! ≤ exp

(1

12k

)·√

2πk ·(k

e

)k.

Theorem 2.3 For s ≥ 3 we have R(s) > 1e√

2· s2s/2.

Proof. We claim that there is a graph on

n =

⌊s · 2s/2

e√

2

⌋vertices that neither has a clique nor an independent set of size s. For consider the random graph G(n, 1

2 ).Then by Markov’s inequality,

P [Is > 0 ∨ Cs > 0] = P [Is + Cs > 0] ≤ E(Is + Cs).

Therefore, by Lemma 2.2 and Stirling’s formula,

P [Is > 0 ∨ Cs > 0] ≤ E(Is + Cs) = E(Is) + E(Cs) =

(n

s

)· 21−(s2) <

ns

s!· 21−(s2)

≤ ns√2πs(s/e)s

· 21−(s2)

≤ ss2s2/2 exp(−s)2−s/2√

2πs(s/e)s· 21−(s2) =

2√2πs

< 1.

In effect,P [Is = 0 ∧ Cs = 0] = 1− P [Is > 0 ∨ Cs > 0] > 0,

whence R(s) > n. 2

The proof of Theorem 2.3 shows that there exists a graph with n =⌊s·2s/2e√

2

⌋vertices that does not have

a clique or an independent set of size s. Rather than by constructing such a graph deterministically, theproof is by showing that a random graph has this property with a strictly positive probability. This kind ofargument is what is called the probabilistic method. In spite of more than 60 years of intensive research, nodeterministic construction of a ‘Ramsey graph’ is known that could hold a candle to the simple probabilisticargument above.

The proof of Theorem 2.3 relies on a so-called first moment argument. The name stems from the factthat in order to show the absence of a clique/independent set of size s, we compute the expected number(aka “first moment”) of cliques/independent sets.

3 The Erdos-Ko-Rado theoremThis section follows [2, p. 12].

Let n ≥ 1 be an integer and let V = 0, . . . , n− 1. A family F of subsets of V is intersecting if forany two A,B ∈ F we have A ∩B 6= ∅.

4 INDEPENDENT SETS IN TRIANGLE-FREE GRAPHS 5

Theorem 3.1 (Erdos-Ko-Rado) Let 1 ≤ k ≤ n/2. If F is an intersecting family such that all sets in Fhave size k, then |F| ≤

(n−1k−1

).

Before we come to the proof, note that this bound is tight: the family of all subsets of V of size k thatcontain 0 contains

(n−1k−1

)sets.

To prove the theorem, let us denote by m mod n ∈ V the remainder of the integer m upon divisionby n. We need the following lemma.

Lemma 3.2 Suppose that F is an intersecting family such that all sets in F have size k. For any s ∈ Vdefine a set

As = s+ i mod n : 0 ≤ i < k .

Then F contains at most k of these sets As. (In symbols, |F ∩ As : s ∈ V | ≤ k.)

Proof. Suppose that As ∈ F for some s. The sets At such that As ∩ At 6= ∅ can be partitioned into pairsAs−i mod n, As+k−i mod n with 1 ≤ i ≤ k − 1. Since As−i mod n ∩ As+k−i mod n = ∅, F can onlycontain one set from each of these k−1 pairs. This implies that, F contains at most k of the sets At, t ∈ Vin total. 2

Proof of Theorem 3.1. Choose a permutation σ : V → V and an element i ∈ V uniformly at random fromall n · n! possible choices. Let

Ai,σ = σ(i), σ(i+ 1 mod n), . . . , σ(i+ k − 1 mod n) .

Lemma 3.2 shows that for any permutation σ0 : V → V we have

P [Ai,σ ∈ F|σ = σ0] =|F ∩ Aj,σ0

: j ∈ V |n

≤ k

n, (2)

because i ∈ V is chosen uniformly and independently of σ. Furthermore, for any i0 ∈ V and any setS ⊂ V of size k we have

P [Ai,σ = S|i = i0] = 1/

(n

k

), (3)

because σ is chosen uniformly and independently of i. Combining (2) and (3), we get

k

n≥ P [Ai,σ ∈ F ] =

|F|(nk

) .Hence, |F| ≤ k

n

(nk

)=(n−1k−1

). 2

4 Independent sets in triangle-free graphsThis section follows [2, p. 272–273].

Theorem 4.1 Let G = (V,E) be a triangle-free graph on n vertices whose largest degree is bounded byd ≥ 16. Then α(G) ≥ n log2 d

8d .

Proof. Let I be a random independent set of G. For each vertex v let

Xv = d| v ∩ I|+ |N(v) ∩ I|.

Think of v itself having “cost” d, while each of its neighbors has “cost” 1. We are going to estimate EXv .Fix v and let H be the subgraph induced on V \ (v ∪N(v)). Let S be an independent set of H and

let X be the set of all vertices w ∈ N(v) without a neighbor in S. We claim that

E[Xv|I ∩ V (H) = S] ≥ log2 d

4. (4)

5 THE BINOMIAL DISTRIBUTION 6

Given S, there are precisely 2x + 1 possible ways to choose I (either some subset of X or v itself).Therefore,

E[Xv|I ∩ V (H) = S] =d

2x + 1+

1

2x + 1

x∑i=1

i

(x

i

)=

d

2x + 1+x2x−1

2x + 1.

With a bit of calculus it is easily verified that this implies (4).Now, (4) yields E

∑v∈V Xv ≥ n log2 d

4 . Further,∑v∈V Xv ≤ 2d|I| by double-counting. Hence,

E |I| ≥ n log2 d8d . 2

5 The binomial distributionIn this section we follow [10, Chapter 2].

One of the most important probability distributions that occurs in the theory of random discrete struc-tures is the binomial distribution. To define it, we first define a probability distribution on sets of 0/1sequence. More precisely, let n ≥ 1 be an integer, and let 0 ≤ p ≤ 1 be a real. Let Ω = 0, 1n be the setof all 0/1 sequences of length n. For any sequence x = (x1, . . . , xn) ∈ Ω, we let

S(x) =

n∑i=1

xi

be the number of ‘ones’ in that sequence. We define a probability distribution µn,p on Ω by letting

P x = pS(x)(1− p)n−S(x). (5)

Thus, for any subset A ⊂ Ω we haveP(A) =

∑x∈A

P x .

Now, S : Ω → R is a random variable. The distribution PS of S is called the binomial distributionBin(n, p). Thus, the binomial distribution is a probability distribution on R, and for any s ∈ R we have

PS s = µn,p x ∈ Ω : S(x) = s =∑

x∈Ω:S(x)=s

µn,p x .

Clearly, S can only take the values 0, 1, . . . n, and thus PS s = 0 for all s 6∈ 0, 1, . . . , n. Further-more, for any k ∈ 0, 1, . . . , n we have

PS k =

(n

k

)pk(1− p)n−k.

The definition (5) of the probability measure on Ω ensures that for x ∈ Ω each entry xi is equal toone with probability p independently of all the others. (You may want to verify this!) Therefore, by thelinearity of the expectation,

E(S) = n · p.

Furthermore, since the variance of a sum of independent variables equals the sum of the variances, we seethat

Var(S) = n · p(1− p).

This expression fot the variance can sometimes be used in combination with Chebyshev’s inequality:

Lemma 5.1 (Chebyshev’s inequality) Let X be a real-valued random variable such that Var(X) < ∞.Then for any t > 0 we have

P[|X − E(X)| > t ·

√Var(X)

]≤ t−2.


While we’re at it, we might as well recall Markov’s inequality.

Lemma 5.2 (Markov’s inequality) LetX be a non-negative real-valued random variable such that E(X) <∞. Then for any t > 1 we have

P [X > t · E(X)] ≤ t−1.

A key feature of the binomial distribution is that it is ‘concentrated about its expectation’. This can bequantified precisely by the so-called Chernoff bounds. The method for deriving these bounds is instructiveand can be applied to many other distributions as well.

The fundamental idea is to apply Markov’s inequality to the random variable exp(u · S) for a suitablereal u. More precisely, for any u, t ≥ 0 we have

P [S ≥ E(S) + t] = P [exp(uS) ≥ exp(u(E(S) + t))] ≤ E(exp(uS))

exp(u(E(S) + t)). (6)

Analogously, if u ≤ 0 and t ≥ 0, then

P [S ≤ E(S)− t] = P [exp(uS) ≥ exp(u(E(S)− t))] ≤ E(exp(uS))

exp(u(E(S)− t)). (7)

To proceed, we need to compute the E(exp(uS)), and then optimize over u. Since S(x) =∑ni=1 xi and

the entries xi are mutually independent, we have

E(exp(uS)) = E

[exp

(u

n∑i=1

xi

)]= E

[n∏i=1

exp(uxi)

]=

n∏i=1

E [exp(uxi)]

=

n∏i=1

(1− p+ p exp(u)) = (1− p+ p exp(u))n. (8)

To continue, let λ = E(S) = np, and assume that λ+ t < n. Plugging (6) into (8), we obtain

P [S ≥ λ+ t] ≤ exp(−u(λ+ t)) · (1− p+ p exp(u))n for any u ≥ 0. (9)

Differtiating this expression shows that the r.h.s. is minimized if u is chosen so that

exp(u) =(λ+ t)(1− p)(n− λ− t)p

.

Substituting this value of u in (9), we obtain

P [S ≥ λ+ t] ≤(

(n− λ− t)p(λ+ t)(1− p)

)λ+t

·(

1− p+ (1− p) · λ+ t

n− λ− t

)n=

(n− λ− tλ+ t

)λ+t(p

1− p

)λ+t

(1− p)n(

1 +λ+ t

n− λ− t

)n=

(n− λ− tλ+ t

)λ+t(p

1− p

)λ+t

(1− p)n(

n

n− λ− t

)n=

(n− λ− tλ+ t

)λ+t(λ

n− λ

)λ+t

(n− λ)n(n− λ− t)−n

=

(λ

λ+ t

)λ+t(n− λ

n− λ− t

)n−λ−t. (10)

This bound holds for all 0 ≤ t < n − λ. With the convention that the second factor in the last line is 1, itextends to t = n− λ as well. Furthermore, for t > n− λ, the probability on the left hand side is triviallyzero.

The bound on P [S ≥ λ+ t] that we just obtained is often a bit awkward to apply. The followingtheorem provides a more handy one.


Theorem 5.3 Let S be a random variable that has a binomial distribution Bin(n, p); that is, for anyinteger 0 ≤ k ≤ n we have

P [S = k] =

(n

k

)pk(1− p)n−k.

Let λ = np = E(S). Furthermore, define a function

ϕ : (−1,∞)→ R≥0, x 7→ (1 + x) ln(1 + x)− x.

Then

P [S ≥ λ+ t] ≤ exp(−λ · ϕ(t/λ)) ≤ exp

(− t2

2(λ+ t/3)

)for any t ≥ 0,

P [S ≤ λ− t] ≤ exp(−λ · ϕ(−t/λ)) ≤ exp

(− t

2

2λ

)for any 0 ≤ t < λ.

Proof. In terms of ϕ, (10) reads

P [S ≥ λ+ t] ≤ exp [−λϕ(t/λ)− (n− λ)ϕ(−t/(n− λ))] (11)

for 0 ≤ t ≤ n−λ. Replacing S by n−S (which has Binomial distribution Bin(n, 1− p)), we obtain from(10)

P [S ≤ λ− t] ≤ exp [−λϕ(−t/λ)− (n− λ)ϕ(t/(n− λ))] (12)

for 0 ≤ t ≤ λ. Since ϕ is positive, (11) and (12) directly yield the ‘middle’ bounds in the theorem (i.e., theones stated in terms of ϕ). Using elementary calculus, one verifies that ϕ(x) ≥ x2/2 for −1 < x < 0, andthat

ϕ(x) ≥ x2

2(1 + x/3)

for x > 0. 2

The previous results concern the properties of the binomial distribution for any fixed n, p. In the restof this section, we will deal with the asymptotics of the binomial distribution as n → ∞. That is, we willconsider a sequence of random variables (Sn)n≥1 such that Sn has a binomial distribution Bin(n, p(n)),where p(n) ∈ [0, 1]. To simplify the notation, we will usually omit the index n and just write S insteadof Sn. Similarly, we will just write p instead of p(n). Nevertheless, we will have to keep in mind thatformally p depends on n, and that we’re talking about a sequence of random variables rather than a singleone.

There are two fundamentally different cases. The first case is that the sequence (n·p(n))n≥1 is bounded.The second case is that n · p(n) → ∞ as n → ∞. Let us begin with the first case. In fact, let us assumethat there is a real λ > 0 such that

limn→∞

n · p(n) = λ.

We will see that, in this case, the binomial variable S can be approximated well by a random variable withanother distribution, the so-called Poisson distribution. More precisely, we say that a random variable Lhas a Poisson distribution Po(λ) if

P [L = k] =λk

k! exp(λ)for any integer k ≥ 0.

Theorem 5.4 Suppose that np→ λ for a real λ > 0. Then for any integer k we have

limn→∞

P [S = k] =λk

k! exp(λ).


Proof. We may assume that n > k. Then

P [S = k] =

(n

k

)pk(1− p)n−k ≤ (np)k

k!exp(−p(n− k))

=(np)k

k!exp(−np) · exp(kp), (13)

P [S = k] =

(n

k

)pk(1− p)n−k ≥ (n− k)kpk

k!· (1− p)n

≥ (np)k

k!(1− p)n(1− k/n)k. (14)

Using elementary calculus, one easily verifies that 1− x ≥ exp(−x− x2) for 0 ≤ x ≤ 12 . As np→ λ, we

know that p → 0. Similarly, for any fixed integer k we have k/n → 0. Therefore, for n sufficiently largewe obtain from (14)

P [S = k] ≥ (np)k

k!exp(−np) · exp(−k2/n− k3/n2 − p2n). (15)

Since p→ 0, k2/n→ 0, and p2n→ 0 for large n, (13) and (15) imply the assertion. 2

The second case is that np → ∞. We may confine ourselves to p ≤ 1/2, because otherwise we couldsimply replace S by n−S. What we will be interested in is the so-called center of the binomial distribution,i.e., the values close to λ = np. For this part, we have the following limit theorem, showing convergenceto the normal distribution.

Theorem 5.5 Suppose that p = p(n) ≤ 12 and np → ∞. Let σ = σ(n) =

√np(1− p). Then for any

reals a < b we have

limn→∞

P

[S − λσ∈ [a, b]

]=

1√2π

∫ b

a

exp(−x2/2)dx.

To prove the theorem, we need the following estimate.

Lemma 5.6 Suppose that p = p(n) ≤ 12 and np → ∞. Let σ = σ(n) =

√np(1− p). Then for any real

C > 0 we have

limn→∞

maxk:|k|≤Cσ,λ+k∈Z

∣∣∣∣log

[P [S = λ+ k] ·

√2πσ exp

(k2

2σ2

)]∣∣∣∣ = 0.

Proof. Fix any k such that |k| ≤ Cσ. By Stirling’s formula,

P [S = λ+ k] =

(n

λ+ k

)pλ+k(1− p)n−λ−k

= (1 + o(1)) ·√

n

2π(λ+ k)(n− λ− k)·(

np

λ+ k

)λ+k (n(1− p)n− λ− k

)n−λ−k= (1 + o(1)) ·

√n

2π(λ+ k)(n− λ− k)·(

λ

λ+ k

)λ+k (n− λ

n− λ− k

)n−λ−k.

Here and throughout, the o(1) hides a term that tends to zero as n → ∞. Let us first take a look at thesquare root. Since λ→∞, λ ≤ n/2, and k ≤ C

√λ, we get√

n

2π(λ+ k)(n− λ− k)= (1 + o(1)) ·

√n

2πλ(n− λ)=

1 + o(1)√2πσ

.

Thus,

P [S = λ+ k] =1 + o(1)√

2πσ·(

λ

λ+ k

)λ+k (n− λ

n− λ− k

)n−λ−k. (16)


The right part of the expression in (17) is identical to the expression we had in (10). In (11) we expressedthis in terms of the function ϕ(x) = (1 + x) ln(1 + x)− x. Namely, we have

P [S = λ+ k] =1 + o(1)√

2πσ· exp [−λϕ(k/λ)− (n− λ)ϕ(−k/(n− λ))] . (17)

Since |k| ≤ Cσ, the quotients k/λ and k/(n − λ) tend to zero. Furthermore, Taylor expanding ϕ around0, we get ϕ(x) = x2/2 +O(x3). Hence,

−λϕ(k/λ)− (n− λ)ϕ(−k/(n− λ)) = −λ ·

[1

2

(k

λ

)2

+O(k/λ)3

]

−(n− λ) ·

[1

2

(k

n− λ

)2

+O(k/(n− λ))3

]

= − k2

2λ− k2

2(n− λ)+O(k3/λ2) = − k2

2σ2+O(k3/λ2).

Since k ≤ Cσ ≤ C√λ, we have k3/λ2 → 0. Hence, (17) yields

P [S = λ+ k] =1 + o(1)√

2πσexp

[− k2

2σ2

],

as claimed. 2

Proof of Theorem 5.5. We will only consider the case that 0 ≤ a < b. The case a < b ≤ 0 can be dealt withanalogously. Moreover, in the case a < 0 < b the assertion follows by applying the previous two cases tothe two intervals [a, 0] and [0, b]. Thus, assume from now on that 0 ≤ a < b.

Letting K = k : aσ ≤ k ≤ bσ, λ+ k ∈ Z, we have

P

[S − λσ∈ [a, b]

]=

∑k∈K

P [S = λ+ k]

=1 + o(1)√

2πσ

∑k∈K

exp

(− k2

2σ2

)[by Lemma 5.6]. (18)

Recall that the o(1) hides an expression that tends to zero as n→∞. The function x 7→ exp(−x2/(2σ2))is monotonically decreasing on the positive reals. Therefore,

∑k∈K

exp

(− k2

2σ2

)= δ +

∫ bσ

aσ

exp(−x2/(2σ2))dx, (19)

with |δ| ≤ 2 exp(−a2/2) ≤ 2. Plugging (19) into (18), we get

P

[S − λσ∈ [a, b]

]=

1 + o(1)√2πσ

[δ +

∫ bσ

aσ

exp(−x2/(2σ2))dx

]

=1 + o(1)√

2πσ

[δ + σ

∫ b

a

exp(−z2/2)dz

]=

1 + o(1)√2π

∫ b

a

exp(−z2/2)dz,

where the last step follows because |δ| ≤ 2 and σ ≥√λ/2→∞. 2

Remark 5.7 Theorem 5.5 actually is a special case of the ‘central limit theorem’ from probability theory.

6 THE WEIERSTRASS APPROXIMATION THEOREM 11

6 The Weierstrass approximation theoremIn this section we follow [2, p. 113–114].

The following is a well-known result from analysis. Here we see a probabilistic proof, which is by wayof the Binomial distribution.

Theorem 6.1 (Weierstrass) For any continuous f : [0, 1]→ R and any ε > 0 there is a polynomial p(x)such that supx∈[0,1] |f(x)− p(x)| < ε.

Proof. The function f is indeed uniformly continuous on [0, 1]. That is, there is δ > 0 such that

supx,y∈[0,1]:|x−y|<δ

|f(x)− f(y)| < ε/2.

Furthermore, ‖f‖∞ = supx∈[0,1] |f(x)| < ∞. Now, let n = n(ε) > 0 be large enough integer andconsider the polynomial

p(x) =

n∑i=0

P [Bin(n, x) = i] · f(i/n) =

n∑i=0

(n

i

)xi(1− x)n−if(i/n).

Then for any x ∈ [0, 1] we have

|f(x)− p(x)| ≤n∑i=0

(n

i

)xi(1− x)n−i|f(i/n)− f(x)|

≤∑

i:|i−nx|<n2/3

P [Bin(n, x) = i] · |f(i/n)− f(x)|

+2 ‖f‖∞ ·∑

i:|i−nx|≥n2/3

P [Bin(n, x) = i]

≤ supx,y∈[0,1]:|x−y|<δ

|f(x)− f(y)|+ 2 ‖f‖∞ · P[|Bin(n, x)− nx| ≥ n2/3

]≤ ε/2 + 2n−1/3 ‖f‖∞ < ε,

as desired. 2

7 The second moment methodIn this section we follow [3, Chapter 7].

Remember the random graph G(n, p) on V = 1, . . . , n, where each edge is present with probabilityp ∈ [0, 1] independently? We are going to study G(n, p) “in the limit” n → ∞. More precisely, let(p(n))n≥1 be a sequence of numbers p(n) ∈ [0, 1]. We say, somewhat sloppily, that G(n, p(n)) has someproperty E with high probability (‘w.h.p.’) if

limn→∞

P [G(n, p(n)) ∈ E ] = 1.

As usual, we will usually just write p instead of p(n).A graph G = (V,E) is connected if for any set S ⊂ V such that ∅ 6= S 6= V there is an edge e ∈ E

such that e ∩ S 6= ∅ 6= e \ S. In other words, there is an edge that leads from S to V \ S.

Theorem 7.1 Fix any ε > 0.

1. If p ≤ (1− ε) lnn/n, then G(n, p) is disconnected w.h.p.

2. If p ≥ (1 + ε) lnn/n, then G(n, p) is connected w.h.p.

7 THE SECOND MOMENT METHOD 12

Proof. Suppose that p ≤ (1 − ε) lnn/n. We call a vertex v of G(n, p) isolated if there is no edge e suchthat v ∈ e. Let Iv = 1 if v is isolated, and set Iv = 0 otherwise. Let I =

∑v∈V Iv be the number of

isolated vertices. Since for any v ∈ V the number of edges e such that e ∈ V is binomially distributedBin(n− 1, p), we have

E [Iv] = (1− p)n−1 ≥ exp(−p(n− 1) + n ·O(p2)) = (1 + o(1))nε−1.

Hence,E [I] =

∑v∈V

E [Iv] = (1 + o(1))nε. (20)

Furthermore, the second moment of I is

E[I2]

=∑

(v,w)∈V×V

E [IvIw] = n(n− 1)P [I1 = I2 = 1] + E [I] .

Thus, we need to compute the probability that the first two vertices are isolated. Clearly, 1 and 2 areisolated iff there is no edge that touches either. The total number of possible edges touching 1 or 2 is2(n− 1)− 1 = 2n− 3. Hence,

P [I1 = I2 = 1] = (1− p)2(n−1)−1 = (1− p)(1− p)2(n−1).

Consequently, we obtain

E[I2]− E [I]

2 − E [I] = n(n− 1)P [I1 = I2 = 1]−∑

(v,w)∈V×V

E [Iv] E [Iw]

= n(n− 1)[(1− p)−1(1− p)2(n−1) − (1− p)2(n−1)

]≤ c · E [I]

for a constant c > 0. Thus, Var [I] = E[I2]−E [I]

2 ≤ (c+ 1)EI. Now, Chebyshev’s inequality and (20)show that

P [I = 0] ≤ P [|I − E [I] | ≥ E [I]] ≤ Var [I]

E [I]2 ≤

c+ 1

E [I]= o(1),

as ε > 0 is fixed as n → ∞. This means that I > 0 w.h.p., i.e., w.h.p. the random graph G(n, p) has atleast one isolated vertex. In particular, it is disconnected.

Now, assume that np ≥ (1 + ε) lnn. We are going to show that w.h.p. there is no set S ⊂ V of size1 ≤ |S| ≤ n/2 such that there is no edge connecting S and V \ S. If we fix the size 1 ≤ s ≤ n/2 of thisset S, then the number of potential edges equals s(n − s). Hence, for any one set S ⊂ V of size |S| = sthe probability that there is no S-V \ S edge equals (1− p)s(n−s). Thus, letting Xs denote the number ofsuch sets S, we obtain

E [Xs] ≤(n

s

)· (1− p)s(n−s) ≤

(en

s

)sexp(−ps(n− s))

≤ exp

(s

[1 + ln(n/s)− (1 + ε) lnn · n− s

n

])= exp

(s[1 + (1 + ε)

s

nlnn− ε lnn− ln s

]).

The function s 7→ 1 + (1 + ε) sn lnn− ε lnn− ln s is strictly decreasing on the interval [1, n/2]. Thus, itsmaximum value is attained at s = 1, whence

E [Xs] ≤(n

s

)· (1− p)s(n−s) ≤ exp(−(1 + o(1))ε ln(n) · s) = n(o(1)−1)ε·s.

and thus Markov’s inequality shows that∑n/2s=1Xs = 0 w.h.p. 2

8 THE “GIANT COMPONENT” 13

8 The “giant component”The following result was first proved by Erdos and Renyi in their seminal 1960 paper on random graphs [8].Here we present a recent, very compact proof from [11].

Theorem 8.1 For any ε > 0 the following is true.

1. If np < 1− ε, then the largest connected component of G(n, p) contains O(lnn) vertices w.h.p.

2. If np > 1 + ε, then the largest connected component of G(n, p) contains Ω(n) vertices w.h.p.

The proof is based on tracing the depth first search process for exploring the connected components.Let us describe this process on a general graph G = (V,E) with vertex set V = 1, . . . , n. At all timesevery vertex is either marked complete, active or unexplored. Initially, all vertices are unexplored. Onestep of the process proceeds as follows.

• If there is an active vertex, let v be the one that got activated last (i.e., most recently). Check foreach unexplored vertex w whether w is adjacent to v, proceeding in the natural order (recall that Vis an ordered set). Once a w that is adjacent to v has been found, declare this vertex active and stop;should the quest from v resume at a later time, consider only unexplored vertices > w. If none ofthe unexplored w is adjacent to v, declare v complete.

• If there is no active vertex, declare the least unexplored vertex active. The entire process stops oncethere are no unexplored vertices left.

Observe that the unexplored vertices form a path. Let N ≤(n2

)be the total number of edges probed by the

process.Every time the process runs out of active vertices, the exploration of one component is complete. We

refer to the time interval between the activation of the first and the completion of the last vertex of acomponent as an epoch of the process.

On a random graph G(n, p), each vertex is adjacent to every other with probability p independently.Indeed, every time the above process checks whether two vertices are adjacent in G = G(n, p), the answeris a Bernoulli variable with mean p, and these random variables are independent. This is because no edgegets probed twices. Thus, the answers to the edge queries can be described as a sequence X = (Xi)i≥1 ofmutually independent Be(p) variables.

The proof of Theorem 8.1 hinges upon the following lemma.

Lemma 8.2 Let ε > 0 be a sufficiently small constant.

1. Assume np = 1− ε and set k = 7ε−2 lnn. W.h.p. there is no interval I of length kn in 1, . . . , Nsuch that

∑i∈I Xi ≥ k.

2. Assume np = 1 + ε and set N0 = εn2/2. W.h.p. we have∑N0

i=1Xi = ε(1+ε)n2 + o(n2/3).

Proof. ad 1: For any I we have∑i∈I Xi = Bin(kn, p). By the Chernoff bound,

P [Bin(kn, p) ≥ k] ≤ exp

[− (k − knp)2

2(knp+ (k − knp)/3)

]≤ exp

[− ε2k2

2((1− ε)k + k/3)

]≤ exp(−ε2k/3) = o(n−2).

Since the total number of possible intervals is bounded by N ≤ n2, the assertion follows from the unionbound.

ad 2: We have E∑N0

i=1Xi = npN0 ≥ (1 + ε)N0 = ε(1+ε)n2 and

Var

N0∑i=1

Xi =

N0∑i=1

VarXi = N0p(1− p).

8 THE “GIANT COMPONENT” 14

Thus, the assertion follows from Chebyshev’s inequality. 2

Proof of Theorem 8.1. ad 1: Assume that G = G(n, p) has a component C with more than k = 7ε−2 lnnvertices. Consider the epoch when C gets explored and the time t in that epoch when the k+1st vertex v ofC gets activated. From the start of the epoch to time t the depth first search process has performed a numberof edge probes; their outcomes are recorded by the random variables (Xi)i∈I for a certain interval I . LetSC be the set of complete vertices and UC the set of active vertices of C at time t Then |SC ∪UC | > k andthus

∑i∈I Xi ≥ k. Furthermore, the total number of edges probed in the C-epoch up to time t is bounded

by(k2

)+ k(n− k) < kn; this is because only edges with at least one vertex in (SC ∪UC) \ v have been

probed. Thus, the interval I violates 1. in Lemma 8.2.ad 2: Consider the state of the vertices after the process has performed the first N0 = εn2/2 edge

probes. We claim that fewer than n/3 vertices are complete. Indeed, assume otherwise, and let N∗ ≤ N0

be the number of edge queries at the moment the dn/3eth vertex was completed. At that time at most1 +

∑N∗i=1Xi vertices were active, and by 2. in Lemma 8.2 we have

1 +

N∗∑i=1

Xi ≤ 1 +

N0∑i=1

Xi < n/3.

Thus, the number of unexplored vertices must be at least n/3. Moreover, the process must have probed allpotential edges between complete and unexplored vertices, and there are at least (n/3)2 such edges. Thisis a contradiction because (n/3)2 > N0, which is the total number of connections probed.

Now, let C be the set of complete vertices, let A be the set of active vertices and let U contain theunexplored ones, all after N0 edge probes. We already know that |C| < n/3. If also |A| < ε2n/5, thenU 6= ∅. Since each i ≤ N0 such that Xi = 1 activates a vertex, property 2. in Lemma 8.2 implies

|A ∪ C| ≥N0∑i=1

Xi ≥ε(1 + ε)n

2− o(n2/3).

Assume for contradiction that |A| ≤ ε2n/5. Then

|C| ≥ εn

2+

3ε2n

10− o(n2/3). (21)

Further, by construction the depth first search process has already probed all possible |C ×U | connectionsbetween C and U , and

|U | ≥ n− |A| − |C| ≥ n(1− ε2/5)− |C|.

Since the total number of connections probed is bounded by N0, we thus get

εn2

2= N0 ≥ |C × U | ≥ |C|(n(1− ε2/5)− |C|). (22)

Subject to (21) this term is minimized if |C| = εn2 + 3ε2n

10 − o(n2/3) is as small as possible. Hence,

|C|(n(1− ε2/5)− |C|) ≥ n2

[ε

2+

3ε2

10− o(1)

]·[1− ε2

5− ε

2− 3ε2

10+ o(1)

]>εn2

2. (23)

Thus, (22) contradicts (23).This shows that |U | ≥ ε2n/5. In particular, the random graph contains a path, and thus a connecte

component, on ε2n/5 = Ω(n) vertices. 2

There is a lot of literature on the giant component in G(n, p) and many other models of random graphs(and hypergraphs, and . . . ). More details on the subject can be found in [4, 10]. The problem is also closelyrelated to the area of percolation. In fact, the emergence of the giant component in G(n, p) is sometimesreferred to as “mean-field percolation”.

9 THE ENTROPY 15

9 The entropyThe goal is to provide a measure for the “uncertainty” of a random variable. This secton follows [14,Chapters 1 and 4]. Suppose that X : Ω→ X , where X (the range of X) is a finite set. Let us write

p(x) = P [X = x] (x ∈ X ).

Using the convention 0 ln 0 = 0, we define the entropy of X as

HX = H(p) = −∑x∈X

p(x) ln p(x).

The entropy does indeed provide a measure of the intuitive notion of uncertainty. Indeed, if X is adeterministic quantity (say, X = 0 with certainty), then HX = 0. On the other hand, if X is the uniformdistribution on |X |, then HX = ln |X |, and we shall see in due course that this is the maximum value thatthe entropy can take.

A closely related concept is the Kullback-Leibler divergence of two probability distributions p, q on X ,defined by

D (q‖p) =∑x∈X

q(x) lnq(x)

p(x).

(Here we use the convention that 0 ln(0/0) = 0, and we assume that q(x) > 0 only if p(x) > 0.)Recall Jensen’s inequality: if f is a convex function and X is a random variable, then

E [f(X)] ≥ f(E [X]). (24)

Proposition 9.1 The Kullback-Leibler divergence has the following properties.

1. D (q‖p) is convex in q; that is, for any q, q′ and any 0 < α < 1 we have

D ((1− α)q + αq′‖p) ≤ (1− α)D (q‖p) + αD (q′‖p) .

2. D (q‖p) is convex in p.

3. D (q‖p) ≥ 0 (“Gibbs’ inequality”) and D (q‖p) = 0 iff p = q.

4. If p = p1 ⊗ p2 and q = q1 ⊗ q2 are product distributions, then

D (q‖p) = D (q1‖p1) +D (q2‖p2) .

Proof. ad 1.: the differential of D (q‖p) with respect to q(x) is

∂

∂q(x)D (q‖p) = 1 + ln

q(x)

p(x).

Further, for x 6= x′ the second differentials are

∂2

∂q(x)2D (q‖p) =

1

q(x),

∂2

∂q(x)∂q(x′)D (q‖p) = 0.

Thus, the Hessian is positive semidefinite, whence D (q‖p) is convex in q.ad 2.: because z 7→ − ln z is a convex function, we have

q(x) lnq(x)

(1− α)p(x) + αp′(x)≤ (1− α)q(x) ln

q(x)

p(x)+ αq(x) ln

q(x)

p′(x).

ad 3.: let Eq denote the expectation over the probability distribution q. Then D (q‖p) = −Eq[lnY ], whereY (x) = 1p(x)q(x)>0 · p(x)/q(x). Because z 7→ − ln z is convex, (24) yields

D (q‖p) = −Eq lnY ≥ − ln Eq [Y ] = − ln

∑x∈X :q(x)>0

q(x) · p(x)

q(x)

= − ln∑

x∈X :q(x)>0

p(x) ≥ 0.

9 THE ENTROPY 16

Further, if q 6= p, then the first inequality above is strict.ad 4.: assume that X = X1 ×X2 and q = q1 ⊗ q2, p = p1 ⊗ p2. Then

D (q‖p) =∑

(x1,x2)∈X

q1(x1)q2(x2) lnq1(x1)q2(x2)

p1(x1)p2(x2)

=∑

(x1,x2)∈X

q1(x1)q2(x2)

[lnq1(x1)

p1(x1)+ ln

q2(x2)

p2(x2)

]

=∑

(x1,x2)∈X

q1(x1)q2(x2) lnq1(x1)

p1(x1)+ q1(x1)q2(x2) ln

q2(x2)

p2(x2)

=∑x1∈X1

∑x2∈X2

q1(x1) lnq1(x1)

p1(x1)· q2(x2) +

∑x1∈X1

∑x2∈X2

q2(x2) lnq2(x2)

p2(x2)· q1(x1)

= D (q1‖p1) +D (q2‖p2) ,

as claimed. 2

We emphasize that generally D (q‖p) 6= D (p‖q).

Proposition 9.2 The entropy has the following properties.

1. HX ≥ 0 and HX = 0 iff X takes one value with probability 1.

2. The entropy is maximized if p(x) is the uniform distribution.

3. Let X,Y be random variables, and let HX,Y be shorthand for the entropy of the random variable(X,Y ). Then

HX,Y ≤ HX +HY .

Furthermore, HX,Y = HX +HY iff X,Y are independent.

4. Let A ⊂ X be an event and let pA and pA be the conditional distributions of X given that A occursresp. does not occur. Let 1A be the indicator of the event A. Then

HX = H1A + P [A] ·H(pA) + (1− P [A]) ·H(pA).

Proof. The first claim simply follows from the fact that −z ln z > 0 for z ∈ (0, 1). To obtain the secondstatement, let p be the distribution of X and let u be the uniform distribution on its range X . Then

0 ≤ D (p‖u) = ln |X | −HX .

Similarly, to prove 3. let pX,Y be the joint distribution of X,Y and let pX ⊗pY be the product distribution.Then

0 ≤ D (pX,Y ‖pX ⊗ pY ) = HX,Y −∑(x,y)

pX,Y (x, y) ln(pX ⊗ pY (x, y))

= HX,Y −∑(x,y)

pX,Y (x, y)(ln(pX(x)) + ln(pY (x, y)))

= HX,Y −∑(x,y)

pX,Y (x, y) ln(pX(x)) +∑(x,y)

pX,Y (x, y) ln(pY (x, y))

= HX,Y −HXHY ,

as desired.

9 THE ENTROPY 17

With respect to 4. we have

HX = −∑x∈A

p(x) ln p(x)−∑x 6∈A

p(x) ln p(x)

= −P [A]∑x∈A

p(x)

P [A]ln p(x)− (1− P [A])

∑x∈A

p(x)

1− P [A]ln p(x)

= −P [A]∑x∈A

p(x)

P [A]

[lnp(x)

P [A]+ ln P [A]

]−(1− P [A])

∑x∈A

p(x)

1− P [A]

[ln

p(x)

1− P [A]+ ln(1− P [A])

]= −P [A] ln P [A]− (1− P [A]) ln(1− P [A]) +H(pA) +H(pA)

= H1A +H(pA) +H(pA),

as claimed. 2

Suppose that X,Y are random variables with values in X ,Y respectively. Let pX,Y be their jointdistribution. Moreover, let p(y|x) = p(x, y)/p(x). The conditional entropy is

HY |X = −∑x∈X

p(x)∑y∈Y

p(y|x) ln p(y|x) =∑x∈X

p(x) ·H(p( · |x)).

In addition, we define the mutual information as

IX,Y =∑

(x,y)∈X×Y

p(x, y) lnp(x, y)

p(x)p(y).

Proposition 9.3 1. HX,Y = HX +HY |X (“chain rule”).

2. IX,Y = HY −HY |X = HX −HX|Y .

3. IX,Y ≥ 0 and IX,Y = 0 iff X,Y are independent.

Proof. ad 1.: we have

HX,Y = −∑

(x,y)∈X×Y

p(x, y) ln p(x, y)

= −∑y∈Y

p(y)∑x∈X

p(x, y)

p(y)

[lnp(x, y)

p(y)+ ln p(y)

]= HY +HX|Y .

ad 2.: we have

IX,Y =∑

(x,y)∈X×Y

p(x, y)

[lnp(x, y)

p(y)− ln p(x)

]

= −∑

(x,y)∈X×Y

p(x, y) ln p(x) +∑y∈Y

p(y)∑x∈X

p(x, y)

p(y)lnp(x, y)

p(y)

= HX −HX|Y .

ad 3.: note that IX,Y = D (p‖pX ⊗ pY ), where pX , pY are the distributions of X,Y . Thus, theassertion follows from Gibbs’ inequality. 2

We are going to see two applications of the above concepts. The first one is Sanov’s theorem. Let pbe a probability distribution over a finite set X and let X1, X2, . . . be independent random variables withdistribution p. Define ξn as the vector with entries

ξn(x) =1

n

n∑i=1

1Xi=x.

Thus, ξn is the “empirical distribution” after n trials.

10 THE BOLTZMANN DISTRIBUTION 18

Theorem 9.4 Let ∅ 6= K ⊂ [0, 1]X be a compact set of probability distributions on X . Moreover, let

δ = minq∈K

D (q‖p) .

Then P [ξn ∈ K] = exp(−δn+ o(n)).

Proof. Call a probability distribution q ∈ K feasible if P [ξn = q] > 0. For any feasible q we have

P [ξn = q] =

(n

(qxn)x∈X

)p(x)qxn.

By Stirling’s formula,1

nln P [ξn = q] ∼

∑x∈X

qx lnpxqx

= −D (q‖p) .

The assertion follows from the fact that there are no more than (n+ 1)|X | = exp(o(n)) feasible q. 2

The second application is an alternative approach to the Chernoff bound for the Binomial distribution.In a sense, this corresponds to the case that X = 0, 1 in Sanov’s theorem.

Theorem 9.5 Let p, q ∈ (0, 1). Then P [Bin(n, p) = dqne] = Θ(n−1/2) exp(−D (q‖p)n).

Proof. Let Q = dqne and let q = (Q/n, 1−Q/n). By Stirling’s formula,

P [Bin(n, p) = Q] =

(n

Q

)pQ(1− p)n−Q

= Θ (q(1− q)n)−1/2

(np

Q

)Q(n(1− p)n−Q

)n−Q= Θ (q(1− q)n)

−1/2exp

[n

(Q

nlnnp

Q+n−Qn

lnn(1− p)n−Q

)]= Θ (q(1− q)n)

−1/2exp [−nD (q‖p)] . (25)

For any 0 < 1 < 1 the function

f : z 7→ z ln(z/p) + (1− z) ln((1− z)/(1− p))

is differentiable. In particular, f(z+ δ) = f(z) +O(δ) as δ → 0. As |Q/n− q| = O(1/n), we thus obtain

D (q‖p) = D (q‖p) +O(1/n).

Thus, the assertion follows from (25). 2

Theorem 9.5 (and, more generally, Sanov’s theorem) gives us a very accurate estimate of the proba-bility of an “unlikely” outcome of the binomial distribution. Indeed, we obtain bounds that are tight ona logarithmic scale, in contrast to mere “concentration inequalities”. Results of this type are called largedeviations principles.

The Chernoff bound of can be obtained fairly easily from Theorem 9.5. All that is required is a bit ofcalculus (basically one has to differentiate D (q‖p) twice in q and apply Taylor’s theorem).

10 The Boltzmann distributionThis section follows [14, Chapters 2 and 4].

Let X be a finite set whose elements we call configurations. The idea is that a configuration x ∈ Xgives a full microscopic description of some physical system. Moreover, let E : X → R be a function; we


refer to E(x) as the energy of x. Given a number β > 0, called the inverse temperature, we can define theBoltzmann distribution

µβ : X → [0, 1] , µβ(x) =exp(−βE(x))

Z(β)where Z(β) =

∑x∈X

exp(−βE(x)).

We call Z(β) the partition function.If we think of 1/β as the “temperature” of the system, then we realise two extreme cases. If the

temperatrue tends to zero (and thus β → ∞), then that the main contribution to Z(β), and thus to theBoltzmann distribution, stems from configurations xwhereE(x) is “small”. Indeed, we call a configurationx0 ∈ X such that E(x0) = minx∈X E(x) a ground state. By contrast, as β → 0 the impact of the energyfunction diminishes.

Thinking of β as a fixed number, we denote the expectation of a random variable O with respect to theBoltzmann distribution by 〈O〉. Thus,

〈O〉 =∑x∈X

µβ(x)O(x).

When working with Boltzmann distributions, we frequently encounter the hyperbolic functions: lettingi =√−1, we have

cosh(x) = cos(ix) =exp(x) + exp(−x)

2

sinh(x) = −i sin(ix) =exp(x)− exp(−x)

2

tanh(x) = −i tan(ix) =sinhx

coshx=

exp(x)− exp(−x)

exp(x) + exp(−x)

coth(x) = 1/ tanh(x) =coshx

sinhx=

exp(x) + exp(−x)

exp(x)− exp(−x).

As a first example of a Boltzmann distribution, consider the configuration set X = −1, 1 and letE(x) = −Bx for some fixed number B > 0 (the “external field”). Then

µβ(x) = Z(β)−1 exp(−βE(x)) with Z(β) = exp(−βB) + exp(βB) = 2 cosh(βB).

This system is called the Ising spin. The magnetization is just the averge of x itself over the Boltzmanndistribution:

〈x〉 =∑x=±1

µβ(x)x = tanh(βB).

Thus, at a fixed β the magnetization is a smooth function of the “external magnetic field” B.Typically the energy function will take a particular form. Suppose that G = (V,E) is a graph whose

vertex set V decomposes into two disjoint subsets V = V ∪ F and such that each edge links a vertex in Vwith one in F (thus, G is bipartite). We call the vertices in V the variables and the ones in F the factors.For a vertex y of G we let ∂y denote the set of neighbors of y. We refer to G as the factor graph.

Let X0 be a set of spins. Then our set X of configurations comprises all maps x : V → X0. Further,with each factor a ∈ F we associate a function Ef : X ∂f0 → R. Then

E : X → R, x 7→∑f∈F

Ef (x∂f ),

where x∂f : ∂f → X0, j 7→ xj is the restriction of x to ∂f .The sets ∂f , f ∈ F will normally be “small”, while the number |V| of variables nodes will be huge (or

even tend to∞). If maxf∈F |∂f | = k, then we speak of k-body interaction. The case of 1-body interactionis also called interaction-free. Observe that in this case the partition function factorizes.


The Boltzmann distribution gives rise to a number of “thermodynamic potentials”. The single mostimportant one is the free energy

F (β) = − 1

βlnZ(β).

By omitting the −1/β in front, we obtain its sibling, the free entropy

Φ(β) = lnZ(β).

Additionally, we have the internal energy

U(β) =∂

∂ββF (β)

and the canonical entropy

S(β) = β2 ∂

∂βF (β).

Straightforward calculations yield

Proposition 10.1 We have

µβ(x) = exp(−β(E(x)− F (β))),

U(β) = 〈E(x)〉 ,S(β) = H(µβ),

F (β) = U(β)− S(β)/β,

− ∂2

∂β2βF (β) =

⟨E(x)2

⟩− 〈E(x)〉2 .

As mentioned earlier, we are going to be interested in systems where the number n of “particles”(=variable nodes) tends to infinity. We refer to this as the thermodynamical limit. That is, instead of asingle “system” we consider a sequence of “systems” (i.e., Boltzmann distributions). Each of them inducesthe above thermodynamic potentials such as Fn, Un, etc. Given such a sequence, we refer to

f(β) = limn→∞

Fn(β)/n

as the free energy density. Analogous definitions are provided for the other potentials.Of course, for any n free energy Fn(β) is an analytic function, because Z(β) is. However, the free

enegy density f(β) may be non-analytic. The points β where this occurs are of special interest. Wecall them phase transitions. They are expected to mirror qualitative changes in the physical nature of thesystem.

The Boltzmann distribution µβ on the configuration space X can be viewed as the minimizer of acertain functional. Namely, for a probability distribution P on X we define

G [P ] = 〈E(x)〉P −H(P )/β =∑x∈X

P (x)E(x) +1

β

∑x∈X

P (x) lnP (x).

The functional P 7→ G [P ] is called the Gibbs free energy.

Proposition 10.2 The Gibbs free energy is a convex functional. It attains its unique minimum at the Boltz-mann distribution µβ , where it takes the value G [µβ ] = F (β).

Proof. We have

G [P ] =∑x∈X

P (x)E(x) +1

β

∑x∈X

P (x) lnP (x)

=1

β

∑x∈X

P (x)

[ln

Z(β)

exp(−βE(x))+ lnP (x)− lnZ(β)

]= F (β) +

1

β

∑x∈X

P (x) lnP (x)

µβ(x)= F (β) +D (P‖µβ) /β.

11 THE ONE-DIMENSIONAL ISING MODEL 21

Thus, the assertion follows from Proposition 9.1. 2

A consequence of the above result is that for any probability distribution P , G [P ] yields an upperbound on the free energy.

11 The one-dimensional Ising modelThis section follows [14, Section 2.5].

The Ising model, invented by Lenz in the 1920s, is one of the most basic models of statistical mechanics.At the same time, many important questions regarding the Ising model remain open. In the Ising model,the variable nodes V are points on the d-dimensional integer grid 1, . . . , nn. There is a function nodeconnected to each single variable node. In addition, there is a function node between two variable nodesv, w iff ‖v − w‖1 = 1. The configuration space is X = ±1V , and the energy function is

E : X → R, σ 7→ −∑

i,j∈V:‖i−j‖1=1

σiσj −B∑i∈V

σi.

We refer to B as the external field. Let N = |V|.Currently, despite considerable efforts, the free energy density f(β) is known in the Ising model only

in dimensions d = 1, 2. The case d = 1 was solved by Ernst Ising (1924). Lars Onsager, who received theNobel price in 1968, solved the d = 2 case in 1948.

The average magnetization is

MN (β,B) =1

N

∑i∈V〈σi〉 .

Clearly, MN (β, 0) = 0. We define the spontaneous magnetization as

M+(β) = limB0

limN→∞

MN (β,B).

We say that the system is paramagnetic at β if M+(β) = 0, and ferromagnetic otherwise.

Theorem 11.1 For d = 1 the free entropy density of the Ising model is

φ(β,B) = ln

(exp(β) cosh(βB) +

√exp(2β) sinh2 βB + exp(−2β)

).

Proof. The proof is based on the transfer matrix method. Let us label the vertices on the path from “left toright” as 1, . . . , N . Then

E(σ) = −N−1∑i=1

σiσi+1 −BN∑i=1

σi.

Thus, the partition function is

ZN (β,B) =∑

σ∈±1Nexp

[β

N−1∑i=1

σiσi+1 + βB

N∑i=1

σi

].

Define

zp(β,B, σp+1) =∑

σ1,...,σp=±1

exp

[β

p∑i=1

σiσi+1 + βB

p∑i=1

σi

].

ThenZN (β,B) =

∑σN=±1

zN−1(β,B, σN ) exp(βBσN ).

The key observation is that there is a simple linear recurrence between the zps. Namely, define

T (σ, σ′) = exp (βσσ′ + βBσ′) .

11 THE ONE-DIMENSIONAL ISING MODEL 22

In matrix notation,

T =

(exp(β + βB) exp(−β − βB)

exp(−β + βB) exp(β − βB)

).

Now,zp(β,B, σp+1) =

∑σp=±1

T (σp+1, σp)zp−1(β,B, σp).

Letting

ψL =

(exp(βB)

exp(−βB)

), ψR =

(1

1

),

we obtainZN (β,B) =

⟨ψL, T

N−1ψR⟩. (26)

The matrix T has the two eigenvalues

λ1/2 = exp(β) cosh(βB)±√

exp(2β) sinh2 βB + exp(−2β).

Let ψ1, ψ2 be the corresponding eigenvectors. Then (26) becomes

ZN (β,B) = 〈ψR, ψ1〉〈ψL, ψ1〉λN−11 + 〈ψR, ψ2〉〈ψL, ψ2〉λN−1

2 .

One verifies that 〈ψR, ψ1〉 , 〈ψL, ψ1〉 6= 0. As λ1 > λ2, we get

φ(β,B) = limN→∞

1

nlnZN (β,B) = lnλ1,

as claimed. 2

Corollary 11.2 In the 1-dimensional Ising model,

limN→∞

MN (β,B) =sinhβB√

sinh2 βB + exp(−4β).

Proof. We have limN→∞MN (β,B) = 1β∂φ∂B (β,B). 2

There are two interesting variants of the Ising model. These arise by modifying the energy function:we introduce couplings J = (Jij)i,j∈V such that Jij = Jji ∈ R for all i, j. Given these couplings, wedefine the energy function

EJ : X → R, σ 7→ −∑

i,j∈V:‖i−j‖1=1

Jijσiσj −B∑i∈V

σi.

The first variant of the Ising model is obtained by setting Jij for all i, j. This is called the Isingantiferromagnet, because the energy function “encourages” adjacent variables to take opposite spins.

The second variant is the Edwards-Anderson model of “spin glasses”. In this model, the couplings Jijthemselves are chosen randomly. For instance, the Jij could be independent Gaussians (with an appropriatevariance). Thus, in this case the (energy function and thus the) Boltzmann distribution itself is a randomobject. The goal is to estimate the E[φ(β,B)], with the expectation taken over the choice of J . As itturns out, The random choice of J leads to frustration, i.e., there generally is no zero-energy ground state.The mathematical understanding of both the Ising antiferromagnet and of the Edwards-Anderson model isrudimentary.

12 THE CURIE-WEISS MODEL 23

12 The Curie-Weiss modelWe continue to follow [14, Chapter 2].

In the Curie-Weiss model, we have N variables V = v1, . . . , vN and there is a factor node fijassociated with any pair 1 ≤ i < j ≤ N of variables, and one factor node fi associated with each variablevi. Thus, there are N(N + 1)/2 factor nodes in total. The configuration space is X = ±1N , and theenergy function is

E : X → R, σ 7→ − 1

N

∑1≤i<j≤N

σiσj −BN∑i=1

σi.

This is an example of a mean-field model: there is “no underlying geometry” (of bounded dimension). Thisis “unrealistic”, but the upshot is that we can actually determine the free entropy density. Let

h(z) = −z ln z − (1− z) ln(1− z)

be the entropy function.

Theorem 12.1 Let

φ(m,β,B) = −β2

(1−m2) + βBm+ h

(1 +m

2

).

Then the free entropy density of the Curie-Weiss model is

φ (β,B) = maxm∈[−1,1]

φ(m,β,B).

Proof. For a configuraiton σ = (σ1, . . . , σN ) we define

m(σ) =1

N

N∑i=1

σi;

this is called the empirical magnetization. The key observation is that

E(σ) =1

2N − 1

2Nm(σ)2 −NBm(σ).

Therefore, summing over the (finite number of) possible empirical magentizations m, we find

ZN (β,B) = exp(−Nβ/2)∑m

(N

N(1 +m)/2

)exp

[Nβ

2·m2 +NβBm

]. (27)

The binomial coefficient(

NN(1+m)/2

)simply enumerates the number of σ with m(σ) = m. By Stirling’s

formula,1

Nln

(N

N(1 +m)/2

)∼ h((1 +m)/2).

Hence,ZN (β,B) = exp(o(N))

∑m

exp(Nφ(m,β,B)). (28)

Since there are O(N) summands in (28) and because m 7→ φ(m,β,B) is continuous, we see that

1

NlnZN (β,B) ∼ max

m∈[−1,1]φ(m,β,B),

as claimed. 2

As in the case of the Ising model, we consider

MN (β,B) =1

N

∑i∈V〈σi〉 ∼

1

β

∂

∂Bφ(β,B)

andM+(β) = lim

B0limN→∞

MN (β,B) = limB0

1

β

∂

∂Bφ(β,B).

13 THE RANDOM ENERGY MODEL 24

Corollary 12.2 For β < βc = 1 we have M+(β) = 0. By contrast, M+(β) > 0 for β > βc. In particular,there is a phase transition at β = βc.

Proof. Suppose first that B = 0. Then the function m 7→ φ(m,β,B) is symmetric in m, and thus itsdifferential at m = 0 vanishes. But where the function attains its global maximum depends on β. Moreprecisely, it turns out that for β < βc = 1 (the “Curie point”), the global maximum is attained at m = 0.Indeed, the differentials of φ(m,β,B) are

∂

∂mφ(m,β, 0) = βm− ln(1 +m)− ln(1−m)

2,

∂2

∂m2φ(m,β, 0) = β − 1

1−m2.

Thus, for β < 1 the function is concave, with its maximum at m = 0. However, for β > 1 the globalmaximum is attained at another point m+(β) > 0 (and, by symmetry, at its mirror image m−(β) > 0),while there is a local minimum at m = 0.

In effect, for β < βc we have M+(β) = 0, because for sufficiently small B > 0, the maximum ofφ(m,β,B) is going to remain at m = 0. By contrast, for β > βc we have that M+(β) > 0 (becauseB > 0 rules out m−(β) as a global maximum). 2

As in the case of the Ising model, there is a “spin glass” variant of the Curie-Weiss model. In theSherrington-Kirkpatrick model we introduce couplings J = (Jij)1≤i<j≤N that are mutually independentstandard Gaussians. This gives rise to the energy function

E(σ) = − 1√N

∑1≤i<j≤N

Jijσiσj −BN∑i=1

σi.

Confirming the non-rigorous prediction of Giorgio Parisi (1980), Talagrand [16] determined the free energydensity in this model.

13 The random energy modelThis section follows [15, Chapter 1].

Let X = ±1N . The energy function in the random energy model is defined by letting (E(σ))σ∈Xbe mutually independent centred Gaussian variables with

Var(E(σ)) = N/2.

In particular, the energy function is random. We are thus interested in computing

E [φ(β)] = limN→∞

1

NE lnZ(β).

By convexity, this is upper bounded by limN→∞1N ln E[Z(β)]. However, generally these two quantities

do not coincide. Because there is no interaction between different configurations, the random energy modelis an extremely case.

We begin with a simple fact about Gaussian random variables. First, we need the following tail bounds.Let g be a centred Gaussian with variance τ2; then for any t ≥ 0,

P [g ≥ t] ≤ exp

[− t2

2τ2

], (29)

P [g ≥ t] ≥ 1

L · (1 + t/τ)exp

[− t2

2τ2

]with L > 0 an absolute constant. (30)


Proposition 13.1 Let g1, . . . , gM be (not necessarily independent) centred Gaussian random variablessuch that maxi≤M Var(gi) ≤ τ2. Then

E ln∑i≤M

exp(βgi) ≤ β2τ2

2+ lnM,

E maxi≤M

gi ≤ τ√

2 lnM,

E ln∑i≤M

exp(βgi) ≤ βτ√

2 lnM if β ≥√

2 lnM/τ.

Proof. This will be covered in the exercises. 2

Corollary 13.2 The free entropy density of the random energy model satisfies

φ(β) ≤ β2

4+ ln 2 and

φ(β) ≤ β√

ln 2 if β ≥ 2√

ln 2.

Proof. This follows from Proposition 13.1 by setting M = 2N and τ =√N/2. 2

Theorem 13.3 The free entropy density of the random energy model satisfies

φ(β) =

β2

4 + ln 2 if β ≤ 2√

ln 2,

β√

ln 2 if β ≥ 2√

ln 2.

Proof. Let Xs = |σ ∈ X : E(σ) ≤ −sN| and let

as = P [E(σ0) ≤ −sN ] for some σ0 ∈ X .

Using (29) and (30), we can estimate as as follows:

1

Ls√N

exp(−Ns2) ≤ as ≤ exp(−Ns2). (31)

Furthermore, the random variable Xs has a binomial distribution Bin(Xs, as) (because the random vari-ables (E(σ))σ∈X are independent). Let As =

Xs ≤ 2N−1as

. Then by Chebyshev’s inequality,

P [As] ≤Var(Xs)

(2N−1as)2≤ 4

2Nas. (32)

If As does not occur, then ZN ≥ Xs exp(βNs) ≥ 2N−1as exp(βNs), and thus

1

NE[lnZN |As

]≥ N − 1

Nln 2 +

1

Nln as + βs. (33)

Furthermore, for any fixed σ0 ∈ X we have ZN ≥ exp(−βE(σ0)). Hence,

1

NE [1As lnZN ] ≥ − β

NE [1AsE(σ0)] ≥ − β

NE|E(σ0)| = − 1√

π· β√

N. (34)

Combining (32)–(34), we see that

1

NE lnZN ≥

(1− 4

2Nas

)(N − 1

Nln 2 +

1

Nln as + βs

)− Lβ√

N. (35)

Finally, we combine (31) and (35). If s <√

ln 2, then (31) shows that 2Nas →∞. Hence, (35) yields

1

NE lnZN ≥ ln 2 + βs− s2 − o(1). (36)


If β < 2√

ln 2, then let s = β/2, and if β ≥ 2√

ln 2, then let s →√

ln 2. This yields the desired lowerbound on φ(β). The upper bound follows from Corollary 13.2. 2

In the random energy model there is a qualitative difference between the “high temperature” regimeβ < 2

√ln 2 and the “low temperature” case β > 2

√ln 2. The following result, which will be discussed in

the exercises, characterizes the high temperature case.

Theorem 13.4 Assume that β < 2√

ln 2. Let n > 0 and η1, . . . , ηn ∈ ±1 remain fixed as N → ∞.Then

limN→∞

E

⟨n∏i=1

1σi=ηi

⟩= 2−n.

To characterize the low temperature case, we need to define a particular probability distribution. Thestarting point is as follows. Let µ be a finite measure on R≥0 that is, let’s say, absolutely continuous withrespect to the Lebesgue mesaure. A Poisson point process of intensity µ is a probability distribution overfinite subsets Π ⊂ R≥0 such that

• |Π| = Po(|µ|), where |µ| = µ(R).

• Given that |Π| = k, the set Π is distributed like X1, . . . , Xk, where X1, . . . , Xk are i.i.d. withdistribution µ/|µ|.

The existence of the Poisson point process is proved in probability theory (and possibly discussed in theexercises).

Now, assume that µ is infinite but Lebesgue-absolutely continuous and µ(ε,∞) < ∞ for any ε > 0.Then we can extend the construction of the Poisson point process to µ by letting Π =

⋃k≥0 Πk, where

Πk is a Poisson point process of the finite measure µk = 1(2−k,21−k]dµ. Of course, we refer to Π as thePoisson point process of intensity µ. Observe that Π is a countable set.

Specifically, let µ = x−m−1dx for a number m ∈ (0, 1). Then for any ε > 0 we have µ(ε,∞) < ∞.Moreover, ∫ 1

0

xdµ(x) <∞. (37)

Therefore, the Poisson point process Π of intensity µ is such that |Π∩ (ε,∞)| <∞ for any ε > 0. Hence,we can enumerate Π as a non-increasing sequence (uα)α≥1 almost surely. Due to (37), the sum

S =∑α≥1

uα

is finite almost surely. Hence,vα =

uαS

is a non-increasing sequence such that∑α vα = 1.

We endow the space S of non-increasing sequences (yα) with∑α yα ≤ 1 with the weakest topology

under which all projections y 7→ yα are continuous. Then the random sequence (vα) induces a probabilitymeasure Λm on S, the Poisson-Dirichlet distribution.

Finally, in the random energy model let (wα)α≥1 be the sequence (〈σ〉)σ∈X in non-decreasing orderfor α ≤ 2N , and let wα = 0 for α > 2N .

Theorem 13.5 If β > 2√

ln 2, then w = (wα) converges in distribution to Λm, with m = 2√

ln 2/β. Thismeans that limN→∞ E [f(w)] = E[f(v)] for any continuous f on S.

Roughly speaking, Theorem 13.5 states that a bounded number of configurations dominate the Boltz-mann distribution. This phenomenon goes by the name of condensation in physics. The proof of Theo-rem 13.5 requires several steps. We begin with the following observation.


Lemma 13.6 Letν =

1√π

exp(−2t√

ln 2)dx.

Let (cα)α≥1 be a non-increasing enumeration of the Poisson point process with intensity ν and set m =

2√

ln 2/β. Then the sequence v = (vα) defined by

vα =exp(βcα)∑γ≥1 exp(βcγ)

has distribution Λm.

Proof. We may rescale the sequence cγ be a constant factor without distorting the distriubution of v. Thus,let uα = C · exp(βcα) for a suitable constant C > 0 (that will be fixed below). Then vα = uα/

∑γ≥1 uγ .

Let ψ : x 7→ C exp(βx). Then uα is the sequence obtained by ordering the Poisson point process withintensity µ = ψ(ν) non-increasingly; hence, µ is the image of ν, defined by letting

µ(A) = ν(ψ−1(A)) =1√π

∫ψ−1(A)

exp(−2t√

ln 2)dt

=Cm

β√π

∫A

x−m−1dx [with x = C exp(βt), using m = 2√

ln 2/β].

Finally, we choose C such that Cm/(β√π) = 1. Then µ = x−m−1dx, which means that we just arrived

at the definition of Λm. 2

Let

a2N =

1

Nln

(2N√N

)and E′(σ) = E′N (σ) = E(σ) +NaN . Let (hα) be the numbers −E′(σ) in non-increasing order. Further-more, let

νb = 1t≥b dν =1t≥b√π

exp(−2t√

ln 2) dx.

Lemma 13.7 For any fixed integer k ≥ 0 we have

P [|α : hα ≥ b| = k] ∼ |νb|k

k!exp(−|νb|).

Furthermore, let

νb,N = exp

[−2taN −

t2

N

]1t≥bdx.

Given that |α : hα ≥ b| = k, the set h1, . . . , hk has the same distribution as X1, . . . , Xk, whereX1, . . . , Xk are i.i.d. with distribution νb,N/|νb,N |.

Proof. Because E(σ) is Gaussian with variance N/2,

P [−E′(σ) ≥ x] = P [−E(σ) ≥ NaN + x] =1√πN

∫ ∞NaN+x

exp(−t2/N)dt

=1√πN

∫ ∞x

exp(−(t+NaN )2/N)dt

=1√πN

∫ ∞x

exp

[− t

2 + 2NaN +N2a2N

N

]dt

=1

2N√π

∫ ∞x

exp[−t2/N − 2taN

]dt ≡ dN [by the choice of aN ]. (38)

Hence, |α : hα ≥ b| has distribution Bin(2N , dN ). Because 2NdN = Θ(1), we can approximate thebinomial distribution by a Poisson distribution with the same mean dN2N . Since limN→∞ dN2N = |νb|,


the first assertion follows. Furthermore, (38) shows the assertion about the conditional distribution of thepoints h1, . . . , hk. 2

Define eα = exp(βhα) and let Z =∑2N

α=1 eα. Furthermore, for a real b let

wbα = 1hα≥b ·eαZb, where Zb =

∑α

1hα≥b · eα.

Lemma 13.8 We have∑α |wα − wbα| =

2(Z−Zb)Z .

Proof. Let A = α : hα ≥ b. For any α ∈ A we have

wα − wbα = eα[Z−1 − Z−1

b

]= eα ·

Zb − ZZZb

; (39)

this is because the energy function E′ is simply obtained from E by adding the number NaN , which isindependent of the configuration. Summing (39) over α ∈ A, we find∑

α∈A|wα − wb| =

Z − ZbZ

. (40)

Furthermore, if α 6∈ A, then wα − wbα = wα. Hence,∑α6∈A

|wα − wbα| =∑α 6∈A

wα = Z−1∑α 6∈A

eα =Z − ZbZ

. (41)

Combining (40) and (41) completes the proof. 2

To perform the next step, we need the “partial integration” formula: let X be a random variable, let0 ≤ a < b, and let F ∈ C1(R,R≥0) be non-decreasing with limx→−∞ F (x) = 0. Then

E[F (X) · 1X∈[a,b]

]≤ F (a)P [X ≥ a] +

∫ b

a

F ′(t)P [X ≥ t] dt. (42)

Lemma 13.9 For any ε > 0 there is b ∈ R such that

lim supN→∞

P

[Z − ZbZ

≥ ε]≤ ε.

Proof. We first claim that for any ε > 0 there is η > 0 such that

P [Z ≤ η] ≤ ε

2. (43)

Indeed, if Z ≤ exp(βx), then there is no α such that hα ≥ x. Hence, by Lemma 13.7 we haveP [Z ≤ exp(βx)] ≤ exp(−|νx|). Choosing x < 0 sufficiently small, we can ensure that the r.h.s. is lessthan ε/2.

For any fixed σ ∈ X we have

E [Z − Zb] = 2NE[exp(−βE′(σ))1E′(σ)≥−b

].

Applying (42) with X = −E′(σ) and F (x) = exp(βx), we obtain

E [Z − Zb] ≤ β

∫ b

−∞exp(βx) · 2NP [E′(σ) ≤ −x] dx

≤ β√π

∫ b

−∞

∫ ∞x

exp(βx) exp(−2taN ) dt dx [by (38)]

=β

2√πaN

∫ b

−∞exp(βx) · exp(−2xaN ) dx

=β

2√πaN

∫ b

−∞exp(x(β − 2aN )) dx.

14 BELIEF PROPAGATION 29

Consequently, since β > 2√

ln 2 ≥ 2aN , we can choose b < 0 small enough such that E [Z − Zb] <ε2η/2. Hence, by Markov’s inequality

P [Z − Zb ≥ εη] ≤ ε/2. (44)

Combining (43) and (44) completes the proof. 2

Proof of Theorem 13.5. Let (cα) be a non-increasing ordering of the Poisson point process with intensity ν.Set

S =∑α≥1

exp(βcα), Sb =∑α≥1

1cα≥b · exp(βcα).

Moreover, let

vα =exp(βcα)

S, vbα = 1cα≥b ·

exp(βcα)

Sb.

Let δ > 0. By Lemmas 13.8 and 13.9, there is b < 0 such that

P

∑α≥1

|wα − wbα| > δ

< δ. (45)

The same argument (i.e., the counterparts of Lemmas 13.8 and 13.9) applies to v and vb and yields

P

∑α≥1

|vα − vbα| > δ

< δ, (46)

provided b is small enough. Recall that the topology on S is the weakest topology under which the projec-tions x 7→ xα are continuous. Under this topology S is compact. Moreover, this topology is no strongerthan the topology induced by the metric

d(x, y) =∑α≥1

2−α|xα − yα|.

Hence, (45) and (46) show that for any continuous function f on S and any ε > 0, we can choose δ > 0small enough so that ∣∣E [f(w)]− E

[f(wb)

]∣∣ < ε+ 2δ ‖f‖∞ < 2ε, (47)∣∣E [f(v)]− E[f(vb)

]∣∣ < 2ε. (48)

In addition, we claim thatlimN→∞

∣∣E [f(wb)]− E

[f(vb)

]∣∣ = 0. (49)

Indeed, the sequences wb = (wbα) and vb = (vbα) are constructed from the sequences (hα) and (cα) in thesame way. Moreover, by Lemma 13.7 probability that hα ≥ b for at least k indices α is asymptoticallyPo(|νb|). In the case of cα, the corresponding distribution is Po(|νb|) exactly. Furthermore, given that theaforementioned event occurs, the k points hα ≥ b are i.i.d. from the distribution νb,N . This distributionis proportional to νb (i.e., νb,N/|νb,N | = νb/|νb|), which is the corresponding distribution in the case ofthe points cα. Hence, we obtain (49). Finally, the assertion follows from (47), (48), (49) and the triangleinequality. 2

14 Belief PropagationThis section follows [14, Chapter 14].


14.1 Example: the 1-dimensional Ising modelThe goal is to develop a technique for computing the free entropy of Boltzmann distributions whose factorgraphs are trees. We begin with an example, namely the 1-dimensional Ising model; our goal is to com-pute the average magnetization 1

N

∑Ni=1 〈σi〉 for given values of β and B. To this end, we introduce the

following “messages” for each i ∈ [N ]:

ν→j(σj) =1

Z→j

∑σ1,...,σj−1=±1

exp

[β

j−1∑i=1

σiσi+1 + βB

j−1∑i=1

σi

],

νj←(σj) =1

Zj←

∑σj+1,...,σN=±1

exp

β N−1∑i=j

σiσi+1 + βB

N∑i=j+1

σi

.Here Z→j and Zj← are chosen such that

ν→j(1) + ν→j(−1) = νj←(1) + νj←(−1) = 1.

In the sequel, we are often going to suppress this normalization; this is expressed by the use of the ∝(“proportional”) symbol. Denote the marginal distribution of j by

µj(±1) = P [σj = ±1] .

Then in terms of the above messages,

µj(σj) = ν→j(σj) · exp(βBσj) · νj←(σj). (50)

Now, the key insight is that the messages ν satisfy the following recurrence:

ν→i+1(σi+1) ∝∑σi=±1

ν→i(σi) exp [βσiσi+1 + βBσi] , (51)

with the initial point ν→1 being the uniform distribution. There is a similar recurrence for the νi←. Solvingthese recurrences determines the marginals µj due to (50). These, in turn, yield the average magetization:

1

N

N∑i=1

〈σi〉 =1

N

N∑i=1

µi(1)− µi(−1).

In fact, we can use the messages to approximate the average magnetization analytically.

Theorem 14.1 Let β > 0 and let B ∈ R. Define

f(x) =1

βatanh (tanh(β) tanh(βx)) .

The function u 7→ f(u + B) has a unique fixed point u∗. Furthermore, for any ε > 0 and any iN ∈[εN, (1− ε)N ] we have

limN→∞

〈σiN 〉 = tanh(β(2u∗ +B)).

Proof. We begin by summarizing the two messages ν→i(±1) by the single number

u→i =1

2βln

ν→i(1)

ν→i(−1),

the so-called effective magnetic field. The messages can be recovered via the relation

ν→i(σ) =1 + σ · tanh(βu→i)

2.

Hence, in terms of the u→i, (51) becomes

u→i+1 = f(u→i +B).

Furthermore, for any β,B and any u we have | ∂∂uf(u+ B)| < 1. Therefore, the function u 7→ f(u+ B)has a unique fixed point u∗, and limi→∞ u→i = u∗. By symmetry, we also have limi→∞ ui← = u∗. Theassertion thus follows from (50). 2


14.2 Tree factor graphsConsider a factor graph G with variables nodes V and function nodes F . Throughout this section weassume that G is a tree, i.e., it is connected but has no cycles. Let N = |V| and M = |F|. Let X be a finiteset (of possible “spins”). As in Section 10, we consider a Boltzmann distribution

µ(x) =1

Z

∏a∈F

ψa(x∂a) (x ∈ XV),

whereψa(x∂a) = exp(−βEa(x∂a)).

The ultimate goal is to devise a formalism for computing the free entropy 1N lnZ.

For starters, we generalize the approach from the previous section in order to compute the marginals

µi(x′i) =

⟨1xi=x′i

⟩for i ∈ V , x′i ∈ X .

To this end, we are going to construct “messages” from variable nodes to function nodes along with amessages from function nodes to variable nodes. More preicisely, for any i ∈ V , any a ∈ ∂i and anyx′ ∈ X there will be a message νi→a(x′) ∈ [0, 1] “from i to a” and a message νa→i(x′) ∈ [0, 1]. Themessages are normalized such that ∑

x′∈Xνi→a(x′) =

∑x′∈X

νa→i(x′) = 1 (52)

for all i, a. LetM be the set of all pairs (ν, ν) of families ν = (νi→a(x′))i,a,x′ , ν = (ν(t)a→i(x

′))i,a,x′ ofnumbers in [0, 1] that satisfy the normalization condition (52). We refer toM as the message space.

Now, we would like to define the Belief Propagation operator BP : M → M, (ν, ν) 7→ (ν′, ν′) byletting

ν′j→a(xj) ∝∏

b∈∂j\a

νb→j(xj),

ν′a→j(xj) ∝∑x∂a\j

ψa(x∂a)∏

k∈∂a\j

ν′k→a(xk).

In the second expression, we sum over all maps x∂a\j ∈ X ∂a\j . Moreover, x∂a is meant to be the map∂a→ X that coincides with x∂a\j if k ∈ ∂a\j and that takes the value xj if k = j. Note that if ∂j\a = ∅,then ν′j→a be the uniform distribution over X (because the empty product is understood to be equal to one).Moreover, if ∂a = j, then the second sum is understood to ψa(xj).

There is one more problem: the normalization that is implicit in the ∝ sign is possible only if∑xj∈X

∏b∈∂j\a

νb→j(xj) > 0 and∑xj∈X

∑x∂a\j

ψa(x∂a)∏

k∈∂a\j

ν′k→a(xk) > 0.

If one of these sums is equal to zero, we let ν′ = ν, ν′ = ν. This completes the construction of the BeliefPropagation operator.

Let (ν, ν) ∈M. For each variable i we define a probability measure νi on X by letting

νi(xi) ∝∏a∈∂i

νa→i(xi),

provided that∑x′∈X

∏a∈∂i νa→i(x

′) > 0; otherwise, we let νi be the uniform distribution over X . Theidea is that νi is the “estimate” of the marginal of variable i given (ν, ν).

Recall that the diameter of a graph is the maximum distance between any two vertices, where “distance”refers to the number of edges on the shortest path.

Theorem 14.2 In a tree factor graph of diameter t∗ the following is true.


1. There exists (ν∗, ν∗) ∈M such that BP(ν∗, ν∗) = (ν∗, ν∗). In other words, the Belief Propagationoperator has a fixed point.

2. For any (ν0, ν0) ∈M and any t > t∗ we have BPt(ν0, ν0) = (ν∗, ν∗). Here BPt denotes the t-folditeration of the Belief Propagation operator. In particular, the Belief Propagation operator has aunique fixed point.

3. For any i ∈ V and any x ∈ X we have ν∗i (x) = µi(x).

Proof. Let i be a variable and let a ∈ ∂i. We denote by Gi→a the subgraph of the factor graph induced onthe vertices y ∈ V ∪F such that every path from y to a in G passes through i. Thus, Gi→a is a sub-tree ofG that contains i but not a. Let Vi→a be the set of variabe nodes and let Fi→a be the set of function nodesin Gi→a. Furthermore, let µi→a(·) denote the the Boltzmann distribution induced on Gi→a, i.e.,

µi→a(x) ∝∏

a∈Fi→a

ψa(x∂a) (x ∈ XVi→a). (53)

Let µi,i→a(·) denote the marginal distribution of variable i under µi→a. In addition, we define the depthti→a as the maximum distance between i and a leaf of Gi→a.

Let (ν0, ν0) ∈M be any starting point and let νt = BPt(ν0, ν0). We claim that for any i ∈ V , a ∈ ∂iwe have

µi,i→a = ν(t)i→a if t > ti→a. (54)

The proof is by induction on ti→a. If ti→a = 0, then i is a leaf of G. In particular, ∂i = a. Hence,by the construction of the Belief Propagation operator νti→a is the uniform distribution for any t ≥ 1.Furthermore, since Gi→a consists only of the vertex i, the product on the r.h.s. of (53) is empty, whichmeans that µi→a is the uniform distribution on X , too. Thus, we have got (54) in the case ti→a = 0.

Now, assume that (54) is true for all i′, a′ such that ti→a > ti′→a′ and let t > ti→a. Let xi ∈ X , letb ∈ ∂i \ a, and let x∂b\i ∈ X ∂b\i. We are going to show that

µi→a (x∂b) ∝ ψb(x∂b)∏

j∈∂b\i

µj,j→b(xj). (55)

Indeed, given the value of variable i, the values of the variables in ∂b\i are independent of the values of thevariables in

⋃b′∈∂i\a,b ∂b

′ \ i, because i is the only connection between them (of course, this argumentdepends crucially on the fact that G is a tree). Therefore, unravelling the definition (53), we obtain

µi→a (x∂b) ∝ ψb(x∂b)∑

y:y∂b=x∂b

∏j∈∂b\i

∏c∈Fj→b

ψc(y∂c) [summing over y ∈ XVi→a ]

∝ ψb(x∂b)∏

j∈∂b\i

∑y(j):y

(j)j =xj

∏c∈Fj→b

ψc(y(j)∂c ) [summing over y(j) ∈ XVj→b ]

∝ ψb(x∂b)∏

j∈∂b\i

µj,j→b(xj).

Using (55), we find

µi,i→a(xi) =∑

y:yi=xi

µi→a(x∂b) [summing over y ∈ X⋃b∈∂i\a ∂b]

=∏

b∈∂i\a

∑y∂b∈X∂b:yi=xi

µi→a(y∂b) [because different subtrees are independnet]

∝∏

b∈∂i\a


ψb(y∂b)∏

j∈∂b\i

µj,j→b(yj)

∝∏

b∈∂i\a


ψb(y∂b)∏

j∈∂b\i

ν(t−1)j→b (yj) [by induction]

∝∏

b∈∂i\a

ν(t−1)j→b (xi) ∝ ν(t)

i→a(xi) [by definition].


In fact, since both the first and the last term are probability distributions, we have µi,i→a(xi) = ν(t)i→a(xi).

This completes the proof of (54).Finally, (54) implies that ν∗i→a = µi,i→a is the unique fixed point of the Belief propagation operator.

This establishes the first and the second assertion. To obtain the third assertion, we perform a similarcomputation as above once more, this time including all b ∈ ∂i: by the definition of the Boltzmanndistribution,

µi(xi) =∑

y:yi=xi

µ(x∂b) [summing over y ∈ X⋃b∈∂i ∂b]

=∏b∈∂i


µ(y∂b) ∝∏b∈∂i


ψb(y∂b)∏

j∈∂b\i

ν∗j→b(yj)

∝∏b∈∂i

ν∗j→b(xi) ∝ ν∗i (xi) [by definition],

as desired. 2

14.3 The Bethe free entropyWe keep the assumptions and the notation from the previous section. Theorem 14.2 enables us not only toexpress the marginal distribution of each variable in terms of the Belief Propagation fixed point, but alsothe joint distribution of several variables.

Corollary 14.3 Assume that G is a tree. Let F be a set of function nodes and let V = ∂F be the setof variable nodes that are adjacent to a node in F . Assume that the subgraph G [V ] of the factor graphinduced on V is connected. Moreover, let D be the set of function nodes that are adjacent to a node inV and also to a node in V \ V . Then each node a ∈ D has a unique neighbor i(a) in V , and the jointdistribution of the variables in V satisfies

µ(xV ) ∝∏a∈F

ψa(x∂a)∏a∈D

ν∗a→i(a)(xi(a)).

Proof. The fact that each node a ∈ D has a unique neighbor i(a) ∈ V simply follows because G isa tree and G [V ] is connected. Furthermore, given the values xV of the variables in V , the values thatthe variables in different components of G − V take are mutually independent. Indeed, each of thesecomponents is connected to V only through a unique a ∈ D. Therefore, Theorem 14.2 implies that inthe factor graph obtained by removing the nodes F , the probability that i(a) takes the value xi(a) equalsν∗a→i(a)(xi(a)). Since the “missing factors” are just ψb(x∂b) for b ∈ F , the assertion follows. 2

An important special case of Corollary 14.3 is that F = a is a singleton. Let us denote the jointdistribution of the variables ∂a by µa. Then Corollary 14.3 implies

Corollary 14.4 Assume that G is a tree. For any function node a, µa(x∂a) ∝ ψa(x∂a)∏i∈∂a ν

∗i→a(xi).

In fact, we can express the entire distribution µ merely in terms of the marginal distributions µi, µa.

Corollary 14.5 Assume that G is a tree. We have

µ(x) =∏a∈F

µa(x∂a)∏i∈V

µi(xi)1−|∂i|.

Proof. We proceed by induction on M . The cases M = 0, 1 are immediate. Now, suppose that M > 1.Because G is a tree, there is a function node a such that exactly one variable i ∈ ∂a has degree greater thanone.

We set up a new factor graph G′ with one less function node as follows. The set of variables is V ′ =V \ (∂a \ i). Moreover, the set of function nodes is F ′ = F \ a. Furthermore, we modify the factorsψ′b(·) for b ∈ F ′ as follows. Pick one function node a′ ∈ ∂i \ a. Now, let

ψ′a′(x∂a′) = ψa′(x∂a′) ·∑x∂a\i

ψa(x∂a),


while ψ′b = ψb for all b 6= a′. Because there is one less function node, by induction the Boltzmanndistribution µ′ of G′ satisfies

µ′(x′) =∏b∈F ′

µ′b(x∂b)∏j∈V′

µ′j(xj)1−|∂G′ j|. (56)

Observe that by construction, variable i has the same marginal under both µ, µ′, i.e.,

µi(xi) = µ′i(xi). (57)

Therefore, for any x′ : V ′ → X we have

µ(x′|xi) =∑x∂a\i

ψa(x∂a)

µi(xi)·∏b 6=a

ψb(x∂b) (58)

(57)=

∏b 6=a ψ

′b(x∂b)

µ′i(xi)=µ′(x′)

µ′i(xi)= µ′(x′|xi). (59)

Combining (57) and (58) and using Bayes’ rule, we see that µ(x′) = µ′(x′). In particular, for all variablesj ∈ V ′ we have µ′j(x

′j) = µj(x

′j). Moreover, for all factors b 6= a, a′ we see that µb(x′∂b) = µ′b(x

′∂b). In

addition, we claim that µ′a′(x′∂a′) = µa′(x

′∂a′). To see this, think of the factor ψ′a′ as being composed of

two function nodes: one that corresponds to the function node a′ in G, and one function node a∗ that isconnected to i only, namely the one that implements the term

∑x∂a\i

ψa(x∂a). These two function nodesare connected only through i. Then Bayes’ rule yields

µ′a′(x′∂a′) = µ′a′(x

′∂a′) = µ′a′(xi)µ

′a′(x∂a′ |xi)′µ′a∗(x∂a∗ |xi).

Since ∂a∗ = i, the last factor simply is equal to one. Moreover, because µ(x′) = µ′(x′) we have

µ′a′(xi)µ′a′(x∂a′ |xi) = µa′(x

′∂a′).

Hence, (56) yields

µ(x′) =∏b∈F ′

µ′b(x′∂b)

∏j∈V′

µ′j(x′j)

1−|∂G′ j| = µ′a′(x′a′) ·

∏b6=a,a′

µb(x′∂b)

∏j∈V′

µj(x′j)

1−|∂G′ j|

= µ2−|∂i|i (xi) ·

∏b6=a

µb(x′∂b)

∏j∈V′\i

µj(x′j)

1−|∂j|. (60)

Finally, because i is the only link between a and F \ a, we obtain

µ(x) = µ(xV′)µ(x|xV′) = µ(xV′)µ(x∂a\i|xi) = µ(xV′)µa(x∂a)/µi(xi)

(60)=

∏a∈F

µa(x∂a)∏i∈V

µi(xi)1−|∂i|,

as claimed. 2

Corollary 14.5 enables us to express the free entropy of the Boltzmann distribution in terms of the fixedpoint of Belief Propagation. The expression F∗ in Corollary 14.6 is called the Bethe free entropy.

Corollary 14.6 Assume that G is a tree. Let (ν∗, ν∗) be the Belief Propagation fixed point and set

F∗ =∑a∈F

Fa +∑i∈V

Fi −∑

i∈V,a∈∂i

Fai, where

Fa = ln∑x∂a

ψa(x∂a)∏i∈∂a

ν∗i→a(xi),

Fi = ln∑xi

∏b∈∂i

ν∗b→i(xi),

Fai = ln∑xi

ν∗i→a(xi)ν∗a→i(xi)

Then the free entropy of µ is Φ = lnZ = F∗.


Proof. By Corollary 14.5 the entropy of µ is nothing but

H(µ) = −∑a∈F

∑x∂a

µa(x∂a) lnµa(x∂a)−∑i∈V

(1− |∂i|)∑xi

µi(xi) lnµi(xi).

Furthermore, the internal energy comes to

U = 〈E(x)〉 =∑x

µ(x)E(x) =∑x

µ(x)E(x) = −∑x

µ(x)

βln∏a∈F

ψa(x∂a)

= −∑a∈F

∑x

µ(x)

βlnψa(x∂a) = − 1

β

M∑a=1

∑x∂a

µa(x∂a) lnψa(x∂a).

Since Φ = H(µ)− βU by Proposition 10.1, we find that

Φ = −∑a∈F

∑x∂a

µa(x∂a) lnµa(x∂a)

ψa(x∂a)−∑i∈V

(1− |∂i|)∑xi

µi(xi) lnµi(xi). (61)

By Theorem 14.2 and Corollary 14.4, we can express the marginals in terms of the Belief Propagation fixedpoint:

µi(xi) ∝∏a∈∂i

νa→i(xi),

µa(x∂a) ∝ ψa(x∂a)∏i∈∂a

ν∗i→a(xi).

Plugging these expressions into the summands of (61), we get

−∑a∈F

∑x∂a

µa(x∂a) lnµa(x∂a)

ψa(x∂a)= −

∑a∈F

∑x∂a

µa(x∂a) ln

∏i∈∂a ν

∗i→a(xi)∑

y∂aψa(x∂a)

∏i∈∂a ν

∗i→a(xi)

=∑a∈F

Fa −∑a∈F

∑x∂a

µa(x∂a) ln∏i∈∂a

ν∗i→a(xi)

=∑a∈F

Fa −∑a∈F

∑x∂a

µa(x∂a)∑i∈∂a

ln ν∗i→a(xi)

=∑a∈F

Fa −∑i∈V

∑a∈∂i

∑xi

µi(xi) ln ν∗i→a(xi)

=∑a∈F

Fa −∑i∈V

∑a∈∂i

∑xi

µi(xi) ln

∏b∈∂i\a ν

∗b→i(xi)∑

yi

∏b∈∂i\a ν

∗b→i(yi)

=∑a∈F

Fa +∑i∈V

∑a∈∂i

ln∑yi

∏b∈∂i\a

ν∗b→i(yi)

−∑i∈V

∑xi

∑a∈∂i

∑b∈∂i\a

µi(xi) ln ν∗b→i(xi)

=∑a∈F

Fa +∑i∈V

∑a∈∂i

ln∑yi

∏b∈∂i\a

ν∗b→i(yi)

−∑i∈V

∑xi

∑b∈∂i

(|∂i| − 1)µi(xi) ln ν∗b→i(xi).

Furthermore,

−∑i∈V

(1− |∂i|)∑xi

µi(xi) lnµi(xi) = −∑i∈V

(1− |∂i|)∑xi

µi(xi) ln

∏a∈∂i ν

∗a→i(xi)∑

yi

∏a∈∂i ν

∗a→i(yi)

=∑i∈V

(1− |∂i|)Fi −∑i∈V

∑xi

∑a∈∂i

(1− |∂i|)µi(xi) ln ν∗a→i(xi)


Observing that the triple-sums cancel, we get

Φ =∑a∈F

Fa +∑i∈V

(1− |∂i|)Fi +∑i∈V

∑a∈∂i

ln∑yi

∏b∈∂i\a

ν∗b→i(yi)

=∑a∈F

Fa +∑i∈V

Fi +∑i∈V

∑a∈∂i

ln

∑yi

∏b∈∂i\a ν

∗b→i(yi)∑

xi

∏b∈∂i ν

∗b→i(xi)

. (62)

Finally, for any i ∈ V and any a ∈ ∂i,

ln

∑yi

∏b∈∂i\a ν

∗b→i(yi)∑

xi

∏b∈∂i ν

∗b→i(xi)

= − ln

[∑xi

ν∗a→i(xi) ·∏b∈∂i ν

∗b→i(xi)∑

yi

∏b∈∂i\a ν

∗b→i(yi)

]

= − ln

[∑xi

ν∗a→i(xi) · ν∗i→a(xi)

]= −Fai.

Thus, the assertion follows from (62). 2

14.4 Applications and extensionsIn this section we sketch some applications of Belief Propagation. This is not meant to be a complete,formal treatment.

14.4.1 Belief Propagation Guided Decimation.

Belief Propagation allows us to sample from the Boltzmann distribution on trees. More precisely, theBelief Propagation Guided Decimation algorithm proceeds as follows. Suppose that the variable nodes arev1, . . . , vN .

1. For i = 1, . . . , N do the following.

2. Use BP to compute the marginal µvi of variable vi.

3. Choose a random value σi from the distribution µvi .

4. Add a function node that “freezes” vi to σi.

5. Return σ.

It is easily verified that the result σ is indeed a perfect sample from the Boltzmann distribution. Bycomparison to other techniques such as Markov Chain Monte Carlo, the big advantage of BP is that itworks on any tree model. However, there may be numerical issues.

14.4.2 Low-density parity check codes.

In coding theory, the fundamental problem is to send a message over a noisy channel in such a way thaterrors can be detected and/or corrected. There are many different constructions of “error-correcting codes”based on various methods (e.g., finite field theory). Among the most efficient constructions are so-calledlow-density parity check codes (“LDPC codes”).

Roughly speaking, an LDPC code is based on a factor graph (with cycles) whose function nodes rep-resent linear equations over Z2, while the variable nodes represent the bits of the codeword. In addition,there are function nodes that represent the received word y ∈ ZN2 (which may contain transmission errros).The goal is to reconstruct the “most likely” sent word x given the received word y.

The Boltzmann distribution takes the form

µy(x) =1

Z(y)

N∏i=1

Q(yi|xi)M∏a=1

1xai1⊕···⊕xaik(a)

=0.

15 DILUTED MEAN-FIELD MODELS 37

Here Q(yi|xi) is the probability that yi is received given that xi was sent. Thus, Q represents the noisychannel. We assume that the channel is memoryless, i.e., the probability of receiving word y given that xwas sent is

Q(y|x) =

N∏i=1

Q(yi|xi).

While typically the factor graph corresponding to µy has cycles, Belief Propagation performs quite wellunder certain conditions. These involve the structure of the linear equations as well as the distributionQ. Inan LDPC code, the linear equations are typically randomly generated given a particular degree distribution,which has a significant impact on the quality of the code (e.g., number of errors that can be corrected) andthe performance of BP. The mathematical understanding of both the “idealized” error correcting capacityof LPDC codes as well as the performance of BP is incomplete.

14.4.3 The infinite-tree limit

From a statistical mechanics perspective we are typically interested in the thermodynamic limit, i.e., thelimit as the size N of the system tends to infinity. In the case of tree factor graphs, this means going toinfinite-depth trees.

Of course, in order to even consider this limit one has to define an appropriate “shape” that allows forinfinite-depth limits. One possibility is to consider regular trees, i.e., each variable node and each functionnode has an equal number of children. But there are various alternatives. For instance, one could considermodels where all function nodes have the same number of children, while the number of children that avariable node has is a Poisson random variable.

How do we compute the fixed point messages on an infinite tree? The basic idea, called density evolu-tion in physics, is to pass to distributional fixed point equations. That is, we postulate that the distributionsof the messages satisfy the Belief Propagation equations. Currently, little is known in terms of rigorousmathematical results about the existence of these fixed point distributions.

15 Diluted mean-field modelsThe next step up (in terms of “realism”) from tree graphical models is the class of diluted mean-fieldmodels. We already know two examples of this type of model: the Erdos-Renyi random graph and theLDPC codes. In this section we consider the diluted mean-field Potts antiferromagnet at zero temperature,aka the random graph coloring problem. Although Belief Propagation and related methods from physicshave been used to put forward conjectures on this problem, we are going to study the problem by means of“classical” methods of probablistic combinatorics. The present section is based on [1, 5, 10, 13].

15.1 The concentration of the chromatic numberConsider the random graph G(n, p) with np = O(1) as n → ∞. Let χ(G) denote the chromatic numberof the graph G. Thus, χ(G(n, p)) is a random variable. Surprisingly, it turns out that χ(G(n, p)) is tightlyconcentrated.

Theorem 15.1 Suppose that p = p(n) is such that np = Θ(1) as n→∞. There exist k = k(n) such that

limn→∞

P [χ(G(n, p)) ∈ k, k + 1] = 1.

In words, the chromatic number of the random graph G(n, p) is concentrated on two subsequent integers.The proof of Theorem 15.1 rests on an important concentration inequality, the Azuma-Hoeffding in-

equality.

Lemma 15.2 Suppose that (Xk)k=0,1,...,n is a martingale such that X0 = E [Xn]. Moreover, assume thatthere exist numbers ck > 0 such that

|Xk −Xk − 1| ≤ ck almost surely


for all 1 ≤ k ≤ n. Then for any t > 0 we have

P [Xn ≥ E [Xn] + t] ≤ exp

[− t2

2∑nk=1 c

2k

],

P [Xn ≤ E [Xn]− t] ≤ exp

[− t2

2∑nk=1 c

2k

].

Proof. Let (Ft) be the filtration underlying the martingale (Xn). We may assume that F0 is the trivialσ-algebra (with two elements). Let ∆k = Xk−Xk−1 and Sk =

∑ki=1 ∆i = Xk−X0. Then, by a similar

token as in the proof of the Chernoff bound, for any u > 0 we have

P [Xn − E [Xn] ≥ t] = P [Xn − E [Xn] ≥ t] = P [Sn ≥ t] ≤ exp [−ut] · E [exp(uSn)] . (63)

Thus, we need to estimate E [exp(uSn)]. By the martingale property,

E [exp(uSn)] = E [E [exp(uSn)|Fn−1]] = E [exp(uSn−1)E [exp(u∆n)|Fn−1]] (64)

We are going to see below that for all 1 ≤ k ≤ n,

E [exp(u∆k)|Fk−1] ≤ exp[u2c2k/2

]almost surely. (65)

Hence, (64) yieldsE [exp(uSn)] ≤ E [exp(uSn−1)] exp

[u2c2n/2

].

Proceeding inductively, we find

E [exp(uSn)] ≤ exp

[u2

n∑i=1

c2n/2

].

Thus, (63) yields

P [Xn − E [Xn] ≥ t] ≤ exp

[−ut+ u2

n∑k=1

c2k/2

].

Setting u = t/∑nk=1 c

2k gives the first inequality. The second one follows by considering −Xk instead.

We are left to prove (65). We claim that, more generally, for any random variable Z with ‖Z‖∞ ≤ aand E [Z] = 0 we have

E [exp(uZ)] ≤ exp(u2a2/2). (66)

Indeed, because the function z 7→ exp(uz) is convex, we have

exp(uZ) ≤ a+ Z

2aexp(ua) +

a− Z2a

exp(−ua) [using Z = a+Z2a · a+ a−Z

2a · (−a)].

Because E [Z] = 0, we thus find

E [exp(uZ)] ≤ 1

2exp(ua) +

1

2exp(−ua) = cosh(ua) ≤ exp(u2a2/2),

where the last identity follows from the Taylor expansion of cosh. This implies (66) and thus (65). 2

Corollary 15.3 Let Z1, . . . , Zn be independent random variables and let f(Z1, . . . , Zn) be a functionsuch that the following is true.

If z = (z1, . . . , zn) and z′ = (z′1, . . . , z′n) are such that for all i ∈ 1, . . . , n \ k

we have zi = z′i, then |f(z)− f(z′)| ≤ ck, with ck ≥ 0 a real number. (67)


Then X = f(Z1, . . . , ZN ) satisfies

P [X ≥ E [X] + t] ≤ exp

[− t2

2∑nk=1 c

2k

],

P [X ≤ E [X]− t] ≤ exp

[− t2

2∑nk=1 c

2k

],

for any t > 0.

Proof. Let Fk be the coarsest σ-algebra in which Z1, . . . , Zk are measurable. Then F0 ⊂ · · · ⊂ Fn is afiltration. Let Xk = E [X|Fk]. Then (X0, . . . , Xn) is a Doob martingale that satisfies the assumptions ofLemma 15.2. 2

We are going to apply Corollary 15.3 to a particular family of random variables, the so-called vertexexposure martingale. For 1 ≤ i, j ≤ n let aij = 1 if the edge i, j is present in G(n, p), and let aij = 0otherwise. Furthermore, let Ai = (aij)1≤j<i for i = 2, . . . , n. Then the 0/1 vectors A2, . . . , An aremutually independent. We say that a random variable X is Lipschitz with respect to the vertex exposuremartingale ifX = f(A2, . . . , An) for a function f such that for any k ∈ 2, . . . , n and any two sequences(A2, . . . , An), (A′2, . . . , A

′n) such that A′i = Ai for all i 6= k we have

|f(A2, . . . , An)− f(A′2, . . . , A′n)| ≤ 1.

Corollary 15.4 Assume X is Lipschitz with respect to the vertex exposure martingale. Then for any t > 0we have

P [X ≥ E [X] + t] ≤ exp

[− t

2

2n

], P [X ≤ E [X]− t] ≤ exp

[− t

2

2n

]. (68)

To prove Theorem 15.1, consider a function ω = ω(n) that tends to∞ slowly, say ω(n) = ln lnn. Letk = k(n) be the least integer such that

P [χ(G(n, p)) ≤ k] ≥ 1/ω. (69)

We are going to show thatP [χ(G(n, p)) ≤ k + 1] = 1− o(1). (70)

The assertion is immediate from (69)–(70).To prove (70) let

X(G) = min |S| : S ⊂ 1, . . . , n is such that χ(G− S) ≤ k .

In words, X(G) is the number of vertices that we have to remove from G to make the graph k-colorable.1

The random variable X = X(G(n, p)) is Lipschitz with respect to the vertex exposure martingale. There-fore, (68) shows that

P[|X − E [X] | ≥

√n lnn

]≤ 1/n. (71)

Since P [X = 0] ≥ 1/ω by (69), (71) implies that E [X] ≤√n lnn. Hence, applying (71) a second time,

we find thatP[X > 2

√n lnn

]≤ 1/n. (72)

Now, consider the following process. Let Q be a set of vertices such that G(n, p) − Q is k-colorable.While there exist adjacent two vertices v, w ∈ N(Q) \ Q, add v, w to Q. Let Q′ be the final outcome ofthis process.

Lemma 15.5 W.h.p. the set Q′ has size ≤ 100√n lnn.

Lemma 15.6 W.h.p. for any set Y of size |Y | ≤ 100√n lnn the subgraph of G(n, p) induced on Y is

3-colorable.1The following argument is attributed to A. Frieze.


Before proving Lemmas 15.5 and 15.6, we show how they imply Theorem 15.1. Namely, let Q′ bethe outcome of the aforementioned process. Moreover, let σ : V \ Q′ → 1, . . . , k be a k-coloring ofG(n, p)−Q′. LetU = N(Q′)\Q′. ThenU is an independent set inG(n, p). Let σ(u) = k+1 for all u ∈ U .Further, by Lemma 15.5 and 15.6, w.h.p. the graph induced on Q′ has a 3-coloring σ : Q′ → 1, 2, 3.Thus, the resulting map σ : V → 1, . . . , k + 1 is a k-coloring of G(n, p).

To prove Lemmas 15.5 and 15.6, we need the following

Lemma 15.7 Assume that np = O(1). W.h.p. the random graph G(n, p) has the following property. LetS be a set of ≤ 100

√n lnn vertices. Then the number of edges spanned by S satisfies

e(S) ≤ 7

5|S|.

Proof. Fix one set S0 of size 1 ≤ |S0| = s0 ≤ 100√n lnn. Then

P [e(S0) > 7|S|/5] ≤( (s0

2

)7s0/5

)p7s0/5 ≤

(5es2

0p

14s0

)7s0/5

≤ (s0p)7s0/5.

Furthermore, the total number of sets os size s0 is(ns0

). Hence, by the union bound the probability that

there is one such that e(S0) > 75 |S0| is bounded by(

n

s0

)(s0p)

7s0/5 ≤[

en

s0· (s0p)

7/5

]s0≤[e(s0p)

2/5np]s0

= o(1)s0 .

Summing over all possible s0 completes the proof. 2

Proof of Lemma 15.5. We are going to prove the following stronger statement: w.h.p. for any set Z of≤ 2√n lnn vertices, and Z ′ constructed as above from Z, we have |Z ′| ≤ 100

√n lnn. Indeed, assume

that |Z ′| > 100√n lnn. Let Z ′′ be the first set of size at least b100

√n lnnc obtained during the process.

For any pair of vertices that we add to Z ′, the number of edges spanned by Z ′ increases by three. Hence,the number of edges spanned by Z ′′ is at least e(Z ′′) ≥ ( 3

2 − o(1)) · 98√n lnn. Thus,

e(Z ′′)/|Z ′′| ≥ (1− o(1))147

100>

7

5.

But by Lemma 15.7, there is no set Z ′′ of such a high density w.h.p. 2

Proof of Lemma 15.6. Assume that Y is a set of minimum size such that the subgraph induced on Y failsto be 3-colorable. We claim that |Y | > 100

√n lnn w.h.p. Indeed, if |Y | ≤ 100

√n lnn, then Lemma 15.7

implies that the subgraph induced on Y has average degree at most 14/5 < 3 w.h.p. Hence, there is avertex v ∈ Y of degree less than 3, and thus the subgraph spanned by Y \ v fails to be 3-colorable. Thiscontradicts the minimality of |Y |. 2

15.2 The first moment boundWhile Theorem 15.1 establishes the concentration of the chromatic number, it does not reveal its typicalvalue. To investigate it, we consider the following random graph model: let G(n,m) be a graph on nvertices with m edges chosen uniformly at random among all such graphs. We begin with the followinglemma.

Lemma 15.8 There is ε = O(ln k/k) such that for d = 2m/n > (2k−1) ln k+ε we have χ(G(n,m)) >k w.h.p.

Proof. Let σ : V = 1, . . . , n → 1, . . . , k be a map. Let Vi = σ−1(i) and ni = |Vi| for i = 1, . . . , k.Let pσ be the probability that σ is a k-coloring of G(n,m). Then

pσ =

((n2

)−∑ki=1

(ni2

)m

)/

((n2

)m

).


By Stirling’s formula,1

nln pσ ∼

d

2ln

(1−

k∑i=1

n2i

n2

)≤ d

2ln(1− 1/k),

because∑ki=1

n2i

n2 is minimized if ni = n/k for all i. Hence, by the union bound the probability that thereis a k-coloring is bounded by

1

nln∑σ

pσ ≤ ln k +d

2ln(1− 1/k)

≤ ln k − d

2

(1

k+

1

2k2

)≤ ln k − (2k − 1) ln k + ε

2

(1

k+

1

2k2

)= − ε

2k− ε

2k2+

ln k

2k2< 0,

provided that the constant hidden in the O(·) is not too small. 2

Corollary 15.9 There is ε′ = O(ln k/k) such that for d = np > (2k−1) ln k+ε′ we have χ(G(n, p)) > kw.h.p.

Proof. The number of edges in G(n, p) has distribution Bin((n2

), p). Furthermore, given that the number

of edges is m, G(n, p) is identical to G(n,m). Since

P

[Bin(

(n

2

), p) <

(n

2

)p−√n lnn

]= o(1),

the assertion follows from Lemma 15.8. 2

15.3 The second moment boundTo also obtain an upper bound on χ(G(n, p)), we are going to carry out a second moment argument. Theresult will be the following

Theorem 15.10 There is an absolute constant c > 0 such that in the case d = 2m/n < (2k− 2)k ln k− cwe have χ(G(n,m)) ≤ k + 1 w.h.p.

In the following, we keep the assumption on d from Theorem 15.10. We are not actually going to workwith all k-colorings. Instead, let us call a map σ : V → 1, . . . , k balanced if

∣∣σ−1(i)− n/k∣∣ ≤ √n for

all i. Let Z denote the number of all balanced k-colorings of G(n,m). The proof of the following lemmais similar to the proof of Lemma 15.8.

Lemma 15.11 We have 1n ln E [Z] ∼ ln k + d

2 ln(1− 1/k).

To compute the second moment, we define the overlap matrix ρ(σ, τ) of two balanced maps σ, τ asfollows: ρ(σ, τ) = (ρij(σ, τ))i,j=1,...,k is a k × k matrix with entries

ρij(σ, τ) =k

n

∣∣σ−1(i) ∩ τ−1(j)∣∣ .

The row and column sums of this matrix are “approximately” equal to 1 (because σ, τ are balanced). For ak × k matrix ρ = (ρij) we let

‖ρ‖2 =

k∑i,j=1

ρ2ij

1/2


denote the Frobenius norm. Furthermore, let

f(ρ) = ln k − 1

k

k∑i,j=1

ρij ln ρij +d

2ln

(1− 2

k+ k−2 ‖ρ‖22

).

Lemma 15.12 Let σ, τ be balanced. Then

1

nln P [both σ, τ are k-colorings of G(n,m)] ∼ f(ρ(σ, τ)).

Proof. This follows from Stirling’s formula and a similar calculation as in the proof of Lemma 15.8. 2

Let B be the set of all balanced maps V → 1, . . . , k. Moreover, let R be the set of all possibleoverlap matrix ρ(σ, τ), with σ, τ ∈ B. Furthermore, let ρ be the k × k matrix with all entries equal to 1/k,and let R1 consist of all ρ ∈ R with ‖ρ− ρ‖22 ≤ η, with η = η(k) > 0 a sufficiently small constant. Thecore of the proof lies in establishing the following two propositions.

Proposition 15.13 There is a number δ = δ(k) > 0 such that for all ρ ∈ R\R1 we have f(ρ) < f(ρ)−δ.

Let Z1 be the number of pairs σ, τ ∈ B of k-colorings with ρ(σ, τ) ∈ R1.

Proposition 15.14 There is a number C = C(k) > 0 such that E [Z1] ≤ C · E [Z]2.

Proof of Theorem 15.10. Propositions 15.13 and 15.14 imply that

E[Z2]≤ C · E [Z]

2,

with C = C(k) > 0 independent of n. Therefore, we can apply the Paley-Zygmund inequality (which youknow from the exercises):

P [Z > 0] ≥ E [Z]2

E [Z2]≥ 1/C > 0.

Hence, G(n,m) is k-colorable with a probability that remains bounded away from 0 as n → ∞. We can,therefore, carry out the proof of Theorem 15.1 with this choice of k in (69). Consequently, we find thatχ(G(n,m)) ≤ k + 1 w.h.p. 2

By a similar argument as in the proof of Corollary 15.9 we obtain

Corollary 15.15 There is an absolute constant c′ > 0 such that in the case d = 2m/n < (2k−2)k ln k−c′we have χ(G(n, p)) ≤ k + 1 w.h.p.

Corollary 15.16 There is d0 > 0 such that for all d ≥ d0 the following is true. Let k be the least integersuch that d < 2k ln k. Then w.h.p. we have χ(G(n, p)) ∈ k, k + 1 .

15.3.1 Proof of Proposition 15.13

Throughout this section we assume that k ≥ k0 is sufficiently large; this is possible at the expense ofchoosing c in Theorem 15.10 sufficiently large. We noticed earlier that for any ρ ∈ R the row and columnsums are 1 + o(1). In fact, because the function f is continuous, we may confine ourselves to matricesρ whose row and columns sums are exactly equal to one. In other words, we may assume that the matrixρ is doubly-stochastic. Thus, let D denote the set of all doubly-stochastic k × k matrices. Our aim is tomaximize f over ρ ∈ D.

Solving this problem directly turns out to be difficult. We will sidestep the issue by considering arelaxation. That is, we are actually going to optimize f over the set S ⊃ D of singly-stochastic k × kmatrices (i.e., matrices whose row sums are equal to one, while the column sums can be anything). Tocarry out this optimisation, we perform a local variations argument. We begin by differentiating f .


We denote by O(F (k)) a function F (k) = O(F (k)) · (ln k)O(1). Furthermore, let

H(ρ) = ln k − 1

k

k∑i,j=1

ρij ln ρij ,

E(ρ) =d

2ln

(1− 2

k+ k−2 ‖ρ‖22

).

We observe that

∂

∂y

d

2ln(1− 2/k + y/k2

)=

d

2k2(1− 2/k + y/k2)=

ln k

k(1 + O(1/k)), (73)

∂2

∂y2

d

2ln(1− 2/k + y/k2

)=

−d2k4(1− 2/k + y/k2)2

= O(k−3). (74)

In addition, we let ρi = (ρi1, . . . , ρik) denote the ith row of ρ. Moreover,

H(ρi) = −k∑j=1

ρij ln ρij

denotes the entropy of ρi, viewed as a probability distribution on 1, . . . , k. Finally, let r = ln k+Ok(1)be such that

d = (2k − 1) ln k − r.

Lemma 15.17 Let ρ ∈ S. Let i, j, l ∈ [k] and set δ = ρil − ρij . Suppose that ρij , ρil > 0. Then

sign

∂f

∂ρij− ∂f

∂ρil

∣∣∣∣ρ

= sign

1 +

δ

ρij− exp

(d

k(1− 2k + 1

k2 ‖ρ‖22)· δ

).

Proof. A direct calculation yields

∂f

∂ρij− ∂f

∂ρil=

1

k

[ln

(ρilρij

)− d

k· ρil − ρij

1− 2k + 1

k2 ‖ρ‖22

].

Substituting δ = ρil − ρij , we find

ln

(ρilρij

)− d

k· ρil − ρij

1− 2k + 1

k2 ‖ρ‖22

= ln (1 + δ/ρij)−d

k(1− 2k + 1

k2 ‖ρ‖22)· δ.

Taking exponentials yields the assertion. 2

Thus, to understand what happens if we slightly increase ρij at the expense of decreasing ρil, we needto investigate how the line 1 + δ/ρij intersects the exponential function exp(δd/(k(1− 2/k+ k−2 ‖ρ‖22)).

Lemma 15.18 Let ρ be a stochastic k × k matrix. Let i, j ∈ [k] and assume that ρij > 0.

1. If1

ρij>

d

k(1− 2k + 1

k2 ‖ρ‖22), (75)

then there exists a unique δ∗ > 0 such that

1 +δ∗

ρij= exp

[d

k(1− 2k + 1

k2 ‖ρ‖22)· δ∗].

Furthermore, for all 0 < δ < δ∗ we have

1 +δ

ρij− exp

[d

k(1− 2k + 1

k2 ‖ρ‖22)· δ

]> 0.


2. If (75) does not hold, then for all δ > 0 we have

1 +δ

ρij< exp

[d

k(1− 2k + 1

k2 ‖ρ‖22)· δ

].

Proof. There is at most one δ∗ > 0 where the straight line δ 7→ 1 + δρij

intersects the strictly convexfunction

δ 7→ exp

[d

k(1− 2k + 1

k2 ‖ρ‖22)· δ

].

In fact, there is exactly one such δ∗ iff the differential of the linear function is greater than that of theexponential function at δ = 0. Since

∂

∂δ

(1 +

δ

ρij

) ∣∣∣∣δ=0

=1

ρij,

∂

∂δexp

[d

k(1− 2k + 1

k2 ‖ρ‖22)· δ

] ∣∣∣∣δ=0

=d

k(1− 2k + 1

k2 ‖ρ‖22),

the assertion follows. 2

Corollary 15.19 Suppose that ρ ∈ S. Let i ∈ [k] and J ⊂ [k] be such that for some λ > 10−4 we have

|J | ≥ kλ and maxj∈J

ρij < λ/2− 10/ ln k.

Let ρ be the matrix with entries ρab = ρab for all (a, b) 6∈ i × J and

ρia = |J |−1∑j∈J

ρij for a ∈ J.

Then f(ρ) ≤ f(ρ). In fact, if ρ 6= ρ, then f(ρ) < f(ρ).

Proof. If ρij = 0 for all j ∈ J , then ρ = ρ and there is nothing to show. Thus, assume that∑j∈J ρij > 0.

Suppose that ρ ∈ S maximizes f(ρ) subject to the conditions

i. ρab = ρab for all (a, b) 6∈ i × J and

ii. maxj∈J ρij ≤ maxj∈J ρij .

Such a maximizer ρ exists because i.–ii. define a compact domain.We claim that ρij > 0 for all j ∈ J . Indeed, assume that ρij = 0 for j ∈ J but ρil > 0 for some

other l ∈ J . Then there is ξ > 0 such that the matrix ρ′ obtained from ρ by replacing ρij by ξ and ρil byρil − ξ satisfied f(ρ′) > f(ρ). This is because the derivative of the function z 7→ −z ln z tends to infinityas z → 0. Hence, we obtain a contradiction to the maximality of ρ.

Thus, let a be such that ρia = minj∈J ρij > 0. Because ρ is stochastic, we have ‖ρ‖22 ∈ [1, k] and|J |ρia ≤

∑j∈J ρij ≤ 1. Therefore,

1/ρia ≥ |J | ≥ kλ ≥ 3 ln k >d

k(1− 2/k + ‖ρ‖22 /k2). (76)

Thus, (75) is satisfied. Our assumptions δ = λ/2− 10/ ln k and λ ≥ 10−4 ensure that

exp

(dδ

k(1− 2/k + ‖ρ‖22 /k2)

)≤ exp

(2δ ln k

(1− 1/k)2

)≤ exp(−10)kλ ≤ exp(−10)|J | < 1 + δ/ρia. (77)


Let b ∈ J be such that ρib = maxj∈J ρij . Assume that δ = ρib − ρia > 0. Since δ ≤ ρib ≤ δ, (76)and (77) yield in combination with Lemmas 15.17 and 15.18 that

∂f

∂ρia− ∂f

∂ρib

∣∣∣∣ρ

> 0,

in contradiction to our assumption that ρ is maximizes f(ρ) subject to i.–ii. Hence, minj∈J ρij = ρia =ρib = maxj∈J ρib, which means that ρ = ρ. 2

Corollary 15.19 enables us to optimise f over matrices ρ that do not have an entry near 1/2.

Lemma 15.20 Let ρ be a stochastic k×k matrix. Let i ∈ [k] be such that ρij 6∈ [0.49, 0.51] for all j ∈ [k].

1. Suppose that ρij ≤ 0.49 for all j ∈ [k]. Let ρ′ be the matrix with entries

ρ′hj = ρhj and ρ′ij = 1/k for all j ∈ [k] , h ∈ [k] \ i .

Then f(ρ) ≤ f(ρ′).

2. Suppose that ρij ≥ 0.51 for some j ∈ [k]. Then there is a number α = 1/k+ O(1/k2) such that forthe matrix ρ′′ with entries

ρ′′hj = ρhj and ρ′′ii = 1− α, ρ′′ih =1− αk − 1

for all j ∈ [k] , h ∈ [k] \ i

we have f(ρ) ≤ f(ρ′′).

Proof. To obtain the first assertion, we simply apply Corollary 15.19 to row i and J = [k] (with λ =0.999). Now, suppose that ρij ≥ 0.51 for some pair (i, j). Without loss of generality, we may assumethat (i, j) = (1, 1) and that ρ11 ≥ · · · ≥ ρ1k. Let ρ ∈ S be the matrix that maximizes f subject to theconditions

i. ρ11 ≥ 0.51.

ii. ρa = ρa for all a ∈ 2, . . . , k.

We aim to prove that ρ = ρ′′. Since ρ12 ≤ 1 − ρ11 ≤ 0.49, Corollary 15.19 applies to J = 2, . . . , k(with λ = 0.999) and yields

ρ12 = · · · = ρ1k =1− ρ11

k − 1.

Let δ = ρ11− ρ12 = ρ11−O(1/k) and let 0 ≤ λ ≤ 0.49k be such that ρ11 = 1−λ/k. Since ‖ρ‖22 ∈ [1, k],we have Q = 1− 1/k + ‖ρ‖22 /k2 ≥ (1− 1/k)2. Therefore,

exp

[δd

kQ

]= k2ρ11

(1 + O(1/k)

)= k2(1−λ/k)(1 + O(1/k)).

Furthermore,

1 +δ

ρ12=ρ11

ρ12=

(k − 1)ρ11

1− ρ11=k2

λ(1− λ/k)(1 +O(1/k)).

Define

ξ : l 7→ k2l/k

(1

l− 1

k

),

so that (1 +

δ

ρ12

)exp

[− δd

kQ

]= (1 + O(1/k)) · ξ(λ).

We have

d

dlξ(l) = k2l/k

[2 ln k

k

(1

l− 1

k

)− 1

l2

].


Figure 1: the function ξ(l) for k = 30.

This differential is zero at k2 (1 ±√

1− 2/ ln k). At µ = k2 (1 −

√1− 2/ ln k) = (1 + ok(1)) k

2 ln k thefunction ξ attains a local minimum, while k

2 (1 +√

1− 2/ ln k) > k/2 is a local maximum. Furthermore,ξ(1) = 1 + O(1/k) and ξ′(1) = −1 + ok(1). Therefore, there is γ = O(1/k) such that for 1 + γ < λ < µwe have (

1 +δ

ρ12

)exp

[− δd

kQ

]= (1 + O(1/k))ξ(λ) ≤ (1 + O(1/k))ξ(1 + γ) < 1.

Furthermore, if λ = 0.49k, then(1 +

δ

ρ12

)exp

[− δd

kQ

]= (1 + O(1/k))ξ(λ) ≤ k0.98

(1

0.49k− 1

k

)≤ 3k−0.01 < 1.

As a consequence, we have(1 +

δ

ρ12

)< exp

[δd

kQ

]for all 1 + γ ≤ λ ≤ 0.49k. (78)

In addition, because µ is the unique local minimum of ξ and because ξ(1) = 1 + O(1/k) and ξ′(1) =−1 + ok(1), we can choose γ = O(1/k) such that(

1 +δ

ρ12

)exp

[− δd

kQ

]= (1 + O(1/k))ξ(λ)

≥ (1 + O(1/k))ξ(1− γ) > 1 for 0 < λ < 1− γ. (79)

Thus, Lemma 15.17 and the maximality of f(ρ) imply that ρ11 = 1− 1/k + O(1/k2), as claimed. 2

The following lemma allows us to get rid of rows that have an entry close to 1/2.

Lemma 15.21 Let ρ be a stochastic matrix such that ρij ∈ [0.49, 0.51] for some (i, j) ∈ [k]2. Then there

is a stochastic matrix ρ′ such that ρ′ij 6∈ [0.49, 0.51] for all j ∈ [k] and such that f(ρ′) ≥ f(ρ) + ln k5k .

Proof. Without loss of generality we may assume that (i, j) = (1, 1) and that ρ ∈ S maximizes f subjectto the condition that ρ11 ∈ [0.49, 0.51]. There are two cases.

Case 1: ρ1j < 0.49 for all j ≥ 2. Applying Corollary 15.19 to the set J = 2, . . . , k (with λ = 0.999),we see that ρ1j = 1−ρ11

k−1 for all j ≥ 2 due to the maximality of f(ρ). Hence,

H(ρ1) ≤ h(ρ11) + (1− ρ11) ln(k − 1) ≤ ln 2 + 0.51 ln k. (80)


Moreover, because ρ11 ≤ 0.51 we have

‖ρ1‖22 ≤ 0.512 + (k − 1)

(1− ρ11

k − 1

)2

≤ 0.261. (81)

Let ρ′ be the matrix obtained from ρ by replacing the first row by the vector (1, 0, . . . , 0). SinceH(1, 0, . . . , 0) = 0, (80) yields

f(ρ)− f(ρ′) = H(ρ)−H(ρ′) + E(ρ)− E(ρ′) ≤ ln 2 + 0.51 ln k

k+ E(ρ)− E(ρ′). (82)

Furthermore, (81) entails ‖ρ‖22 − ‖ρ′‖22 ≤ ‖ρ1‖22 − 1 ≤ −0.739. Hence,

E(ρ)− E(ρ′) ≤ −0.739(1 + O(1/k)) ln k/k ≤ −0.73 ln k/k. (83)

Combining (82) and (83), we obtain f(ρ)− f(ρ′) ≤ 1k [ln 2− 0.22 ln k] ≤ − ln k/(5k).

Case 2: there is j ≥ 2 such that ρ1j > 0.49. We may assume that j = 2. Applying Corollary 15.19 toJ = 3, . . . , k (with λ = 0.999), we find that ρ1j = (1 − ρ11 − ρ12)/(k − 2) for all j ≥ 3 due tothe maximality of f(ρ). Hence,

H(ρ1) ≤ 2 ln 2 + 0.02 ln k. (84)

Further, because ρ211 + ρ2

12 ≤ 0.512 + 0.492 as ρ11, ρ12 ∈ [0.49, 0.51] and ρ11 + ρ12 ≤ 1, we seethat

‖ρ1‖22 ≤ 0.512 + 0.492 + (k − 2)

(1− ρ11 − ρ12

k − 2

)2

≤ 0.501. (85)

As in the first case, let ρ′ be the matrix obtained from ρ by replacing the first row by the vector(1, 0, . . . , 0). Then (84) gives

f(ρ)− f(ρ′) = H(ρ)−H(ρ′) + E(ρ)− E(ρ′)

≤ 1

k[2 ln 2 + 0.02 ln k] + E(ρ)− E(ρ′). (86)

Further, from (85) we obtain ‖ρ‖22 − ‖ρ′‖22 ≤ 0.501− 1 = −0.499. Hence,

E(ρ)− E(ρ′) ≤ −0.499(1 + O(1/k)) ln k/k ≤ −0.49 ln k/k. (87)

Combining (86) and (87), we get f(ρ)− f(ρ′) ≤ 1k [2 ln 2− 0.47 ln k] ≤ − ln k/(5k).

Hence, in either case we obtain the desired bound. 2

Lemmas 15.20 and 15.21 show that for any stochastic matrix ρ there is a stochastic matrix ρ whoserows are either of the form 1

k1 or that are close to the vector (1, 0, . . . , 0) such that f(ρ) ≥ f(ρ). Thefollowing lemma provides a bound on f(ρ). Recall that d = 2k ln k − ln k − r with r = ln k +Ok(1).

Lemma 15.22 Let ρ = (ρij) be the k × k matrix with entries

ρij =

1 if i = j ∈ [s] ,0 if i ∈ [s] , j 6= i,

1/k otherwise.

Thenf(ρ) ≤ r

k+s

k(1− s

k) · ln k

2k− rs

2k2+ o(1/k).


Proof. We have ‖ρ‖22 = s+ 1− s/k. Hence,

E(ρ) =d

2k2

−2k + ‖ρ‖22 − 2

(1−‖ρ‖222k

)2+ o(1/k)

=d

2k2

[−2k + s+ 1− s/k − 2

(1− s

2k

)2]

+ o(1/k)

=d

2k2

[−2k − 1 + s+ s/k − s2

2k2

]+ o(1/k)

=d

2k2

[−2k − 1 + s

(1 +

1

k− s

2k2

)]+ o(1/k)

= −2 ln k +r

k+

s

2k

(d

k+

d

k2− ds

2k3

)+ o(1/k)

= −2 ln k +r

k+s

k

(ln k − ln k

2k− r

2k+

ln k

k− s ln k

2k2

)+ o(1/k)

= −2 ln k +r

k+s ln k

k

(1 +

1

2k− s

2k2

)− rs

2k2+ o(1/k).

Further, H(ρ) = ln k + (1− s/k) ln k = 2 ln k − sk ln k. Thus,

H(ρ) + E(ρ) =r

k+s ln k

k

(1

2k− s

2k2

)− rs

2k2+ o(1/k)

=r

k+s

k(1− s/k) · ln k

2k− rs

2k2+ o(1/k),

as claimed. 2

Proof of Proposition 15.13. Lemma 15.21 implies that the maximum maxρ∈S f(ρ) is attained at a matrixρ without entries in [0.49, 0.51]. Therefore, Lemma 15.20 shows that the maximizer ρ has the followingform for some integer 0 ≤ s ≤ k and some α = 1/k + O(1/k2):

ρij =

1− α if i = j ∈ [s] ,αk−1 if i ∈ [s] , j 6= i,

1/k otherwise.

Thus, for i ∈ [s] we have

H(ρi) = h(1− α) + α ln(k − 1) ≤ h(α) + α ln k,

‖ρi‖22 = (1− α)2 + α2/(k − 1).

Let ρ′ be the matrix obtained from ρ by replacing the first s rows by (1, 0, . . . , 0). Then

H(ρ)−H(ρ′) ≤ s

k[h(α) + α ln k] ≤ αs

k[1− lnα+ ln k] ≤ 2αs

k[1 + ln k] , (88)

‖ρ‖22 − ‖ρ′‖22 ≤ s

[(1− α)2 +

α2

k − 1− 1

]= αs [−2 + α(1 + 1/(k − 1))] = αs [−2 +O(1/k)] . (89)

Plugging (89) into (73) yields

E(ρ)− E(ρ′) ≤ αs [−2 +O(1/k)] · (1 + O(1/k))ln k

k≤ −2αs

k

[ln k + O(1/k)

]. (90)

Combining (88) and (90), we obtain

f(ρ)− f(ρ′) ≤ 2αs

k

[1 + O(1/k)

]≤ 3/k.

Thus, the assertion follows from Lemma 15.22, provided that the constant c is chosen sufficiently big. 2


15.3.2 Proof of Proposition 15.14

Stirling’s formula implies that∑ρ∈R1

E [Zρ] ≤ O(n(1−k2)/2)∑ρ∈R1

exp(nf(ρ)) (91)

(cf. Lemma 15.12). By construction, we have∑ki,j=1 ρij = k for all ρ ∈ R1. Therefore, we can

parametrize the setR1 as follows. Let

L : [0, 1]k2−1 → [0, 1]

k2, ρ = (ρij)(i,j)∈[k]2\(k,k) 7→ L(ρ) = (Lij(ρ))i,j∈[k],

where Lij(ρ) = ρij for (i, j) 6= (k, k) and Lkk(ρ) = k −∑

(i,j)6=(k,k) ρij . Let R1 = L−1(R1). Then Linduces a bijection R1 → R1. Thus,∑

ρ∈R1

exp(nf(ρ)) =∑ρ∈R1

exp(n · f L(ρ)). (92)

To study the function f L = H L + E L for ρ ∈ R1, we compute its first two differentials. Adirect calculation yields for (i, j) 6= (k, k) and (a, b) 6∈ (i, j), (k, k)

∂

∂ρijH L(ρ) =

1

klnLkk(ρ)

Lij(ρ),

∂2

∂ρ2ij

H L(ρ) = −1

k

[1

Lij(ρ)+

1

Lkk(ρ)

],

∂2

∂ρij∂ρabH L(ρ) = − 1

kLkk(ρ).

Furthermore, letting F(ρ) = ‖ρ‖22, we find

∂

∂ρijF L(ρ) = 2(Lij(ρ)− Lkk(ρ)),

∂2

∂ρ2ij

F L(ρ) = 4,∂2

∂ρij∂ρabF L(ρ) = 2.

Thus, the chain rule, (73) and (74) yield

∂

∂ρijE L(ρ) =

(Lij(ρ)− Lkk(ρ)) · dk2(1− 2/k + F L(ρ)/k2)

,

∂2

∂ρ2ij

E L(ρ) =2d

k2(1− 2/k + F L(ρ)/k2)− 2d(Lij(ρ)− Lkk(ρ))2

k4(1− 2/k + F L(ρ)/k2)2= O(1/k),

∂2

∂ρij∂ρabE L(ρ) =

d

k2(1− 2/k + F L(ρ)/k2)− 2d(Lij(ρ)− Lkk(ρ))(Lab(ρ)− Lkk(ρ))

k4(1− 2/k + F L(ρ)/k2)2

= O(1/k).

In particular, we see that Df L(k−11) = 0. Furthermore, for η small enough the second partial differen-tials of E L are all positive for all ρ ∈ R1. Hence, the Hessian of E is positive definite, while that of His negative definite. Their sum is negative definite due to the above O(1/k) term. More precisely, we findthat there exist constants ξ > 0, η > 0 such that for all ρ ∈ R1 we have

f L(ρ) ≤ f(ρ)− ξ∑

(i,j)6=(k,k)

(ρij − 1/k)2. (93)


Thus, we obtain

∑ρ∈R1

E [Zρ] ≤ exp (f(ρ)n) ·O(n(1−k2)/2)∑ρ∈R1

exp

−n · ξ ∑(i,j) 6=(k,k)

(ρij − 1/k)2

≤ exp (f(ρ)n) ·O(1)

∫Rk2−1

exp

−ξ ∑(i,j) 6=(k,k)

(zij − 1/k)2

dz≤ exp (f(ρ)n) ·O(1)

[∫ ∞−∞

exp[−ξz2

]dz

]k2−1

= O(1) · exp (f(ρ)n) ,

as desired.

15.4 Extensions and outlookTogether with other methods, the proof of Theorem 15.10 actually yields the asymptotic number of k-colorings of the random graph G(n,m). In fact, let us denote by Zk = Zk(G(n,m)) the total number ofk-colorings of the random graph. This can be viewed as the partition function of the Potts antiferromagneton G(n,m) at zero temperature. Thus, it seems natural to consider the free entropy density

φ(d) = limn→∞

1

nE [ln max Zk(G(n,m)), 1] .

(The max ·, · is needed because Zk(G(n,m)) may be equal to zero.) Unfortunately, we do not currentlyknow that the limit φ(d) exists for all d. However, it is possible to prove the following.

Theorem 15.23 ([5]) There is εk = ok(1) such that the following is true. Suppose that

d ≤ (2k − 1) ln k − 2 ln 2− εk.

Thenφ(d) = ln k +

d

2ln(1− 1/k). (94)

This theorem matches the Bethe free entropy that one would obtain by applying Belief Propagation toan infinite random tree in the degrees of the variable nodes (which correspond to the vertices of G(n,m))have degree Po(d) and the function nodes (which correspond to the edges of G(n,m) and impose thecondition that adjacent vertices must be colored differently) have degree two.

However, the formula (95) breaks down for d beyond (2k − 1) ln k − 2 ln 2. More precisely, we have

Theorem 15.24 ([5]) There is εk = ok(1) such that the following is true. Suppose that

d ≥ (2k − 1) ln k − 2 ln 2 + εk.

Then either the limit φ(d) does not exist, or

φ(d) < ln k +d

2ln(1− 1/k). (95)

Theorem 15.24 shows that the limit φ(d) is non-analytic as a function of d (in fact, the limit may notexist for some d). Thus, one could say that the point dc = (2k − 1) ln k − 2 ln 2 + ok(1) marks a phasetransition. Beyond this phase transition, the Bethe free entropy obtained from Belief Propagation on theinfinite tree does not match the free entropy on G(n,m) anymore!

In the physics literature, this conundrum is resolved through the notion of replica symmetry breaking.This leads to an extension of Belief Propgation, called Survey Propagation (see [14] for further reading).Currently, only few rigorous result about this extension exist (in any model).

REFERENCES 51

References[1] D. Achlioptas, A. Naor: The two possible values of the chromatic number of a random graph. Annals

of Mathematics 162 (2005), 1333–1349.

[2] N. Alon, J. Spencer: The probabilistic method. Wiley (2nd ed. or later).

[3] B. Bollobas: Modern graph theory. Springer 1998.

[4] B. Bollobas: Random graphs. Cambridge University Press (2nd ed.).

[5] A. Coja-Oghlan, D. Vilenchik: Chasing the k-colorability threshold. arXiv:1304.1063 [cs.DM]

[6] R. Durrett: Probability theory and examles. Thomson (3rd ed.)

[7] R. Durrett: Random graph dynamics. Cambridge.

[8] P. Erdos, A. Renyi: On the evolution of random graphs. Magayar Tud. Akad. Mat. Kutato Int. Kozl.5 (1960) 17–61.

[9] W. Feller: An introduction to probability theory and its applications, volume 1. Wiley 1966.

[10] S. Janson, T. Łuczak, A. Rucinski: Random Graphs, Wiley 2000.

[11] M. Krivelevich, B. Sudakov: The phase transition in random graphs – a simple proof. Preprint (2012).

[12] D. Levin, Y. Peres and E. Wilmer: Markov Chains and Mixing Times, AMS 2009.

[13] T. Łuczak: A note on the sharp concentration of the chromatic number of random graphs. Combina-torica 11 (1991) 295–297

[14] M. Mezard, A. Montanari: Information, physics and computation, Oxford 2009.

[15] M. Talagrand: Spin glasses: a challenge for mathematicians. Springer 2003.

[16] M. Talagrand: The Parisi formula. Annals of Mathematics 163 (2006) 221–263.

probabilistic combinatorics - math.uni-frankfurt.deacoghlan/probcomb.pdf · 3 the erdos-ko-rado...

Documents