large deviations - ustc

LARGE DEVIATIONS

S. R. S. VARADHAN

Date: Spring, 2010.

1

2 S. R. S. VARADHAN

1. Introduction

The theory of large deviations deals with rates at which probabilities of certain eventsdecay as a natural parameter in the problem varies. It is best to think of a specific exampleto clarify the idea. Let us suppose that we have n independent and identically distributedrandom variables Xi having mean zero and variance one with a common distribution µ.The distribution of Xn = X1+···+Xn

n converges to the degenerate distribution at 0, whileZn = X1+···+Xn√

nhas a limiting normal distribution according to the Central Limit Theorem.

In particular

(1.1) limn→∞

P [Zn ≥ `] =1√2π

∫ ∞`

e−x2

2 dx

While the convergence is uniform in ` it does not say much if ` is large. A natural questionto ask is does the ratio

P [Zn ≥ `]1√2π

∫∞` e−

x2

2 dx

tend to 1 even when ` →∞? It depends on how rapidly ` is getting large. If ` <<√n it

holds under suitable conditions. But if ` '√n it clearly does not. For instance if Xi are

bounded by a constant C, and ` > C√n, P [Zn ≥ `] = 0 while the Gaussian probability is

not. Large deviations are when ` '√n. When ` <<

√n they are referred to as ”moderate

deviations”. While moderate deviations are refinements of the Central Limit Theorem,large deviations are different. It is better to think of them as estimating the probability

pn(`) = P [X1 + · · ·+Xn ≥ n`] = P [Xn ≥ `]

It is expected that the probability decays exponentially. In the Gaussian case it is clearthat

pn(`) =√n√2π

∫ ∞`

e−nx2

2 dx = e−n`2

2+o(n) = e−n I(`)+o(n)

The function I(`) = `2

2 reflects the fact that the the common distribution of Xi wasGaussian to begin with.

Following Cramer, let us take for the common distribution of our random variablesX1, X2, · · · , Xn an arbitrary distribution µ. We shall try to estimate P [Xn ∈ A] whereA = x : x ≥ ` for some ` > m, m being the mean of the distribution µ. Under suitableassumptions µn(A) should again decay exponentially rapidly in n as n→∞, and our goalis to find the precise exponential constant. In other words we want to calculate

limn→∞

1n

logµn(A) = −Iµ(`)

as explicitly as possible. The attractive feature of the large deviation theory is that suchobjects can be readily computed. If we denote the sum by Sn =

∑ni=1Xi then

µn(A) = Pr[Sn ≥ n`]

LARGE DEVIATIONS 3

can be estimated by the standard Tchebechev type estimate

µn(A) ≤ e−θ n `Eexp[θSn]

with any θ > 0. Denoting the moment generating function of the underlying distributionµ by

M(θ) =∫eθxdµ

and its logarithm byψ(θ) = logM(θ)

we obtain the obvious inequality

µn(A) ≤ e−θ n `[M(θ)]n

and by taking logarithms on both sides and dividing by n1n

logµn(A) ≤ −θ `+ ψ(θ).

Since the above inequality is valid for every θ ≥ 0 we should optimize and get1n

logµn(A) ≤ − supθ≥0

[θ `− ψ(θ)].

By an application of Jensen’s inequality we can see that if θ < 0, then ψ(θ) ≥ mθ ≥ `θ.Because 0 is always a trivial lower bound replacing supθ≥0 [θ `− ψ(θ)] by supθ [θ `− ψ(θ)]does not increase its value. Therefore we have

(1.2)1n

logµn(A) ≤ − supθ

[θ `− ψ(θ)].

It is in fact more convenient to introduce the conjugate function

(1.3) h(`) = supθ

[θ `− ψ(θ)]

which is seen to be a nonnegative convex function of ` with a minimum value of 0 atthe mean ` = m. From the convexity we see that h(`) is non-increasing for ` ≤ m andnondecreasing for ` ≥ m. We can now rewrite our upper bound as

1n

logµn(A) ≤ −h(`)

for ` ≥ m. A similar statement is valid for the sets of the form A = x : x ≤ ` with` ≤ m. We shall now prove that the upper bounds that we obtained are optimal. To dothis we need effective lower bounds. Let us first assume that ` = m. In this case we knowby central limit theorem that

limn→∞

µn(A) =12

and clearly it follows that

lim infn→∞

1n

logµn(A) ≥ 0 = −h(m).

4 S. R. S. VARADHAN

To do the general case let us assume that the distribution µ does not live on any propersubinterval of (−∞,∞). Then ψ(·) grows super-linearly at ∞ and the supremum insupθ[θ`− ψ(θ)] is attained at some value θ0 for θ. Equating the derivative to 0,

(1.4) ` = ψ′(θ0) =M ′(θ0)M(θ0)

=1

M(θ0)

∫eθ0xdµ(x).

If we now define a new probability distribution µ′ by the relation

(1.5)dµ′

dµ(x) =

eθ0x

M(θ0)

then µ′ has ` for its expected value. If we denote be µ′n the distribution of the mean ofthe n random variables under the assumption that their common distribution is µ′ ratherthan µ, an elemenatary calculation yields

µn(A) =∫A

[M(θ0)e−θ0x]ndµ′n(x).

To get a lower bound let us replace A by Aδ = x : ` ≤ x ≤ `+ δ for some δ > 0. Then

µn(A) ≥ µn(Aδ) ≥ [M(θ0)e−θ0(`+δ)]nµ′n(Aδ).

Again applying the central limit theorem, but now to µ′n,we see that µ′n(Aδ) → 12 , and

taking logarithms we get

lim infn→∞

1n

logµn(A) ≥ ψ(θ0)− θ0(`+ δ).

Since δ > 0 was arbitrary we can let it go to 0 and obtain

lim infn→∞

1n

logµn(A) ≥ ψ(θ0)− θ0 ` = −h(`).

A similar proof works for intervals of the form A = x : x ≤ ` as well. We have thereforecalculated the precise exponential decay rate of the relevant probabilities.

Remark 1.1. This way of computing the probability of a rare event by changing theunderlying model to one where the rare event is no longer rare and using the explicit formof the Radon-Nikodym derivative to estimate the probabiliy goes back to Cramer-Lundbergin the 1930’s. The problem is to estimate

(1.6) P [suptX(t) ≥ `] = q(`)

where X(t) =∑

τi≤t ξi−mt is all the claims paid up to time t less the premiums collectedreflecting the net total loss up to time t. ` is the initial capital of an insurance company.The claims follow a compound Poisson process with the Levy-Khinitchine representation

(1.7) logE[exp[θ X(t)]] = t logψ(θ) = t λ

∫(eθ x − 1)dF (x)−mθ t

λ is the claims rate, F is the distribution of the individual claim amounts and m is therate at which premium is being collected. It is assumed that the company is profitable, i.e.λ∫x dF (x) < m. If ` is large the probability of ever running out of cash should be small.

LARGE DEVIATIONS 5

The problem is to estimate q(`) for large `. First consider the imbedded random walk, whichis the net out flow of cash after each claim is paid. Sn+1 = Sn+ξn+1−τn+1m = Sn+Yn+1.ξn+1 is the amount of the new claim and τn+1 is the gap between the claims during whicha premium of mτn+1 was collected.The idea of ”tilting” is to consider θ0 > 0 so that

E[eθ0Y ] = E[eθ0(ξ−mτ)] = E[eθ0ξ]E[e−mθ0τ ] = 1

Solve for

M(θ0) =λ+mθ0

λSuch θ0 6= 0 exists because E[Y ] < 0. Tilt so that the (ξ, τ) is now distributed aseθ0ye−θ0 τdF (y)e−λτdτ The tilted process Q will now have net positive outflow. We havemade the claims more frequent and more expensive, while keeping the premium the same.It will run out of cash. Let σ` be the overshoot, i.e. shortfall when that happens. For thetilted process E[e−θ0Y ] = 1.

q(`) = EQ[e−θ0(`+σ`)] = e−θ0 `EQ[e−θ0σ` ]

EQ[e−θ0σ` ] has a limit as `→∞ that is calculated by the renewal theorem.

Remark 1.2. If the underlying distribution were the standard Gaussian then we see thath(`) = `2

2 .

Remark 1.3. One should really think of h(`) as giving the ”local” decay rate of theprobability at or near ”`”. Because an+ bn ≤ 2[max(a, b)]n and the factor 2 leaves no tracewhen we take logarithms and divide by n, the global decay rate is really the worst localdecay rate. The correct form of the result should state

limn→∞

1n

logµn(A) = − infx∈A

h(x).

Using the monotonicity of h on either side of m one can see that this is indeed correct forsets of the form A = (−∞, `) or (`,∞). In fact the upper bound

lim supn→∞

1n

logµn(A) ≤ − infx∈A

h(x)

is easily seen to be valid for more or less any set A. Just consider the smallest set of theform (−∞, a] ∪ [b,∞), with a ≤ m ≤ b that contains A. While a and b may not be in Athey clearly belong to A, the closure of A. If we use the continuity of h we see that A canbe arbitrary. If we do not want to use the continuity of h then it is surely true for closedsets. The lower bound is more of a problem. In the Gaussian case if A is a set of Lebesguemeasure zero, then µn(A) = 0 for all n and the lower bound clearly does not hold in thiscase. Or if we take the coin tossing or the Binomial case, if A contains only irrationals, weare again out of luck. We can see that the proof we gave works if the point ` is an interiorpoint of A. Therefore it is natural to assume for the lower bound that A is an open setjust as it is natural to assume that A is closed for the upper bound.

6 S. R. S. VARADHAN

Exercise 1.For the coin tossing where µ is the discrete distribution with masses p and q = 1 − p

at 1 and 0 respectively calculate explicitly the function h(x). In this case our proof of thelower bound may not be valid because µ lives on a bounded interval. Provide two proofsfor the lower bound, one by explicit combinatorial calculation and Stirling’s formula andthe other by an argument that extends our proof to the general case.

Exercise 2.Extend the result to the case where X1, X2, · · · , Xn take values in Rd with a common

distribution µ on Rd. Tchebechev will give estimates of probabilities for halfspaces and fora ball we can try to get the halfspace that gives the optimal estimate. If the ball is smallthis is OK. A compact set is then covered by a finite number of small balls and a closed setis approximated by closed bounded sets by cutting it off. This will give the upper boundfor closed sets and the lower bound for open sets presents no additional difficulties in themultidimensional case.

Exercise 3.Let us look at the special case of µ on Rd that has probabilities π1, π2, · · · , πd, (with∑i πi = 1), respectively at the unit vectors in the coordinate directions e1, e2, · · · , ed.

Calculate explicitly the rate function h(x) in this case. Is there a way of recovering fromthis example the one dimensional result for a general discrete µ with d mass points?

Exercise 4.If we replace the compound Poisson process by Brownian motion with a positive drift

x(t) = β(t) +mt, m > 0, then

q(`) = P [inft≥0

x(t) ≤ −`] = e−2m`.

References:1. H. Cramer; Collective Risk Theory, Stockholm, 19552. H. Cramer; Sur un nouveau theoreme limite de la theorie des probabilites; Actualites

Scientifiques, Paris, 1938

2. General Principles

In this section we will develop certain basic principles that govern the theory of largedeviations. We will be working for the most part on Polish (i.e. complete separable metric)spaces. Let X be such a space. A function I(·) : X → [0,∞], will be called a (proper) ratefunction if it is lower semicontinuous and the level sets K` = x : I(x) ≤ ` are compact inX for every ` < ∞. Let Pn be a sequence of probability distributions on X. We say thatPn satiesfies the large deviation principle on X with rate function I(·) if the following twostatements hold. For every closed set C ⊂ X,

lim supn→∞

1n

logPn(C) ≤ − infx∈C

I(x)

LARGE DEVIATIONS 7

and for every open set G ⊂ X,

lim infn→∞

1n

logPn(G) ≥ − infx∈G

I(x)

Remark 2.1. The value of +∞ is allowed for the function I(·). All the infima are effectivelytaken only on the set where I(·) <∞.

Remark 2.2. The examples of the previous section are instances where the LDP is valid.One should check that the assumption of compact level sets holds in those cases, and thisverification is left as an exercise.

Remark 2.3. Under our assumptions the infimum infx I(x) is clearly attained and sincePn(X) = 1, this infimum has to be 0. If we define K0 = x : I(x) = 0, then K0 is acompact set and the sequence Pn is tight with any limit point Q of the sequence satisfyingQ(K0) = 1. In particular if K0 is a single point x0 of X, then Pn ⇒ δx0 , i.e. Pn convergesweakly to the distribution degenerate at the point x0 of X.

There are certain general properties that are easy to prove.

Theorem 2.4. Suppose Pn is a sequence that satisfies an LDP on X with respect to aproper rate function I(·). Then for any ` < ∞ there exists a compact set D` ⊂ X suchthat

Pn(D`) ≥ 1− e−n `

for every n.

Proof. Let us consider the compact level set A` = x : I(x) ≤ `+2. From the compactnessof A` it follows that for each k, it can be covered by a finite number of open balls of radius1k . The union, which we will denote by Uk, is an open set and I(x) ≥ (`+ 2) on the closedset U ck . By the LDP, for every k,

lim supn→∞

1n

logPn(Ack) ≤ −(`+ 2).

In particular there is an n0 = n0(k) such that for n ≥ n0

Pn(Ack) ≤ e−n (`+1).

We can assume without loss of generality that n0(k) ≥ k for every k ≥ 1. We then havefor n ≥ n0

Pn(Ack) ≤ e−ke−n `.For j = 1, 2, · · · , n0(k), we can find compact sets Bk,1, Bk,2, · · · , Bk,n0 such that Pj(Bc

k,j) ≤e−ke−j ` for every j. Let us define E = ∩k[Ak ∪ (∪jBk,j)]. E is clearly totally bounded and

Pn(Ec) ≤[∑k≥1

e−k]e−n ` ≤ e−n `

We are done.

8 S. R. S. VARADHAN

Remark 2.5. The conclusion of the theorem which is similar in spirit to Prohorov’s tight-ness condition in the context of weak convergence will be caled superexponential tightness.

There are some elementary relationships where the validity of LDP in one context impliesthe same in other related situations. We shall develop a couple of them.

Theorem 2.6. Suppose Pn and Qn are two sequences on two spaces X and Y satisfying theLDP with rate functions I(·) and J(·) respectively. Then the sequence of product measuresRn = Pn ×Qn on X × Y satisfies an LDP with the rate function K(x, y) = I(x) + J(y).

Proof. The proof is typical of large deviation arguments and we will go through it once forthe record.

Step 1. Let us pick a point z = (x, y) in Z = X × Y . Let ε > 0 be given. We wish toshow that there exists an open set N = Nz,ε containing z such that

lim supn→∞

1n

logRn(N) ≤ −K(z) + ε.

Let us find an open set U1 in X such that I(x′) ≥ I(x)− 12ε for all x ∈ U1. This is possible

by the lower semicontinuity of I(·). By general separation theorems in a metric space wecan find an open set U2 such that x ∈ U2 ⊂ U2 ⊂ U1. By the LDP of the sequence Pn

lim supn→∞

1n

logPn(U2) ≤ lim supn→∞

1n

logPn(U2) ≤ − infx′∈U2

I(x′) ≤ −I(x) +12ε.

We can repeat a similar construction around y in Y to get

lim supn→∞

1n

logQn(V2) ≤ lim supn→∞

1n

logPn(V2) ≤ − infy′∈V2

J(y′) ≤ −J(y) +12ε

for some open sets V1, V2 with y ∈ V2 ⊂ V2 ⊂ V1. If we take N = U2 × V2 as theneighborhood of z = (x, y) we are done.

Step 2. Let D ⊂ Z = X × Y be a compact set. Let ε > 0 be given. We will show that forsome neighborhood Dε of D

lim supn→∞

1n

logRn(Dε) ≤ − infz∈D

K(z) + ε.

We know from step 1. that for each z ∈ D there is a neighborhood Nz such that

lim supn→∞

1n

logRn(Nz) ≤ −K(z) + ε ≤ − infz∈D

K(z) + ε.

From the open covering Nz of D we extract a finite subcover Nj : 1 ≤ j ≤ k and if wetake Dε = ∪jNj , then

Rn(Dε) ≤∑j

Rn(Nj) ≤ k supjRn(Nj).

LARGE DEVIATIONS 9

Since k leaves no trace after taking logarithms, dividing by n and passing to the limit weget

lim supn→∞

1n

logRn(Dε) ≤ − infz∈D

K(z) + ε.

In particular because ε > 0 is arbitrary we get

lim supn→∞

1n

logRn(D) ≤ − infz∈D

K(z).

Step 3. From the superexponential tightness, for any given ` there are compact sets A`

and B` in X and Y respectively such that Pn([A`]c) ≤ e−n ` and Pn([B`]c) ≤ e−n `. If wedefine the compact set C` ⊂ Z by C` = A`×B` then Rn([C`]c) ≤ 2e−n `. We can completethe proof of the theorem by writing any closed set C as the union C = [C∩C`]∪ [C∩(C`)c].An easy calculation yields

lim supn→∞

1n

logRn(C) ≤ max(− infz∈C`

K(z),−`) ≤ max(− infz∈C

K(z),−`)

and we can let `→∞ to obtain the upper bound for arbitrary closed sets.

Step 4. The lower bound a much simpler. We need to prove only that if z ∈ Z is arbitraryand N is a neighborhood of z in Z, then

lim infn→∞

1n

logRn(N) ≥ −I(z).

Since any neighborhood of z contains a product U×V of neighborhoods U and V the lowerbound for Rn follows easily from the lower bounds in the LDP for Pn and Qn.

The next result has to do with the behavior of LDP under continuous mapping from onespace to another and is referred to as the ”Contraction Principle”

Theorem 2.7. If Pn satisfies an LDP on X with a rate function I(·), and F is a continuousmapping from the Polish spacs X to another Polish space Y , then the family Qn = PnF

−1

satisfies an LDP on Y with a rate function J(·) given by

J(y) = infx:F (x)=y

I(x).

Proof. Let C ⊂ Y be closed. Then D = F−1C = y : F (x) ∈ C is a closed subset of X.

lim supn→∞

1n

logQn(C) = lim supn→∞

1n

logPn(D)

≤ − infx∈D

I(x)

= − infy∈C

infx:F (x)=y

I(x)

= − infy∈C

J(y)

10 S. R. S. VARADHAN

The lower bound is proved just as easily from the definition

lim infn→∞

1n

logQn(U) = lim infn→∞

1n

logPn(V )

≥ − infx∈V

I(x)

= − infy∈U

infx:F (x)=y

I(x)

= − infy∈U

J(y)

where V = F−1U ⊂ X is open corresponding to the open set U ⊂ Y.

If we want to consider the behavior of integrals of the form

an =∫ 1

0enF (x)dµn(x)

where dµn(x) = e−nI(x)dx It is clear that the contribution comes mostly from the pointwhere F (x) − I(x) achieves its maximum. In fact under mild conditions it is easy to seethat

limn→∞

1n

log an = supx

[F (x)− I(x)].

This essentially remains true under LDP.

Theorem 2.8. Assume that Pn satisfies an LDP with rate function I(·) on X. Supposethat F (·) is a bounded continuous function on X, then

limn→∞

1n

log an = supx

[F (x)− I(x)]

where

an =∫XenF (x)dPn(x).

Proof. The basic idea of the proof is that the sum of a finite number of exponentials growslike the largest of them and if they are all nonnegative then there is no chance of anycancellation.

Step 1. Let ε > 0 be given, For every x ∈ X, there is a neighborhood Nx such that,F (x′) ≤ F (x) + ε on Nx and I(x′) ≥ I(x)− ε on the closure Nx of Nx. We have here usedthe lower semicontinuity of I(·) and the continuity, (in fact only the upper semicontinuity)of F (·). If we denote by

an(A) =∫AenF (x)dPn(x)

we have shown that for any x ∈ X, there is a neighborhood Nx of x, such that,

an(Nx) ≤ en[F (x)+ε]Pn(Nx) ≤ en[F (x)+ε]Pn(Nx)

LARGE DEVIATIONS 11

and by LDP

lim supn→∞

1n

log an(Nx) ≤ [F (x)− I(x)] + 2ε.

Step 2. By a standard compactness argument for any compact K ⊂ X,

lim supn→∞

1n

log an(K) ≤ supx

[F (x)− I(x)] + 2ε.

Step 3. As for the contribution from the complement of a compact set by superexponentialtightness for any ` <∞ there exist a compact set K` such that

Pn(Kc` ) ≤ e−n `

and if F is bounded by a constant M , the contribution from Kc` can be estimated by

lim supn→∞

1n

log an(Kc` ) ≤ [M − `].

Putting the two pieces together we get

lim supn→∞

1n

log an = lim supn→∞

1n

log an(X) ≤ max(M − `, supx

[F (x)− I(x)] + 2ε).

If we let ε→ 0 and `→∞ we are done.

Step 4. The lower bound is elementary. Let x ∈ X be arbitrary. By the continuity ofF (·), (in fact lower semicontinuity at this time) F (x′) ≥ F (x) − ε in some neighborhoodNx of x, and

an = an(X) ≥ an(Nx) ≥ en[F (x)−ε]Pn(Nx).By taking logarithms, dividing by n and letting n→∞ we get

lim infn→∞

1n

log an ≥ [F (x)− I(x)]− ε.

We let ε→ 0, and take the supremum over the points of x ∈ X to get our result.

Remark 2.9. We have used only the upper semicontinuity of F for the upper bound andthe lower semicontinuity of F for the lower bound. It is often necessary to weaken theregularity assumptions on F and this has to be done with care. More on this later whenwe start applying the theory to specific circumstances.

Remark 2.10. The boundedness of F is only needed to derive the upper bound. Thelower bound is purely local. For unbounded F we could surely try our luck with truncationand try to estimate the errors. With some control this can be done.

Finally we end the section by transfering the LDP from Pn to Qn where Qn is definedby

Qn(A) =

∫A e

nF (x)dPn(x)∫X e

nF (x)dPn(x)for Borel subsets A ⊂ X.

12 S. R. S. VARADHAN

Theorem 2.11. If Pn satisfies an LDP with rate function I(·) and F is a bounded con-tinuous function on X, then Qn defined above satisfies an LDP on X as well with the newrate function J(·) given by J(x) = supx∈X [F (x)− I(x)]− [F (x)− I(x)].

Proof. The proof is essentially repeating the arguments of Theorem 4. Let us remark thatin the notation of Theorem 4, along the lines of the proof there one can establish

lim supn→∞

1n

log an(C) ≤ supx∈C

[F (x)− I(x)].

Now,logQn(C) = log an(C)− log an

and we get

lim supn→∞

1n

logQn(C) ≤ supx∈C

[F (x)− I(x)]− supx∈X

[F (x)− I(x)] = − infx∈C

J(x)

This gives us the upper bound and the lower bound is just as easy and is left as anexercise.

1. Sanov’s Theorem

Here we consider a sequence of i.i.d. random variables with values in some completeseparable metric space X with a common distribution α. Then the sample distribution

βn =1

n

n∑

j=1

δxj

maps X n → M(X ) and the product measure αn will generate a measure Pn on the spaceM(X ) which is the distribution of the empirical distribution. The law of large numbersimplies with a little extra work that Pn → δα, i.e. for large n, the empirical distributionβn is very close to the true distribution α. Close here is in the sense of weak convergence.We want to prove a large deviation result for Pn.

Theorem 1.1. The sequence Pn satisfies a large deviation principle on M(X ) with therate function I(β) given by

I(β) = hα(β) = h(α;β) = +∞

unless β << α and dβdα

| log dβdα

| is in L1(α). Then

I(β) =

∫

dβ

dαlog

dβ

dαdα =

∫

logdβ

dαdβ

Before we begin the proof of the theorem we prove a lemma that is useful.

Lemma 1.2. Let α, β be two probability distributions on a measure space (X ,B). Let B(X )be the space of bounded measurable functions on (X ,B). Then

I(β) = supf∈B(X )

[∫

f(x)dβ(x) − log

∫

ef(x)dα(x)

]

Proof. The function x log x and ex − 1 are dual to each other in the sense that

x log x − x + 1 = supy

[x y − (ey − 1)]

ey − 1 = supx

[x y − (x log x − x + 1)]

If b(x) = dβdα

, then∫

f(x)dβ =

∫

f(x)b(x)dα(x)

≤∫

[

[b(x) log b(x) − b(x) + 1] + [ef(x) − 1]]

dα(x)

= I(β) +

∫

ef(x)dα(x)

1

2

Writing f(x) = f(x) − c + c and optimizing with respect to c,∫

f(x)dβ ≤ infc

[I(β) + c +

∫

[ef(x)−c − 1]]

dα(x)]

= I(β) + log

∫

ef(x)dα(x)

with the choice of c = log∫

ef(x)dα(x). On the other hand if for every f ,∫

f(x)dβ(x) ≤ C + log ef(x)dα(x)

taking f(x) = λ1A(x),

β(A) ≤ 1

λ[C + log[eλα(A) + (1 − α(A))]

With λ = − log α(A),

β(A) ≤ C + 2

log 1α(A)

proving β << α. Let b(x) = dβdα

.∫

f(x)b(x)dα(x) ≤ C + log

∫

ef(x)dα(x)

Pick f(x) = log b(x). Log b(x) may not be a bounded function. But one can truncate andpass to the limit. Note that (b log b)− is bounded. The trouble is only from the positivepart (b log b)+ and this is controlled by the upper bound. If X is a nice metric space, we canreplace B(X) by C(X) the space of bounded continuous functions. One can use Lusin’stheorem to approximate b(x) by bounded continuous functions with respect to both α andβ and pass to the limit. We will the have

I(β) = supf∈C(X )

[∫

f(x)dβ(x) − log

∫

ef(x)dα(x)

]

We now turn to the proof of the theorem.

Proof. First we note that I(β) is convex and lower semi-continuous in the topology of weakconvergence of probability distributions.

To prove that Dℓ = β : I(β) ≤ ℓ are compact sets, we need to produce a compact set

Kǫ such that β(Kcǫ ) ≤ ǫ for all β ∈ Dℓ. Let us pick Kǫ so that α(Kc

ǫ ) ≤ e−ℓ+2

ǫ . Then forany β ∈ Dℓ,

α(Kcǫ ) ≤

ℓ + 2

log 1α(Kǫ)

≤ ǫ

3

We now show that given any β and any ǫ > 0, there is an open set Uβ, a small neighborhoodaround β so that

lim supn→∞

1

nlog Pn(Uβ) ≤ −I(β) + 2ǫ

First we pick f so that∫

f(x) dβ(x) − log∫

ef(x)dα(x) ≥ I(β) − ǫ Then

E[enR

f(x)dβn(x)] = E[exp[f(x1) + f(x2) + · · · + f(xn)]] =

[∫

ef(x)dα(x)

]n

If Uβ = γ : |∫

f(x)dβ −∫

f(x)dγ(x)| < ǫ, then by Tchebechev’s inequality

lim supn→∞

1

nlog Pn(Uβ) ≤ −

∫

f(x) dβ + ǫ + log

∫

ef(x)dα(x) ≤ −I(β) + 2ǫ

If D is any compact subset of M(X ), then since D can be covered by a finite number ofUβ we can conclude that for any compact D ⊂ M(X )

lim supn→∞

1

nlog Pn(D) ≤ − inf

β∈DI(β) + 2ǫ

Since ǫ is arbitrary we actually have

lim supn→∞

1

nlog Pn(D) ≤ − inf

β∈DI(β)

If we can show that for any ℓ < ∞ there is some compact set Dℓ ⊂ M(X ) such that

lim supn→∞

1

nlog Pn(Dc

ℓ) ≤ −ℓ

then it would follow that for any closed set C

lim supn→∞

1

nlog Pn(C) ≤ − inf

β∈CI(β)

by writing P (C) ≤ P (C ∩ Dℓ) + P (Dcℓ). Then

lim supn→∞

1

nlog Pn(C) ≤ max− inf

β∈CI(β),−ℓ

and by letting ℓ → ∞ we would get the upper bound. Let us pick Kj so that α(Kcj ) ≤ ǫj .

Let

D = γ : γ(Kcj ) ≤ δj forall j ≥ 1

Pn(Dc) ≤∞

∑

j=1

P [βn(Kcj ) ≥ δj ]

=∑

j

[B(n, ǫj) ≥ nδj]

≤∑

j

[ǫjeθj + (1 − ǫj)]

ne−nθj δj

4

Since we can choose δj ↓ 0, ǫj ↓ 0 and θj ↑ ∞ arbitrarily, we can do it in such a way that∑

j

[ǫjeθj + (1 − ǫj)]

ne−nθj δj ≤ e−ℓ n

For instance δj = 1j, θj = j(ℓ + log 2 + j), ǫj = e−θj will do it.

To prove the lower bound, we tilt the measure from Pn to Qn based on i.i.d with β foreach component. Let Uβ be a neighborhood around β.

Pn(Uβ) =

∫

βn∈Uβ

[b(x1)b(x2) · · · b(xn)]−1dβ(x1) · · · dβ(xn)

If b(x) = 0 on a set of positive α measure we would still have a lower bound

Pn(Uβ) ≥∫

βn∈Uβ

[b(x1)b(x2) · · · b(xn)]−1dβ(x1) · · · dβ(xn)

In any case

Pn(Uβ) ≥∫

βn∈Uβ

|R

log b(x) βn(dx)−I(β)|≤ǫ

[b(x1)b(x2) · · · b(xn)]−1dβ(x1) · · · dβ(xn)

≥ e−n[I(β)+ǫ]

∫

βn∈Uβ

|R

log b(x) βn(dx)−I(β)|≤ǫ

dβ(x1) · · · dβ(xn)

= e−n[I(β)+ǫ](1 + o(1))

by the law of large numbers. This completes the proofs of both upper and lower bounds.

Sanov’s theorem has the following corollary.

Corollary 1.3. Let Xi be i.i.d.r.v with values in a separable Banach space X with acommon distribution α. Assume

E[eθ‖X‖] < ∞for all θ > 0. Then the mean 1

n

∑ni=1 Xi satisfies a large deviation principle with rate

function

H(x) = supy∈X ∗

[< y, x > − log

∫

e<y,x> dα(x)]

Before starting on the proof let us establish a certain formula.

Lemma 1.4. Let α be a probability distribution on X with∫

eθ|z|dα(z) < ∞ for all θ > 0.Let

M(y) =

∫

e〈y,z〉dα(z)

and

H(x) = supy∈X ∗

[〈y, x〉 − log M(y)]

5

Then

H(x) = infb:

R

zb(z) dα(z)=x

∫

b(z) log b(z) dα(z)

Remark 1.5. The proof will depend on the following minmax theorem from convex anal-ysis. Let F (x, y) be a function on X × Y, which is convex and lower semi continuous iny for each x and concave and upper semi continuous in x for each y. Let C1 ⊂ X andC2 ⊂ Y be closed convex subsets. Let either C1 or C2 be compact or let either all the levelsets Dℓ

x = y : F (x, y) ≤ ℓ or all the level sets Dℓy = x : F (x, y) ≥ ℓ be compact. Then

infy∈C2

supx∈C1

F (x, y) = supx∈C1

infy∈C2

F (x, y)

Proof. We consider the function

F (y, b(·)) = 〈y, x〉 −∫

〈y, z〉b(z)dα(z) +

∫

b(z) log b(z)dα(z)

on X ∗ × N , where N is the set of b(z) that are nonnegative,∫

b(z) dα(z) = 1 and∫

b(z) log b(z)dα(z) < ∞.

supy

infb

F (y, b(·)) = supy

[〈y, x〉 − log M(y)] = H(x)

while

supy

[

〈y, x〉 −∫

〈y, z〉dα(z)]

= ∞

unless x =∫

z dα(z), in which case it is 0. Therefore

infb

supy

F (y, b(·)) = infb:

R

zb(z) dα(z)=x

∫

b(z) log b(z) dα(z)

It is not hard to verify the conditions for the minmax theorem to be applicable.

Proof. (of corollary) Step 1. First we assume that ‖Xi‖ is bounded by ℓ. Suppose βn

is a sequence of probability measures supported in the ball of radius ℓ and βn convergesweakly to β. If xn =

∫

zdβn(z) then < y, xn >→< y, x >=∫

zdβ(z) for each y ∈ X ∗. ByProhorov’s condition there is a compact set Kǫ that has probability at least (1 − ǫ) underall the βn as well as β. If xǫ

n =∫

Kǫ z dβn(z), xn is in the convex hull of Kǫ and 0, which iscompact. More over ‖xǫ

n − xn‖ ≤ ǫℓ. This is enough to conclude that ‖xn − x‖ → 0.

Step 2. We truncate and write X = Y ℓ + Zℓ where Xℓ is supported in the ball of radius

ℓ. Since E[eθ‖Zℓ‖] → 1 as ℓ → ∞ for all θ > 0, we see that

lim supℓ→∞

lim supn→∞

1

nlog P [

1

n

n∑

j=1

‖Zℓj‖ ≥ ǫ] ≤ lim sup

ℓ→∞infθ>0

[−θǫ + log E[eθ‖Zℓ‖] = −∞

Step 3. Finally

P [1

n

n∑

j=1

Xj ∈ C] ≤ P [1

n

n∑

j=1

Y ℓj ∈ Cǫ] + P [

1

n

n∑

j=1

‖Zℓj‖ ≥ ǫ]

6

If we denote by

Hℓ(x) = supy∈X ∗

[< y, x > − log E[e<y,Y ℓ>]]

then

lim supn→∞

1

nlog P [

1

n

n∑

j=1

Xj ∈ C] ≤ max− infx∈Cǫ

Hℓ(x), lim supn→∞

log P [1

n

n∑

j=1

‖Zℓj‖ ≥ ǫ]

≤ − lim infǫ→0

lim infℓ→∞

infx∈Cǫ

Hℓ(x)

= − infx∈C

H(x)

The last step needs the fact that if for some xℓ ∈ Cǫ, Hℓ(xℓ) is bounded, then xǫℓ is a

compact sequence and any limit point xǫ will be in Cǫ and as ǫ → 0 this will producea limit point in x ∈ C with H(x) ≤ lim infǫ→0 lim infℓ→∞ Hℓ(xǫ

ℓ). The following lemmajustifies the step.

Lemma 1.6. Let µσ, σ ∈ S, be a collection of probability measures on X satisfying, thetightness condition that for any given ǫ > 0, there exists a compact set Kǫ ⊂ X , such thatµσ(Kǫ) ≥ 1 − ǫ for all σ ∈ S,and

supσ∈S

∫

eθ‖z‖dµσ(z) = m(θ) < ∞

for every θ > 0. Let

Hσ(x) = supy∈X ∗

[< y, x > − log

∫

e<y,z> dµσ(z)]

Then the set Dℓ = ∪σ∈Sx : Hσ(x) ≤ ℓ has a compact closure in X .

Proof. If Hσ(xσ) ≤ ℓ then there exists bσ such that∫

bσ(z) log bσ(z) dµσ(z) ≤ ℓ and xσ =∫

zdβσ(z) =∫

z bσ(z)dµσ(z). From the entropy bound I(βσ) ≤ ℓ and the tightness ofµσ it follows that βσ is tight and

∫

‖z‖dβσ(z) uniformly integrable. Therefore xσ iscompact.

This concludes the proof of the corollary

Gaussian distributions on a Banach space are defined by a mean and covariance. Sup-pose X is a Gaussian random variable with mean 0 and some covariance B(y1, y2) =E[〈y1,X〉〈y2,X〉]. Then the distribution of X1 +X2 + · · ·+Xn is the same as that of

√nX.

If there is a large deviation principle for 1n

∑nj=1 Xj, this gives a Gaussian tail behavior for

X.

lim supλ→∞

1

λ2log P [X ∈ λC] ≤ − inf

x∈CH(x)

where

H(x) = supy∈X ∗

[〈y, x〉 − 1

2B(y, y)]

7

According to our result we only need to check that E[eθ‖X‖] < ∞ for all θ > 0. There is asimple argument due to Fernique which shows that any Gaussian distribution on a Banach

space must necessarily be such that E[eθ‖X‖2] < ∞ for some θ > 0. The proof uses the

independence of X + Y and X − Y . If ‖X‖ ≥ ℓ and ‖Y ‖ ≤ a we must have ‖X + Y ‖ and‖X − Y ‖ ≥ ℓ − a. Therefore

P [‖X + Y ‖ ≥ ℓ − a ∩ ‖X − Y ‖ ≥ ℓ − a] = P [‖X‖ ≥ ℓ − a√2

]2 ≥ P [‖X‖ ≥ ℓ]P [‖Y ‖ ≤ a]

We can pick a so that P [‖Y ‖ ≤ a] ≥ 12 . We then have

P [‖X‖ ≥ ℓ] ≤ 2P [‖X‖ ≥ ℓ − a√2

]2

or

P [‖X‖ ≥√

2ℓ + a] ≤ 2P [‖X‖ ≥ ℓ]2.

Iterating this inequality will give the necessary bound. Define ℓn+1 =√

2ℓn + a. Then forT (ℓ) = P [‖X‖ ≥ ℓ] we have

T (ℓn+1) ≤ 2[T (ℓn)]2 ≤ [2T (ℓ1)]2n

With proper choice of a and ℓ1, ℓn ≃ C 2n2 and T (ℓn) ≤ δ2n

for some δ < 1. That does it.

2. Schilder’s Theorem.

One of the advantages in formulating the large deviation principle in fairly general termsis the ability to apply it in several infinite dimensional contexts. The first such example isresult concerning the behavior of Brownian motion with a small parameter. Let us considerthe family of stochastic process xǫ(t) defined by

xǫ(t) =√

ǫ β(t)

or equivalently

xǫ(t) = β(ǫt)

for t in some fixed time interval, say [0, 1] where β(·) is the standard Brownian motion.The distributions of xǫ(·) induce a family of scaled Wiener processes on C[0, 1] that wedenote by Qǫ. We are interested in establishing an LDP for Qǫ as ǫ → 0. The rate functionwill turn out to be

I(f) =1

2

∫ 1

0[f ′(t)]2dt

if f(0) = 0 and f(·) is absolutely continuous in t with a square integrable derivative f ′(·).Otherwise

I(f) = +∞The main theorem of this section due to M.Schilder is the following

Theorem 2.1. The family Qǫ satisfies an LDP with rate function I(·).

8

Proof. First let us note that as soon as we have a bound on the rate function say I(f) ≤ ℓ,then f satisfies a Holder inequality

|f(t) − f(s)| = |∫ t

s

f ′(σ) dσ| ≤ |t − s| 12(

∫ t

s

[f ′(σ)]2dσ

)

12

≤√

2ℓ |t − s| 12 .

The lower semi-continuity of I(·) is obvious and since f(0) = 0, by the Ascoli-Arzelatheorem, the level sets are totally bounded and hence compact.

Now let us turn to the proof of the upper bound. Suppose C ⊂ Ω is a closed subset ofthe space Ω = C[0, 1]. For any f in Ω and for any positive integer N , let us divide theinterval [0, 1] in to N equal subintervals and construct the piecewise linear approximation

fN = πNf by matching fN = f at the points jN

: 0 ≤ j ≤ N and interpolating linearly

in between. An elementary calculation yields I(πNf) = 12

∑N−1j=0 [f( j+1

N)− f( j

N)]2. We can

estimate Qǫ(C) by

Qǫ(C) = Qǫ[f ∈ C] ≤ Qǫ[fN ∈ Cδ] + Qǫ[ ‖fN − f‖ ≥ δ ]

where ‖ ‖ is the supremum norm on Ω, and Cδ is the δ neighborhood of the closed set C

in the uniform metric. If we denote by

ℓδ = inff∈Cδ

I(f) and ℓ = inff∈C

I(f)

by the compactness of the level sets and the lower semicontinuity of I(·) it follows that

ℓδ ↑ ℓ as δ ↓ 0

Clearly

Qǫ[fN ∈ Cδ] ≤ Qǫ[IN (f) ≥ ℓδ].

We know that, under Qǫ, 2IN is essentially a sum of squares of N independent identicallydistributed Gaussians with mean 0, and has a scaled χ2 distribution with N degrees offreedom.

Qǫ[IN (f) ≥ ℓδ] = (1

ǫ)N

1

Γ(N2 )

∫ ∞

ℓδ

uN2−1 exp[−u

ǫ] du

It is elementary to conclude that for fixed N and δ

lim supǫ→0

ǫ log Qǫ[IN (f) ≥ ℓδ] ≤ −ℓδ.

As for the second term Qǫ[ ‖fN − f‖ ≥ δ ],

‖fN − f‖ ≤ 2 sup0≤j≤N

supjN≤t≤ j+1

N

|f(t) − f(j

N)|

9

and the events sup jN≤t≤ j+1

N

|f(t) − f( jN

)| ≥ δ2 all have the same probability;

Qǫ

[

supjN≤t≤ j+1

N

|f(t) − f(j

N)| ≥ δ

2

]

= Qǫ

[

sup0≤t≤ 1

N

|f(t)| ≥ δ

2

]

≤ 2Qǫ

[

sup0≤t≤ 1

N

f(t) ≥ δ

2

]

(by symmetry)

= 4Qǫ

[

f(1

N) ≥ δ

2

]

(by reflection principle)

We can now easily estimate Qǫ[ ‖fN − f‖ ≥ δ ] by

Qǫ[ ‖fN − f‖ ≥ δ ] ≤ 4NQǫ

[

f(1

N) ≥ δ

2

]

and for fixed N and δ,

lim supǫ→0

ǫ log Qǫ

[

‖fN − f‖ ≥ δ]

≤ −Nδ2

8

Now combining both terms

lim supǫ→0

ǫ log Qǫ[C] ≤ − infℓδ,Nδ2

8

Since N and δ are arbitrary we can let N → ∞ first and then let δ → 0, to get the upperbound

lim supǫ→0

ǫ log Qǫ[C] ≤ −ℓ

and we are done.

For the lower bound, in order to show that for open sets G ⊂ Ω

lim infǫ→0

ǫ log Qǫ(G) ≥ − infg∈G

I(g),

because there is always a neighborhood Ng of g with g ∈ Ng ⊂ G, it is sufficient to showthat, for any g ∈ Ω and N ∋ g,

lim infǫ→0

ǫ log Qǫ(N) ≥ −I(g).

Actually for any g with I(g) < ∞, and for any neighborhood N ∋ g and any δ > 0 there

exist a smooth h ∈ N with a neighborhood N satisfying h ∈ N ⊂ N and I(h) ≤ I(g)+δ. Sofor the purpose of establishing the lowere bound there is no loss of generality in assumingthat g is smooth. If we denote by Qǫ,g the measure that is obtained from Qǫ by thetranslation f → f − g mapping Ω → Ω, then

Qǫ[f : ‖f − g‖ < δ] = Qǫ,g[f : ‖f‖ < δ] =

∫

‖f‖<δ

dQǫ,g

dQǫ(f) dQǫ.

Cameron-Martin formula evaluates the Radon-Nikodym derivative

dQǫ,g

dQǫ(f) = exp

[

− 1

ǫ

∫ 1

0g′(t)df(t) − 1

2ǫ

∫ 1

0[g′(t)]2dt

]

10

The second term in the exponentiel is a constant and comes out of the integral as exp [− I(g)ǫ

].As for the integral

∫

‖f‖<δ

exp[

− 1

ǫ

∫ 1

0g′(t)df(t)

]

dQǫ

we first note that, because g is smooth we can integrate by parts to rewrite the ”df” integralas

∫ 1

0g′(t)df(t) = g′(1)f(1) −

∫ 1

0f(t)g′′(t)dt.

We restrict the integration to the set ‖f‖ < δ ∩ f :∫ 10 g′(t)df(t) ≤ 0 and use the

symmetry of Qǫ under f → −f to conclude that∫

‖f‖<δ

exp[

− 1

ǫ

∫ 1

0g′(t)df(t)

]

dQǫ ≥1

2Qǫ[ ‖f‖ < δ ].

Clearly, as ǫ → 0, Qǫ[ ‖f‖ < δ ] → 1 and it follows that

lim infǫ→0

ǫ log Qǫ[f : ‖f − g‖ < δ] ≥ −I(g)

and the proof of the lower bound is complete.

Remark 2.2. A calculation involving the covariance min(s, t) of the Brownian Motion and

its formal inverse quadratic form∫ 10 [f ′(t)]2dt will allow us to write a formal expression

dQǫ = exp[

− 1

2ǫ

∫ 1

0[f ′(t)]2dt

]

Πt df(t)

from which our rate function can be guessed. But the above density is with respect toan infinite dimensional Lebesgue measure that does not exist, and the density itself isan expression of doubtful meaning, since almost surely the Brownian paths are nowheredifferentiable. Nevertheless the LDP is valid as we have shown. In fact the LDP weestablished for Gaussian distributions on a Banach space is applicable here and is notdifficult to calculate for the covariance mins, t of Brownian motion

sup[

∫ T

0f(t)g(t)dt − 1

2

∫ T

0

∫ T

0min(s, t)g(s)g(t)dsdt] =

1

2

∫ T

0[f ′(t)]2dt

provided f(0) = 0 and f ′ exists in L2[0, T ].

Remark 2.3. This method of establishing the upper bound for some approximations thatare directly carried out and controlling the error by super-exponential estimates will be arecurring theme. Similarly the lower bound is often established for a suitably dense set ofsmooth points. The trick of Cramer involving change of measure so that the large deviationbecomes normal and getting a lower bound in terms of the Radon-Nikodym derivative willalso be a recurring theme.

1. Strassen’s Law of the Iterated Logarithm.

Let P be the Wiener measure on the space Ω = C[0,∞) of continuos functions on [0,∞)that starts at time 0 from the point 0. For λ ≥ 3 we define the rescaled process

xλ(t) =1√

λ log log λx(λt).

As λ → ∞, xλ(t) will go to 0 in probability with respect to P , but the convergence willnot be almost sure. Strassen’s theorem states:

Theorem 1.1. On any fixed time interval, say [0, 1], for almost all ω = x(·), the familyxλ(·) : λ ≥ 3 is compact and has as its set of limit points the compact set

K = f : I(f) ≤ 1where

I(f) =1

2

∫ 1

0[f ′(t)]2dt.

Proof. We will divide the proof into several steps. The first part is to prove that If Kǫ isa neighborhhod of size ǫ around K then, for almost all ω, xλ(·) ∈ Kǫ for sufficiently largeλ. This is proved in two steps.

Fisrt we sample xλ(·) along a discrete sequence λ = ρn for some ρ > 1 and show thatalmost surely, for any such ρ, xρn(·) ∈ K ǫ

2for sufficiently large n. This requires just the

Borel-Cantelli lemma. We need to show that∑

n

Pr[xρn(·) ∈ Kcǫ2] < ∞.

We use the results of the LDP proved in the last section to estimate for any closed set C,

Pr[xρn(·) ∈ C] = Pan [C]

where Pan is the Wiener measure scaled by 1√log log ρn ∼ 1√

log n. From the results proved in

the last section

log Pan [C] ≤ −[ℓ − δ] log n

for sufficiently large n, and to complete this step we need to prove only that

ℓ = inff∈C

I(f) = inff∈Kc

ǫ2

I(f) > 1

which is obvious because we have removed a neighborhood of the set K consisting of all f

with I(f) ≤ 1.The second step is to show that the price for sampling is not too much. More precisely

we will show that, almost surely,

lim supn→∞

supρn≤λ≤ρn+1

‖xλ(·) − xρn+1(·)‖ ≤ θ(ρ)

1

2

where θ(ρ) is nonrandom and → 0 as ρ ↓ 1. We then pick our ρ so that θ(ρ) < ǫ2 to

complete the proof. If λ2 > λ1 we can estimate

|xλ2(t) − xλ1(t)| ≤ | 1√λ2 log log λ2

− 1√λ1 log log λ1

| |x(λ1t)| +1√

λ2 log log λ2|x(λ2t) − x(λ1t)|

≤∣

∣

∣

∣

∣

√

λ1 log log λ1

λ2 log log λ2− 1

∣

∣

∣

∣

∣

‖xλ1(·)‖ + |xλ2(t) − xλ2(λ1

λ2t)|

Taking the supremum over 0 ≤ t ≤ 1,

‖xλ2(·) − xλ1(·)‖ ≤∣

∣

∣

∣

∣

√

λ2 log log λ2

λ1 log log λ1− 1

∣

∣

∣

∣

∣

‖xλ2(·)‖ + sup|t−s|≤|

λ1λ2

−1|

|xλ2(t) − xλ2(s)|.

If we now take λ2 = ρn+1 and λ1 = λ with ρn ≤ λ ≤ ρn+1

lim supn→∞

supρn≤λ≤ρn+1

‖xλ(·)−xρn+1(·)‖ ≤ |√ρ−1| lim supn→∞

‖xρn+1(·)‖+lim supn→∞

sup|t−s|≤|1− 1

ρ|

|xρn+1(t)−xρn+1(s)|.

One of the consequences of the result proved in the earlier step is that for any continuosfunctional F : Ω → R, almost surely,

lim supn→∞

F (xρn(·)) ≤ supf∈K

F (f).

Therefore, almost surely

lim supn→∞

supρn≤λ≤ρn+1

‖xλ(·) − xρn+1(·)‖ ≤ |√ρ − 1| supf∈K

‖f‖ + supf∈K

sup|t−s|≤1− 1

ρ

|f(t) − f(s)|

= θ(ρ)

and it is easily seen that θ(ρ) → 0 as ρ ↓ 1.Now we turn to the second part where we need to prove that xλ(·) returns infinitely

often to any neighborhood of any point f ∈ K. We can assume without loss of generalitythat I(f) = ℓ < 1, for such points are dense in K. Again we apply the Borel-Cantellilemma but now we need independence. Let us define an = ρn − ρn−1, and for 0 ≤ t ≤ 1

yn(t) =1√

an log log an

[x(ρn−1t + ant) − x(ρn−1)].

The distribution of yn(·) is the same as that of Brownian motion scaled by 1√log log an

and

from the LDP results of the last section for any η > 0

log Pr[

‖yn(·) − f‖ <δ

2

]

≥ −(ℓ + η) log n

for sufficiently large n and this shows that

∑

n

Pr[

‖yn(·) − f‖ <δ

2

]

= +∞.

3

Because yn(·) are independent, by Borel-cantelli lemma, yn(·) returns infinitely often tothe δ

2 neighborhood of f .The last piece of the proof then is to show that, almost surely ,

lim supn→∞

‖xρn(·) − yn(·)‖ ≤ θ(ρ)

where θ(ρ) → 0 as ρ ↑ ∞. We could then complete the proof by picking ρ large enoughthat θ(ρ) < δ

2 . Then xρn(·) would return infinitely often to the δ neighborhood of f . Henceso does xλ(·) for every δ > 0.

|xρn(t) − yn(t)| =

∣

∣

∣

∣

x(ρnt)√ρn log log ρn

− (x(ρn + ant) − x(ρn))√an log log an

∣

∣

∣

∣

≤∣

∣

1√ρn log log ρn

− 1√an log log an

∣

∣ |x(ρnt)| + 1√an log log an

|x(ρnt) − x(ρn−1 + ant)|

+1√

an log log an

|x(ρn)|

≤∣

∣

√ρn log log ρn

√an log log an

− 1∣

∣ |xρn(t)| +√

ρn log log ρn

√an log log an

|xρn(t) − xρn(1

ρ+ [1 − 1

ρ]t)|

+

√ρn log log ρn

√an log log an

|xρn(1

ρ)|

Taking the supremum over 0 ≤ t ≤ 1,

‖xρn(·)−yn(·)‖ ≤∣

∣

√ρn log log ρn

√an log log an

−1∣

∣

∥

∥xρn(·)‖+√

ρn log log ρn

√an log log an

[

sup|t−s|≤ 1

ρ

|xρn(t)−xρn(s)|+xρn(1

ρ)]

.

Again from the first part we conclude that, almost surely,

lim supn→∞

‖xρn(·) − yn(·)‖ ≤∣

∣

∣

∣

√

ρ

ρ − 1− 1

∣

∣

∣

∣

supf∈K

‖f‖ +

√

ρ

ρ − 1supf∈K

[

f(1

ρ) + sup

|t−s|≤ 1ρ

|f(t) − f(s)|]

= θ(ρ)

It is easily checked that θ(ρ) → 0 as ρ ↑ ∞. This concludes the proof.

Remark 1.2. As we commented earlier we can calculate for any continuous F : Ω → R,

lim supλ→∞

F (xλ(·)) = supf∈K

F (f)

almost surely. Some simple examples are: if F (f) = f(1) we get, almost surely,

lim supt→∞

x(t)√t log log t

=√

2 = supf∈K

f(1)

4

or if F (f) = sup0≤s≤1 |f(s)|, we get for almost all Wiener paths,

lim supt→∞

sup0≤s≤t |x(s)|√t log log t

=√

2 = supf∈K

sup0≤s≤1

|f(s)|

Remark 1.3. There is a way of recovering laws of iterated logarithms for sums of inde-pendent random variables from Strassen’s theorem for Brownian Motion. This requiresa concept known as Skorohod imbedding. If X is a random variable with mean zero andvariance σ2, we find a stopping time τ (perhaps randomized) such that Eτ = σ2and x(τ)has the same distribution as X. Then the random walk gets imbedded in the BrownianMotion and the LIL for random walk is deduced from the LIL for the Brownian Motion.For instance if X = ±1 with probability 1

2 each, τ is the hitting time of ±1. As an excerciselook up the reference and study the details.

2. Behavior of Diffusions with a small parameter.

In this section we will investigate the Large Deviation behavior of the family of diffusionprocesses P ǫ

x corresponding to the generator

Lǫ =ǫ2

2

∑

i,j

ai,j(x)∂2

∂xi ∂xj+

∑

j

bj(x)∂

∂xj

that start from the point x ∈ Rd. The family P ǫx will be viewed as a family of measures on

the space C[[0, T ];Rd] of continuous functions on [0, T ] with values in Rd. As we let ǫ → 0,the generator converges to the first order operator

L0 =∑

j

bj(x)∂

∂xj

.

If we impose enough regularity on b(x) = bj(·) so that the trajectories of the ODE

dx(t)

dt= b(x)

are unique, then the processes P ǫx will converge as ǫ → 0 to the degenerate distribution

concentrated at the unique solution of the above ODE that starts from the initial pointx(0) = x. We are interested in the validity of LDP for these measures P ǫ

x.If we use the theory of Stochastic Differential Equations, we would take a square root σ

such that σσ∗ = a = aij and solve the SDE

dx(t) = ǫσ(x(t))dβ(t) + b(x(t))dt

with x(0) = x. We would then view the solution x(t) as a map Φx,ǫ of the Wiener space

C[[0.T ];Rd] with the Wiener measure Q back to C[[0.T ];Rd] and the map will induce P ǫx

as the image of Qǫ. In fact we can absorb ǫ in β and rewrite the SDE as

dx(t) = σ(x(t))d[ǫβ(t)] + b(x(t))dt

5

and think of the map Φx as independent of ǫ and mapping the scaled Wiener measuresQǫ into P ǫ

x. The advantage now is that we may try to appeal to the contraction principleand deduce the LDP of P ǫ

x from that of Qǫ that we established in Section 3. The mapΦx : f(·) → x(·) is defined by

dx(t) = σ(x(t))df(t) + b(x(t))dt ; x(0) = x.

Let us ignore for the moment that this map is not defined everywhere and is far fromcontinuous. At the level of rate functions we see the map as

g′(t) = σ(g(t))f ′(t) + b(g(t)) ; g(0) = x

and if we assume that σ or equivalently a is invertible,

f ′(t) = σ−1(g(t))[g′(t) − b(g(t))]

and1

2

∫ T

0‖f ′(t)‖2dt =

1

2

∫ T

0〈[g′(t) − b(g(t))], a−1(g(t))[g′(t) − b(g(t))]〉dt

The reasoning above is not quite valid because the maps Φx are not continuous and thecontraction principle is not directly applicable. We will have to replace Φx by continuousmaps Φn,x and try to interchange limits. Although we can do it in one step, inorder toillustrate our methods in a better manner, we will perform this in two steps. Let us supposefirst that b ≡ 0. The map Φn,x is defined by

xn(t) = x +

∫ t

0σ(xn(πn(s)))dβ(s)

where

πn(s) =[ns]

nAlthough xn(·) appears to be defined implicitly it is in fact defined explicitly, by inductionon j, using the updating rule

xn(t) = xn(j

n) +

∫ t

jn

σ(xn(j

n))dβ(s)

for jn≤ t ≤ j+1

n. The map Φn,x are clearly continuous and the contraction principle applies

to yield an LDP for the distribution P ǫn,x of Φn,x under Qǫ with a rate function that is

easily seen to equal

In(g) =1

2

∫ T

0〈g′(t), a−1(g(πn(t)))g′(t)〉dt

for functions g with g(0) = x that have a square integrable derivative in t. OtherwiseIn(g) = +∞. We will prove the following superexponential approximation theorem.

Theorem 2.1. For any δ > 0, and compact set K ⊂ Rd,

lim supn→∞

lim supǫ→0

supx∈K

ǫ2 log Qǫ

[

‖Φn,x(·) − Φx(·)‖ ≥ δ]

= −∞

6

If we have the approximation theorem then it is straight forward to interchange the ǫ

and n limits.

Theorem 2.2. For the measures P ǫn,x an LDP holds with the rate function

Ix(f) =1

2

∫ T

0〈f ′(t), a−1(f(t))f ′(t)〉dt

for functions f(t) with a square integrable derivative that satisfy f(0) = x and equal to +∞otherwise.

Proof. Let C ∈ C[0.T ] be closed and δ > 0 be positive. Then

P ǫx[C] = Qǫ

[

Φx(·) ∈ C]

≤ Qǫ

[

Φn,x(·) ∈ Cδ]

+ Qǫ

[

‖Φn,x(·) − Φx(·)‖ ≥ δ]

Taking logarithms, multiplying by ǫ2 and taking limsups

lim supǫ→0

ǫ2 log P ǫx[C] ≤ −maxan,x(δ), bn,x(δ)

where

an,x(δ) = infg∈Cδ

In.x(g)

and in view of the superexponential estimate,

lim supn→∞

bn,x(δ) = −∞.

We therefore obtain for every δ > 0,

lim supǫ→0

ǫ2 log P ǫx[C] ≤ − lim sup

n→∞an,x(δ)

and finally letting δ → 0, it is easily seen that

lim supδ→0

lim supn→∞

an,x(δ) ≥ − infg∈C

Ix(g).

To prove the lower bound, we take a small ball B(g, δ) around g and

P ǫx[B(g, δ)] = Qǫ[Φx ∈ B(g, δ)] ≥ Qǫ[Φn,x ∈ B(g,

δ

2)] − Qǫ[ ‖Φn,x(·) − Φx(·)‖ ≥ δ

2]

For large n the second term decays a lot faster than the first term and so we get the lowerbound

lim infǫ→0

ǫ2 log P ǫx[B(g, δ)] ≥ − lim inf

n→∞In,x(g) = −Ix(g)

and we are done.

We now return to the proof of the Theorem on super-exponential estimates. We willcarry it out in several steps each of which will be formulated as a lemma.

7

Lemma 2.3. Let ξ(t) be a stochastic integral of the form

ξ(t) =

∫ t

0σ(s, ω)dβ(s)

with values in Rd, with a σ satisfying the bound

σ(s, ω)σ∗(s, ω) ≤ C(‖ξ(s)‖2 + δ) I

for some constant C. Let α > 0 be a positive number and let τα be the stopping time

τα = inft : ‖x(t)‖ ≥ α.Then for any T > 0

Pr[τα ≤ T ] ≤ exp

[

− [log(1 + α2

δ2 )]2

4kdCT

]

.

where kd is a constant depending only on the dimension.

Proof. Consider the function

U(x) = (δ2 + ‖x‖2)N

with the choice of N to be made later. One can easily estimate

1

2

∑

i,j

(σσ∗)i,j(s, ω)∂2U

∂xi∂xj(ξ(s)) ≤ kdCN2U(ξ(s))

where kd is a constant depending only on the dimension d. If we pick N so that, for someλ > 0,

kdCN2 = λ

then e−λtU(ξ(t)) is a supermartingale and

Eexp[−λτα] ≤[

δ2

δ2 + α2

]N

and by Tchebechev’s inequality

Pr[τα ≤ T ] ≤ exp[λT ]

[

δ2

δ2 + α2

]N

Making the optimal choice of

λ =[log(1 + α2

δ2 )]2

4kdCT

we obtain

Pr[τα ≤ T ] ≤ exp

[

− [log(1 + α2

δ2 )]2

4kdCT

]

.

8

Lemma 2.4. If we drop the assumption that

σ(s, ω)σ∗(s, ω) ≤ C(‖ξ(s)‖2 + δ) I

we get instead the estimate

Pr[

τα ≤ T]

≤ exp

[

− [log(1 + α2

δ2 )]2

4kdCT

]

+1−Pr[

σ(s, ω)σ∗(s, ω) ≤ C(‖ξ(s)‖2+δ) I ∀s ∈ [0, T ]]

.

Proof. The proof is routine. Just have to make sure that the stochastic integral is modifiedin an admissible manner to give the proof. Try the details as an exercise.

We now return to prove the super-exponential estimate.

Proof.

xn(t) − x(t) = ǫ

∫ t

0[σ(xn(πn(s))) − σ(x(s))]dβ(s)

= ǫ

∫ t

0[σ(xn(πn(s))) − σ(xn(s))]dβ(s) + ǫ

∫ t

0[σ(xn(s)) − σ(x(s))]dβ(s)

If we fix n and consider

ξ(t) =

∫ t

0σ(s, ω)dβ(s)

with

σ(s, ω) = ǫ[

[σ(xn(πn(s))) − σ(xn(s))] + [σ(xn(s)) − σ(x(s))]]

then

σσ ∗ (s, ω) ≤ Cǫ2[ ‖ξ(s)‖2 + δ2] I

provided

sup0≤s≤T

‖xn(πn(s)) − xn(s)‖ ≤ δ.

If we apply the earlier lemma we get

Qǫ

[

‖Φn,x(·)− Φx(·)‖ ≥ α]

≤ exp

[

− [log(1 + α2

δ2 )]2

4kdCǫ2T

]

+ Qǫ

[

sup0≤s≤T

‖xn(πn(s))− xn(s)‖ ≥ δ]

If we assume that σ(·) is uniformly bounded, the second term is dominated by

nT exp[−nδ2

Aǫ2].

for some constant A. We now conclude the proof of the superexponential estimate byestimating

lim supǫ→0

ǫ2 log Qǫ

[

‖Φn,x(·) − Φx(·)‖ ≥ α]

≤ −min

[log(1 + α2

δ2 )]2

4kdCT,nδ2

Cǫ2

For fixed α and δ we can let n → ∞ and then let δ → 0 keeping α fixed to get −∞ on theright hand side.

9

Remark 2.5. If we look at the distribution at time 1 of P ǫx we have the transition proba-

bility pǫ(1, x, dy) that is viewed as probability measures on Rd. The LDP for them is nowobtained by the contraction principle with the rate function

I(x, y) = inff(0)=x

f(1)=y

1

2

∫ 1

0〈f ′(t), a−1(f(t)) f ′(t)〉 dt

=[d(x, y)]2

2

where d is the Riemannian distance in the metric ds2 = a−1i,j (x)dxidxj.

Remark 2.6. The distribution pǫ(1, x, dy) is the same as p1(ǫ2, x, dy) and so for the tran-

sition probability distributions p(t, x, dy) of the diffusion process with generator

L =1

2

∑

i,j

ai,j(x)∂2

∂xi∂xj

we hace an LDP as t → 0 with rate function [d(x,y)]2

2 .

Remark 2.7. All the results are valid locally uniformly in the starting point x. What wemean by that is if we take Pǫ = Pxǫ,ǫ, so long as xǫ → x as ǫ → 0, the LDP is valid for Pǫ

with the same rate function as the one for Px,ǫ.

Remark 2.8. Just for the record the assumptions on a are boundedness, uniform ellipticityor the boundedness of a−1 and a uniform Lipschitz condition on a or equivalently on σ.

Remark 2.9. Under the above regularity conditions one can look around in PDE booksand find that for the equation

∂u

∂t= Lu

there is a fundamental solution which is nothing else but the density p(t, x, y) of the tran-sition probability distribution. There are some reasonable estimates for it that imply

lim sup|x−y|→0

lim supt→0

t log p(t, x, y) = lim inf|x−y|→0

lim inft→0

t log p(t, x, y) = 0.

One can use the Chapman-Kolmogorov equations to write

p(t, x, y) =

∫

Rd

p(αt, x, dz)p(1 − αt, z, y)

and bootstrap from the LDP of p(t, x, dy) to the behavior of p(t, x, y) itself

limt→0

t log p(t, x, y) = − [d(x, y)]2

2.

We now turn our attention to the operator

Lǫ =ǫ2

2

∑

i,j

ai,j(x)∂2

∂xi∂xj+

∑

j

bj(x)∂

∂xj

10

If we denote by P ǫx,b the probability measure corresponding to the above operator, and by

P ǫx the measure corresponding to the operator with b ≡ 0, Girsanov formula provides the

Radon-Nikodym derivative

dP ǫx.b

dP ǫx

= exp

1

ǫ2

∫ T

0〈a−1(x(s)b(x(s)), dx(s)〉 − 1

2ǫ2

∫ T

0〈b(x(s)), a−1(x(s))b(x(s))〉ds

If we pretend that the exponent

F (ω) =

∫ T

0〈a−1(x(s)b(x(s)), dx(s)〉 − 1

2

∫ T

0〈b(x(s)), a−1(x(s))b(x(s))〉ds

is a bounded, continuous function on C[0, T ], then the LDP for P ǫx,b would follow from the

LDP for P ǫx with the rate function

Ix.b(g) = Ix(g) + F (g) =1

2

∫ T

0〈[g′(t) − b(g(t))], a−1(g(t))[g′(t) − b(g(t))]〉 dt

which is what we wanted to prove. The problem really is taking care of the irregular natureof the function F (·). We leave it as an exercise to show that the following lemma will dothe trick.

Lemma 2.10. There exist bounded continuous functions Fn,ℓ(·) such that for every real λ

and C < ∞,

lim supℓ→∞

lim supn→∞

ǫ2 log E

exp[ λ

ǫ2[Fn,ℓ(ω) − F (ω)]

]

= 0

andlimℓ→∞

limn→∞

Fn,ℓ(g) = F (g)

uniformly on g : Ix(g) ≤ C.

Proof. Since only the Stochastic Integral part is troublesome we need to approximate

G(ω) =

∫ T

0〈a−1(x(s)b(x(s)), dx(s)〉.

It is routine to see that

Gn,ℓ(ω) = Gn(ω) if |Gn(ω)| ≤ ℓ

= 0 otherwise

with

Gn(ω) =

∫ T

0〈a−1

(

x

(

[ns]

n

))

b

(

x

(

[ns]

n

))

, dx(s)〉

will do the trick by repeating the steps in the similar approximation we used earlier duringthe course of the proof of the case with b ≡ 0. We leave the details to the reader.

Remark 2.11. The local uniformity in the starting point continues to hold without anyproblems.

11

3. The Exit Problem.

The exit problem deals with the following question. Suppose we are given a diffusiongenerator Lǫ of the form

Lǫ =ǫ2

2

∑

i,j

ai,j(x)∂2

∂xi∂xj+

∑

j

bj(x)∂

∂xj

We suppose that ai,j and bj are smooth and that a is uniformly elliptic, i.e. boundedlyinvertible. As ǫ → 0 the process will converge to the deterministic solution of the ODE

dx

dt= b(x(t)).

Let us assume that we have a bounded open set G with the property that for some pointx0 inside G, all trajectories of the above ODE starting from arbitrary points x ∈ G, remainforever with in G and converge to x0 as t → ∞. In other words x0 is a stable equlibriumpoint for the flow and all trajectories starting from G remain in G and approach x0. Wecan solve the Dirichlet Problem for Lǫ, i.e solve the equation

LǫUǫ(x) = 0 for x ∈ G

Uǫ(y) = f(y) for y ∈ ∂G

where f is the boundary value of Uǫ specified on the boundary ∂G of G. We will assumethat the domain G is regular and the function f is continuous. The exit problem is tounderstand the behavior of Uǫ as ǫ → 0. If we denote by Pǫ,x the measure on path spacecorresponding to the generator Lǫ that starts from the point x in G at time 0, then weknow that Uǫ has the representation in terms of the first exit place x(τ) where

τ = inft : x(t) /∈ G.For any ǫ > 0,

Uǫ(x) = EPǫ,x[

f(x(τ))]

.

It is tempting to let ǫ → 0 in the above representation. While Pǫ,x converges weakly to Px

the degenerate measure at the solution to the ODE, since the ODE never exits, τ = +∞a.e Px for every x ∈ G. What happens to x(τ) is far from clear.

The solution to the exit problem is a good application of large deviations as originallycarried out by Wentzell and Freidlin. Let us consider the rate function

I(x(·)) =1

2

∫ T

0〈[x′(t) − b(x(t)], a−1(x(t) [x′(t) − b(x(t))]〉

and define for any x ∈ G and boundary pount y ∈ ∂G

V (x, y) = inf0<T<∞

infx(0)=x

x(T )=y

I(x(·))

and when x = x0

R(y) = V (x0, y)

We have the following theorem

12

Theorem 3.1. Suppose for some (necessarily unique) point y0 on ∂G.

R(y) > R(y0) for all y ∈ ∂G, y 6= y0

then,limǫ→0

Uǫ(x) = f(y0)

for all x ∈ G.

The proof will be based on the following lemma which will be proved at the end.

Lemma 3.2. Let N , a eighborhood of y0 on the boundary, be given. Then there are twoneighborhoods B and H of x0 with x0 ∈ B ⊂ B ⊂ H ⊂ H ⊂ G such that

limǫ→0

supx∈∂H Pǫ,x[E1]

infx∈∂H Pǫ,x[E2]= 0

where E1 and E2 are defined by

E1 = x(·) : x(·) exits G in N before visiting BE2 = x(·) : x(·) exits G in N c before visiting B

Proof. Given the lemma the proof of the main theorem is easy. Suppose ǫ is small and theprocess starts from a point x0 in G. It will follow the ODE and end up inside B before itexits from G. Then it will hang around x0 inside H for a very long time. Because ǫ > 0 itwill exit from H at some boundary point of H. Then it may exit from G before enteringback into B with a very small probability. This small probability is split between exitingin N and in N c. Denoting by pn,ǫ and qn,ǫ the respective probabilities of exiting from N

and N c during the n-th trip back from H into B

Pǫ,x[x(τ) /∈ N ] =∑

n

qn,ǫ

≤ Cǫ pn,ǫ

= CǫPǫ,x[x(τ) ∈ N ]

where Cǫ → 0 as ǫ → 0 according to the lemma. Clearly

Pǫ,x[x(τ) /∈ N ] ≤ Cǫ

Cǫ + 1

and → 0 as ǫ → 0.The proof of the theorem is complete.

We now turn to the proof of the lemma.

Proof. We will break up the proof into two steps. First we show that in estimating variousquantities we can limit ourselves to a finite time interval.

Step 1. Let us denote by τB and τG the hitting times of ∂B and ∂G respectively. Weprove that

C(T ) = lim supǫ→0

ǫ2 supx∈∂H

log Pǫ,x[(τB ≥ T ) ∩ (τG ≥ T )]

13

tends to −∞ as T → ∞. By the large deviation property

C(T ) ≤ − infI(x(·)) ;x(·) : x(0) ∈ ∂H, x(t) ∈ G ∩ Bc for 0 ≤ t ≤ TIf C(T ) does not go to −∞ as T → ∞ by the additivity property of the rate function ,there will be a long interval [T1, T2] over which

∫ T2

T1

〈[x′(t) − b(x(t)]a−1(x(t)[x′(t) − b(x(t))]〉dt

will be small. This in the limit will give us a trajectory of the ODE that lies in G∩Bc forall times thereby contradicting stability.

Step 2. One can construct trajectories that connect points close to each other that haverate functions no larger than the distance between them. Just go from one point to theother at speed 1 on a straightline. Suppose R(y0) = ℓ and R(y) ≥ ℓ + δ on N c. We canpick a small ball H around x0 such that for some T0 < ∞ and for any x ∈ ∂H the followingstatements are true: Every path from x that exits from G in the set N c with in time T0

has a rate function that exceeds ℓ + 78δ. There is a path from x that exits G in N that has

a rate function that is atmost ℓ + 18δ. We can assume that T0 is large eough that C(T0) of

step 1 is smaller than −2ℓ− δ. It is easy to use the LDP and derive the following estimatesuniformly for x ∈ ∂H.

Pǫ,x[τG < τB , x(τG) ∈ N c] ≤ exp[− 1

ǫ2(ℓ +

7

8δ)] + exp[− 1

ǫ2(2ℓ + δ)]

and

Pǫ,x[τG < τB, x(τG) ∈ N ] ≥ exp[− 1

ǫ2(ℓ +

1

8δ)] − exp[− 1

ǫ2(2ℓ + δ)]

The lemma follows from this.

Remark 3.3. In the special case when A(x) = I and b(x) = −12∇F for some F we can

carry out the calculation of the rate function more or less explicitly.∫ T

0‖x′(t) +

1

2(∇F )(x(t))‖2dt =

∫ T

0‖x′(t) − 1

2(∇F )(x(t))‖2dt + 2

∫ T

0〈(∇F )(x(t)), x′(t)〉dt

≥ 2

∫ T

0〈(∇F )(x(t)), x′(t)〉dt

= 2[F (x(T )) − F (x(0))].

Therefore

V (x, y) ≥ [F (y) − F (x)]

On the other hand if x(t) sojves the ODE

x′(t) = (∇F )(x(t)

with x(0) = y and x(T ) = x, then the trajectory

y(t) = x(T − t) for 0 ≤ t ≤ T

14

connects x to y and its rate function is

I[y(·)] =1

2

∫ T

0‖y′(t) +

1

2(∇F )(y(t))‖2dt

=1

2

∫ T

0‖x′(t) − 1

2(∇F )(x(t))‖2dt

=1

2

∫ T

0‖x′(t) +

1

2(∇F )(x(t))‖2dt −

∫ T

0〈(∇F )(x(t)), x′(t)〉dt

= [F (x(0)) − F (x(T ))]

= [F (y) − F (x)].

It follows then that for any y in the domain of attraction of x0

R(y) = V (x0, y) = F (y) − F (x0)

and the exit point is obtained by minimizing the potential F (y) on ∂G.

1. Markov Chains.

Let us switch from independent to dependent random variables. Suppose X1, · · · ,Xn, · · ·is a Markov Chain on a finite state spacce F consisting of points x ∈ F . The Markov Chainwill be assumed to have a stationary transition probability given by a stochastic matrixπ = π(x, y), as the probability for transition from x to y. We will assume that all theentries of π are positive, imposing thereby a strong irreducibility condition on the MarkovChain. Under these conditions there is a unique invariant or stationary distribution p(x)satisfying

∑

x

p(x)π(x, y) = p(y).

Let us suppose that V (x) : F → R is a function defined on the state space with a meanvalue of m =

∑

x V (x)p(x) with respect to the invariant distribution. By the ergodictheorem, for any starting point x,

limn→∞

Px

[

| 1n

∑

j

V (Xj) − m| ≥ a]

= 0

where a > 0 is arbitrary and Px denotes, as is customary, the measure corresponding tothe Markov Chain initialized to start from the point x ∈ F . We expect the rates to beexponential and the goal as it was in Cramer’s theorem, is to calculate

limn→∞

1

nlog Px

[ 1

n

∑

j

V (Xj) ≥ a]

for a > m. First, let us remark that for any V

limn→∞

1

nlog Ex

[

exp[V (X1) + V (X2) · · · + V (Xn)]

]

= log σ(V )

exists where σ(V ) is the eigenvalue of the matrix

πV = πV (x, y) = π(x, y)eV (x)

with the largest modulus, which is positive. Such an eigenvalue exists by Frobenius theoryand is characterized by the fact that the eigenvector has positive entries. To see this weonly need to write down the formula

Ex

[

exp[V (X1) + V (X2) · · · + V (Xn)]

]

=∑

y

[πV ]n(x, y)

that is easily proved by induction. The geometric rate of growth of any of the entries ofthe n-th power of a positive matrix is of course given by the Frobenius eigenvalue σ. Oncewe know that

limn→∞

1

nlog Ex

[

exp[V (X1) + V (X2) · · · + V (Xn)]

]

= log σ(V )

1

2

we can get an upper bound by Tchebychev’s inequality

lim supn→∞

1

nlog Px

[

V (X1) + V (X2) · · · + V (Xn) ≥ a

]

≤ log σ(V ) − a

or repalcing V by λV with λ > 0

lim supn→∞

1

nlog Px

[

V (X1) + V (X2) · · · + V (Xn) ≥ a

]

≤ log σ(λV ) − aλ

Since we can optimize over λ ≥ 0 we can obtain

lim supn→∞

1

nlog Px

[

V (X1) + V (X2) · · · + V (Xn) ≥ a

]

≤ −h(a) = − supλ≥0

[λa − log σ(λV )]

By Jensen’s inequality

σ(λV ) = limn→∞

1

nlog EPx

[

exp[

λ[V (X1) + V (X2) · · · + V (Xn)]]

]

≥ limn→∞

1

nEPx

[

λ[V (X1) + V (X2) · · · + V (Xn)]

]

= λm

Therefore

h(a) = supλ≥0

[λa − log σ(λV )] = supλ∈R


For the lower bound we can again perform the trick of Cramer to change the measure fromPx to a Qx such that under Qx the event in question has probability nearly 1. The RadonNikodym derivative of Px with respect to Qx will then control the large deviation lowerbound. Our plan then is to replace π by π such that π(x, y) has an invariant measure q(x)with

∑

x V (x)q(x) = a. If Qx is the distribution of the chain with transition probability π

then

Px[E] =

∫

E

RndQx

where

Rn = exp[

∑

j

logπ(Xj ,Xj+1)

π(Xj ,Xj+1)

]

This provides us a lower bound for the probbaility

− lim infn→∞

1

nlog Px

[

V (X1) + V (X2) · · · + V (Xn) ≥ a

]

≤ J(π)

where

J(π) = limn→∞

1

nEQ

x [− log Rn]

=∑

x,y

logπ(x, y)

π(x, y)π(x, y)q(x)

3

The last step comes from applying the ergodic theorem for the π chain with q(·) as theinvariant distribution to the average

− 1

nlog Rn =

1

n

n∑

j=1

logπ(Xj ,Xj+1)

π(Xj ,Xj+1)

Since any π with∑

x V (x)q(x) = a will do where q(·) is the corresponding invariant mea-sure, the lower bound we have is quickly improved to

limn→∞

1

nlog Px

[

V (X1) + V (X2) · · · + V (Xn) ≥ a

]

≥ − infπ

P

V (x)q(x)=a

J(π)

If we find the λ such that λa − log σ(λV ) = h(a) then for such a choice of λ

a =σ′(λV )

σ(λV )

and we take π(x, y)eλV (y) as our nonnegative matrix. We can find for the eigenvalue σ acolumn eigenvector f and a row eigenvector g. We define

π(x, y) =1

σπ(x, y)eλV (y) f(y)

f(x)

One can check that π is a stochastic matrix with invariant measure q(x) = g(x)f(x)(properly normalized). An elemenatary perturbation argument involving eigenvalues givesthe relationship

a =∑

x

q(x)V (x)

and

J =∑

x,y

[− log σ + λV (y) + log f(y) − log f(x)]π(x, y)q(x)

= λa − log σ

= h(a)

thereby matching the upper bound. We have therefore proved the following theorem.

Theorem 1.1. For any Markov Chain with a transition probability matrix π with positiveentries, the probability distribution of 1

n

∑nj=1 V (Xj) satisfies an LDP with a rate function

h(a) = supλ


There is an interesting way of looking at σ(V ). If V (x) = log u(x)(πu)(x) then f(x) = (πu)(x)

is a column eigenfunction for πV with eigenvalue σ = 1. Therefore

log σ(logu

πu) = 0

4

and because log σ(V + c) = log σ(V ) + c for any constant c, log σ(V ) is the amount c bywhich V has to be shifted so that V − c is in the class

M = V : V = logu

πu

We now turn our attention to the number of visits

Lnx =

n∑

j=1

χx(Xj)

to the state x or

ℓnx =

1

nLn

x

the proportion of time spent in the state x. If we view ℓx;x ∈ F as a point in thespace P of probability measures on F , then we get a distribution νn

x for any starting pointx and time n. The ergodic theorem asserts that νn

x converges weakly to the degeneratedistribution at p(x), the invariant measure for the chain. In fact we have actually provedthe large deviation principle for νn

x .

Theorem 1.2. The distributions νnx of the empirical distributions satisfy an LDP with a

rate function

I(q) = supV ∈M

∑

x

q(x)V (x)

Proof. Upper Bound. Let q ∈ P be arbitrary. Suppose V ∈ M. Then

EPx [exp[∑

V (Xi)](πu)(Xn+1)] = (πu)(x)

Therefore

lim supn→∞

1

nlog Ex

[

exp[

n∑

j=1

V (Xj)]]

≤ 0

which implies that

lim supU↓q

lim supn→∞

1

nlog νn

x [U ] ≤ −∑

x

q(x)V (x)

where U are neighborhoods of q that shrink to U . Since the space P is compact this providesimmediately the upper bound in the LDP with the rate function I(q). The lower bound isa matter of finding a π such that q(x) is the invariant measure for π and J(π) = I(q). Onecan do the variational problem of minimizing J(π) over those π that have q as an invariantmeasure. The minimum is attained at a point where

π(x, y) = π(x, y)f(y)

(πf)(x)

5

with q as the invariant measure for π. Since q(x) is the the invariant distribution for π aneasy calculation gives

J =∑

x,y

[log f(y) − log(πu)(x)]π(x, y)q(x)

=∑

x

[log f(x) − log(πu)(x)]q(x)

=∑

x

V (x)q(x) for some V ∈ M

≤ I(q)

Since we already have the upper bound with I(·) we are done.

Remark 1.3. It is important to note that in the special case of IID random variableswith a finite set of possible values we can take π(x, y) = π(y) and for any V (x) the matrix

π(y)eV (y) is of rank one with the nonzero eigenvalue being equal to σ(V ) =∑

x eV (x)π(x)and the analysis reduces to the one of Cramer. In particular

I(q) =∑

x

q(x) logq(x)

π(x)

is the relative entropy with respect to π(·).Remark 1.4. A similar analysis can be done for the continuous time Markov chains aswell. Suppose we have a Markov chain, again on the finite state space F , with transitionprobabilities π(t, x, y) given by π(t) = exp[tA] where the infinitesiaml generator or the ratematrix A satisfies a(x, y) ≥ 0 for x 6= y and

∑

y∈F a(x, y) = 0. For any function V : F → R

the operator (A + V ) defined by

(A + V )f(x) =∑

y

a(x, y)f(y) + V (x)f(x)

has the property that the eigenvalue with the largest real part is real and if we make thestrong irreducibility asssumption as before i.e. a(x, y) > 0 for x 6= y, then this eigenvalueis unique and is characterized by having row ( or equivalently) column eigenfunctionsthat are positive. We denote this eigenvalue by θ(V ). If u > 0 is a positive functionon F , then Au + (−Au

u)u = Au − Au = 0. Therefore for V = −Au

u, the vector u is a

column eigenvector for the eigenvalue 0 and for such V , θ(V ) = 0. If we now denote by[M = V : V = −Au

ufor some u > 0 then θ(V ) is the unique constant such that V −θ ∈ M.

We can now have the exact analogues of the descrete case and with the rate function

I(q) = supV ∈M

∑

x

q(x)V (x)

we have an LDP for the empirical measures

ℓx(t) =1

t

∫ t

0χx(X(s))ds

6

that represent the proportion of time spent in the state x ∈ F .

Remark 1.5. In the special case when A is symmetric one can explicitly calculate

I(q) = −∑

x,y

a(x, y)√

q(x)√

q(y).

The large deviation theory gives the asymptotic relation

limt→∞

1

tlog E

[

exp[

∫ t

0V (X(s))ds

]]

= supq(·)∈P

[

∑

x

q(x)V (x) − I(q)

]

= supf :f≥0‖f‖2=1

∑

x

V (x)[f(x)]2 +∑

x,y

a(x, y)f(x)f(y)

= supf :‖f‖2=1

∑

x

V (x)[f(x)]2 +∑

x,y

a(x, y)f(x)f(y)

Notice that the Dirichlet form∑

x,y a(x, y)f(x)f(y) can be rewritten as 12

∑

x,y a(x, y)[f(x)−f(y)]2 and replacing f by |f | lowers the Dirichlet form and so we have dropped the as-sumption that f ≥ 0. This is the familiar variational formula for the largest eigenvalue ofa symmetric matrix.

large deviations - ustc

Documents