stochastics: models, visualization, theorems - ssdnm

48
Stochastics: models, visualization, theorems Istv´an Fazekas University of Debrecen, Hungary Lublin, 2013

Upload: others

Post on 19-Mar-2022

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastics: models, visualization, theorems - ssdnm

Stochastics: models, visualization, theorems

Istvan Fazekas

University of Debrecen, Hungary

Lublin, 2013

Page 2: Stochastics: models, visualization, theorems - ssdnm

1 Neural Networks

Artificial neural networks1. Motivated by biological neural networks (human brain)2. Computation mechanism3. Consists of lot of small units4. Connections among the units5. Learning from dataHowever, the phenomenon is considered as a black box.

The perceptron(1) Inputx(n) = (x1(n), . . . , xm(n))⊤ is a known m-dimensional vector. At any time n =

1, 2, . . . , N we have an input vector.(2) Synaptic weightsThe true weights w1, . . . , wm are unknown. We have to find them. At time n,

w1(n), . . . , wm(n) are known. These approximate the true weights.w = (w1, . . . , wm)⊤ is the true weight vector, w(n) = (w1(n), . . . , wm(n))⊤ is its

approximation.(3) BiasThe true bias b is unknown. Its nth approximation is b(n).(4) Summing junction

v(n) = b(n) +∑m

i=1wi(n)xi(n). (1.1)

(5) Activation function, transfer function φ(6) OutputFor input x(n), the neuron produces the output y(n) = φ(v(n)).

TrainingSteps:1. design2. training3. applicationTraining set:x(n) input with d(n) output is known for n = 1, . . . , NThese are the training points. The perceptron produces for input x(n) the output

y(n).The error is

E(n) = (d(n) − y(n))2 .

Find the minimum: training

1

Page 3: Stochastics: models, visualization, theorems - ssdnm

Activation functionsLogistic function

φ(x) =1

1 + exp(−ax), x ∈ R ,

where a > 0.φ(0) = 1/2, φ(.) is increasing, φ(∞) = 1, φ(−∞) = 0. The derivative

φ′(x) =a exp(−ax)

(1 + exp(−ax))2.

So φ′(0) = a/4

Logistic function

−3 −2 −1 0 1 2 30

1

Logistic function with a = 1

Threshold function, Heaviside function, hard limit

φ(x) =

1, if x ≥ 0 ,0, if x < 0 .

−3 −2 −1 0 1 2 30

1

2

Page 4: Stochastics: models, visualization, theorems - ssdnm

Piecewise linear

φ(x) =

0, if x < −0.5 ,

x+ 0.5, if −0.5 ≤ x < 0.5 ,1, if x ≥ 0.5 .

−2 −1 0 1 20

1

Uniform on [−0.5, 0.5].

Tangent hyperbolic

φ(x) =2

1 + exp(−2x)− 1 = tanh(x).

−3 −2 −1 0 1 2 3−1

0

1

Sign

φ(x) =

−1, if x < 0 ,0, if x = 0 ,1, if x > 0 .

3

Page 5: Stochastics: models, visualization, theorems - ssdnm

−3 −2 −1 0 1 2 3−1

0

1

Linearφ(x) = x, x ∈ R

Training perceptron

A1

A2

a

t

u

.

Linearly separable sets

• If the training point is correctly classified, then no correction is made. That is ifeither w(n)⊤x(n) > 0 and x(n) ∈ A1, or w(n)⊤x(n) ≤ 0 and x(n) ∈ A2, then

w(n+ 1) = w(n).

• If w(n)⊤x(n) ≤ 0, but x(n) ∈ A1, then we move w(n) to direction x(n):

w(n+ 1) = w(n) + η(n)x(n).

• If w(n)⊤x(n) > 0, but x(n) ∈ A2, then we move w(n) to the direction opposite tox(n):

w(n+ 1) = w(n) − η(n)x(n).

4

Page 6: Stochastics: models, visualization, theorems - ssdnm

+

o

+

o

+

o+

o

+

o

+

o

+

o

+o

+

o

+

o

x(n)∈ A1

x(n)∈ A2

w

w(n)

Two types of corrections

Perceptron convergence theorem (Rosenblatt, Novikoff)Assume that A1 and A2 are linearly separable. Assume that ∥w⊤x(n)∥ ≥ δ > 0 and

∥x(n)∥2 ≤ R < ∞ for each training point. Let the training parameter η > 0 be fixed.Then the perceptron algorithm terminates at a hyperplane that correctly separates thetwo sets.

Multi Layer Perceptron, MLP

A

B

Linearly non-separable sets

A perceptron separates two linearly separable sets. An MLP separates two non-linearlyseparable sets.

Parts of an MLP:1. An input layer2. Hidden layers3. An output layerThe input signal comes in at the input end of the network, propagates forward (neuron

by neuron) through the network and finally produces the output of the network. Theinputs of a neuron i are the outputs of the neurons being just to the left of neuron i. The

5

Page 7: Stochastics: models, visualization, theorems - ssdnm

outputs of a neuron i are the inputs of the neurons being just to the right of neuron i.Each neuron has its own weights and activation function.

Training MLPThe main steps

• Initialization of the weights

• Give training points

• The input signal propagates through the network without changing the weights

• The output signal is compared with the true output

• The error is propagated backward through the network, the weights are updated

Notationi, j, k: ith, jth, kth neuron (i, j, k are from the left to the right);n: nth step of training;yi(n): the output of neuron i (at the same time the input of neuron j if the layer of j isjust right to the layer of i);y0(n) ≡ 1;wj0(n) = bj(n): the bias of j;wji(n): the weight on the edge from i to j;vj(n): the value produced by the summing junction of j:

vj(n) =∑

i∈left neighbour layer of j

wji(n)yi(n);

φj(.): transfer function of neuron j;yj(n): the output of neuron j: yj(n) = φj(vj(n));dj(n): the true output (we compare yj(n) with it, dj(n) is known in the output layer)

Denote by C the set of the neurons in the output layer.Then the error at the nth step is

E(n) =1

2

∑j∈C

e2j(n) =1

2

∑j∈C

(dj(n) − yj(n))2.

To find the minimum use gradient method (delta rule):

wji(n+ 1) − wji(n) = ∆wji(n) = −η ∂E(n)

∂wji(n), (1.2)

where η > 0 is the training parameter.

6

Page 8: Stochastics: models, visualization, theorems - ssdnm

Error back-propagationNumerical problem: calculate the gradient of E .Error back-propagation: a recursive calculation of the gradient It is calculated back-

wards layer by layer.The local gradient of neuron j can be obtained from the local gradients of the neurons

being to the right of j. And the local gradients in the output layer can be calculateddirectly.

The correction of wji(n) is

∆wji(n) = ηδj(n)yi(n),

where δj(n) is the local gradient:

δj(n) = − ∂E(n)

∂vj(n)= ej(n)φ′

j(vj(n)). (1.3)

When j is a neuron in the output layer, then ej(n) = dj(n) − yj(n) is known.However, if j is a neuron in a hidden layer, then it is not known. However, it can be

calculated recursively.

δj(n) = φ′j(vj(n))

∑k∈right neighbour layer of j

δk(n)wkj(n). (1.4)

It is the most important formula of back-prop.

Versions of error back-propagationThe gradient is calculated by error back-propagation algorithm and applied in other

methods.Conjugate gradient methods: Fletcher-Reeves formula (conjugate gradient back-propagation

with Fletcher-Reeves updates), Polak-Ribiere formula (conjugate gradient back-propagationwith Polak-Ribiere updates), Powell-Beale formula (conjugate gradient back-propagationwith Powell-Beale restarts).

Quasi Newton methods: Broyden-Fletcher-Goldfarb-Shanno formula (BFGS quasi-Newton back-propagation).

Levenberg-Marquardt method (Levenberg-Marquardt back-propagation).Generalized delta rule or momentum method (gradient descent with momentum back-

propagation).Sequential (one by one), batch (by epochs).

GeneralizationA network generalizes well if it produces correct outputs for inputs not being in the

training set.If the shape of the separating curve (or the approximating function) is very ’irregular’,then it is probably overfitted.

7

Page 9: Stochastics: models, visualization, theorems - ssdnm

Left: generalizes well; right: overfitted, overtrained

Radial-Basis Function Networks (RBF)Let

xi ∈ Rm0 , i = 1, 2, . . . , N,

be the input vectors anddi ∈ R1, i = 1, 2, . . . , N,

the corresponding output values.Find the function F : Rm0 → R that approximates well the value di at point xi for

each i.According to Tikhonov’s regularization the goodness of fit is measured by two terms.

(i) Standard error term:

Es(F ) =1

2

N∑i=1

(di − yi)2 =

1

2

N∑i=1

[di − F (xi)]2.

(ii) Regularizing term Ec(F ) = 12∥DF∥2H .

Here D is a linear differential operator. Actually D is defined on a space of functionsF : Rm0 → R being square integrable. This Hilbert space is denoted by H and ouroperator is D : H → H. ∥DF∥2H is the square of the norm of D at space H. The functionF is an unknown element of (H, ⟨· , ·⟩H) . D includes the prior information about theproblem. Ec(F ) is the penalty function.

We have to minimize:

E(F ) = Es(F ) + λEc(F ) =1

2

N∑i=1

[di − F (xi)]2 +

1

2λ∥DF∥2H, (1.5)

where λ > 0 is the regularization parameter.If λ is small, then the training points determine the solution Fλ. If λ is large then the

prior informations specify the solution. E(F ) is called Tikhonov’s functional.Denote the minimum point of E(F ) by Fλ. Fλ is a minimum point of E(F ), if

E(Fλ) ≤ E(Fλ + βh)

8

Page 10: Stochastics: models, visualization, theorems - ssdnm

for any function h and any scalar β. So for any h ∈ H fixed non-zero function[d

dβE(Fλ + βh)

]β=0

= 0 .

Let G be the Green function of DD. Then

Fλ(x) =1

λ

N∑i=1

[di − Fλ(xi)]G(x,xi) . (1.6)

That is if the training points (xi, di), i = 1, . . . , N , are given, the Green function Gis known, and the regularization parameter λ is given, then the minimum point of theTikhonov functional E is Fλ in (1.6).

Let

wi =1

λ[di − Fλ(xi)], i = 1, 2, . . . , N, (1.7)

w = (w1, w2, . . . , wN)⊤.

d = (d1, d2, . . . , dN)⊤,

G =

G(x1,x1) . . . G(x1,xN)...

. . ....

G(xN ,x1) . . . G(xN ,xN)

.

Then(G+ λI)w = d, (1.8)

where I is the N ×N unit matrix.Solving equation (1.8), we obtain the solution of the regularization problem:

Fλ(x) =N∑i=1

wiG(x,xi) . (1.9)

It is the regularization network.However, the Green function is not known. We have to choose it. Generally, it is of

the formG(x,xi) = G(∥x− xi∥) .

Therefore

Fλ(x) =N∑i=1

wiG(∥x− xi∥) . (1.10)

Example.

G(x,xi) = exp

(− 1

2σ2i

∥x− xi∥2),

where x,xi ∈ Rm0 , σi > 0. It is the density function of Nm0(xi, σ2i I) (up to a scalar).

The number of neurons in the regularization network is large (N).

9

Page 11: Stochastics: models, visualization, theorems - ssdnm

We want to find a new solution of the form

F ∗(x) =

m1∑i=1

wiG(∥x− ti∥). (1.11)

Find the new weights wi : i = 1, 2, . . . ,m1 which minimize the Tikhonov functionalE(F ∗):

E(F ∗) =1

2

N∑j=1

(dj −

m1∑i=1

wiG(∥xj − ti∥)

)2

2∥DF ∗∥2H. (1.12)

Letd = (d1, d2, . . . , dN)⊤, w = (w1, w2, . . . , wm1)

G =

G(x1, t1) . . . G(x1, tm1)...

. . ....

G(xN , t1) . . . G(xN , tm1)

.

G0 =

G(t1, t1) . . . G(t1, tm1)...

. . ....

G(tm1 , t1) . . . G(tm1 , tm1)

.

Then w is the solution of(G⊤G+ λG0)w = G⊤d. (1.13)

We solve it and then the generalized RBF is

F ∗(x) =

m1∑i=1

wiG(∥x− ti∥) .

Finding the centers ti.

• Fixed centers selected at random

• k-means clustering

• Supervised selection of centers

Non-parametric statistical estimatorsLet K : Rm0 → R be a bounded, continuous, symmetric function with maximum at

the origin. Assume∫Rm0

K(x)dx = 1. Then K is called kernel function.Let x1, . . . ,xN be a sample form a population with density function f . The kernel

type estimator of f is (Parzen–Rosenblatt):

fN(x) =1

Nhm0

N∑i=1

K

(x− xi

h

), x ∈ Rm0 . (1.14)

Here h is the bandwidth.

10

Page 12: Stochastics: models, visualization, theorems - ssdnm

We want to estimate the (one-dimensional) Y with a function of X (m0-dimensional).The optimal choice is the conditional expectation:

g(x) = EY |X = x.

g is the regression function. Then

g(x) =

∫yf(y |x)dy =

∫yf(y,x)

f(x)dy ,

where f(y,x) is the joint density function of Y and X, f(x) is the density function of X,f(y |x) is the conditional density of Y given X.

Let (y1,x1), . . . , (yN ,xN) be a sample for (Y,X). We estimate f by (1.14). Weestimate the joint density by

f(y,x) =1

Nhm0+1

N∑i=1

K

(x− xi

h

)K0

(y − yih

), (1.15)

y ∈ R , x ∈ Rm0 . (Here K0 is a kernel.)Then

g(x) =

∫yf(y,x)

f(x)dy =

=1

Nhm0+1

∑Ni=1K

(x−xi

h

) ∫yK0

(y−yih

)dy

1Nhm0

∑Ni=1K

(x−xi

h

) =

∑Ni=1 yiK

(x−xi

h

)∑Ni=1K

(x−xi

h

) .

We obtained

g(x) =

∑Ni=1 yiK

(x−xi

h

)∑Ni=1K

(x−xi

h

) .

It is the kernel type estimator of the regression function g (Nadaraya-Watson).

Constrained optimizationKuhn–Tucker theorem, Karush–Kuhn–Tucker theorem (Book by Boyd and Van-

denberghe).Let fk(x), k = 0, . . . ,m, be real valued functions of d variables. Assume that the

intersection of their domain is non-empty: D. The primal optimization problem is:

minimize f0(x)

subject to fk(x) ≤ 0, k = 1, . . . ,m,

Denote p∗ the optimal value, that is the infimum of f0(x) subject to the constraints.Let

L = L(x,λ) = f0(x) +m∑k=1

λkfk(x),

be the Lagrange function, where λ = (λ1, . . . , λm) are the Lagrange multipliers.

11

Page 13: Stochastics: models, visualization, theorems - ssdnm

Letg(λ) = inf

x∈DL(x,λ)

be the Lagrange dual function. g is always concave. Moreover, for λ ≥ 0 we haveg(λ) ≤ p∗. Using this inequality, we want to find the best lower bound. This leads to thefollowing.

The Lagrange dual problem is:

maximize g(λ)

subject to λi ≥ 0, i = 1, . . . ,m.

Denote d∗ the optimal value, that is the supremum of g(λ) subject to the constraints.Then d∗ ≤ p∗. p∗ − d∗ ≥ 0 is the optimal duality gap.

Assume that fk(x), k = 0, . . . ,m, are differentiable. Then D is open.

Theorem. Let x∗ be the optimal solution of the primal problem, let λ∗ = (λ∗1, . . . , λ∗m)

be the optimal solution of the dual optimization problem with zero duality gap.Then the Kuhn–Tucker conditions are satisfied:

fk(x∗) ≤ 0, k = 1, . . . ,m,

λ∗k ≥ 0, k = 1, . . . ,m,

λ∗kfk(x∗) = 0, k = 1, . . . ,m,

df0(x∗)

dx+

m∑k=1

λ∗kdfk(x∗)

dx= 0.

For convex functions the above conditions are sufficient.

Theorem. Let fk(x), k = 0, . . . ,m, be convex. Assume that the points x and λ1, . . . , λmsatisfy the Kuhn–Tucker conditions, that is

fk(x) ≤ 0, k = 1, . . . ,m, (1.16)

λk ≥ 0, k = 1, . . . ,m, (1.17)

λkfk(x) = 0, k = 1, . . . ,m, (1.18)

df0(x)

dx+

m∑k=1

λkdfk(x)

dx= 0. (1.19)

Then x is the optimal solution of the primal optimization problem, λ = (λ1, . . . , λm) isthe optimal solution of the dual optimization problem, and the optimal duality gap is zero.

The Slater condition: there exists a point x in the interior of D such that fi(x) < 0,i = 1, . . . ,m (that is we have strict inequality).

If fk(x), k = 0, . . . ,m, are convex, then the Slater condition implies, that the dualitygap is zero.

Assume that fk(x), k = 0, . . . ,m, are convex and differentiable. Assume that theSlater condition is satisfied. Then there is a solution of (1.16)-(1.19).

The Kuhn–Tucker conditions (1.16)-(1.19) are necessary and sufficient for optimality:that is x is the optimal solution of the primal optimization problem, λ = (λ1, . . . , λm)

12

Page 14: Stochastics: models, visualization, theorems - ssdnm

is the optimal solution of the dual optimization problem, and the optimal duality gap iszero.

The optimality (with zero duality gap) of x, λ is equivalent to

minxL(x, λ) = L(x, λ) = max

λ≥0L(x,λ),

that is (x, λ) is a saddle-point.

Support Vector Machines (SVM)Statistical learning theory, VapnikApplied in image processing, a bioinformatics, data mining,...

Linear separation, the optimal hyperplane

Assume that

(x1, y1), . . . , (xN , yN), xi ∈ Rd, yi ∈ −1, 1 (1.20)

is the training set. If xi ∈ Rd belong to class A1, then yi = 1, if it belongs to class A2,then yi = −1. A1 and A2 are separable by the hyperplane

⟨x,φ⟩ = c (1.21)

if

⟨xi,φ⟩ > c, if yi = 1, (1.22)

⟨xi,φ⟩ < c, if yi = −1, (1.23)

where φ ∈ Rd is a unit vector, c ∈ R and ⟨a,B⟩ denotes the scalar product of a and B.

Optimal hyperplane: the widest zone between the two classes.Steps: geometric conditions, Kuhn–Tucker theorem, numerical procedure. Assume

that the two classes are linearly separable.For any unit vector φ let

c1(φ) = minyi=1

⟨xi,φ⟩, (1.24)

c2(φ) = maxyi=−1

⟨xi,φ⟩. (1.25)

Let the unit vector φ0 be the maximum point of

ϱ(φ) =c1(φ) − c2(φ)

2(1.26)

given ∥φ∥ = 1.

Theorem. φ0 and

c0 =c1(φ0) + c2(φ0)

2(1.27)

give the optimal hyperplane ⟨x,φ0⟩ = c0.

13

Page 15: Stochastics: models, visualization, theorems - ssdnm

Theorem. The optimal hyperplane is unique.An alternative form of the task is the following. Find the vector ψ0 and scalar b0

which satisfy inequalities

⟨xi,ψ⟩ + b ≥ 1, if yi = 1, (1.28)

⟨xi,ψ⟩ + b ≤ −1, if yi = −1 (1.29)

and give the minimum of∥ψ∥2 = ⟨ψ,ψ⟩. (1.30)

Theorem. The normalized form of the vector ψ0 that minimizes the quadratic function(1.30) under the linear constraints (1.28)–(1.29) gives the normal vector φ0 of the optimalhyperplane:

φ0 =ψ0

∥ψ0∥. (1.31)

Furthermore, the margin between the optimal hyperplane and separated vectors is:

ϱ(φ0) = sup∥φ∥=1

1

2

(minyi=1

⟨xi,φ⟩ − maxyi=−1

⟨xi,φ⟩)

=1

∥ψ0∥. (1.32)

Under constraintyi(⟨xi,ψ⟩ + b) ≥ 1, i = 1, . . . , N, (1.33)

find the minimum of1

2∥ψ∥2.

It is a convex optimization problem.Apply the Kuhn–Tucker theorem. The Lagrange function is

L(ψ, b,α) =1

2⟨ψ,ψ⟩ −

N∑i=1

αi (yi[⟨xi,ψ⟩ + b] − 1) = (1.34)

=1

2⟨ψ,ψ⟩ −

⟨N∑i=1

αiyixi,ψ

⟩− b

N∑i=1

αiyi +N∑i=1

αi (1.35)

where αi ≥ 0, i = 1, . . . , N , are the Lagrange multipliers. We should find the saddle pointof L.

Find the minimum according to ψ and b maximum according to α. We can find theminimum with respect to ψ and b analytically. We obtain the following.

Find the maximum of

W (α) =N∑i=1

αi −1

2

N∑i,j=1

yiyjαiαj⟨xi,xj⟩ (1.36)

under constraints αi ≥ 0 andN∑i=1

αiyi = 0. (1.37)

14

Page 16: Stochastics: models, visualization, theorems - ssdnm

So we find the vector defining the optimal hyperplane:

ψ0 =N∑i=1

yiα0ixi.

b0 can be obtained from the following equations.For non-zero values of α0

i the following should be satisfied:

yi(⟨xi,ψ0⟩ + b0) = 1. (1.38)

These vectors xi are called support vectors.Finally, the optimal hyperplane is

f(x) =N∑i=1

yiα0i ⟨xi,x⟩ + b0 = 0.

Linearly non-separable sets

0

2

4

−1

1

3

0

0.2

xy

z

Normal density

0 1 2 3 4 5x

Two samples from normal distributions

15

Page 17: Stochastics: models, visualization, theorems - ssdnm

Soft marginVapnik’s proposal: Introduce the non-negative variables ξiNi=1 and claim only

yi(⟨ψ,xi⟩ + b) ≥ 1 − ξi, i = 1, . . . , N. (1.39)

One should minimizeN∑i=1

ξi

The target function is

Φ(ψ, ξ) =1

2⟨ψ,ψ⟩ + C

N∑i=1

ξi. (1.40)

Minimize (1.40) under constraints (1.39) and ξi ≥ 0, i = 1, 2, . . . , N . It is called softmargin.

Apply the Kuhn–Tucker theorem to find the minimum of (1.40) under constraints(1.39) and ξi ≥ 0, i = 1, 2, . . . , N .

The Lagrange function is

L(ψ, b, ξ,α,β) =1

2⟨ψ,ψ⟩ + C

N∑i=1

ξi −N∑i=1

αi (yi(⟨ψ,xi⟩ + b) − 1 + ξi) −N∑i=1

βiξi

where αi ≥ 0, βi ≥ 0 are multipliers. We want to find saddle point.Minimum with respect to ψ, b and ξi, maximum with respect to αi, βi.We obtain

W (α) =N∑i=1

αi −1

2

N∑i,j=1

yiyj⟨xi,xj⟩αiαj. (1.41)

We have to maximize (1.41) with respect to α1, . . . , αN under constraints

N∑i=1

yiαi = 0, (1.42)

0 ≤ αi ≤ C, i = 1, . . . , N. (1.43)

Denote the solution by α01, . . . , α

0N . Then

ψ0 =N∑i=1

yiα0ixi.

Moreover, if 0 < α0i < C, then

yi(⟨xi,ψ0⟩ + b0) = 1. (1.44)

It gives b0.Finally the optimal hyperplane is:

f(x) =N∑i=1

yiα0i ⟨xi,x⟩ + b0 = 0.

16

Page 18: Stochastics: models, visualization, theorems - ssdnm

Non-linear separationIf the two sets are not separable linearly, then we transform to a higher dimensional

space and try to find linear separation in that space.Let φ : Rd → Rq, d < q, be a non-linear function. The image space is called feature

space. We know, that to find the optimal hyperplane, we need only the inner product ofthe vectors.

The inner product:

⟨φ(x),φ(y)⟩ =∑i

µiφi(x)φi(y), µi > 0.

Integral operator:

(AKf)(x) =

∫ b

a

K(x, y)f(y)dy.

φ : [a, b] → R is eigenfunction, µ is eigenvalue if AKφ = µφ.

Mercer’s theorem Let K : [a, b] × [a, b] → R continuous, symmetric, positivesemi-definite function.

Then the expansion

K(x, y) =∞∑i=1

µiφi(x)φi(y) (1.45)

is uniformly convergent.Kernel functions:

1. Polynomial:K(x,xi) = (⟨x,xi⟩ + 1)p.

We should choose p.

2. Radial Basis Function:

K(x,xi) = exp

(− 1

2σ2∥ x− xi ∥2

).

3. Two-layer perceptron:

K(x,xi) = tanh(β0⟨x,xi⟩ + β1).

The training set is(x1, y1), . . . , (xN , yN).

Instead of them we should separate

(φ(x1), y1), . . . , (φ(xN), yN).

However, the optimal hyperplane has the form

⟨ψ0,x⟩ + b0 = 0

17

Page 19: Stochastics: models, visualization, theorems - ssdnm

where

ψ0 =N∑i=1

yiα0ixi,

and in the optimization problem only the inner products of the vectors are included, sowe replace every inner product by a kernel function.

Replace ⟨ , ⟩ by K( ., . ).Then the separating surface is

f(x,α) =N∑i=1

yiαiK(xi,x) + b = 0. (1.46)

To find αi and b we should maximize

W (α) =N∑i=1

αi −1

2

N∑i,j=1

αiαjyiyjK(xi,xj) (1.47)

subject to

N∑i=1

yiαi = 0, (1.48)

0 ≤ αi ≤ C, i = 1, . . . , N. (1.49)

Vectors xi corresponding to non-zero α0i are called support vectors.

We classify a point z according to

d(z,α) = sgn

( ∑xi is a support vector

yiα0iK(z,xi) + b0

).

Sequential Minimal Optimization (SMO)How to find the maximum of W?

Platt, Cristianini and Shawe-Taylor: SMO

SVM regressionLoss functionsLeast squares:

(y − f(x,α))2,

where x is the input and y is the output.Robust statistics (Huber): apply other loss functions.ε-insensitive loss function:

Lε(y − f(x,α)) = |y − f(x,α)|εwhere

|y − f(x,α)|ε =

0, if |f(x,α) − y| ≤ ε,

|f(x,α) − y| − ε, if |f(x,α) − y| > ε.

18

Page 20: Stochastics: models, visualization, theorems - ssdnm

SVM for linear regressionWe want to fit a function

f(x,ψ) =d∑

i=1

ψixi + b,

where x = (x1, . . . , xd)⊤ ∈ Rd, ψ = (ψ1, . . . , ψd)⊤ ∈ Rd.

Let (x1, y1), . . . , (xN , yN) be the sample points (training set), where xi ∈ Rd, yi ∈ R,i = 1, . . . , N .

Then the empirical risk is

Remp(ψ, b) =1

N

N∑i=1

Lεi(yi − ⟨ψ,xi⟩ − b).

We should minimize it.Introduce the non-negative variables ξiNi=1 and ξ∗i Ni=1.Consider the conditions

yi − ⟨ψ,xi⟩ − b ≤ εi + ξi, i = 1, . . . , N,

⟨ψ,xi⟩ + b− yi ≤ εi + ξ∗i , i = 1, . . . , N,

ξi ≥ 0, i = 1, . . . , N,

ξ∗i ≥ 0, i = 1, . . . , N.

(1.50)

The minimum of Remp(ψ, b) in ψ and b is attained at the same point as the minimum of∑Ni=1(ξi + ξ∗i ) with respect to ξi, ξ

∗i , ψ, b and subject to the constrains (1.50).

Vapnik’s proposal. Minimize

Φ(ψ, ξ, ξ∗) = C

(N∑i=1

(ξi + ξ∗i )

)+

1

2⟨ψ,ψ⟩

subject to the constraints (1.50). Apply Kuhn–Tucker theorem. Introduce the non-negative multipliers αi, α

∗i , βi, β

∗i .

The Lagrange function:

L(ψ, b, ξ, ξ∗,α,α∗,β,β∗) = C∑i

(ξi + ξ∗i ) +1

2⟨ψ,ψ⟩ +

∑i

αi(yi − ⟨ψ,xi⟩ − b− εi − ξi)

+∑i

α∗i (−yi + ⟨ψ,xi⟩ + b− εi − ξ∗i ) −

∑i

βiξi −∑i

β∗i ξ

∗i .

We have to find its saddle-point, minimize with respect to ψ, b, ξi, ξ∗i , maximize with

respect to αi, α∗i , βi, β

∗i .

We should maximize

W (α,α∗) =N∑i=1

yi(αi − α∗i ) −

N∑i=1

εi(αi + α∗i ) −

1

2

N∑i,j=1

(αi − α∗i )(αj − α∗

j )⟨xi,xj⟩ (1.51)

19

Page 21: Stochastics: models, visualization, theorems - ssdnm

subject to the constraints

N∑i=1

(αi − α∗i ) = 0

0 ≤ αi ≤ C, i = 1, . . . , N,

0 ≤ α∗i ≤ C, i = 1, . . . , N.

(1.52)

Solving this quadratic programming problem, we find the optimal α0 and α∗0. Then

ψ0 =N∑i=1

(α0i − α∗0

i )xi.

The optimal b is obtained by minimizing Remp(ψ0, b). The regression hyperplane is

f(x) =N∑i=1

(α0i − α∗0

i )⟨x,xi⟩ + b0.

SVM for non-linear regressionApply kernel functions. Maximize

W (α,α∗) =N∑i=1

yi(αi − α∗i ) −

N∑i=1

εi(αi + α∗i ) −

1

2

N∑i,j=1

(αi − α∗i )(αj − α∗

j )K(xi,xj).

subject to the constrains (1.52). Denote the maximum point by α0i , α

∗0i , i = 1, 2, . . . , N .

Finally, the approximating function is

f(x) =N∑i=1

(α0i − α∗0

i )K(x,xi) + b0.

References

TEXTBOOKSHaykin, Simon: Neural networks. A comprehensive foundation, 1999, Prentice Hall,

London

Vapnik, V. N.: Statistical Learning Theory, 1998, John Wiley and Sons Inc., NewYork

Cristianini, Nello and Shawe-Taylor, John: An Introduction to Support Vector Ma-chines and Other Kernel-based Methods, 2000, Cambridge, Cambridge University Press

Luc, Devroye and Laszlo, Gyorfi and Gabor, Lugosi: A probabilistic theory of patternrecognition, 1996, New York, Springer

20

Page 22: Stochastics: models, visualization, theorems - ssdnm

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning:Prediction, Inference and Data Mining, Second Edition, Springer Verlag, 2009

Fletcher, R. Practical methods of optimization. Second edition. A Wiley-IntersciencePublication. John Wiley and Sons, Ltd., Chichester, 1987.

Prakasa Rao, B. L. S. Nonparametric functional estimation. Probability and Mathe-matical Statistics. Academic Press, Inc. (Harcourt Brace Jovanovich, Publishers), NewYork, 1983

Boyd, Stephen; Vandenberghe, Lieven: Convex optimization. Cambridge UniversityPress, Cambridge, 2004

Martin T. Hagan, Howard B. Demuth, Mark H. Beale: Neural Network Design.Boston, MA: PWS Publishing, 1996

RESEARCH PAPERS AND BOOKSMcCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in

nervous activity. Bulletin of Mathematical Biophysics, vol 5 issue 4:115 - 133.

Hebb, D. O. (1949). The Organization of Behavior: A neuropsychological theory. NewYork: Wiley.

Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for InformationStorage and Organization in the Brain, Psychological Review, v65, No. 6, pp. 386408.

Novikoff, A. B. (1962). On convergence proofs on perceptrons. Proceedings of Sym-posium on the Mathematical Theory of Automata, vol 12, 615-622. Polytechnic Instituteof Brooklyn.

Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive switching circuits, in 1960 IREWESCON Convention Record, Part 4, New York: IRE, pp. 96104

Marvin Minsky and Seymour Papert, 1972 (2nd edition with corrections, first edi-tion 1969) Perceptrons: An Introduction to Computational Geometry, The MIT Press,Cambridge MA,

David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams: Learning representa-tions by back-propagating errors. Nature 323, 533 - 536 (09 October 1986)

Platt, John C.: Sequential Minimal Optimization: A Fast Algorithm for TrainingSupport Vector Machines, 1998, Microsoft Research, Technical Report MSR-TR-98-14,http://research.microsoft.com/en-us/um/people/jplatt/smoTR.pdf

Kohonen, Teuvo (1982). ”Self-Organized Formation of Topologically Correct FeatureMaps”. Biological Cybernetics 43 (1): 5969.

V. Vapnik and A. Chervonenkis. ”On the uniform convergence of relative frequenciesof events to their probabilities.” Theory of Probability and its Applications, 16(2):264280,1971.

Tikhonov, A.N. Solution of incorrectly formulated problems and the regularizationmethod. (English. Russian original) Sov. Math., Dokl. 5, 1035-1038 (1963); translationfrom Dokl. Akad. Nauk SSSR 151, 501-504 (1963).

21

Page 23: Stochastics: models, visualization, theorems - ssdnm

Tikhonov, A. N.; Arsenin, V. Ya. Metody resheniya nekorrektnykh zadach. (Russian)[Methods for the solution of ill-posed problems] Izdat. “Nauka”, Moscow, 1974.

Kuhn, H. W.; Tucker, A. W. Nonlinear programming. Proceedings of the SecondBerkeley Symposium on Mathematical Statistics and Probability, 1950, pp. 481492. Uni-versity of California Press, Berkeley and Los Angeles, 1951.

Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.

MATLAB The Language of Technical Computing. MathWorks,http://www.mathworks.com/products/matlab/

22

Page 24: Stochastics: models, visualization, theorems - ssdnm

2 Strong laws of large numbers

Topics1. The phenomenon

random walks with normalizations.2. The laws

laws of large numbers (LLN),the law of the iterated logarithm (LIL),the central limit theorem.

3. Significancerelative frequency and probability,the fundamental theorem of mathematical statistics,consistency of estimators,Monte Carlo methods.

4. Weak laws and strong lawsstochastic convergence, almost sure convergence.

5. Pairwise independence, complete independenceEtemadi’s theorem, Kolmogorov’s theorem.

1. The phenomenon: the random walkξi = ±1 in the coin tossing experiment, Sn =

∑ni=1 ξi

0 50

−10

−5

0

5

10

A trajectory of the symmetric random walk (Sn)

The law of large numbersSn/n→ 0

23

Page 25: Stochastics: models, visualization, theorems - ssdnm

0 100 200

0

1

The normalized trajectory of the symmetric random walk (Sn/n)

Standardization, another phenomenon

0 49−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

The standardized random walk (Sn/√n)

The central limit theoremSn/

√n ≈ N (0, 1)

0 100

−3

−2

−1

0

1

2

3

300 repetitions of the first 100 steps of the random walk;the histogram of S100/

√100 is bell shaped

Between the two phenomena: LIL

P

(lim supn→∞

Sn√2n log log n

= 1

)= 1.

24

Page 26: Stochastics: models, visualization, theorems - ssdnm

0.1 5

x 104

−1

0

1

n: 1 000 – 50 000

5 10

x 105

−1

0

1

LIL; n: 500 000 – 1 000 000

2. The main features of limit theorems

Definition 2.1. We say that the sequence of random variables η1, η2, . . . converges inprobability to a random variable η, if

∀ ε > 0 we have limn→∞

P(|ηn − η| > ε

)= 0.

In other words: stochastic convergence, convergence in measure.

Definition 2.2. We say that the sequence of random variables η1, η2, . . . converges al-most surely (a.s.) to a random variable η, if there exists an event N with P (N) = 0and

limn→∞

ηn(ω) = η(ω), for ω ∈ Ω \N.

In other words: convergence with probability 1, convergence almost everywhere.

Almost sure convergence is point-wise convergence except a set of zero probability.Almost sure convergence implies stochastic convergence. Stochastic convergence does notimply almost sure convergence. Therefore in the strong laws the convergence is strongerthan in the weak laws.

Example 2.1. We construct a sequence converging in probability but not converging almostsurely.Let (Ω,A, P ) be the interval [0, 1) endowed with the σ-algebra of the Borel sets and with

25

Page 27: Stochastics: models, visualization, theorems - ssdnm

the Lebesgue measure.Let ξ1(ω) = 1, ω ∈ [0, 1);ξ2(ω) = 1, if ω ∈ [0, 1/2) and 0 otherwise;ξ3(ω) = 1, if ω ∈ [1/2, 1) and 0 otherwise;ξ4(ω) = 1, if ω ∈ [0, 1/4) and 0 otherwise;...ξ8(ω) = 1, if ω ∈ [0, 1/8) and 0 otherwise;...

As the length of the intervals where ξn = 0, converges to 0, so ξn → 0 in probability.However the interval, where ξn = 1, ’returns infinitely many times’ above any point, soin the sequence ξn(ω) there are infinitely many zeros, and infinitely many ones.So ξn(ω) in not convergent, ω ∈ [0, 1).

0 0.25 1

0

4

(a)

0 0.25 0.5 1

0

5

(b)

0 0.5 0.75 1

0

6

(c)

0 0.75 1

0

7

(d)

We say that ξ1, ξ2, . . . is fundamental (Cauchy) with probability 1 if ξ1(ω), ξ2(ω), . . .is fundamental (Cauchy) for almost all ω ∈ Ω.

Theorem 2.1. limn→∞ ξn = ξ almost surely iff

P

supk≥n

|ξk − ξ| ≥ ε

→ 0, (2.1)

as n→ ∞, for every ε > 0.

Theorem 2.2. Let∞∑k=1

P |ξk − ξ| ≥ ε <∞, (2.2)

for every ε > 0.Then limn→∞ ξn = ξ almost surely.

Theorem 2.3. The sequence ξ1, ξ2, . . . is fundamental with probability 1 iff

P

supk≥0

|ξn+k − ξn| ≥ ε

→ 0, (2.3)

as n→ ∞, for every ε > 0.

26

Page 28: Stochastics: models, visualization, theorems - ssdnm

Let ξ1, ξ2, . . . be independent identically distributed random variables;let Sn =

∑ni=1 ξi.

The Marcinkiewicz SLLNLet E|ξi|r <∞, where 0 < r < 2, and let m = Eξi, if r ≥ 1, and m = 0 otherwise. Then

limn→∞

Sn − nm

n1/r= 0 almost surely.

Kolmogorov’s SLLN: when r = 1.

Rate of convergence: Hsu, Robbins, Erdos, Baum, Katz,...Non-identically distributed r.v.’sNot independent r.v.’sBanach space valued caseMultiindex caseThe law of the iterated logarithmIf 0 < σ2 = Eξ21 <∞, Eξ1 = 0, then

P

(lim supn→∞

Sn√2σ2n log log n

= 1

)= 1,

P

(lim infn→∞

Sn√2σ2n log log n

= −1

)= 1.

The central limit theoremAssume σ2 = D2ξ1 is finite and positive, let m = Eξ1. Then

limn→∞

P

(Sn − nm√

nσ< x

)=

x∫−∞

1√2πe−t2/2 dt , ∀x ∈ R .

Rate of convergence: Berry-EsseenLocal CLTNon-identically distributed r.v.’s: Lyapunov, LindebergNot independent r.v.’sFunctional CLT: Donsker’s theorem

An interplay between SLLN and CLTAlmost sure central limit theoremsLet X1, X2, . . . , be i.i.d. real r.v.’s with mean 0 and variance 1.

Let Sk = X1 + · · · +Xk.Then

1

log n

n∑k=1

1

kδSk(ω)√

k

⇒ N (0, 1) , for almost every ω ∈ Ω , (2.4)

if n → ∞, where δx is the unit mass at point x, ⇒ N (0, 1) denotes weak convergence tothe standard normal law.

See Brosamler (1988), Schatte (1988), Lacey–Philipp (1990), Rychlik (1994),. . .

27

Page 29: Stochastics: models, visualization, theorems - ssdnm

A.s. CLTA trajectory of a standardized random walk

Sk/√k, k = 1, . . . , 10000

0 10000

−2

−1

0

1

2

10 000 steps of a random walk

The histogram of the discrete distribution

P

(η =

Sk(ω)√k

)≈ 1

log n

1

k, k = 1, . . . , 10000.

3. Applications of the LLN: relative frequency and probabilityConsider an experiment K and an event A with P (A) = p.Repeat K n times under fixed conditions and independently.Denote kA/n the relative frequency of A.Then kA/n = Sn/n, if Sn = ξ1 + · · ·+ ξn, where ξi is the indicator of A during the i-th

performance of the experiment.Then ξ1, . . . , ξn are independent with Bernoulli distribution, i.e. P (ξi = 1) = p,

P (ξi = 0) = 1 − p, m = Eξi = p = P (A).Therefore

limn→∞

kAn

= limn→∞

Sn

n= m = P (A).

It is the Bernoulli LLN.Theorem in probability theory: The relative frequency converges to the probability.

28

Page 30: Stochastics: models, visualization, theorems - ssdnm

Empirical fact: in a real life experiment the relative frequency ’converges’ to theprobability.

So our model is in accordance with the empirical facts.

Applications in mathematical analysisThe Weierstrass approximation theorem; Bernstein polynomials

Applications in statisticsSample in statistics: independent identically distributed random variables ξ1, . . . , ξn.

(1) The empirical mean (sample mean, average) is a consistent estimator of the theo-retical mean. That is ξ → m, where ξ = 1

n

∑ni=1 ξi is the empirical mean and m = Eξ is

the expectation.(2) The fundamental theorem of mathematical statistics.

The following is true with probability 1: the empirical distribution function F ∗n converges

to the theoretical distribution function F uniformly on the whole real line.

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1

Monte Carlo methods (stochastic simulation)Calculating integrals. Let f : [0, 1] −→ [0, 1].∫ 1

0

f(x) dx =?

Let ξ1, η1, ξ2, η2, . . . be independent, uniformly distributed on [0, 1]. Then (ξi, ηi), i =1, 2, . . . , are independent and uniformly distributed on the unit square.

Let

ϱi =

1, if f(ξi) > ηi0, if f(ξi) ≤ ηi

.

Then ϱ1, . . . , ξn are independent and identically distributed. Eϱi = P (f(ξi) > ηi) =∫ 1

0f(x) dx. By the SLLN

limn→∞

1

n

∑n

i=1ϱi =

∫ 1

0

f(x) dx almost surely.

So, observing the sequence (ξi, ηi), i = 1, 2, . . . , n, we can calculate∫ 1

0f(x) dx. An

approximation of the integral is the number of points (ξi, ηi) falling below the curve dividedby the total number of points (ξi, ηi).

29

Page 31: Stochastics: models, visualization, theorems - ssdnm

0 10

1

4. Weak law: stochastic convergence; strong law: almost sure convergence

Kolmogorov’s zero-one lawLet A1,A2, . . . be σ-algebras.

Definition 2.3. Let

A∞k = σAk,Ak+1, . . . , T = ∩∞

n=1A∞n

T is called the tail σ-algebra.

Theorem 2.4 (Kolmogorov’s zero-one law). Let A1,A2, . . . be independent σ-algebras.If A ∈ T , then P(A) is either zero or one.

Kolmogorov’s InequalityLet ξ1, ξ2, . . . , ξn be random variables and let Sk =

∑ki=1 ξi be the partial sum.

Theorem 2.5 (Kolmogorov’s Inequality). Let ξ1, ξ2, . . . , ξn be independent random vari-ables with Eξi = 0, Eξ2i <∞, i ≤ n.

(a) Then for every ε > 0

P

max1≤k≤n

|Sk| ≥ ε

≤ ES2

n

ε2. (2.5)

(b) If also P (|ξi| ≤ c) = 1, i ≤ n, then

P

max1≤k≤n

|Sk| ≥ ε

≥ 1 − (c+ ε)2

ES2n

. (2.6)

Proof. First hitting time of level ε:

Ak = |Si| < ε, i = 1, 2, . . . , k − 1, |Sk| ≥ ε.

30

Page 32: Stochastics: models, visualization, theorems - ssdnm

One-Series Theorem

Theorem 2.6 (Kolmogorov and Khinchin). Let ξ1, ξ2, . . . be independent random vari-ables with Eξi = 0.

(a) Then if ∑Eξ2n <∞, (2.7)

the series∑ξn converges with probability 1.

(b) If the random variables ξn, n ≥ 1, are uniformly bounded (i.e., P (|ξn| ≤ c) = 1,c < ∞), the converse is true: the convergence of

∑ξn with probability 1 implies

(2.7).

Proof. (a) Use Theorem 2.3 and part (a) of Kolmogorov’s inequality.(b) Use Theorem 2.3 and part (b) of Kolmogorov’s inequality.

Two-Series Theorem

Theorem 2.7 (Two-Series Theorem). (a) A sufficient condition for the convergence ofthe series

∑ξn of independent random variables, with probability 1, is that both series∑

Eξn and∑

D2ξn converge.

(b) If P (|ξn| ≤ c) = 1, the condition is also necessary.

Proof. (a) Use One-Series Theorem.(b) Symmetrization and use One-Series Theorem.

Notation:ξc = ξ, if |ξ| ≤ c, and ξc = 0, if |ξ| > c.

Kolmogorov’s Three-Series Theorem

Theorem 2.8 (Kolmogorov’s Three-Series Theorem). Let ξ1, ξ2, . . . be a sequence of in-dependent random variables.

(a) A sufficient condition for the convergence of∑ξn with probability 1 is that the series∑

Eξcn,∑

D2ξcn,∑

P (|ξn| ≥ c)

converge for some c > 0;

(b) a necessary condition is that these series converge for every c > 0.

Proof. Use Two-Series Theorem.

31

Page 33: Stochastics: models, visualization, theorems - ssdnm

The Borel-Cantelli lemma

a) If∞∑i=1

P (Ai) <∞, then the probability that infinitely many of the events Ai occur is 0.

b) Let the events A1, A2, . . . be independent. If∞∑n=1

P (An) = ∞, then the probability that

infinitely many of the events Ai occur is 1.

Remark that Erdos and Renyi proved, that part b) of the Borel-Cantelli lemma is truefor pairwise independent events, too.

lim supAn =∞∩n=1

∞∪k=n

Ak

means that An occurs for infinitely many n.

Proof of the Borel-Cantelli lemma

a) We should prove that∞∑i=1

P (Ai) <∞ implies P (lim supAi) = 0.

P (lim supAi) = P (∞∩n=1

∞∪k=n

Ak) ≤ P (∞∪k=n

Ak) for every n.

But

P (∞∪k=n

Ak) ≤∞∑k=n

P (Ak) → 0, as n→ ∞,

as it is the tail of a convergent series.

b) We should prove that∞∑i=1

P (An) = ∞ implies P (lim supAn) = 1.

By the continuity of the probability,

P (lim supAn) = P (∞∩n=1

∞∪k=n

Ak) = limn→∞

P (∞∪k=n

Ak) = limn→∞

limm→∞

P (m∪

k=n

Ak). (2.8)

Using de Morgan’s law and the independence,

P (∪m

k=nAk) = P (

∩m

k=nAk) =

∏m

k=nP (Ak).

Applying P (Ak) = 1 − P (Ak) ≤ exp−P (Ak), the previous expression has the fol-lowing upper bound

m∏k=n

exp−P (Ak) = exp−m∑

k=n

P (Ak).

By our assumption, this expression converges to 0, as m→ ∞. So limm→∞

P (m∪

k=n

Ak) = 1.

Therefore, by (2.8), P (lim supAn) = 1.

Lemma 2.1 (Chebyshev’s inequality). Assume that the variance of ξ is finite. Then forany ε > 0 we have

P(|ξ − Eξ| ≥ ε

)≤ D2(ξ)/ε2.

32

Page 34: Stochastics: models, visualization, theorems - ssdnm

Cantelli’s SLLN

Theorem 2.9 (Cantelli). Let ξ1, ξ2, . . . be independent random variables with finite fourthmoments and let

E|ξn − Eξn|4 ≤ C, n ≥ 1,

for some constant C.Then as n→ ∞

Sn − ESn

n→ 0 (P -a.s.). (2.9)

Proof. Apply Chebysev’s inequality and the Borel-Cantelli lemma.

Weak laws of large numbersLet ξ1, ξ2, . . . be a sequence of r.v.’s, and Sn = ξ1 + · · · + ξn, n = 1, 2, . . . , be the

sequence of their partial sums.The weak law of large numbers states the stochastic convergence of Sn/n.

Theorem 2.10. Let ξ1, ξ2, . . . be pairwise independent identically distributed r.v.’s.Assume Eξ2i <∞. Denote m = Eξi the common expectation. Then

limn→∞

Sn

n= m in probability.

So the average of the the observations converges to the theoretical mean.’Weak’ means that the convergence is stochastic. That is for large n’s Sn/n is close tom with high probability. Stochastic convergence is a metric convergence. Almost sureconvergence is not a metric convergence.

Khintchine proved that the above theorem remains valid if instead of Eξ2i < ∞ weassume only E|ξi| <∞.

Proof of the weak law

Proof. By Chebyshev’s inequality, for any ε > 0

P

(∣∣∣∣Sn

n−m

∣∣∣∣ > ε

)= P

(∣∣∣∣Sn

n− E

(Sn

n

)∣∣∣∣ > ε

)≤

≤ 1

ε2D2

(Sn

n

)=

1

ε2n2

n∑i=1

D2ξi =1

ε2nD2ξ1 → 0,

as n→ ∞. (We used that the variance is additive for pairwise independent r.v.’s.)

5. IndependenceKhintchine (1929): pairwise independence ⇒ WLLNKolmogorov (1933): (complete) independence ⇒ SLLNEtemadi (1981): pairwise independence ⇒ SLLN

33

Page 35: Stochastics: models, visualization, theorems - ssdnm

The Toeplitz Lemma

Lemma 2.1 (Toeplitz). Let an be a sequence of nonnegative numbers, bn =∑n

i=1 ai,bn > 0 for n ≥ 1, and bn ↑ ∞, n→ ∞.

Let xn be a sequence of numbers converging to x.Then

1

bn

n∑j=1

ajxj → x. (2.10)

In particular, if an = 1 thenx1 + · · · + xn

n→ x. (2.11)

The Kronecker Lemma

Lemma 2.2 (Kronecker). Let bn be an increasing sequence of positive numbers, bn ↑ ∞,n→ ∞,and let xn be a sequence of numbers such that

∑xn converges.

Then1

bn

n∑j=1

bjxj → 0, n→ ∞. (2.12)

In particular, if bn = n, xn =ynn

and∑(yn

n

)converges, then

y1 + · · · + ynn

→ 0, n→ ∞. (2.13)

Proof. Apply the Toeplitz Lemma.

A Kolmogorov’s SLLN

Theorem 2.11 (Kolmogorov). Let ξ1, ξ2, . . . be a sequence of independent random vari-ables with finite second moments, and let there be positive numbers bn such that bn ↑ ∞and ∑ D2ξn

b2n<∞. (5)

ThenSn − ESn

bn→ 0 (P -a.s.). (2.14)

In particular, if ∑ D2ξnn2

<∞ (2.15)

thenSn − ESn

n→ 0 (P -a.s.). (2.16)

Proof. Apply the two-series theorem and Kronecker’s Lemma.

34

Page 36: Stochastics: models, visualization, theorems - ssdnm

A Lemma for Moments

Lemma 2.3. Let ξ be a nonnegative random variable. Then

∞∑n=1

P (ξ ≥ n) ≤ Eξ ≤ 1 +∞∑n=1

P (ξ ≥ n) . (2.17)

Kolmogorov’s SLLN

Theorem 2.12 (Kolmogorov). Let ξ1, ξ2, . . . be a sequence of independent identicallydistributed random variables. If E|ξ1| <∞, then

Sn

n→ m (P -a.s.) (2.18)

where m = Eξ1.If

Sn

n→ C (P -a.s.) (2.19)

where C is a finite constant, then E|ξ1| <∞.

Proof. Use truncation, Lemma 2.3, Theorem 2.11.

Etemadi’s SLLNIt turned out that pairwise independence implies the strong law. Etemadi’s result

implies both Khintchine’s WLLN, and Kolmogorov’s SLLN:

Theorem 2.13 (Etemadi’s SLLN). Let ξ1, ξ2, . . . be pairwise independent identically dis-tributed r.v.’s.

Let Sn = ξ1 + · · · + ξn. Assume E|ξi| <∞, m = Eξi. Then

limn→∞

Sn

n= m almost surely. (2.20)

Lemma 2.2. Let ξ be a non-negative r.v. Then

Eξ <∞ if and only if∑∞

n=1P (ξ ≥ n) <∞ .

Proof.∞∑n=1

P (ξ ≥ n) =∞∑n=1

∞∑k=n

P (k ≤ ξ < k + 1) =

=∞∑k=1

kP (k ≤ ξ < k + 1) ≤∞∑k=0

Eξχk≤ξ<k+1 = Eξ = · · · ≤

≤∞∑k=0

(k + 1)P (k ≤ ξ < k + 1) = · · · = 1 +∞∑n=1

P (ξ ≥ n) .

35

Page 37: Stochastics: models, visualization, theorems - ssdnm

Proof of Etemadi’s SLLNLet

ξ+i = ξiχξi≥0, ξ−i = −ξiχξi≤0,

the positive and the negative parts of ξi, i = 1, 2, . . . . Then both ξ+1 , ξ+2 , . . . , and

ξ−1 , ξ−2 , . . . are sequences of identically distributed, pairwise independent r.v.’s. Moreover

ξi = ξ+i − ξ−i .

Therefore, if the theorem is valid for the positive parts and the negative parts, the it validfor ξi. So it is enough to prove for non-negative r.v.’s.

Assume ξi ≥ 0. Truncate ξi at level i:

ηi = ξiχξi≤i, i = 1, 2, . . . ; S∗n = η1 + · · · + ηn, n = 1, 2, . . . .

Let α > 1 be fixed and let kn = [αn] (integer part).First we prove that a subsequence of S∗

n/n is convergent. More precisely, we shallprove that

limn→∞

S∗kn

kn= m, almost surely. (2.21)

To this end, first we prove, that

limn→∞

ES∗kn/kn = m. (2.22)

By the monotone convergence theorem,

m = Eξ1 = limn→∞

Eξ1χξ1≤n = limn→∞

Eξnχξn≤n = limn→∞

Eηn.

However, ES∗kn/kn is the arithmetical mean of the numbers Eη1, . . . ,Eηkn . As the sequence

of the arithmetical means of a convergent sequence has the same limit as the limit of thesequence itself, therefore (2.22) follows.

Later we shall show that for any ε > 0

∆ =∞∑n=1

P

(∣∣∣∣S∗kn

− ES∗kn

kn

∣∣∣∣ > ε

)<∞. (2.23)

But (2.23) and (2.22) imply (2.21).To see it, we remark that by Borel-Cantelli’s lemma and (2.23), for any ε > 0 the

following event has probability 1: |S∗kn

− ES∗kn|/kn can be greater than ε only for finitely

many indices kn. Applying this property for a null sequence of ε’s, we obtain

(S∗kn − ES∗

kn)/kn → 0, almost surely.

This and (2.22) imply (2.21).Now we prove that (2.21) is valid for the whole sequence Sn/n, that is

limn→∞

Skn

kn= m, almost surely. (2.24)

36

Page 38: Stochastics: models, visualization, theorems - ssdnm

We have∞∑n=1

P (ηn = ξn) =∞∑n=1

P (ξn > n) =∞∑n=1

P (ξ1 > n) <∞

because E|ξ1| < ∞. So, by the Borel-Cantelli lemma, with probability 1: ξn = ηn canhappen only for finitely many n. Therefore (2.21) implies (2.24).

Now we turn to the indices not belonging to the subsequence kn. Let kn < i ≤ kn+1.Then, by the non-negativity of ξi,

Skn

kn+1

≤ Si

i≤Skn+1

kn.

Thereforeknkn+1

Skn

kn≤ Si

i≤ kn+1

kn

Skn+1

kn+1

.

So, using (2.24),1

αm ≤ lim inf

n→∞

Sn

n≤ lim sup

n→∞

Sn

n≤ αm

almost surely. It is valid for any α > 1, so (2.20) is true. To complete the proof we have to prove (2.23), that is

∆ =∞∑n=1

P

(∣∣∣∣S∗kn

− ES∗kn

kn

∣∣∣∣ > ε

)<∞.

Using Chebyshev’s inequality, the additivity of the variance for pairwise independentr.v.’s, and the fact that D2(X) ≤ E(X2), we obtain

∆ ≤ c∞∑n=1

D2(S∗kn

)

k2n= c

∞∑n=1

1

k2n

kn∑i=1

D2(ηi) ≤ c∞∑n=1

1

k2n

kn∑i=1

Eη2i = c∞∑i=1

Eη2i∑kn≥i

1

k2n,

where at the last step we interchanged the order of summations.(Here and in what follows c is a constant which may depend on the formula.)

By the definition of kn, we obtain∑kn≥i

1/k2n ≤ ci−2(1 + α−2 + α−4 + · · · ) ≤ ci−2.

By the definition of ηi, using the additivity of the expectation, then interchanging sum-mations,

∆ ≤ c

∞∑i=1

Eη2i∑kn≥i

1

k2n≤ c

∞∑i=1

Eη2ii2

= c

∞∑i=1

E(ξ21χξ1≤i)

i2=

= c

∞∑i=1

1

i2

i−1∑k=0

E(ξ21χk<ξ1≤k+1) = c

∞∑k=0

E(ξ21χk<ξ1≤k+1)∞∑

i=k+1

1

i2≤

≤ c

∞∑k=0

1

k + 1E(ξ21χk<ξ1≤k+1) ≤ c

∞∑k=0

E(ξ1χk<ξ1≤k+1) = cEξ1 <∞.

In formula being last but one, we applied that∑∞

i=k+1 i−2 ≤ c(k + 1)−1

which is the consequence of approximating a sum by an integral.

37

Page 39: Stochastics: models, visualization, theorems - ssdnm

ReferencesN. Etemadi (1981); An elementary proof of the strong law of large numbers. Z.

Wahrsch. Verw. Gebiete, 55(1):119–122, 1981.

Khinchin, A. (1929); ”Sur la loi des grands nombres.” Comptes rendus de l’Acadmiedes Sciences 189, 477-479, 1929.

Kolmogorov, Andrey (1933); Grundbegriffe der Wahrscheinlichkeitsrechnung (in Ger-man). Berlin: Julius Springer

Ash, Robert B. (1972); Real analysis and probability. Academic Press, New York-London, 1972.

Bauer, Heinz (1996); Probability theory. Walter de Gruyter and Co., Berlin, 1996.

Gut, Allan (2005); Probability: a graduate course. Springer Texts in Statistics.Springer, New York, 2005.

Loeve, Michel (1963); Probability theory. D. Van Nostrand Co., Inc., Princeton, N.J.-Toronto, Ont.-London, 1963.

Renyi, A. (1970); Probability theory. North-Holland Publishing Co., Amsterdam-London, 1970.

Shiryayev, A. N. (1984); Probability. Graduate Texts in Mathematics, 95. Springer-Verlag, New York, 1984.

38

Page 40: Stochastics: models, visualization, theorems - ssdnm

3 General approach to strong laws of large numbers

IntroductionHajek and Renyi (1955) proved the following inequality.Let X1, . . . , Xn be independent random variables with zero mean values and finite

variances EX2k = σ2

k, k = 1, . . . , n. Denote by Sk = X1 + · · · + Xk, k = 1, . . . , n, thepartial sums. Let β1, . . . , βn be a non-decreasing sequence of positive numbers. Then forany ε > 0 and for any m with 1 ≤ m ≤ n we have

P(

maxm≤l≤n

∣∣∣∣Sl

βl

∣∣∣∣ ≥ ε

)≤ 1

ε2

[1

β2m

m∑l=1

σ2l +

n∑l=m+1

σ2l

β2l

]. (3.1)

In Hajek-Renyi (1955) this inequality was used to obtain strong laws of large numbers(SLLN).

In Fazekas and Klesov (2000) it was shown that a Hajek-Renyi type maximal inequal-ity for moments is always a consequence of an appropriate Kolmogorov type maximalinequality. Moreover, the Hajek-Renyi type maximal inequality automatically implies theSLLN. The most important is that no restriction is assumed on the dependence structureof the random variables.

We list some abstract versions of the Hajek-Renyi inequality. Then we show thatseveral Hajek-Renyi type inequalities and SLLN’s can be inserted into the framework ofour general theory. We shall mention results e.g. on martingales, mixingales, ϱ-mixingsequences, associated sequences, negatively associated sequences, and demimartingales.

NotationX1, X2, . . . , will denote a sequence of random variables. The partial sums will be

Sn =∑n

i=1Xi for n ≥ 1 and S0 = 0. A sequence bn will be called non-decreasing ifbi ≤ bi+1 for i ≥ 1.

Hajek-Renyi type maximal inequalities for moments

Definition 3.1. We say that the random variables X1, . . . , Xn satisfy the first Kol-mogorov type maximal inequality for moments, if for each m with 1 ≤ m ≤ n

E[

max1≤l≤m

|Sl|]r

≤ K∑m

l=1αl (3.2)

where α1, . . . , αn are non-negative numbers, r > 0, and K > 0.

Definition 3.2. We say that the random variables X1, . . . , Xn satisfy the first Hajek-Renyi type maximal inequality for moments, if

E[

max1≤l≤n

∣∣∣∣Sl

βl

∣∣∣∣]r ≤ C

n∑l=1

αl

βrl

(3.3)

where β1, . . . , βn is a non-decreasing sequence of positive numbers, α1, . . . , αn are non-negative numbers, r > 0, and C > 0.

39

Page 41: Stochastics: models, visualization, theorems - ssdnm

Theorem 3.1. (Fazekas-Klesov (2000), Theorem 1.1.) Let the random variables X1, . . . , Xn

be fixed. If the first Kolmogorov type maximal inequality (3.2) for moments is satisfied,then the first Hajek-Renyi type maximal inequality (3.3) for moments is satisfied withC = 4K.

Here and in what follows we mean that the Hajek-Renyi type maximal inequality isvalid with the same parameters n, α1, . . . , αn, r > 0 as the appropriate Kolmogorov’sinequality and for an arbitrary non-decreasing sequence of positive numbers β1, . . . , βn.Moreover, C may depend on K and r only.

The above result allows us to obtain an abstract form of the SLLN.

Theorem 3.2. (Theorem 2.1 in Fazekas-Klesov (2000).) Let X1, X2, . . . be a sequenceof random variables. Let the non-negative numbers α1, α2, . . . , r > 0, and K > 0 befixed. Assume that for each m ≥ 1 the first Kolmogorov type maximal inequality (3.2) formoments is satisfied. Let b1, b2, . . . be a non-decreasing unbounded sequence of positivenumbers. If ∑∞

l=1

αl

brl<∞ (3.4)

then

limn→∞

Sn

bn= 0 a.s. (3.5)

Proof. The proof in Fazekas-Klesov (2000) is based on a theorem of Dini (see Lemma 3.2below). Actually it suffices to apply Lemma 3.1 below which is a simple consequence ofDini’s theorem. Let βn be a sequence satisfying the properties given in Lemma 3.1.Then apply Theorem 3.1.

Lemma 3.1. Let bk be a non-decreasing unbounded sequence of positive numbers. Letαk be a sequence of non-negative numbers, with

∑∞k=1

αk

brk<∞, where r > 0.

Then there exists a non-decreasing unbounded sequence βk of positive numbers such that∑∞k=1

αk

βrk<∞ and limk→∞

βk

bk= 0.

In Fazekas-Klesov (1998) a direct proof of Lemma 3.1 is given. A more general Hajek-Renyi type inequality.

Definition 3.3. We say that the random variables X1, . . . , Xn satisfy the second Kol-mogorov type maximal inequality for moments, if for each k,m with 1 ≤ k < m ≤ n

E[

maxk≤l≤m

|Xk + · · · +Xl|]r

≤ K∑m

l=kαl (3.6)

where α1, . . . , αn are non-negative numbers, r > 0, and K > 0.

Definition 3.4. We say that the random variables X1, . . . , Xn satisfy the second Hajek-Renyi type maximal inequality for moments, if for each m with 1 ≤ m ≤ n

E[

maxm≤l≤n

∣∣∣∣Sl

βl

∣∣∣∣]r ≤ C

[1

βrm

m∑l=1

αl +n∑

l=m+1

αl

βrl

](3.7)

where β1, . . . , βn is a non-decreasing sequence of positive numbers, α1, . . . , αn are non-negative numbers, r > 0, and C > 0.

40

Page 42: Stochastics: models, visualization, theorems - ssdnm

Theorem 3.3. Let the random variables X1, . . . , Xn be fixed. If the second Kolmogorovtype maximal inequality (3.6) for moments is satisfied, then the second Hajek-Renyi typemaximal inequality (3.7) for moments is satisfied with C = 4DrK, where Dr = 1 for0 < r ≤ 1, and Dr = 2r−1 for r ≥ 1.

A popular way to obtain the SLLN is the application of the second Hajek-Renyi typemaximal inequality.

Second proof of Theorem 3.2 if the second Kolmogorov type maximal inequality (3.6)for moments is satisfied. By Theorem 3.3,

E[

supk≥m

∣∣∣∣Sk

bk

∣∣∣∣]r = limn→∞

E[

maxm≤k≤n

∣∣∣∣Sk

bk

∣∣∣∣]r ≤ C

[1

brm

m∑l=1

αl +∞∑

l=m+1

αl

brl

].

As m→ ∞, the above expression converges to 0.

Applications of the moment inequalitiesSome basic inequalitiesA Kolmogorov type inequality serves the starting point of each application of our main

theorem.Doob’s inequalities for martingales are well-known.Qi-Man Shao (2000) proved a comparison theorem for moment inequalities between

negatively associated and independent random variables. That general theorem im-plies the following Kolmogorov type inequality for negatively associated random variablesX1, . . . , Xn.

E[

max1≤k≤n

∣∣∑k

i=1Xi

∣∣]p ≤ 23−p∑n

i=1E|Xi|p for 1 < p ≤ 2. (3.8)

It was obtained by Matu la (1992) for p = 2.

Applications of the moment inequalitiesProofs of SLLN’sIt is worth to fix explicitly the conditions like we did in Theorem 3.2 because it helps

to handle some difficult particular cases.In Fazekas-Klesov (2000) the general SLLN was applied among others to prove Brunk-

Prokhorov type SLLN for martingales, to extend Shao’s (1995) Marcinkiewicz-Zygmundtype SLLN for ϱ-mixing sequences, and to extend Hansen’s (1991) SLLN for mixingales.

Using the general Theorem 3.2, Kuczmaszewska (2005) proved an SLLN for nega-tively associated sequences. (She obtained the Kolmogorov type maximal inequality whichserves the base of the proof.) Moreover, she presented a general Marcinkiewicz-Zygmundtype SLLN for ϱ-mixing sequences. (Both Kuczmaszewska (2005) and Fazekas-Klesov(2000) applied the Kolmogorov type maximal inequality for ϱ-mixing sequences given inShao (1995).)

41

Page 43: Stochastics: models, visualization, theorems - ssdnm

Rate of convergence in the SLLN

Theorem 3.4 (Shuhe-Ming). (Lemma 1.2 in Shuhe-Ming (2006).) Let X1, X2, . . . be asequence of random variables. Let the non-negative numbers α1, α2, . . . , r > 0, and K > 0be fixed. Assume that for each m ≥ 1 the first Kolmogorov type maximal inequality (3.2)for moments is satisfied. Let b1, b2, . . . be a non-decreasing unbounded sequence of positivenumbers. If (3.4) is satisfied, i.e.

∑∞l=1

αl

brl<∞, then limn→∞

Sn

bn= 0 a.s., moreover

Sn

bn= O

(βnbn

)a.s.

whereβn = max

1≤k≤nbkν

δ/rk , νk =

∑∞

l=k

αl

brl,

δ is an arbitrary number with 0 < δ < 1.

We remark that limn→∞ βn/bn = 0.Shuhe and Ming (2006) followed the approach described in F-K (2000) but they utilized

the full strength of Dini’s theorem.

Lemma 3.2 (Dini). (Dini’s theorem, see Fikhtengolts (1969), sect. 375.5.) Let c1, c2, . . .be non-negative numbers, νn =

∑∞k=n ck. If 0 < νn <∞ for all n = 1, 2, . . . , then for any

0 < δ < 1 we have∑∞

n=1 cn/νδn <∞.

Theorem 3.4 was used to obtain convergence rates in SLLN’s for random variables sat-isfying certain dependence conditions. We list those cases where the SLLN was obtainedin Fazekas-Klesov (2000) while the rate of convergence in Shuhe-Ming (2006). Sequenceswith superadditive moment function. The proofs are based on an inequality in Moricz(1976). Marcinkiewicz-Zygmund SLLN for sequences with superadditive moment func-tion. SLLN using Petrov’s natural characteristics of the order of growth of sums.

The simple versions of the so called almost sure limit theorems are based on an SLLNwith logarithmic normalizing factors. In Fazekas-Klesov (2000), Theorem 8.1, a shortproof is given for the next SLLN. Shuhe and Ming (2006) gave the rate of convergence.

Theorem 3.5 (Shuhe-Ming). (Shuhe-Ming (2006), Theorem 2.5.) Let for some β > 0and C > 0

|cov(Xk, Xl)| ≤ C

(l

k

, 1 ≤ l ≤ k. (3.9)

Then

limn→∞

1

log n

n∑k=1

Xk − EXk

k= 0 a.s. (3.10)

Moreover, for any 0 < δ < 1/2

1

log n

n∑k=1

Xk − EXk

k= O

(1

(log n)δ

)a.s. (3.11)

42

Page 44: Stochastics: models, visualization, theorems - ssdnm

Hajek-Renyi type inequalities for the probability

Definition 3.5. We say that the random variables X1, . . . , Xn satisfy the first Kol-mogorov type maximal inequality for the probability, if for each m with 1 ≤ m ≤ n

P(

max1≤l≤m

|Sl| ≥ ε)≤ K

εr

m∑l=1

αl for any ε > 0 (3.12)

where α1, . . . , αn are non-negative numbers, r > 0, and K > 0.

Definition 3.6. We say that the random variables X1, . . . , Xn satisfy the first Hajek-Renyi type maximal inequality for the probability, if

P(

max1≤l≤n

∣∣∣∣Sl

βl

∣∣∣∣ ≥ ε

)≤ C

εr

n∑l=1

αl

βrl

for any ε > 0 (3.13)

where β1, . . . , βn is a non-decreasing sequence of positive numbers, α1, . . . , αn are non-negative numbers, r > 0, and C > 0.

Theorem 3.6. (Tomacs-Lıbor (2006).) Let the r.v.’s X1, . . . , Xn be fixed. If the firstKolmogorov type maximal inequality (3.12) for the probability is satisfied, then the firstHajek-Renyi type maximal inequality (3.13) for the probability is satisfied with C = 4K.

Theorem 3.7. (Tomacs-Lıbor (2006).) SLLN Let X1, X2, . . . be a sequence of randomvariables. Let the non-negative numbers α1, α2, . . . , r > 0, and K > 0 be fixed. Assumethat for each m ≥ 1 the first Kolmogorov type maximal inequality (3.12) for the probabilityis satisfied. Let b1, b2, . . . be a non-decreasing unbounded sequence of positive numbers. If∑∞

l=1

αl

brl<∞ (3.14)

then

limn→∞

Sn

bn= 0 a.s. (3.15)

A more general Hajek-Renyi type inequality for the probability

Definition 3.7. We say that the random variables X1, . . . , Xn satisfy the second Kol-mogorov type maximal inequality for the probability if for each k,m with 1 ≤k < m ≤ n

P(

maxk≤l≤m

|Xk + · · · +Xl| ≥ ε)≤ K

εr

m∑l=k

αl for each ε > 0 (3.16)

where α1, . . . , αn are non-negative numbers, r > 0, and K > 0.

Definition 3.8. We say that the random variables X1, . . . , Xn satisfy the second Hajek-Renyi type maximal inequality for the probability, if for each m with 1 ≤ m ≤ n

P(

maxm≤l≤n

∣∣∣∣Sl

βl

∣∣∣∣ ≥ ε

)≤ C

εr

[1

βrm

m∑l=1

αl +n∑

l=m+1

αl

βrl

]for any ε > 0 (3.17)

where β1, . . . , βn is a non-decreasing sequence of positive numbers, α1, . . . , αn are non-negative numbers, r > 0, and C > 0.

43

Page 45: Stochastics: models, visualization, theorems - ssdnm

Theorem 3.8. Let the random variables X1, . . . , Xn be fixed. If the second Kolmogorovtype maximal inequality (3.16) for the probability is satisfied, then the second Hajek-Renyitype maximal inequality (3.17) for the probability is satisfied with C = (1 + r

√4)rK.

We show that by applying the second Hajek-Renyi type maximal inequality we canprove the SLLN.

Second proof of Theorem 3.7 if the second Kolmogorov type maximal inequality for theprobability is satisfied. By Theorem 3.8,

P(

supk≥m

∣∣∣∣Sk

βk

∣∣∣∣ > ε

)≤ C

[1

brm

m∑l=1

αl +∞∑

l=m+1

αl

brl

].

As m→ ∞, the above expression converges to 0.

Applications of the probability inequalitiesAlternative proofs for the Hajek-Renyi inequalities and SLLN’s

Most classical Hajek-Renyi type inequalities were proved by direct methods. Below welist some known results and point out how can we insert them into our general framework.However, by our general method we can reproduce the Hajek-Renyi type inequalities upto an absolute constant multiplier.

First we remark that in Hajek-Renyi (1955) a direct proof is given for inequality(3.1) if the random variables are independent. Now we see that the original Hajek-Renyiinequality is a consequence of the original Kolmogorov inequality (up to a constant).

Another classical Hajek-Renyi type inequality was proved by Chow (1960) for sub-martingales. We can see that it is a consequence of Doob’s inequality (up to a constant).

Kounias and Weng (1969) proved the following Hajek-Renyi type inequality with-out assuming any dependence condition. Let X1, . . . , Xn be random variables. Let r > 0be fixed. Let the moments vi = E|Xi|r be finite for each i. Let s = 1 if 0 < r ≤ 1 ands = r if r > 1. Then

P(

max1≤l≤n

∣∣∣∣Sl

βl

∣∣∣∣ ≥ ε

)≤ 1

εr

(n∑

l=1

( vlβrl

)1/s)s

for any ε > 0 (3.18)

where β1, . . . , βn is an arbitrary non-decreasing sequence of positive numbers.This inequality is of type (3.13) for r ≤ 1. But for r > 1 it seems to be different

of type (3.13). However, we shall see that it can be inserted into our framework forr > 1, too. The appropriate Kolmogorov type inequality is of the form (3.12) with αk =(∑k

i=1 ai

)r−(∑k−1

i=1 ai

)r, ai = (E|Xi|r)1/r. We can prove that

∑nl=1

αl

βrl≤(∑n

l=1alβl

)r.

So in this case (3.13) implies (3.18) up to a multiplier 4.Szynal (1973) obtained the following Hajek-Renyi type inequality without assuming

independence and moment conditions. Let X1, . . . , Xn be arbitrary random variables. Letr > 0 be fixed. Let s = 1 if 0 < r ≤ 1 and s = r if r > 1. Then

P[

maxm≤l≤n

∣∣∣∣Sl

βl

∣∣∣∣ ≥ 3ε

]≤ 2

[m∑l=1

E1/s( |Xl|r

(βmε)r + |Xl|r)

+n∑

l=m+1

E1/s( |Xl|r

(βlε)r + |Xl|r)]s

(3.19)

44

Page 46: Stochastics: models, visualization, theorems - ssdnm

for any ε > 0 where β1, . . . , βn is an arbitrary non-decreasing sequence of positive numbers.It seems that (3.19) can not be inserted into the framework of our theorem.

Bickel (1970) obtained a Hajek-Renyi type generalization of Levy’s inequality. Chan-dra and Ghosal (1996) obtained a Kolmogorov type inequality and a Marcinkiewicz-Zygmund type SLLN for asymptotically almost negatively associated (AANA)random variables.

The sequenceX1, X2, . . . is called AANA if there exists a non-negative sequence qn → 0such that

cov (f(Xm), g(Xm+1, . . . , Xm+k)) ≤ qm(var(f(Xm)

)var(g(Xm+1, . . . , Xm+k)

))1/2(3.20)

for all m, k ≥ 1 and for all coordinatewise increasing continuous functions f and g when-ever the right hand side of the above inequality is finite.

The negatively associated (in particular the independent) sequences are AANA. TheKolmogorov type inequality for AANA sequences is the following.

Theorem 3.9. (Chandra-Ghosal (1996), Theorem 1.) Let X1, . . . , Xn be zero meansquare integrable r.v.’s such that (3.20) holds for every 1 ≤ m < m + k ≤ n. LetAn =

∑n−1l=1 q

2l . Then

P(

max1≤l≤n

|Sl| ≥ ε)≤ 2

ε2

(An +

√1 + A2

n

)2 n∑l=1

EX2l for any ε > 0. (3.21)

Using (3.21), Kim, Ko and Lee (2004) obtained Hajek-Renyi inequalities of the form(3.13) and (3.17), moreover, if A =

∑∞l=1 q

2l < ∞, an SLLN of the form Theorem 3.7.

Now we can see that these results are immediate consequences of our general theorems.Rate of convergence in the SLLN’s

Combining the method of Theorem 3.7 and the ideas of Shuhe and Ming (2006) wecan obtain further results concerning the rate of convergence in the SLLN.

Theorem 3.10. Let X1, X2, . . . be a sequence of random variables. Let the non-negativenumbers α1, α2, . . . , r > 0, and K > 0 be fixed. Assume that for each m ≥ 1 the firstKolmogorov type maximal inequality (3.12) for the probability is satisfied. Let b1, b2, . . .be a non-decreasing unbounded sequence of positive numbers. If (3.14) is satisfied, i.e.∑∞

l=1αl

brl<∞, then limn→∞

Sn

bn= 0 a.s., moreover

Sn

bn= O

(βnbn

)a.s.

where βn is defined in Theorem 3.4.

This theorem can be used to obtain convergence rates in the SLLN for random variablessatisfying certain dependence conditions. Using that inequality, an SLLN can be obtainedfor demimartingales (Christofides (2000)). Now we can see that our general theoremsimmediately imply the SLLN from the Kolmogorov type inequality, moreover the rateof convergence in the SLLN. The sequence of partial sums of zero mean associatedrandom variables is a demimartingale. We mention that for associated random variables

45

Page 47: Stochastics: models, visualization, theorems - ssdnm

Kolmogorov type inequalities were obtained by Newman and Wright (1981) and by Matu la(1996). From those inequalities Matu la (1996) and Shuhe-Ming (2006) obtained the SLLNand the rate of convergence in the SLLN, respectively. Now we se that those theoremsare covered by our general method.

We see that for AANA sequences, using the Kolmogorov type inequality of Chandra-Ghosal (1996), our method gives the SLLN and the rate of convergence in it. We mentionthat negatively associated sequences are AANA. In the special case of negatively associ-ated sequences, the Kolmogorov inequality was obtained by Matu la (1992), the Hajek-Renyi type inequality and the SLLN by Liu, Gan and Chen (1999), while the rate ofconvergence in the SLLN by Shuhe and Ming (2006). Those results are covered by ourmethod.

ReferencesD. W. K. Andrews, (1988). Laws of large numbers for dependent non-identically

distributed random variables. Econometric Theory, 4, 458–467.Bickel, P. J., (1970). A Hajek-Renyi extension of Levy’s inequality and some applica-

tions. Acta Math. Acad. Sci. Hungar. 21, 199–206.H. D. Brunk, (1948). The strong law of large numbers. Duke Math. J. 15, 181–195.D. L. Burkholder, (1773). Distribution function inequalities for martingales. Ann.

Probab. 1(1), 19–42.Chandra, T. K. and Ghosal, S., (1996). Extensions of the strong law of large numbers

of Marcinkiewicz and Zygmund for dependent variables. Acta Math. Hungar. 71(4),327–336.

Chow, Y. S., (1960). A martingale inequality and the law of large numbers. Proc.Amer. Math. Soc. 11, 107–111.

Christofides, T. C., (2000). Maximal inequalities for demimartingales and a stronglaw of large numbers. Statist. Probab. Lett. 50(4), 357–363.

Csorgo, M. (1968). On the strong law of large numbers and the central limit theoremfor martingales. Trans. Amer. Math. Soc. 131, 259–275.

V. A. Egorov, (1970). On the strong law of large numbers and the law of the iteratedlogarithm for a sequence of independent random variables. Probab. Theory Appl. 15(3),520–527.

Fazekas, I., Klesov, O. I., (1998). A general approach to the strong laws of largenumbers. Technical Report, No. 4, (1998), Kossuth Lajos University, Hungary.

Fazekas, I., Klesov, O. I., (2000). A general approach to the strong laws of largenumbers. Teor. Veroyatnost. i Primenen. 45(3), 568–583; Theory Probab. Appl., 45(3)(2002), 436–449.

Fikhtengolts, G. M., (1969). A Course of Differential and Integral Calculus. Vol. 2.(Russian) Nauka, Moscow.

J. Hajek and A. Renyi, (1955). Generalization of an inequality of Kolmogorov. ActaMath. Acad. Sci. Hungar. 6(3-4), 281–283.

P. Hall and C. C. Heyde, (1980). Martingale Limit Theory and its Application. Aca-demic Press, New York.

B. E. Hansen, (1991). Strong laws for dependent heterogeneous processes. Economet-ric Theory, 7, 213–221. Erratum. Econometric Theory, 8 (1982), 421–422.

46

Page 48: Stochastics: models, visualization, theorems - ssdnm

Kim, Tae-Sung; Ko, Mi-Hwa; Lee, Il-Hyun; (2004). On the strong law for asymptot-ically almost negatively associated random variables. Rocky Mountain J. Math. 34(3),979–989.

Kounias, E. G. and Weng, T-S, (1969). An inequality and almost sure convergence.Ann. Math. Statist. 40(3), 1091–1093.

Kuczmaszewska, A., (2005). The strong law of large numbers for dependent randomvariables. Statist. Probab. Lett. 73(3), 305–314.

Liu, Jingjun; Gan, Shixin; and Chen, Pingyan; (1999). The Hajeck-Renyi inequalityfor the NA random variables and its application. Statist. Probab. Lett. 43(1), 99–105.

M. Longnecker and R. J. Serfling, (1977). General moment and probability inequalitiesfor the maximum partial sum. Acta Math. Sci. Hungar. 30(1-2), 129–133.

Matu la, P., (1992). A note on the almost sure convergence of sums of negativelydependent random variables. Statist. Probab. Lett. 15(3), 209–213.

Matu la, P., (1996). Convergence of weighted averages of associated random variables.Probab. Math. Statist. 16(2), 337–343.

D. L. McLeish, (1975). A maximal inequality and dependent strong laws. Ann.Probab. 3(5), 829–839.

T. F. Mori, (1993). On the strong law of large numbers for logarithmically weightedsums. Annales Univ. Sci. Budapest, 36, 35–46.

F. Moricz, (1976). Moment inequalities and the strong laws of large numbers. Z.Wahrscheinlichkeitstheorie verw. Gebiete, 35, 299–314.

Newman, C. M. and Wright, A. L., (1982). Associated random variables and martin-gale inequalities. Z. Wahrsch. Verw. Gebiete, 59(3), 361–371.

V. V. Petrov, (1975). Sums of Independent Random Variables. Springer–Verlag, NewYork.

Yu. V. Prokhorov, (1950). On the strong law of large numbers. (Russian.) Izv. ANSSSR, ser. matem. 14(6), 523–536.

Qi-Man Shao, (1995). Maximal inequalities for partial sums of ϱ-mixing sequences.Ann. Probab. 23(2), 948–965.

Qi-Man Shao, (2000). A comparison theorem on moment inequalities between nega-tively associated and independent random variables. J. Theoret. Probab. 13(2), 343–356.

Hu Shuhe, Hu Ming, (2006). A general approach rate to the strong law of largenumbers. Statist. Probab. Lett. 76(8), 843–851.

Szynal, Dominik, (1973). An extension of the Hajek-Renyi inequality for one maximumof partial sums. Ann. Statist. 1, 740–744.

H. Teicher, (1968). Some new conditions for the strong law. Proc. Nat. Acad. Sci.U.S.A. 59(3), 705–707.

Tomacs, T. and Lıbor, Zs., (2006). A Hajek-Renyi type inequality and its applications.Ann. Math. Inf. 33, 141–149.

Sung, Soo Hak; Hu, Tien-Chung; Volodin, Andrei (2008). A note on the growth ratein the Fazekas-Klesov general law of large numbers and on the weak law of large numbersfor tail series. Publ. Math. Debrecen 73(1-2), 110.

47