foundations of data analysis (ma4800) - tum...jv ij2; v 2ki: as before one can de ne orthogonality...

Foundations of Data Analysis (MA4800)

Massimo Fornasier

Fakultat fur MathematikTechnische Universitat Munchen

[email protected]://www-m15.ma.tum.de/

Slides of LectureJune 20 2017

Preliminaries on Linear Algebra (LA)

We give for granted familiarity with the basics of LA taught instandard courses, in particular,

I vector spaces, spans and linear combinations, linear bases,linear maps and matrices, eigenvalues, complex numbers,scalar products, theory of symmetric matrices.

For more details we refer to the lecture notes (in German)

G. Kemper, Lineare Algebra fuer Informatik, TUM, 2017.

of the course taught for Computer Scientsts at TUM, or any otherstandard international text of LA.

Matrix Notations

The index sets I , J, L, . . . are assumed to be finite.

As soon as the complex conjugate values appear1, the scalar fieldis restricted to K ∈ R,C.

The set KI×J is the vector space of matrices A ∈ KI×J , whoseentries are denoted by Aij , for i ∈ I and j ∈ J. Vice versa, numbersaij ∈ K may be used to define A := (aij)i∈I ,j∈J ∈ KI×J .

If A ∈ KI×J , the transposed matrix AT ∈ KJ×I , and ATji := Aij . A

matrix A ∈ KI×I is symmetric if AT = A. The Hermitiantransposed matrix AH ∈ KJ×I coincides with AT . If K = R thenclearly AH = AT . A Hermitian matrix satisfies AH = A.

Often, we will need to consider the rows A(i) and the columns A(j)

of a matrix, defined respectively as vectors A(i) = (aij)j∈J andA(j) = (aij)i∈I = (AT )(j).

1In the case K = R, then α = α, for all α ∈ K.

Matrix NotationsWe assume that the standard matrix-vector and matrix-matrixmultiplications are done in the usual way: (Ax)i =

∑j∈J aijxj or

(AB)i` =∑

j∈J AijBj` as usual, for x ∈ KJ , A ∈ KI×J and

B ∈ KJ×L.

The Kronecker symbol is defined by

δij =

1, if i = j ∈ I ,0, otherwise.

The unit vector e(i) ∈ KI is defined by

e(i) = (δij)j∈I

The symbol I = (δij)j∈I ,j∈I is used for the identity matrix. Sincematrices and index sets do not appear in the same place, thesimultaneous use of the symbol I do not create confusion(example: I ∈ KI×I ).

Matrix NotationsThe range of matrix A ∈ KI×J is

range(A) = Ax : x ∈ KJ = spanA(i), i ∈ I.

Hence, the range of a matrix is a vector space spanned by itscolumns.

The Euclidean scalar product in KI is given by

〈x , y〉 = yHx =∑i∈I

xiy i , (1)

where for K = R the conjugate sign can be ignored .

It often useful and very important to notice that the matrix-vectorAx and matrix-matrix AB multiplications can also be expressed interms of scalar products of the rows of A with x and of the rows ofA with the colums of B, i.e., according to our terminology of

(Ax)i = 〈A(i), x〉, (AB)i` = 〈A(i),B(`)〉. (2)

Matrix Notations

Two vectors x , y ∈ KI are orthogonal (and we may write x ⊥ y) if〈x , y〉 = 0.

We often consider sets which are mutually orthogonal. In this casefor X ,Y ⊂ KI we say that X is orthogonal to Y and we writeX ⊥ Y if 〈x , y〉 = 0, for all x ∈ X and y ∈ Y .

When X and Y are linear subspaces their orthogonality can besimply checked by showing that they possess bases which aremutually orthogonal. This is a very important principle we will useoften.

A family of vectors X = xνν∈F ⊂ KI is orthogonal if the vectorsxν are pairwise orthogonal, i.e., 〈xν , xν′〉 = 0 for ν 6= ν ′. The familyis additionally called orthonormal if 〈xν , xν〉 = 1 for all ν ∈ F .

Matrix Notations

A matrix A ∈ KI×J is called orthogonal, if the columns of A areorthonormal, equivalently if

AHA = I ∈ KJ×J .

An orthogoal square matrix A ∈ KI×I is called unitary. Differentlyfrom just orthogonal matrices, unitary matrices satisfy

AHA = AAH = I ,

i.e, AH = A−1 is the invese matrix of A.

Assume that the index sets satisfy either I ⊂ J or J ⊂ I . Then a(rectangular) matrix A ∈ KI×J is diagonal if Aij = 0 for all i 6= j .

Matrix Rank

Proposition

Let A ∈ KI×J . The following statements are equivalent and maybe all used as a definition of matrix rank r = rank(A).

I (a) r = dim range(A);

I (b) r = dim range(AH);

I (c) r is the maximal number of linearly independent rows of A;

I (d) r is the maximal number of linearly independent columnsof A;

I (e) r is minimal with the property

A =r∑

i=1

aibHi , where ai ∈ KI , b ∈ KJ ;

I (f) r is maximal with the property that there exists aninvertible r × r submatrix of A;

I (g) r is the number of positive singular values (soon!).

Matrix Rank

The rank is bounded by the maximal rank, which, for matrices, isalways given by rmax = min#I ,#J, and this bound is attainedby full-rank matrices.

As linear independence may depend on the field K one mayquestion whether the rank of a real-valued matrix depends onconsidering it in R or in C. For matrices it tunrs out that it doesnot matter: the rank is the same.

We will often work with matrices of bounded rank r ≤ k . Wedenote accordingly with Rk = A ∈ KI×J : range(A) ≤ k such aset. Notice that this set is not a vector space (exercise!).

NormsIn the following we consider an abstract vector space V over thefield K. As a typical example we may keep in mind the Euclideanplane V = R2 endowed with the classical Euclidean norm.

We recall below the axioms of an abstract norm on V : a norm is amap ‖ · ‖ : V → [0,∞) with the following properties

I ‖v‖ = 0 if and only if v = 0;I ‖λv‖ = |λ|‖v‖ for all v ∈ V and λ ∈ K;I ‖v + w‖ ≤ ‖v‖+ ‖w‖ for all v ,w ∈ V (triangle inequality).

A norm is always continuous as a consequence of the (inverse)triangle inequalty:

|‖v‖ − ‖w‖| ≤ ‖v − w‖, for all v ,w ∈ V .

A vector space V endowed with a norm ‖ · ‖, and we write the pair(V , ‖ · ‖) to indicate it, is called a normed vector space. Asmentioned above a typical example is the Euclidean plane.

Scalar products and (pre)-Hilbert spaces

A normed vector space (V , ‖ · ‖) is a pre-Hilbert space if its normis defined by

‖v‖ =√〈v , v〉, v ∈ V ,

where 〈·, ·〉 : V × V → K is a scalar product on V , i.e., it fulfillsthe properties

I 〈v , v〉 > 0 for v 6= 0;

I 〈v ,w〉 = 〈w , v〉, for v ,w ∈ V ;

I 〈u + λv ,w〉 = 〈u,w〉+ λ〈v ,w〉 for u, v ,w ∈ V and λ ∈ K;

I 〈w , u + λv〉 = 〈w , u〉+ λ〈w , v〉 for u, v ,w ∈ V and λ ∈ K.

The triangle inequality for the norm follows from the Schwarzinequality

|〈v ,w〉| ≤ ‖v‖‖w‖, v ,w ∈ V .

Scalar products and (pre)-Hilbert spaces

We describe the pre-Hilbert space also by the pair (V , 〈·, ·〉). Atypical example of scalar product over KI is the one we introducedin (1), which generates the Euclidean norm on KI :

‖v‖2 =

√∑i∈I|vi |2, v ∈ KI .

As before one can define orthogonality between vectors,orthogonal, and orthonormal sets of vectors.

Hilbert spaces

A Hilbert space is a pre-Hilbert space (V , 〈·, ·〉), whose topologyinduced by the associated norm ‖ · ‖ is complete, i.e., any Cauchysequence in such vector spaces is convergent.

It is important to stress that every finite dimensional pre-Hilbertspace (V , 〈·, ·〉) is actually complete, hence always a Hilbert space.

Thus, those who are not familiar with topological notions (e.g.,completeness, Cauchy sequences, etc.), one should just reason inwhat follows according to their Euclidean geometric intuition.

Projections

Recall now that a set C in a vector space is convex if, for allv ,w ∈ C and all t ∈ [0, 1], tv + (1− tw) ∈ C .

Given a closed convex set C in a Hilbert space (V , 〈·, ·〉) onedefines the projection of any vector v on C as

PC (v) = arg minw∈C ‖v − w‖.

This definition is well-posed, as the projection is actually unique,and an equivalent defition is given by fulfilling the followinginequality

<〈z − PC (v), v − PC (v)〉 ≤ 0,

for all z ∈ C . This is left as an exercise.

Projections onto subspaces: Pythagoras-Fourier Theorem

Of extreme importance for us are the orthogonal projections ontosubspaces.

In case C = W ⊂ V is actually a closed linear subspace of V , thenthe projection onto W can be readily computed as soon as onedisposes of an orthonormal basis for W . Let wνν∈F be a(countable) orthonormal basis for W then

PW (v) =∑ν∈F〈v ,wν〉wν , for all v ∈ V . (3)

Moreover, it holds the Pythagoras-Fourier Theorem:

‖PW (v)‖2 =∑ν∈F|〈v ,wν〉|2, for all v ∈ V .

Pythagoras-Fourier Theorem and othonormal expansions

We will use very much this characterization of the orthogonalprojections and the Pythagoras-Fourier Theorem, especially for thecase where W = V .

In this case, obviosly, PV = I and, we have the orthonormalexpansion of any vector v ∈ V ,

v =∑ν∈F〈v ,wν〉wν ,

and the norm equivalence

‖v‖2 =∑ν∈F|〈v ,wν〉|2.

Trace of a matrix

The effort of defining a abstract scalar products and norms allowsus now to introduce several norms for matrices.

First of all we need to introduce the concept of trace, which is themap tr : KI×I → K defined by

tr(A) =∑i∈I

Aii ,

i.e., it is the sum of the diagonal elements of the matrix A.

Trace of a matrix: properties

The trace enjoyes several properties, which we collect in thefollowing:

Proposition (Properties of the trace)

(a) tr(AB) = tr(BA), for any A ∈ KI×J and B ∈ KJ×I ;

(b) tr(ABC ) = tr(BCA), for any A,B,C matrices of compatiblesize and indexes; this property is called the circularity propertyof the trace;

(c) as a consequence of the previous property we obtain theinvariance of the trace under unitary transformations, i.e.,tr(A) = tr(UAUH) for A ∈ KI×I and any unitary matrixU ∈ KI×I ;

(d) tr(A) =∑

i∈I λi , where λi : i ∈ I is the set of eigenvaluesof A.

Matrix norms

The Frobenius norm of a matrix is essentially the Euclidean normcomputed over the entries of the matrix (considered as a vector):

‖A‖F :=

√ ∑i∈I ,j∈J

|Aij |2, A ∈ KI×J .

It is also know as Schur norm or Hilbert-Schmidt norm. This normis generated by the scalar product (that’s why we made the effortof introducing abstract scalar products!)

〈A,B〉F :=∑i∈I

∑j∈J

AijBij = tr(ABH) = tr(BHA). (4)

In particular,‖A‖2

F = tr(AAH) = tr(AHA) (5)

holds.

Matrix norms

Let ‖ · ‖X and ‖ · ‖Y be vector norms on the vector spaces X = KI

and Y = KJ , respectively. Then the associated matrix norm is

‖A‖ := ‖A‖X→Y := supz 6=0

‖Az‖Y‖z‖X

, A ∈ KI×J .

If both ‖ · ‖X and ‖ · ‖Y coincides with the Euclidean norms ‖ · ‖2

on X = KI and Y = KJ , respectively, then the associated matrixnorm is called the spectral norm and it is denoted with ‖A‖∞.

As the spectral norm has a central importance in manyapplications, it is often denoted just by ‖A‖.

Unitary invariance and submultiplicativity

Both the matrix norms we introduced so far as are invariant withrespect to unitary transformations, i.e., ‖A‖F = ‖UAV H‖F and‖A‖∞ = ‖UAV H‖∞ for unitary matrices U,V .

Moreover, both are submitiplicative, i.e.,‖AB‖F ≤ ‖A‖∞‖B‖F ≤ ‖A‖F‖B‖F and ‖AB‖ ≤ ‖A‖‖B‖, formatrices A,B of compatible sizes.

Introduction of the Singular Value Decomposition

We introduce and analyze the singular value decomposition (SVD)of a matrix A, which is is the factorization of A into the product ofthree matrices A = UΣV H , where U,V are orthogonal matrices ofcompatible size and the matrix Σ is diagonal with positive realentries.

All these terms, orthogonal, diagonal matrix, have been introducedin the previous lectures.

Geometrical derivation: principal component analysis

To gain insight into the SVD, treat the n = #I rows of a matrixA ∈ KI×J as points in a d-dimensional space, where d = #J, andconsider the problem of finding the best k-dimensional subspacewith respect to the set of points.

Here best means minimize the sum of the squares of theperpendicular distances of the points to the subspace.An orthonormal basis for this subspace is built as fundamentaldirections with maximal variance of the group of high dimensionalpoints, and are called principal components.

Geometrical derivation: principal component analysis

Pythagoras Theorem, scalar products, and orthogoalprojections

We begin with a special case of the problem where the subspace is1-dimensional, a line through the origin.

We will see later that the best-fitting k-dimensional subspace canbe found by k applications of the best fitting line algorithm.

Finding the best fitting line through the origin with respect to a setof points xi := A(i) ∈ K2 : i ∈ I in the Euclidean plane K2

means minimizing the sum of the squared distances of the pointsto the line. Here distance is measured perpendicular to the line.The problem is called the best least squares fit.


In the best least squares fit, one is minimizing the distance to asubspace. Now, consider projecting orthogonally a point xi onto aline through the origin. Then, by Pythagoras theorem

x2i1+x2

i2+· · ·+x2id = (length of projection) 2+ (distance of pt. to line) 2.


In particular, from the formula above, one has

(distance of pt to line) 2 = x2i1+x2

i2+· · ·+x2id− (length of projection) 2.

To minimize the sum of the squares of the distances to the line,one could minimize

∑i (x2

i1 + x2i2 + · · ·+ x2

id) minus the sum of thesquares of the lengths of the projections of the points to the line.

However, the first term of this difference is a constant(independent of the line), so minimizing the sum of the squares ofthe distances is equivalent to maximizing the sum of the squares ofthe lengths of the projections onto the line.

Averiging through the points and matrix notation

For best-fit subspaces, we could maximize the sum of the squaredlengths of the projections onto the subspace instead of minimizingthe sum of squared distances to the subspace.

Consider the rows of A as n = #I rows of a matrix A ∈ KI×J aspoints in a d-dimensional space, where d = #J. Consider thebest-fit line through the origin.

Let v be a unit vector along this line. The length of the projectionof A(i), the i th row of A, onto v is, according to our definition ofscalar product, |〈A(i), v〉|.

From this, we see that the sum of length squared of the projectionsis

‖Av‖22 =

∑i∈I|〈A(i), v〉|2. (6)

Best-fit and first singular direction

The best-fit line is the one maximizing ‖Av‖22 and hence

minimizing the sum of the squared distances of the points to theline.

With this in mind, define the first singular vector, v1 of A, which isa column vector, as the best-fit line through the origin for the npoints in d-dimensional space that are the rows of A.

Thusv1 = arg max‖v‖2=1 ‖Av‖2.

The value σ1(A) = ‖Av1‖2 is called the first singular value of A.Note that σ1(A)2 is the sum of the squares of the projections ofthe points to the line determined by v1.

First link: linear algebra VS optimization!!!

This is the first time we use an optimization criterion to approacha data interpretation problem and to create a connection betweenlinear algebra and optimization.

As we will discuss along this course, optimization plays indeed ahuge role in data analysis and we will have to discuss it in moredetail later on.

A greedy approach to best k-dimensional fit

The greedy approach to find the best-fit 2-dimensional subspace fora matrix A, takes v1 as the first basis vector for the 2-dimenionalsubspace and finds the best 2-dimensional subspace containing v1.

The fact that we are using the sum of squared distances will againhelp, thanks to Pythagoras Theorem: For every 2-dimensionalsubspace containing v1, the sum of squared lengths of theprojections onto the subspace equals the sum of squaredprojections onto v1 plus the sum of squared projections along avector perpendicular to v1 in the subspace.

Thus, instead of looking for the best 2-dimensional subspacecontaining v1, look for a unit vector, call it v2, perpendicular to v1

that maximizes ‖Av‖2 among all such unit vectors.

Using the same greedy strategy to find the best three and higherdimensional subspaces, defines v3, v4, . . . in a similar manner.

Formal definitionsMore formally, the second singular vector, v2 is defined by thebest-fit line perpendicular to v1:

v2 = arg max‖v‖2=1,〈v1,v〉=0 ‖Av‖2.

The value σ2(A) = ‖Av2‖2 is called the second singular value of A.

The kth singular vector vk is defined similarly by

vk = arg max‖v‖2=1,〈v1,v〉=0,...,〈vk−1,v〉=0 ‖Av‖2. (7)

and so on.

The process stops when we have found v1, . . . , vr as singularvectors and

0 = arg max‖v‖2=1,〈v1,v〉=0,...,〈vr ,v〉=0 ‖Av‖2.

Optimality of the greedy algorithm

If instead of finding v1 that maximized ‖Av‖2 and then the best-fit2-dimensional subspace containing v1, we had found the best-fit2-dimensional subspace, we might have done better.

Surprisingly enough, this is actually not the case. We now give asimple proof that the greedy algorithm indeed finds the bestsubspaces of every dimension.

Proposition

Let A ∈ KI×J and v1, v2, . . . , vr be the singular vectors definedabove. For 1 ≤ k ≤ r , let Vk be the subspace spanned byv1, v2, . . . , vk . Then for each k, Vk is the best-fit k-dimensionalsubspace for A.

The natural oder of the singular values

We conclude this part with an important property left as anexercise: show that necessarily

σ1(A) ≥ σ2(A) ≥ · · · ≥ σr (A), (8)

i.e., the singular values come always with a natural nonincreasingorder.

On matrix norms again

Note that the vector Avi is really a list of lengths (with signs) ofthe projections of the rows of A onto vi . Think of σi (A) = ‖Avi‖2

as the component of the matrix A along vi .

For this interpretation to make sense, it should be true that addingup the squares of the components of A along each of the vi givesthe square of the whole content of the matrix A.

This is indeed the case and is the matrix analogy of decomposing avector into its components along orthogonal directions.

On matrix norms againConsider one row, say A(i) of A. Since v1, v2, . . . , vr span the spaceof all rows of A(exercise!), 〈A(i), v〉 = 0 for all v orthogonal tov1, v2, . . . , vr .

Thus, for each row A(i), by Pythagoras Theorem,‖A(i)‖2

2 =∑r

k=1 |〈A(i), vk〉|2.

Summing over all rows i ∈ I and recalling (6), we obtain∑i∈I‖A(i)‖2

2 =r∑

k=1

∑i∈I|〈A(i), vk〉|2 =

r∑k=1

‖Avk‖22 =

r∑k=1

σk(A)2.

But ∑i∈I‖A(i)‖2

2 =∑i∈I

∑j∈J|Aij |2,

is the Frobenius norm of A, so that we obtained

‖A‖F =

√√√√ r∑k=1

σk(A)2. (9)

On matrix norms againWe might also observe that

σ1(A) = max‖v‖2=1

‖Av‖2 = arg maxv 6=0 ‖Av‖2/‖v‖2 = ‖A‖, (10)

i.e., the first singular vector corresponds to the spectral norm of A.This allows us to define other and more general norms, called theSchatten-p-norms for matricesdefined for 1 ≤ p <∞ by

‖A‖p =

(r∑

k=1

σk(A)p

)1/p

,

and for p =∞‖A‖∞ = max

k=1,...,rσk(A).

Notice that these definitions gives precisely ‖A‖F = ‖A‖2 and‖A‖ = ‖A‖∞ as particular Schatten-norms. Of particular relevancein certain applications related to recommender systems is theso-called nuclear norm ‖A‖∗ = ‖A‖1 corresponding to theSchatten-1-norm.

On rank again

We just mentioned above, and left it as an exercise, that theorthogonal basis consitituted by the vectors v1, v2, . . . , vr spans thespace of all rows of A.

This means that the number r (of them) is actually the rank of thematrix! And this a proof of (g) in the Proposition abovecharacterizing the rank!

The Singular Value Decomposition in its full glory

The vectors Av1,Av2, . . . ,Avr form yet another fundamental set ofvectors associated with A: We normalize them to length one bysetting

ui =1

σi (A)Avi

The vectors u1, u2, . . . , ur are called the left singular vectors of A.

The vi are called the right singular vectors.

The SVD theorem (soon!) will fully explain the reason for theseterms.

Orthonormality of the left singular vectors

Clearly, the right singular vectors are orthogonal by definition.

We now show that the left singular vectors are also orthogonal.

TheoremLet A ∈ KI×J be a rank r matrix. The left singular vectors of A,u1, u2, . . . , ur are orthogonal.

SVD in its full glory

TheoremLet A ∈ KI×J with right singular vectors v1, v2, . . . , vr , left singularvectors u1, u2, . . . , ur , and corresponding singular values σ1, . . . , σr .Then

A =r∑

k=1

σkukvHk

Best rank-k approximation

Given the singular value decomposition A =∑r

k=1 σkukvHk of a

matrix A ∈ KI×J , we define the k-rank truncation by

Ak =k∑`=1

σ`u`vH` .

It should be clear that Ak is a k-rank matrix.

LemmaThe rows of Ak are the orthogonal projections of the rows of Aonto the subspace Vk spanned by the first k singular vectors of A.

Best rank-k Frobenius approximation

TheoremFor any matrix B of rank at most k

‖A− Ak‖F ≤ ‖A− B‖F

Best rank-k spectral approximation

Lemma‖A− Ak‖2 = σ2

k+1.

TheoremFor any matrix B of rank at most k

‖A− Ak‖ ≤ ‖A− B‖.

Nonconvexity and curse of dimensionality

We introduced the SVD by means of a greedy algorithm, which ateach step is supposed to solve a nonconvex optimization problemof the type

vk = arg max‖v‖2=1,〈v1,v〉=0,...,〈vk−1,v〉=0 ‖Av‖2. (11)

Well, the issue with this is that it is neither a convex optimizationproblem nor, in principle, stated on a low dimensional space. Infact our goal is often to deal with “big data” (i.e., many datapoints, each of very high dimension), which eventually implies largedimensionalities of the matrix A.

How can we then approach the computation of the SVD withoutincurring in the so called “curse of dimensionality”, which is anemphatic way of describing the computational infeasibility of acertain numerical task?

The Power Method

It is easiest to describe first in the case when A is real-valued,square, and symmetric and has the same right and left singularvectors, namely, A =

∑rk=1 σkvkvT

k .

In this case we have

A2 =

(r∑

k=1

σkvkvTk

)(r∑`=1

σ`v`vT`

)=

r∑k,`=1

σ`σkvk vTk v`︸︷︷︸

:=δk`

vT` =

r∑k=1

σ2kvkv

Tk .

Similarly, if we take the mth power of A, again all the cross termsare zero and we will get

Am =r∑

k=1

σmk vkvTk .

We had the spectral norm of A and a spectral gap

If we had σ1 σ2, we would have

limm→∞

Am

σm1= v1vT

1 .

While it is not so simple to compute σ1 (why?!), which correspondsto the spectral norm of A, one can easily compute the Frobeniusnorm ‖Am‖F , so that we can consider Am

‖Am‖F which again converges

for m→∞ to v1vT1 from which v1 may be computed (exercise!).

What if A is not squared?But then B = AAT is squared. If again, the SVD of A isA =

∑rk=1 σkukvT

k then

B =r∑

k=1

σ2kukuT

k .

This is the spectral decomposition of B. Using the same kind ofcalculation as above, one obtains

Bm =r∑

k=1

σ2mk ukuT

k .

As m increases, for k > 1 the ratio σ2mk /σ2m

1 goes to zero and Bm

gets approximately equal to

σ2m1 u1uT

1 ,

provided again that σ1 σ2. This suggests how to compute bothσ1 and u1, simply by powering B.

Two relevant issues

I If we do not have a spectral gap σ1 σ2 we get into troubles.

I Computing Bm costs m matrix-matrix multiplication, whendone in the schoolbook way it costs O(md3) operations orO(m log d) when done by successive squaring.

Use a random vector application instead!

Instead, we computeBmx ,

where x is a random unit length vector.

The idea is that the component of x in the direction of u1, i.e.,x1 := 〈x , u1〉 is actually bounded away from zero with highprobability and would get mutiplied by σ2

1, while the othercomponents of x along other ui ’s would be multiplied by σ2

k σ21

only.

Each increase in m requires multiplying B by the vector Bm−1x ,which we can further break up into

Bmx = A(AT (Bm−1x)).

This requires two matrix vector products, involving the matricesAT and A. Since Bmx ≈ σ2k

1 u1(uT1 x) is a scalar multiple of u1, u1

can be recovered from Bmx by normalization.

Concentration of measure phenomenon

The concentration of measure phenomenon informally speaking isdescribing the fact that - in high-dimension - certain events happenwith high-probability. In particular, we have

LemmaLet x ∈ Rd be a unit d-dimensional vector of componentsx = (x1, . . . , xd) with respect to the canonincal basis and picked atradnom from the set x : ‖x‖2 ≤ 1. The probability that|x1| ≥ α > 0 is at least 1− 2α

√d − 1.

Extensions

RemarkNotice that in the previous result essentially shows also that,independently of the dimension d, the x1 = 〈x , u1〉 component of arandom unit vector x with respect to any orthonormal basisu1, . . . , ud2 is bounded away from zero with overwhelmingprobability.

RemarkIn view of the isometrical mapping (a, b)→ a + ib from R2 to Cthe previous result extends to random unit vectors in Cd simply bymodifying the statement as follows: The probability that, for arandomly chosen unit vecor z ∈ Cd |z1| ≥ α > 0 holds is at least1− 2α

√2d − 1.

2Since the sphere is rotation invariant, the arguments above apply to anyorthonormal basis by rotating it to coincide with the canonical basis!

Randomized Power MethodBy injecting randomization in the process, we are able to use theblessing of the dimensionality (concentration of measure) againstthe curse of dimensionality.

TheoremLet A ∈ KI×J and x ∈ KI be a random unit length vector. Let Vbe the space spanned by the left singular vectors of Acorresponding to singular values greater than (1− ε)σ1. Let m be

Ω(

ln(d/ε)ε

). Let w∗ be unit vector after m iterations of the power

method, namely,

w∗ =(AAH)mx

‖(AAH)mx‖2.

The probability that w∗ has a component of at least O(εαd

)orthogonal to V is at most 1− 2α

√2d − 1.

To make the previous statement concrete, lets take α = 1/102d .

Then, with probability at least 9/10 (i.e., very close to 1), after m

iteration of order Ω(

ln(d/ε)ε

)the vector w∗ has a component of at

least O (ε) orthogonal to V .

Applications of the SVD: Principal Component Analysis

The traditional use of SVD is in Principal Component Analysis(PCA). PCA is illustrated by an example - customer-product datawhere there are n customers buying d products.

Applications of the SVD: Principal Component Analysis

Let matrix A with elements Aij represent the probability ofcustomer i purchasing product j (or the amount or utility ofproduct j to customer i).

One hypothesizes that there are really only k underlying basicfactors like age, income, family size, etc. that determine acustomer’s purchase behavior.

An individual customer’s behavior is determined by some weightedcombination of these underlying factors.

Principal Component AnalysisThat is, a i customer’s purchase behavior can be characterized bya k-dimensional vector (u`,i )`=1,...,k , where k is much smaller thann and d .

The components u`,i , for ` = 1, . . . , k of the vector are weights foreach of the basic factors.

Associated with each basic factor ` is a vector of probabilities v`,each component of which is the probability of purchasing a givenproduct by someone whose behavior depends only on that factor.

More abstractly, A = UV H is an n × d matrix that can beexpressed as the product of two matrices U and V where U is ann × k matrix expressing the factor weights for each customer andV is a k × d matrix expressing the purchase probabilities ofproducts that correspond to that factor:

(A(i))H =k∑`=1

u`,ivH` .

Principal Component Analysis

One twist is that A may not be exactly equal to UV T , but closeto it since there may be noise or random perturbations.

Taking the best rank-k approximation Ak from SVD (as describedabove) gives us such a U,V . In this traditional setting, oneassumed that A was available fully and we wished to find U,V toidentify the basic factors or in some applications to denoise A (ifwe think of A− UV T as noise).

Low rankness and recommender systems

Now imagine that n and d are very large, on the order ofthousands or even millions, there is probably little one could do toestimate or even store A.

In this setting, we may assume that we are given just given a fewelements of A and wish to estimate A. If A was an arbitrary matrixof size n × d , this would require Ω(nd) pieces of information andcannot be done with a few entries.

But again hypothesize that A was a small rank matrix with addednoise. If now we also assume that the given entries are randomlydrawn according to some known distribution, then there is apossibility that SVD can be used to estimate the whole of A.

This area is called collaborative filtering or low-rank matrixcompletion, and one of its uses is to target an ad to a customerbased on one or two purchases.

SVD as a compression algorithm

Suppose A is the pixel intensity matrix of a large image.

The entry Aij gives the intensity of the ij th pixel. If A is n × n, thetransmission of A requires transmitting O(n2) real numbers.Instead, one could send Ak , that is, the top k singular valuesσ1, σ2, . . . , σk along with the left and right singular vectorsu1, u2, . . . , uk , and v1, v2, . . . , vk .

This would require sending O(kn) real numbers instead of O(n2)real numbers.

If k is much smaller than n, this results in savings. For manyimages, a k much smaller than n can be used to reconstruct theimage provided that a very low resolution version of the image issufficient.

Thus, one could use SVD as a compression method.

SVD as a compression algorithm

Sparsifying bases

It turns out that in a more sophisticated approach, for certainclasses of pictures one could use a fixed sparsifying basis so thatthe top (say) hundred singular vectors are sufficient to representany picture approximately.

This means that the space spanned by the top hundred sparsifyingbasis vectors is not too different from the space spanned by thetop two hundred singular vectors of a given matrix in that class.

Compressing these matrices by this sparsifying basis can savesubstantially since the sparsifying basis is transmitted only onceand a matrix is transmitted by sending the top several hundredsingular values for the sparsifying basis.

We will discuss later on the use of sparsifying bases.

Singular Vectors and ranking documents

Consider representing a document by a vector each component ofwhich corresponds to the number of occurrences of a particularword in the document.

The English language has on the order of 25.000 words. Thus,such a document is represented by a 25.000- dimensional vector.The representation of a document is called the word vector mode.

A collection of n documents may be represented by a collection of25.000-dimensional vectors, one vector per document. The vectorsmay be arranged as rows n × 25.000 matrix.


An important task for a document collection is to rank thedocuments.

A naive method would be to rank in order of the total length ofthe document (which is the sum of the components of itsterm-document vector). Clearly, this is not a good measure inmany cases.

This naive method attaches equal weight to each term andessentially takes the projection of each term-document vector inthe direction of the vector whose components have just value 1.

Is there a better weighting of terms, i.e., a better projectiondirection which would measure the intrinsic relevance of thedocument to the collection?


A good candidate is the best-fit direction for the collection ofterm-document vectors, namely the top (left) singular vector of theterm-document matrix.

An intuitive reason for this is that this direction has the maximumsum of squared projections of the collection and so can be thoughtof as a synthetic term-document vector best representing thedocument collection.

WWW ranking

Ranking in order of the projection of each document’s term vectoralong the best-fit direction has a nice interpretation in terms of thepower method. For this, we consider a different example - that ofweb with hypertext links.

The World Wide Web can be represented by a directed graphwhose nodes correspond to web pages and directed edges tohypertext links between pages.

Some web pages, called authorities, are the most prominentsources for information on a given topic. Other pages called hubs,are ones that identify the authorities on a topic.

Authority pages are pointed to by many hub pages and hub pagespoint to many authorities.

WWW ranking

WWW rankingOne would like to assign hub weights and authority weights to eachnode of the web.

If there are n nodes, the hub weights form a n-dimensional vectoru and the authority weights form a n-dimensional vector v .Suppose A is the adjacency matrix representing the directed graph:Aij is 1 if there is a hypertext link from page i to page j and 0otherwise. Given hub vector u, the authority vector v could becomputed by the formula

vj =n∑

i=1

uiaij ,

since the right hand side is the sum of the hub weights of all thenodes that point to node j . In matrix terms,

v = ATu.

Similarly, given an authority vector v , the hub vector could becomputed by u = Av .

WWW ranking

Of course, at the start, we have neither vector.

But the above suggests a power-like iteration: Start with any v .Set u = Av ; then set v = ATu, and repeat the process.

We know from the power method that this converges to the leftand right singular vectors. So, after sufficiently many iterations, wemay use the left vector u as hub weights vector and project eachcolumn of A onto this direction and rank columns (authorities) inorder of this projection.

But the projections just form the vector ATu which equals v . Sowe can just rank by order of vj .

This is the basis of an algorithm called the HITS algorithm whichwas one of the early proposals for ranking web pages. A differentranking called page rank is widely used. It is based on a randomwalk on the graph described above, but we do not explore it here.

Pseudo-inverse matrices

Given A ∈ KI×J of rank r and its singular value decomposition

A = UΣV H =r∑

k=1

σkukvHk ,

we define the Moore-Penrose pseudo-inverse A† ∈ KJ×I as

A† =r∑

k=1

σ−1k vkuH

k .

In matrix form it isA† = V Σ−1UH ,

where the inverse ·−1 is applied only to the nonzero singular valuesof A.

Pseudo-inverse matricesIf AHA ∈ KJ×J is invertible (implying #I ≥ #J) and this is true ifr = #J and the columns are linearly independent, then

A† = (AHA)−1AH .

Indeed

(AHA)−1AH = (VΣ2V H)−1VΣUH = (VΣ−2V H)VΣUH = VΣ−1UH = A†.

In this case, it should be also noted two fundamental properties ofA†. First of all, it is a left inverse of A, i.e.,

I = (AHA)−1AHA = A†A,

where I ∈ Kr×r . On the other side the matrix

Prange(A) = AA†,

corresponds to the orthogonal projection of KI onto the range ofthe matrix A, i.e., the span on the columns of A.

Pseudo-inverse matrices

Similarly, if AAH ∈ KI×I is invertible (implying #I ≤ #J) and thisis true if r = #I and the rows are linearly independent, then

A† = AH(AAH)−1.

In this case, A† is a right inverse of A, i.e.,

I = AAH(AAH)−1 = AA†.

Least square problems

I The system Ax = y is overdetermined, i.e., there are moreequations than unknowns. In this case the problem mighteven not be solvable and approximated solutions are needed,i.e., those that minimize the mismatch ‖Ax − y‖2;

I The system Ax = y is underdetermined, i.e., there are lessequations than unknowns. In this case the problem has aninfinite number of solutions and one would wish to determineone of them with special properties, by means of a properselection criterion. A criterion of selection of proper solutionsof the systems Ax = y is often to pick the solution withminimal Euclidean norm ‖x‖2.

Pseudo inverse matrices and Least square problems

Let us first connect least squares problems with the Moore-Penrosepseudo-inverse introduced above.

Proposition

Let A ∈ KI×J and y ∈ KI . Define M⊂ KJ to be the set ofminimizers of the map x → ‖Ax − y‖2. The convex optimizationproblem

arg minx∈M ‖x‖2 (12)

has the unique solution x∗ = A†y.

Proof at the blackboard ...

Overdetermined systems

Corollary

Let A ∈ KI×J with n = #I ≥ #J = d be of full rank r = d, andlet y ∈ KI . The the least square problem

arg minx∈KJ ‖Ax − y‖2, (13)

has the unique solution x = A†y.

This result follows from the Proposition above because theminimizer x∗ of ‖Ax − y‖2 is in this case unique.

Underdetermined systems

Corollary

Let A ∈ KI×J with n = #I ≤ #J = d be of full rank r = n, andlet y ∈ KI . The the least square problem

arg minx∈KJ ‖x‖2 subject to Ax = y , (14)

has the unique solution x = A†y.

Stability of the Singular Value DecompositionIn many situations, there is a data matrix of interest A, but onehas only a perturbed version of it A = A + E where E is going tobe a relatively small quantifiable perturbation.To which extend canwe rely on the singular value decomposition of A to estimate theone of A?In many real-life cases, certain sets of data tend to behave asrandom vectors generated in the vicinity of a lower dimensionallinear subspace.

The more data we get the more they take a specific envelopingshape around that subspace. Hence, we could consider manymatrices built out of sets of such randomly generated points. Howdo the SVD’s of such matrices look like?

What happens if we acquire more and more data and the matricesbecome larger and larger?

Also to be able to respond in a quantitative way to such questionswe need to develop a theory of stability of singular valuedecompositions under small perturbations.

Spectral theoremA ∈ KI×I is actually a Hermitian matrix, i.e. A = AH . We recallthat a nonzero vector v ∈ KI is an eigenvector of A if Av = λv forsome scalar λ ∈ K called the corresponding eigenvalue.

Theorem (Spectral theorem for Hermitian matrices )

If A ∈ KI×I and A = AH , then there exists an orthonormal basisv1, . . . vn consisting of eingenvectors of A with realcorresponding eigenvalues λ1, . . . λn such that

A =∑k=1

λkvkvHk .

The previous representation is called the spectral decomposition ofA.

In fact, by denoting uk = signλk and σk = |λk |, then

A =n∑

k=1

λkvkvHk =

n∑k=1

signλk |λk |vkvHk =

n∑k=1

σkukvHk ,

which is actually the SVD of A.

Weyl’s bounds

Assume now E is Hermitian so that A = AH holds, as well. As theeigenvalues of such matrices are always real, we can order in anonincreasing order so that λ1 is the largest (non necessarilypositive!). Then

λ1(A) = max‖v‖2=1

vH(A + E )v

≤ max‖v‖2=1

vHAv + max‖v‖2=1

vHEv

= λ1(A) + λ1(E )

By using the spectral decomposition of A + E do prove as anexercise the first equality above.

Weyl’s bounds

Let v1 be the eigenvector of A associated to the top eigenvalue ofA and

λ1(A) = max‖v‖2=1

vH(A + E )v

≥ vH1 (A + E )v1

= λ1(A) + vH1 Ev1

≥ λ1(A) + λn(E ).

From these estimates we obtain the bounds

λ1(A) + λn(E ) ≤ λ1(A + E ) ≤ λ1(A) + λ1(E ).

Weyl’s boundsThis chain of inequalities can be properly extended to the 2nd, 3rd,etc. eigenvalues (we do not perform this extension), leading to thefollowing important result.

Theorem (Weyl)

If A,E ∈ KI×I are two Hermitian matrices, then for all k = 1, . . . , n

λk(A) + λn(E ) ≤ λk(A + E ) ≤ λk(A) + λ1(E ).

This is a very strong result because it does not depend at all fromthe magnitude of the perturbation matrix E . As a corollary

Corollary

If A,E ∈ KI×I are two arbitrary matrices, then for all k = 1, . . . , n

|σk(A + E )− σk(A)| ≤ ‖E‖,

where σk ’s are the corresponding singular vectors and ‖E‖ is thespectral norm of E .

Mirsky’s bounds

Actually more can be said:

Theorem (Mirsky)

If A,E ∈ KI×I are two arbitrary matrices, then√√√√ n∑k=1

|σk(A + E )− σk(A)|2 ≤ ‖E‖F ,

where ‖E‖F is the Frobenius norm of E .

In its original form, Mirsky’s theorem holds for an arbitraryunitarily invariant norm (hence for the spectral norm) and includesWeyl’s theorem as a special case.

Stability of singular spaces: principal angles

To quantify the stability properties of the singular valuedecomposition under perturbations, we need to introduce thenotion of principal angles between subspaces. To introduce theconcept we recall that in a Euclidean space Rd the angle θ(v ,w)of two vectors v ,w ∈ Rd is defined by

cos θ(v ,w) =wT v

‖v‖2‖w‖2.

In particular, for unitary vectors v ,w one has cos θ(v ,w) = wT v .


DefinitionLet V ,W ∈ KJ×I be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn = #I ≤ #J = d)3. We define the principal angles between Vand W by

cos Θ(V ,W ) = Σ(W HV ),

where Σ(W HV ) is the list of the singular values of the matrixW HV .

3Here we committed an abuse of notation as we denoted with V ,W thematrices of the orthonormal bases of two subspaces V ,W . However, thisidentification here is legitimated as all we need to know about the subspacesare some orthonormal bases of them.

Stability of singular spaces: principal anglesThis definition sounds a bit abstract and it is perhaps useful tocharacterize the principal angles in terms of a geometricdescription. Recalling the explicit characterization of theorthogonal projection onto a subspace given in (3), we observe that

PV = VV H , PW = WW H .

The following result essentially says that the principal anglesquantify how different are the orthogonal projections onto V andW ..

Proposition

Let V ,W ∈ KJ×I be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn = #I ≤ #J = d). Then

‖PV − PW ‖F = ‖VV H −WW H‖F =√

2‖ sin Θ(V ,W )‖2.



The orhogonal space W⊥ to any subspace W of KJ is define byW⊥ = v ∈ KJ : 〈v ,w〉 = 0, ∀w ∈W . We show yet anotherdescription of the singular angles. The projections onto W andW⊥ are related by

PW = I − PW⊥ or WW H = I −W⊥W H⊥ . (15)

Proposition

Let V ,W ∈ KJ×I be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn = #I ≤ #J = d). We denote W⊥ an orthogonal matricesspanning W⊥. Then

‖W H⊥ V ‖F = ‖ sin Θ(V ,W )‖2.



We leave as an additional exercise to show that, as a corollary ofthe previous result, one has also

‖PW⊥PV ‖F = ‖ sin Θ(V ,W )‖2. (16)

Again, one has to use the properties of the trace operator!

Stability of singular spaces: Wedin’s boundA, A ∈ KI×J and define E = A− A. We fix1 ≤ k ≤ rmax = min#I ,#J and we consider the SVD

A =(

U1 U0

)( Σ1 00 Σ0

)(V H

1

V H0

)= U1Σ1V H

1︸︷︷︸:=A1

+ U0Σ0V H0︸︷︷︸

:=A0

,

where

V1 = [v1, . . . , vk ], V0 = [vk+1, . . . , vrmax ], U1 = [u1, . . . , uk ], U0 = [uk+1, . . . , urmax ],

and

Σ1 = diag(σ1, . . . , σk), Σ0 = diag(σk+1, . . . , σrmax).

For the same k we use corresponding notations for an analogousdecomposition for A except that all the symbols have · on the top.It is obvious that

A + E = A1 + A0 + E = A1 + A0 = A. (17)

Stability of singular spaces: Wedin’s bound

Let us notice that the range(A1) is indeed spanned by theorthonormal system U1, while range(AH

1 ) is spanned by V1. Similarstatements hold for A1.We define now two “residual” matrices

R11 = AV1 − U1Σ and R21 = AH U1 − V1ΣH .


A connection between R11 and E is seen through the rewriting

R11 = AV1 − U1Σ = (A− E )V1 − U1(UH1 AV1) = −E V1.

We used that Σ = UH1 AV1 = UH

1 A1V1 and that U1UH1 is the

orthogonal projection onto the range of range(A1), henceU1UH

1 A1 = A1, and A1V1 = AV1. Analogously, it can be shownthat

R21 = −EH U1.


Theorem (Wedin)

Under the notations so far used in this section, assume that thereexist an α ≥ 0 and δ > 0 such that

σk(A) ≥ α + δ and σk+1(A) ≤ α. (18)

Then for every unitarily invariant norm ‖ · ‖? (in particular for thespectral and Frobenius norm)

max‖ sin Θ(V1,V1)‖?, ‖ sin Θ(U1,U1)‖?

≤ max‖R11‖?, ‖R21‖?δ

≤ ‖E‖?δ

.

Introduction to probability theoryIn the probabilistic proof of the convergence of the power methodfor computing singular vectors of a matrix A, we used the followingrandom process

to define a random point on the sphere: we considered animaginary player X of darts, with no experience, throwing darts -actually at random - at the target, the two dimensional diskΩ = Br , the disk of radius r and center 0. At every point ω ∈ Ωwithin the target hit by a dart we assign a point on the boundaryof the target (the one dimensional sphere S1) X (ω) = ω

‖ω‖2∈ S1.

Introduction to probability theory

It is also true that the one dimensional sphere S1 is parametrizedby the angle θ ∈ [0, 2π), hence we can also consider the player Xas a map from ω ∈ Ω to θ = arg ω

‖ω‖2∈ [0, 2π) ⊂ R

X : Ω→ R.

We defined the probability that the very inexperienced player Xhits a point ω ∈ B ⊂ Ω simply as the ratio of the area of Brelative to the entire area of the target Ω:

P(ω ∈ B) =Vol(B)

Vol(Ω).

What is then the probability of the event

B = ω ∈ Ω : θ1 ≤ X (ω) ≤ θ2?


What is then the probability of the event

B = ω ∈ Ω : θ1 ≤ X (ω) ≤ θ2?

The area of B is the area of the disk sector of vectors with anglesbetween [θ1, θ2], which is computed by using the polar coordinates:

Vol(B) =

∫ θ2

θ1

∫ r

0sdsdθ =

r 2(θ2 − θ1)

2.

Hence, recalling that Vol(Ω) = r 2π, we have

P(ω ∈ Ω : θ1 ≤ X (ω) ≤ θ2) =Vol(B)

Vol(Ω)=

r 2(θ2 − θ1)

2πr 2=θ2 − θ1

2π.


What is the average outcome of the (random) player? Well, if weconsider N attempts of the player, the empirical average resultwould be

1

N

N∑i=1

X (ωi ),

which, for N →∞ converges to

EX :=

∫Ω

X (ω)dP(ω) =1

2π

∫ 2π

0θdθ =

(2π)2

4π= π.

The function φ(θ) = 12π I[0,2π)(θ) is called the probability density of

the random player X . In general one computes the quantities

Eg(X ) :=

∫Ω

g(X (ω))dP(ω) =

∫R

g(θ)ϕ(θ)dθ.


If we define Σ the set of all possible events (identified withmeasurable subsets of Ω) the triple (Ω,Σ,P) is called a probabilityspace.

The random player, which we described above with the functionX : Ω→ R, is an example of random variable.

The average outcome of its playing EX is the expected value ofthe random variable.

As already mentioned above, this particular random variable has adensity function φ(θ) which allows to compute interestingquantities (such as the probabilities of certain events) by integralson the real line.

This relatively intuitive construction is generalized to abstractprobability spaces as follows.

Abstract probability theory

Let (Ω,Σ,P) be a probability space, where Σ denotes a σ-algebra(of the events) on the sample space and P a probability measureon (Ω,Σ).

The probability of an event B ∈ Σ is denoted by

P(B) =

∫BdP(ω) =

∫Ω

IB(ω) dP(ω)

where the characteristic function IB(ω) takes the value 1 if ω ∈ Band 0 otherwise.

The union bound (or Bonferroni’s inequality, or Boole’s inequality)states that for a collection of events B` ∈ Σ, ` = 1, . . . , n, we have

P

(n⋃`=1

B`

)≤

n∑`=1

P(B`) . (19)

Random variables

A random variable X is a real-valued measurable function on(Ω,Σ). Recall that X is called measurable if the preimageX−1(A) = ω ∈ Ω : X (Ω) ∈ A is contained in Σ for all Borelmeasurable subsets A ⊂ R.

Usually, every reasonable function X will be measurable; inparticular, all functions appearing in this lectures.

In what follows we will usually not mention the underlyingprobability space (Ω,Σ,P) when speaking about random variables.

The distribution function F = FX of X is defined as

F (t) = P(X ≤ t), t ∈ R . (20)

DensitiesA random variable X possesses a probability density functionφ : R→ R+ if

P(a < X ≤ b) =

∫ b

aφ(t) dt for all a < b ∈ R (21)

Thenφ = ddt F (t). The expectation or mean of a random variable

will be denoted by

EX =

∫Ω

X (ω) dP(ω) . (22)

If X has probability density function φ then for a functiong : R→ R,

Eg(X ) =

∫ +∞

−∞g(t)φ(t) dt (23)

whenever the integral exists.

MomentsThe quantities EX p, p > 0 are called moments of X , while E|X |pare called absolute moments. (Sometimes we may omit“absolute”.)

The quantity E(X − EX )2 = EX 2 − (EX )2 is called variance.

For 1 ≤ p <∞, (E|X |p)1p defines a norm on the Lp(Ω,P)-space of

all p-integrable random variables, in particular, the triangleinequality

(E|X + Y |p)1p ≤ (E|X |p)

1

p+ (E|Y |p)

1

p(24)

holds for all p-integrable random variables X ,Y on (Ω,Σ,P).

Hoelder’s inequality states that, for random variables X ,Y on acommon probability space and p, q ≥ 1 with 1

p + 1q = 1, we have

|EXY | ≤ (E|X |p)1p (E|Y |q)

1q .

Sequences of random variables and convergence

Let Xn, n ∈ N, be a sequence of random variables such that Xn

converges to X as n→∞ in the sense that limn→∞ Xn(ω) = X (ω)for all ω.

Lebesgue’s dominated convergence theorem states that if thereexists a random variable Y with E|Y | <∞ such that |Xn| ≤ |Y |a.s. then limn→∞ EXn = EX .

Lebesgue’s dominated convergence theorem has as well an obviousformulation for integrals of sequences of functions.

Fubini’s theorem

Fubini’s theorem on the integration of functions of two variablescan be formulated as follows.

Let f : A× B → C be measurable, where (A, ν) and (B, µ) aremeasurable spaces.

If∫A×B |f (x , y)|d(ν ⊗ µ)(x , y) <∞ then

∫A

(∫B

f (x , y)dµ(y)

)dν(x) =

∫B

(∫A

f (x , y) dν(x)

)dµ(y) .

Moment computations and Cavalieri’s formula

Absolute moments can be computed by means of the followingformula.

Proposition

The absolute moments of a random variable X can be expressed as

E|X |p = p

∫ ∞0

P(|X | ≥ t)tp−1 dt, p > 0 .

Corollary

For a random variable X the expectation satisfies

EX =

∫ ∞0

P(X ≥ t) dt −∫ ∞

0P(X ≤ −t) dt .

Tail bounds and Markov inequality

The function t → P(|X | ≥ t) is called the tail of X. A simple butoften effective tool to estimate the tail by expectations andmoments is the Markov inequality.

TheoremLet X be a random variable. Then

P(|X | ≥ t) ≤ E|X |t

for all t > 0 .

Tail bounds and Markov inequality

RemarkAs an important consequence we note that for p > 0

P(|X | ≥ t) = P(|X |p ≥ tp) ≤ t−pE|X |p, for all t > 0 .

The special case p = 2 is referred to as the Chebyshev inequality.Similarly, for θ > 0 we obtain

P(X ≥ t) = P(exp(θX ) ≥ exp(θt)

)≤ exp(−θt)E exp(θX ) for all t ∈ R .

The function θ 7→ E exp(θX ) is usually called the Laplacetransform or the moment generating function of X .

Normal distributions

A normal distributed random variable or Gaussian randomvariable X has probability density function

ψ(t) =1√

2πσ2exp

(−(t − µ)2

2σ2

). (25)

Its distribution is often denoted by N(µ, σ).

It has mean EX = µ and variance E(X − µ)2 = σ2.

A standard Gaussian random variable (or standard normal orsimply standard Gaussian), usually denoted g , is a Gaussianrandom variable with Eg = 0 and Eg 2 = 1.

Random vectorsA random vector X = [X1, . . . ,Xn]> ∈ Rn is a collection of nrandom variables on a common probability space (Ω,Σ,P).

Its expectation is the vector EX = [EX1, . . . ,EXn]> ∈ Rn, while itsjoint distribution function is defined as

F (t1, . . . , tn) = P(X1 ≤ t1, . . . ,Xn ≤ tn), t1, . . . , tn ∈ R .

Similarly to the univariate case, the random vector X has a jointprobability density if there exists a function φ : Rn → [0, 1] suchthat for any measurable domain D ⊂ Rn

P(X ∈ D) =

∫Dφ(t1, . . . , tn) dt1 · · · dtn .

A complex random vector Z = X + iY ∈ Cn is a special case of a2n-dimensional real random vector (X,Y) ∈ R2n.

IndependenceA collection of random variables X1, . . . ,Xn is (stochastically)independent if, for all t1, . . . , tn ∈ R,

P(X1 ≤ t1, . . . ,Xn ≤ tn) =n∏`=1

P(X` ≤ t`) .

For independent random variables, we have

E

[n∏`=1

X`

]=

n∏`=1

E[X`] . (26)

If they have a joint probability density function φ then the latterfactorizes as

φ(t1, . . . , tn) = φ1(t1)× · · · × φn(tn)

where the φ1, . . . φn are the probability density functions ofX1, . . . ,Xn.

Independence

In general, a collection X1 ∈ Rn1 , . . . ,Xm ∈ Rnm of random vectorsare independent if for any collection of measurable sets A` ⊂ Rn` ,` ∈ [m],

P(X1 ∈ A1, . . . ,Xm ∈ Am) =m∏`=1

P(X` ∈ A`) .

If furthermore f` : Rn` → RN` , ` = 1, . . . ,m, are measurablefunctions then also the random vectors f1(X1), . . . , fm(Xm) areindependent.

A collection X1, . . . ,Xm ∈ Rn of independent random vectors thatall have the same distribution is called independent identicallydistributed (i.i.d.).A random vector X′ will be called an independent copy of X if Xand X′ are independent and have the same distribution.

Sum of independent random variables

The sum X + Y of two independent random variables X ,Y havingprobability density functions φX , φY , has probability densityfunction φX+Y given by the convolution

φX+Y (t) = (φX ∗ φY )(t) =

∫ +∞

−∞φX (u)φY (t − u) du (27)

Fubini’s theorem for expectations

The theorem takes the following form. Let X,Y ∈ Rn be twoindependent random vectors (or simply random variables) andf : Rn × Rn → R be a measurable function such thatE|f (X,Y)| <∞. Then the functions

f1 : Rn → R, f1(x) = Ef (x,Y), f2 : Rn → R, f2(y) = Ef (X, y)

are measurable, E|f1(X)| <∞ and E|f2(Y)| <∞ and

Ef1(X) = Ef2(Y) = Ef (X,Y) . (28)

The random variable f1(X) is also called conditional expectation orexpectation conditional on X and will sometimes be denoted byEY f (X,Y).

Gaussian vectorsA random vector g ∈ Rn is called a standard Gaussian vector if itscomponents are independent standard normal distributed randomvariables. More generally, a random vector X ∈ Rn is said to be aGaussian vector or multivariate normal distributed if there exists amatrix A ∈ Rn×k such that X = Ag + µ, where g ∈ Rk is astandard Gaussian vector and µ ∈ Rn is the mean of X.

The matrix Σ = AAT is then the covariance matrix of X , i.e.,Σ = E(X− µ)(X− µ)>. If Σ is non-degenerate, i.e., Σ is positivedefinite, then X has a joint probability density function of the form

ψ(x) =1

(2π)n2

√det(Σ)

exp

(−1

2〈x− µ,Σ−1(x− µ)〉

)

In the degenerate case when Σ is not invertible X does not have adensity.It is easily deduced from the density that a rotated standardGaussian Ug, where U is an orthogonal matrix, has the samedistribution as g itself.

Sum of gaussian variables

If X1, . . . ,Xn are independent and normal distributed randomvariables with means µ1, . . . , µn and variances σ2

1, . . . , σ2n then

X = [X1, . . . ,Xn]> has a multivariate normal distribution and itssum Z =

∑n`=1 X` has the univariate normal distribution with

mean µ =∑n

`=1 µ` and variance σ2 =∑n

`=1 σ2` , as can be

calculated from (27).

Jensen’s inequality

Jensen’s inequality reads as follows.

TheoremLet f : Rn → R be a convex function, and let X ∈ Rn be a randomvector. Then

f (EX) ≤ Ef (X) .

Note that −f is convex if f is concave, so that for concavefunctions f , Jensen’s inequality reads

Ef (X) ≤ f (EX) . (29)

Cramer’s Theorem

We often encounter sums of independent mean zero randomvariables. Deviation inequalities bound the tail of such sums.

We recall that the moment generating function of a (real-valued)random variable X is defined by

θ 7→ E exp(θX ) ,

for all θ ∈ R whenever the expectation on the right hand side iswell-defined.

Its logarithm is the cumulant generating function

CX (θ) = lnE exp(θX ) .

With the help of these definitions we can formulate Cramer’stheorem.

Cramer’s Theorem

TheoremLet X1, . . . ,XM be a sequence of independent (real-valued)random variables, with cumulant generating functions CX` ,` ∈ [M] := 1, . . . ,M. Then, for t > 0,

P

(M∑`=1

X` ≥ t

)≤ exp

(infθ>0

−θt +

M∑`=1

CX`(θ)

).

RemarkThe function

t 7→ infθ>0

−θt +

M∑`=1

XX`(θ)

appearing in the exponential is closely connected to a convexconjugate function appearing in convex analysis.

Hoeffdings’ inequality

TheoremLet X1, . . . ,XM be a sequence of independent random variablessuch that EX` = 0 and |X`| ≤ B` almost surely, ` ∈ [M]. Then

P

(M∑`=1

X` ≥ t

)≤ exp

(− t2

2∑M

`=1 B2`

),

and consequently,

P

(∣∣∣∣∣M∑`=1

X`

∣∣∣∣∣ ≥ t

)≤ 2 exp

(− t2

2∑M

`=1 B2`

). (30)

Bernstein inequalities

TheoremLet X1, . . . ,XM be independent mean zero random variables suchthat, for all integers n ≥ 2,

E|X`|n ≤n!

2Rn−2σ2

` for all ` ∈ [M] (31)

for some constants R > 0 and σ` > 0, ` ∈ [M]. Then, for all t > 0,

P

(∣∣∣∣∣M∑`=1

X`

∣∣∣∣∣ ≥ t

)≤ 2 exp

(− t2

2(σ2 + Rt)

), (32)

where σ2 :=∑M

`=1 σ2` .


Before providing the proof we give two consequences. The first isthe Bernstein inequality for bounded random variables.

Corollary

Let X1, . . . ,XM be independent random variables with zero meansuch that |X`| < K almost surely, for ` ∈ [M] and some constantK > 0. Further assume E|X`|2 ≤ σ2

` for constants σ` > 0, ` ∈ [M].Then, for all t > 0,

P

(∣∣∣∣∣M∑`=1

X`

∣∣∣∣∣ ≥ t

)≤ 2 exp

(− t2

2(σ2 + 13 Kt)

), (33)

where σ2 :=∑M

`=1 σ2` .


As a second consequence, we present the Bernstein inequality forsubexponential random variables.

Corollary

Let X1, . . . ,XM be independent mean zero subexponential randomvariables, that is, P(|X`| ≥ t) ≤ βe−κt for some constants β, κ > 0for all t > 0, ` ∈ [M]. Then

P

(∣∣∣∣∣M∑`=1

X`

∣∣∣∣∣ ≥ t

)≤ 2 exp

(− (κt)2

2(2βM + κt)

), (34)

The Johnson-Lindenstrauss Lemma

Theorem ( Johnson-Lindenstrauss Lemma)

For any 0 < ε < 1 and any integer n, let k be a positive integerand such that

k ≥ 2β(ε2/2− ε3/3)−1 ln n,

for some β ≥ 2. Then for any set P of n points in Rd , there is amap f : Rd → Rk such that for all v ,w ∈ P

(1− ε)‖v − w‖22 ≤ ‖f (v)− f (w)‖2

2 ≤ (1 + ε)‖v − w‖22 (35)

Furthermore, this map can be generated at random withprobability 1− (n2−β − n1−β).

Let us comment on the probability: choosing β 2 leads to avery favorable probability 1− (n2−β − n1−β) ≈ 1 as soon as n islarge enough.

The Johnson-Lindenstrauss Lemma

To clarify the impact of this result, consider a cloud of n = 107

points (ten million points) in dimension d = 106 (one million).

Then there exists a map f generated at random with probabilityclose to 1, which allows us to project the 10 million points indimension k = O(ε−2 log 106) = O(6ε−2), preserving the distanceswithin the point cloud with a maximal distortion of O(ε).

For example, for ε = 0.1 we obtain k of the order of a thousand,while the original dimension d was one million. Hence, we obtain aquasi-isometrical dimensionality reduction of 3-4 orders ofmagnitude.

Projection onto random subspaces

One proof of the lemma takes f to be a suitable multiple of theorthogonal projection onto a random subspace of dimension k inRd (Gupta, Dasgupta).

In this case f would be linear and representable by a matrix A.

Then, by considering unit vectors z = v − w/‖v − w‖2, we obtainthat (35) is equivalent to

(1− ε) ≤ ‖Az‖2 ≤ (1 + ε).

or|‖Az‖2 − 1| ≤ ε.

... or a random point onto a fixed subspace

Hence the aim is essentially to estimate the length of a unit vectorin Rd when it is projected onto a random k-dimensional subspace.

However, this length has the same distribution as the length of arandom unit vector projected down onto a fixed k-dimensionalsubspace.

Here we take this subspace to be the space spanned by the first kcoordinate vectors, for simplicity.

Gaussianly distributed vectors

We recall that X ∼ N(0, 1) if its density is

ψ(x) = 1/√

2π exp(−x2/(2)).

Let X1, . . . ,Xd be d-independent Gaussian N(0, 1) randomvariables, and let Y = ‖X‖−1

2 (X1, . . . ,Xd).

It is easy to see that Y is a point chosen uniformly at random fromthe surface of the d-dimensional sphere Sd−1 (exercise!).

Let the vector Z ∈ Rk be the projection of Y onto its first kcoordinates, and let L = ‖Z‖2

2. Clearly the expected squaredlength of Z is µ = E[‖Z‖2

2] = k/d (exercise!).

Concentration inequalities

LemmaLet k < d. Then,

(a) If α < 1, then

P(

L ≤ αk

d

)≤ αk/2

(1 +

(1− α)k

d − k

)(d−k)/2

≤ exp

(k

2(1− α + lnα)

).

(b) If α > 1, then

P(

L ≥ αk

d

)≤ αk/2

(1 +

(1− α)k

d − k

)(d−k)/2

≤ exp

(k

2(1− α + lnα)

).

Proof techniques

The proof of this lemma uses by now standard techniques appliedfor proving large deviation bounds on sums of random variablessuch as the ones we used for proving the Hoeffding and Bernsteininequalities.

The proof of the Johnson-Lindenstrauss Lemma - as developed atthe blackboard - follows by union bound over the application of theconcentration inequalities of the Lemma.

Johnson-Lindenstrauss Lemma by Gaussian matrices

Theorem ( Johnson-Lindenstrauss Lemma)

For any 0 < ε < 1/2 and any integer n, let k be a positive integerand such that

k ≥ βε−2 ln n.

Then for any set P of n points in Rd , there is a map f : Rd → Rk

such that for all v ,w ∈ P

(1− ε)‖v − w‖22 ≤ ‖f (v)− f (w)‖2

2 ≤ (1 + ε)‖v − w‖22 (36)

Furthermore, this map can be generated at random withprobability 1− (n2−β(1−ε) − n1−β(1−ε)).

Concentration inequalities for Gaussian matrices

LemmaLet x ∈ Rd . Assume that the entries of A ∈ Rk×d are sampledindependently and identically distributed according to N(0, 1).Then

P

(∣∣∣∣∣∥∥∥∥ 1√

kAx

∥∥∥∥2

2

− ‖x‖22

∣∣∣∣∣ > ε‖x‖22

)≤ 2e−(ε2−ε3)k/4,

or, equivalently,

P

((1− ε)‖x‖2

2 ≤∥∥∥∥ 1√

kAx

∥∥∥∥2

2

≤ (1 + ε)‖x‖22

)≥ 1− 2e−(ε2−ε3)k/4.

Nothing special in Gaussian matrices

LemmaLet x ∈ Rd . Assume that the entries of A ∈ Rk×d are either +1 or−1 with probability 1/2. Then

P

(∣∣∣∣∣∥∥∥∥ 1√

kAx

∥∥∥∥2

2

− ‖x‖22

∣∣∣∣∣ > ε‖x‖22

)≤ 2e−(ε2−ε3)k/4,

or, equivalently,

P

((1− ε)‖x‖2

2 ≤∥∥∥∥ 1√

kAx

∥∥∥∥2

2

≤ (1 + ε)‖x‖22

)≥ 1− 2e−(ε2−ε3)k/4.

Scalar product preservation

Corollary

Let P be a set points in the unit ball B(1) ⊂ Rd and thatf : Rd → Rk is a linear function such that

(1− ε)‖u ± v‖22 ≤ ‖f (u ± v)‖2

2 ≤ (1 + ε)2‖u ± v‖22, (37)

for all u, v ∈ P. Then

|〈u, v〉 − 〈f (u), f (v)〉| ≤ ε,

for u, v ∈ P.

Scalar product preservation

Corollary

Let u, v be two points in the unit ball B(1) ⊂ Rd , A ∈ Rk×d is amatrix with random entries such that, for a given x ∈ Rd

P

((1− ε)‖x‖2

2 ≤∥∥∥∥ 1√

kAx

∥∥∥∥2

2

≤ (1 + ε)‖x‖22

)≥ 1− 2e−(ε2−ε3)k/4.

(For instance A is a matrix with Gaussian random entries or withentries ±1 with probability 1/2.) Then, with probability at least1− 4e−(ε2−ε3)k/4

|〈u, v〉 − 〈f (u), f (v)〉| ≤ ε,

for u, v ∈ P.

foundations of data analysis (ma4800) - tum...jv ij2; v 2ki: as before one can de ne orthogonality...

Documents