singular value decomposition - cornell university · 2008-07-16 · rank deﬁciency • suppose a...

Singular Value Decomposition

CS3220 - Summer 2008Jonathan Kaldor

Another Factorization?

• We’ve already looked at A=LU (for n x n matrices) and A=QR (for m x m matrices)

• Both of them are exceedingly useful, but somewhat specialized

Extending QR

• We factored A=QR because we wanted an “easy” system to solve for the least squares problem (namely, upper triangular system)

• Recall also that when solving n x n systems, we observed that diagonal systems were even easier to solve

• Can we come up with a factorization where we only have orthogonal and diagonal matrices?

The SVD

• For any m x n matrix A, we can factor it into A = U∑VT, where:

U: m x m orthogonal matrixV: n x n orthogonal matrix∑: m x n diagonal matrix, with ∑i,i = σi ≥ 0

σi’s are typically ordered so σi ≥ σi+1 for i=1...n

Terminology

• The σis are called the singular values of the matrix A

• The columns ui of U are called the left singular vectors of A

• The columns vi of V are called the right singular vectors of A

Existence

• Note: we have not made any qualifications about A

• In particular, A doesn’t need to be full rank, and m can be less than, equal to, or greater than n (works for all sized matrices A). Essentially, every matrix has an SVD factorization

Uniqueness

• SVD is “mostly” unique. Singular values σi are unique, and singular vectors ui and vi are unique up to choice of sign if σis are distinct. If σi = σi+1 for some i, then SVD is not unique

• Example: identity matrix

How to Compute?

• Short answer: in MATLAB, with [U, S, V] = svd(A);

• Longer answer: algorithm is beyond the scope of this course

• Sufficient to know that factorization exists, we can compute it, and that computing it is more expensive than LU or QR factorization.

What does it mean?

• Let A = U∑VT. What is Av1 (A multiplied by the first column of V)?

• VTv1 = e1 = [1;0;0;...0]

• ∑e1 = σ1e1 = [σ1;0;0;...0]

• Uσ1e1 = σ1Ue1 = σ1u1

• Extending this shows that Avi = σiui for all i

What does it mean?

• Take A = 2 x 2 matrix. Then U and V are both 2 x 2 matrices

v1v2

Unit Circle

What does it mean?


v1v2

What does it mean?


v1v2

v1

v2

Multiply by VT

(rotates v vectors to axes)

What does it mean?


v1

v2

What does it mean?


What does it mean?


Multiply by ∑ (scales 1st axis by σ1,

second by σ2)

What does it mean?


What does it mean?


Multiply by U (rotates ellipse to u1, u2)

u1

u2

What does it mean?


u1

u2

v1v2

Multiply by U∑VT

What does it mean?

• In 2D, says that A takes directions v1 and v2, scales them by σ1 and σ2, and then rotates them to u1 and u2

• Maps unit circle defined by v1 and v2 to ellipse defined by axes u1 and u2

• This geometric argument is true in general (but best not to try and imagine it for n > 3!)

Rank Deficiency

• Suppose A is 2 x 2, not the zero matrix, and not full rank (i.e. its singular). This means that it maps to a subspace of the 2D plane (i.e. a line). This can also be seen to be mapping the unit circle to a degenerate ellipse where one axis has length 0

• In terms of our rotation and scaling operations, this is equivalent to having σ2=0 (why σ2?)

Rank Deficiency

• Thus, we can use the SVD to determine the rank of a matrix: Compute the SVD, and count the number of singular values > 0 (in practice, need to count number of singular values > some small epsilon to account for floating point issues)

Rank Deficiency

• If matrix is rank r

U1 U2

V1T

V2T

σ1 ⋱

σr

∑1

r m - r r

r

n-rm

n

n-r

Range and Null Space

• Suppose vi is a vector in V2. What is Avi?

• If vi is in V2, then σi = 0. That means that Avi = σiui = 0

• Thus, vi is in the null space of A

• This means that V2 is an orthonormal basis for the null space of A (or null(A))


• Similarly, let x be any random vector. Then Ax is a linear combination of the columns of U1

• This means that U1 is an orthonormal basis for range(A)


• Since columns in V1 are orthonormal to V2, and V2 is an orthonormal basis for null(A), it follows that V1 is an orthonormal basis for the orthogonal complement of null(A) (or null(A)⊥ )

• Similarly, columns of U2 are orthonormal to U1, so U2 is an orthonormal basis for ran(A)⊥

Skinny SVD

• Just like QR, there are some columns of U and V that we don’t need

• Namely, any vector in V2 gets zeroed out when multiplying by ∑, and any vector in U2 is zeroed when multiplying by ∑

• Can reconstruct A using only U1, ∑(1:r, 1:r), and V1, where U1 is m x r and V1 is n x r

Generalizing Least Squares

• We know how to solve least squares as long as A is of full rank

• Suppose A is instead of rank r (σr+1 = ... = σn = 0). Can we still find an acceptable least squares solution?

• Note: have to combine both overdetermined and underdetermined strategies (may not be an exact solution, and we can add any vector in the null space without changing the answer)

Rank-Deficient Least Squares

• ‖Ax-b‖22 = ‖U∑VTx - b‖22

=

=

=

U1 U2 ∑1 00 0

V1T

V2Tx

b-

U1T

U2T∑1 00 0

V1T

V2Tx b-

U1T∑1 0 V1T

V2Tx b- U2T b+

Rank Deficient Least Squares

• =

Let y = = and a = U1Tb. Then

U1T∑1 0 V1T

V2Tx b-

V1T

V2Tx y1

y2

∑1 0 y1

y2

a-


• Can choose y2 to be anything and still satisfy equation. Let it be 0 to minimize norm

Then our solution is simply ∑1y1 = a and y2 = 0. Recall that y1 = V1Tx and a = U1Tb. Substituting everything, we get x = V1∑1-1U1Tb

∑1 0 y1

y2

a-


• Note that this included both of the observations we made while solving overdetermined and underdetermined systems (ignoring ‖U2Tb‖ since there was no way to minimize it, and seeing that we could arbitrarily set part of our solution vector (V2Tx) without changing the solution)

Matrix Inverse

• Let A = U∑VT be an n x n matrix of full rank. Then A-1=V∑-1UT

• What can we observe about this?

• Roles of U and V are changed

• Singular values of inverse are reciprocals of singular values of A: 1/σi (note that they are in reverse order because of the way we order singular values)

Matrix (Pseudo)Inverse

• We can generalize this notion of the matrix inverse to come up with the pseudoinverse, which exists for m x n matrices of rank r:A+ = V1∑1-1U1T, where V1, ∑1, and U1 are defined from the skinny SVD

• This is in a sense the closest matrix to the inverse for matrices that don’t have an inverse

Matrix (Pseudo)Inverse

• Note that when A is n x n and full rank, U = U1, V = V1, and ∑ = ∑1, so the pseudoinverse is the inverse.

• Note that the pseudoinverse is just what we came up with in the general rank-deficient least squares case: the pseudoinverse A+b solves the least squares problem Ax=b

Sidestep: Matrix Norms

• When introducing least squares, we introduced the concept of vector norms, which measured size.

• There is a corresponding notion of norms for matrices as well, which have similar properties as the vector norms:‖A‖ > 0 if A≠0‖c A‖ = |c| ‖A‖‖A + B‖ ≤ ‖A‖ + ‖B‖


• Frobenius norm of a matrix: ‖A‖F = sqrt(sum(sum(Ai,j2)))


• If ‖‖n is some vector norm, then there is a corresponding matrix norm defined as

‖A‖n = max ‖Ax‖n / ‖x‖n

Note: this does not necessarily give us a rule for how to compute the matrix norm

x≠0


• ‖A‖1 happens to be easy to compute directly: it’s max ‖A(:,i)‖1, i.e. the maximum one-norm of the columns of A

• ‖A‖2 unfortunately is rather difficult to compute directly... at least, without our fancy new SVD. Namely, ‖A‖2 = σ1


• Matrix norms are useful for measuring the maximum amount any vector x is scaled by when multiplying by A

• We’ll see on Wednesday that they are also useful for measuring error propagation in linear systems

Rank-k Approximation

• A = U∑VT can be expressed as

sum(ui σi viT)

from i = 1 ... n

• Idea: cut off summation at value k < n

• Gives us the rank-k approximation

Rank-k Approximation

• The rank-k approximation is a rank-k matrix that is the closest rank-k matrix to the original matrix A when measured using the Frobenius matrix norm

• ‖A-Ak‖F is minimal over all rank k matrices

• Same as setting k+1...n singular values to 0.

Principal Component Analysis (PCA)

• Powerful tool for data analysis

• General idea: we have some data in a high dimensional space, but it actually comes from some low dimensional model, maybe corrupted by some noise

Principal Component Analysis

• Example: Suppose we have a zip line (rope between two trees) that we can slide down. We set up a bunch of cameras to record our motion as we slide down. Each camera records an (x,y) position for us at each point in time - our position in camera space. We have 2n measurements, if we have n cameras. However, our motion is primarily in one direction (plus some error)

Principal Component Analysis

Variance and Covariance

• Variance of a set of m measurements: how much it spreads from the mean∑(Xi - X)2/mwhere X is the mean of the data

• Covariance: how much a pair of measurements change together (dependence between the two variables)∑(Xi - X)(Yi - Y)/m


• Idea: choose basis that maximizes variance, minimizes covariance of measurements in that basis

• Maximizing variance: finding important dimensions

• Minimizing covariance: reducing dependence between pairs of measurements


• If we have experimental data for measurement X in m x 1 vector x, then variance is simply xTx/m

• If we have experimental data for two measurements X and Y in m x 1 vectors x and y, then covariance is simply xTy/m


• Let X be our m x n matrix of experimental data (m trials, n measurements per trial, adjusted so mean of each measurement is 0)

• Covariance matrix can then be expressed as XTX

• Want to find orthonormal basis that diagonalizes covariance matrix: this will minimize covariance, maximize variance (in that basis)

PCA

• We can use the SVD to find a good basis: if X = U∑VT, then XTX = V∑∑VT

• Singular values σi give us relative importances of each dimension

• Columns of V give the orthonormal basis we’re interested in.

singular value decomposition - cornell university · 2008-07-16 · rank deﬁciency • suppose a...

Documents