uniform sampling for matrix approximationcmusco/personal_site/pdfs/... · 2019-04-09 · uniform...

Uniform Sampling for Matrix Approximation

Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco,Richard Peng, Aaron Sidford

Massachusetts Institute of Technology

November 20, 2014

Overview

Reduce large matrix A to some smaller matrix A. Use A toapproximate solution to some problem - e.g. regression.

Main Result

Simple and efficient iterative sampling algorithms for matrixapproximation.

Alternatives to Johnson-Lindenstrauss (random projection) typeapproaches

Main technique

Understanding what information is preserved when we sample rows ofmatrix uniformly at random

Overview

Main Result

Main technique

Overview

Main Result

Main technique

Outline

1 Spectral Matrix Approximation

2 Leverage Score Sampling

3 Iterative Leverage Score Computation

4 Coherence Reducing Reweighting

Overview

Spectral Matrix Approximation

(1− ε)‖Ax‖22 ≤ ‖Ax‖22 ≤ (1 + ε)‖Ax‖22

Applications

Approximate linear algebra - e.g. regression

Preconditioning

Spectral Graph Sparsification

Etc...

Applications

Preconditioning

Etc...

Applications

Preconditioning

Etc...

Applications

Preconditioning

Etc...

Applications

Preconditioning

Etc...

Regression

minx‖Ax− b‖22

Solve:

Ax = b =⇒ A>(Ax) = A>b

Set x to:

(A>A)−1A>b

Problem:A is very tall. Too slow to compute A>A.

Regression

minx‖Ax− b‖22

Solve:

Ax = b =⇒ A>(Ax) = A>b

Set x to:

(A>A)−1A>b

Regression

minx‖Ax− b‖22

Solve:

Ax = b =⇒ A>(Ax) = A>b

Set x to:

(A>A)−1A>b

Regression

minx‖Ax− b‖22

Solve:

Ax = b =⇒ A>(Ax) = A>b

Set x to:

(A>A)−1A>b

Regression

minx‖Ax− b‖22

Solve:

Ax = b =⇒ A>(Ax) = A>b

Set x to:

(A>A)−1A>b

Regression

Solution:

‖Ax− b‖22 =

∥∥∥∥[ A b] [ x−1

]∥∥∥∥22

≈ε∥∥∥∥[ A b

] [ x−1

]∥∥∥∥22

x = argminx‖Ax− b‖22

Regression

Solution:

‖Ax− b‖22 =

∥∥∥∥[ A b] [ x−1

]∥∥∥∥22

≈ε∥∥∥∥[ A b

] [ x−1

]∥∥∥∥22

Regression

Solution:

‖Ax− b‖22 =

∥∥∥∥[ A b] [ x−1

]∥∥∥∥22

≈ε∥∥∥∥[ A b

] [ x−1

]∥∥∥∥22

Regression

Solution:

Or, use preconditioned iterative method:

(A>A)−1(A>A))

= O(1)

All equivalent:

Norm:‖Ax‖22 = (1± ε)‖Ax‖22

Quadratic Form:

x>(A>A)x = (1± ε)x>(A>A)x

Loewner Ordering:

(1− ε)A>A � A>A � (1 + ε)A>A

Inverse:

(1 + ε)(A>A)−1 � (A>A)−1 � 1

(1− ε)(A>A)−1

All equivalent:

Norm:‖Ax‖22 = (1± ε)‖Ax‖22

Quadratic Form:

x>(A>A)x = (1± ε)x>(A>A)x

Loewner Ordering:

(1− ε)A>A � A>A � (1 + ε)A>A

Inverse:

(1 + ε)(A>A)−1 � (A>A)−1 � 1

(1− ε)(A>A)−1

All equivalent:

Norm:‖Ax‖22 = (1± ε)‖Ax‖22

Quadratic Form:

x>(A>A)x = (1± ε)x>(A>A)x

Loewner Ordering:

(1− ε)A>A � A>A � (1 + ε)A>A

Inverse:

(1 + ε)(A>A)−1 � (A>A)−1 � 1

(1− ε)(A>A)−1

All equivalent:

Norm:‖Ax‖22 = (1± ε)‖Ax‖22

Quadratic Form:

x>(A>A)x = (1± ε)x>(A>A)x

Loewner Ordering:

(1− ε)A>A � A>A � (1 + ε)A>A

Inverse:

(1 + ε)(A>A)−1 � (A>A)−1 � 1

(1− ε)(A>A)−1

All equivalent:

Norm:‖Ax‖22 = (1± ε)‖Ax‖22

Quadratic Form:

x>(A>A)x = (1± ε)x>(A>A)x

Loewner Ordering:

(1− ε)A>A � A>A � (1 + ε)A>A

Inverse:

(1 + ε)(A>A)−1 � (A>A)−1 � 1

(1− ε)(A>A)−1

How to Find a Spectral Approximation?

Just take a matrix square root of A>A!

U>U = A>A =⇒ ‖Ux‖22 = ‖Ax‖22

Cholesky decomposition, SVD, etc. give U ∈ Rd×d

Runs in something like O(nd2) time.

U>U = A>A =⇒ ‖Ux‖22 = ‖Ax‖22

How to Find one Faster?

Use Subspace Embeddings [Sarlos ‘06], [Nelson, Nguyen ‘13], [Clarkson,Woodruff ‘13], [Mahoney, Meng ‘13]

Left multiply by sparse ‘Johnson-Lindenstrauss matrix’.

Can apply in O(nnz(A)) time.

Reduce to O(d/ε2) dimensions.

‘Squishes’ together rows

What if we want to preserve structure/sparsity?

Use Row Sampling

Overview

Leverage Score Sampling

Sample rows with probability proportional to leverage scores.

Leverage Score

τi (A) = a>i (A>A)−1ai

Sample rows with probability proportional to leverage scores.

Leverage Score

Sample each row independently with pi = O(τi (A) log d/ε2).∑i τi (A) = d giving reduction to O(d log d/ε2) rows.

Straight-forward analysis with matrix Chernoff bounds.

What is the leverage score?

Statistics: Outlier detection

Spectral Graph Theory: Effective resistance, commute time

Matrix Approximation: Row’s importance in composing the quadraticform of A>A.

How to Understand this equation?

Row norms of orthonormal basis for A’s columns

Correct ‘scaling’ to make matrix Chernoff bound work

How easily a row can be reconstructed from other rows.

How easily a row can be reconstructed from other rows

min ‖x‖22 = τi (A).

ai has component orthogonal to all other rows: x = ei , ‖x‖2 = 1.

There are many rows pointing ‘in the direction of’ ai : x is well spreadand has small norm.

min ‖x‖22 = τi (A).

x = [0, 0, . . . , 14 ,14 ,

14 , 0, . . .] works.

τi (A) ≤ ‖x‖22 = 14 .

x = [0, 0, . . . , 14 ,14 ,

14 , 0, . . .] works.

τi (A) ≤ ‖x‖22 = 14 .

x = [0, 0, . . . , 14 ,14 ,

14 , 0, . . .] works.

τi (A) ≤ ‖x‖22 = 14 .

Mathematically:

min{x|x>A=a>i }

‖x‖22

To minimize, setx> = a>i (A>A)−1A>

‖x‖22 = a>i (A>A)−1ai = τi (A)

Mathematically:

min{x|x>A=a>i }

‖x‖22

‖x‖22 = a>i (A>A)−1ai = τi (A)

Mathematically:

min{x|x>A=a>i }

‖x‖22

‖x‖22 = a>i (A>A)−1ai = τi (A)

Mathematically:

min{x|x>A=a>i }

‖x‖22

‖x‖22 = a>i (A>A)−1ai = τi (A)

Intepretation tells us:

τi (A) ≤ 1

Adding rows to A can only decrease leverage scores of existing rows

Removing rows can only increase leverage scores

In fact always have:∑n

i=1 τi (A) = d (Foster’s Theorem)

τi (A) ≤ 1

Computing Leverage Scores

Great! Nicely interpretable scores that allow us to sample down A.

Problem: Computing leverage scores naively requires computing(A>A)−1.

Great! Nicely interpretable scores that allow us to sample down A.

Problem: Computing leverage scores naively requires computing(A>A)−1.

Traditional Solution

Overestimates are good enough. Just increases number of rows taken.

Given a constant factor spectral approximation A we have:

a>i (A>A)−1ai ≈c a>i (A>A)−1ai

∑τi (A) ≤

∑c · τi (A) = c

∑τi (A) = c · d .

So can still sample O(d log d/ε2) rows using τi (A).

But how to obtain a constant factor spectral approximation?

Unfortunately, to compute even a constant factor spectralapproximation still requires either leverage score estimates, a subspaceembedding or a matrix square root.

∑τi (A) ≤

∑c · τi (A) = c

∑τi (A) = c · d .

∑τi (A) ≤

∑c · τi (A) = c

∑τi (A) = c · d .

∑τi (A) ≤

∑c · τi (A) = c

∑τi (A) = c · d .

∑τi (A) ≤

∑c · τi (A) = c

∑τi (A) = c · d .

∑τi (A) ≤

∑c · τi (A) = c

∑τi (A) = c · d .

∑τi (A) ≤

∑c · τi (A) = c

∑τi (A) = c · d .

A spectral approximation A gives the for each guarantee:

τi (A) ≤ c · τi (A)

This is stronger than we need though! We just need a bound on thesum: ∑

τi (A) ≤ c · d

A spectral approximation A gives the for each guarantee:

τi (A) ≤ c · τi (A)

This is stronger than we need though! We just need a bound on thesum: ∑

τi (A) ≤ c · d

Overview

Iterative Leverage Score Computation

Review of what our goal is:

Want to find A such that ‖Ax‖22 ≈ε ‖Ax‖22.

Want to avoid JL Projection - just use random row sampling

Need a way to efficiently compute approximations τi (A) such that∑τi (A) = O(d)

Efficient: O(nnz(A) + R(d , d)) where R(d , d) is the cost of solving ad × d regression problem.

Uniform Row Sampling

In practice, data is typically incoherent - no row has a high leveragescore. [Kumar, Mohri, Talwalkar ‘12], [Avron, Maymounkov, Toledo‘10].

∀i τi (A) ≤ O(d/n)

In this case, we can uniformly sample O(d log d) rows to obtain aspectral approximation.∑

τi (A) = O

)· n = O(d)

∀i τi (A) ≤ O(d/n)

.In this case, we can uniformly sample O(d log d) rows to obtain aspectral approximation.

τi (A) = O

)· n = O(d)

∀i τi (A) ≤ O(d/n)

.In this case, we can uniformly sample O(d log d) rows to obtain aspectral approximation.∑

τi (A) = O

)· n = O(d)

∀i τi (A) ≤ O(d/n)

.In this case, we can uniformly sample O(d log d) rows to obtain aspectral approximation.∑

τi (A) = O

)· n = O(d)

No guarantees on what uniform sampling does on general matrices.

For x =

10...0

, ‖Ax‖22 = 1, but with good probability, ‖Ax‖22 = 0.

Can we still use uniform sampling in some way?

For x =

10...0

For x =

10...0

One simple idea for an algorithm:

1 Uniformly sample O(d log d) rows of A to obtain Au

2 Compute (A>u Au)−1, and use it to estimate leverage scores of A

3 Sample rows of A using these estimated leverage scores to obtain Awhich spectrally approximates A.

We want to bound how large A must be.

One simple idea for an algorithm:

1 Uniformly sample O(d log d) rows of A to obtain Au

One simple idea for an algorithm:1 Uniformly sample O(d log d) rows of A to obtain Au

2 Compute (A>u Au)−1, and use it to estimate leverage scores of A3 Sample rows of A using these estimated leverage scores to obtain A

which spectrally approximates A.

2 Compute (A>u Au)−1, and use it to estimate leverage scores of A3 Sample rows of A using these estimated leverage scores to obtain A

which spectrally approximates A.

TheoremLet Au be obtained from uniformly sampling m rows of A. Let Au∪i be Au withai appended if not already included. τi (A) = a>i (A>u∪iAu∪i )

−1ai .

τi (A) ≥ τi (A)

τi (A) ≤ nd

Note:Sherman-Morrison gives equation to compute τi (A) from a>i (A>u Au)−1ai

τi (A) =

a>i (A>u Au)−1ai if ai ∈ Au

(A>u Au )−1ai

−1ai .

τi (A) ≥ τi (A)

τi (A) ≤ nd

τi (A) =

(A>u Au )−1ai

−1ai .

τi (A) ≥ τi (A)

τi (A) ≤ nd

τi (A) =

(A>u Au )−1ai

What does this bound give us?

τi (A) ≤ nd

Set m = n2 . Then: E

∑i τi (A) ≤ 2d .

Reminiscent of the MST algorithm from [Karger, Klein, Tarjan ’95].

τi (A) ≤ nd

∑i τi (A) ≤ 2d .

τi (A) ≤ nd

∑i τi (A) ≤ 2d .

τi (A) ≤ nd

∑i τi (A) ≤ 2d .

τi (A) ≤ nd

∑i τi (A) ≤ 2d .

τi (A) ≤ nd

∑i τi (A) ≤ 2d .

Immediately yields a recursive algorithm for obtaining A.

1 Recursively obtain constant factor spectral approximation to A2.

2 Use this to approximate τi (A).

3 Use these approximate scores to obtain A. (With any (1± ε)multiplicative error you want).

This algorithm, along with a few other variations on it are the mainalgorithmic results of our paper.

Obtain spectral approximation to A with arbitrary error.

Avoid JL projecting and densifying A.

O(nnz(A) + R(d , d)) time.

Proof of Main Theorem

−1ai .

τi (A) ≥ τi (A) (1)

τi (A) ≤ nd

(1) follows from the fact that removing rows of A can only increaseleverage scores.

−1ai .

τi (A) ≥ τi (A) (1)

τi (A) ≤ nd

(1) follows from the fact that removing rows of A can only increaseleverage scores.

τi (A) ≤ nd

Consider choosing a uniform random row aj .

τi (A) = n · Eu,jτj(A)

τi (A) ≤ nd

Consider choosing a uniform random row aj .

τi (A) = n · Eu,jτj(A)

Further, expectation of this process is exactly the same as expectedleverage score if we sample m rows, and then compute the leveragescore of a random one.

Remember,∑

i τi (Au) = d .

m rows, so expected score of randomly chosen row is dm .

Ej τj(A) = dm

i τi (A) = ndm .

Remember,∑

i τi (Au) = d .

Ej τj(A) = dm

i τi (A) = ndm .

Remember,∑

i τi (Au) = d .

Ej τj(A) = dm

i τi (A) = ndm .

Remember,∑

i τi (Au) = d .

Ej τj(A) = dm

i τi (A) = ndm .

Remember,∑

i τi (Au) = d .

Ej τj(A) = dm

i τi (A) = ndm .

Review

What did this theorem just tell us?

i τi (A) = E∑

i a>i (A>u∪iAu∪i )−1ai is bounded. And this is all we

Recall that uniform sampling from A does not give us a spectralapproximation.

We cannot bound a>i (A>u Au)−1ai .

Review

i τi (A) = E∑

Review

i τi (A) = E∑

Overview

Coherence Reducing Reweighting

Reminder: If our data is incoherent, then all leverage scores are smallO(d/n) and we can uniformly sample rows and obtain a small spectralapproximation.

We show: Even if our matrix is not incoherent, it is ‘close’ to someincoherent matrix.

Reminder: If our data is incoherent, then all leverage scores are smallO(d/n) and we can uniformly sample rows and obtain a small spectralapproximation.We show: Even if our matrix is not incoherent, it is ‘close’ to someincoherent matrix.

What do we mean by close?

We can reweight d/α rows of A to obtain A′ with τi (A′) ≤ α for all i .

Can reweight n/2 rows so that τi is bounded by 2d/n

What does this result give us (without going into details)?

Uniform sample m rows of A and setτi (A) = max{a>i (A>u Au)−1ai , 1}.This sampling does a good job of approximating A′.

Gives good estimates for leverage scores of the that are notreweighted.

Few rows are reweighted, so overall we are able to significantly reducematrix size.

Don’t need to actually compute W. Just existence is enough.

Uniform sample m rows of A and setτi (A) = max{a>i (A>u Au)−1ai , 1}.

This sampling does a good job of approximating A′.

How to prove existence of reweighting?

∑i τi (A) = d so at most d/α rows with τi (A) ≥ α.

Can we just delete them?

How to prove existence of reweighting?∑i τi (A) = d so at most d/α rows with τi (A) ≥ α.

Not quite. Deleting some rows may cause leverage scores of other rows toincrease.

Alternative idea

1 Cycle through rows

2 Each time you see a row with τi (A) ≥ α cut its weight so thatτi (A) = α

3 Repeat

Alternative idea

3 Repeat

Alternative idea

3 Repeat

Alternative idea

3 Repeat

Alternative idea

3 Repeat

Alternative idea

3 Repeat

Alternative idea

3 Repeat

Key observation:

Decreasing the weight of a row will only increase the leverage score ofother rows

At any step, every row that has been reweighted has leverage score≥ α so at most d/α reweighted rows.

Key observation:

But does this reweighting process converge?

Rows that keep violating constraint will have weight cut to ≈ 0.

Can remove them without significantly effecting leverage scores ofother rows.

So overall, reweighting d/α rows is enough to cut all leverage scoresbelow α.

Conclusion

Take away:

Very simple analysis shows how leverage scores can be approximatedwith uniform sampling.

Simple iterative spectral approximation algorithms matching state ofthe art runtimes follow.

Open Questions:

Analogous algorithms for low rank approximation?

For other types of regression?

Generally, our result shows that it is possible to go beyond relativecondition number (spectral approximation) invariants for iterativealgorithms. Can this type of analysis be useful elsewhere?

Thanks! Questions?

Conclusion

Take away:

Open Questions:

Thanks! Questions?

Conclusion

Take away:

Open Questions:

Thanks! Questions?

uniform sampling for matrix approximationcmusco/personal_site/pdfs/... · 2019-04-09 · uniform...

Documents