empirical testing of sparse approximation and matrix

Sparse Approximation Phase TransitionsMatrix completion

Empirical Testing of Sparse Approximationand Matrix Completion Algorithms

Jared Tanner

Workshop on Sparsity, Compressed Sensing and Applications———–

University of OxfordJoint with Blanchard, Donoho, and Wei

Jared Tanner Empirical Testing of Sparse Approximation and Matrix Completion Algorithms


Universality using cluster: embarrassinglyEmpirical testing of iterative algorithms using GPUs

Three sparse approximation questions to test

Sparse approximation:

minx‖x‖0 s.t. ‖Ax − b‖2 ≤ τ with A ∈ Rm×n

1. Are there algorithms that have same behaviour for different A?2. Which algorithm is fastest and with a high recovery probability?

Matrix completion:

minX

rank(X ) s.t. ‖A(X )− b‖2 ≤ τ with A maps Rm×n to Rp

3. What is largest rank that is recovered with efficient algorithm?

Information about each question can be gleaned from large scaleempirical testing. Lets use some HPC resources.




Sparse approximation phase transition

I Problem characterized by three numbers: k ≤ m ≤ n• n, Signal Length, “Nyquist” sampling rate• m, number of inner product measurements,• k, signal complexity, sparsity, k := minx ‖x‖0

I Mixed under/over-sampling rates compared to naive/optimal

Undersampling: δm :=m

n, Oversampling: ρm :=

k

m

I Testing model: For matrix ensemble and algorithm draw Aand k-sparse x0, let Π(k,m, n) be the probability of recovery

I For fixed (δm, ρm), Π(k,m, n) converges to 1 or 0 withincreasing m: separated by phase transition curve ρ(δ)

I Algorithm with ρ(δ) large, Π(k,m, n) insensitive to matrix?




Phase Transition: `1 ball, C n

I With overwhelming probability on measurements Am,n:for any ε > 0, as (k,m, n) →∞• All k-sparse signals if k/m ≤ ρS(m/n,C )(1− ε)• Most k-sparse signals if k/m ≤ ρW (m/n,C )(1− ε)• Failure typical if k/m ≥ ρW (m/n,C )(1 + ε)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recovery: all signals

Recovery: most signals

!S

!W

δ = m/n

km

I Asymptotic behaviour δ → 0: ρ(m/n) ∼ [2(e) log(n/m)]−1




Phase Transition: Simplex, T n−1, x ≥ 0

I With overwhelming probability on measurements Am,n:for any ε > 0, x ≥ 0, as (k,m, n) →∞• All k-sparse signals if k/m ≤ ρS(m/n,T )(1− ε)• Most k-sparse signals if k/m ≤ ρW (m/n,T )(1− ε)• Failure typical if k/m ≥ ρW (m/n,T )(1 + ε)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recovery: all signals

Recovery: most signals

!W

!S

δ = m/n

km

I Asymptotic behaviour δ → 0: ρ(m/n) ∼ [2(e) log(n/m)]−1




`1-Weak Phase Transitions: Visual agreement

I Testing beyond the proven theory, 6.4 CPU years later...

I Black: Weak phase transition: x ≥ 0 (top), x signed (bot.)I Overlaid empirical evidence of 50% success rate:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

δ=n/N

ρ=k/n

GaussianBernoulliFourierTernary p=2/3Ternary p=2/5Ternary p=1/10HadamardExpander p=1/5Rademacherρ(δ,Q)

I Gaussian, Bernoulli, Fourier, Hadamard, RademacherI Ternary (p): P(0) = 1− p and P(±1) = p/2I Expander (p): dp · ne ones per column, otherwise zerosI Rigorous statistical comparison shows n−1/2 convergence




`1-Weak Phase Transitions: Visual agreement

I Testing beyond the proven theory, 6.4 CPU years later...I Black: Weak phase transition: x ≥ 0 (top), x signed (bot.)I Overlaid empirical evidence of 50% success rate:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

δ=n/N

ρ=k/n

GaussianBernoulliFourierTernary p=2/3Ternary p=2/5Ternary p=1/10HadamardExpander p=1/5Rademacherρ(δ,Q)

I Gaussian, Bernoulli, Fourier, Hadamard, RademacherI Ternary (p): P(0) = 1− p and P(±1) = p/2I Expander (p): dp · ne ones per column, otherwise zerosI Rigorous statistical comparison shows n−1/2 convergence




Bulk Z -scores

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−4

−3

−2

−1

0

1

2

3

4

5

δ=n/N

Z−score

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−6

−4

−2

0

2

4

6

δ=n/N

Z−score

(a) Bernoulli (b) Fourier

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−4

−3

−2

−1

0

1

2

3

4

δ=n/N

Z−score

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−4

−3

−2

−1

0

1

2

3

δ=n/N

Z−score

(c) Ternary (1/3) (d) Rademacher

I n = 200, n = 400 and n = 1600I Linear trend with δ = m/n, decays at rate n−1/2

I Proven for matrices with subgaussian tail, Montanari 2012




Which algorithm is fastest and with high phase transition?

State of the art algorithms for sparse approximation

I Hard Thresholding, Hk(ATb), followed by subspace restrictedlinear solver: Conjugate Gradient

I Normalized IHT: Hk(x t + κAT (b − Ax t)), (Steepest Descent)I Hard Thresholding Pursuit: NIHT with pseudo-inverseI CSAMPSP (hybrid of CoSaMP and Subspace Pursuit)

v t+1 = x t+1 = Hαk(xt + κAT (b − Ax t))

It = supp(v t) ∪ supp(x t) Join supp. sets

wIt = (ATIt AIt )

−1ATIt b Least squares fit

x t+1 = Hβk(wt) Second threshold

I SpaRSA [Lee and Wright ’08]

I Testing environment with random problem generation, orpassing matrix and measurements.

I Matrix EnsemblesI Discrete Cosine Transform, Sparse Matrices, Gaussian










wIt = (ATIt AIt )



I SpaRSA [Lee and Wright ’08]I Testing environment with random problem generation, or

passing matrix and measurements.

I Matrix EnsemblesI Discrete Cosine Transform, Sparse Matrices, Gaussian










wIt = (ATIt AIt )



I SpaRSA [Lee and Wright ’08]I Testing environment with random problem generation, or

passing matrix and measurements.I Matrix Ensembles

I Discrete Cosine Transform, Sparse Matrices, Gaussian




Ingredients of Greedy CS Algorithms:

I Descent: νt := x t + κAT (b − Ax t) with κ =‖AT

Λt (b−Ax t)‖22

‖AΛt ATΛt (b−Ax t)‖2

2

requires two matvec and one transpose matvec, and vec adds.

I Support: identification of the support set forx t+1 = Hk(νt), hard thresholding, and calculating κ.Use linear binning for fast parallel order statistic calculation,and only do so when support set could change. Reducedsupport set time to a small fraction of one DCT matvec time.

I Generation: when testing millions of problemsthe problem generation can become slow, especially usingmatlab randn.

Total time (for large problems) reduced to essentially the matvecs.




Computing environment

CPU:

I Intel Xeon 5650 (released March 2010)

I 6 core, 2.66 GHz

I 12 GB of DDR2 PC3-1066, 6.4 GT/s

I Matlab 2010a, 64 bit (inherent multi-core threading)

GPU:

I NVIDIA Tesla c2050 (release April 2010)

I 448 Cores, peak performance 1.03 Tflop/s

I 3GB GDDR5 (on device memory)

I Error-correction

Is it faster?




Multiplicative acceleration factor for NIHT: CPU/GPU

matrixEnsemble n nonZeros Descent Support Generation

dct

214 63.21 42.16 1.04216 64.46 41.59 1.77218 54.11 38.45 3.20220 57.94 38.82 5.80

smv

212 4 0.52 4.10 32.32214 4 1.41 14.64 135.08216 4 4.29 43.04 521.60218 4 10.43 71.50 1630.08212 7 0.63 3.48 33.92214 7 1.86 12.86 142.53216 7 5.42 37.11 526.82218 7 10.80 55.60 1556.44

gen210 1.06 2.07 0.34212 10.36 4.09 2.53214 16.75 6.17 5.85




Algorithm Selection for DCT, map, n = 216

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NIHT: circle

HTP: plus

CSMPSP: square

ThresholdCG: times

m/n

k/m

Algorithm selection map, n=65536

NIHT dominant near phase transition.





0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NIHT: circle

HTP: plus

CSMPSP: square

ThresholdCG: times

m/n

k/m


NIHT dominant near phase transition.





0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NIHT: circle

HTP: plus

CSMPSP: square

ThresholdCG: times

m/n

k/m


NIHT dominant near phase transition, though HTP nearly as fast.




HTP / best time for DCT, n = 220

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

m/n

Time: HTP / fastest algorithm, n=1048576

k/m

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

NIHT and HTP have essentially identical average case behaviour.




Best time for DCT, n = 214

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

m/n

Time (ms) of fastest algorithm, n=16384

k/m

25

30

35

40

45





0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

m/n


k/m

25

30

35

40

45

50

55

60





0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

m/n


k/m

40

50

60

70

80

90

100

110





0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

m/n


k/m

100

150

200

250

300




Concentration phenomenon NIHT for DCT, δ = 0.25

I Logit fit, exp(β0+β1k)1+exp(β0+β1k) , of data collected of about 105 tests

I ρnihtW (1/4) ≈ 0.25967 (Note, ρW (1/4,C ) = 0.2674)

I Transition width proportional to n−1/2



What is largest rank recoverable with efficient algorithmMatrix completion with ρ near 1

Optimal order recovery - matrix completion

I Four defining numbers: r ≤ m ≤ n and p• m × n, Matrix size, mn is “Nyquist” sampling rate• p, number of inner product measurements• r , matrix complexity, rank

I For what (r ,m, n, p) does an encoder/decoder pair recover asuitable approximation of X from (b,A)?• p = r(m + n − r) is the optimal oracle rate• p ∼ r(m + n − r) possible using efficient algorithms

I Mixed under/over-sampling rates compared to naive/optimal

Undersampling: δ :=p

mn, Oversampling: ρ :=

r(m + n − r)

p




Largest rank recoverable with efficient algorithm

I Compresses sensing algorithms all behave about the same

I How about matrix completion, do simple methods work well?

I NIHT: alternating projection with column subspace stepsize

X j+1 = Hr (Xj + µjA∗(b −A(X j)))

with

µj :=‖P j

UA∗(b −A(X j))‖2

F

||A(P jUA∗(b −A(X j)))||22

where P jU := UjU

∗j . (column & row projection doesn’t work.)

I Contrast NIHT with nuclear norm minimization viasemi-definite programming and simple Power Factorization.




Three matrix completion algorithms to compare

I Nuclear norm minimization (extension of `1 in CS)

minX‖X‖∗ :=

∑σi (X ) subject to A(X ) = b.

I NIHT for matrix completion (how to select µj)

X j+1 = Hr (Xj + µjA∗(b −A(X j)))

I Power Factorization

minR,V

‖RV ‖2 subject to RV(X ) := A(X ) = b.

Benchmark algorithms ability to recover low rank matrices, andcontrast speed and memory requirements. 4.3 CPU years later...




NIHT vs “state of the art”, Gaussian sensing (m = n = 80)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.4

0.5

0.6

0.7

0.8

0.9

1

p/mn

rho

Recovery phase transition with gamma = 1.000

NIHT: Column Projection (0.999) with Gaussian MeasurementsPower Factorization with Gaussian MeasurementsNuclear Norm Minimization with Gaussian Measurements

I Simple NIHT has nearly optimal recovery ability

I Convex relaxation consistent with theory of Hassibi et al.




NIHT vs “state of the art”, entry sensing (m = n = 800)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p/mn

rho

Recovery phase transition with gamma = 1.000

NIHT: Column Projection (0.999) with Entry MeasurementsPower Factorization with Entry MeasurementsNuclear Norm Minimization with Entry Measurements

I Simple NIHT has nearly optimal recovery ability

I Convex relaxation is slow and with a small recovery region.




Conclusions

I There are many algorithms for sparse approximation,and matrix completion, all proven to have the optimal orderrecovery m ≥ Const.k log(n/m), and p ≥ Const.r(m + n− r).

I Empirical testing can suggest conjectures and point us to“best” methods.

I Use of high performance computing tools allows testing largenumbers of problems, and problems quickly: GPU softwaresolves problems of size n = 106 in under one second.

Two new findings:

I Near universality of CS algorithms phase transitions, `1

I Convexification less effective for matrix completion; simplemethods for min rank have higher phase transition




References

I Observed universality of phase transitions in high-dimensionalgeometry, with implications for modern data analysis andsignal processing (2009) Phil. Trans. Roy. Soc. A, Donohoand Tanner.

I GPU Accelerated Greedy Algorithms for compressed sensing(2012), Blanchard and Tanner.

I Normalized iterative hard thresholding for matrix completion(2012), Tanner and Wei.

Thanks for your time


empirical testing of sparse approximation and matrix

Documents