optimization for sparse solutions, a tutorial

Optimization for Sparse Solutions, A Tutorial

Wotao Yin (CAAM, Rice U.)

MIIS 2012, Xi’an, China

Structure means simplicity

Examples

I signals that are sparse (left figure) or sparse under bases, frames, ortransforms

I a set of jointly sparse signals

I low rank matrix (right figure); low rank + sparse; low rank + jointlysparse

I low dimensional manifold

More structures will be discovered and used

Why structured solutions?

I cheap to store / transmit

I more meaningful, easy to understand,

I easy to use

I more robust to errors

I need fewer data to begin with

I easy to find by optimization (both speed and storage)

`1 gives sparse solutions

Minimization `1 has a long history in data processing (geophysical:Claerbout and Muir’73, Santosa and Symes’86), image processing(Rudin, Osher, and Fatemi’92), and statistics (Chen, Donoho, andSaunders’98; Tibshirani’96).

I learned from S. Osher that Galileo Galilei (1564–1642) used the medianfilter, which for given vector y solves

minx‖x − y‖1.

Heuristically, `1 is known as a good convex approximate (convexenvelop) of `0, and `1 tends to give sparse solutions. This is so widelyknown that many people would just try but not bother with checking thetheory.

`1 gives sparse solutions

Goal: to recover sparse vector x0 from b = Ax0

I Support S := supp(x0)

I Zero set Z := SC ;

I Sparsity k = ‖x0‖0 = |supp(x0)|.

For x0 to be the solution of

minx‖x‖1, subject to Ax = b,

we need‖x0 + v‖1 ≥ ‖x0‖1, ∀v ∈ N (A).

x0 is recovered, good x0 is not recovered, bad

When the diamond-shaped `1-ball touches the sparse x0, we need theaffine space x : Ax = b = x0 + v : v ∈ N (A) to not cut through the`1-ball.

‖x0 + v‖1 = ‖x0S + vS‖1 + ‖vZ‖1

≥ ‖x0S‖1 + ‖vZ‖1 − ‖vS‖1

= ‖x0‖1 + ‖v‖1 − 2‖vS‖1

≥ ‖x0‖1 + ‖v‖1 − 2√k‖v‖2.

Therefore, x0 is `1-recoverable provided

‖v‖1

‖v‖2≥ 2√k , ∀v ∈ N (A), v 6= 0.

In general, we only have

1 ≤ ‖v‖1

‖v‖2≤√n.

However, ‖v‖1

‖v‖2is significantly greater than 1 if V = N (A) is random.

Kashin’77, Garvaev and Gluskin’84 show that a randomly drawn(n −m)-dimensional subspace V of Rn satisfies

‖v‖1

‖v‖2≥ c1

√m√

1 + log(n/m), ∀v ∈ V, v 6= 0

with probability at least 1− exp(−c0(n−m)), where c0, c1 > 0 are indep.of m and n.

Hence, x0 is `1-recoverable with high probability if A is random and

k ≤ O(m log−1(n/m)) or m ≥ O(k log(n/k)).

The number of measurements m is a multiple of k log(n/k) — typicallymuch less than n.

Other Analyses

I Gorodnitsky, Rao’97: Spark – Smallest number of columns of A thatare linearly dependent.

I Donoho, Elad’03: Coherence – Smallest angles between any twocolumns of A.

I Donoho, Tanner’06: analysis similar to the above

I Candes, Romberg, and Tao’04 ’06: Restricted Isometry Property(RIP) – any 2k-column submatrices of A must be almost orthogonal.

I RIP⇒ `1 recovery: above, Cohen-Dahmen-DeVore, Forcart-Lai,Cai-Wang-Xu, Li-Mo, ...

I More ......

Applications

See Rice Compressed Sensing Repository: http://dsp.rice.edu/cs

I Signal sensing and processing, image processing

I Medical imaging (faster imaging, less radiation, multi-modality)

I (Wireless) communications (joint sparsity, channel estimation)

I Multi/hyperspectral imaging (spectral+spatial structure, data toinformation)

I Statistics, machine learning (feature selection; LASSO, logisticregression, SVM)

I Geophysical (seismic) data analysis

I Face recognition, video processing (robust PCA: common structure+ sparse difference)

I Distributed computation / cloud computing (terabytes of data arespread out, centralized computing is infeasible, sparse recovery atminimal communication cost?)

Many applications require nonnegativity, structured (model) sparsity,complex data.

http://dsp.rice.edu/cs

Optimization/Algorithmic Techniques Used, Part 1

I First-order: (sub)gradient descent, gradient projection, continuation,coordinate descents, prox-point, conjugate gradient, forwardbackward splitting, conjugate gradient, accelerated first-order

I High-order: interior-point, semismooth Newton, hybrid 1st and 2ndorder, BFGS, L-BFGS

I Final debiasing

I Duality: augmented Lagrangian, alternating direction, Bregman(original, linearized, split), primal-dual methods. They are applied toboth primal and dual formulations.

I Homotopy: parametric quadratic programming

Optimization/Algorithmic Techniques Used, Part 2

I Greedy algorithms (OMP family), support detection

I Non-convex approximations to `0, `q minimization, reweighted itr

I Stochastic approximation (gradient, Hessian), sample-average approx

I Distributed reconstruction

I Parallel and GPU-friendly algorithms

I Numerical linear algebra

I Heuristics. Solution structure may be domain specific.

Sources: Optimization Online and Rice CS Repository. We pick a few todiscuss...

http://www.optimization-online.org/

http://dsp.rice.edu/cs

Forward-backward Splitting / Prox-Linear

To solveminµJ(x) + F (x),

iterate

xk+1 ← µJ(x) + 〈∇F (xk), x〉+1

2δk‖x − xk‖2

2.

δk : step size. Can be obtained from forward-backward operator splitting

Simple for many choices of J(x): `1, TV, ‖ · ‖∗, ‖ · ‖2,1, etc.

Properties:

I If δk < 2/L (Lipschitz const of ∇F ), ‖xk − x∗‖ is non-expansive; afixed-point is a solution

I Adaptive δk : accelerate convergence, Barzilai-Borwein andnonmonotone line search give sufficient descent

Sparse Vector Recovery

For J(x) = ‖x‖1 and given ∇F (xk), the subproblem is O(n) and parallel.It is called shrinkage or soft-thresholding.

In compressed sensing, ∇F (xk) = A>(Axk − b) is often cheap (e.g.,DFT, discrete cosine transform, chirp or convolution/circulant basedsensing). Codes: SpaRSA, FPC, SPGL1

∇F (xk) can be demanding in logistic regression and when dealing withnon-Gaussian noise. Code: LPS

Low-Rank Matrix Recovery

For low-rank matrix recovery, J(X ) = ‖X‖∗ :=∑

i σi (X ). Thesubproblem is

minXµ‖X‖∗ +

1

2‖X − Y k‖2

F

Since both terms are unitary-invariant, it is solved by SVD and thenshrinking the singular values.

Codes: FPCA, SVT (linearized Bregman). Apply approximate SVD orcompute partial SVD.

Lee et al’10 uses the same framework with the max-norm.

Application to Transform-`1 and Total Variation

If J(x) = ‖Lx‖1 and L is non-singular or even unitary (e.g., orthogonalwavelets), the subproblem has closed-form solutions.

If J(x) = ‖∇x‖1 (total variation or TV), there are graph-cut (computingmax-flow/min-cut) algorithms to solve the subproblems.

If ‖Lx‖1 and L is non-invertible, the subproblem is less trivial. Solutions:operator splitting, smoothing (e.g., Huber norm to replace `1).

Prox-Linear Theory and Practice

Forward-backward splitting theory applies. (Combettes and Wajs’05)

Convergence for convex J(x) is quite standard; also holds with line searchand other variants. (e.g., Wright, Figueiredo, Nowak’08).

Can extend to prox-regular functions (Lewis and Wright’08) and hybrid1st/2nd-order iterations (Wen, Yin, Zhang, Goldfarb’11)

Convergence speed depends on solution structure, controlled by µ

I Larger µ gives faster and more structured solution. Reason for`1: smaller optimal support is easier identify and afterward, iterateessentially on a smaller support with a better condition number

I Smaller µ gives slower and less structured solution.

Continuation: start from a large µ and decrease µ sequentially, usingprevious solution to warm-start the current iteration (e.g., code FPC).

Techniques Applied with Prox-Linear

I Two-step descents (TwIST)

I Barzilai-Borwein steps, non-monotone line search (Wen et al.’09)

I (Block) coordinate descent: apply descent only to a subset ofcomponents. (Tseng and Yun’09, Li and Osher’09, Wright’11).

I Active set: estimate the optimal support and solve a reducedsubproblem (Shi et al.’08, Wen et al.’09)

I Use (approximate) second-order information, e.g. in logisticregression (Byrd et al.’10; Shi et al.’08)

I Accelerated first-order methods: given a convex function withL-Lipschitz gradients and no other structures, iterations is reducedfrom O(L/ε) to O(L/

√ε) for ε-optimal in objective value

I minF (x): Nesterov’83;I min J(x) + F (x): Nesterov’04, Bech and Teboulle’08, Tseng’08;I ALM: Ma, Scheinberg, Goldfarb’10

Variable Splitting

Consider a linear operator L and

minx

J(Lx) + F (x)

Rewriteminx,y

J(y) + F (x), s.t. y = Lx ,

which “separates” J(·) from F (·).

Augmented Lagrangian: generates xk , yk along with multipliers λk

(xk+1, yk+1) ← minx,yL(x , y ;λk) = J(y) + F (x) + 〈λ, Lx − y〉+

c

2‖Lx − y‖2

2,

λk+1 ← λk + c(Lxk+1 − yk+1).

Convergence needs bounded λk > 0 if J and F are convex.

The joint minimization subproblem is still hard to solve.

Variable Splitting

Consider a linear operator L and

minx

J(Lx) + F (x)

Introduceminx,y

J(y) + F (x), s.t. y = Lx ,

which “separates” J(·) from F (·).

Alternating Direction Method (ADM): alternate xk and yk updates

xk+1 ← minx

L(x , yk ;λk),

yk+1 ← miny

L(xk+1, y ;λk),

λk+1 ← λk + δc(Lxk+1 − yk+1).

Converges if δ ∈ (0, (√

5 + 1)/2). Subproblems x/y are decoupled.

Variable Splitting

If a subproblem, say the y -subproblem, is still expensive to solve, one canconsiderGradient descent:

yk+1 ← yk − τ k∇yL(xk+1, yk ;λk).

Proximal descent:

yk+1 ← miny

J(y)+〈∇y (〈λ, Lxk+1−y〉+c

2‖Lxk+1−y‖2

2), y−yk)+1

2τ‖y−yk‖2

2.

Alternating Direction Method, History

Yin Zhang: “the idea of ADM goes back to Sun-Tze (400 BC) andCaesar (100 BC):

”

Back to 1950s–70s, it appears as operator splitting for PDEs and studiedby Douglas, Peaceman, and Rachford, and then Glowinsky etal.’81-89, Gabay’83.

Subsequent studies are in the context of variational inequality (Ecksteinand Bertsekes’92, He et al.’02)

Extensions to multiple blocks (He, Yuan, and collaborators)

Alternating Direction Method, Applications

For sparse optimization, one subproblem is shrinkage or its variant.

The other subproblem is smooth, very often solving a linear system with(A>A + ckL>L).

For TV (·) = ‖∇ · ‖1, breaking L = ∇ from ‖ · ‖1. (Wang, Yin, andZhang’08). The linear system can be diagonalized by DFT for various A(e.g., I and convolution) and suitable boundary cond.

Codes: FTVd, Split Bregman, YALL1, RecPF, SALSA.

Variable Splitting is Versatile

Variable splitting lets one apply ADM to many extensions of `1:transform-`1, frame-`1, constrained/penalized, weighted, nonnegative,isotropic/anisotropic TV, complex, group `2,1, nuclear norm, and evensome non-convex models.

For example, codes YALL1 and YALL1-Group solve

and their group/joint sparse version.

Related: the Bregman Methods (Osher et al’06, Yin et al’08)

Bregman dist: D(u, uk) := J(u)−(J(uk) + 〈pk , u − uk〉

), pk ∈ ∂J(uk).

For minx F (x) subject to Ax = b, Bregman Alg updates xk and pk :

xk+1 ← min µ[J(x)− (J(xk) + 〈pk , x − xk〉)

]+

1

2‖Ax − b‖2

2

pk+1 ← pk + A>(b − Axk+1).

Relation to augmented Lagrangian: pk = A>λk . Split Bregman(applying Bregman to the splitting formulation) is ADM.

Bregman is not novel, but leads to new observations:

I Successively linearizing J(·) gives rise to Ax = b in the limit

I Better denoising; error forgetting

Example: to recover x0 from noisy measurements b = Ax0 + ω

1. `1 denoising: min µ‖x‖1 + 12‖Ax − b‖2

2

0 50 100 150 200 250−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

true signalBPDN recovery

µ = 48.5

0 50 100 150 200 250−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

true signalBPDN recovery

µ = 49

2. `1 Bregman iteration

0 50 100 150 200 250−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

true signalBregman recovery

µ = 150, 5 iterations

Better denoising results are also observed on image reconstruction.

The Add-Back Update, Error ForgettingAfter a change of variable, the Bregman update has an add-back form:

xk+1 ← min µJ(x) +1

2‖Ax − bk‖2

2,

bk+1 ← b + (bk − Axk+1).

I Subproblem is in the same form of `1 + `22

I Subproblem solution errors are iterative cancelled

Simulation: subproblems solved at low accuracy 10−6 by ISTA, FPC-BB,GPSR, and SpaRSA, but relative errors drop to machine precision

0 5 10 15 20 25 30

10−15

10−10

10−5

100

Bregman iteration

Err

or

ISTAFPC−BBGPSRSpaRSAMosek

Non-Convex `p Minimization, p < 1Pioneered by Rao et al’97; popularized by Chartrand et al for CS

`1-ball `1/2-ball

However, there are multiple local minima for minx ‖x‖pp, s.t Ax = b.

‖x‖pp colorized over

x0 + v : v ∈ N (A)

Non-Convex Minimization

Chartrand and Yin’08, Daubechies et al.’09, Lai-Wang’10 smooth out the“small” local minima by

replacing ‖x‖pp byn∑

i=1

x2i

(x2i +ε)(1−p)/2

.

Sharp `1/2-ball Smoothed `1/2-ball

Work better than `1 on fast decaying signals (nonzero entries have afast decaying distribution)

For fast decaying sparse signals, smoothing+continuation help avoidlocal minima.

Red: sparse solution; Green: smoothed `p solutions

ε = 1.0

Courtesy of Rick Chartrand



ε = 0.1




ε = 0.01




ε = 0.001


Other non-convex techniques

I Candes, Wakin, and Boyd’08 uses∑n

i=1 |xi |/(|xi |+ ε)(1−p).

I Wang and Yin’10 uses∑

i wi |xi | with binary wi .Code: Threshold–ISD

I Mohimani, Babaie-Zadeh, Jutten use other approximations.Code: SL0

I Extensions to joint sparse vector and low-rank matrix recovery ......

Nearly all need fast decaying for significantly better recovery than `1.

Matrix Completion(Candes and Recht’09; Recht, Fazel, and Parrilo’10) A matrix M can beexactly recovered from enough yet few random subsamples withoverwhelming probability, provided

I Low rank

I Row and column subspaces are incoherent

Motivating example: low rank due to only a few factors contributing tousers’ preferences

Matrix Completion Model

Given samples in Ω, recover X by

minx

µJ(X ) +1

2‖XΩ −MΩ‖2

F

Choices of J(X ):

I rank(X ): computationally intractable

I ‖X‖∗: convex envelope of rank(X )

I ‖X‖max: another convex regularizer of rank(X )

I Nonconvex `p semi-norm of singular values of X

Codes: SVT, FPCA, APGL, ...

Also, trace(X>X )p/2: equals ‖X‖∗ when p = 1; it is non-convex whenp < 1 and is the `p-norm of singular values

On terabyte data, SVDs become almost forbidden.

Code LMaFit avoids SVDs by exploiting low-rank factorization M ≈ XY

and solveminX ,Y‖PΩ(XY −M)‖F

Given an approximate rank r , X is m-by-r and Y is r -by-n.

Drawbacks:I non-convex, no theoretical guarantees known yetI need rank estimate or dynamic rank update

Advantages:I cheap (alternating minimization)I does not store any full matrix

Other models and codes:I OptSpace (descent on the Grassmanian manifold)I Low-rank+Sparse, splitting method: Yuan, Yang’09, Lin et al.’10,

etc.

To finish up

Some observations:

I The work of finding structured solutions is interdisciplinary

I Recognizing the structure are important

I It is a common ground for existing and novel algorithms

I This field grows quickly, and its tools becoming standard techniquesin computational mathematics and engineering

Benefits of structured solutions:

I cheap to store / transmit

I more meaningful, easy to understand,

I easy to use

I robust to errors

I easy to find

Acknowledgements: R. Chartrand for some nice figures, and YangyangXu for some slides.

optimization for sparse solutions, a tutorial

Documents