optimization for sparse solutions, a tutorial

38
Optimization for Sparse Solutions, A Tutorial Wotao Yin (CAAM, Rice U.) MIIS 2012, Xi’an, China

Upload: others

Post on 04-Nov-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Sparse Solutions, A Tutorial

Optimization for Sparse Solutions, A Tutorial

Wotao Yin (CAAM, Rice U.)

MIIS 2012, Xi’an, China

Page 2: Optimization for Sparse Solutions, A Tutorial

Structure means simplicity

Examples

I signals that are sparse (left figure) or sparse under bases, frames, ortransforms

I a set of jointly sparse signals

I low rank matrix (right figure); low rank + sparse; low rank + jointlysparse

I low dimensional manifold

More structures will be discovered and used

Page 3: Optimization for Sparse Solutions, A Tutorial

Why structured solutions?

I cheap to store / transmit

I more meaningful, easy to understand,

I easy to use

I more robust to errors

I need fewer data to begin with

I easy to find by optimization (both speed and storage)

Page 4: Optimization for Sparse Solutions, A Tutorial

`1 gives sparse solutions

Minimization `1 has a long history in data processing (geophysical:Claerbout and Muir’73, Santosa and Symes’86), image processing(Rudin, Osher, and Fatemi’92), and statistics (Chen, Donoho, andSaunders’98; Tibshirani’96).

I learned from S. Osher that Galileo Galilei (1564–1642) used the medianfilter, which for given vector y solves

minx‖x − y‖1.

Heuristically, `1 is known as a good convex approximate (convexenvelop) of `0, and `1 tends to give sparse solutions. This is so widelyknown that many people would just try but not bother with checking thetheory.

Page 5: Optimization for Sparse Solutions, A Tutorial

`1 gives sparse solutions

Goal: to recover sparse vector x0 from b = Ax0

I Support S := supp(x0)

I Zero set Z := SC ;

I Sparsity k = ‖x0‖0 = |supp(x0)|.

For x0 to be the solution of

minx‖x‖1, subject to Ax = b,

we need‖x0 + v‖1 ≥ ‖x0‖1, ∀v ∈ N (A).

Page 6: Optimization for Sparse Solutions, A Tutorial

x0 is recovered, good x0 is not recovered, bad

When the diamond-shaped `1-ball touches the sparse x0, we need theaffine space x : Ax = b = x0 + v : v ∈ N (A) to not cut through the`1-ball.

Page 7: Optimization for Sparse Solutions, A Tutorial

‖x0 + v‖1 = ‖x0S + vS‖1 + ‖vZ‖1

≥ ‖x0S‖1 + ‖vZ‖1 − ‖vS‖1

= ‖x0‖1 + ‖v‖1 − 2‖vS‖1

≥ ‖x0‖1 + ‖v‖1 − 2√k‖v‖2.

Therefore, x0 is `1-recoverable provided

‖v‖1

‖v‖2≥ 2√k , ∀v ∈ N (A), v 6= 0.

In general, we only have

1 ≤ ‖v‖1

‖v‖2≤√n.

Page 8: Optimization for Sparse Solutions, A Tutorial

However, ‖v‖1

‖v‖2is significantly greater than 1 if V = N (A) is random.

Kashin’77, Garvaev and Gluskin’84 show that a randomly drawn(n −m)-dimensional subspace V of Rn satisfies

‖v‖1

‖v‖2≥ c1

√m√

1 + log(n/m), ∀v ∈ V, v 6= 0

with probability at least 1− exp(−c0(n−m)), where c0, c1 > 0 are indep.of m and n.

Hence, x0 is `1-recoverable with high probability if A is random and

k ≤ O(m log−1(n/m)) or m ≥ O(k log(n/k)).

The number of measurements m is a multiple of k log(n/k) — typicallymuch less than n.

Page 9: Optimization for Sparse Solutions, A Tutorial

Other Analyses

I Gorodnitsky, Rao’97: Spark – Smallest number of columns of A thatare linearly dependent.

I Donoho, Elad’03: Coherence – Smallest angles between any twocolumns of A.

I Donoho, Tanner’06: analysis similar to the above

I Candes, Romberg, and Tao’04 ’06: Restricted Isometry Property(RIP) – any 2k-column submatrices of A must be almost orthogonal.

I RIP⇒ `1 recovery: above, Cohen-Dahmen-DeVore, Forcart-Lai,Cai-Wang-Xu, Li-Mo, ...

I More ......

Page 10: Optimization for Sparse Solutions, A Tutorial

Applications

See Rice Compressed Sensing Repository: http://dsp.rice.edu/cs

I Signal sensing and processing, image processing

I Medical imaging (faster imaging, less radiation, multi-modality)

I (Wireless) communications (joint sparsity, channel estimation)

I Multi/hyperspectral imaging (spectral+spatial structure, data toinformation)

I Statistics, machine learning (feature selection; LASSO, logisticregression, SVM)

I Geophysical (seismic) data analysis

I Face recognition, video processing (robust PCA: common structure+ sparse difference)

I Distributed computation / cloud computing (terabytes of data arespread out, centralized computing is infeasible, sparse recovery atminimal communication cost?)

Many applications require nonnegativity, structured (model) sparsity,complex data.

Page 11: Optimization for Sparse Solutions, A Tutorial

Optimization/Algorithmic Techniques Used, Part 1

I First-order: (sub)gradient descent, gradient projection, continuation,coordinate descents, prox-point, conjugate gradient, forwardbackward splitting, conjugate gradient, accelerated first-order

I High-order: interior-point, semismooth Newton, hybrid 1st and 2ndorder, BFGS, L-BFGS

I Final debiasing

I Duality: augmented Lagrangian, alternating direction, Bregman(original, linearized, split), primal-dual methods. They are applied toboth primal and dual formulations.

I Homotopy: parametric quadratic programming

Page 12: Optimization for Sparse Solutions, A Tutorial

Optimization/Algorithmic Techniques Used, Part 2

I Greedy algorithms (OMP family), support detection

I Non-convex approximations to `0, `q minimization, reweighted itr

I Stochastic approximation (gradient, Hessian), sample-average approx

I Distributed reconstruction

I Parallel and GPU-friendly algorithms

I Numerical linear algebra

I Heuristics. Solution structure may be domain specific.

Sources: Optimization Online and Rice CS Repository. We pick a few todiscuss...

Page 13: Optimization for Sparse Solutions, A Tutorial

Forward-backward Splitting / Prox-Linear

To solveminµJ(x) + F (x),

iterate

xk+1 ← µJ(x) + 〈∇F (xk), x〉+1

2δk‖x − xk‖2

2.

δk : step size. Can be obtained from forward-backward operator splitting

Simple for many choices of J(x): `1, TV, ‖ · ‖∗, ‖ · ‖2,1, etc.

Properties:

I If δk < 2/L (Lipschitz const of ∇F ), ‖xk − x∗‖ is non-expansive; afixed-point is a solution

I Adaptive δk : accelerate convergence, Barzilai-Borwein andnonmonotone line search give sufficient descent

Page 14: Optimization for Sparse Solutions, A Tutorial

Sparse Vector Recovery

For J(x) = ‖x‖1 and given ∇F (xk), the subproblem is O(n) and parallel.It is called shrinkage or soft-thresholding.

In compressed sensing, ∇F (xk) = A>(Axk − b) is often cheap (e.g.,DFT, discrete cosine transform, chirp or convolution/circulant basedsensing). Codes: SpaRSA, FPC, SPGL1

∇F (xk) can be demanding in logistic regression and when dealing withnon-Gaussian noise. Code: LPS

Page 15: Optimization for Sparse Solutions, A Tutorial

Low-Rank Matrix Recovery

For low-rank matrix recovery, J(X ) = ‖X‖∗ :=∑

i σi (X ). Thesubproblem is

minXµ‖X‖∗ +

1

2‖X − Y k‖2

F

Since both terms are unitary-invariant, it is solved by SVD and thenshrinking the singular values.

Codes: FPCA, SVT (linearized Bregman). Apply approximate SVD orcompute partial SVD.

Lee et al’10 uses the same framework with the max-norm.

Page 16: Optimization for Sparse Solutions, A Tutorial

Application to Transform-`1 and Total Variation

If J(x) = ‖Lx‖1 and L is non-singular or even unitary (e.g., orthogonalwavelets), the subproblem has closed-form solutions.

If J(x) = ‖∇x‖1 (total variation or TV), there are graph-cut (computingmax-flow/min-cut) algorithms to solve the subproblems.

If ‖Lx‖1 and L is non-invertible, the subproblem is less trivial. Solutions:operator splitting, smoothing (e.g., Huber norm to replace `1).

Page 17: Optimization for Sparse Solutions, A Tutorial

Prox-Linear Theory and Practice

Forward-backward splitting theory applies. (Combettes and Wajs’05)

Convergence for convex J(x) is quite standard; also holds with line searchand other variants. (e.g., Wright, Figueiredo, Nowak’08).

Can extend to prox-regular functions (Lewis and Wright’08) and hybrid1st/2nd-order iterations (Wen, Yin, Zhang, Goldfarb’11)

Convergence speed depends on solution structure, controlled by µ

I Larger µ gives faster and more structured solution. Reason for`1: smaller optimal support is easier identify and afterward, iterateessentially on a smaller support with a better condition number

I Smaller µ gives slower and less structured solution.

Continuation: start from a large µ and decrease µ sequentially, usingprevious solution to warm-start the current iteration (e.g., code FPC).

Page 18: Optimization for Sparse Solutions, A Tutorial

Techniques Applied with Prox-Linear

I Two-step descents (TwIST)

I Barzilai-Borwein steps, non-monotone line search (Wen et al.’09)

I (Block) coordinate descent: apply descent only to a subset ofcomponents. (Tseng and Yun’09, Li and Osher’09, Wright’11).

I Active set: estimate the optimal support and solve a reducedsubproblem (Shi et al.’08, Wen et al.’09)

I Use (approximate) second-order information, e.g. in logisticregression (Byrd et al.’10; Shi et al.’08)

I Accelerated first-order methods: given a convex function withL-Lipschitz gradients and no other structures, iterations is reducedfrom O(L/ε) to O(L/

√ε) for ε-optimal in objective value

I minF (x): Nesterov’83;I min J(x) + F (x): Nesterov’04, Bech and Teboulle’08, Tseng’08;I ALM: Ma, Scheinberg, Goldfarb’10

Page 19: Optimization for Sparse Solutions, A Tutorial

Variable Splitting

Consider a linear operator L and

minx

J(Lx) + F (x)

Rewriteminx,y

J(y) + F (x), s.t. y = Lx ,

which “separates” J(·) from F (·).

Augmented Lagrangian: generates xk , yk along with multipliers λk

(xk+1, yk+1) ← minx,yL(x , y ;λk) = J(y) + F (x) + 〈λ, Lx − y〉+

c

2‖Lx − y‖2

2,

λk+1 ← λk + c(Lxk+1 − yk+1).

Convergence needs bounded λk > 0 if J and F are convex.

The joint minimization subproblem is still hard to solve.

Page 20: Optimization for Sparse Solutions, A Tutorial

Variable Splitting

Consider a linear operator L and

minx

J(Lx) + F (x)

Introduceminx,y

J(y) + F (x), s.t. y = Lx ,

which “separates” J(·) from F (·).

Alternating Direction Method (ADM): alternate xk and yk updates

xk+1 ← minx

L(x , yk ;λk),

yk+1 ← miny

L(xk+1, y ;λk),

λk+1 ← λk + δc(Lxk+1 − yk+1).

Converges if δ ∈ (0, (√

5 + 1)/2). Subproblems x/y are decoupled.

Page 21: Optimization for Sparse Solutions, A Tutorial

Variable Splitting

If a subproblem, say the y -subproblem, is still expensive to solve, one canconsiderGradient descent:

yk+1 ← yk − τ k∇yL(xk+1, yk ;λk).

Proximal descent:

yk+1 ← miny

J(y)+〈∇y (〈λ, Lxk+1−y〉+c

2‖Lxk+1−y‖2

2), y−yk)+1

2τ‖y−yk‖2

2.

Page 22: Optimization for Sparse Solutions, A Tutorial

Alternating Direction Method, History

Yin Zhang: “the idea of ADM goes back to Sun-Tze (400 BC) andCaesar (100 BC):

Back to 1950s–70s, it appears as operator splitting for PDEs and studiedby Douglas, Peaceman, and Rachford, and then Glowinsky etal.’81-89, Gabay’83.

Subsequent studies are in the context of variational inequality (Ecksteinand Bertsekes’92, He et al.’02)

Extensions to multiple blocks (He, Yuan, and collaborators)

Page 23: Optimization for Sparse Solutions, A Tutorial

Alternating Direction Method, Applications

For sparse optimization, one subproblem is shrinkage or its variant.

The other subproblem is smooth, very often solving a linear system with(A>A + ckL>L).

For TV (·) = ‖∇ · ‖1, breaking L = ∇ from ‖ · ‖1. (Wang, Yin, andZhang’08). The linear system can be diagonalized by DFT for various A(e.g., I and convolution) and suitable boundary cond.

Codes: FTVd, Split Bregman, YALL1, RecPF, SALSA.

Page 24: Optimization for Sparse Solutions, A Tutorial

Variable Splitting is Versatile

Variable splitting lets one apply ADM to many extensions of `1:transform-`1, frame-`1, constrained/penalized, weighted, nonnegative,isotropic/anisotropic TV, complex, group `2,1, nuclear norm, and evensome non-convex models.

For example, codes YALL1 and YALL1-Group solve

and their group/joint sparse version.

Page 25: Optimization for Sparse Solutions, A Tutorial

Related: the Bregman Methods (Osher et al’06, Yin et al’08)

Bregman dist: D(u, uk) := J(u)−(J(uk) + 〈pk , u − uk〉

), pk ∈ ∂J(uk).

For minx F (x) subject to Ax = b, Bregman Alg updates xk and pk :

xk+1 ← min µ[J(x)− (J(xk) + 〈pk , x − xk〉)

]+

1

2‖Ax − b‖2

2

pk+1 ← pk + A>(b − Axk+1).

Relation to augmented Lagrangian: pk = A>λk . Split Bregman(applying Bregman to the splitting formulation) is ADM.

Bregman is not novel, but leads to new observations:

I Successively linearizing J(·) gives rise to Ax = b in the limit

I Better denoising; error forgetting

Page 26: Optimization for Sparse Solutions, A Tutorial

Example: to recover x0 from noisy measurements b = Ax0 + ω

1. `1 denoising: min µ‖x‖1 + 12‖Ax − b‖2

2

0 50 100 150 200 250−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

true signalBPDN recovery

µ = 48.5

0 50 100 150 200 250−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

true signalBPDN recovery

µ = 49

2. `1 Bregman iteration

0 50 100 150 200 250−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

true signalBregman recovery

µ = 150, 5 iterations

Better denoising results are also observed on image reconstruction.

Page 27: Optimization for Sparse Solutions, A Tutorial

The Add-Back Update, Error ForgettingAfter a change of variable, the Bregman update has an add-back form:

xk+1 ← min µJ(x) +1

2‖Ax − bk‖2

2,

bk+1 ← b + (bk − Axk+1).

I Subproblem is in the same form of `1 + `22

I Subproblem solution errors are iterative cancelled

Simulation: subproblems solved at low accuracy 10−6 by ISTA, FPC-BB,GPSR, and SpaRSA, but relative errors drop to machine precision

0 5 10 15 20 25 30

10−15

10−10

10−5

100

Bregman iteration

Err

or

ISTAFPC−BBGPSRSpaRSAMosek

Page 28: Optimization for Sparse Solutions, A Tutorial

Non-Convex `p Minimization, p < 1Pioneered by Rao et al’97; popularized by Chartrand et al for CS

`1-ball `1/2-ball

However, there are multiple local minima for minx ‖x‖pp, s.t Ax = b.

‖x‖pp colorized over

x0 + v : v ∈ N (A)

Page 29: Optimization for Sparse Solutions, A Tutorial

Non-Convex Minimization

Chartrand and Yin’08, Daubechies et al.’09, Lai-Wang’10 smooth out the“small” local minima by

replacing ‖x‖pp byn∑

i=1

x2i

(x2i +ε)(1−p)/2

.

Sharp `1/2-ball Smoothed `1/2-ball

Work better than `1 on fast decaying signals (nonzero entries have afast decaying distribution)

Page 30: Optimization for Sparse Solutions, A Tutorial

For fast decaying sparse signals, smoothing+continuation help avoidlocal minima.

Red: sparse solution; Green: smoothed `p solutions

ε = 1.0

Courtesy of Rick Chartrand

Page 31: Optimization for Sparse Solutions, A Tutorial

For fast decaying sparse signals, smoothing+continuation help avoidlocal minima.

Red: sparse solution; Green: smoothed `p solutions

ε = 0.1

Courtesy of Rick Chartrand

Page 32: Optimization for Sparse Solutions, A Tutorial

For fast decaying sparse signals, smoothing+continuation help avoidlocal minima.

Red: sparse solution; Green: smoothed `p solutions

ε = 0.01

Courtesy of Rick Chartrand

Page 33: Optimization for Sparse Solutions, A Tutorial

For fast decaying sparse signals, smoothing+continuation help avoidlocal minima.

Red: sparse solution; Green: smoothed `p solutions

ε = 0.001

Courtesy of Rick Chartrand

Page 34: Optimization for Sparse Solutions, A Tutorial

Other non-convex techniques

I Candes, Wakin, and Boyd’08 uses∑n

i=1 |xi |/(|xi |+ ε)(1−p).

I Wang and Yin’10 uses∑

i wi |xi | with binary wi .Code: Threshold–ISD

I Mohimani, Babaie-Zadeh, Jutten use other approximations.Code: SL0

I Extensions to joint sparse vector and low-rank matrix recovery ......

Nearly all need fast decaying for significantly better recovery than `1.

Page 35: Optimization for Sparse Solutions, A Tutorial

Matrix Completion(Candes and Recht’09; Recht, Fazel, and Parrilo’10) A matrix M can beexactly recovered from enough yet few random subsamples withoverwhelming probability, provided

I Low rank

I Row and column subspaces are incoherent

Motivating example: low rank due to only a few factors contributing tousers’ preferences

Page 36: Optimization for Sparse Solutions, A Tutorial

Matrix Completion Model

Given samples in Ω, recover X by

minx

µJ(X ) +1

2‖XΩ −MΩ‖2

F

Choices of J(X ):

I rank(X ): computationally intractable

I ‖X‖∗: convex envelope of rank(X )

I ‖X‖max: another convex regularizer of rank(X )

I Nonconvex `p semi-norm of singular values of X

Codes: SVT, FPCA, APGL, ...

Also, trace(X>X )p/2: equals ‖X‖∗ when p = 1; it is non-convex whenp < 1 and is the `p-norm of singular values

Page 37: Optimization for Sparse Solutions, A Tutorial

On terabyte data, SVDs become almost forbidden.

Code LMaFit avoids SVDs by exploiting low-rank factorization M ≈ XY

and solveminX ,Y‖PΩ(XY −M)‖F

Given an approximate rank r , X is m-by-r and Y is r -by-n.

Drawbacks:I non-convex, no theoretical guarantees known yetI need rank estimate or dynamic rank update

Advantages:I cheap (alternating minimization)I does not store any full matrix

Other models and codes:I OptSpace (descent on the Grassmanian manifold)I Low-rank+Sparse, splitting method: Yuan, Yang’09, Lin et al.’10,

etc.

Page 38: Optimization for Sparse Solutions, A Tutorial

To finish up

Some observations:

I The work of finding structured solutions is interdisciplinary

I Recognizing the structure are important

I It is a common ground for existing and novel algorithms

I This field grows quickly, and its tools becoming standard techniquesin computational mathematics and engineering

Benefits of structured solutions:

I cheap to store / transmit

I more meaningful, easy to understand,

I easy to use

I robust to errors

I easy to find

Acknowledgements: R. Chartrand for some nice figures, and YangyangXu for some slides.