-
Sparse regularization path by differential inclusion
Wotao Yin (UCLA Math)
joint with: Stanley Osher, Ming Yan (UCLA)
Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U)
ICERM Approximation, Integration, and Optimization Workshop
October 2, 2014
1 / 42
-
Background
• Assume vector x∗ ∈ Rn is sparse, unknown
• Goal: Recover x∗ from
b = Ax∗ + �
where A ∈ Rm×n , b ∈ Rm and � is unknown noise.
• Consider m � n, the under-determined case
2 / 42
-
Background: Regularization by optimization
Examples:
• convex: LASSO: minimize λ‖x‖1 + 12m ‖Ax − b‖22
• nonconvex: SCAD, `p-seminorm minimization p ∈ (0, 1)
Optimization approach:
• convex penalty: avoid overfitting, tractable, lead to bias
• nonconvex penalty: may works better, but performance unpredictable
They have a tuning parameter, the best choice of which is often unknown
So, we need model selection: vary the parameter values, solve many instances,
and then pick the best one
3 / 42
-
Background: Regularization path by an algorithm
Algorithmic regularization:
• An algorithm generates a regularization path. (Points on the path may not
minimize an energy function)
• Model selection is done by deciding when to stop at a time (for continuous
dynamic) or stop at an iteration (for discrete update)
Examples:
• LASSO/LARS: solve parameterized LASSO KKT conditions
• xMP family
4 / 42
-
This talk introduces a continuous regularization path by differential
inclusions, with
• recovery guarantees
• fast implementation
• and generalization to other structured solution
5 / 42
-
Introduced: two Inverse Scale Space (ISS) Dynamics
• Let x(t), p(t) ∈ Rn by primal-dual regularization path. t is time.
• Bregman ISS dynamic: {x(t), p(t)}t≥0 is governed by
ṗ(t) =1m
AT (b − Ax(t)),
p(t) ∈ ∂‖x(t)‖1.
Initial solution x(0) = p(0) = 0.
• Linearized Bregman ISS dynamic: {x(t), p(t)}t≥0 is governed by
ṗ(t) +1κ
ẋ(t) =1m
AT (b − Ax(t)),
p(t) ∈ ∂‖x(t)‖1.
Initial solution x(0) = p(0) = 0.
• For well-defined path (and uniqueness), make technical assumptions:
• p(t) is right continuously differentiable, and• x(t) is right continuous
6 / 42
-
Generalization
Given any convex optimization model:
minimizex
r(x) + t ∙ f (x)
one can generate the related Bregman ISS model:
ṗ(t) = −f ′(x),
p(t) ∈ ∂r(x(t)).
where
• r is convex regularization: weighted `1, `1,2, nuclear norm, and so on;
can incorporate nonnegative or box constraints as indicator functions
• f is convex fitting: square loss, logistic loss, etc.
Linearized Bregman ISS: add a strongly convex function to r .
7 / 42
-
Major claims for Bregman ISS applied to `1
The solution path {x(t), p(t)}t≥0 :
• x(t) is sparse, if p ∈ ∂‖x‖1 ∩R(AT ) and A is fat, w.h.p.
• x(t) is less biased than LASSO, better than LASSO+debiasing
• path can be piece-wise computed very quickly
• sign-consistency: sign(x(t)) = sign(x∗) at some t under conditions
In less technical languages, the new method
• recovers sparse nonzero elements like `1 but avoids its bias
• can generate a regularization path much quicker than `1
• whereas `1 extends, it does too
8 / 42
-
Background: `1 subgradient
• Subdifferential of convex function f
∂f (y) = {p : f (x) ≥ f (y) + 〈p, x − y〉, ∀y ∈ domf }.
Each p ∈ ∂f (y) is a subgradient of f at y.
• Subdifferential of | ∙ |:
∂|xi | =
{1}, xi > 0;
[−1, 1], xi = 0;
{−1}, xi < 0.
=⇒ let pi ∈ ∂|xi |, then
xi
≥ 0, if pi = 1;
= 0, if pi ∈ (−1, 1);
≤ 0, if pi = −1.
9 / 42
-
Sparsity and `1 subgradient
Although x ↔ p is not one-one, it is in some cases. When xi is nonzero, pimust equal its sign; when pi ∈ (−1, 1), xi has to be zero.
p is like an array of 3-position switches: −1, (−1, 1), +1
10 / 42
-
Toy example 1
Consider:
b = ax + �,
where b, a, x are strictly positive scalars.
Bregman ISS:
• start: x(0) = 0 and p(0) = 0
• stage 1: p evolves before reaching 1, meanwhile x stays 0.
ṗ = a(b − ax) = ab ⇒ p(t) = (ab)t
• stage 2: p reaches 1 at t = 1/(ab), but cannot exceed 1, so ṗ(t) ≤ 0 and
thus x(t) 6= 0. Right continuity assumption makes ṗ(t) < 0 impossible as
it will immediately make x(t+) = 0. Therefore,
at t ≥ 1/(ab), 0 = ṗ(t) = a(b − ax(t)) ⇒ x(t) = b/a, p(t) = 1.
Once p(t) = 1 “switch is on”, the signal x(t) = b/a immediately pops out!
11 / 42
-
LASSO:
x(t) = arg minx
|x|+t2|ax − b|2
⇒ optimality condition:
0 = p + ta(ax − b), p ∈ ∂|x|
⇒ solution:
p(t) =
{(ab)t,
1,x(t) =
{0, t ∈ [0, 1ab ),ba−
1ta2 , t ∈ [
1ab ,∞).
In this example, LASSO has the same p(t) path but a different x(t) path.
LASSO’s x(t) path reduced the signal strength.
12 / 42
-
`1 subgradient and sparsity
Faces of ∂‖x‖
• `1 subdifferential: ∂‖x‖1 = ∂|x1| × ∙ ∙ ∙ × ∂|xn |.
• The image of ∂‖x‖1 is [−1, 1]n
• Let p ∈ ∂‖x‖1. For xi 6=, pi must equal ±1 and is thus exposed.
• More pi exposed ⇔ p lies on a low-dim face of [−1, 1]n
Observation:
vector x is sparse ⇔ few pi = ±1 ⇔ p 6∈ a low-dim face of [−1, 1]n
13 / 42
-
• If matrix A is fat (or AT is thin), then R(AT ) is a small subspace
• A is random and p ∈ ∂‖x‖1 ∩R(AT )
⇒ p is unlikely on a low-dim face of [−1, 1]n
⇒ very few pi = ±1
⇒ sparse x
Bregman ISS update:
ṗ(t) =1m
AT (b − Ax(t)), p(t) ∈ ∂‖x(t)‖1,
⇒ p ∈ ∂‖x‖1 ∩R(AT )
Conclusion: if A is fat, then x(t) is typically sparse.
14 / 42
-
Toy example 2
• x ∈ Rn , measurement b is a scalar:
b = aT x + � ∈ R
Suppose a1 = 1 > a2 ≥ . . . ≥ an > 0 and b > 0. w.o.l.g.
• Bregman ISS solution:
x1(t) =
{0, t < 1/b,
b, t ≥ 1/b.
x2(t) = ∙ ∙ ∙ = xn(t) = 0, t ≥ 0.
15 / 42
-
• LASSO solution:
x1(t) =
{0, t < 1/b,
b − 1t , t ≥ 1/b.
x2(t) = ∙ ∙ ∙ = xn(t) = 0, t ≥ 0.
• Both solutions are sparse. Like before, LASSO solution is a
strength–reduced signal, which is not good.
16 / 42
-
Oracle estimator
• Unknown S := supp(x∗) is disclosed by an oracle
• Oracle estimator is the least-squares solution restricted to S
x̃∗S = arg min{1
2m‖Ax − b‖22 : supp(x) = S}
• Define: submatrix AS of A and Σm := 1m ATS AS . The oracle estimate
x̃∗S = Σ−1m (
1m
ATS b) = x∗S +
1m
Σ−1m ATS �
has oracle properties:
• consistency: supp(x̃∗S ) = S
• normality: x̃∗S ∼ N (x∗S ,σ2
m Σ−1m ). In particular, E[x̃
∗S ] = x
∗S unbiased.
17 / 42
-
LASSO fails to have oracle properties
Tibshirani’96 (LASSO) and Chen-Donoho-Saunders’96 (BPDN):
minimize ‖x‖1 +t
2m‖Ax − b‖22
Optimality conditions:
p =tm
AT (b − Ax), p ∈ ∂‖x‖1.
Pros:
• p ∈ ∂‖u‖1 ∩R(AT ), so x is sparse
• efficient solvers for fixed t
• sign–consistency under conditions
Cons:
• x(t) is always biased!
• computing for many values of t is slow or inaccurate
18 / 42
-
The LASSO bias
At some t, suppose supp(x̃LASSO) = supp(x∗) =: S .
Then,
x̃LASSOS = x∗S +
1m
Σ−1m ATS �
︸ ︷︷ ︸oracle estimate
−1t
Σ−1m sign(x̃LASSOS )
︸ ︷︷ ︸bias
.
The bias is caused by the part `1-norm applied to xS .
LASSO’s `1 minimization enforces xSc = 0 but hurts the signals in xS!
19 / 42
-
Debias LASSO
Two approaches:
• Exact debias: Add 1t Σ−1m sign(x̃
LASSOS ) to x̃
LASSOS
• Pseudo debias:
minimizex
‖Ax − b‖2 subject to supp(x) = supp(x̃LASSO)
It’s “psuedo” since the debiased solution may have changed signs.
Issues:
• extra computation
• bias has negative effect on the signs of x̃LASSO, which is not removed by
debiasing, therefore:
x̃LASSOS often misses small signals, which are not recovered by debiasing.
• not work for problems with “continuous support” (e.g., in low-rank
matrix recovery)
20 / 42
-
Bregman ISS: a “debiasing” interpretation
• LASSO optimality condition:
p =tm
AT (b − Ax)
• Differentiate w.r.t. t ⇒
ṗ =1m
AT (b − A(tẋ + x))
• Important: recognize that tẋ + x is the debiased LASSO solution!
• Idea: replace tẋ + x by x ⇒ Bregman ISS:
ṗ =1m
AT (b − Ax)
• No bias is ever introduced!
• Note: Bregman ISS 6=LASSO+debiasing. Bregman ISS is better and faster.
21 / 42
-
Compute the Bregman ISS path
Theorem
The solution path to
ṗ+(t) =1m
AT (b − Ax(t)), p(t) ∈ ∂‖x(t)‖1
with initial conditions t0 = 0, p(0) = 0, x(0) = 0, is given piece-wise by:
• for k = 1, 2, . . . ,K
• p(t) is piece-wise linear
p(t) = p(tk−1) +t − tk−1
mAT (b − Ax(tk−1)), t ∈ [tk−1, tk ],
where tk := sup{t > tk−1 : p(t) ∈ ‖x(tk−1)‖1}.
• x(t) = x(tk−1) is piece-wise constant for t ∈ [tk−1, tk); if tk 6=∞,
x(tk) = arg minu
‖Au − b‖22 subject to ui
≥ 0, pi(tk) = 1,
= 0, pi(tk) ∈ (−1, 1),
≤ 0, pi(tk) = −1.
22 / 42
-
Faster alternative: Linearized Bregman ISS
ṗ(t) +1κ
ẋ(t) =1m
AT (b − Ax(t)),
p(t) ∈ ∂‖x(t)‖1.
• Solution is piece-wise smooth, closed form.
• It approximates Bregman ISS. Converges to the Bregman ISS solution
exponentially fast in κ
• Reduce to one nonlinear ODE:
ż(t) =1m
AT (b − κA shrink(z(t))).
Insight: The mapping z(t) = p(t) + 1κ
x(t) is one-one. Given z(t), recover
x(t) = κ shrink(z(t)), p(t) = z(t)−1κ
x(t),
whereshrink(u) = prox‖∙‖1 (u) = arg min
y‖y‖1 +
12‖y − u‖22.
23 / 42
-
Discrete Linearized Bregman Iteration
• Nonlinear ODE (from last slide):
ż =1m
AT (b − κA shrink(z(t))).
• Forward Euler:
zk+1 = zk +αkm
AT (b − A (κ shrink(zk))︸ ︷︷ ︸
xk
)
• Easy to parallelize for very large dataset. For example:
A = [A1 A2 ∙ ∙ ∙ AL], where A` is distributed
Distributed implementation:
for ` = 1, . . . ,L in parallel:
{zk+1` = z
k` +
αkm A
T` (b − w
k)
wk+1` = κA` shrink(zk+1` )
all-reduce sum: wk+1 =L∑
`=1
wk+1` .
24 / 42
-
Comparison to ISTA iteration for LASSO
• Linearized Bregman (LB) iteration:
zk+1 = zk −αkm
AT (A(κ shrink(zk))− b)
• ISTA (forward-backward splitting, FPC, SpaSRA, ...) iteration:
xk+1 = shrink(xk −αkm
AT (Axk − b), λ)
Comparison:
• ISTA: intermediate xk is dense, solves LASSO for fixed λ as k →∞
• LBreg: intermediate xk is sparse (useful as a regularization path)
as k →∞, solves:
minimize ‖x‖1 +1
2κ‖x‖2 subject to Ax = b,
with exact penalty property: sufficiently large κ gives ‖x‖1 minimizer
25 / 42
-
Comparison to orthogonal matching pursuit (OMP)1
OMP: start with index set S = ∅ and vector x = 0;
iterate
1. compute residual vector A∗(b − Ax), add its largest entry to S
2. set x ← arg min ‖b − Ax‖22 subject to xi = 0 ∀i 6∈ S.
Differences:
• OMP: increase index set S (OMP variants evolve S in other ways)
• ISS: evolves p ∈ ‖x‖1, which encodes more information
1Mallat-Zhang’93, Tropp-Gilbert’0726 / 42
-
Generalization (once again)
Bregman ISS model:
ṗ(t) = −f ′(x),
p(t) ∈ ∂r(x(t)).
where
• r is convex regularization: weighted `1, `1,2, nuclear norm, etc.
• f is convex fitting: square loss, logistic loss, etc.
Linearized Bregman ISS model: add a strongly convex function to r .
27 / 42
-
Next: Numerical examples
28 / 42
-
20-Dimensional Example
0 0.5 1 1.5 2 2.5
x 10-6
-100
-50
0
50
100
0 0.5 1 1.5 2 2.5
x 10-6
-1
-0.5
0
0.5
1
29 / 42
-
Example
0 0.5 1 1.5 2 2.5
x 10-6
-100
-50
0
50
100
0 0.5 1 1.5 2 2.5
x 10-6
-1
-0.5
0
0.5
1
30 / 42
-
Example
0 0.5 1 1.5 2 2.5
x 10-6
-100
-50
0
50
100
0 0.5 1 1.5 2 2.5
x 10-6
-1
-0.5
0
0.5
1
31 / 42
-
Example
0 0.5 1 1.5 2 2.5
x 10-6
-100
-50
0
50
100
0 0.5 1 1.5 2 2.5
x 10-6
-1
-0.5
0
0.5
1
32 / 42
-
Example
0 0.5 1 1.5 2 2.5
x 10-6
-100
-50
0
50
100
0 0.5 1 1.5 2 2.5
x 10-6
-1
-0.5
0
0.5
1
33 / 42
-
Predict prostate tumor size
• given 8 clinical features, select predictors for prostate tumor size
• data: 67 training cases + 30 testing cases
Predictor LS Subset LASSO ISS
Intercept 2.452 2.466 2.481 2.476
lcavol 0.716 0.667 0.622 0.554
lweight 0.293 0.366 0.289 0.279
age -0.143 0 -0.096 0
lbph 0.212 0 0.188 0.198
svi 0.310 0.268 0.262 0.238
lcp -0.289 -0.291 -0.164 0
gleason -0.021 0 0 0
pgg45 0.277 0.227 0.187 0.122
#Features 8 5 7 5
Test Error 0.586 0.587 0.543 0.541
LS = least squares, Subset = best subset regression, LASSO solved by glmnet
Bregman ISS achieves least test error with fewest features!34 / 42
-
Relation to discrete Bregman iteration
• Forward Euler of ṗ = 1m AT (b − Ax):
pk+1 = pk +δ
mAT (b − Axk),
which is the first-order optimality condition to
xk+1 ← arg minx
D‖∙‖1 (x; xk) +
δ
2m‖Ax − b‖2,
where D‖∙‖1 (x; xk) := ‖x‖1 − ‖x
k‖1 − 〈pk , x − xk〉.
• By change of variable, “add-back-the-residual” iteration
xk+1 ← arg minx‖x‖1 +
δ
2m‖Ax − bk‖2,
bk+1 ←bk + (b − Axk).
Still true if ‖ ∙ ‖1 is replace by any convex regularizer
• Message: keep existing solver, use a small δ, “add-back-the-residual”
35 / 42
-
Test with noisy measurements and tiny signals
0 50 100 150 200 250-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
true signalBPDN recovery
LASSO (hand tuned)
0 50 100 150 200 250-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
true signalBregman recovery
Bregman 5th itr.
36 / 42
-
Related observation
YALL1 paper (Yang-Zhang’08): tested different algorithms for LASSO
min ‖u‖1 +t
2n‖Au − b‖22.
Strange observation: ADM algorithms do better than the model itself!
37 / 42
-
Theory: path consistency
Question: does there ∃t so that solution x(t) has the following properties?
• no false positive: if ui = 0, then xi(t) = 0
• no false negative: if ui 6= 0, then xi(t) 6= 0
• sign consistency: furthermore, sign(x) = sign(x(t)).
Theorem
Under the assumptions
• Gaussian noise: ω ∼ N (0, σ2I ),
• normalized column: 1n maxj ‖Aj‖2 ≤ 1,
and under irrepresentable and strong-signal conditions, Bregman ISS reaches
sign consistency and gives an unbias estimate to x∗.
Proof is based on the next two lemmas.
38 / 42
-
No false positive
Define true support S := supp(x∗), and let T := Sc.
Lemma
Under assumptions, if AS has full column rank and
maxj∈T‖ATj AS(A
TS AS)
−1‖1 ≤ 1− η
for some η ∈ (0, 1), then with high probability
supp(x(s)) ⊆ S , ∀s ≤ t̄ := O
(η
σ
√m
log n
).
Proof uses: (i) concentration inequality and (ii) if supp(x(s)) ⊆ S , s ≤ t, then
p(s)T = ATT AS(A
TS AS)
−1p(s)S + tA∗T PA⊥
Sw, s ≤ t.
39 / 42
-
No false negative / sign consistency
Lemma
Under assumptions, if A∗SAS � γI and
umin ≥ max
{
O
(σ√γ
√log |S |
m
)
,O
(σ log |S |ηγ
√log n
m
)}
,
then there exist t∗ (which can be given explicitly) so that with high probability
sign(x(t)) = sign(x∗)
and x(t) = x∗S − (A∗SAS)
−1A∗Sω obeys
‖x(t)− x∗‖∞ ≤ umin/2.
• first term in max ensures ‖(A∗SAS)−1A∗Sω‖∞ ≤ umin/2
• second term ensures: inf{t : sign(xS(t)) = sign(xS)} ≤ t̄.
40 / 42
-
Related work
Discrete:
• Bregman iteration for imaging (TV) and compressed sensing `1:
Osher-Burger-Goldfarb-Xu-Y’06, Y-Osher-Goldfarb-Darbon’08
• Linearized Bregman on `1: Y-Osher-Goldfarb-Darbon’08, Y’10, Lai-Y’13
• Matrix completion SVT on ‖X‖∗: Cai-Candès-Shen’10
• Extension and analysis: Zhang’13, Zhang’14
Continuous:
• Inverse scale space (ISS) on TV: Burger-Gilboa-Osher-Xu’06
• Adaptive ISS on `1: Burger-Möller-Benning-Osher’11
• Greedy ISS on `1: Möller-Zhang’13
41 / 42
-
Summary
Instead of minimize r(x) + t ∙ f (x), just try
ṗ(t) = −f ′(x), p ∈ ∂r(x).
It will
• keep solution structure
• remove bias
• give a solution path efficiently
Even simpler for you: keep your existing solver, apply “add back the residual”
42 / 42