general principles for high-dimensional estimation: statistics and computation · 2011-10-03 ·...
TRANSCRIPT
General principles for high-dimensionalestimation: Statistics and computation
Martin Wainwright
Statistics, and EECS
UC Berkeley
Joint work with: Garvesh Raskutti, Sahand Negahban
Pradeep Ravikumar, Bin Yu.
IntroductionHigh-dimensional data sets are everywhere:
biological data (genes, proteins, etc.)
social network data
recommender systems (Amazon, Netflix etc.)
astronomy datasets
IntroductionHigh-dimensional data sets are everywhere:
biological data (genes, proteins, etc.)
social network data
recommender systems (Amazon, Netflix etc.)
astronomy datasets
Question:
Suppose that n = 100 and p = 1000. Do we expect theory requiring n → +∞and p = O(1) to be useful?
IntroductionHigh-dimensional data sets are everywhere:
biological data (genes, proteins, etc.)
social network data
recommender systems (Amazon, Netflix etc.)
astronomy datasets
Question:
Suppose that n = 100 and p = 1000. Do we expect theory requiring n → +∞and p = O(1) to be useful?
Modern viewpoint:
non-asymptotic results (valid for all (n, p))
allow for n ≪ p or n ≍ p
investigate various types of low-dimensional structure
Loss functions and regularization
Models: Indexed class of probability distributions Pθ | θ ∈ Ω
Data: samples Zn1 = (xi, yi), i = 1, . . . , n drawn from unknown Pθ∗
Estimation: Minimize loss function plus regularization term:
θ︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Goal: For given error norm ‖ · ‖e want upper bounds on ‖θ − θ∗‖e non-asymptotic results allowing for (n, p, s1, s2, . . .) → ∞ where
⋆ n ≡ sample size⋆ p ≡ dimension of parameter space Ω⋆ si ≡ structural parameters (e.g., sparsity, rank, graph degree)
Example: Sparse linear regression
= +nS
wy X θ∗
Sc
n× p
Set-up: noisy observations y = Xθ∗ + w with sparse θ∗
Estimator: Lasso program
θ ∈ argminθ
1
n
n∑
i=1
(yi − xTi θ)
2 + λn
p∑
j=1
|θj |
Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004;
Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005;
Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van
de Geer, 2007; Bickel, Ritov & Tsybakov, 2008, Zhang, 2009 .....
Example: Structured (inverse) covariance matrices
Zero pattern of inverse covariance
1 2 3 4 5
1
2
3
4
5
1 2
3
4
5
Set-up: Samples from random vector with sparse covariance Σ or sparseinverse covariance Θ∗ ∈ R
p×p.
Estimator (for inverse covariance)
Θ ∈ argminΘ
〈〈 1n
n∑
i=1
xixTi , Θ〉〉 − log det(Θ) + λn
∑
b∈B
|||Θb|||F
Some past work: Yuan & Lin, 2006; d’Aspremont et al., 2007; Bickel & Levina, 2007; El
Karoui, 2007; d’Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman
et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009
Example: Sparse principal components analysis
+=Σ ZZT D
Set-up: Covariance matrix Σ = ZZT +D, where leading eigenspace Z hassparse columns.
Estimator:
Θ ∈ argminΘ
−〈〈Θ, Σ〉〉+ λn
∑
(j,k)
|Θjk|
Some past work: Johnstone, 2001; Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al.,
2004; d’Aspremont et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008
Example: Low-rank matrix approximation
Θ∗ U D V T
r × r
p1 × p2 p1 × r
r × p2
Set-up: Matrix Θ∗ ∈ Rp1×p2 with rank r ≪ minp1, p2.
Estimator:
Θ ∈ argminΘ
1
n
n∑
i=1
(yi − 〈〈Xi, Θ〉〉)2 + λn
minp1,p2∑
j=1
σj(Θ)
Some past work: Fazel, 2001; Srebro et al., 2004; Recht, Fazel & Parillo, 2007; Bach, 2008;
Candes & Recht, 2008; Keshavan et al., 2009; Rohde & Tsybakov, 2009; Recht, 2009;
Negahban & W., 2009 ...
Matrix decomposition: Low-rank plus sparse
Matrix Y can be (approximately) decomposed into sum:
Y U V TD
d1 × d2 d1 × r
r × r r × d2
≈ +
Y = Θ∗︸︷︷︸
Low-rank component
+ Γ∗︸︷︷︸
Sparse component
Matrix decomposition: Low-rank plus sparse
Matrix Y can be (approximately) decomposed into sum:
Y U V TD
d1 × d2 d1 × r
r × r r × d2
≈ +
Y = Θ∗︸︷︷︸
Low-rank component
+ Γ∗︸︷︷︸
Sparse component
exact decomposition: initially studied by Chandrasekaran, Sanghavi,Parillo & Willsky, 2009some subsequent work: Candes et al., 2010; Hsu et al., 2010; Agarwal etal., 2011Various applications:
robust collaborative filtering graphical model selection with hidden variables image/video segmentation
Matrix decomposition: Low-rank plus column sparse
Matrix Y can be (approximately) decomposed into sum:
Y U V TD
d1 × d2 d1 × r
r × r r × d2
≈ +
Y = Θ∗︸︷︷︸
Low-rank component
+ Γ∗︸︷︷︸
Column sparse component
Matrix decomposition: Low-rank plus column sparse
Matrix Y can be (approximately) decomposed into sum:
Y U V TD
d1 × d2 d1 × r
r × r r × d2
≈ +
Y = Θ∗︸︷︷︸
Low-rank component
+ Γ∗︸︷︷︸
Column sparse component
exact decomposition: initially studied by Xu, Caramanis & Sanghavi,2010
later work: Agarwal, Negahban & W., 2011
Various applications: robust collaborative filtering robust principal components analysis
Motivation and roadmap
many results on different high-dimensional models
all based on estimators of the type:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Martin Wainwright () High-dimensional statistics September, 2011 10 / 1
Motivation and roadmap
many results on different high-dimensional models
all based on estimators of the type:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Question:
Is there a common set of underlying principles?
Martin Wainwright () High-dimensional statistics September, 2011 10 / 1
Motivation and roadmap
many results on different high-dimensional models
all based on estimators of the type:
θλn︸︷︷︸Estimate
∈ argminθ∈Ω
L(θ;Zn
1 )︸ ︷︷ ︸Loss function
+ λn R(θ)︸ ︷︷ ︸Regularizer
.
Question:
Is there a common set of underlying principles?
Answer: Yes, two essential ingredients.
(I) Restricted strong convexity of loss function
(II) Decomposability of the regularizer
Martin Wainwright () High-dimensional statistics September, 2011 10 / 1
(I) Role of curvature
1 Curvature controls difficulty of estimation:
δL
θθ
∆
δL
θθ
∆
High curvature: easy to estimate (b) Low curvature: harder
(I) Role of curvature
1 Curvature controls difficulty of estimation:
δL
θθ
∆
δL
θθ
∆
High curvature: easy to estimate (b) Low curvature: harder
2 captured by lower bound on Taylor series error TL(∆; θ∗)
L(θ∗ +∆)− L(θ∗)− 〈∇L(θ∗), ∆〉︸ ︷︷ ︸TL(∆,θ∗)
≥ γ2‖∆‖2e
for all ∆ around θ∗.
High dimensions: no strong convexity!
−1−0.5
00.5
1
−1−0.5
00.5
10
0.2
0.4
0.6
0.8
1
When p > n, the Hessian ∇2L(θ;Zn1 ) has nullspace of dimension p− n.
Restricted strong convexity
Definition
Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if
Ln(θ∗ +∆)−
Ln(θ
∗) + 〈∇Ln(θ∗), ∆〉
︸ ︷︷ ︸Taylor error TL(∆; θ∗)
≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸
Lower curvature
− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance
for all ∆ in a suitable neighborhood of θ∗.
Restricted strong convexity
Definition
Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if
Ln(θ∗ +∆)−
Ln(θ
∗) + 〈∇Ln(θ∗), ∆〉
︸ ︷︷ ︸Taylor error TL(∆; θ∗)
≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸
Lower curvature
− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance
for all ∆ in a suitable neighborhood of θ∗.
ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n
Restricted strong convexity
Definition
Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if
Ln(θ∗ +∆)−
Ln(θ
∗) + 〈∇Ln(θ∗), ∆〉
︸ ︷︷ ︸Taylor error TL(∆; θ∗)
≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸
Lower curvature
− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance
for all ∆ in a suitable neighborhood of θ∗.
ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n
RSC enforces a lower bound on curvature, but only when R2(∆) ≪ ‖∆‖2e
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Martin Wainwright () High-dimensional statistics September, 2011 14 / 1
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Restricted nullspace/eigenvalue (RE) condition (van de Geer, 2007; Bickel et
al., 2009):
‖X∆‖222n
≥ γ ‖∆‖22 for all ‖∆Sc‖1 ≤ ‖∆S‖1.
Martin Wainwright () High-dimensional statistics September, 2011 14 / 1
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Restricted nullspace/eigenvalue (RE) condition (van de Geer, 2007; Bickel et
al., 2009):
‖X∆‖222n
≥ γ ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2
√s‖∆‖2.
Martin Wainwright () High-dimensional statistics September, 2011 14 / 1
Least-squares: RSC ≡ restricted eigenvalue
for least-squares loss L(θ) = 12n‖y −Xθ‖22:
TL(∆; θ∗) = Ln(θ∗ +∆)−
Ln(θ
∗)− 〈∇Ln(θ∗), ∆〉
=
1
2n‖X∆‖22.
Restricted nullspace/eigenvalue (RE) condition (van de Geer, 2007; Bickel et
al., 2009):
‖X∆‖222n
≥ γ ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2
√s‖∆‖2.
holds with high probability for various sub-Gaussian designs whenn % s log p/s.
fairly strong dependency between covariates is possible
Martin Wainwright () High-dimensional statistics September, 2011 14 / 1
Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R
p to output y ∈ R:
Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)
.
Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R
p to output y ∈ R:
Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)
.
Taylor series expansion involves random Hessian
H(θ) =1
n
n∑
i=1
Φ′′(〈θ, Xi〉
)XiX
Ti ∈ R
p.
Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R
p to output y ∈ R:
Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)
.
Taylor series expansion involves random Hessian
H(θ) =1
n
n∑
i=1
Φ′′(〈θ, Xi〉
)XiX
Ti ∈ R
p.
Proposition (Negahban, W., Ravikumar & Yu, 2010)
For zero-mean sub-Gaussian covariates Xi with covariance Σ and any GLM,
TL(∆; θ∗) ≥ c3‖Σ12 ∆‖2
‖Σ 1
2∆‖2 − c4κ(Σ)( log p
n
)1/2‖∆‖1
for all ‖∆‖2 ≤ 1
with probability at least 1− c1 exp(−c2n).
Here κ(Σ) = maxj Σjj .
Violating matrix incoherence (elementwise/RIP)
Important:
Incoherence/RIP conditions imply restricted nullspace/eigenvalue, but are farfrom necessary.Very easy to violate them.....
Violating matrix incoherence (elementwise/RIP)
Form random design matrix
X =[x1 x2 . . . xp
]︸ ︷︷ ︸
p columns
=
XT1
XT2...
XTn
︸ ︷︷ ︸n rows
∈ Rn×p, each row Xi ∼ N(0,Σ), i.i.d.
Example: For some µ ∈ (0, 1), consider the covariance matrix
Σ = (1− µ)Ip×p + µ11T .
Violating matrix incoherence (elementwise/RIP)
Form random design matrix
X =[x1 x2 . . . xp
]︸ ︷︷ ︸
p columns
=
XT1
XT2...
XTn
︸ ︷︷ ︸n rows
∈ Rn×p, each row Xi ∼ N(0,Σ), i.i.d.
Example: For some µ ∈ (0, 1), consider the covariance matrix
Σ = (1− µ)Ip×p + µ11T .
Elementwise incoherence violated: for any j 6= k
P
[ 〈xj , xk〉n
≥ µ− ǫ
]≥ 1− c1 exp(−c2nǫ
2).
Violating matrix incoherence (elementwise/RIP)
Form random design matrix
X =[x1 x2 . . . xp
]︸ ︷︷ ︸
p columns
=
XT1
XT2...
XTn
︸ ︷︷ ︸n rows
∈ Rn×p, each row Xi ∼ N(0,Σ), i.i.d.
Example: For some µ ∈ (0, 1), consider the covariance matrix
Σ = (1− µ)Ip×p + µ11T .
Elementwise incoherence violated: for any j 6= k
P
[ 〈xj , xk〉n
≥ µ− ǫ
]≥ 1− c1 exp(−c2nǫ
2).
RIP constants tend to infinity as (n, s) increases:
P
[∣∣∣∣∣∣XTS XS
n− Is×s
∣∣∣∣∣∣2≥ µ (s− 1)− 1− ǫ
]≥ 1− c1 exp(−c2nǫ
2).
Noiseless ℓ1 recovery for µ = 0.5
0 0.5 1 1.5 2 2.5 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rescaled sample size α
Pro
b. o
f exa
ct r
ecov
ery
Prob. exact recovery vs. sample size (µ = 0.5)
p = 128
p = 256
p = 512
Probab. versus rescaled sample size α := ns log(p/s) .
(II) Decomposable regularizers
M
M⊥
Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.
Martin Wainwright () High-dimensional statistics September, 2011 18 / 1
(II) Decomposable regularizers
M
M⊥
Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.
Regularizer R decomposes across (M,M⊥) if
R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.
Martin Wainwright () High-dimensional statistics September, 2011 18 / 1
(II) Decomposable regularizers
M
M⊥
Regularizer R decomposes across (M,M⊥) if
R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.
Includes:• (weighted) ℓ1-norms • nuclear norm• group-sparse norms • sums of decomposable norms
Martin Wainwright () High-dimensional statistics September, 2011 18 / 1
Example: Sparse vectors and ℓ1-regularization
for each subset S ⊂ 1, . . . , p, define subspace pairs
M(S) :=θ ∈ R
p | θSc = 0,
M⊥(S) :=θ ∈ R
p | θS = 0
= M⊥(S).
decomposability of ℓ1-norm:
∥∥θS + θSc
∥∥1
= ‖θS‖1 + ‖θSc‖1 for all θS ∈ M(S) and θSc ∈ M⊥(S).
natural extension to group Lasso: collection of groups Gj that partition 1, . . . , p group norm
‖θ‖G,α =∑
j
‖θGj‖α for some α ∈ [1,∞].
Example: Low-rank matrices and nuclear normfor each pair of r-dimensional subspaces U ⊆ R
p1 and V ⊆ Rp2 :
M(U, V ) :=Θ ∈ R
p1×p2 | row(Θ) ⊆ V, col(Θ) ⊆ U
M⊥(U, V ) :=Γ ∈ R
p1×p2 | row(Γ) ⊆ V ⊥, col(Γ) ⊆ U⊥.
Example: Low-rank matrices and nuclear normfor each pair of r-dimensional subspaces U ⊆ R
p1 and V ⊆ Rp2 :
M(U, V ) :=Θ ∈ R
p1×p2 | row(Θ) ⊆ V, col(Θ) ⊆ U
M⊥(U, V ) :=Γ ∈ R
p1×p2 | row(Γ) ⊆ V ⊥, col(Γ) ⊆ U⊥.
(a) Θ ∈ M (b) Γ ∈ M⊥ (c) Σ ∈ M
Example: Low-rank matrices and nuclear normfor each pair of r-dimensional subspaces U ⊆ R
p1 and V ⊆ Rp2 :
M(U, V ) :=Θ ∈ R
p1×p2 | row(Θ) ⊆ V, col(Θ) ⊆ U
M⊥(U, V ) :=Γ ∈ R
p1×p2 | row(Γ) ⊆ V ⊥, col(Γ) ⊆ U⊥.
(a) Θ ∈ M (b) Γ ∈ M⊥ (c) Σ ∈ M
by construction, ΘTΓ = 0 for all Θ ∈ M(U, V ) and Γ ∈ M⊥(U, V )
decomposability of nuclear norm |||Θ|||1 =∑minp1,p2
j=1 σj(Θ):
|||Θ+ Γ|||1 = |||Θ|||1 + |||Γ|||1 for all Θ ∈ M(U, V ) and Γ ∈ M⊥(U, V ).
Significance of decomposability
R(ΠM⊥(∆))
R(ΠM(∆))
R(ΠM(∆))
R(ΠM⊥(∆))
(a) C for exact model (cone) (b) C for approximate model (star-shaped)
Lemma
Suppose that L is convex, and R is decomposable w.r.t. M. Then as long asλn ≥ 2R∗(∇L(θ∗;Zn
1 ), the error vector ∆ = θλn− θ∗ belongs to
C(M,M; θ∗) :=∆ ∈ Ω | R(ΠM⊥(∆)) ≤ 3R(Π
M(∆)) + 4R(ΠM⊥(θ∗))
.
Main theoremEstimator
θλn∈ arg min
θ∈Rp
Ln(θ;Z
n1 ) + λnR(θ)
,
where L satisfies RSC(γ, τ) w.r.t regularizer R.
Main theoremEstimator
θλn∈ arg min
θ∈Rp
Ln(θ;Z
n1 ) + λnR(θ)
,
where L satisfies RSC(γ, τ) w.r.t regularizer R.
Theorem (Negahban, Ravikumar, W., & Yu, 2009)
Suppose that θ∗ ∈ M. For any regularization parameterλn ≥ 2R∗(∇L(θ∗;Zn
1 )), any solution θλnsatisfies
‖θλn− θ∗‖2e - 1
γ2(L) (λ′n)
2 Ψ2(M)
where λ′n = maxλn, τ(L).
Quantities that control rates:
curvature in RSC: γℓ
tolerance in RSC: τ
dual norm of regularizer: R∗(v) := supR(u)≤1
〈v, u〉.
optimal subspace const.: Ψ(M) = supθ∈M\0
R(θ)/‖θ‖e
Main theoremEstimator
θλn∈ arg min
θ∈Rp
Ln(θ;Z
n1 ) + λnR(θ)
,
where L satisfies RSC(γ, τ) w.r.t regularizer R.
Theorem (Negahban, Ravikumar, W., & Yu, 2009)
For any regularization parameter λn ≥ 2R∗(∇L(θ∗;Zn1 )), any solution θ
satisfies
‖θ − θ∗‖2e -λ′n
γ(L)
λ′n
γ(L)Ψ2(M) +R(ΠM⊥(θ∗))
,
where λ′n = maxλn, τ(L).
Quantities that control rates:
curvature in RSC: γℓ
tolerance in RSC: τ
dual norm of regularizer: R∗(v) := supR(u)≤1
〈v, u〉.
optimal subspace const.: Ψ(M) = supθ∈M\0
R(θ)/‖θ‖e
Example: Linear regression (exact sparsity)
Lasso program: minθ∈Rp
‖y −Xθ‖22 + λn‖θ‖1
RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p
for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.
Corollary
Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with
λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4
γ2 s λ2n.
Example: Linear regression (exact sparsity)
Lasso program: minθ∈Rp
‖y −Xθ‖22 + λn‖θ‖1
RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p
for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.
Corollary
Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with
λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4
γ2 s λ2n.
Some stochastic instances: recover known results
Compressed sensing: Xij ∼ N(0, 1) and bounded noise ‖w‖2 ≤ σ√n
Deterministic design: X with bounded columns and wi ∼ N(0, σ2)
‖XTw
n‖∞ ≤
√3σ2 log p
nw.h.p. =⇒ ‖θ − θ∗‖22 ≤ 12σ2
γ2
s log p
n.
(e.g., Candes & Tao, 2007; Huang & Zhang, 2008; Meinshausen & Yu, 2008; Bickel et
al., 2008)
Example: Linear regression (weak sparsity)
for some q ∈ [0, 1], say θ∗ belongsto ℓq-“ball”
Bq(Rq) :=θ ∈ R
p |p∑
j=1
|θj |q ≤ Rq
.
Corollary
For θ∗ ∈ Bq(Rq), any Lasso solution satisfies (w.h.p.)
‖θ − θ∗‖22 - σ2Rq
( log pn
)1−q/2
.
rate known to be minimax optimal (Raskutti, W. & Yu, 2009)
Example: Group-structured regularizersMany applications exhibit sparsity with more structure.....
G1 G2 G3
divide index set 1, 2, . . . , p into groups G = G1, G2, . . . , GT
for parameters νi ∈ [1,∞], define block-norm
‖θ‖ν,G :=
T∑
t=1
‖θGt‖νt
group/block Lasso program
θλn∈ arg min
θ∈Rp
1
2n‖y −Xθ‖22 + λn‖θ‖ν,G
.
different versions studied by various authors(Wright et al., 2005; Tropp et al., 2006; Yuan & Li, 2006; Baraniuk, 2008; Obozinski et
al., 2008; Zhao et al., 2008; Bach et al., 2009)
Convergence rates for general group Lasso
Corollary
Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter
λn ≥ 2 maxt=1,2,...,T
∥∥XTw
n
∥∥ν∗t
, where 1ν∗t= 1− 1
νt,
any solution θλnsatisfies
‖θλn− θ∗‖2 ≤ 2
γℓΨν(SG)λn, where Ψν(SG) = sup
θ∈M(SG)\0
‖θ‖ν,G
‖θ‖2.
Martin Wainwright () High-dimensional statistics September, 2011 26 / 1
Convergence rates for general group Lasso
Corollary
Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter
λn ≥ 2 maxt=1,2,...,T
∥∥XTw
n
∥∥ν∗t
, where 1ν∗t= 1− 1
νt,
any solution θλnsatisfies
‖θλn− θ∗‖2 ≤ 2
γℓΨν(SG)λn, where Ψν(SG) = sup
θ∈M(SG)\0
‖θ‖ν,G
‖θ‖2.
Some special cases with m ≡ max. group size
1 ℓ1/ℓ2 regularization: Group norm with ν = 2
‖θλn− θ∗‖22 = O
(sGmn
+sG log T
n
).
This rate is minimax-optimal. (Raskutti, W. & Yu, 2010)
Martin Wainwright () High-dimensional statistics September, 2011 26 / 1
Convergence rates for general group Lasso
Corollary
Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter
λn ≥ 2 maxt=1,2,...,T
∥∥XTw
n
∥∥ν∗t
, where 1ν∗t= 1− 1
νt,
any solution θλnsatisfies
‖θλn− θ∗‖2 ≤ 2
γℓΨν(SG)λn, where Ψν(SG) = sup
θ∈M(SG)\0
‖θ‖ν,G
‖θ‖2.
Some special cases with m ≡ max. group size
1 ℓ1/ℓ∞ regularization: Group norm with ν = ∞
‖θλn− θ∗‖22 = O
(sGm2
n+
sG log T
n
).
Martin Wainwright () High-dimensional statistics September, 2011 26 / 1
Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank
noisy/partial observations of the form
yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise
estimate by solving semi-definite program (SDP):
Θ ∈ argminΘ
1
n
n∑
i=1
(yi − 〈〈Xi, Θ〉〉)2 + λn
minp1,p2∑
j=1
σj(Θ)
︸ ︷︷ ︸|||Θ|||1
Example: Low-rank matrices and nuclear norm
low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank
noisy/partial observations of the form
yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise
estimate by solving semi-definite program (SDP):
Θ ∈ argminΘ
1
n
n∑
i=1
(yi − 〈〈Xi, Θ〉〉)2 + λn
minp1,p2∑
j=1
σj(Θ)
︸ ︷︷ ︸|||Θ|||1
studied in past work (Fazel, 2001; Srebro et al., 2004; Bach, 2008)
observations based on random projection (Recht, Fazel & Parillo, 2007)
work on matrix completion (Srebro, 2004; Candes & Recht, 2008; Recht, 2009;
Negahban & W., 2010)
other work on general noisy observation models (Rohde & Tsybakov, 2009;
Negahban & W., 2009)
Rates for (near) low-rank estimation
For parameter q ∈ [0, 1], set of near low-rank matrices:
Bq(Rq) =Θ∗ ∈ R
p1×p2 |minp1,p2∑
j=1
|σj(Θ∗)|q ≤ Rq
.
Corollary (Negahban & W., 2009)
Under RSC condition, with regularization parameter λn ≥ 16σ(√
p1
n +√
p2
n
),
we have w.h.p.
|||Θ−Θ∗|||2F ≤ c0Rq
γ2ℓ
(σ2 (p1 + p2)
n
)1− q
2
Rates for (near) low-rank estimation
For parameter q ∈ [0, 1], set of near low-rank matrices:
Bq(Rq) =Θ∗ ∈ R
p1×p2 |minp1,p2∑
j=1
|σj(Θ∗)|q ≤ Rq
.
Corollary (Negahban & W., 2009)
Under RSC condition, with regularization parameter λn ≥ 16σ(√
p1
n +√
p2
n
),
we have w.h.p.
|||Θ−Θ∗|||2F ≤ c0Rq
γ2ℓ
(σ2 (p1 + p2)
n
)1− q
2
for a rank r matrix M
|||M |||1 =
r∑
j=1
σj(M) ≤√r
√√√√r∑
j=1
σ2j (M) =
√r |||M |||F
solve nuclear norm regularized program with λn ≥ 2n|||∑n
i=1 wiXi|||2
Weighted matrix completion
Random operator X : Rp×p → Rn with
[X(Θ∗)
]i= Θ∗
a(i)b(i)
where (a(i), b(i)) is a matrix index sampled with weight function ω.
Martin Wainwright () High-dimensional statistics September, 2011 29 / 1
Weighted matrix completion
Random operator X : Rp×p → Rn with
[X(Θ∗)
]i= Θ∗
a(i)b(i)
where (a(i), b(i)) is a matrix index sampled with weight function ω.
Even in noiseless setting, model is unidentifiable:Consider a rank one matrix:
Θ∗ = e1eT1 =
1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...
......
... 00 0 0 . . . 0
Martin Wainwright () High-dimensional statistics September, 2011 29 / 1
Weighted matrix completion
Random operator X : Rp×p → Rn with
[X(Θ∗)
]i= Θ∗
a(i)b(i)
where (a(i), b(i)) is a matrix index sampled with weight function ω.
Even in noiseless setting, model is unidentifiable:Consider a rank one matrix:
Θ∗ = e1eT1 =
1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...
......
... 00 0 0 . . . 0
Past work has imposed eigen-incoherence conditions.(e.g., Recht & Candes, 2008; Gross, 2009)
Martin Wainwright () High-dimensional statistics September, 2011 29 / 1
A milder “spikiness” condition
Consider the “poisoned” low-rank matrix:
Θ∗ = Γ∗ + δe1eT1 = Γ∗ + δ
1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...
......
... 00 0 0 . . . 0
where Γ∗ is rank r − 1, all eigenectors perpendicular to e1.
Excluded by eigen-incoherence for all δ > 0.
Martin Wainwright () High-dimensional statistics September, 2011 30 / 1
A milder “spikiness” condition
Consider the “poisoned” low-rank matrix:
Θ∗ = Γ∗ + δe1eT1 = Γ∗ + δ
1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...
......
... 00 0 0 . . . 0
where Γ∗ is rank r − 1, all eigenectors perpendicular to e1.
Excluded by eigen-incoherence for all δ > 0.
Control by spikiness ratio: (Negahban & W., 2010)
1 ≤ p‖Θ∗‖∞|||Θ∗|||F
≤ p.
Martin Wainwright () High-dimensional statistics September, 2011 30 / 1
Restricted strong convexity for matrix completion
matrix completion based on random operator Xn : Rp×p → Rn such that
[Xn(Θ)
]i= Θab where (ai, bi) chosen according to weighted measure ω
Restricted strong convexity for matrix completion
matrix completion based on random operator Xn : Rp×p → Rn such that
[Xn(Θ)
]i= Θab where (ai, bi) chosen according to weighted measure ω
in expectation: E[‖Xn(Θ)‖22]/n = |||Θ|||2ω(F ).
Restricted strong convexity for matrix completion
matrix completion based on random operator Xn : Rp×p → Rn such that
[Xn(Θ)
]i= Θab where (ai, bi) chosen according to weighted measure ω
in expectation: E[‖Xn(Θ)‖22]/n = |||Θ|||2ω(F ).
Proposition (Negahban & W., 2010)
For the weighted matrix completion operator Xn, there are universal positive
constants (c1, c2) such that Zn(Θ) :=
∣∣∣∣‖Xn(Θ)‖2
2
n − |||Θ|||2ω(F )
∣∣ is bounded as
supΘ∈Rp×p
Zn(Θ) ≤ c1 p‖Θ‖ω(∞) |||Θ|||ω(nuc)
√p log p
n︸ ︷︷ ︸“low-rank term”
+ c2
(p‖Θ‖ω(∞)
√p log p
n
)2
︸ ︷︷ ︸“spikiness” term
with probability at least 1− exp(−p log p).
Summary
convergence rates for high-dimensional estimators
decomposability of regularizer R restricted strong convexity of loss functions
actual rates determined by: noise measured in dual function R∗
subspace constant Ψ in moving from R to error norm ‖ · ‖e restricted strong convexity constant
recovered some known results as corollaries: Lasso with hard sparsity multivariate group Lasso inverse covariance matrix estimation via log-determinant
derived some new results on ℓ2 or Frobenius norm: models with weak sparsity log-linear models with weak/exact sparsity low-rank matrix estimation other models? other error metrics?
Martin Wainwright () High-dimensional statistics September, 2011 32 / 1
Some papers (www.eecs.berkeley.edu/˜wainwrig)1 A. Agarwal S. Negahban, and M. J. Wainwright. Fast global convergence
rates of gradient methods for high-dimensional inference. Arxiv, March2011.
2 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2009). Aunified framework for high-dimensional analysis of M -estimators withdecomposable regularizers.Full length version: arxiv.org/abs/1010.2731v1Preliminary version: NIPS Conference, December 2009.
3 S. Negahban and M. J. Wainwright (2010). Restricted strong convexityand (weighted) matrix completion: Optimal bounds with noise.arxiv.org/abs/0112.5100, September 2010.
4 G. Raskutti, M. J. Wainwright and B. Yu (2009) Minimax rates for linearregression over ℓq-balls. IEEE Transactions on Information Theory,October 2011. (First appeared at Allerton Conference, September 2009.)