general principles for high-dimensional estimation: statistics and computation · 2011-10-03 ·...

66
General principles for high-dimensional estimation: Statistics and computation Martin Wainwright Statistics, and EECS UC Berkeley Joint work with: Garvesh Raskutti, Sahand Negahban Pradeep Ravikumar, Bin Yu.

Upload: others

Post on 04-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

General principles for high-dimensionalestimation: Statistics and computation

Martin Wainwright

Statistics, and EECS

UC Berkeley

Joint work with: Garvesh Raskutti, Sahand Negahban

Pradeep Ravikumar, Bin Yu.

Page 2: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

IntroductionHigh-dimensional data sets are everywhere:

biological data (genes, proteins, etc.)

social network data

recommender systems (Amazon, Netflix etc.)

astronomy datasets

Page 3: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

IntroductionHigh-dimensional data sets are everywhere:

biological data (genes, proteins, etc.)

social network data

recommender systems (Amazon, Netflix etc.)

astronomy datasets

Question:

Suppose that n = 100 and p = 1000. Do we expect theory requiring n → +∞and p = O(1) to be useful?

Page 4: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

IntroductionHigh-dimensional data sets are everywhere:

biological data (genes, proteins, etc.)

social network data

recommender systems (Amazon, Netflix etc.)

astronomy datasets

Question:

Suppose that n = 100 and p = 1000. Do we expect theory requiring n → +∞and p = O(1) to be useful?

Modern viewpoint:

non-asymptotic results (valid for all (n, p))

allow for n ≪ p or n ≍ p

investigate various types of low-dimensional structure

Page 5: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Loss functions and regularization

Models: Indexed class of probability distributions Pθ | θ ∈ Ω

Data: samples Zn1 = (xi, yi), i = 1, . . . , n drawn from unknown Pθ∗

Estimation: Minimize loss function plus regularization term:

θ︸︷︷︸Estimate

∈ argminθ∈Ω

L(θ;Zn

1 )︸ ︷︷ ︸Loss function

+ λn R(θ)︸ ︷︷ ︸Regularizer

.

Goal: For given error norm ‖ · ‖e want upper bounds on ‖θ − θ∗‖e non-asymptotic results allowing for (n, p, s1, s2, . . .) → ∞ where

⋆ n ≡ sample size⋆ p ≡ dimension of parameter space Ω⋆ si ≡ structural parameters (e.g., sparsity, rank, graph degree)

Page 6: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Sparse linear regression

= +nS

wy X θ∗

Sc

n× p

Set-up: noisy observations y = Xθ∗ + w with sparse θ∗

Estimator: Lasso program

θ ∈ argminθ

1

n

n∑

i=1

(yi − xTi θ)

2 + λn

p∑

j=1

|θj |

Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004;

Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005;

Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van

de Geer, 2007; Bickel, Ritov & Tsybakov, 2008, Zhang, 2009 .....

Page 7: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Structured (inverse) covariance matrices

Zero pattern of inverse covariance

1 2 3 4 5

1

2

3

4

5

1 2

3

4

5

Set-up: Samples from random vector with sparse covariance Σ or sparseinverse covariance Θ∗ ∈ R

p×p.

Estimator (for inverse covariance)

Θ ∈ argminΘ

〈〈 1n

n∑

i=1

xixTi , Θ〉〉 − log det(Θ) + λn

b∈B

|||Θb|||F

Some past work: Yuan & Lin, 2006; d’Aspremont et al., 2007; Bickel & Levina, 2007; El

Karoui, 2007; d’Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman

et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009

Page 8: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Sparse principal components analysis

+=Σ ZZT D

Set-up: Covariance matrix Σ = ZZT +D, where leading eigenspace Z hassparse columns.

Estimator:

Θ ∈ argminΘ

−〈〈Θ, Σ〉〉+ λn

(j,k)

|Θjk|

Some past work: Johnstone, 2001; Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al.,

2004; d’Aspremont et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008

Page 9: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Low-rank matrix approximation

Θ∗ U D V T

r × r

p1 × p2 p1 × r

r × p2

Set-up: Matrix Θ∗ ∈ Rp1×p2 with rank r ≪ minp1, p2.

Estimator:

Θ ∈ argminΘ

1

n

n∑

i=1

(yi − 〈〈Xi, Θ〉〉)2 + λn

minp1,p2∑

j=1

σj(Θ)

Some past work: Fazel, 2001; Srebro et al., 2004; Recht, Fazel & Parillo, 2007; Bach, 2008;

Candes & Recht, 2008; Keshavan et al., 2009; Rohde & Tsybakov, 2009; Recht, 2009;

Negahban & W., 2009 ...

Page 10: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Matrix decomposition: Low-rank plus sparse

Matrix Y can be (approximately) decomposed into sum:

Y U V TD

d1 × d2 d1 × r

r × r r × d2

≈ +

Y = Θ∗︸︷︷︸

Low-rank component

+ Γ∗︸︷︷︸

Sparse component

Page 11: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Matrix decomposition: Low-rank plus sparse

Matrix Y can be (approximately) decomposed into sum:

Y U V TD

d1 × d2 d1 × r

r × r r × d2

≈ +

Y = Θ∗︸︷︷︸

Low-rank component

+ Γ∗︸︷︷︸

Sparse component

exact decomposition: initially studied by Chandrasekaran, Sanghavi,Parillo & Willsky, 2009some subsequent work: Candes et al., 2010; Hsu et al., 2010; Agarwal etal., 2011Various applications:

robust collaborative filtering graphical model selection with hidden variables image/video segmentation

Page 12: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Matrix decomposition: Low-rank plus column sparse

Matrix Y can be (approximately) decomposed into sum:

Y U V TD

d1 × d2 d1 × r

r × r r × d2

≈ +

Y = Θ∗︸︷︷︸

Low-rank component

+ Γ∗︸︷︷︸

Column sparse component

Page 13: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Matrix decomposition: Low-rank plus column sparse

Matrix Y can be (approximately) decomposed into sum:

Y U V TD

d1 × d2 d1 × r

r × r r × d2

≈ +

Y = Θ∗︸︷︷︸

Low-rank component

+ Γ∗︸︷︷︸

Column sparse component

exact decomposition: initially studied by Xu, Caramanis & Sanghavi,2010

later work: Agarwal, Negahban & W., 2011

Various applications: robust collaborative filtering robust principal components analysis

Page 14: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Motivation and roadmap

many results on different high-dimensional models

all based on estimators of the type:

θλn︸︷︷︸Estimate

∈ argminθ∈Ω

L(θ;Zn

1 )︸ ︷︷ ︸Loss function

+ λn R(θ)︸ ︷︷ ︸Regularizer

.

Martin Wainwright () High-dimensional statistics September, 2011 10 / 1

Page 15: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Motivation and roadmap

many results on different high-dimensional models

all based on estimators of the type:

θλn︸︷︷︸Estimate

∈ argminθ∈Ω

L(θ;Zn

1 )︸ ︷︷ ︸Loss function

+ λn R(θ)︸ ︷︷ ︸Regularizer

.

Question:

Is there a common set of underlying principles?

Martin Wainwright () High-dimensional statistics September, 2011 10 / 1

Page 16: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Motivation and roadmap

many results on different high-dimensional models

all based on estimators of the type:

θλn︸︷︷︸Estimate

∈ argminθ∈Ω

L(θ;Zn

1 )︸ ︷︷ ︸Loss function

+ λn R(θ)︸ ︷︷ ︸Regularizer

.

Question:

Is there a common set of underlying principles?

Answer: Yes, two essential ingredients.

(I) Restricted strong convexity of loss function

(II) Decomposability of the regularizer

Martin Wainwright () High-dimensional statistics September, 2011 10 / 1

Page 17: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

(I) Role of curvature

1 Curvature controls difficulty of estimation:

δL

θθ

δL

θθ

High curvature: easy to estimate (b) Low curvature: harder

Page 18: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

(I) Role of curvature

1 Curvature controls difficulty of estimation:

δL

θθ

δL

θθ

High curvature: easy to estimate (b) Low curvature: harder

2 captured by lower bound on Taylor series error TL(∆; θ∗)

L(θ∗ +∆)− L(θ∗)− 〈∇L(θ∗), ∆〉︸ ︷︷ ︸TL(∆,θ∗)

≥ γ2‖∆‖2e

for all ∆ around θ∗.

Page 19: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

High dimensions: no strong convexity!

−1−0.5

00.5

1

−1−0.5

00.5

10

0.2

0.4

0.6

0.8

1

When p > n, the Hessian ∇2L(θ;Zn1 ) has nullspace of dimension p− n.

Page 20: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity

Definition

Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if

Ln(θ∗ +∆)−

Ln(θ

∗) + 〈∇Ln(θ∗), ∆〉

︸ ︷︷ ︸Taylor error TL(∆; θ∗)

≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸

Lower curvature

− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance

for all ∆ in a suitable neighborhood of θ∗.

Page 21: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity

Definition

Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if

Ln(θ∗ +∆)−

Ln(θ

∗) + 〈∇Ln(θ∗), ∆〉

︸ ︷︷ ︸Taylor error TL(∆; θ∗)

≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸

Lower curvature

− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance

for all ∆ in a suitable neighborhood of θ∗.

ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n

Page 22: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity

Definition

Loss function Ln satisfies restricted strong convexity (RSC) with respect toregularizer R if

Ln(θ∗ +∆)−

Ln(θ

∗) + 〈∇Ln(θ∗), ∆〉

︸ ︷︷ ︸Taylor error TL(∆; θ∗)

≥ γ2ℓ ‖∆‖2e︸ ︷︷ ︸

Lower curvature

− τ2ℓ R2(∆)︸ ︷︷ ︸Tolerance

for all ∆ in a suitable neighborhood of θ∗.

ordinary strong convexity: special case with tolerance τℓ = 0 does not hold for most loss functions when p > n

RSC enforces a lower bound on curvature, but only when R2(∆) ≪ ‖∆‖2e

Page 23: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) = 12n‖y −Xθ‖22:

TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.

Martin Wainwright () High-dimensional statistics September, 2011 14 / 1

Page 24: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) = 12n‖y −Xθ‖22:

TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.

Restricted nullspace/eigenvalue (RE) condition (van de Geer, 2007; Bickel et

al., 2009):

‖X∆‖222n

≥ γ ‖∆‖22 for all ‖∆Sc‖1 ≤ ‖∆S‖1.

Martin Wainwright () High-dimensional statistics September, 2011 14 / 1

Page 25: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) = 12n‖y −Xθ‖22:

TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.

Restricted nullspace/eigenvalue (RE) condition (van de Geer, 2007; Bickel et

al., 2009):

‖X∆‖222n

≥ γ ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2

√s‖∆‖2.

Martin Wainwright () High-dimensional statistics September, 2011 14 / 1

Page 26: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Least-squares: RSC ≡ restricted eigenvalue

for least-squares loss L(θ) = 12n‖y −Xθ‖22:

TL(∆; θ∗) = Ln(θ∗ +∆)−

Ln(θ

∗)− 〈∇Ln(θ∗), ∆〉

=

1

2n‖X∆‖22.

Restricted nullspace/eigenvalue (RE) condition (van de Geer, 2007; Bickel et

al., 2009):

‖X∆‖222n

≥ γ ‖∆‖22 for all ∆ ∈ Rp with ‖∆‖1 ≤ 2

√s‖∆‖2.

holds with high probability for various sub-Gaussian designs whenn % s log p/s.

fairly strong dependency between covariates is possible

Martin Wainwright () High-dimensional statistics September, 2011 14 / 1

Page 27: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R

p to output y ∈ R:

Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)

.

Page 28: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R

p to output y ∈ R:

Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)

.

Taylor series expansion involves random Hessian

H(θ) =1

n

n∑

i=1

Φ′′(〈θ, Xi〉

)XiX

Ti ∈ R

p.

Page 29: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity for GLMSgeneralized linear model linking covariates x ∈ R

p to output y ∈ R:

Pθ(y | x, θ∗) ∝ expy 〈x, θ∗〉 − Φ(〈x, θ∗〉)

.

Taylor series expansion involves random Hessian

H(θ) =1

n

n∑

i=1

Φ′′(〈θ, Xi〉

)XiX

Ti ∈ R

p.

Proposition (Negahban, W., Ravikumar & Yu, 2010)

For zero-mean sub-Gaussian covariates Xi with covariance Σ and any GLM,

TL(∆; θ∗) ≥ c3‖Σ12 ∆‖2

‖Σ 1

2∆‖2 − c4κ(Σ)( log p

n

)1/2‖∆‖1

for all ‖∆‖2 ≤ 1

with probability at least 1− c1 exp(−c2n).

Here κ(Σ) = maxj Σjj .

Page 30: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Violating matrix incoherence (elementwise/RIP)

Important:

Incoherence/RIP conditions imply restricted nullspace/eigenvalue, but are farfrom necessary.Very easy to violate them.....

Page 31: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Violating matrix incoherence (elementwise/RIP)

Form random design matrix

X =[x1 x2 . . . xp

]︸ ︷︷ ︸

p columns

=

XT1

XT2...

XTn

︸ ︷︷ ︸n rows

∈ Rn×p, each row Xi ∼ N(0,Σ), i.i.d.

Example: For some µ ∈ (0, 1), consider the covariance matrix

Σ = (1− µ)Ip×p + µ11T .

Page 32: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Violating matrix incoherence (elementwise/RIP)

Form random design matrix

X =[x1 x2 . . . xp

]︸ ︷︷ ︸

p columns

=

XT1

XT2...

XTn

︸ ︷︷ ︸n rows

∈ Rn×p, each row Xi ∼ N(0,Σ), i.i.d.

Example: For some µ ∈ (0, 1), consider the covariance matrix

Σ = (1− µ)Ip×p + µ11T .

Elementwise incoherence violated: for any j 6= k

P

[ 〈xj , xk〉n

≥ µ− ǫ

]≥ 1− c1 exp(−c2nǫ

2).

Page 33: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Violating matrix incoherence (elementwise/RIP)

Form random design matrix

X =[x1 x2 . . . xp

]︸ ︷︷ ︸

p columns

=

XT1

XT2...

XTn

︸ ︷︷ ︸n rows

∈ Rn×p, each row Xi ∼ N(0,Σ), i.i.d.

Example: For some µ ∈ (0, 1), consider the covariance matrix

Σ = (1− µ)Ip×p + µ11T .

Elementwise incoherence violated: for any j 6= k

P

[ 〈xj , xk〉n

≥ µ− ǫ

]≥ 1− c1 exp(−c2nǫ

2).

RIP constants tend to infinity as (n, s) increases:

P

[∣∣∣∣∣∣XTS XS

n− Is×s

∣∣∣∣∣∣2≥ µ (s− 1)− 1− ǫ

]≥ 1− c1 exp(−c2nǫ

2).

Page 34: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Noiseless ℓ1 recovery for µ = 0.5

0 0.5 1 1.5 2 2.5 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rescaled sample size α

Pro

b. o

f exa

ct r

ecov

ery

Prob. exact recovery vs. sample size (µ = 0.5)

p = 128

p = 256

p = 512

Probab. versus rescaled sample size α := ns log(p/s) .

Page 35: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

(II) Decomposable regularizers

M

M⊥

Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.

Martin Wainwright () High-dimensional statistics September, 2011 18 / 1

Page 36: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

(II) Decomposable regularizers

M

M⊥

Subspace M: Approximation to model parametersComplementary subspace M⊥: Undesirable deviations.

Regularizer R decomposes across (M,M⊥) if

R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.

Martin Wainwright () High-dimensional statistics September, 2011 18 / 1

Page 37: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

(II) Decomposable regularizers

M

M⊥

Regularizer R decomposes across (M,M⊥) if

R(α+ β) = R(α) +R(β) for all α ∈ M, and β ∈ M⊥.

Includes:• (weighted) ℓ1-norms • nuclear norm• group-sparse norms • sums of decomposable norms

Martin Wainwright () High-dimensional statistics September, 2011 18 / 1

Page 38: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Sparse vectors and ℓ1-regularization

for each subset S ⊂ 1, . . . , p, define subspace pairs

M(S) :=θ ∈ R

p | θSc = 0,

M⊥(S) :=θ ∈ R

p | θS = 0

= M⊥(S).

decomposability of ℓ1-norm:

∥∥θS + θSc

∥∥1

= ‖θS‖1 + ‖θSc‖1 for all θS ∈ M(S) and θSc ∈ M⊥(S).

natural extension to group Lasso: collection of groups Gj that partition 1, . . . , p group norm

‖θ‖G,α =∑

j

‖θGj‖α for some α ∈ [1,∞].

Page 39: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Low-rank matrices and nuclear normfor each pair of r-dimensional subspaces U ⊆ R

p1 and V ⊆ Rp2 :

M(U, V ) :=Θ ∈ R

p1×p2 | row(Θ) ⊆ V, col(Θ) ⊆ U

M⊥(U, V ) :=Γ ∈ R

p1×p2 | row(Γ) ⊆ V ⊥, col(Γ) ⊆ U⊥.

Page 40: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Low-rank matrices and nuclear normfor each pair of r-dimensional subspaces U ⊆ R

p1 and V ⊆ Rp2 :

M(U, V ) :=Θ ∈ R

p1×p2 | row(Θ) ⊆ V, col(Θ) ⊆ U

M⊥(U, V ) :=Γ ∈ R

p1×p2 | row(Γ) ⊆ V ⊥, col(Γ) ⊆ U⊥.

(a) Θ ∈ M (b) Γ ∈ M⊥ (c) Σ ∈ M

Page 41: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Low-rank matrices and nuclear normfor each pair of r-dimensional subspaces U ⊆ R

p1 and V ⊆ Rp2 :

M(U, V ) :=Θ ∈ R

p1×p2 | row(Θ) ⊆ V, col(Θ) ⊆ U

M⊥(U, V ) :=Γ ∈ R

p1×p2 | row(Γ) ⊆ V ⊥, col(Γ) ⊆ U⊥.

(a) Θ ∈ M (b) Γ ∈ M⊥ (c) Σ ∈ M

by construction, ΘTΓ = 0 for all Θ ∈ M(U, V ) and Γ ∈ M⊥(U, V )

decomposability of nuclear norm |||Θ|||1 =∑minp1,p2

j=1 σj(Θ):

|||Θ+ Γ|||1 = |||Θ|||1 + |||Γ|||1 for all Θ ∈ M(U, V ) and Γ ∈ M⊥(U, V ).

Page 42: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Significance of decomposability

R(ΠM⊥(∆))

R(ΠM(∆))

R(ΠM(∆))

R(ΠM⊥(∆))

(a) C for exact model (cone) (b) C for approximate model (star-shaped)

Lemma

Suppose that L is convex, and R is decomposable w.r.t. M. Then as long asλn ≥ 2R∗(∇L(θ∗;Zn

1 ), the error vector ∆ = θλn− θ∗ belongs to

C(M,M; θ∗) :=∆ ∈ Ω | R(ΠM⊥(∆)) ≤ 3R(Π

M(∆)) + 4R(ΠM⊥(θ∗))

.

Page 43: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Main theoremEstimator

θλn∈ arg min

θ∈Rp

Ln(θ;Z

n1 ) + λnR(θ)

,

where L satisfies RSC(γ, τ) w.r.t regularizer R.

Page 44: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Main theoremEstimator

θλn∈ arg min

θ∈Rp

Ln(θ;Z

n1 ) + λnR(θ)

,

where L satisfies RSC(γ, τ) w.r.t regularizer R.

Theorem (Negahban, Ravikumar, W., & Yu, 2009)

Suppose that θ∗ ∈ M. For any regularization parameterλn ≥ 2R∗(∇L(θ∗;Zn

1 )), any solution θλnsatisfies

‖θλn− θ∗‖2e - 1

γ2(L) (λ′n)

2 Ψ2(M)

where λ′n = maxλn, τ(L).

Quantities that control rates:

curvature in RSC: γℓ

tolerance in RSC: τ

dual norm of regularizer: R∗(v) := supR(u)≤1

〈v, u〉.

optimal subspace const.: Ψ(M) = supθ∈M\0

R(θ)/‖θ‖e

Page 45: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Main theoremEstimator

θλn∈ arg min

θ∈Rp

Ln(θ;Z

n1 ) + λnR(θ)

,

where L satisfies RSC(γ, τ) w.r.t regularizer R.

Theorem (Negahban, Ravikumar, W., & Yu, 2009)

For any regularization parameter λn ≥ 2R∗(∇L(θ∗;Zn1 )), any solution θ

satisfies

‖θ − θ∗‖2e -λ′n

γ(L)

λ′n

γ(L)Ψ2(M) +R(ΠM⊥(θ∗))

,

where λ′n = maxλn, τ(L).

Quantities that control rates:

curvature in RSC: γℓ

tolerance in RSC: τ

dual norm of regularizer: R∗(v) := supR(u)≤1

〈v, u〉.

optimal subspace const.: Ψ(M) = supθ∈M\0

R(θ)/‖θ‖e

Page 46: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Linear regression (exact sparsity)

Lasso program: minθ∈Rp

‖y −Xθ‖22 + λn‖θ‖1

RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p

for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.

Corollary

Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with

λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4

γ2 s λ2n.

Page 47: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Linear regression (exact sparsity)

Lasso program: minθ∈Rp

‖y −Xθ‖22 + λn‖θ‖1

RSC corresponds to lower bound on restricted eigenvalues of XTX ∈ Rp×p

for a s-sparse vector, we have ‖θ‖1 ≤ √s ‖θ‖2.

Corollary

Suppose that true parameter θ∗ is exactly s-sparse. Under RSC and with

λn ≥ 2‖XTwn ‖∞, then any Lasso solution satisfies ‖θ − θ∗‖22 ≤ 4

γ2 s λ2n.

Some stochastic instances: recover known results

Compressed sensing: Xij ∼ N(0, 1) and bounded noise ‖w‖2 ≤ σ√n

Deterministic design: X with bounded columns and wi ∼ N(0, σ2)

‖XTw

n‖∞ ≤

√3σ2 log p

nw.h.p. =⇒ ‖θ − θ∗‖22 ≤ 12σ2

γ2

s log p

n.

(e.g., Candes & Tao, 2007; Huang & Zhang, 2008; Meinshausen & Yu, 2008; Bickel et

al., 2008)

Page 48: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Linear regression (weak sparsity)

for some q ∈ [0, 1], say θ∗ belongsto ℓq-“ball”

Bq(Rq) :=θ ∈ R

p |p∑

j=1

|θj |q ≤ Rq

.

Corollary

For θ∗ ∈ Bq(Rq), any Lasso solution satisfies (w.h.p.)

‖θ − θ∗‖22 - σ2Rq

( log pn

)1−q/2

.

rate known to be minimax optimal (Raskutti, W. & Yu, 2009)

Page 49: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Group-structured regularizersMany applications exhibit sparsity with more structure.....

G1 G2 G3

divide index set 1, 2, . . . , p into groups G = G1, G2, . . . , GT

for parameters νi ∈ [1,∞], define block-norm

‖θ‖ν,G :=

T∑

t=1

‖θGt‖νt

group/block Lasso program

θλn∈ arg min

θ∈Rp

1

2n‖y −Xθ‖22 + λn‖θ‖ν,G

.

different versions studied by various authors(Wright et al., 2005; Tropp et al., 2006; Yuan & Li, 2006; Baraniuk, 2008; Obozinski et

al., 2008; Zhao et al., 2008; Bach et al., 2009)

Page 50: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Convergence rates for general group Lasso

Corollary

Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter

λn ≥ 2 maxt=1,2,...,T

∥∥XTw

n

∥∥ν∗t

, where 1ν∗t= 1− 1

νt,

any solution θλnsatisfies

‖θλn− θ∗‖2 ≤ 2

γℓΨν(SG)λn, where Ψν(SG) = sup

θ∈M(SG)\0

‖θ‖ν,G

‖θ‖2.

Martin Wainwright () High-dimensional statistics September, 2011 26 / 1

Page 51: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Convergence rates for general group Lasso

Corollary

Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter

λn ≥ 2 maxt=1,2,...,T

∥∥XTw

n

∥∥ν∗t

, where 1ν∗t= 1− 1

νt,

any solution θλnsatisfies

‖θλn− θ∗‖2 ≤ 2

γℓΨν(SG)λn, where Ψν(SG) = sup

θ∈M(SG)\0

‖θ‖ν,G

‖θ‖2.

Some special cases with m ≡ max. group size

1 ℓ1/ℓ2 regularization: Group norm with ν = 2

‖θλn− θ∗‖22 = O

(sGmn

+sG log T

n

).

This rate is minimax-optimal. (Raskutti, W. & Yu, 2010)

Martin Wainwright () High-dimensional statistics September, 2011 26 / 1

Page 52: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Convergence rates for general group Lasso

Corollary

Say Θ∗ is supported on sG groups, and X satisfies RSC. Then forregularization parameter

λn ≥ 2 maxt=1,2,...,T

∥∥XTw

n

∥∥ν∗t

, where 1ν∗t= 1− 1

νt,

any solution θλnsatisfies

‖θλn− θ∗‖2 ≤ 2

γℓΨν(SG)λn, where Ψν(SG) = sup

θ∈M(SG)\0

‖θ‖ν,G

‖θ‖2.

Some special cases with m ≡ max. group size

1 ℓ1/ℓ∞ regularization: Group norm with ν = ∞

‖θλn− θ∗‖22 = O

(sGm2

n+

sG log T

n

).

Martin Wainwright () High-dimensional statistics September, 2011 26 / 1

Page 53: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Low-rank matrices and nuclear norm

low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank

noisy/partial observations of the form

yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise

estimate by solving semi-definite program (SDP):

Θ ∈ argminΘ

1

n

n∑

i=1

(yi − 〈〈Xi, Θ〉〉)2 + λn

minp1,p2∑

j=1

σj(Θ)

︸ ︷︷ ︸|||Θ|||1

Page 54: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Example: Low-rank matrices and nuclear norm

low-rank matrix Θ∗ ∈ Rp1×p2 that is exactly (or approximately) low-rank

noisy/partial observations of the form

yi = 〈〈Xi, Θ∗〉〉+ wi, i = 1, . . . , n, wi i.i.d. noise

estimate by solving semi-definite program (SDP):

Θ ∈ argminΘ

1

n

n∑

i=1

(yi − 〈〈Xi, Θ〉〉)2 + λn

minp1,p2∑

j=1

σj(Θ)

︸ ︷︷ ︸|||Θ|||1

studied in past work (Fazel, 2001; Srebro et al., 2004; Bach, 2008)

observations based on random projection (Recht, Fazel & Parillo, 2007)

work on matrix completion (Srebro, 2004; Candes & Recht, 2008; Recht, 2009;

Negahban & W., 2010)

other work on general noisy observation models (Rohde & Tsybakov, 2009;

Negahban & W., 2009)

Page 55: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Rates for (near) low-rank estimation

For parameter q ∈ [0, 1], set of near low-rank matrices:

Bq(Rq) =Θ∗ ∈ R

p1×p2 |minp1,p2∑

j=1

|σj(Θ∗)|q ≤ Rq

.

Corollary (Negahban & W., 2009)

Under RSC condition, with regularization parameter λn ≥ 16σ(√

p1

n +√

p2

n

),

we have w.h.p.

|||Θ−Θ∗|||2F ≤ c0Rq

γ2ℓ

(σ2 (p1 + p2)

n

)1− q

2

Page 56: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Rates for (near) low-rank estimation

For parameter q ∈ [0, 1], set of near low-rank matrices:

Bq(Rq) =Θ∗ ∈ R

p1×p2 |minp1,p2∑

j=1

|σj(Θ∗)|q ≤ Rq

.

Corollary (Negahban & W., 2009)

Under RSC condition, with regularization parameter λn ≥ 16σ(√

p1

n +√

p2

n

),

we have w.h.p.

|||Θ−Θ∗|||2F ≤ c0Rq

γ2ℓ

(σ2 (p1 + p2)

n

)1− q

2

for a rank r matrix M

|||M |||1 =

r∑

j=1

σj(M) ≤√r

√√√√r∑

j=1

σ2j (M) =

√r |||M |||F

solve nuclear norm regularized program with λn ≥ 2n|||∑n

i=1 wiXi|||2

Page 57: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Weighted matrix completion

Random operator X : Rp×p → Rn with

[X(Θ∗)

]i= Θ∗

a(i)b(i)

where (a(i), b(i)) is a matrix index sampled with weight function ω.

Martin Wainwright () High-dimensional statistics September, 2011 29 / 1

Page 58: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Weighted matrix completion

Random operator X : Rp×p → Rn with

[X(Θ∗)

]i= Θ∗

a(i)b(i)

where (a(i), b(i)) is a matrix index sampled with weight function ω.

Even in noiseless setting, model is unidentifiable:Consider a rank one matrix:

Θ∗ = e1eT1 =

1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...

......

... 00 0 0 . . . 0

Martin Wainwright () High-dimensional statistics September, 2011 29 / 1

Page 59: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Weighted matrix completion

Random operator X : Rp×p → Rn with

[X(Θ∗)

]i= Θ∗

a(i)b(i)

where (a(i), b(i)) is a matrix index sampled with weight function ω.

Even in noiseless setting, model is unidentifiable:Consider a rank one matrix:

Θ∗ = e1eT1 =

1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...

......

... 00 0 0 . . . 0

Past work has imposed eigen-incoherence conditions.(e.g., Recht & Candes, 2008; Gross, 2009)

Martin Wainwright () High-dimensional statistics September, 2011 29 / 1

Page 60: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

A milder “spikiness” condition

Consider the “poisoned” low-rank matrix:

Θ∗ = Γ∗ + δe1eT1 = Γ∗ + δ

1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...

......

... 00 0 0 . . . 0

where Γ∗ is rank r − 1, all eigenectors perpendicular to e1.

Excluded by eigen-incoherence for all δ > 0.

Martin Wainwright () High-dimensional statistics September, 2011 30 / 1

Page 61: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

A milder “spikiness” condition

Consider the “poisoned” low-rank matrix:

Θ∗ = Γ∗ + δe1eT1 = Γ∗ + δ

1 0 0 . . . 00 0 0 . . . 00 0 0 . . . 0...

......

... 00 0 0 . . . 0

where Γ∗ is rank r − 1, all eigenectors perpendicular to e1.

Excluded by eigen-incoherence for all δ > 0.

Control by spikiness ratio: (Negahban & W., 2010)

1 ≤ p‖Θ∗‖∞|||Θ∗|||F

≤ p.

Martin Wainwright () High-dimensional statistics September, 2011 30 / 1

Page 62: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity for matrix completion

matrix completion based on random operator Xn : Rp×p → Rn such that

[Xn(Θ)

]i= Θab where (ai, bi) chosen according to weighted measure ω

Page 63: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity for matrix completion

matrix completion based on random operator Xn : Rp×p → Rn such that

[Xn(Θ)

]i= Θab where (ai, bi) chosen according to weighted measure ω

in expectation: E[‖Xn(Θ)‖22]/n = |||Θ|||2ω(F ).

Page 64: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Restricted strong convexity for matrix completion

matrix completion based on random operator Xn : Rp×p → Rn such that

[Xn(Θ)

]i= Θab where (ai, bi) chosen according to weighted measure ω

in expectation: E[‖Xn(Θ)‖22]/n = |||Θ|||2ω(F ).

Proposition (Negahban & W., 2010)

For the weighted matrix completion operator Xn, there are universal positive

constants (c1, c2) such that Zn(Θ) :=

∣∣∣∣‖Xn(Θ)‖2

2

n − |||Θ|||2ω(F )

∣∣ is bounded as

supΘ∈Rp×p

Zn(Θ) ≤ c1 p‖Θ‖ω(∞) |||Θ|||ω(nuc)

√p log p

n︸ ︷︷ ︸“low-rank term”

+ c2

(p‖Θ‖ω(∞)

√p log p

n

)2

︸ ︷︷ ︸“spikiness” term

with probability at least 1− exp(−p log p).

Page 65: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Summary

convergence rates for high-dimensional estimators

decomposability of regularizer R restricted strong convexity of loss functions

actual rates determined by: noise measured in dual function R∗

subspace constant Ψ in moving from R to error norm ‖ · ‖e restricted strong convexity constant

recovered some known results as corollaries: Lasso with hard sparsity multivariate group Lasso inverse covariance matrix estimation via log-determinant

derived some new results on ℓ2 or Frobenius norm: models with weak sparsity log-linear models with weak/exact sparsity low-rank matrix estimation other models? other error metrics?

Martin Wainwright () High-dimensional statistics September, 2011 32 / 1

Page 66: General principles for high-dimensional estimation: Statistics and computation · 2011-10-03 · Introduction High-dimensional data sets are everywhere: biological data (genes, proteins,

Some papers (www.eecs.berkeley.edu/˜wainwrig)1 A. Agarwal S. Negahban, and M. J. Wainwright. Fast global convergence

rates of gradient methods for high-dimensional inference. Arxiv, March2011.

2 S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2009). Aunified framework for high-dimensional analysis of M -estimators withdecomposable regularizers.Full length version: arxiv.org/abs/1010.2731v1Preliminary version: NIPS Conference, December 2009.

3 S. Negahban and M. J. Wainwright (2010). Restricted strong convexityand (weighted) matrix completion: Optimal bounds with noise.arxiv.org/abs/0112.5100, September 2010.

4 G. Raskutti, M. J. Wainwright and B. Yu (2009) Minimax rates for linearregression over ℓq-balls. IEEE Transactions on Information Theory,October 2011. (First appeared at Allerton Conference, September 2009.)