spam: sparse additive modelsdilan/docs_mljc/spam-nips.pdf · 2007-11-08 · spam: sparse additive...

SpAM: Sparse Additive Models

Pradeep Ravikumar† Han Liu†‡ John Lafferty∗† Larry Wasserman‡†

†Machine Learning Department‡Department of Statistics

∗Computer Science Department

Carnegie Mellon UniversityPittsburgh, PA 15213

Abstract

We present a new class of models for high-dimensional nonparametric re-gression and classification called sparse additive models (SpAM). Our meth-ods combine ideas from sparse linear modeling and additive nonparametricregression. We derive a method for fitting the models that is effective evenwhen the number of covariates is larger than the sample size. A statisticalanalysis of the properties of SpAM is given together with empirical resultson synthetic and real data, showing that SpAM can be effective in fittingsparse nonparametric models in high dimensional data.

1 Introduction

Substantial progress has been made recently on the problem of fitting high dimensionallinear regression models of the form Yi = XT

i β + εi, for i = 1, . . . , n. Here Yi is a real-valued response, Xi is a p-dimensional predictor and εi is a mean zero error term. Findingan estimate of β when p > n that is both statistically well-behaved and computationallyefficient has proved challenging; however, the lasso estimator (Tibshirani (1996)) has been

remarkably successful. The lasso estimator β minimizes the `1-penalized sums of squares

∑

i

(Yi −XTi β) + λ

p∑

j=1

|βj | (1)

with the `1 penalty ‖β‖1 encouraging sparse solutions, where many components βj arezero. The good empirical success of this estimator has been recently backed up by resultsconfirming that it has strong theoretical properties; see (Greenshtein and Ritov, 2004; Zhaoand Yu, 2007; Meinshausen and Yu, 2006; Wainwright, 2006).

The nonparametric regression model Yi = m(Xi)+εi, where m is a general smooth function,relaxes the strong assumptions made by a linear model, but is much more challenging inhigh dimensions. Hastie and Tibshirani (1999) introduced the class of additive models ofthe form

Yi =

p∑

j=1

mj(Xij) + εi (2)

which is less general, but can be more interpretable and easier to fit; in particular, anadditive model can be estimated using a coordinate descent Gauss-Seidel procedure calledbackfitting. An extension of the additive model is the functional ANOVA model

Yi =∑

1≤j≤p

mj(Xij) +∑

j<k

mj,k(Xij , Xik) +∑

j<k<`

mj,k,`(Xij , Xik, Xi`) + · · ·+ εi (3)

1

which allows interactions among the variables. Unfortunately, additive models only havegood statistical and computational behavior when the number of variables p is not largerelative to the sample size n.

In this paper we introduce sparse additive models (SpAM) that extend the advantages ofsparse linear models to the additive, nonparametric setting. The underlying model is thesame as in (2), but constraints are place on the component functions {mj}1≤j≤p to si-multaneously encourage smoothness of each component and sparsity across components.The SpAM estimation procedure we introduce allows the use of arbitrary nonparametricsmoothing techniques, and in the case where the underlying component functions are linear,it reduces to the lasso. It naturally extends to classification problems using generalized addi-tive models. The main results of the paper are (i) the formulation of a convex optimizationproblem for estimating a sparse additive model, (ii) an efficient backfitting algorithm forconstructing the estimator, (iii) simulations showing the estimator has excellent behavioron some simulated and real data, even when p is large, and (iv) a statistical analysis of thetheoretical properties of the estimator that support its good empirical performance.

2 The SpAM Optimization Problem

In this section we describe the key idea underlying SpAM. We first present a populationversion of the procedure that intuitively suggests how sparsity is achieved. We then presentan equivalent convex optimization problem. In the following section we derive a backfittingprocedure for solving this optimization problem in the finite sample setting.

To motivate our approach, we first consider a formulation that scales each componentfunction gj by a scalar βj , and then imposes an `1 constraint on β = (β1, . . . , βp)

T . Forj ∈ {1, . . . , p}, let Hj denote the Hilbert space of measurable functions fj(xj) of the singlescalar variable xj , such that E(fj(Xj)) = 0 and E(fj(Xj)

2) < ∞, furnished with the innerproduct ⟨

fj, f′j

⟩= E

(fj(Xj)f

′j(Xj)

). (4)

Let Hadd = H1+H2+ . . . ,Hp denote the Hilbert space of functions of (x1, . . . , xp) that havean additive form: f(x) =

∑j fj(xj). The standard additive model optimization problem,

in the population setting, is

minfj∈Hj , 1≤j≤p

E

(Y −∑p

j=1 fj(Xj))2

(5)

and m(x) = E(Y |X = x) is the unknown regression function. Now consider the followingmodification of this problem that imposes additional constraints:

(P ) minβ∈Rp,gj∈Hj

E

(Y −∑p

j=1 βjgj(Xj))2

(6a)

subject to

p∑

j=1

|βj | ≤ L (6b)

E(g2

j

)= 1, j = 1, . . . , p (6c)

E (gj) = 0, j = 1, . . . , p (6d)

noting that gj is a function while β is a vector. Intuitively, the constraint that β lies in the `1-ball {β : ‖β‖1 ≤ L} encourages sparsity of the estimated β, just as for the parametric lasso.When β is sparse, the estimated additive function f(x) =

∑pj=1 fj(xj) =

∑pj=1 βjgj(xj) will

also be sparse, meaning that many of the component functions fj(·) = βjgj(·) are identicallyzero. The constraints (6c) and (6c) are imposed for identifiability; without (6c), for example,one could always satisfy (6a) by rescaling.

While this optimization problem makes plain the role `1 regularization of β to achievesparsity, it has the unfortunate drawback of not being convex. More specifically, while theoptimization problem is convex in β and {gj} separately, it is not convex in β and {gj}jointly.

2

However, consider the following related optimization problem:

(Q) minfj∈Hj

E

(Y −∑p

j=1 fj(Xj))2

(7a)

subject to

p∑

j=1

√E(f2

j (Xj)) ≤ L (7b)

E(fj) = 0, j = 1, . . . , p. (7c)

This problem is convex in {fj}, as a quadratically constrained quadratic program (QCQP).Moreover, the solutions to problems (P ) and (Q) are equivalent:

(β∗,{g∗j})

optimizes (P ) implies{f∗

j = β∗j g

∗j

}optimizes (Q);

{f∗

j

}optimizes (Q) implies

(β∗ = (‖fj‖2)T ,

{g∗j = f∗

j /‖f∗j ‖2})

optimizes (P ).

While optimization problem (Q) has the important virtue of being convex, the way it en-courages sparsity is not intuitive; the following observation provides some insight. Consider

the set C ⊂ R4 defined by C ={(f11, f12, f21, f22)

T ∈ R4 :√f211 + f2

12 +√f221 + f2

22 ≤ L}.

Then the projection π12C onto the first two components is an `2 ball. However, the projec-tion π13C onto the first and third components is an `1 ball. In this way, it can be seen that theconstraint

∑j ‖fj‖2 ≤ L acts as an `1 constraint across components to encourage sparsity,

while it acts as an `2 constraint within components to encourage smoothness, as in a ridge re-gression penalty. It is thus crucial that the norm ‖fj‖2 appears in the constraint, and not its

square ‖fj‖22. For the purposes of sparsity, this constraint could be replaced by∑

j ‖fj‖q ≤ Lfor any q ≥ 1. In case each fj is linear, (fj(x1j), . . . , f(xnj)) = βj(x1j , . . . , xnj), the opti-mization problem reduces to the lasso.

The use of scaling coefficients together with a nonnegative garrote penalty, similar to ourproblem (P ), is considered by Yuan (2007). However, the component functions gj are fixed,so that the procedure is not asymptotically consistent. The form of the optimization problem(Q) is similar to that of the COSSO for smoothing spline ANOVA models (Lin and Zhang,2006); however, our method differs significantly from the COSSO, as discussed below. Inparticular, our method is scalable and easy to implement even when p is much larger thann.

3 A Backfitting Algorithm for SpAM

We now derive a coordinate descent algorithm for fitting a sparse additive model. Weassume that we observe Y = m(X) + ε, where ε is mean zero Gaussian noise. We write theLagrangian for the optimization problem (Q) as

L(f, λ, µ) =1

2E

(Y −∑p

j=1 fj(Xj))2

+ λ

p∑

j=1

√E(f2

j (Xj)) +∑

j

µjE(fj). (8)

Let Rj = Y −∑k 6=j fk(Xk) be the jth residual. The stationary condition for minimizing Las a function of fj , holding the other components fk fixed for k 6= j, is expressed in termsof the Frechet derivative δL as

δL(f, λ, µ; δfj) = E [(fj −Rj + λvj)δfj ] = 0 (9)

for any δfj ∈ Hj satisfying E(δfj) = 0, where vj ∈ ∂√

E(f2j ) is an element of the subgra-

dient, satisfying√

Ev2j ≤ 1 and vj = fj

/√E(f2

j ) if E(f2j ) 6= 0. Therefore, conditioning on

Xj , the stationary condition (9) implies

fj + λvj = E(Rj |Xj). (10)

3

Input : Data (Xi, Yi), regularization parameter λ.

Initialize fj = f(0)j , for j = 1, . . . , p.

Iterate until convergence:

For each j = 1, . . . , p:

Compute the residual: Rj = Y −∑k 6=j fk(Xk);

Estimate the projection Pj = E[Rj |Xj ] by smoothing: Pj = SjRj ;

Estimate the norm sj =√

E[Pj ]2 using, for example, (15) or (35);

Soft-threshold: fj =

[1− λ

sj

]

+

Pj ;

Center: fj ← fj −mean(fj).

Output : Component functions fj and estimator m(Xi) =∑

j fj(Xij).

Figure 1: The SpAM Backfitting Algorithm

Letting Pj = E[Rj |Xj] denote the projection of the residual onto Hj , the solution satisfies

1 +λ√

E(f2j )

fj = Pj if E(P 2j ) > λ (11)

and fj = 0 otherwise. Condition (11), in turn, implies

1 +λ√

E(f2j )

√

E(f2j ) =

√E(P 2

j ) or√

E(f2j ) =

√E(P 2

j )− λ. (12)

Thus, we arrive at the following multiplicative soft-thresholding update for fj :

fj =

1− λ√E(P 2

j )

+

Pj (13)

where [·]+ denotes the positive part. In the finite sample case, as in standard backfitting(Hastie and Tibshirani, 1999), we estimate the projection E[Rj |Xj ] by a smooth of theresiduals:

Pj = SjRj (14)

where Sj is a linear smoother, such as a local linear or kernel smoother. Let sj be an

estimate of√

E[P 2j ]. A simple but biased estimate is

sj =1√n‖Pj‖2 =

√mean(P 2

j ). (15)

More accurate estimators are possible; an example is given in the appendix. We have thusderived the SpAM backfitting algorithm given in Figure 1.

While the motivating optimization problem (Q) is similar to that considered in the COSSO(Lin and Zhang, 2006) for smoothing splines, the SpAM backfitting algorithm decouplessmoothing and sparsity, through a combination of soft-thresholding and smoothing. Inparticular, SpAM backfitting can be carried out with any nonparametric smoother; it is notrestricted to splines. Moreover, by iteratively estimating over the components and usingsoft thresholding, our procedure is simple to implement and scales to high dimensions.

4

3.1 SpAM for Nonparametric Logistic Regression

The SpAM backfitting procedure can be extended to nonparametric logistic regression forclassification. The additive logistic model is

P(Y = 1 |X) ≡ p(X ; f) =exp

(∑pj=1 fj(Xj)

)

1 + exp(∑p

j=1 fj(Xj)) (16)

where Y ∈ {0, 1}, and the population log-likelihood is `(f) =E [Y f(X)− log (1 + exp f(X))]. Recall that in the local scoring algorithm for gen-eralized additive models (Hastie and Tibshirani, 1999) in the logistic case, one runsthe backfitting procedure within Newton’s method. Here one iteratively computes thetransformed response for the current estimate f0

Zi = f0(Xi) +Yi − p(Xi; f0)

p(Xi; f0)(1− p(Xi; f0))(17)

and weights w(Xi) = p(Xi; f0)(1−p(Xi; f0), and carries out a weighted backfitting of (Z,X)with weights w. The weighted smooth is given by

Pj =Sj(wRj)

Sjw. (18)

To incorporate the sparsity penalty, we first note that the Lagrangian is given by

L(f, λ, µ) = E [log (1 + exp f(X))− Y f(X)] + λ

p∑

j=1

√E(f2

j (Xj)) +∑

j

µjE(fj) (19)

and the stationary condition for component function fj is E (p− Y |Xj)+λvj = 0 where vj

is an element of the subgradient ∂√

E(f2j ). As in the unregularized case, this condition is

nonlinear in f , and so we linearize the gradient of the log-likelihood around f0. This yieldsthe linearized condition E [w(X)(f(X)− Z) |Xj ] + λvj = 0. When E(f2

j ) 6= 0, this impliesthe condition

(E (w |Xj) +

λ√E(fj)2

)fj(Xj) = E(wRj |Xj). (20)

In the finite sample case, in terms of the smoothing matrix Sj , this becomes

fj =Sj(wRj)

Sjw + λ/√

E(f2j ). (21)

If ‖Sj(wRj)‖2 < λ, then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannotbe solved explicitly, so we propose to iterate until convergence:

fj ←Sj(wRj)

Sjw + λ√n /‖fj‖2

. (22)

When λ = 0, this yields the standard local scoring update (18). An example of logisticSpAM is given in Section 5.

4 Properties of SpAM

4.1 SpAM is Persistent

The notion of risk consistency, or persistence, was studied by Juditsky and Nemirovski(2000) and Greenshtein and Ritov (2004) in the context of linear models. Let (X,Y ) denotea new pair (independent of the observed data) and define the predictive risk when predictingY with f(X) by

R(f) = E(Y − f(X))2. (23)

5

Since we consider predictors of the form f(x) =∑

j βjgj(xj) we also write the risk as R(β, g)

where β = (β1, . . . , βp) and g = (g1, . . . , gp). Following Greenshtein and Ritov (2004), wesay that an estimator mn is persistent relative to a class of functions Mn if

R(mn)−R(m∗n)

P→ 0 (24)

wherem∗n = argminf∈Mn

R(f) is the predictive oracle. Greenshtein and Ritov (2004) showed

that the lasso is persistent for the class of linear models Mn = {f(x) = xTβ : ‖β‖1 ≤ Ln}if Ln = o((n/ log n)1/4). We show a similar result for SpAM.

Theorem 4.1. Suppose that pn ≤ enξ

for some ξ < 1. Then SpAM is persistent

relative to the class of additive models Mn ={f(x) =

∑pj=1 βjgj(xj) : ‖β‖1 ≤ Ln

}if

Ln = o(n(1−ξ)/4

).

4.2 SpAM is Sparsistent

In the case of linear regression, with mj(Xj) = βTj Xj , Wainwright (2006) shows that un-

der certain conditions on n, p, s = |supp(β)|, and the design matrix X , the lasso re-

covers the sparsity pattern asymptotically; that is, the lasso estimator βn is sparsistent :

P

(supp(β) = supp(βn)

)→ 1. We show a similar result for SpAM with the sparse backfit-

ting procedure.

For the purpose of analysis, we use orthogonal function regression as the smoothing proce-dure. For each j = 1, . . . , p let ψj be an orthogonal basis for Hj . We truncate the basis tofinite dimension dn, and let dn →∞ such that dn/n→ 0. Let Ψj denote the n× d matrixΨj(i, k) = ψjk(Xij). If A ⊂ {1, . . . , p}, we denote by ΨA the n× d|A| matrix where for eachi ∈ A, Ψi appears as a submatrix in the natural way. The SpAM optimization problem canthen be written as

minβ

1

2n

(Y −∑p

j=1 Ψjβj

)2

+ λn

p∑

j=1

√1

nβT

j ΨTj Ψjβj (25)

where each βj is a d-dimensional vector. Let S denote the true set of variables {j : mj 6= 0},with s = |S|, and let Sc denote its complement. Let Sn = {j : βj 6= 0} denote the estimated

set of variables from the minimizer βn of (25).

Theorem 4.2. Suppose that Ψ satisfies the conditions

Λmax

(1

nΨT

SΨS

)≤ Cmax <∞ and Λmin

(1

nΨT

SΨS

)≥ Cmin > 0 (26)

∥∥∥(

1nΨT

ScΨS

) (1nΨT

SΨS

)−1∥∥∥

2

2s log s ≤ 1− δ < 1. (27)

Let the regularization parameter λn → 0 be chosen to satisfy

λn

√s→ 0,

s

dnλn→ 0, and

dn(log dn + log(p− s))nλ2

n

→ 0. (28)

Then SpAM is sparsistent: P

(Sn = S

)−→ 1.

5 Experiments

In this section we present experimental results for SpAM applied to both synthetic andreal data, including regression and classification examples that illustrate the behavior of thealgorithm in various conditions. We first use simulated data to investigate the performanceof the SpAM backfitting algorithm, where the true sparsity pattern is known. We then applySpAM to some real data. If not explicitly stated otherwise, the data are always rescaled to

6

lie in a d-dimensional cube [0, 1]d, and a kernel smoother with Gaussian kernel is used. Totune the penalization parameter λ, we use a Cp statistic, which is defined as

Cp(f) =1

n

n∑

i=1

(Yi −

∑pj=1 fj(Xj)

)2

+2σ2

n

p∑

j=1

trace(Sj)1[fj 6= 0] (29)

where Sj is the smoothing matrix for the j-th dimension and σ2 is the estimated variance.

5.1 Simulations

We first apply SpAM to an example from (Hardle et al., 2004). A dataset with sample sizen = 150 is generated from the following 200-dimensional additive model:

Yi = f1(xi1) + f2(xi2) + f3(xi3) + f4(xi4) + εi (30)

f1(x) = −2 sin(2x), f2(x) = x2 − 13 , f3(x) = x− 1

2 , f4(x) = e−x + e−1 − 1 (31)

and fj(x) = 0 for j ≥ 5 with noise εi ∼ N (0, 1). These data therefore have 196 irrelevantdimensions. The results of applying SpAM with the plug-in bandwidths are summarized inFigure 2.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Com

pone

nt N

orm

s

194

994

24

3

0.0 0.2 0.4 0.6 0.8 1.0

24

68

1012

14C

p

0.0

0.2

0.4

0.6

0.8

1.0

sample size

prob

. of c

orre

ct r

ecov

ery

0 10 20 30 40 50 60 70 80 90 110 130 150

p=128 p=256

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

24

l1=97.05

x1

m1

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

24

6

l1=88.36

x2

m2

0.0 0.2 0.4 0.6 0.8 1.0

−6−4

−22

46

l1=90.65

x3

m3

0.0 0.2 0.4 0.6 0.8 1.0

−4−2

24

6

l1=79.26

x4

m4

0.0 0.2 0.4 0.6 0.8 1.0

−6−4

−22

46

zero

x5

m5

0.0 0.2 0.4 0.6 0.8 1.0

−6−4

−22

46

zero

x6

m6

Figure 2: (Simulated data) Upper left: The empirical `2 norm of the estimated compo-nents as plotted against the tuning parameter λ; the value on the x-axis is proportional

to∑

j ‖fj‖2. Upper center: The Cp scores against the tuning parameter λ; the dashedvertical line corresponds to the value of λ which has the smallest Cp score. Upper right:The proportion of 200 trials where the correct relevant variables are selected, as a functionof sample size n. Lower (from left to right): Estimated (solid lines) versus true additivecomponent functions (dashed lines) for the first 6 dimensions; the remaining componentsare zero.

5.2 Boston Housing

The Boston housing data was collected to study house values in the suburbs of Boston; thereare altogether 506 observations with 10 covariates. The dataset has been studied by many

7

other authors (Hardle et al., 2004; Lin and Zhang, 2006), with various transformationsproposed for different covariates. To explore the sparsistency properties of our method,we add 20 irrelevant variables. Ten of them are randomly drawn from Uniform(0, 1), theremaining ten are a random permutation of the original ten covariates, so that they havethe same empirical densities.

The full model (containing all 10 chosen covariates) for the Boston Housing data is:

medv = α+ f1(crim) + f2(indus) + f3(nox) + f4(rm) + f5(age)

+ f6(dis) + f7(tax) + f8(ptratio) + f9(b) + f10(lstat) (32)

0.0 0.2 0.4 0.6 0.8 1.0

01

23

Com

pone

nt N

orm

s

177

56

38

104

0.0 0.2 0.4 0.6 0.8 1.0

2030

4050

6070

80C

p 0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=177.14

x1

m1

0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=1173.64

x4

m4

0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=478.29

x8

m8

0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=1221.11

x10

m10

Figure 3: (Boston housing) Left: The empirical `2 norm of the estimated components versusthe regularization parameter λ. Center: The Cp scores against λ; the dashed vertical linecorresponds to best Cp score. Right: Additive fits for four relevant variables.

The result of applying SpAM to this 30 dimensional dataset is shown in Figure 3. SpAMidentifies 6 nonzero components. It correctly zeros out both types of irrelevant variables.From the full solution path, the important variables are seen to be rm, lstat, ptratio, andcrim. The importance of variables nox and b are borderline. These results are basicallyconsistent with those obtained by other authors (Hardle et al., 2004). However, using Cp

as the selection criterion, the variables indux, age, dist, and tax are estimated to beirrelevant, a result not seen in other studies.

5.3 SpAM for Spam

Here we consider an email spam classification problem, using the logistic SpAM backfittingalgorithm from Section 3.1. This dataset has been studied by Hastie et al. (2001), using aset of 3,065 emails as a training set, and conducting hypothesis tests to choose significantvariables; there are a total of 4,601 observations with p = 57 attributes, all numeric. Theattributes measure the percentage of specific words or characters in the email, the averageand maximum run lengths of upper case letters, and the total number of such letters. Todemonstrate how SpAM performs well with sparse data, we only sample n = 300 emails asthe training set, with the remaining 4301 data points used as the test set. We also use thetest data as the hold-out set to tune the penalization parameter λ.

The results of a typical run of logistic SpAM are summarized in Figure 4, using plug-inbandwidths. It is interesting to note that even with this relatively small sample size, logisticSpAM recovers a sparsity pattern that is consistent with previous authors’ results. Forexample, in the best model chosen by logistic SpAM, according to error rate, the 33 selectedvariables cover 80% of the significant predictors as determined by Hastie et al. (2001).

References

Greenshtein, E. and Ritov, Y. (2004). Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Journal of Bernoulli 10 971–988.

8

λ(×10−3) Error # zeros selected variables5.5 0.2009 55 { 8,54}

5.0 0.1725 51 { 8, 9, 27, 53, 54, 57}

4.5 0.1354 46 {7, 8, 9, 17, 18, 27, 53, 54, 57, 58}

4.0 0.1083 (√

) 20 {4, 6–10, 14–22, 26, 27, 38, 53–58}

3.5 0.1117 0 all3.0 0.1174 0 all2.5 0.1251 0 all2.0 0.1259 0 all

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

0.12

0.14

0.16

0.18

0.20

penalization parameter

Em

piric

al p

redi

ctio

n er

ror

Figure 4: (Email spam) Classification accuracies and variable selection for logistic SpAM.

Hardle, W., Muller, M., Sperlich, S. and Werwatz, A. (2004). Nonparametric andSemiparametric Models. Springer-Verlag Inc.

Hastie, T. and Tibshirani, R. (1999). Generalized additive models. Chapman & HallLtd.

Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer-Verlag.

Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametricregression. Ann. Statist. 28 681–712.

Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in multivariatenonparametric regression. Ann. Statist. 34 2272–2297.

Meinshausen, N. and Yu, B. (2006). Lasso-type recovery of sparse representations forhigh-dimensional data. Tech. Rep. 720, Department of Statistics, UC Berkeley.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society, Series B, Methodological 58 267–288.

Wainwright, M. (2006). Sharp thresholds for high-dimensional and noisy recovery ofsparsity. Tech. Rep. 709, Department of Statistics, UC Berkeley.

Yuan, M. (2007). Nonnegative garrote component selection in functional ANOVA models.In Proceedings of AI and Statistics, AISTATS.

Zhao, P. and Yu, B. (2007). On model selection consistency of lasso. J. of Mach. Learn.Res. 7 2541–2567.

9

Appendices

6 Estimating√

E[P 2

j ]

To construct a more accurate of estimator of√

E[P 2j ], let Sj(x) =

(S(x,X1j), . . . , S(x,Xnj))T denote the xth column of the smoothing matrix, and

Gx = Sj(x)Sj(x)T . Then Pj(x) = Sj(x)

TRj , and Pj(x)2 = RT

j GxRj . To estimate E[P 2j (x)],

we use the quadratic form identity

E(XTQX) = tr(ΣQ) + µTQµ in case X ∼ N(µ,Σ). (33)

Thus, if the noise εi ∼ N(0, σ2) is Gaussian, then

E(RTj GxR

Tj |Xj) = σ2tr(Gx) + E(Rj |Xj)

TGx E(Rj |Xj). (34)

Defining G = 1n

∑i GXi

and plugging in our estimate Pj for E(Rj |Xj) yields the estimator

sj =√σ2tr(G) + PT

j GPj . (35)

7 Proof of Theorem 4.1

Let (Y1, X1), . . . , (Yn, Xn) be n data points where Xi ∈ Rp and Yi ∈ R. The model is

Yi = α+m(Xi) + εi (36)

where

m ∈ Mn(Ln) =

{m =

p∑

j=1

βjmj(xj), mj ∈ Tj ,

p∑

j=1

|βj | ≤ Ln

}, (37)

Tj =

{mj ∈ Hj :

∫mj(xj)dxj = 0,

∫m2

j(xj)dxj = 1, supx|mj(x)| ≤ C

}(38)

and Hj is a class of smooth functions such as the Sobolev space:

Hj =

{mj :

∫m′′

j (xj)2 dxj <∞, mj , m

′j are absolutely continuous

}. (39)

We begin with some notation. IfM is a class of functions then the L∞ bracketing numberN[ ](ε) is smallest number of pairs B = {(`1, u1), . . . , (`k, uk)} such that ‖uj − `j‖∞ ≤ ε,1 ≤ j ≤ k, and such that for every m ∈M there exists (`, u) ∈ B such that ` ≤ m ≤ u. Forthe Sobolev space Tj ,

logN[ ](ε, Tj) ≤ K(

1

ε

)1/2

(40)

for some K > 0. The bracketing integral is defined to be

J[ ](δ) =

∫ δ

0

√logN[ ](u)du. (41)

A useful empirical process inequality (see Corollary 19.35 of van der Vaart 1998, for exam-ple), is

E

(supg∈M

|µ(g)− µ(g)|)≤ C J[ ](‖F‖∞)√

n(42)

for some C > 0, where F (x) = supg∈M |g(x)|, µ(g) = E(g(X)) and µ(g) = n−1∑n

i=1 g(Xi).

10

Set Z ≡ (Z0, . . . , Zp) = (Y,X1, . . . , Xp) and note that

R(β, g) =

p∑

j=0

p∑

k=0

βjβkE(gj(Zj)gk(Zk)) (43)

where we define g0(x) = z0 and β0 = −1. Also define

R(β, g) =1

n

n∑

i=1

p∑

j=0

p∑

k=0

βjβkgj(Zij)gk(Zik). (44)

Note that the SpAM estimator is the minimizer of R(β, g) subject to∑

j βjgj(xj) ∈Mn(Ln). For all (β, g),

|R(β, g)−R(β, g)| ≤ ‖β‖21 maxjk

supgj∈Tj,gk∈Tk

|µjk(g)− µjk(g)| (45)

where µjk(g) = n−1∑n

i=1

∑jk E(gj(Zij)gk(Zik)) and µjk(g) = E(gj(Zj)gk(Zk)). From (40)

it follows that

logN[ ](ε,Mn) ≤ 2 log pn +K

(1

ε

)1/2

. (46)

Hence, J[ ](C,M) = O(√

log pn) and it follows from (42) that

maxjk

supgj∈Tj,gk∈Tk

|µjk(g)− µjk(g)| = O

(√log pn

n

)= O

(1

n(1−ξ)/2

). (47)

We conclude that

supg∈M

|R(g)−R(g)| = O

(L2

n

n(1−ξ)/2

)= o(1). (48)

Therefore,

R(m∗) ≤ R(mn) ≤ R(mn) + oP (1)

≤ R(m∗) + oP (1) ≤ R(m∗) + oP (1)

and the conclusion follows. �

8 Proof of Theorem 4.2

There exists an orthonormal basis (the Fourier basis) Bj = {ψj1, ψj2, . . .} for the second-order Sobolev space Hj such that mj ∈ Hj if and only if

mj =∞∑

k=1

βjkψjk (49)

and∑∞

k=1 β2jkk

4 < C2, for some C < ∞. The basis is bounded, with supX |ψjk(X)| ≤ B,for a constant B <∞. Thus we can write

Yi =

p∑

j=1

∞∑

k=1

β∗jkψjk(Xij) + εi (50)

It can be shown that the sparse backfitting procedure with an orthogonal function regressionsmoother, with a truncated basis of size dn, solves the following optimization problem,

minβ

n∑

i=1

1

2n

(Yi −

p∑

j=1

dn∑

k=1

βjkψjk(Xij)

)2

+ λ

p∑

j=1

√1

n

∑

i

∑

k,k′

βjkβjk′ψjk(Xij)ψjk′ (Xij) (51)

11

To simplify notation, let βj be the dn dimensional vector {βjk, k = 1, . . . , dn}, and Ψj then× dn matrix, Ψj[i, k] = ψjk(Xij). If A ⊂ {1, . . . , p}, we denote by ΨA the n× d|A| matrixwhere for each i ∈ A, Ψi appears as a submatrix in the natural way.

The optimization task can then be written as

minβ

1

2n

(Y −

p∑

j=1

Ψjβj

)2

+ λ

p∑

j=1

√1

nβ>

j Ψ>

j Ψjβj (52)

We now state assumptions on the design and design parameters. Let S be the true sparsitypattern, and let Sc = {1, . . . , p}\S be the set of irrelevant covariates.

(A1) Dependence conditions:

Λmax

(1

nΨ>

SΨS

)≤ Cmax <∞ (53)

Λmin

(1

nΨ>

SΨS

)≥ Cmin > 0 (54)

(A2) Incoherence conditions:∥∥∥∥∥

(1

nΨ>

ScΨS

)(1

nΨ>

SΨS

)−1∥∥∥∥∥

2

2

s log s ≤ (1− ε) (55)

for some ε > 0.

(A3) Truncation conditions:

dn → ∞ (56)

dn

n→ 0 (57)

(A4) Penalty conditions:

λn

√s → 0 (58)

dn(log dn + log(p− s))nλ2

n

→ 0 (59)

s

dn

1

λn→ 0 (60)

Theorem 8.1. Given the model in (50), and design settings (n, p, s, d, λ) such that condi-

tions (A1) through (A4) are satisfied, then P

(Sn = S

)→ 1.

A commonly used basis truncation size is dn = n1/5, which achieves the minimax error ratein the one-dimensional case. The theorem under this design setting gives,

Corollary 8.2. Given the model in (50), penalty settings such that

λn

√s→ 0,

logn(p− s)n4/5λ2

n

→ 0,s

n1/5λn → 0 (61)

and design settings such that d = O(n1/5) and conditions (A1) and (A2) are satisfied, then

P

(Sn = S

)→ 1.

Let F (β) denote the objective function of the optimization problem in (51), and let G(β) =∑p

j=1

√1nβ

>

j Ψ>

j Ψjβj denote the penalty part, Then a vector β ∈ Rdnp is an optimum of the

above objective function if and only if there exists a subgradient g ∈ ∂G(β), such that

1

nΨ>

∑

j

Ψjβj − Y

+ λng = 0 (62)

12

The subdifferential g of G(β) takes the form

gj =1nΨ>

j Ψjβj√1n β

>

j Ψ>

j Ψj βj

, for βj 6= 0 (63)

g>

j

[1

nΨ>

j Ψj

]−1

gj ≤ 1 for βj = 0 (64)

We now proceed by a “witness” proof technique. We set βSc = 0 and gS = ∂GS(βS),

and, obtaining βS and gSc from the stationary condition in (62), we show that with highprobability,

g>

j

(1

nΨ>

j Ψj

)−1

gj ≤ 1, j ∈ Sc (65)

βS 6= 0 (66)

This shows that there exists a optimal solution to the optimization problem in (51) whichhas the same sparsity pattern as the model. Since it can be shown that every solution tothe optimization problem has the same sparsity pattern, this will prove the required result.

Setting βSc = 0 and gS = ∂GS(βS) in the stationary condition for βS gives

1

nΨ>

j (ΨSβS − Y ) + λngj = 0, j ∈ S (67)

which can be summarized as

1

nΨ>

S [ΨSβS − Y ] + λngj = 0 (68)

Let Vn = Y −ΨSβ∗S −Wn, where Wn denotes the noise vector. Then

|Vi| = |∑

j∈S

∞∑

k=d+1

βjkΨjk(Xij)| (69)

≤ B∑

j∈S

∞∑

k=d+1

|βjk| (70)

= B∑

j∈S

∞∑

k=d+1

|βjk|k2

k2(71)

≤ B∑

j∈S

√√√√∞∑

k=d+1

β2jkk

4

√√√√∞∑

k=d+1

1

k4(72)

≤ sBC

√√√√∞∑

k=d+1

1

k4(73)

≤ sB′

d3/2n

for some constant B′ > 0 (74)

Therefore

‖Vn‖∞ ≤ B′sd−3/2, (75)

Letting ΣSS = 1n [Ψ>

SΨS ], we have

βS − β∗S = Σ−1

SS

[1

Ψ

>

SWn

]+ Σ−1

SS

[1

nΨ−1

S Vn

]− λnΣ−1

SSgS (76)

13

This allows us to get the L∞ bound,

‖βS − β∗S‖∞ =

∥∥∥∥Σ−1SS

[1

nΨ>

SWn

]∥∥∥∥∞

+∥∥Σ−1

SS

∥∥∞

∣∣∣∣1

nΨ>

SVn

∣∣∣∣∞

+ λn

∥∥Σ−1SSgS

∣∣∞

(77)

Let ρn = minj∈S maxk∈{1,...,dn}} |β∗jk| > 0. It suffices to show that ‖βS − β∗

S‖∞ < ρn

2 to

ensure that βj 6≡ 0, j ∈ S. We now proceed to bound the quantities in (77).

‖Σ−1SS‖∞ ≤ ‖Σ−1

SS‖2√sd (78)

≤√sd

Cmin(79)

1 ≥ g>

j

[1

nΨ>

j Ψj

]−1

gj (80)

≥ 1

Cmax‖gj‖22 (81)

‖gj‖2 ≤√Cmax (82)

This gives the following bounds,

‖gS‖∞ = maxj∈S‖gj‖∞ (83)

≤ maxj∈S‖gj‖2 (84)

≤√Cmax (85)

Also, ∥∥Σ−1SSgS

∥∥∞≤

∥∥Σ−1SSgS

∥∥2

(86)

≤∥∥Σ−1

SS

∥∥2‖gS‖2 (87)

≤√Cmaxs

Cmin(88)

From (75), we get

1

nΨ>

jkVn ≤∣∣∣∣∣1

n

∑

i

Ψjk(Xij)

∣∣∣∣∣ ‖Vn‖∞ (89)

≤ B′2sd−3/2 (90)

(91)

Finally, consider Z := Σ−1SS

[1nΨ>

SWn

]. Note that Wn ∼ N(0, σ2I), so that Z is Gaussian

as well, with mean zero. Consider its l-th component, Zl = e>l Z. Then E[Zl] = 0, and

Var[Zl] = σ2

n e>

l Σ−1SSel ≤ σ2

Cminn . It can then be shown (Ledoux and Talagrand, 1991) that

E[‖Z‖∞] ≤ 3√

log(sd)maxl

Var[Zl] (92)

≤ 3

√σ2 log(sd)

nCmin(93)

An application of Markov’s inequality then proves the desired result,

P

[‖βS − β∗

S‖∞ >ρn

2

]≤ P

[‖Z‖∞ + λn

√Cmaxs

Cmin+√sB′2sd−1 ≥ ρn

2

](94)

≤ 1

ρn

{E [‖Z‖∞] + λn

√Cmaxs

Cmin+B′2(s3/2/d)

}(95)

≤ 1

ρn

3

√σ2 log(sd)

nCmin+ λn

√Cmaxs

Cmin+B′2(s3/2/d)

(96)

14

which converges to zero under the given assumptions.

We now analyze gSc . The stationary condition for q ∈ Sc is given by

1

nΨ>

q [ΨSβS −ΨSβ∗S −Wn − Vn] + λngq = 0 (97)

Thus,

λngq = −[

1

nΨ>

q ΨS

][βS − β∗

S ] +

(1

nΨ>

q

)[Wn + Vn] (98)

= −ΣScS [βS − β∗S ] +

(1

nΨ>

q

)[Wn + Vn] (99)

= ΣScSΣ−1SSλngS + ΣScSΣ−1

SS

[(1

nΨ>

S

)(Wn + Vn)

]+

(1

nΨ>

q

)[Wn + Vn](100)

Letting Σqq = 1nΨ>

q Ψq, we want g>

q Σ−1qq gq ≤ 1. Since

√g>

q Σ−1qq gq ≤

√λmax

[Σ−1

qq

]‖gq‖2 (101)

≤ 1√lmin

‖gq‖2 (102)

it suffices to show that supq∈Sc ‖gq‖2 ≤√lmin. From (98), we have,

λn ‖gq‖2 ≤∥∥ΣScSΣ−1

SS

∥∥2λn ‖gS‖2 +

∥∥ΣScSΣ−1SS

∥∥2

{∥∥∥∥1

nΨ>

SWn

∥∥∥∥2

+

∥∥∥∥1

nΨ>

SVn

∥∥∥∥2

}(103)

+

{∥∥∥∥1

nΨ>

qWn

∥∥∥∥2

+

∥∥∥∥1

nΨ>

q Vn

∥∥∥∥2

}

≤ λnδ√sCmax + δ

√sd

{∥∥∥∥1

nΨ>

SWn

∥∥∥∥∞

+

∥∥∥∥1

nΨ>

SVn

∥∥∥∥∞

}(104)

+√d

{∥∥∥∥1

nΨ>

q Wn

∥∥∥∥∞

+

∥∥∥∥1

nΨ>

q Vn

∥∥∥∥∞

}

≤ λnδ√sCmax + δ

√sdZ1 +

√dZ2q + δ

√sdR1 +

√dR2 (105)

where Z1 :=∥∥ 1

nΨ>

SWn

∥∥∞

and Z2q :=∥∥ 1

nΨ>

q Wn

∥∥∞

are the Gaussian residuals, and R1 =∥∥ 1nΨ>

SVn

∥∥∞

, and R2 :=∥∥ 1

nΨ>

q Vn

∥∥∞

are the basis residuals.

From (75), we have R1, R2 ≤√

2B2sd−3/2. Thus,

1

λn

[δ√sdR1 +

√dR2

]≤√dB′2sd−3/2(δ

√s+ 1)

λn(106)

≤ B′2d−1s(δ√s+ 1)

λn(107)

→ 0 (108)

For the Gaussian residuals, proceeding as earlier,

E [Z1] ≤ 3

√σ2 log(sd)Cmax

n(109)

E

[supq∈Sc

Z2q

]≤ 3

√σ2 log(d(p− s))Cmax

n(110)

15

Thus,

P

[1

λn

√d

(δ√sZ1 + sup

q∈Sc

Z2q

)>ε

2

](111)

≤√d

ελn

(δ√sE [Z1] + E[ sup

q∈Sc

Z2q]

)(112)

≤√d

ελn

(δ√s3

√σ2 log(sd)Cmax

n+ 3

√σ2 log(d(p− s))Cmax

n

)(113)

≤ 1

ε

[3

√σ2Cmaxδ2sd log(sd)

nλ2n

+ 3

√σ2Cmaxd log(d(p− s))

nλ2n

]

→ 0 (114)

Thus, P(supq∈Sc ‖gq‖2 ≤

√lmin

)→ 1, which proves the result. �

16

spam: sparse additive modelsdilan/docs_mljc/spam-nips.pdf · 2007-11-08 · spam: sparse additive...

Documents