stability - university of california, berkeleybinyu/ps/papers2013/yu13.pdf · stability 1487 (2011)...

Bernoulli 19(4), 2013, 1484–1500DOI: 10.3150/13-BEJSP14

StabilityBIN YU

Departments of Statistics and EECS, University of California at Berkeley, Berkeley, CA 94720, USA.E-mail: [email protected]

Reproducibility is imperative for any scientific discovery. More often than not, modern scientific findingsrely on statistical analysis of high-dimensional data. At a minimum, reproducibility manifests itself in sta-bility of statistical results relative to “reasonable” perturbations to data and to the model used. Jacknife,bootstrap, and cross-validation are based on perturbations to data, while robust statistics methods deal withperturbations to models.

In this article, a case is made for the importance of stability in statistics. Firstly, we motivate the necessityof stability for interpretable and reliable encoding models from brain fMRI signals. Secondly, we findstrong evidence in the literature to demonstrate the central role of stability in statistical inference, suchas sensitivity analysis and effect detection. Thirdly, a smoothing parameter selector based on estimationstability (ES), ES-CV, is proposed for Lasso, in order to bring stability to bear on cross-validation (CV).ES-CV is then utilized in the encoding models to reduce the number of predictors by 60% with almostno loss (1.3%) of prediction performance across over 2,000 voxels. Last, a novel “stability” argument isseen to drive new results that shed light on the intriguing interactions between sample to sample variabilityand heavier tail error distribution (e.g., double-exponential) in high-dimensional regression models with p

predictors and n independent samples. In particular, when p/n → κ ∈ (0.3,1) and the error distributionis double-exponential, the Ordinary Least Squares (OLS) is a better estimator than the Least AbsoluteDeviation (LAD) estimator.

Keywords: cross-validation; double exponential error; estimation stability; fMRI; high-dim regression;Lasso; movie reconstruction; robust statistics; stability

1. Introduction

In his seminal paper “The Future of Data Analysis” (Tukey, 1962), John W. Tukey writes:

“It will still be true that there will be aspects of data analysis well called technology, but there will also be thehallmarks of stimulating science: intellectual adventure, demanding calls upon insight, and a need to find out‘how things really are’ by investigation and the confrontation of insights with experience” (p. 63).

Fast forward to 2013 in the age of information technology, these words of Tukey ring as trueas fifty years ago, but with a new twist: the ubiquitous and massive data today were impossibleto imagine in 1962. From the point of view of science, information technology and data are ablessing, and a curse. The reasons for them to be a blessing are many and obvious. The reasonsfor it to be a curse are less obvious. One of them is well articulated recently by two prominentbiologists in an editorial Casadevall and Fang (2011) in Infection and Immunity (of the AmericanSociety for Microbiology):

“Although scientists have always comforted themselves with the thought that science is self-correcting, the im-mediacy and rapidity with which knowledge disseminates today means that incorrect information can have aprofound impact before any corrective process can take place” (p. 893).

1350-7265 © 2013 ISI/BS

http://www.bernoulli-society.org/index.php/publications/bernoulli-journal/bernoulli-journal

http://dx.doi.org/10.3150/13-BEJSP14

mailto:[email protected]

Stability 1485

“A recent study analyzed the cause of retraction for 788 retracted papers and found that error and fraud were re-sponsible for 545 (69%) and 197 (25%) cases, respectively, while the cause was unknown in 46 (5.8%) cases (31)”(p. 893).

The study referred is Steen (2011) in the Journal of Medical Ethics. Of the 788 retracted pa-pers from PubMed from 2000 to 2010, 69% are marked as “errors” on the retraction records.Statistical analyses are likely to be involved in these errors. Casadevall and Fang go on to callfor “enhanced training in probability and statistics,” among other remedies including “reembrac-ing philosophy.” More often than not, modern scientific findings rely on statistical analyses ofhigh-dimensional data, and reproducibility is imperative for any scientific discovery. Scientificreproducibility therefore is a responsibility of statisticians. At a minimum, reproducibility man-ifests itself in the stability of statistical results relative to “reasonable” perturbations to data andto the method or model used.

Reproducibility of scientific conclusions is closely related to their reliability. It is receivingmuch well-deserved attention lately in the scientific community (e.g., Ioannidis, 2005; Kraft etal., 2009, Casadevall and Fang, 2011; Nosek et al., 2012) and in the media (e.g., Naik, 2011;Booth, 2012). Drawing a scientific conclusion involves multiple steps. First, data are collectedby one laboratory or one group, ideally with a clear hypothesis in the mind of the experimenteror scientist. In the age of information technology, however, more and more massive amounts ofdata are collected for fishing expeditions to “discover” scientific facts. These expeditions involverunning computer codes on data for data cleaning and analysis (modeling and validation). Beforethese facts become “knowledge,” they have to be reproduced or replicated through new sets ofdata by the same group or preferably by other groups. Given a fixed set of data, Donoho et al.(2009) discuss reproducible research in computational hormonic analysis with implications oncomputer-code or computing-environment reproducibility in computational sciences includingstatistics. Fonio et al. (2012) discuss replicability between laboratories as an important screen-ing mechanism for discoveries. Reproducibility could have multitudes of meaning to differentpeople. One articulation on the meanings of reproducibility, replication, and repeatability can befound in Stodden (2011).

In this paper, we advocate for more involvement of statisticians in science and for an enhancedemphasis on stability within the statistical framework. Stability has been of a great concern instatistics. For example, in the words of Hampel et al. (1986), “. . . robustness theories can beviewed as stability theories of statistical inference” (p. 8). Even in low-dimensional linear re-gression models, collinearity is known to cause instability of OLS or problem for individualparameter estimates so that significance testing for these estimates becomes unreliable. Here wedemonstrate the importance of statistics for understanding our brain; we describe our method-ological work on estimation stability that helps interpret models reliably in neuroscience; andwe articulate how our solving neuroscience problems motivates theoretical work on stability androbust statistics in high-dimensional regression models. In other words, we tell an interwindingstory of scientific investigation and statistical developments.

The rest of the paper is organized as follows. In Section 2, we cover our “intellectual adven-ture” into neursocience, in collaboration with the Gallant Lab at UC Berkeley, to understandhuman visual pathway via fMRI brain signals invoked by natural stimuli (images or movies) (cf.Kay et al., 2008, Naselaris et al., 2009, Kay and Gallant, 2009, Naselaris et al., 2011). In partic-ular, we describe how our statistical encoding and decoding models are the backbones of “mind-

1486 B. Yu

reading computers,” as one of the 50 best inventions of 2011 by the Time Magazine (Nishimotoet al., 2011). In order to find out “how things really are,” we argue that reliable interpretationneeds stability. We define stability relative to a data perturbation scheme. In Section 3, we brieflyreview the vast literature on different data perturbation schemes such as jacknife, subsampling,and bootstrap. (We note that data perturbation in general means not only taking subsets of dataunits from a given data set, but also sampling from an underlying distribution or replicating theexperiment for a new set of data.)

In Section 4, we review an estimation stability (ES) measure taken from Lim and Yu (2013)for regression feature selection. Combining ES with CV as in Lim and Yu (2013) gives riseto a smoothing parameter selector ES-CV for Lasso (or other regularization methods). When weapply ES-CV to the movie-fMRI data, we obtain a 60% reduction of the model size or the numberof features selected at a negligible loss of 1.3% in terms of prediction accuracy. Subsequently, theES-CV-Lasso models are both sparse and more reliable hence better suited for interpretation dueto their stability and simplicity. The stability considerations in our neuroscience endeavors haveprompted us to connect with the concept of stability from the robust statistics point of view. In ElKaroui et al. (2013), we obtain very interesting theoretical results in high-dimensional regressionmodels with p predictors and n samples, shedding light on how sample variability in the designmatrix meets heavier tail error distributions when p/n is approximately a constant in (0,1) or inthe random matrix regime. We describe these results in an important special case in Section 5. Inparticular, we see that when p/n → κ and 1 > κ > 0.3 or so, the Ordinary Least Squares (OLS)estimator is better than the Least Absolute Deviation (LAD) estimator when the error distributionis double exponential. We conclude in Section 6.

2. Stable models are necessary for understanding visualpathway

Neuroscience holds the key to understanding how our mind works. Modern neuroscience is invig-orated by massive and multi-modal forms of data enabled by advances in technology (cf. Atkil,Martone and Van Essen, 2012). Building mathematical/statistical models on this data, compu-tational neuroscience is at the frontier of neuroscience. The Gallant Lab at UC Berkeley is aleading neuroscience lab specializing in understanding the visual pathway, and is a long-termcollaborator with the author’s research group. It pioneered the use of natural stimuli in experi-ments to invoke brain signals, in contrast to synthetic signals such as white noise and movingbars or checker boards as previously done.

Simply put, the human visual pathway works as follows. Visual signals are recorded by retinaand through the relay center LGN they are transmitted to primary visual cortex areas V1, onto V2 and V4, on the “what” pathway (in contrast to the “where” pathway) (cf. Goodale andMilner, 1992). Computational vision neuroscience aims at modeling two related tasks carriedout by the brain (cf. Dayan and Abbott, 2005) through two kinds of models. The first kind, theencoding model, predicts brain signals from visual stimuli, while the second kind, the decodingmodel recovers visual stimuli from brain signals. Often, decoding models are built upon encodingmodels and hence indirectly validate the former, but they are important in their own right. Inthe September issue of Current Biology, our joint paper with the Gallant Lab, Nishimoto et al.

Stability 1487

(2011) invents a decoding (or movie reconstruction) algorithm to reconstruct movies from fMRIbrain signals. This work has received intensive and extensive coverage by the media includingThe Economist’s Oct. 29th 2011 issue (“Reading the Brain: Mind-Goggling”) and the NationalPublic Radio in their program “Forum with Michael Krasny” on Tue, Sept. 27, 2011 at 9:30 am(“Reconstructing the Mind’s Eye”). As mentioned earlier, it was selected by the Time Magazineas one of the best 50 inventions of 2011 and dubbed as “Mind-reading Computers” on the coverpage of the Time’s invention issue.

What is really behind the movie reconstruction algorithm?

Can we learn something from it about how brain works?

The movie reconstruction algorithm consists of statistical encoding and decoding models, bothof which employ regularization. The former are sparse models so they are concise enough tobe viewed and are built via Lasso + CV for each voxel separately. However, as is well-knownLasso + CV results are not stable or reliable enough for scientific interpretation due to the L1

regularization and the emphasis of CV on prediction performance. So Lasso + CV is not estima-tion stable. The decoding model uses the estimated encoding model for each voxel and Tiknohovregularization or Ridge in covariance estimation to pull information across different voxels overV1, V2 and V4 (Nishimoto et al., 2011). Then an empirical prior for clips of short videos is usedfrom movie trailers and YouTube to induce posterior weights on video clips in the empirical priordatabase. Tiknohov or Ridge regularization concerns itself with the estimation of the covariancebetween voxels that is not of interest for interpretation. The encoding phase is the focus herefrom now on.

V1 is a primary visual cortex area and the best understood area in the visual cortex. Hubeland Wiesel received a Nobel Prize in Physiology or Medicine in 1981 for two major scientificdiscoveries. One is Hubel and Wiesel (1959) that uses cat physiology data to show, roughlyspeaking, that simple V1 neuron cells act like Gabor filters or as angled edge detectors. Later,using solely image data, Olshausen and Field (1996) showed that image patches can be sparselyrepresented on Gabor-like basis image patches. The appearance of Gabor filters in both places islikely not a coincidence, due to the fact that our brain has evolved to represent the natural world.These Gabor filters have different locations, frequencies and orientations. Previous work fromthe Gallant Lab has built a filter-bank of such Gabor filters and successfully used them to designencoding models with single neuron signals in V1 invoked by static natural image stimuli (Kayet al., 2008, Naselaris et al., 2011).

In Nishimoto et al. (2011), we use fMRI brain signals observed over 2700 voxels in differentareas of the visual cortex. fMRI signals are indirect and non-invasive measures of neural activitiesin the brain and have good spatial coverage and temporal resolution in seconds. Each voxel isroughly a cube of 1 mm by 1 mm by 1 mm and contains hundreds of thousands of neurons.Leveraging the success of Gabor-filter based models for single neuron brain signals, for a givenimage, a vector of features is extracted by 2-d wavelet filters. This feature vector has been usedto build encoding models for fMRI brain signals in Kay et al. (2008) and Naselaris et al. (2011).Invoked by clips of videos/movies, fMRI signals from three subjects are collected with the sameexperimental set-up. To model fMRI signals invoked by movies, a 3-dim motion-energy Gaborfilter bank has been built in Nishimoto et al. (2011) to extract a feature vector of dimension of

1488 B. Yu

26K. Linear models are then built on these features at the observed time point and lagged timepoints.

At present sparse linear regression models are favorites of the Gallant Lab through Lassoor ε-L2Boost. These sparse models give similar prediction performance on validation data asneural nets and kernel machines on image-fMRI data; they correspond well to the neuroscienceknowledge on V1; and they are easier to interpret than neural net and kernel machine models thatinclude all features or variables.

For each subject, following a rigorous protocol in the Gallant Lab, the movie data (how manyframes per second?) consists of three batches: training, test and validation. The training data isused to fit a sparse encoding model via Lasso or e-L2Boost and the test data is used to selectthe smoothing parameter by CV. These data are averages of two or three replicates. That is, thesame movie is played to one subject two or three times and the resulted fMRI signals are calledreplicates. Then a completed encoding determined model is used to predict the fMRI signalsin the validation data (with 10+ replicates) and the prediction performance is measured by thecorrelation between the predicted fMRI signals and observed fMRI signals, for each voxel and foreach subject. Good prediction performance is observed for such encoding models (cf. Figure 2).

3. Stability considerations in the literature

Prediction and movie reconstruction are good steps to validate the encoding model in order tounderstand the human visual pathway. But the science lies in finding the features that might drivea voxel, or to use Tukey’s words, finding out “how things really are.”

It is often the case that the number of data units is easily different from what is in collecteddata. There are some hard resource constraints such as that human subjects can not lie inside anfMRI machine for too long and it also costs money to use the fMRI machine. But whether thedata collected is for 2 hours as in the data or 1 hours 50 min or 2 hours and 10 min is a judgementcall by the experimenter given the constraints. Consequently, scientific conclusions, or in ourcase, candidates for driving features, should be stable relative to removing a small proportionof data units, which is one form of reasonable or appropriate data perturbation, or reproduciblewithout a small proportion of the data units. With a smaller set of data, a more conservativescientific conclusion is often reached, which is deemed worthwhile for the sake of more reliableresults.

Statistics is not the only field that uses mathematics to describe phenomena in the naturalworld. Other such fields include numerical analysis, dynamical systems and PDE and ODE.Concepts of stability are central in all of them, implying the importance of stability in quantitativemethods or models when applied to real world problems.

The necessity for a procedure to be robust to data perturbation is a very natural idea, easilyexplainable to a child. Data perturbation has had a long history in statistics, and it has at leastthree main forms: jacknife, sub-sampling and bootstrap. Huber (2002) writes in “John W. TukeysContribution to Robust Statistics:

“[Tukey] preferred to rely on the actual batch of data at hand rather than on a hypotheticalunderlying population of which it might be a sample” (p. 1643).

Stability 1489

All three main forms of data perturbation rely on an “actual batch of data” even though theirtheoretical analyses do assume hypothetical underlying populations of which data is a sample.They all have had long histories.

Jacknife can be traced back at least to Quenouille (1949, 1956) where jacknife was used to esti-mate the bias of an estimator. Tukey (1958), an abstract in the Annals of Mathematical Statistics,has been regarded as a key development because of his use of jacknife for variance estimation.Miller (1974) is an excellent early review on Jacknife with extensions to regression and timeseries situations. Hinkley (1977) proposes weighted jacknife for unbalanced data for which Wu(1986) provides a theoretical study. Künsch (1989) develops Jacknife further for time series.Sub-sampling on the other hand was started three years earlier than jacknife by Mahalanobis(1946). Hartigan (1969, 1975) buids a framework for confidence interval estimation based onsubsampling. Carlstein (1986) applies subsampling (which he called subseries) to the time seriescontext. Politis and Romano (1992) study subsampling for general weakly dependent processes.Cross-validation (CV) has a more recent start in Allen (1974) and Stone (1974). It gives an esti-mated prediction error that can be used to select a particular model in a class of models or alonga path of regularized models. It has been wildly popular for modern data problems, especially forhigh-dimensional data and machine learning methods. Hall (1983) and Li (1986) are examplesof theoretical analyses of CV. Efron’s (1979) bootstrap is widely used and it can be viewed assimplified jacknife or subsampling. Examples of early theoretical studies of bootstrap are Bickeland Freedman (1981) and Beran (1984) for the i.i.d. case, and Künsch (1989) for time series.Much more on these three data perturbation schemes can be found in books, for example, byEfron and Tibshirani (1993), Shao and Tu (1995) and Politis, Romano and Wolf (1999).

If we look into the literature of probability theory, the mathematical foundation of statistics,we see 5 that a perturbation argument is central to limiting law results such as the Central LimitTheorem (CLT).

The CLT has been the bedrock for classical statistical theory. One proof of the CLT that is com-posed of two steps and is well exposited in Terence Tao’s lecture notes available at his website(Tao, 2012). Given a normalized sum of i.i.d. random variables, the first step proves the univer-sality of a limiting law through a perturbation argument or the Lindebergs swapping trick. Thatis, one proves that a perturbation in the (normalized) sum by a random variable with matchingfirst and second moments does not change the (normalized) sum distribution. The second stepfinds the limit law by way of solving an ODE.

Recent generalizations to obtain other universal limiting distributions can be found in Chat-terjee (2006) for Wigner law under non-Gaussian assumptions and in Suidan (2006) for lastpassage percolation. It is not hard to see that the cornerstone of theoretical high-dimensionalstatistics, concentration results, also assumes stability-type conditions. In learning theory, stabil-ity is closely related to good generalization performance (Devroye and Wagner, 1979, Kearnsand Ron, 1999, Bousquet and Elisseeff, 2002, Kutin and Niyogi, 2002, Mukherjee et al., 2006,Shalev-Shwartz et al., 2010).

To further our discussion on stability, we would like to explain what we mean by statisticalstability. We say statistical stability holds if statistical conclusions are robust or stable to appro-priate perturbations to data. That is, statistical stability is well defined relative to a particular aimand a particular perturbation to data (or model). For example, aim could be estimation, predic-tion or limiting law. It is not difficult to have statisticians to agree on what are appropriate data

1490 B. Yu

perturbations when data units are i.i.d. or exchangeable in general, in which case subsamplingor bootstrap are appropriate. When data units are dependent, transformations of the original dataare necessary to arrive at modified data that are close to i.i.d. or exchangeable, such as in para-metric bootstrap in linear models or block-bootstrap in time series. When subsampling is carriedout, the reduced sample size in the subsample does have an effect on the detectable difference,say between treatment and control. If the difference size is large, this reduction on sample sizewould be negligible. When the difference size is small, we might not detect the difference with areduced sample size, leading to a more conservative scientific conclusion. Because of the utmostimportance of reproducibility for science, I believe that this conservatism is acceptable and mayeven be desirable in the current scientific environment of over-claims.

4. Estimation stability: Seeking more stable models thanLasso + CV

For the fMRI problem, let us recall that for each voxel, Lasso or e-L2Boost is used to fit themean function in the encoding model with CV to choose the smoothing parameter. Differentmodel selection criteria have been known to be unstable. Breiman (1996) compares predictivestability among forward selection, two versions of garotte and Ridge and their stability increasesin that order. He goes on to propose averaging unstable estimators over different perturbed datasets in order to stabilize unstable estimators. Such estimators are prediction driven, however, andthey are not sparse and thereby not suitable for interpretation.

In place of bootstrap for prediction error estimation as in Efron (1982), Zhang (1993) usesmulti-fold cross-validation while Shao (1996) uses m out of n bootstrap samples with m � n.They then select models with this estimated prediction error, and provide theoretical results forlow dimensional or fixed p linear models. Heuristically, the m out of n bootstrap in Shao (1996)is needed because the model selection procedure is a discrete (or set) valued estimator for thetrue model predictor set and hence non-smooth (cf. Bickel, Götze, and van Zwet, 1997).

The Lasso (Tibshirani, 1996) is a modern model selection method for linear regression andvery popular in high-dimensions:

β(λ) = argβ∈Rp

{‖Y − Xβ‖22 + λ‖β‖1

},

where Y ∈ Rn is the response vector and X ∈ Rn×p is the design matrix. That is, there are n dataunits and p predictors. For each λ, there is a unique L1 norm for its solution that we can use toindex the solution as β(τ ) where

τ = τ(λ) = ∥∥β(λ)∥∥

1.

Cross-validation (CV) is used most of the time to select λ or τ , but Lasso + CV is unstable rel-ative to bootstrap or subsampling perturbations when predictors are correlated (cf. Meinshausenand Bühlmann, 2010, Bach, 2008).

Using bootstrap in a different manner than Shao (1996), Bach (2008) proposes BoLasso toimprove Lasso’s model selection consistency property by taking the smallest intersecting model

Stability 1491

of selected models over different bootstrap samples. For particular smoothing parameter se-quences, the BoLasso selector is shown by Bach (2008) to be model selection consistent for thelow dimensional case without the irrepresentable condition needed for Lasso (cf. Meinshausenand Bühlmann, 2006, Zhao and Yu, 2006; Zou, 2006; Wainwright, 2009). Meinshausen andBühlmann (2010) also weaken the irrepresentable condition for model selection consistency ofa stability selection criterion built on top of Lasso. They bring perturbations to a Lasso paththrough a random scalar vector in the Lasso L1 penalty, resulting in many random Lasso paths.A threshold parameter is needed to distinguish important features based on these random paths.They do not consider the problem of selecting one smoothing parameter value for Lasso as inLim and Yu (2013).

We would like to seek a specific model along the Lasso path to interpret and hence selects aspecific λ or τ . It is well known that CV does not provide a good interpretable model becauseLasso + CV is unstable. Lim and Yu (2013) propose a stability-based criterion that is termedEstimation Stability (ES). They use the cross-validation data perturbation scheme. That is, n dataunits are randomly partitioned into V blocks of pseudo data sets of size (n − d) or subsampleswhere d = �n/V �.1

Given a smoothing parameter λ, a Lasso estimate βv(λ) is obtained for the vth block v =1, . . . , V . Since the L1 norm is a meaningful quantity to line up the V different estimates, Limand Yu (2013)2 use it, denoted as τ below, to line up these estimates to form an estimate m(τ )

for the mean regression function and an approximate delete-d jacknife estimator for the varianceof m(τ ):

m(τ ) = 1

V

∑v

Xβv(τ ),

T (τ ) = n − d

d

1

V

∑v

(∥∥Xβv(τ ) − m(τ )∥∥2)

.

The last expression is only an approximate delete-d jacknife variance estimator unless V = (n

n−d

)when all the subsamples of size n − d are used. Define the (estimation) statistical stability mea-sure as

ES(τ) = 1/V∑

v ‖Xβv(τ ) − m(τ )‖2

m2(τ )= d

n − d

T (τ )

m2(τ )= d

n − d

1

Z2(τ ),

where Z(τ) = m(τ )/

√T (τ ).

For nonlinear regression functions, ES can still be applied if we take an average of the esti-mated regression functions. Note that ES aims at estimation stability, while CV aims at predictionstability. In fact, ES is the reciprocal of a test statistic for testing

H0 :Xβ = 0.

1�x� is the floor function or the largest integer that is smaller than or equal to x.2It is also fine to use λ to line up the different solutions, but not a good idea to use the ratio of λ and its maximum valuefor each pseudo data set.

1492 B. Yu

Since Z(τ) = m(τ )/

√T (τ ) is a test statistic for H0, Z2(τ ) is also a test statistic. ES(τ) is a

scaled version of the reciprocal 1/Z2(τ ).To combat the high noise situation where ES would not have a well-defined minimum, Lim

and Yu (2013) combine ES with CV to propose the ES-CV selection criterion for smoothingparameter τ :

Choose the largest τ that minimizes ES (τ ) and is smaller or equal to the CV selection.

ES-CV is applicable to smoothing parameter selection in Lasso, and other regularization meth-ods such as Tikhonov or Ridge regularization (see, for example, Tikhonov, 1943, Markovich,2007, Hoerl, 1962, Hoerl and Kennard, 1970). ES-CV is well suited for parallel computationas CV and incurs only a negligible computation overhead because m(τ ) are already computedfor CV. Moreover, simulation studies in Lim and Yu (2013) indicate that, when compared withLasso + CV, ES-CV applied to Lasso gains dramatically in terms of false discovery rate while itloses only somewhat in terms of true discovery rate.

The features or predictors in the movie-fMRI problem are 3-d Gabor wavelet filters, and eachof them is characterized by a (discretized) spatial location on the image, a (discretized) frequencyof the filter, a (discretized) orientation of the filter, and 4 (discrete) time-lags on the correspondingimage that the 2-d filter is acting on. For the results comparing CV and ES-CV in Figure 1,

Figure 1. For three voxels (one particular subject), we display the (jittered) locations that index the Gaborfeatures selected by CV-Lasso (top row) and ESCV-Lasso (bottom row).

Stability 1493

Figure 2. Comparisons of ESCV(Lasso) and CV(Lasso) in terms of model size and prediction correlation.The scatter plots on the left compare ESCV and CV while the histograms on the right display the differencesof model size and prediction correlation.

we have a sample size n = 7,200 and use a reduced set of p = 8,556 features or predictors,corresponding to a coarser set of filter frequencies than what is used in Nishimoto et al. (2011)with p = 26,220 predictors.

We apply both CV and ES-CV to select the smoothing parameters in Lasso (or e-L2Boost).For three voxels (and a particular subject), for the simplicity of display, we show the locationsof the selected features (regardless of their frequencies, orientations and time-lags) in Figure 1.For these three voxels, ES-CV maintains almost the same prediction correlation performances asCV (0.70 vs. 0.72) while ES-CV selects many fewer and more concentrated locations than CV.Figure 2 shows the comparison results across 2088 voxels in the visual cortex that are selectedfor their high SNRs. It is composed of four sub-plots. The upper two plots compare predictioncorrelation performance of the models built via Lasso with CV and ES-CV on validation data. Foreach model fitted on training data and each voxel, predicted responses over the validation dataare calculated. Its correlation with the observed response vector is the “prediction correlation”displayed in Figure 2. The lower two plots compare the sparsity properties of the models or modelsize. Because of the definition of ES-CV, it is expected that the ES-CV model are always smaller

1494 B. Yu

than or equal to the CV model. The sparsity advantage of ES-CV is apparent with a huge overallreduction of 60% on the number of selected features and a minimum loss of overall predictionaccuracy by only 1.3%. The average size of the ES-CV models is 24.3 predictors, while that forthe CV models is 58.8 predictors; the average prediction correlation performance of the ES-CVmodels is 0.499, while that for the CV models is 0.506.

5. Sample variability meets robust statistics in high-dimensions

Robust statistics also deals with stability, relative to model perturbation. In the preface of hisbook “Robust Statistics,” Huber (1981) states:

“Primarily, we are concerned with distributional robustness: the shape of the true underlyingdistribution deviates slightly from the assumed model.”

Hampel, Rousseeuw, Ronchetti and Stahel (1986) write:“Overall, and in analogy with, for example, the stability aspects of differential equations or

of numerical computations, robustness theories can be viewed as stability theories of statisticalinference” (p. 8).

Tukey (1958) has generally been regarded as the first paper on robust statistics. Fundamen-tal contributions were made by Huber (1964) on M-estimation of location parameters, Hampel(1968, 1971, 1974) on “break-down” point and influence curve. Further important contributionscan be found, for example, in Andrews et al. (1972) and Bickel (1975) on one-step Huber esti-mator, and in Portnoy (1977) for M-estimation in the dependent case.

For most statisticians, robust statistics in linear regression is associated with studying estima-tion problems when the errors have heavier tail distributions than the Gaussian distribution. Inthe fMRI problem, we fit mean functions with an L2 loss. What if the “errors” have heavier tailsthan Gaussian tails? For the L1 loss is commonly used in robust statistics to deal with heaviertail errors in regression, we may wonder whether the L1 loss would add more stability to thefMRI problem. In fact, for high-dimensional data such as in our fMRI problem, removing somedata units could severely change the outcomes of our model because of feature dependence. Thisphenomenon is also seen in simulated data from linear models with Gaussian errors in high-dimensions.

How does sample to sample variability interact with heavy tail errors in high-dimensions?In our recent work El Karoui et al. (2013), we seek insights into this question through analyti-

cal work. We are able to see interactions between sample variability and double-exponential tailerrors in a high-dimensional linear regression model. That is, let us assume the following linearregression model

Yn×1 = Xn×pβp×1 + εn×1,

where

Xi ∼ N(0,�p), i.i.d., εi i.i.d.,Eεi = 0,Eε2i = σ 2 < ∞.

An M-estimator with respect to loss function ρ is given as

β = argminβ∈Rp

∑i

ρ(Yi − X′

iβ).

Stability 1495

We consider the random-matrix high-dimensional regime:

p/n → κ ∈ (0,1).

Due to rotation invariance, WLOG, we can assume �p = Ip and β = 0. We cite below a resultfrom El Karoui et al. (2013) for the important special case of �p = Ip:

Result 1 (El Karoui et al., 2013). Under the aforementioned assumptions, let rρ(p,n) = ‖β‖,then β is distributed as

rρ(p,n)U,

where U ∼ uniform(Sp−1)(1), and

rρ(p,n) → rρ(κ),

as n,p → ∞ and p/n → κ ∈ (0,1).

Denote

zε := ε + rρ(κ)Z,

where Z ∼ N(0,1) and independent of ε, and let

proxc(ρ)(x) = argminy∈R

[ρ(y) + (x − y)2

2c

].

Then rρ(κ) satisfies the following system of equations together with some nonnegative c:

E{[

proxc(ρ)]′} = 1 − κ,

E{[

zε − proxc(zε)]2} = κr2

ρ(κ).

In our limiting result, the norm of an M-estimator stabilizes. It is most interesting to mentionthat in the proof a “leave-one-out” trick is used both row-wise and column-wise such that oneby one rows are deleted and similarly columns are deleted. The estimators with deletions arethen compared to the estimator with no deletion. This is in effect a perturbation argument andreminiscent of the “swapping trick” for proving the CLT as discussed before. Our analyticalderivations involve prox functions, which are reminiscent of the second step in proving normalityin the CLT. This is because a prox function is a form of derivative, and not dissimilar to thederivative appearing in the ODE derivation of the analytical form of the limiting distribution (e.gnormal distribution) in the CLT.

In the case of i.i.d. double-exponential errors, El Karoui et al. (2013) numerically solve thetwo equations in Result 1 to show that when κ ∈ (0.3,1), L2 loss fitting (OLS) is better than L1loss fitting (LAD) in terms of MSE or variance. They also show that the numerical results matchvery well with simulation or Monte Carlo results. At a high level, we may view that zε holdsthe key to this interesting phenomenon. Being a weighted convolution of Z and ε, it embedsthe interaction between sample variability (expressed in Z) and error variability (expressed in ε)

1496 B. Yu

and this interaction is captured in the optimal loss function (cf. El Karoui et al., 2013). In otherwords, zε acts more like double exponential when the influence of standard normal Z in zε is notdominant (or when κ < 0.3 or so as we discover when we solve the equations) and in this case,the optimal loss function is closer to LAD loss. In cases when κ > 0.3, it acts more like Gaussiannoise, leading to the better performance of OLS (because the optimal loss is closer to LS).

Moreover, for double exponential errors, the M-estimator LAD is an MLE and we are in a high-dimensional situation. It is well-known that MLE does not work in high-dimensions. Remedieshave been found through penalized MLE where a bias is introduced to reduce variance andconsequently reduce the MSE. In contrast, when κ ∈ (0.3,1), the better estimator OLS is alsounbiased, but has a smaller variance nevertheless. The variance reduction is achieved through abetter loss function LS than the LAD and because of a concentration of quadratic forms of thedesign matrix. This concentration does not hold for fixed orthogonal designs, however. A follow-up work (Bean et al., 2013) addresses the question of obtaining the optimal loss function. Itis current research regarding the performance of estimators from penalized OLS and penalizedLAD when the error distribution is double-exponential. Preliminary results indicate that similarphenomena occur in non-sparse cases.

Furthermore, simulations with design matrix from an fMRI experiment and double-exponential error show the same phenomenon, that is, when κ = p/n > 0.3 or so, OLS is betterthan LAD. This provides some insurance for using L2 loss function in the fMRI project. It isworth noting that El Karoui et al. (2013) contains results for more general settings.

6. Conclusions

In this paper, we cover three problems facing statisticians at the 21st century: figuring out howvision works with fMRI data, developing a smoothing parameter selection method for Lasso,and connecting perturbation in the case of high-dimensional data with classical robust statisticsthrough analytical work. These three problems are tied together by stability. Stability is welldefined if we describe the data perturbation scheme for which stability is desirable, and suchschemes include bootstrap, subsampling, and cross-validation. Moreover, we briefly review re-sults in the probability literature to explain that stability is driving limiting results such as theCentral Limit Theorem, which is a foundation for classical asymptotic statistics.

Using these three problems as backdrop, we make four points. Firstly, statistical stability con-siderations can effectively aid the pursuit for interpretable and reliable scientific models, es-pecially in high-dimensions. Stability in a broad sense includes replication, repeatability, anddifferent data perturbation schemes. Secondly, stability is a general principle on which to buildstatistical methods for different purposes. Thirdly, the meaning of stability needs articulation inhigh-dimensions because it could be brought about by sample variability and/or heavy tails in theerrors of a linear regression model. Last but not least, emphasis should be placed on the stabilityaspects of statistical inference and conclusions, in the referee process of scientific and appliedstatistics papers and in our current statistics curriculum.

Statistical stability in the age of massive data is an important area for research and actionbecause high-dimensions provide ample opportunities for instability to reveal itself to challengereproducibility of scientific findings.

As we began this article with words of Tukey, it seems fitting to end also with his words:

Stability 1497

“What of the future? The future of data analysis can involve great progress, the overcoming of real difficulties,and the provision of a great service to all fields of science and technology. Will it? That remains to us, to ourwillingness to take up the rocky road of real problems in preferences to the smooth road of unreal assumptions,arbitrary criteria, and abstract results without real attachments. Who is for the challenge?” – Tukey (p. 64, 1962).

Acknowledgements

This paper is based on the 2012 Tukey Lecture of the Bernoulli Society delivered by the author atthe 8th World Congress of Probability and Statistics in Istanbul on July 9, 2012. For their scien-tific influence and friendship, the author is indebted to her teachers/mentors/colleagues, the lateProfessor Lucien Le Cam, Professor Terry Speed, the late Professor Leo Breiman, Professor Pe-ter Bickel, and Professor Peter Bühlmann. This paper is invited for the special issue of Bernoullicommemorating the 300th anniversary of the publication of Jakob Bernoulli’s Ars Conjectandiin 1712.

The author would like to thank Yuval Benjamini for his help on generating the results in thefigures. She would also like to thank two referees for their detailed and insightful comments,and Yoav Benjamini and Victoria Stodden for helpful discussions. Partial supports are gratefullyacknowledged by NSF Grants SES-0835531 (CDI) and DMS-11-07000, ARO Grant W911NF-11-1-0114, and the NSF Science and Technology Center on Science of Information throughGrant CCF-0939370.

References

Allen, D.M. (1974). The relationship between variable selection and data augmentation and a method forprediction. Technometrics 16 125–127. MR0343481

Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H. and Tukey, J.W. (1972). Robust Esti-mates of Location: Survey and Advances. Princeton, NJ: Princeton Univ. Press. MR0331595

Atkil, H., Martone, M.E. and Essen, D.C.V. (2012). Challenges and opportunities in mining neurosciencedata. Science 331 708–712.

Bach, F. (2008). Bolasso: Model consistent lasso estimation through the bootstrap. In Proc. of ICML.Helsinki, Finland.

Bean, D., Bickel, P.J., El Karoui, N. and Yu, B. (2013). Optimal M-estimation in high-dimensional regres-sion. Proc. Natl. Acad. Sci. USA. To appear.

Beran, R. (1984). Bootstrap methods in statistics. Jahresber. Deutsch. Math.-Verein. 86 14–30. MR0736625Bickel, P.J. (1975). One-step Huber estimates in the linear model. J. Amer. Statist. Assoc. 70 428–434.

MR0386168Bickel, P.J. and Freedman, D.A. (1981). Some asymptotic theory for the bootstrap. Ann. Statist. 9 1196–

1217. MR0630103Bickel, P.J., Götze, F. and van Zwet, W.R. (1997). Resampling fewer than n observations: Gains, losses,

and remedies for losses. Statist. Sinica 7 1–31. MR1441142Booth, B. (2012). Scientific reproducibility: Begley’s six rules. Forbes September 26.Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Mach. Learn. Res. 2 499–526.

MR1929416Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–

2383. MR1425957

http://www.ams.org/mathscinet-getitem?mr=0343481








1498 B. Yu

Carlstein, E. (1986). The use of subseries values for estimating the variance of a general statistic from astationary sequence. Ann. Statist. 14 1171–1179. MR0856813

Casadevall, A. and Fang, F.C. (2011). Reforming science: Methodological and cultural reforms. Infectionand Immunity 80 891–896.

Chatterjee, S. (2006). A generalization of the Lindeberg principle. Ann. Probab. 34 2061–2076.MR2294976

Dayan, P. and Abbott, L.F. (2005). Theoretical Neuroscience: Computational and Mathematical Modelingof Neural Systems. Cambridge, MA: MIT Press. MR1985615

Devroye, L.P. and Wagner, T.J. (1979). Distribution-free inequalities for the deleted and holdout error esti-mates. IEEE Trans. Inform. Theory 25 202–207. MR0521311

Donoho, D.L., Maleki, A., Shahram, M., Rahman, I.U. and Stodden, V. (2009). Reproducible research incomputational harmonic analysis. IEEE Computing in Science and Engineering 11 8–18.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7 1–26. MR0515681Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. CBMS-NSF Regional Confer-

ence Series in Applied Mathematics 38. Philadelphia, PA: SIAM. MR0659849Efron, B. and Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Monographs on Statistics and

Applied Probability 57. New York: Chapman & Hall. MR1270903El Karoui, N., Bean, D., Bickel, P.J., Lim, C. and Yu, B. (2013). On robust regression with high-dimensional

predictors. Proc. Natl. Acad. Sci. USA. To appear.Fonio, E., Golani, I. and Benjamini, Y. (2012). Measuring behavior of animal models: Faults and remedies.

Nature Methods 9 1167–1170.Goodale, M.A. and Milner, A.D. (1992). Separate visual pathways for perception and action. Trends Neu-

rosci. 15 20–25.Hall, P. (1983). Large sample optimality of least squares cross-validation in density estimation. Ann. Statist.

11 1156–1174. MR0720261Hampel, F.R. (1968). Contributions to the theory of robust estimation. Ph.D. thesis, Univ. California, Berke-

ley. MR2617979Hampel, F.R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 1887–1896.

MR0301858Hampel, F.R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69 383–

393. MR0362657Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. (1986). Robust Statistics: The Approach

Based on Influence Functions. Wiley Series in Probability and Mathematical Statistics: Probability andMathematical Statistics. New York: Wiley. MR0829458

Hartigan, J.A. (1969). Using subsample values as typical values. J. Amer. Statist. Assoc. 64 1303–1317.MR0261737

Hartigan, J.A. (1975). Necessary and sufficient conditions for asymptotic joint normality of a statistic andits subsample values. Ann. Statist. 3 573–580. MR0391346

Hinkley, D.V. (1977). Jacknifing in unbalanced situations. Technometrics 19 285–292. MR0458734Hoerl, A.E. (1962). Application of ridge analysis to regression problems. Chemical Engineering Progress

58 54–59.Hoerl, A.E. and Kennard, R.W. (1970). Ridge regression: Biased estimation for nonorthogonal problems.

Technometrics 42 80–86.Hubel, D.H. and Wiesel, T.N. (1959). Receptive fields of single neurones in the cat’s striate cortex. Journal

of Physiology 148 574–591.Huber, P.J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35 73–101. MR0161415Huber, P.J. (1981). Robust Statistics. New York: Wiley. MR0606374


















Stability 1499

Huber, P.J. (2002). John W. Tukey’s contributions to robust statistics. Ann. Statist. 30 1640–1648.MR1969444

Ioannidis, J.P.A. (2005). Why most published research findings are false. PLoS Med. 2 696–701.Kay, K.N. and Gallant, J.L. (2009). I can see what you see. Nat. Neurosci. 12 245.Kay, K.N., Naselaris, T., Prenger, R.J. and Gallant, J.L. (2008). Identifying natural images from human

brain activity. Nature 452 352–355.Kearns, M. and Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-

validation. Neural Comput. 11 1427–1453.Kraft, P., Zeggini, E. and Ioannidis, J.P.A. (2009). Replication in genome-wide association studies. Statist.

Sci. 24 561–573. MR2779344Künsch, H.R. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Statist. 17

1217–1241. MR1015147Kutin, S. and Niyogi, P. (2002). Almost-everywhere algorithmic stability and generalization error. In Proc.

of UAI: Uncertainty in Artificial Intelligence 18.Li, K.C. (1986). Asymptotic optimality of CL and generalized cross-validation in ridge regression with

application to spline smoothing. Ann. Statist. 14 1101–1112. MR0856808Lim, C. and Yu, B. (2013). Estimation stability with cross-validation (ES-CV). Available at

arXiv.org/abs/1303.3128.Mahalanobis, P. (1946). Sample surveys of crop yields in India. Sankhya, Series A 7 269–280.Markovich, N. (2007). Nonparametric Analysis of Univariate Heavy-Tailed Data: Research and Practice.

Wiley Series in Probability and Statistics. Chichester: Wiley. MR2364666Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso.

Ann. Statist. 34 1436–1462. MR2278363Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 72

417–473. MR2758523Miller, R.G. (1974). The jackknife—A review. Biometrika 61 1–15. MR0391366Mukherjee, S., Niyogi, P., Poggio, T. and Rifkin, R. (2006). Learning theory: Stability is sufficient for

generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput.Math. 25 161–193. MR2231700

Naik, G. (2011). Scientists’ elusive goal: Reproducing study results. Wall Street Journal (Health IndustrySection) December 2.

Naselaris, T., Prenger, R.J., Kay, K.N. and Gallant, M.O.J.L. (2009). Bayesian reconstruction of naturalimages from human brain activity. Neuron 63 902–915.

Naselaris, T., Kay, K.N., Nishimoto, S. and Gallant, J.L. (2011). Encoding and decoding in fmri. Neuroim-age 56 400–410.

Nishimoto, S., Vu, A.T., Naselaris, T., Benjamini, Y., Yu, B. and Gallant, J.L. (2011). Reconstructing visualexperiences from brain activity evoked by natural movies. Current Biology 21 1641–1646.

Nosek, B.A., Spies, J.R. and Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practicesto promote truth over publishability. In Proc. of CoRR.

Olshausen, B.A. and Field, D.J. (1996). Emergence of simple-cell receptive field properties by learning asparse code for natural images. Nature 381 607–609.

Politis, D.N. and Romano, J.P. (1992). A general theory for large sample confidence regions based onsubsamples under minimal assumptions. Technical Report 399. Dept. Statistics, Stanford Univ.

Politis, D.N., Romano, J.P. and Wolf, M. (1999). Subsampling. New York: Springer. MR1707286Portnoy, S.L. (1977). Robust estimation in dependent situations. Ann. Statist. 5 22–43. MR0445716Quenouille, M.H. (1949). Approximate tests of correlation in time-series. J. R. Stat. Soc. Ser. B Stat.

Methodol. 11 68–84. MR0032176Quenouille, M.H. (1956). Notes on bias in estimation. Biometrika 43 353–360. MR0081040





http://arxiv.org/abs/1303.3128










1500 B. Yu

Shalev-Shwartz, S., Shamir, O., Srebro, N. and Sridharan, K. (2010). Learnability, stability and uniformconvergence. J. Mach. Learn. Res. 11 2635–2670. MR2738779

Shao, J. (1996). Bootstrap model selection. J. Amer. Statist. Assoc. 91 655–665. MR1395733Shao, J. and Tu, D.S. (1995). The Jackknife and Bootstrap. New York: Springer. MR1351010Steen, R.G. (2011). Retractions in the scientific literature: Do authors deliberately commit fraud? J. Med.

Ethics 37 113–117.Stodden, V. (2011). Trust your science? Open your data and code. AMSTATNEWS. Available at http://

magazine.amstat.org/blog/2011/07/01/trust-your-science/.Stone, M. (1974). Cross-validatory choice and assessment of statistical prediction. J. R. Stat. Soc. Ser. B

Stat. Methodol. 36 111–147.Suidan, T. (2006). A remark on a theorem of Chatterjee and last passage percolation. J. Phys. A 39 8977–

8981. MR2240468Tao, T. (2012). Lecture notes on the central limit theorem. Available at http://terrytao.wordpress.com/2010/

01/05/254a-notes-2-the-central-limit-theorem/.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol.

58 267–288. MR1379242Tikhonov, A.N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR 39 195–198.Tukey, J.W. (1958). Bias and confidence in not quite large samples. Ann. Math. Statist. 29 614.Tukey, J.W. (1962). The future of data analysis. Ann. Math. Statist. 33 1–67. MR0133937Wainwright, M.J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using 1-

constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202. MR2729873Wu, C.F.J. (1986). Jackknife, bootstrap and other resampling methods in regression analysis (with discus-

sion). Ann. Statist. 14 1261–1295. MR0868303Zhang, P. (1993). Model selection via multifold cross validation. Ann. Statist. 21 299–313. MR1212178Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.

MR2274449Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.

MR2279469




http://magazine.amstat.org/blog/2011/07/01/trust-your-science/


http://terrytao.wordpress.com/2010/01/05/254a-notes-2-the-central-limit-theorem/








http://magazine.amstat.org/blog/2011/07/01/trust-your-science/

http://terrytao.wordpress.com/2010/01/05/254a-notes-2-the-central-limit-theorem/

stability - university of california, berkeleybinyu/ps/papers2013/yu13.pdf · stability 1487 (2011)...

Documents