the bootstrap and beyond: using jsl for resampling

58
Copyright © 2012, SAS Institute Inc. All rights reserved. THE BOOTSTRAP AND BEYOND: USING JSL FOR RESAMPLING Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute

Upload: jmp-division-of-sas

Post on 29-Nov-2014

363 views

Category:

Technology


7 download

DESCRIPTION

This presentation was originally given live at JMP Discovery Summit 2013 in San Antonio, Texas, USA. To sign up to attend this year's conference, visit http://jmp.com/summit

TRANSCRIPT

Page 1: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

THE BOOTSTRAP AND BEYOND:

USING JSL FOR RESAMPLING

• Michael Crotty & Clay Barker

• Research Statisticians

• JMP Division, SAS Institute

Page 2: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

OUTLINE

• Nonparametric Bootstrap

• Permutation Testing

• Parametric Bootstrap

• Bootstrap Aggregating (Bagging)

• Bootstrapping in R

• Stability Selection

• Wrap-up/Questions

Page 3: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

NONPARAMETRIC BOOTSTRAP

Page 4: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

NONPARAMETRIC

BOOTSTRAPINTRODUCTION TO THE BOOTSTRAP

• Introduced by Brad Efron in 1979; grown in popularity as computing power

increases

• Resampling technique that allows you to estimate the variance of statistics,

even when analytical expressions for the variance are difficult to obtain

• You want to know about the population, but all you have is one sample

• Treat the sample as a population and sample from it with replacement

• This is called a bootstrap sample

• Repeating this sampling scheme produces bootstrap replication

• For each bootstrap sample, you can calculate the statistic(s) of interest

Page 5: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

NONPARAMETRIC

BOOTSTRAPBOOTSTRAP WORLD

• Efron & Tibshirani (1993) diagram of

the Real world and the Bootstrap

world to illustrate why bootstrapping

works

Page 6: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

NONPARAMETRIC

BOOTSTRAPTHE BOOTSTRAP IN JMP

• Possible to do a bootstrap analysis prior to JMP 10 using a script

• “One-click bootstrap” added to JMP Pro in Version 10

• Available in most Analysis platforms

• Rows need to be independent for one-click bootstrap to be implemented

• Takes advantage of the Automatic Recalc feature

• Results can be analyzed in Distribution platform, which will know to provide

Bootstrap Confidence Limits, based on percentile interval method (Efron &

Tibshirani 1993)

Page 7: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

NONPARAMETRIC

BOOTSTRAPNON-STANDARD QUANTITIES

• By non-standard, I mean statistics for which we don’t readily have standard

errors

• Could be unavailable in JMP

• Could be difficult to obtain analytically

• Example: Adjusted R^2 value in linear regression

Page 8: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

NONPARAMETRIC

BOOTSTRAPRECAP

• Bootstrap is a powerful feature with many uses

• Primarily a UI feature, but capability is enhanced when scripted in JSL

• Allows us to get confidence intervals for statistics, functions of statistics and

curves

• Examples from Discovery 2012:

• Non-standard quantities

• Functions of the output

• Multiple tables in one bootstrap run

• Model from the Fit Curve platform

Page 9: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

PERMUTATION TESTS

Page 10: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSINTRODUCTION

• Introduced by R.A. Fisher in the 1930’s

• Fisher wanted to demonstrate the validity of Student’s t test without normality

assumption

• Provide exact results, but only apply to a narrow range of problems

• Must have something to permute (i.e. change the order of)

• Bootstrap hypothesis testing extends permutation testing to more problems

Page 11: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSCONCEPTS

• Basic idea:

• Sample repeatedly from the permutation distribution

• Note that this is sampling without replacement (not with replacement)

• Resampling purpose is to permute (change the order of) the observations

• Compare the number of results more extreme than the observed result

• Calculate a p-value: # 𝑟𝑒𝑠𝑢𝑙𝑡𝑠 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒

#{𝑡𝑜𝑡𝑎𝑙 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠}

Page 12: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSTWO-SAMPLE EXAMPLE

• Consider comparing the (possibly different) distributions (𝐹, 𝐺) of two samples

(sizes 𝑛,𝑚 with 𝑁 = 𝑛 +𝑚)

• 𝐻0: 𝐹 = 𝐺

• Under 𝐻0, all permutations of the observations across 𝐹, 𝐺 are equally likely

• There are 𝑁𝑛

possible permutations; generally sampling from these is sufficient.

• For each permutation replication, determine if the difference ( 𝜃∗) is greater than the

observed difference ( 𝜃).

• Tabulate the number of times 𝜃∗ ≥ 𝜃 and divide by number of replications.

• This is a one-sided permutation test; a two-sided test can be performed by taking

absolute values of 𝜃∗ and 𝜃.

Page 13: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSDEMONSTRATION #1

• Oneway platform in JMP can compute

robust mean estimates

• Test included is a Wald test with an asymptotic

Chi Square distribution p-value

• We wish to use a permutation test to avoid

the distributional assumption of the

asymptotic test

Page 14: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSDEMONSTRATION #1

• Script input:

• continuous response

• categorical predictor

• # of permutation replications

• Script output:

• original Robust Fit

• permutation test results

newX = xVals[random shuffle(iVec)];

// set the x values to a random permutation

column(eval expr(xCol)) << set values(newX);

Page 15: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSDEMONSTRATION #1

Robust mean permutation test demo

Page 16: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSDEMONSTRATION #2

• Contingency platform in JMP performs two

Chi Square tests for testing if responses

differ across levels of the X variable

• Tests require that expected counts of

contingency table cells be > 5

• We wish to use a permutation test to avoid

this requirement

Page 17: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSDEMONSTRATION #2

• Script input:

• categorical response

• categorical predictor

• # of permutation replications

• Script output:

• results of two original Chi Square tests

• permutation test results

newY = responseVals[random shuffle(ivec)];

column(eval expr(responseCol)) << set values(newY);

Page 18: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSDEMONSTRATION #2

Contingency table permutation test demo

Page 19: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PERMUTATION

TESTSRECAP

• When available, full permutation tests provide exact results

• …and incomplete permutation tests still give good results

• If something can be permuted, these tests are easy to implement

• For other significance testing situations, bootstrap hypothesis testing works

Page 20: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

PARAMETRIC BOOTSTRAP

Page 21: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPINTRODUCTION

• JMP provides a one-click non-parametric bootstrap feature.

• But there are other variations of the bootstrap: residual resampling, Bayesian

bootstrap, …

• This section will provide an introduction to the parametric bootstrap and how

it can be implemented in JSL.

Page 22: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPWHY NOT SAMPLE ROWS?

• There are times when we may not want to resample rows of our data.

Nonparametric

Bootstrap

Sample

Page 23: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPWHY NOT SAMPLE ROWS?

• For a sigmoid curve, we don’t want to lose any sections of the curve:

• Upper asymptote

• Lower asymptote

• Inflection point

• Similar issues arise with logistic regression, resampling rows can lead to

problems with separation.

• There are several alternatives to resampling rows, we will focus on the

parametric bootstrap.

Page 24: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPDETAILS

• Nearly identical to the nonparametric bootstrap, except for the way that we

generate our bootstrap samples.

• Nonparametric bootstrap samples from the empirical distribution of our data.

• Parametric bootstrap samples from the fitted parametric model for our data.

Suppose we use 𝐹(𝛽) to model our response 𝑌. Fitting the model to our

observed data gives us 𝐹( 𝛽), our fitted parametric model.

A nonparametric bootstrap algorithm is

1. Obtain 𝐹 𝛽 by fitting the parametric model to observed data.

2. Use 𝐹( 𝛽) to generate 𝑌𝑗∗, a vector of random pseudo-responses

3. Fit 𝐹(𝛽) to 𝑌𝑗∗ , giving us 𝛽𝑗

4. Store 𝛽𝑗∗ and return to step 2 for j=1…,B.

Page 25: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPSIGMOID CURVE EXAMPLE

• A few slides ago, we saw an example of a sigmoid curve.

𝐹 𝛽 = 𝑔 𝑥, 𝛽 = 𝛽3 +𝛽4 − 𝛽3

1 + 𝐸𝑥𝑝[ −𝛽1 𝑥 − 𝛽2 ]

• Assume the response is normally distributed: y ∼ 𝑁(𝑔 𝑥, 𝛽 , 𝜎2).

• Fitting the curve to our original data set gives us an estimate for our

coefficients and error variance.

Term 𝛽1 𝛽2 𝛽3 𝛽4 𝜎

Estimate 5.72 .32 .60 1.11 .011

Page 26: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPSIGMOID EXAMPLE CONTINUED

• Using our estimated coefficients and error variance

𝑦𝑗∗ = 𝑔 𝑥𝑗 , 𝛽 + 𝜎𝜖𝑗

where the 𝜖𝑗 are independent and identically distributed standard normal.

Parametric

Bootstrap

sample

Page 27: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPMORE DETAILS

• Resampling rows can be problematic.

• Any other reasons to use the parametric bootstrap?

• Results will be close to “textbook” formulae when available.

• Very nice for doing goodness of fit tests.

• Example: Normal goodness of fit test

𝐻0: 𝑁 𝜇, 𝜎 appropriate for 𝑦 vs 𝐻1: 𝑁 𝜇, 𝜎 not appropriate

Parametric bootstrap gives us the distribution of the test statistic under 𝐻0→ Perfect for calculating p-values

Page 28: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAP…AND JMP

• JMP provides a one-click nonparametric bootstrap, but a little bit of scripting

gets us the parametric bootstrap as well.

• Most (all?) modeling platforms in JMP allow you to save a prediction formula.

We can also create columns of random values for many distributions.

Put these two things together and we’re well on our way!

Page 29: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

PARAMETRIC

BOOTSTRAPPINGDEMONSTRATION

Parametric Bootstrapping Demonstration in JMP

Page 30: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

BOOTSTRAP AGGREGATING (BAGGING)

Page 31: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING INTRODUCTION

• We have seen bootstrapping for inference, we can also use the bootstrap to

improve prediction.

• Breiman (1996a) introduced the notion of “bootstrap aggregating” (or bagging

for short) to improve predictions.

• The name says it all…aggregate predictions across bootstrap samples.

Page 32: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING UNSTABLE PREDICTORS

• Breiman (1996b) introduced the idea of instability in model selection.

• Let’s say that we are using our data

D = { (𝒙𝑖 , 𝑦𝑖), 𝑖 = 1,… , 𝑛 }

to create a prediction function 𝜇 𝑥, 𝐷 .

• If a small change in 𝐷 results in a large change in 𝜇 ∙,∙ , we have an unstable

predictor.

• A variety of techniques have been shown to be unstable:

Regression trees, best subset regression, forward selection, …

• Instability is a major concern when predicting new observations.

Page 33: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING THE BAGGING ALGORITHM

• A natural way to deal with instability is to observe the behavior of 𝜇 ∙,∙ for

repeated perturbations of the data → bootstrap it!

• Basic bagging algorithm:

1. Take a bootstrap sample 𝐷𝑗 from the observed data 𝐷.

2. Fit your model of choice to 𝐷𝑗, giving you predictor 𝜇𝑗(𝒙).

Repeat 1 and 2 for 𝑗 = 1,… , 𝑏

Then the bagged prediction rule is

𝜇(𝒙)𝑏𝑎𝑔 =1

𝑏 𝑗=1𝑏 𝜇𝑗(𝒙)

• Bagging a classifier is slightly different. You can either average over the

probability formula or use a voting scheme.

Page 34: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING REGRESSION TREES

• Regression trees (as well as classification

trees) are known to be particularly unstable.

Ex:

𝑦 𝑥 = 1 𝑥 < 25 2 ≤ 𝑥 ≤ 43 𝑥 > 4

• A regression tree looks for optimal splits in your

predictors and fits a simple mean to each

section.

Page 35: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING MORE ON INSTABILITY

• In general, techniques that involve “hard decision rules” (like binary splits or

including/excluding terms) are likely to be unstable.

• Regression trees: binary splits

• Best subset: X1 is either included or left out of the model (nothing in between)

• Is anything stable???

One example: Penalized regression techniques can shrink estimates, which is kind of

like letting a variable partially enter the model.

Page 36: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING REGRESSION TREE EXAMPLE

A regression tree for these

data will change drastically

depending on whether or not

we include the point at x=21.

Page 37: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING REGRESSION TREE EXAMPLE

Predictions for a tree with a

single split.

Blue includes x=21.

Red excludes x=21.

This kind of difference is

crucial when we observe

new data.

Page 38: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING REGRESSION TREE EXAMPLE

The bagged predictor is a

compromise between the two.

Page 39: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING RECAP

• Bagging is a very useful tool that improves upon unstable predictors.

Added bonus: we also get a measure of uncertainty.

• Several features in JMP and JSL make it convenient to do bagging yourself.

Most platforms (if not all) allow you to save a prediction formula

linearModel << save prediction formula;

predFormula = column("Y Predictor") << get formula;

And we get a lot of mileage out of the resample freq() functionrsfCol = newColumn("rsf", set formula(resample freq(1)));

• An implementation of Random Forests was added to JMP Pro in version 10.

Random forests resample rows and columns.

Page 40: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

BAGGING DEMONSTRATION

Bagging Demonstration in JMP

Page 41: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

USING R FOR BOOTSTRAPPING

Page 42: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

USING R FOR

BOOTSTRAPPINGTHE JMP INTERFACE TO R

• JMP 10 added the ability to transfer information between JMP and R (a very

powerful open-source statistical software package).

• R has packages to do bootstrapping, in particular the “boot” package.

• We can do the bootstrap in R using a custom-made JMP interface.

Page 43: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

USING R FOR

BOOTSTRAPPINGCONNECTING TO R

• JMP is a very nice complement to R, making it easy to create convenient

interfaces to R and then present R results in JMP.

• A handful of JSL functions allow you to communicate with R.

R Init(); // Initializes the connection to R

x = [1, 2, 3];

R Send( x ); // sends the matrix x to R

R Submit("x <- 2*x"); // submits R Code

y = R Get(x);// gets the object x from R and names it y

R Term(); // Terminates the connection to R

• There are a few more JSL functions for communicating with R, but the

functions listed above will handle the majority of your needs.

Page 44: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

USING R FOR

BOOTSTRAPPINGCONNECTING TO R

• The R connection allows us to combine the strengths of R and JMP

Page 45: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

USING R FOR

BOOTSTRAPPINGDEMONSTRATION

JMP and R Integration Demonstration

Page 46: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

STABILITY SELECTION

Page 47: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONINTRODUCTION

• Bagging is a way to use resampling to improve prediction.

We can also use resampling to improve variable selection techniques.

• Meinshausen and Buhlmann (2010) introduced Stability selection, a very

general modification that can be used in conjunction with any traditional

variable selection technique.

• The motivation behind stability selection is simple and very intuitive:

If a predictor is typically included in the final model after doing variable

selection on a subset of the data, then it is probably a meaningful variable.

Page 48: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONTHE VARIABLE SELECTION PROBLEM

• Suppose that we have observed data

D = { (𝒙𝑖 , 𝑦𝑖), 𝑖 = 1,… , 𝑛 }

𝒙𝑖 is a 𝑝 × 1 vector of predictors for observation 𝑖.

• We want to build a model for the response 𝑦 using a subset of the predictors

to improve both interpretation and predictive ability.

This is the classic variable selection problem.

• No shortage of variable selection techniques:

stepwise regression, best subset, penalized regression, …

Page 49: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONTHE VARIABLE SELECTION PROBLEM

• Usual linear models problem, we want to

estimate the coefficient vector 𝛽 for the model

𝑌 = 𝑋𝛽 + 𝜀.

• Want our variable selection technique to set

𝛽𝑗 = 0 for some of the terms.

• We always have at least one tuning parameter

(λ) that controls the complexity of the model.

Doing variable selection yields a set

𝑆 𝜆 = 𝑘 ∶ 𝛽𝑘 ≠ 0

• We tune the model using Cross-Validation, AIC,

BIC, …

Variable

Selection

Technique

Tuning

Parameter

Forward

Selection

Alpha-to-enter

Backward

Elimination

Alpha-to-leave

Best Subset Maximum model

size considered

Lasso L1 norm

Least Angle

Regression

Number of

nonzero

Variables

Page 50: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONTHE DETAILS

• The stability selection algorithm:

1. Choose a random subsample without replacement (𝐷𝑗) of size 𝑛

2from 𝐷.

2. Use a variable selection technique to obtain 𝑆𝑘(𝜆), the set of nonzero coefficients for

variables selected for tuning parameter value 𝜆.

3. Repeat steps 1 and 2 for 𝑘 = 1…𝐵

• Easy to implement: only as complicated as the underlying selection technique.

• We calculate Π𝑗 𝜆 , the probability variable 𝑗 is included in the model when doing

selection (with tuning 𝜆) on a random subset of the data.

Π𝑗 𝜆 =1

𝑏

𝑘=1

𝑏

𝐼(𝑥𝑗 ∈ 𝑆𝑘(𝜆))

Page 51: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONMORE DETAILS

• Applying the algorithm to a meaningful

range of 𝜆 values shows us how

inclusion probabilities change as a

function of the tuning parameter.

• It makes sense that if a term maintains

a high selection probability, then it

should be in our final model.

Page 52: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONMORE DETAILS

• After looking across a meaningful

range of 𝜆 values, we include variable

𝑥𝑗 in our final model if

• max𝜆

Π𝑗 𝜆 ≥ Π𝑡ℎ𝑟

• Here Π𝑡ℎ𝑟 is a tuning parameter

chosen in advance. Meinshausen and

Buhlmann (2010) shows that the

results are not sensitive to the choice

of Π𝑡ℎ𝑟.

Page 53: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONMORE DETAILS

• Improves unstable variable selection problems.

• Very general, can be applied in most variable selection settings.

• Can greatly outperform cross-validation for high dimensional data (𝑛 ≪ 𝑝).

• Theory shows that stability selection controls the false discovery rate.

• Why subsample instead of the usual bootstrap?

Taking a subsample (without replacement) of size 𝑛

2provides a similar

reduction in information as a standard bootstrap sample.

Saves time when the underlying variable selection technique is

computationally intense.

Page 54: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTION…AND JMP

• JMP provides several built-in variable selection techniques:

• Forward selection, backward elimination, Lasso, Elastic Net, …

• The most convenient in JMP? Forward selection with alpha-to-enter tuning.

• Scripting stability selection for forward selection is fairly straightforward using

a frequency column in conjunction with the Stepwise platform.

newFreq = J(n,1,0);

newFreq[random shuffle(rows)[1::floor(n/2)]]=1;

fCol << set values(newFreq);

• We have implemented a slight modification of the stability selection algorithm,

which Shah and Samworth (2013) shows to have some added perks.

Page 55: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

STABILITY

SELECTIONDEMONSTRATION

Stability Selection Demonstration in JMP

Page 56: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

WRAP-UP

Page 57: The Bootstrap and Beyond: Using JSL for Resampling

Copyr i ght © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ved .

WRAP-UP REFERENCES

• Breiman, L. (1996). Bagging Predictors. Machine Learning, 24, 123-140.

• Breiman, L. (1996). Heuristics of Instability and Stabilization in Model

Selection. The Annals of Statistics, 24, 2350-2383.

• Efron, B. and Tibshirani, R. (1998). An Introduction to the Bootstrap,

Chapman & Hall/CRC.

• Meinshausen, N. and B uhlmann, P. (2010). Stability Selection. Journal of the

Royal Statistical Society Series B, 72, 417-473.

• Shah, R. and Samworth, R. (2013). Variable selection with error control:

another look at stability selection. Journal of the Royal Statistical Society

Series B, 75, 55-80.

Page 58: The Bootstrap and Beyond: Using JSL for Resampling

www.SAS.comCopyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

THANK YOU!

The Bootstrap and Beyond: Using JSL for resampling

Michael Crotty, [email protected]

Clay Barker, [email protected]

Research Statisticians

JMP Division, SAS Institute