data science, statistics, and healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf ·...

56
Data Science, Statistics, and Health with a focus on Statistical Learning and Sparsity and applications to biomedicine Rob Tibshirani Departments of Biomedical Data Science & Statistics Stanford University Columbia data science summit 2020 1 / 56

Upload: others

Post on 06-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Data Science, Statistics, and Healthwith a focus on Statistical Learning and Sparsity

and applications to biomedicine

Rob TibshiraniDepartments of Biomedical Data Science & Statistics

Stanford University

Columbia data science summit 2020 1 / 56

Page 2: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Outline

1. Some general thoughts about data science and health

2. Some comments about supervised learning, sparsity, anddeep learning.

3. General advice for data scientists

4. Example: Predicting platelet usage at Stanford Hospital

5. Some recent advances:• LassoNet [A feature-sparse Neural Network]• Lasso and elastic net for GWAS[Polygenic risk scores]• Contrast trees and boosting [New from Jerry Friedman!]• Generalized Random Forests [Estimating causal effects]

2 / 56

Page 3: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

There is a lot of excitement about Data Science andhealth

• Artificial intelligence, predictive analytics, precisionmedicine are all hot areas with huge potential

• A wealth of data is now available in every area of publichealth and medicine, from new machines and assays, smartphones, smart watches ....

• Already, there have been good successes in data science inpathology, radiology, and other diagnostic specialties.

3 / 56

Page 4: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

For Statisticians: 15 minutes of fame

• 2009: “ I keep saying the sexy job in the next ten yearswill be statisticians.” Hal Varian, Chief Economist Google

• 2012 “Data Scientist: The Sexiest Job of the 21st Century”Harvard Business Review

4 / 56

Page 5: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Some obstacles to this success

• Data siloing is a problem. In the US health care systemresearchers tend not to share data openly. Access to large,representative datasets is essential for this work.

• The bar is higher in health. There is more more atstake in the health area than in say a recommendationsystem for movies. Errors are much more costly

• The public may be less tolerant of data science/machineerrors than human errors (analogous to self-driving cars).

5 / 56

Page 6: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

High level comments about Data Science, Statistics andMachine learning

• Data Science and Statistics involve much more thansimply running a machine learning algorithm on data

• For example:• What is the question of interest? How can I collect data or

run a experiment to address this question?• What inferences can be drawn from the data?• What action should I take as a result of what I’ve learned?• Do I need to worry about bias, confounding,

generalizability, concept drift ...?

• →Statistical concepts and training are essential

6 / 56

Page 7: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Important points for data science applications in health

• Models should be kept as simple as possible, as they areeasier to understand. And we can more easily anticipatehow/when they might break. See Brad Efron’s recent paper“Prediction, Estimation, and Attribution”

• Uncertainly quantification is essential. We must buildtrust in our models.

• Algorithmic fairness is important

• Example of a key unsolved problem: in classification, withhigh dimensional features, how can we arrange for ourclassifier to sometimes “abstain”? (because it isextrapolating).

• Estimation of heterogeneous treatment effects(HTE), eg for personalized medicine, is a very importantarea in need of research. A simple unsolved problem: howcan we do internal cross-validation of a model for HTE?

7 / 56

Page 8: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

The Supervising Learning Paradigm

Training Data Fitting Prediction

Traditional statistics: domain experts work for 10 years tolearn good features; they bring the statistician a small cleandatasetToday’s approach: we start with a large dataset with manyfeatures, and use a machine learning algorithm to find the asgood ones. A huge change.

8 / 56

Page 9: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

This talk is about supervised learning: building models fromdata for predicting an outcome using a collection of inputfeatures.Big data vary in shape. These call for different approaches.

Wide Data

Tall Data

Thousands / Millions of Variables

Hundreds of Samples

Tens / Hundreds of Variables

Thousands / Tens of Thousands of Samples

Lasso & Elastic Net

Random Forests & Gradient Boosting

We have too many variables; prone to overfitting.Lasso fits linear models to the data that are sparse in the variables. Does automatic variable selection.

Sometimes simple models (linear) don’t suffice.We have enough samples to fit nonlinear models with many interactions, and not too many variables.A Random Forest is an automatic and powerful way to do this.

9 / 56

Page 10: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

The Lasso

The Lasso is an estimator defined by the followingoptimization problem:

minimizeβ0,β

1

2

∑i

(yi − β0 −∑j

xijβj)2 subject to

∑|βj | ≤ s

• Penalty =⇒ sparsity (feature selection)

• Convex problem (good for computation and theory)

• Our lab has written a open-source R language packagecalled glmnet for fitting lasso models (Friedman, Hastie,Simon, Tibs). Available on CRAN.

• glmnet v3.0 (just out) now features the relaxed lasso andother goodies (like a progress bar!)

Almost 2 million downloads

10 / 56

Page 11: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

The Elephant in the Room: DEEP LEARNING

Will it eat the lasso and other statistical models?

11 / 56

Page 12: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output12 / 56

Page 13: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

What has changed since the invention of Neural nets?

• Much bigger training sets and faster computation(especially GPUs)

• Many clever ideas: convolution filters, stochastic gradientdescent, input distortion ...

• Use of multiple layers (the “Deep” part): Tommy Poggiosays that the universal approximation theorems for singlelayer NNs in the 1980s set the field back 30 years

• Confession: I was Geoff Hinton’s colleague at Univ. ofToronto (1985-1998) and didn’t appreciate the potential ofNeural Networks!

13 / 56

Page 14: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

What makes Deep Nets so powerful

(and challenging to analyze!)

It’s not one “mathematical model” but a customizableframework– a set of engineering tools that can exploit thespecial aspects of the problem (weight-sharing, convolution,feedback, recurrence ...)

14 / 56

Page 15: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Will Deep Nets eat the lasso and otherstatistical models?

Deep Nets are especially powerful when the features have somespatial or temporal organization (signals, images), and SNR ishigh

But they are not a good approach when

• we have moderate #obs or wide data ( #obs < #features),

• SNR is low, or

• interpretability is important

• It’s difficult to find examples where Deep Nets beat lasso orGBM in low SNR settings, with “generic ” features

15 / 56

Page 16: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

General advice for Data Scientists

Details matter!

Q: Have you tried method X?A: I tried that last year and it didn’t work very well!

Q: What do you mean? What exactly did you try? How did youmeasure performance?A: I can’t remember.

16 / 56

Page 17: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

General tips

• Try a number of methods and use cross-validation totune and compare models (our “lab” is the computer-experiments are quick and free)

• Be systematic when you run methods and makecomparisons: compare methods on the same training andtest sets.

• Keep carefully documented scripts and data archives(where practical).

• Your work should be reproducible by you and others,even a few years from now!

17 / 56

Page 18: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

In Praise of Simplicity

‘Simplicity is the ultimate sophistication” — Leonardo Da Vinci

• Many times I have been asked to review a data analysis bya biology postdoc or a company employee. Almost everytime, they are unnecessarily complicated. Multiple steps,each one poorly justified.

• Why? I think we all like to justify– internally andexternally— our advanced degrees. And then there’s the“hit everything with deep learning” problem

• Suggestion: Always try simple methods first. Move on tomore complex methods, only if necessary

18 / 56

Page 19: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Outline again

1. Some general thoughts about data science and health

2. Some comments about supervised learning, sparsity, anddeep learning

3. General advice for data scientists

4. → Example: Predicting platelet usage at StanfordHospital

5. Some recent advances:• LassoNet [A feature-sparse Neural Network]• Lasso and elastic net for GWAS[Polygenic risk scores]• Contrast trees and boosting [New from Jerry Friedman!]• Generalized Random Forests [Estimating causal effects]

19 / 56

Page 20: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

How many units of platelets will the Stanford Hospitalneed tomorrow?

20 / 56

Page 21: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

AllisonZemekThoPhamSaurabhGombar

LeyingGuanXiaoyingTian Balasubramanian Narasimhan

21 / 56

Page 22: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

22 / 56

Page 23: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Background

• Each day Stanford hospital orders some number of units(bags) of platelets from Stanford blood center, based onthe estimated need (roughly 45 units)

• The daily needs are estimated “manually”

• Platelets have just 5 days of shelf-life; they are safety-testedfor 2 days. Hence are usable for just 3 days.

• Currently about 1400 units (bags) are wasted each year.That’s about 8% of the total number ordered.

• There’s rarely any shortage (shortage is bad but notcatastrophic)

• Can we do better?

23 / 56

Page 24: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Data overview

days

Thr

ee d

ays'

tota

l con

sum

ptio

n

0 200 400 600

6010

016

0

days

Dai

ly c

onsu

mpt

ion

0 200 400 600

2040

60

24 / 56

Page 25: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Data description

Daily platelet use from 2/8/2013 - 2/8/2015.

• Response: number of platelet transfusions on a given day.

• Covariates:

1. Complete blood count (CBC) data: Platelet count,White blood cell count, Red blood cell count, Hemoglobinconcentration, number of lymphocytes, . . .

2. Census data: location of the patient, admission date,discharge date, . . .

3. Surgery schedule data: scheduled surgery date, type ofsurgical services, . . .

4. . . .

25 / 56

Page 26: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Notation

yi : actual PLT usage in day i.xi : amount of new PLT that arrives at day i.ri(k) : remaining PLT which can be used in the following kdays, k = 1, 2wi : PLT wasted in day i.si : PLT shortage in day i.

• Overall objective: waste as little as possible, with littleor no shortage

26 / 56

Page 27: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Our first approach

27 / 56

Page 28: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Our first approach

• Build a supervised learning model (via lasso) to predict useyi for next three days (other methods like random forestsor gradient boosting didn’t give better accuracy).

• Use the estimates yi to estimate how many units xi toorder. Add a buffer to predictions to ensure there is noshortage. Do this is a “rolling manner”.

• Worked quite well- reducing waste to 2.8%- - but the lossfunction here is not ideal

28 / 56

Page 29: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

More direct approach

This approach minimizes the waste directly:

J(β) =n∑

i=1

wi + λ||β||1 (1)

where

three days′total need ti = z

Ti β, ∀i = 1, 2, .., n (2)

number to order : xi+3 = ti − ri(1)− ri(2)− xi+1 − xi+2 (3)

waste wi = [ri−1(1)− yi]+ (4)

actual remaining ri(1) = [ri−1(2) + ri−1(1)− yi − wi]+ (5)

ri(2) = [xi − [yi + wi − ri−1(2)− ri−1(1)]+]+ (6)

Constraint : fresh bags remaining ri(2) ≥ c0 (7)

(8)

This can be shown to be a convex problem (LP).

29 / 56

Page 30: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

ResultsChose sensible features- previous platelet use, day of week, #patients in key wards.

Over 2 years of backtesting: no shortage, reduces waste from1400 bags/ year (8%) to just 339 bags/year (1.9%)

Corresponds to a predicted direct savings at Stanford of$350,000/year. If implemented nationally could result inapproximately $110 million in savings.

30 / 56

Page 31: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Moving forward

• System has just been deployed at the Stanford Bloodcenter (R Shiny app). But yet not in production

• We are distributing the software around the world, forother centers to train and deploy

• see Platelet inventory R packagehttps://bnaras.github.io/pip/

• Just this week we learned that the predictions have beenway off for the past month! Data problem? Modelproblem? Not sure yet.

• A remaining challenge: How can we detect if/when thesystem is no longer working well?

31 / 56

Page 32: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Outline once again

1. Some general thoughts about data science and health

2. Some comments about supervised learning, sparsity, anddeep learning

3. General advice data scientists

4. Example: Predicting platelet usage at Stanford Hospital

5. → Some recent advances:• LassoNet [A feature-sparse Neural Network]• Lasso and elastic net for GWAS[Polygenic risk scores]• Contrast trees and boosting [New from Jerry Friedman!]• Generalized Random Forests [Estimating causal effects]

32 / 56

Page 33: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

A feature sparse Neural net

“‘LassoNet”

Joint work with my Statistics student Ismael Lehradri andformer Stanford Statistics Student Feng Ruan

33 / 56

Page 34: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Motivation

• Two methods for Supervised learning:

1. The Lasso fits linear models with sparsity (featureselection). Interpretable but may not predict well (missesnon-linearities and interactions)

2. Neural networks fit flexible nonlinear models, but arehard to interpret. Typically all features are used. (Somemethods have been proposed for doing “post-mortem”analyses, to try to deduce feature importance.)

Can we marry the two and get the best of both worlds?

34 / 56

Page 35: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

The LassoFor i = 1, 2, . . . N , let yi be the output for observation i, and(xi1, . . . xip) be the features (inputs) from which we will predictyi.

The Lasso is an estimator defined by the followingoptimization problem:

minimizeβ0,β1

2

∑i

(yi − β0 −∑j

xijβj)2 + λ ·

∑|βj |

• The parameters βj are feature weights that are chosen tominimize the objective function: the penalty deliverssparsity (feature selection).• The value λ is a user-defined tuning parameter: largerλ =⇒ more sparse. We solve this problem for an entirepath of λ values.• This is a convex problem (good for computation and

theory)35 / 56

Page 36: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

LassoNet

We assume the model

yi = β0 +∑j

xijβj +

K∑k=1

[αk + γk · f(θTk xi)] + εi (9)

with ε ∼ (0, σ2). Here f is a monotone, nonlinear function suchas a sigmoid or rectified linear unit, and each θk = (θ1k, . . . θpk)is a p-vector. Our objective is to minimize

J(β,Θ, α, γ) =1

2

∑i

(yi − yi)2 + λ∑j

|βj |+ λ∑jk

|θjk|

subject to |θjk| ≤ |βj | ∀j, k. ← secret sauce

36 / 56

Page 37: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

In a nutshell

“Features can participate in the hidden layer only if they havenon-zero main effects”

• A neural network is a complex model, but is made simplerif the model is a function of only a subset of the features

• Eg a protein-based blood test for a disease: if only 10proteins out of 1000 need to be measured, then thecomplicated neural net mapping of these 10 proteins is stillacceptable to scientists

37 / 56

Page 38: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Computational strategy

• We seek the solution over a path of λ values, using currentsolutions as warm starts. We use a projected proximalgradient procedure at each λ

• Unlike Lasso, objective is non-convex. Makesoptimization much harder. Eg the starting values for (β, θ)matter!

• In lasso (our glmnet package), we compute a path ofsolutions from sparse (λ large) to dense (λ small)

• This worked badly for LassoNet; we tried many tricks;finally Feng suggested that we go from Dense to Sparse.Bingo!

38 / 56

Page 39: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Toy Example

●●

●●●●●● ● ● ●● ● ● ●●●●●● ● ● ● ●●● ●● ●●●● ●●● ● ● ●●●●●●● ●●●●●●●●●●●●●●●

0 5 10 15 20

1015

2025

3035

40

Number of nonzero predictors

Test

err

or

lassoNNLassoNet

Boston housing data with 13 noise predictors added. The testerror shown for a (single hidden layer) neural net is the

minimum achieved over 200 epochs.

39 / 56

Page 40: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Example: Response to HIV drugsPredictors are ∼ 500 mutations (binary). From Dr. BobSchafer.

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

●●●●●●●●

●●●● ● ●●●●●●●●●●●

●●●●●●●●●●●

●●●

0 50 100 150 200

0.02

0.04

0.06

0.08

0.10

0.12

Number of predictors

Test

err

or

●●

●●

●●

●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

LassoNNLassoNet

40 / 56

Page 41: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A typical 1

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A typical 2

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 200 400 600 800

0.00

50.

020

0.05

0

Number of nonzero betas

Mis

clas

sific

atio

n er

ror

●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

LassoNNLassoNet

Test error

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

NN weights

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

LassoNet betas

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

LassoNet thetas

MNIST example, discriminating 1s from 2s. Shown are typicaldigits, the test error from lasso, NN and LassoNet and

estimated weights from the latter two methods.

41 / 56

Page 42: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Lasso and elastic net for GWAS

Estimation of Polygenic risk scores

• with current software (eg glmnet), lasso and elastic netcannot be applied to data of size say 500K (patients) by800K (SNPs)

• we have developed a new approach using the idea of strongscreening rules (Tibs et al JRSSB 2012), that successfullycarries out this computation in hours (exact, withinmachine precision)

• Joint work with PhD students Junyang Qian , YosukeTanigawa, Trevor Hastie and the Manny Rivas group atStanford DBDS

• “A Fast and Flexible Algorithm for Solving the Lasso inLarge-scale and Ultrahigh-dimensional Problems” , Qian etal bioRxiv 2019

42 / 56

Page 43: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Strong RulesFor lasso fit, with active set A:∣∣∣〈xj ,y −Xβ(λ)〉

∣∣∣ = λ ∀j ∈ A≤ λ ∀j /∈ A

So variables nearly in A will have inner-products with theresiduals nearly equal to λ.

Suppose fit at λ` is Xβ(λ`), and we want to compute the fit atλ`+1 < λ`.Strong rules gamble on set

S`+1 ={j : |〈xj ,y −Xβ(λ`)〉| > λ`+1 − (λ` − λ`+1)

}glmnet screens at every λ step, and after convergence, checksif any violations. Mostly A`+1 ⊆ S`+1.

∗ Tibshirani, Bien, Friedman, Hastie, Simon, Taylor, Tibshirani (JRSSB 2012)

43 / 56

Page 44: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

0 50 100 150 200 250

010

0020

0030

0040

0050

00

Number of Predictors in Model

Num

ber

of P

redi

ctor

s af

ter

Filt

erin

g

●●●●●●

●●

●● ● ● ● ● ●●●● ● ●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●

●●●●●●●● ● ●● ● ● ● ● ●● ● ● ● ● ●●●● ● ●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●

●●●●●●●●● ●● ● ● ● ● ●● ● ● ● ● ●●●● ● ●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●

global STRONGsequential DPP (exact)sequential STRONG

Percent Variance Explained

0 0.19 0.38 0.56 0.7 0.8 0.87 0.94 0.99 1 1

Strong rules inspired by El Ghaoui, Viallon and Rabbani (2010)

Sequential DPP due to Wang, Lin, Gong, Wonka and Ye (2013)44 / 56

Page 45: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Batch Screening Iterative Lasso

(BASIL)

1. Initialize active variables A0= empty set, λ = λmax2. for k = 1, 2, . . .

• Screening: At current λmax, form the strong setSk = Ak ∪ Ek where Ak is set of M variables with largestinner products with the residual and Ek are the ever activevariables

• Fitting: Solve the lasso using just the strong set ofpredictors Sk

• Checking: find the smallest λk+1 so that KKT conditionsare satisfied (ie. no predictors have been omitted thatshould be present)

Typically M = 1000; Screening and Checking require the fulldataset, and are done efficiently using memory-mapped I/O. (bigmemory R package)

45 / 56

Page 46: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Algorithmic details in pictures

0 10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Iteration 1

Lambda Index

Coe

ffici

ents

0 10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Iteration 2

Lambda Index

Coe

ffici

ents

0 10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Iteration 3

Lambda Index

Coe

ffici

ents

0 10 20 30 40

−0.

4−

0.2

0.0

0.2

0.4

Iteration 4

Lambda Index

Coe

ffici

ents

completed fitnew valid fitnew invalid fit

Figure 1: The lasso coefficient profile that shows the progression of the BASIL algorithm. The previouslyfinished part of the path is colored grey, the newly completed and verified is in green, and the part that isnewly computed but failed the verification is colored red.

of expensive disk read operations. At each iteration, we roll out the solution path progressively,which is illustrated in Figure 1 and will be detailed in the next section. In addition, we proposeoptimization specific for the SNP data in the UK Biobank studies to speed up the procedure.

1.4 Outline of the paperThe rest of the paper is organized as follows.

• Section 2 describes the proposed batch screening iterative lasso (BASIL) algorithm for theGaussian family in detail and its extension to other problems such as logistic regression.

• Section 3 discusses related methods and packages for solving large-scale lasso problems.

4

.CC-BY 4.0 International licenseIt is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which. http://dx.doi.org/10.1101/630079doi: bioRxiv preprint first posted online May. 7, 2019;

46 / 56

Page 47: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

GWAS example

• Large cohort of 500K British adults (Bycroft et al 2018)

• Each individual genotyped at 805K locations (AA, Aa, aaor NA)

• 100s of phenotypes measured on each subject

• We looked at white British subset of 337K, and illustratewith height phenotype

• Divided the data 60% training, 20% validation, 20% test.

Package snpnet available on Github (link on Hastie’s website,and to 2019 report)

47 / 56

Page 48: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Lasso fit to height

0 20 40 60 80

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Lambda Index

R2

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

●●

●● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ● ● ● ● ● ● ● ● ● ● ●

●●

● ● ● ●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

Lasso (train)Lasso (val)ReLasso (train)ReLasso (val)

15 36 91 245

599

1413

3209

6888

1339

5

2280

3

3475

4

4767

3

Figure 3: R2 plot for height. The top axis shows the number of active variables in the model.

● ●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

● ●

●●

140

160

180

200

140 160 180 200Predicted Height (cm)

Act

ual H

eigh

t (cm

)

0

2500

5000

7500

10000

−20 −10 0 10 20 30Residual (cm)

Fre

quen

cy

Figure 4: Left: actual height versus predicted height on 5000 random samples from the test set. Thecorrelation between actual height and predicted height is 0.9416. Right: histogram of the lasso residualsfor height. Standard deviation of the residual is 5.05 (cm).

variance introduced by refitting such large models may be larger than the reduction in bias. Thelarge models confirm the extreme polygenicity of standing height.

17

.CC-BY 4.0 International licenseIt is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which. http://dx.doi.org/10.1101/630079doi: bioRxiv preprint first posted online May. 7, 2019;

R-squared or heritability shown for regularization paths48 / 56

Page 49: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Detailed results

smallest univariate p-values, and a multivariate linear or logistic regression is fit on those variantsjointly. The sequence of models are evaluated on the validation set, and the one with the smallestvalidation error is chosen. We call this method Sequential LR for convenience in the following resultpart (Model 6).

4.4 ResultsWe present results of the lasso and related methods for quantitative traits including standing heightand BMI, and for qualitative traits including asthma and high cholesterol. A comparison of theunivariate p-values and the lasso coefficients for all these traits is showed in the form of Manhattanplots in the Appendix A (Supplementary Figure 13, 14).

4.4.1 Quantitative Traits

Standing Height Height is a polygenic and heritable trait that has been studied for a long time.It has been used as a model for other quantitative traits, since it is easy to measure reliably. Fromtwin and sibling studies, the narrow sense heritability is estimated to be 70-80% (Silventoinen et al.,2003; Visscher et al., 2006, 2010). Recent estimates controlling for shared environmental factorspresent in twin studies calculate heritability at 0.69 (Zaitlen et al., 2013; Hemani et al., 2013). Alinear based model with common SNPs explains 45% of the variance (Yang et al., 2010) and a modelincluding imputed variants explains 56% of the variance, almost matching the estimated heritability(Yang et al., 2015). So far, GWAS studies have discovered 697 associated variants that explain onefifth of the heritability (Lango Allen et al., 2010; Wood et al., 2014). Recently, a large sample studywas able to identify more variants with low frequencies that are associated with height (Marouliet al., 2017). Using lasso with the larger UK Biobank dataset allows both a better estimate of theproportion of variance that can be explained by genomic predictors and simultaneous selection ofSNPs that may be associated. We obtain R2 values that are close to the estimated heritability. Theresults are summarized in Table 3. The associated R2 curves for the lasso and the relaxed lasso areshown in Figure 3. The residuals of the optimal lasso prediction are plotted in Figure 4.

Model Form R2train R2

val R2test Size

(1) Age + Sex 0.5300 0.5260 0.5288 2(2) Age + Sex + 10 PCs 0.5344 0.5304 0.5336 12(3) Strong Single SNP 0.5364 0.5323 0.5355 13(4) 10K Combined 0.5482 0.5408 0.5444 10,012(5) 100K Combined 0.5833 0.5515 0.5551 100,012(6) Sequential LR 0.7416 0.6596 0.6601 17,012(7) Lasso 0.8304 0.6992 0.6999 47,673(8) Relaxed Lasso 0.7789 0.6718 0.6727 13,395

Table 3: R2 values for height. For sequential LR, lasso and relaxed lasso, the chosen model is based onmaximum R2 on the validation set. Model (3) to (8) each includes Model (2) plus their own specificationas stated in the Form column.

A large number (47,673) of SNPs need to be selected in order to achieve the optimal R2test =

0.6992 for the lasso. Comparatively, the relaxed lasso sacrifices some predictive performance byincluding a much smaller subset of variables (13,395). Past the optimal point, the additional

16

.CC-BY 4.0 International licenseIt is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which. http://dx.doi.org/10.1101/630079doi: bioRxiv preprint first posted online May. 7, 2019;

49 / 56

Page 50: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Contrast trees and boosting

NEW! Jerome Friedman arXiv (2019)Especially useful when N >> p.

• Start with data (xi, yi, zi), i = 1, 2, . . . N .

• x, y are usual features and outcome; z = f(x) is anauxiliary quantity. Examples: zi = yi from some fittedmodel, or zi equal to some quantile of the distribution ofyi|xi• A contrast tree is defined as follows. We define a tree split

via a discrepancy measure over a node Rm, such as

dm =1

Nm

∑xi∈Rm

|yi − zi|

50 / 56

Page 51: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

• The split criterion is

(f `m · f rm) ·max(d(l)m , d(r)m )2

wheref `m, frm are the fraction of observations in the

“parent” region Rm associated with each of the twodaughters.

• Notice the max instead of the usual sum!

• This criterion is used to build a contrast tree and thenforms the basis for contrast boosting

51 / 56

Page 52: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Example

• Predicting income> 50, 000

• 14 predictor variables consisting of various demographicand financial properties associated with each person.N = 50, 000. Overall test error rate from gradient boostingis 13%.

• Choose zi = fitted value from gradient boosting

52 / 56

Page 53: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

53 / 56

Page 54: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Generalized Random Forests

Estimates treatment effect: predicted difference in outcomeunder two different actions (treatments) .

See grf (generalized random forests) R package onCRAN/github

(Susan Athey, Julie Tibshirani and Stefan Wager)

54 / 56

Page 55: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

Other features of Generalized Random Forests

• non-parametric least-squares regression, quantile regression

• confidence intervals for heterogeneous treatment effects

• customizable split criteria

55 / 56

Page 56: Data Science, Statistics, and Healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf · 2020-03-04 · Data Science, Statistics, and Health with a focus on Statistical Learning and

For further reading

Many of the methods used are described in detail in our bookson Statistical Learning: (last one by Efron & Hastie)

1

James · W

itten · Hastie · Tibshirani

Springer Texts in Statistics

Gareth James · Daniela Witten · Trevor Hastie · Robert Tibshirani

An Introduction to Statistical Learning with Applications in R

Springer Texts in Statistics

An Introduction to Statistical Learning

Gareth JamesDaniela WittenTrevor HastieRobert Tibshirani

Statistics

An Introduction to Statistical Learning

with Applications in R

An Introduction to Statistical Learning provides an accessible overview of the fi eld

of statistical learning, an essential toolset for making sense of the vast and complex

data sets that have emerged in fi elds ranging from biology to fi nance to marketing to

astrophysics in the past twenty years. Th is book presents some of the most important

modeling and prediction techniques, along with relevant applications. Topics include

linear regression, classifi cation, resampling methods, shrinkage approaches, tree-based

methods, support vector machines, clustering, and more. Color graphics and real-world

examples are used to illustrate the methods presented. Since the goal of this textbook

is to facilitate the use of these statistical learning techniques by practitioners in sci-

ence, industry, and other fi elds, each chapter contains a tutorial on implementing the

analyses and methods presented in R, an extremely popular open source statistical

soft ware platform.

Two of the authors co-wrote Th e Elements of Statistical Learning (Hastie, Tibshirani

and Friedman, 2nd edition 2009), a popular reference book for statistics and machine

learning researchers. An Introduction to Statistical Learning covers many of the same

topics, but at a level accessible to a much broader audience. Th is book is targeted at

statisticians and non-statisticians alike who wish to use cutting-edge statistical learn-

ing techniques to analyze their data. Th e text assumes only a previous course in linear

regression and no knowledge of matrix algebra.

Gareth James is a professor of statistics at University of Southern California. He has

published an extensive body of methodological work in the domain of statistical learn-

ing with particular emphasis on high-dimensional and functional data. Th e conceptual

framework for this book grew out of his MBA elective courses in this area.

Daniela Witten is an assistant professor of biostatistics at University of Washington. Her

research focuses largely on high-dimensional statistical machine learning. She has

contributed to the translation of statistical learning techniques to the fi eld of genomics,

through collaborations and as a member of the Institute of Medicine committee that

led to the report Evolution of Translational Omics.

Trevor Hastie and Robert Tibshirani are professors of statistics at Stanford University, and

are co-authors of the successful textbook Elements of Statistical Learning. Hastie and

Tibshirani developed generalized additive models and wrote a popular book of that

title. Hastie co-developed much of the statistical modeling soft ware and environment

in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso

and is co-author of the very successful An Introduction to the Bootstrap.

9 781461 471370

ISBN 978-1-4614-7137-0

STS Springer Series in Statistics

Trevor HastieRobert TibshiraniJerome Friedman

Springer Series in Statistics

The Elements ofStatistical LearningData Mining, Inference, and Prediction

The Elements of Statistical Learning

During the past decade there has been an explosion in computation and information tech-nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the approach is statistical, theemphasis is on concepts rather than mathematics. Many examples are given, with a liberaluse of color graphics. It should be a valuable resource for statisticians and anyone interestedin data mining in science or industry. The book’s coverage is broad, from supervised learning(prediction) to unsupervised learning. The many topics include neural networks, supportvector machines, classification trees and boosting—the first comprehensive treatment of thistopic in any book.

This major new edition features many topics not covered in the original, including graphicalmodels, random forests, ensemble methods, least angle regression & path algorithms for thelasso, non-negative matrix factorization, and spectral clustering. There is also a chapter onmethods for “wide” data (p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics atStanford University. They are prominent researchers in this area: Hastie and Tibshiranideveloped generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS andinvented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of thevery successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

› springer.com

S T A T I S T I C S

----

Trevor Hastie • Robert Tibshirani • Jerome FriedmanThe Elements of Statictical Learning

Hastie • Tibshirani • Friedman

Second Edition

All available online for free

Packages on CRAN/github: SNPnet, lassoNet, grf

56 / 56