supervised learning from micro-array data: datamining with ...hastie/talks/pam.pdf · micro-array...

30
September, 2003 Stanford Statistics 1 Supervised Learning from Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert Tibshirani, Balasubramanian Narasimhan, Gil Chu, Pat Brown and David Botstein

Upload: others

Post on 22-Jul-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 1�

Supervised Learning from

Micro-Array Data:

Datamining with Care

Trevor HastieStanford University

joint work with Robert Tibshirani, Balasubramanian Narasimhan, Gil Chu, Pat

Brown and David Botstein

Page 2: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 2�

Grade 9 view of cell biology

mRNA

National Human Genome Research InstituteNational Institutes of Health Division of Intramural Research

DNA

mRNA Transcription

Mature mRNA

Transport to cytoplasm for protein synthesis (translation)

Nuclear membrane

mRNA

Cell Membrane

Page 3: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 3�

DNA microarrays

• Exciting new technology for measuring gene expression of tensof thousands of genes SIMULTANEOUSLY in a single sampleof cells

• first multivariate, quantitative way of measuring geneexpression

• a key idea: to find genes, follow around messenger RNA

• also known as “gene chips”— there are a number of differenttechnologies: Affymetrix, Incyte, Brown Lab,...

• techniques for analysis of microarray data are also applicable toSNP data, protein arrays, etc.

Page 4: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 4�

DNA microarray process

Page 5: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 5�

The entire Yeast genome on a chip

Page 6: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 6�

Statistical challenges

• Typically have ∼ 5,000–40,000 genes measured over ∼ 50–100samples.

• Goal: understand patterns in data, and look for genes thatexplain known features in the samples.

• Biologists don’t want to miss anything (low type II error).Statisticians have to help them to appreciate Type I error, andfind ways to get a handle on it.

Page 7: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 7�

Types of problems

• Preprocessing: Li, Wong (Dchip), Speed

• The analysis of expression arrays can be separated into:unsupervised — “class discovery” andsupervised — “class prediction”

• In unsupervised problems, only the expression data is available.Clustering techniques are popular. Hierarchical (Eisen’sTreeView - next slide), K-means, SOMs, block-clustering,gene-shaving (H&T), plaid models (Owen& Lazzeroni). SVDalso useful.

• In supervised problems, a response measurement is availablefor each sample. For example, a survival time or cancer class.

Page 8: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 8�

Two-way hierarchical clustering

Page 9: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 9�

Some editorial comments

• Most statistical work in this area is being done bynon-statisticians.

• Journals are filled with papers of the form

“Application of <machine-learning method> to MicroarrayData”

• Many are a waste of time. P � N i.e. many more variables(genes) than samples. Data-mining research has producedexotic enhancements of standard statistical models for theN � P situation (neural networks, boosting,...). Here we needto restrict the standard models; cannot even do linearregression.

• Simple is better: a complicated method is only worthwhile if itworks significantly better than the simple one.

Page 10: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 10�

• Give scientists good statistical software, with methods they canunderstand. They know their science better than you. Withyour help, they can do a better job analyzing their data thanyou can do alone.

• Software should be easy to install (e.g. R) and easy to use (e.g.SAM and PAM are excel add-ins)

Page 11: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 11�

SAM

How to do 7000 t-tests all at once!

Significance Analysis of Microarrays (Tusher, Tibshirani and Chu, 2001)

-10 -5 0 5 10

010

020

030

040

050

0

t statistic

• At what threshold should we call a gene significant?

• how many false positives can we expect?

Page 12: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 12�

SAM plot

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

expected d(i)

obse

rved

d(i)

-5 0 5

-50

5

Page 13: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 13�

Ave # falsely # Called False

∆ significant significant discovery rate

0.3 75.1 294 .255

0.4 33.6 196 .171

0.5 19.8 160 .123

0.7 10.1 94 .107

1.0 4.0 46 .086

∆ is the half-width of the band around the 45o line.

Page 14: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 14�

• SAM is popular in genetics and biochemistry labs at Stanfordand worldwide already

• SAM is freely available for academic and non-profit use. TheSAM site is:www-stat.stanford.edu/∼tibs/SAM

• For commercial use, software is available for licensing fromStanford: [email protected]

Page 15: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 15�

Nearest Prototype Classification

An extremely simple classifier that

• performs well on test data, and

• produces subsets of informative genes.

Tibshirani, Hastie, Narasimhan and Chu (2002) “Diagnosis of multiple

cancer types by shrunken centroids of gene expression” PNAS 2002

99:6567-6572 (May 14).

Page 16: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 16�

Classification of Samples

Example: small round blue cell tumors; Khan et al, NatureMedicine, 2001

• Tumors classified as BL (Burkitt lymphoma), EWS (Ewing),NB (neuroblastima) and RMS (rhabdomyosarcoma).

• There are 63 training samples and 25 test samples, althoughfive of the latter were not SRBCTs. 2308 genes

• Khan et al report zero training and test errors, using a complexneural network model. Decided that 96 genes were“important”.

• Upon close examination, network is linear. It’s essentiallyextracting linear principal components, and classifying in theirsubspace.

Page 17: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 17�

Khan data

BL EWS NB RMS

Page 18: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 18�

Khan’s neural network

Page 19: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 19�

Nearest Shrunken CentroidsBL

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

EWS

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

NB

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

RMS

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00Centroids are shrunk towards the overall centroid usingsoft-thresholding. Classification is to the nearest shrunken centroid.

Page 20: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 20�

Nearest Shrunken Centroids

• Simple, includes nearest centroid classifier as a special case.

• Thresholding denoises large effects, and sets small ones to zero,thereby selecting genes.

• with more than two classes, method can select different genes,and different numbers of genes for each class.

• Still very simple. In statistical parlance, this is a restrictedversion of a naive Bayes classifier (also called idiot’s Bayes!)

Page 21: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 21�

Results on Khan data

At optimal point chosen by 10-fold CV, there are 27 active genes, 0training and 0 test errors.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

2308

2250

1933

1468

987

657

436

282

183

138

98 61 41 27 18 15 7 1 0

0.0

0.2

0.4

0.6

0.8

cv cv cv cv cv cv cv cv cv cv cv cv cv cv cv

cv

cv

cvcv

tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr

tr

tr

tr tr

te te

te te

te

te te te te te

te te te te te

te

te

te te

Amount of Shrinkage ∆

Err

or

Size

Page 22: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 22�

Predictions and Probabilities

Sample

Pro

babi

lity

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Training Data

•• • • •

•• •

•• • • •

•• •

• •• • • • • • • • • • • • • • • • • • • • • •

• • • •• • • • • • • • • • • • • • •

•• • ••

• • • • •• •

• •• • • • •

• •

••• • •

••••

• ••

• • ••• •

••••• • • •

• • • • • • • ••• • • • •

• • •• • • • •

• • ••• •

• • • • • • •• • • • • • • • • • • • •

•• •

••

•• •

• ••

•• • • • • • • • •

• • • • • • • • • •• • • • • • • • •• •

•• • • • • • • •

•• • •

•• •

• • • • • ••• •

• •••• •

•• • • • •

• • •

•• •

• • •

•• •

BL EWS NB RMS

Sample

Pro

babi

lity

5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Test Data

•• •

••

• • • • • • • • • • •

•• •

• • •

• • •

••

••

• • • ••

••

• • ••

• • •

• ••

••

••

• • ••

• • • •• • •

•• • •

••

••

• •

• • • ••

• • •

BL EWS NB RMS

OO O

O O

Page 23: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 23�

The genes that matter

BL EWS NB RMS

295985

866702

814260

770394

377461

810057

365826

41591

629896

308231

325182

812105

44563

244618

796258

298062

784224

296448

207274

563673

504791

204545

21652

308163

212542

183337

241412

Page 24: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 24�

Heatmap of selected genes

BL EWS NB RMS

295985866702814260770394377461810057365826415916298963082313251828121054456324461879625829806278422429644820727456367350479120454521652308163212542183337241412

PAM — “Prediction Analysis for Microarrays”

R package, available at

http://www-stat.stanford.edu/∼tibs/PAM

Page 25: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 25�

Leukemia classification

Golub et al 1999, Science. They use a “voting” procedure for eachgene, where votes are based on a t-like statistic

Method CV err Test err

Golub 3/38 4/34

PAM 1/38 2/34

Breast Cancer classification

Hedenfalk et al 2001, NEJM. They use a “compound predictor”∑j wjxj , where the weights are t-statistics.

Method BRCA1+ BRCA1- BRCA2+ BRCA2-

Heden et. al. 3/7 2/15 3/8 1/14

PAM 2/7 1/15 2/8 0/14

Page 26: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 26�

Other Promising Approaches

• Quadratically penalized linear regression models. For example,ridge regression

minβ

N∑

i=1

(yi − xTi β) + λβT β

• Quadratically penalized logistic regression (binomial andmultinomial), Cox model, etc

• Regularized linear and quadratic discriminant analysis

• Optimal separating hyperplanes and Support Vector Machines

• Lasso and L1 penalized regression methods

minβ

N∑

i=1

L(yi, xTi β) + λ|β|

Page 27: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 27�

Computational tricks

• SVMs use the kernel trick. Even though the model has p � N

parameters, computations done in N dimensional dual space.

• In fact ANY of the quadratically regularized linear models canbenefit from a similar trick:

– Let X = R · Q, with RN×N , QN×p, and rows of Qorthogonal.

– Fit model with R rather than X (N vs p variables), samequadratic penalty; let solution be β̂∗.

– Then solution to original problem is β̂ = QT β̂∗.

• This applies to regularized logistic regression, multinomial, anyGLM, Cox model, and LDA. It also applies to Euclideanmethods like K-nearest neighbor classification.

Page 28: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 28�

Example: Lasso and Leukemia Classification

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ************

********************

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

8

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ************

********************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * **********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * **

****

***

* * *** ****** ** *** **** * ** *********************

***********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * *

* ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** *

*****

** * *** ****** ** ***

**** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * *

*

** ** * *** ** * * * *** ******

***** ****

* ** *********************************

*

*

*

** ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* *

*

*

** **

* *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** ***

***** ** ********************************

* * * * * * ** * *** ** * * * *** ******

** *** ***** ** ***********

*********************

* * * * * * ** * *** ** * * * *** ****** ** *** ***** ** *********************

***********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * ***

****** ** *** **** * ** **********************

**********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * **********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** **********

*********************** * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********

************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ***

*** ** ** *

*** ******

** *** ***** **

********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****

**** ***

***** ** *********

**********************

*

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************

************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * ** * ** * *** ** * * * *** ****** ** *** **** * **

**************************

******

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ****************************

****

* * * * * * ** * *** ** * **

*** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** *********************

***********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ******** *** **** * ** ***********

*********************

LASSO

2968

6801

2945

461

2267

Page 29: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 29�

10-fold Crossvalidation for Lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.05

0.10

0.15

0.20

0.25

fraction

cv

10−fold cross−validation for Leukemia Expression Data (Lasso)

See http://www-stat.stanford.edu/∼hastie/Papers/#LARS

Page 30: Supervised Learning from Micro-Array Data: Datamining with ...hastie/TALKS/pam.pdf · Micro-Array Data: Datamining with Care Trevor Hastie Stanford University joint work with Robert

September, 2003 Stanford Statistics 30�

Summary

• With large numbers of genes as variables (P � N), we have tolearn how to tame even our simplest supervised learningtechniques. Even linear models are way too aggressive.

• We need to talk with the biologists to learn their priors; i.e.genes work in groups.

• We need to provide them with easy-to-use software to try outour ideas, and involve them in the design. They are muchbetter at understanding their data than we are.