supervised learning from micro-array data: datamining with ...hastie/talks/pam.pdf · micro-array...

September, 2003 Stanford Statistics 1�

�

�

�

Supervised Learning from

Micro-Array Data:

Datamining with Care

Trevor HastieStanford University

joint work with Robert Tibshirani, Balasubramanian Narasimhan, Gil Chu, Pat

Brown and David Botstein


�

�

�

Grade 9 view of cell biology

mRNA

National Human Genome Research InstituteNational Institutes of Health Division of Intramural Research

DNA

mRNA Transcription

Mature mRNA

Transport to cytoplasm for protein synthesis (translation)

Nuclear membrane

mRNA

Cell Membrane


�

�

�

DNA microarrays

• Exciting new technology for measuring gene expression of tensof thousands of genes SIMULTANEOUSLY in a single sampleof cells

• first multivariate, quantitative way of measuring geneexpression

• a key idea: to find genes, follow around messenger RNA

• also known as “gene chips”— there are a number of differenttechnologies: Affymetrix, Incyte, Brown Lab,...

• techniques for analysis of microarray data are also applicable toSNP data, protein arrays, etc.


�

�

�

DNA microarray process


�

�

�

The entire Yeast genome on a chip


�

�

�

Statistical challenges

• Typically have ∼ 5,000–40,000 genes measured over ∼ 50–100samples.

• Goal: understand patterns in data, and look for genes thatexplain known features in the samples.

• Biologists don’t want to miss anything (low type II error).Statisticians have to help them to appreciate Type I error, andfind ways to get a handle on it.


�

�

�

Types of problems

• Preprocessing: Li, Wong (Dchip), Speed

• The analysis of expression arrays can be separated into:unsupervised — “class discovery” andsupervised — “class prediction”

• In unsupervised problems, only the expression data is available.Clustering techniques are popular. Hierarchical (Eisen’sTreeView - next slide), K-means, SOMs, block-clustering,gene-shaving (H&T), plaid models (Owen& Lazzeroni). SVDalso useful.

• In supervised problems, a response measurement is availablefor each sample. For example, a survival time or cancer class.


�

�

�

Two-way hierarchical clustering


�

�

�

Some editorial comments

• Most statistical work in this area is being done bynon-statisticians.

• Journals are filled with papers of the form

“Application of <machine-learning method> to MicroarrayData”

• Many are a waste of time. P � N i.e. many more variables(genes) than samples. Data-mining research has producedexotic enhancements of standard statistical models for theN � P situation (neural networks, boosting,...). Here we needto restrict the standard models; cannot even do linearregression.

• Simple is better: a complicated method is only worthwhile if itworks significantly better than the simple one.


�

�

�

• Give scientists good statistical software, with methods they canunderstand. They know their science better than you. Withyour help, they can do a better job analyzing their data thanyou can do alone.

• Software should be easy to install (e.g. R) and easy to use (e.g.SAM and PAM are excel add-ins)


�

�

�

SAM

How to do 7000 t-tests all at once!

Significance Analysis of Microarrays (Tusher, Tibshirani and Chu, 2001)

-10 -5 0 5 10

010

020

030

040

050

0

t statistic

• At what threshold should we call a gene significant?

• how many false positives can we expect?


�

�

�

SAM plot

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

expected d(i)

obse

rved

d(i)

-5 0 5

-50

5


�

�

�

Ave # falsely # Called False

∆ significant significant discovery rate

0.3 75.1 294 .255

0.4 33.6 196 .171

0.5 19.8 160 .123

0.7 10.1 94 .107

1.0 4.0 46 .086

∆ is the half-width of the band around the 45o line.


�

�

�

• SAM is popular in genetics and biochemistry labs at Stanfordand worldwide already

• SAM is freely available for academic and non-profit use. TheSAM site is:www-stat.stanford.edu/∼tibs/SAM

• For commercial use, software is available for licensing fromStanford: [email protected]


�

�

�

Nearest Prototype Classification

An extremely simple classifier that

• performs well on test data, and

• produces subsets of informative genes.

Tibshirani, Hastie, Narasimhan and Chu (2002) “Diagnosis of multiple

cancer types by shrunken centroids of gene expression” PNAS 2002

99:6567-6572 (May 14).


�

�

�

Classification of Samples

Example: small round blue cell tumors; Khan et al, NatureMedicine, 2001

• Tumors classified as BL (Burkitt lymphoma), EWS (Ewing),NB (neuroblastima) and RMS (rhabdomyosarcoma).

• There are 63 training samples and 25 test samples, althoughfive of the latter were not SRBCTs. 2308 genes

• Khan et al report zero training and test errors, using a complexneural network model. Decided that 96 genes were“important”.

• Upon close examination, network is linear. It’s essentiallyextracting linear principal components, and classifying in theirsubspace.


�

�

�

Khan data

BL EWS NB RMS


�

�

�

Khan’s neural network


�

�

�

Nearest Shrunken CentroidsBL

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

EWS

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

NB

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00

RMS

Average Expression

Gen

e

-1.0 0.0 0.5 1.0

050

010

0015

0020

00Centroids are shrunk towards the overall centroid usingsoft-thresholding. Classification is to the nearest shrunken centroid.


�

�

�

Nearest Shrunken Centroids

• Simple, includes nearest centroid classifier as a special case.

• Thresholding denoises large effects, and sets small ones to zero,thereby selecting genes.

• with more than two classes, method can select different genes,and different numbers of genes for each class.

• Still very simple. In statistical parlance, this is a restrictedversion of a naive Bayes classifier (also called idiot’s Bayes!)


�

�

�

Results on Khan data

At optimal point chosen by 10-fold CV, there are 27 active genes, 0training and 0 test errors.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

2308

2250

1933

1468

987

657

436

282

183

138

98 61 41 27 18 15 7 1 0

0.0

0.2

0.4

0.6

0.8

cv cv cv cv cv cv cv cv cv cv cv cv cv cv cv

cv

cv

cvcv

tr tr tr tr tr tr tr tr tr tr tr tr tr tr tr

tr

tr

tr tr

te te

te te

te

te te te te te

te te te te te

te

te

te te

Amount of Shrinkage ∆

Err

or

Size


�

�

�

Predictions and Probabilities

Sample

Pro

babi

lity

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Training Data

•• • • •

•• •

•• • • •

•• •

• •• • • • • • • • • • • • • • • • • • • • • •

• • • •• • • • • • • • • • • • • • •

•• • ••

• • • • •• •

•

• •• • • • •

• •

••• • •

••••

•

• ••

• • ••• •

••••• • • •

• • • • • • • ••• • • • •

•

• • •• • • • •

• • ••• •

• • • • • • •• • • • • • • • • • • • •

•• •

•

••

•• •

• ••

•• • • • • • • • •

• • • • • • • • • •• • • • • • • • •• •

•• • • • • • • •

•• • •

•• •

•

• • • • • ••• •

• •••• •

•

•• • • • •

• • •

•• •

• • •

•

•• •

BL EWS NB RMS

Sample

Pro

babi

lity

5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Test Data

•

•• •

••

•

•

• • • • • • • • • • •

•• •

• • •

•

• • •

••

•

•

•

•

••

• • • ••

••

•

•

•

• • ••

• • •

• ••

•

••

••

• • ••

•

•

• • • •• • •

•

•• • •

•

••

••

• •

• • • ••

•

•

•

•

•

• • •

BL EWS NB RMS

OO O

O O


�

�

�

The genes that matter

BL EWS NB RMS

295985

866702

814260

770394

377461

810057

365826

41591

629896

308231

325182

812105

44563

244618

796258

298062

784224

296448

207274

563673

504791

204545

21652

308163

212542

183337

241412


�

�

�

Heatmap of selected genes

BL EWS NB RMS

295985866702814260770394377461810057365826415916298963082313251828121054456324461879625829806278422429644820727456367350479120454521652308163212542183337241412

PAM — “Prediction Analysis for Microarrays”

R package, available at

http://www-stat.stanford.edu/∼tibs/PAM


�

�

�

Leukemia classification

Golub et al 1999, Science. They use a “voting” procedure for eachgene, where votes are based on a t-like statistic

Method CV err Test err

Golub 3/38 4/34

PAM 1/38 2/34

Breast Cancer classification

Hedenfalk et al 2001, NEJM. They use a “compound predictor”∑j wjxj , where the weights are t-statistics.

Method BRCA1+ BRCA1- BRCA2+ BRCA2-

Heden et. al. 3/7 2/15 3/8 1/14

PAM 2/7 1/15 2/8 0/14


�

�

�

Other Promising Approaches

• Quadratically penalized linear regression models. For example,ridge regression

minβ

N∑

i=1

(yi − xTi β) + λβT β

• Quadratically penalized logistic regression (binomial andmultinomial), Cox model, etc

• Regularized linear and quadratic discriminant analysis

• Optimal separating hyperplanes and Support Vector Machines

• Lasso and L1 penalized regression methods

minβ

N∑

i=1

L(yi, xTi β) + λ|β|


�

�

�

Computational tricks

• SVMs use the kernel trick. Even though the model has p � N

parameters, computations done in N dimensional dual space.

• In fact ANY of the quadratically regularized linear models canbenefit from a similar trick:

– Let X = R · Q, with RN×N , QN×p, and rows of Qorthogonal.

– Fit model with R rather than X (N vs p variables), samequadratic penalty; let solution be β̂∗.

– Then solution to original problem is β̂ = QT β̂∗.

• This applies to regularized logistic regression, multinomial, anyGLM, Cox model, and LDA. It also applies to Euclideanmethods like K-nearest neighbor classification.


�

�

�

Example: Lasso and Leukemia Classification

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ************

********************

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

60.

8

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ************

********************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * **********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * **

****

***

* * *** ****** ** *** **** * ** *********************

***********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * *

* ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** *

*****

** * *** ****** ** ***

**** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * *

*

** ** * *** ** * * * *** ******

***** ****

* ** *********************************

*

*

*

** ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* *

*

*

** **

* *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** ***

***** ** ********************************

* * * * * * ** * *** ** * * * *** ******

** *** ***** ** ***********

*********************

* * * * * * ** * *** ** * * * *** ****** ** *** ***** ** *********************

***********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * ***

****** ** *** **** * ** **********************

**********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * **********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** **********

*********************** * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********

************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************

* * * * * * ***

*** ** ** *

*** ******

** *** ***** **

********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****

**** ***

***** ** *********

**********************

*

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************

************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * ** * ** * *** ** * * * *** ****** ** *** **** * **

**************************

******

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ****************************

****

* * * * * * ** * *** ** * **

*** ****** ** *** **** * ** ********************************

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** *********************

***********

* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ****** ** *** **** * ** ********************************* * * * * * ** * *** ** * * * *** ******** *** **** * ** ***********

*********************

LASSO

2968

6801

2945

461

2267


�

�

�

10-fold Crossvalidation for Lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.05

0.10

0.15

0.20

0.25

fraction

cv

10−fold cross−validation for Leukemia Expression Data (Lasso)

See http://www-stat.stanford.edu/∼hastie/Papers/#LARS


�

�

�

Summary

• With large numbers of genes as variables (P � N), we have tolearn how to tame even our simplest supervised learningtechniques. Even linear models are way too aggressive.

• We need to talk with the biologists to learn their priors; i.e.genes work in groups.

• We need to provide them with easy-to-use software to try outour ideas, and involve them in the design. They are muchbetter at understanding their data than we are.

supervised learning from micro-array data: datamining with ...hastie/talks/pam.pdf · micro-array...

Documents