sta306b may 24, 2011 rob tibshirani, stanford:...

Sta306b May 24, 2011 Rob Tibshirani, Stanford: 1�

�

�

�

Biomarker discovery andvalidation

a challenge for the field of statistics...

Robert Tibshirani

Stanford University

Email:[email protected]

http://www-stat.stanford.edu/~tibs


�

�

�

Outline

• a recent controversy in Cancer genomics

• some general suggestions for dealing with this problem


�

�

�

A recent experience

• Dave et al published a high-profile study in NEJM, reporting

that they had found two sets of genes whose expression were

highly predictive of survival in patients with Follicular

Lyphoma.

• the paper got a lot of attention at the recent ASH meeting,

because the genes in the clusters were largely expressed in

non-tumor cells, suggesting that the host-response was the

important factor

• One of my medical collaborators- Ron Levy, asked me to look

over their paper- he wanted to apply their model to the

Stanford FL patient population.


�

�

�

n engl j med

351;21

www.nejm.org november

18, 2004

2159

The

new englandjournal

of

medicine

established in 1812

november

18

,

2004

vol. 351 no. 21

Prediction of Survival in Follicular Lymphoma Based on Molecular Features of Tumor-Infiltrating Immune Cells

Sandeep S. Dave, M.D., George Wright, Ph.D., Bruce Tan, M.D., Andreas Rosenwald, M.D.,

Randy D. Gascoyne, M.D., Wing C. Chan, M.D., Richard I. Fisher, M.D., Rita M. Braziel, M.D.,

Lisa M. Rimsza, M.D., Thomas M. Grogan, M.D., Thomas P. Miller, M.D., Michael LeBlanc, Ph.D.,

Timothy C. Greiner, M.D., Dennis D. Weisenburger, M.D., James C. Lynch, Ph.D., Julie Vose, M.D.,

James O. Armitage, M.D., Erlend B. Smeland, M.D., Ph.D., Stein Kvaloy, M.D., Ph.D., Harald Holte, M.D., Ph.D.,

Jan Delabie, M.D., Ph.D., Joseph M. Connors, M.D., Peter M. Lansdorp, M.D., Ph.D., Qin Ouyang, Ph.D.,

T. Andrew Lister, M.D., Andrew J. Davies, M.D., Andrew J. Norton, M.D., H. Konrad Muller-Hermelink, M.D.,

German Ott, M.D., Elias Campo, M.D., Emilio Montserrat, M.D., Wyndham H. Wilson, M.D., Ph.D.,

Elaine S. Jaffe, M.D., Richard Simon, Ph.D., Liming Yang, Ph.D., John Powell, M.S., Hong Zhao, M.S.,

Neta Goldschmidt, M.D., Michael Chiorazzi, B.A., and Louis M. Staudt, M.D., Ph.D.

abstract

From National Cancer Institute (S.S.D.,G.W., B.T., A.R., W.H.W., E.S.J., R.S., H.Z.,N.G., M.C., L.M.S.); Center for InformationTechnology (L.Y., J.P.); and National Heart,Lung, and Blood Institute (S.S.D.) — all inBethesda, Md.; British Columbia CancerCenter, Vancouver, Canada (R.D.G., J.M.C.,P.M.L., Q.O.); University of NebraskaMedical Center, Omaha (W.C.C., T.C.G.,D.D.W., J.C.L., J.V., J.O.A.); Southwest On-cology Group, San Antonio, Tex. (R.I.F.,T.M.G., T.P.M., M.L.); University of Roch-ester School of Medicine, Rochester, N.Y.(R.I.F.); Oregon Health and Science Univer-sity, Portland (R.M.B.); University of ArizonaCancer Center, Tucson (L.M.R., T.M.G.,T.P.M.); Fred Hutchinson Cancer ResearchCenter, Seattle (M.L.); Norwegian RadiumHospital, Oslo (E.B.S., S.K., H.H., J.D.);Cancer Research UK, St. Bartholomew’sHospital, London (T.A.L., A.J.D., A.J.N.);University of Würzburg, Würzburg, Ger-many (A.R., H.K.M.-H., G.O.); and Univer-sity of Barcelona, Barcelona, Spain (E.C.,E.M.). Address reprint requests to Dr.Staudt at the National Cancer Institute,Bldg. 10, Rm. 4N114, NIH, Bethesda, MD20892, or at [email protected].

N Engl J Med 2004;351:2159-69.

Copyright © 2004 Massachusetts Medical Society.

background

Patients with follicular lymphoma may survive for periods of less than 1 year to morethan 20 years after diagnosis. We used gene-expression profiles of tumor-biopsy spec-imens obtained at diagnosis to develop a molecular predictor of the length of survival.

methods

Gene-expression profiling was performed on 191 biopsy specimens obtained from pa-tients with untreated follicular lymphoma. Supervised methods were used to discoverexpression patterns associated with the length of survival in a training set of 95 speci-mens. A molecular predictor of survival was constructed from these genes and validat-ed in an independent test set of 96 specimens.

results

Individual genes that predicted the length of survival were grouped into gene-expres-sion signatures on the basis of their expression in the training set, and two such signa-tures were used to construct a survival predictor. The two signatures allowed patientswith specimens in the test set to be divided into four quartiles with widely disparate me-dian lengths of survival (13.6, 11.1, 10.8, and 3.9 years), independently of clinicalprognostic variables. Flow cytometry showed that these signatures reflected gene ex-pression by nonmalignant tumor-infiltrating immune cells.

conclusions

The length of survival among patients with follicular lymphoma correlates with themolecular features of nonmalignant immune cells present in the tumor at diagnosis.

Copyright © 2004 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .


�

�

�

What happened next...

• I downloaded the data

• Applied some familiar statistical tools- eg SAM (Significance

analysis of microarrays), less familiar ones- supervised principal

components, and also gave the data to Brad Efron. Our initial

finding- no significant correlation between gene expression and

survival.

• I spent 2-3 weeks emailing back and forth with their

statistician (George Wright) and programming in R, to recreate

their analysis

• Confession- it was fun being a “forensic statistician”


�

�

�

Summary of their findings

• They started with the expression of approximately 49,000

genes measured on 189 patient samples, derived from DNA

microarrays. A survival time (possibly censored) was available

for each patient

• they randomly split the data into a training set of 89 patients

and a test set of 90 patients

• using a multi-step procedure (described below), they extracted

two clusters of genes, called IR1 (immune response 1) and IR2

(immune response 2).

• They averaged the gene expression of the genes in each cluster,

to create two “super-genes”.


�

�

�

... continued

• They then fit these super-genes together in a Cox model for

survival, and applied it to the training and test sets. The

p-value in the training set was < 10e− 7 and 0.003 in the test

set. IR1 correlates with good prognosis; IR2 with poor

prognosis

• In the remainder of the paper they interpret the genes in their

model


�

�

�

Details of their analysis

1) Divide the data randomly into training and test sets of

approximately equal numbers of patients. Apply the following

recipe [steps 2–6] to the training set.

2) Choose all genes with univariate Cox score > 1.5 in absolute

value. This reduced the number of genes from roughly 49,000

to roughly 3,000, with about a 50-50 split between good

prognosis genes (negative scores) and poor prognosis genes

(positive scores).

3) Do separate hierarchical clusterings (correlation metric, average

linkage) of the good and poor prognosis genes.


�

�

�

Details continued...

4) Find all clusters in the dendrograms (clustering trees)

containing between 25 and 50 genes, with internal correlation

at least 0.5. Represent each cluster by the average expression

of all genes in the cluster– a “supergene” Try every pair of

supergenes as predictors in Cox models for predicting survival.

5) Choose the most significant pair from this process. The authors

call the resulting pair of clusters IR1 (good prognosis) and IR2

(poor prognosis).

6) Finally use the model (IR1, IR2) in a Cox model to predict

survival in the test set.


�

�

�

genes

patients

survival times

Training set

Expression data

Cox scores

>1.5

< −1.5

49,000

Extract clusters withbetween 25 and 50 genes,and corr > 0.5

Hierarchicalclustering

89

Test set90 patients

Apply best model to test set

in a Cox survival model

and try all possible pairs of supergenes

Average each cluster into a supergene,~1500 Positive genes

~1500 Negative genes


�

�

�

n engl j med

351;21

www.nejm.org november

18, 2004

prediction of survival in follicular lymphoma

2163

Training Set of TumorSpecimens (N=95)

Genes Associated

with Poor

Prognosis

0.33 31

Relative level of expression(¬ median value)

Immune-Response 1 Signature

Immune-Response 2 Signature

ACTN1ATP8B2BIN2C1RLC6orf37C9orf52CD7CD8B1DDEF2DKFZP566G1424DKFZP761D1624FLJ32274FLNAFLT3LGGALNT12GNAQHCSTHOXB2IL7R

TNFRSF1BTNFRSF25TNFSF12TNFSF13B

BLVRAC17orf31C1QAC1QBC3AR1C4AC6orf145CEB1DHRS3DUSP3F8FCGR1AGPRC5BHOXD8LGMNME1

Genes Associated

with Favorable Prognosis

IMAGE:5289004INPP1ITKJAM KIAA1128KIAA1450LEF1LGALS2LOC340061NFICPTRFRAB27ARALGDSSEMA4CSEPW1STAT4TBC1D4TEAD1TMEPAI

MITFMRVI1NDNOASLPELOSCARB2SEPT10TLR5

Pro

bab

ilit

y o

f S

urv

ival 0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

Years

No. at Risk 12296699147187

A B

C D

Follicular lymphoma biopsy specimens(N=191)

Training set (N=95) Test set (N=96)

Identify genes with expression patterns

associated with favorable prognosis

Create optimal multi-variate model withsurvival signature

averages

Validate survival predictor in test set

Identify genes with expression patterns

associated with poor prognosis

Average gene- expression levels for genes in each survival signature

Use hierarchicalclustering to

identify survivalsignatures

Use hierarchicalclustering to

identify survivalsignatures

Average gene- expression levels for genes in each survival signature



�

�

�

n en

gl j m

ed

351;21

ww

w.n

ejm.o

rgn

ovem

ber

18, 200

4

pr

ed

ict

ion

of

sur

viv

al

in f

ol

lic

ul

ar

lym

ph

om

a

2165

Figure 2. Development of a Molecular Predictor of Survival in Follicular Lymphoma.

Panel A shows overall survival among the patients with biopsy specimens in the test set, according to the quartile of the survival-predictor score (SPS). Panel B shows overall survival

according to the International Prognostic Index (IPI) risk group for all the patients for whom these data were available. Panel C shows overall survival among the patients with specimens

in the test set for the indicated IPI risk group, stratified according to the quartile of the SPS.

Pro

bab

ilit

y o

f S

urv

ival

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P<0.001

Years

No. at RiskIPI, 0 or 1

IPI, 2 or 3

IPI, 4 or 5

8 1 0

19 5 0

38 18 0

55 31 1

75 48 6

91 64 9

Pro

bab

ilit

y o

f S

urv

ival

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P<0.001

Years

No. at RiskSPS quartile 1

SPS quartile 2

SPS quartile 3

SPS quartile 4

1 2 2 1

4 3 6 1

9 10 10 4

14 16 13 7

21 23 18 12

23 242423

1

2

3

4

51 58 54 29

86 86 69 38

13.6 11.1 10.8 3.9

Pro

bab

ilit

y o

f S

urv

ival

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P<0.001

Years

No. at RiskSPS quartile 1

SPS quartile 2

SPS quartile 3

SPS quartile 4

1 1 2 0

3 2 4 0

5 3 6 1

13 7 8 2

9 5 8 2

15 8 9 6

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P<0.001

Years

0 0 0 0

0 0 2 0

2 4 4 2

3 7 5 3

3 10 9 6

3 10 11 12

Quartile of SPS SurvivalMedian (yr) 5 yr (%) 10 yr (%)

A

B CIPI, 0 or 1 IPI, 2 or 3

Copyright ©

2004 Massachusetts M

edical Society. A

ll rights reserved. D

ownloaded from

ww

w.nejm

.org at Stanford U

niversity on May 10, 2005 .


�

�

�

p-values of all cluster pairs

1e−08 1e−06 1e−04 1e−02

5e−

045e−

035e−

025e−

01

train set p−value

test

set

p−

valu

e

The total number of points (cluster pairs) with test set p-values less

than 0.05 (239) is far fewer than we’d expect to see by chance (735)


�

�

�

Univariate p-values

1e−04 5e−04 2e−03 1e−02

0.05

0.20

0.50

training pval

test

pva

l

poor prognosis genes

5e−04 2e−03 1e−020.

10.

20.

51.

0

training pval

test

pva

l

good prognosis genes


�

�

�

Swapping train and test sets

1e−11 1e−08 1e−05 1e−02

0.00

10.

005

0.05

00.

500

train set p−value

test

set

p−

valu

e

There are only 85 pairs out of 11628 that are significant in the test

set at the 0.05 level, while we would expect 11628*.05=581 pairs

just by chance.


�

�

�

Swapping train and test sets, ctd...

2e−04 1e−03 5e−03 2e−02

0.1

0.2

0.5

1.0

training pval

test

pva

l

poor prognosis genes

5e−05 5e−04 5e−030.

10.

20.

51.

0

training pval

test

pva

l

good prognosis genes


�

�

�

Cluster size ranges (30,60) rather than (25,50)

1e−07 1e−05 1e−03 1e−01

5e−

045e−

035e−

025e−

01

train set p−value

test

set

p−

valu

e


�

�

�

The aftermath

• I published a short letter to NEJM in March 2005; full details

of my re-analysis appear on my website

• The authors published a rebuttal in the same issue. Their

arguments:

– (1) we followed standard statistical procedures, found a

small p-value on the test set, therefore our finding is correct;

– (2) our method found an interaction, which SAM can’t find

– (3) we get small p-values if we apply our model to random

halves of the data (????!!!!!!)


�

�

�n engl j med 352;14 www.nejm.org april 7, 2005

correspondence

1497

Our predictor could not have been discoveredwith the SAM method, which relies solely on uni-variate associations with survival. Rather, our pre-dictor derives its strength from the synergisticcombination of two gene-expression signatures ina multivariate model. Tibshirani confuses the abil-ity of our method to discover a survival associationwith the fact that we actually found one that validat-ed the association. When he exchanged the trainingset for the test set, he was unable to rediscover ourgene-expression predictor because some genes inour predictor fell below the P value threshold forassociation with survival in the test set. This doesnot negate the fact that our model is highly associ-ated with survival in the test set.

Hong et al. have made three errors. First, in oursorted subpopulations, the CD19¡ fraction con-tained, on average, 12.6 percent contaminationwith follicular-lymphoma cells, not 25 percent, asthey claim. To believe that the higher expression ofthe immune-response signatures in the CD19¡

fraction is due to this 12.6 percent contaminationrequires that the lymphoma cells in the CD19¡fraction have expression of the immune-responsesignatures that was more than eight times as highas that in their counterparts in the CD19+ fraction.Second, Hong et al. incorrectly discount the im-mune-response 1 signature, which contributes sig-nificantly to the survival model in the test set(P<0.001). Third, many of the immune-responsesignature genes are selectively expressed in T cells,monocytes, or dendritic cells, or in more than oneof these, but not in B cells, making the contentionof Hong et al. even more implausible.

Louis M. Staudt, M.D., Ph.D.

George Wright, Ph.D.

Sandeep Dave, M.D.

National Cancer InstituteBethesda, MD [email protected]

1. Ransohoff DF. Rules of evidence for cancer molecular-markerdiscovery and validation. Nat Rev Cancer 2004;4:309-14.

When Doctors Go to War

to the editor: Like Bloche and Marks in their Per-spective article on doctors in combat (Jan. 6 is-sue),1 the American Medical Association (AMA)applauds the outstanding work of military physi-cians in treating wounded soldiers under extremely

challenging circumstances.2 Physicians are de-fined by their common calling to prevent harm andtreat people who are ill or injured and by their uni-versal commitment to uphold recognized principlesof medical ethics whenever patients rely on their

Figure 1. Results of Samplings of Half-Sets and Association between Gene Expression and Survival.

No

. o

f H

alf-

Set

s O

ut

of

2000 S

amp

lin

gs

200

250

150

100

50

010¡11 10¡10 10¡9 10¡8 10¡7 10¡6 10¡5 10¡4 10¡3 10¡2 10¡1 100

P Value for Association of Gene-Expression–Based Model with Survival

300



�

�

�

General comments

• Their finding is fragile. I don’t believe that it is real or

reproducible

• This experience uncovers a problem that is of general

importance to our field:

– with many predictors, it is too easy to overfit the data and

find spurious results

– we can inadvertently mislead the reader, and mislead

ourselves. I have been guilty of this too


�

�

�

Some recommendations

• encourage authors to publish not only the raw data, but a

script of their analysis

• encourage authors to use “canned” methods/packages, with

built-in cross-validation to validate the model search process.

• develop software systems to encoruage reproducible research

• have journals encourage, not discourage, criticism in print of

published work.


�

�

�

A more radical idea

• Incentives do not exist in the current system to encourage

correctness of publshed analyses. The first priority for journals

and authors is to publish “hot” articles. There are no real bad

consequences for errors found years later in published articles.

• The reputation of authors and journal must be on the line.

• Proposal: An IMDB for scientific papers— “ISDB”

sta306b may 24, 2011 rob tibshirani, stanford:...

Documents