sta306b may 24, 2011 rob tibshirani, stanford:...
TRANSCRIPT
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 1�
�
�
�
Biomarker discovery andvalidation
a challenge for the field of statistics...
Robert Tibshirani
Stanford University
Email:[email protected]
http://www-stat.stanford.edu/~tibs
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 2�
�
�
�
Outline
• a recent controversy in Cancer genomics
• some general suggestions for dealing with this problem
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 3�
�
�
�
A recent experience
• Dave et al published a high-profile study in NEJM, reporting
that they had found two sets of genes whose expression were
highly predictive of survival in patients with Follicular
Lyphoma.
• the paper got a lot of attention at the recent ASH meeting,
because the genes in the clusters were largely expressed in
non-tumor cells, suggesting that the host-response was the
important factor
• One of my medical collaborators- Ron Levy, asked me to look
over their paper- he wanted to apply their model to the
Stanford FL patient population.
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 4�
�
�
�
n engl j med
351;21
www.nejm.org november
18, 2004
2159
The
new englandjournal
of
medicine
established in 1812
november
18
,
2004
vol. 351 no. 21
Prediction of Survival in Follicular Lymphoma Based on Molecular Features of Tumor-Infiltrating Immune Cells
Sandeep S. Dave, M.D., George Wright, Ph.D., Bruce Tan, M.D., Andreas Rosenwald, M.D.,
Randy D. Gascoyne, M.D., Wing C. Chan, M.D., Richard I. Fisher, M.D., Rita M. Braziel, M.D.,
Lisa M. Rimsza, M.D., Thomas M. Grogan, M.D., Thomas P. Miller, M.D., Michael LeBlanc, Ph.D.,
Timothy C. Greiner, M.D., Dennis D. Weisenburger, M.D., James C. Lynch, Ph.D., Julie Vose, M.D.,
James O. Armitage, M.D., Erlend B. Smeland, M.D., Ph.D., Stein Kvaloy, M.D., Ph.D., Harald Holte, M.D., Ph.D.,
Jan Delabie, M.D., Ph.D., Joseph M. Connors, M.D., Peter M. Lansdorp, M.D., Ph.D., Qin Ouyang, Ph.D.,
T. Andrew Lister, M.D., Andrew J. Davies, M.D., Andrew J. Norton, M.D., H. Konrad Muller-Hermelink, M.D.,
German Ott, M.D., Elias Campo, M.D., Emilio Montserrat, M.D., Wyndham H. Wilson, M.D., Ph.D.,
Elaine S. Jaffe, M.D., Richard Simon, Ph.D., Liming Yang, Ph.D., John Powell, M.S., Hong Zhao, M.S.,
Neta Goldschmidt, M.D., Michael Chiorazzi, B.A., and Louis M. Staudt, M.D., Ph.D.
abstract
From National Cancer Institute (S.S.D.,G.W., B.T., A.R., W.H.W., E.S.J., R.S., H.Z.,N.G., M.C., L.M.S.); Center for InformationTechnology (L.Y., J.P.); and National Heart,Lung, and Blood Institute (S.S.D.) — all inBethesda, Md.; British Columbia CancerCenter, Vancouver, Canada (R.D.G., J.M.C.,P.M.L., Q.O.); University of NebraskaMedical Center, Omaha (W.C.C., T.C.G.,D.D.W., J.C.L., J.V., J.O.A.); Southwest On-cology Group, San Antonio, Tex. (R.I.F.,T.M.G., T.P.M., M.L.); University of Roch-ester School of Medicine, Rochester, N.Y.(R.I.F.); Oregon Health and Science Univer-sity, Portland (R.M.B.); University of ArizonaCancer Center, Tucson (L.M.R., T.M.G.,T.P.M.); Fred Hutchinson Cancer ResearchCenter, Seattle (M.L.); Norwegian RadiumHospital, Oslo (E.B.S., S.K., H.H., J.D.);Cancer Research UK, St. Bartholomew’sHospital, London (T.A.L., A.J.D., A.J.N.);University of Würzburg, Würzburg, Ger-many (A.R., H.K.M.-H., G.O.); and Univer-sity of Barcelona, Barcelona, Spain (E.C.,E.M.). Address reprint requests to Dr.Staudt at the National Cancer Institute,Bldg. 10, Rm. 4N114, NIH, Bethesda, MD20892, or at [email protected].
N Engl J Med 2004;351:2159-69.
Copyright © 2004 Massachusetts Medical Society.
background
Patients with follicular lymphoma may survive for periods of less than 1 year to morethan 20 years after diagnosis. We used gene-expression profiles of tumor-biopsy spec-imens obtained at diagnosis to develop a molecular predictor of the length of survival.
methods
Gene-expression profiling was performed on 191 biopsy specimens obtained from pa-tients with untreated follicular lymphoma. Supervised methods were used to discoverexpression patterns associated with the length of survival in a training set of 95 speci-mens. A molecular predictor of survival was constructed from these genes and validat-ed in an independent test set of 96 specimens.
results
Individual genes that predicted the length of survival were grouped into gene-expres-sion signatures on the basis of their expression in the training set, and two such signa-tures were used to construct a survival predictor. The two signatures allowed patientswith specimens in the test set to be divided into four quartiles with widely disparate me-dian lengths of survival (13.6, 11.1, 10.8, and 3.9 years), independently of clinicalprognostic variables. Flow cytometry showed that these signatures reflected gene ex-pression by nonmalignant tumor-infiltrating immune cells.
conclusions
The length of survival among patients with follicular lymphoma correlates with themolecular features of nonmalignant immune cells present in the tumor at diagnosis.
Copyright © 2004 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 5�
�
�
�
What happened next...
• I downloaded the data
• Applied some familiar statistical tools- eg SAM (Significance
analysis of microarrays), less familiar ones- supervised principal
components, and also gave the data to Brad Efron. Our initial
finding- no significant correlation between gene expression and
survival.
• I spent 2-3 weeks emailing back and forth with their
statistician (George Wright) and programming in R, to recreate
their analysis
• Confession- it was fun being a “forensic statistician”
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 6�
�
�
�
Summary of their findings
• They started with the expression of approximately 49,000
genes measured on 189 patient samples, derived from DNA
microarrays. A survival time (possibly censored) was available
for each patient
• they randomly split the data into a training set of 89 patients
and a test set of 90 patients
• using a multi-step procedure (described below), they extracted
two clusters of genes, called IR1 (immune response 1) and IR2
(immune response 2).
• They averaged the gene expression of the genes in each cluster,
to create two “super-genes”.
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 7�
�
�
�
... continued
• They then fit these super-genes together in a Cox model for
survival, and applied it to the training and test sets. The
p-value in the training set was < 10e− 7 and 0.003 in the test
set. IR1 correlates with good prognosis; IR2 with poor
prognosis
• In the remainder of the paper they interpret the genes in their
model
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 8�
�
�
�
Details of their analysis
1) Divide the data randomly into training and test sets of
approximately equal numbers of patients. Apply the following
recipe [steps 2–6] to the training set.
2) Choose all genes with univariate Cox score > 1.5 in absolute
value. This reduced the number of genes from roughly 49,000
to roughly 3,000, with about a 50-50 split between good
prognosis genes (negative scores) and poor prognosis genes
(positive scores).
3) Do separate hierarchical clusterings (correlation metric, average
linkage) of the good and poor prognosis genes.
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 9�
�
�
�
Details continued...
4) Find all clusters in the dendrograms (clustering trees)
containing between 25 and 50 genes, with internal correlation
at least 0.5. Represent each cluster by the average expression
of all genes in the cluster– a “supergene” Try every pair of
supergenes as predictors in Cox models for predicting survival.
5) Choose the most significant pair from this process. The authors
call the resulting pair of clusters IR1 (good prognosis) and IR2
(poor prognosis).
6) Finally use the model (IR1, IR2) in a Cox model to predict
survival in the test set.
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 10�
�
�
�
genes
patients
survival times
Training set
Expression data
Cox scores
>1.5
< −1.5
49,000
Extract clusters withbetween 25 and 50 genes,and corr > 0.5
Hierarchicalclustering
89
Test set90 patients
Apply best model to test set
in a Cox survival model
and try all possible pairs of supergenes
Average each cluster into a supergene,~1500 Positive genes
~1500 Negative genes
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 11�
�
�
�
n engl j med
351;21
www.nejm.org november
18, 2004
prediction of survival in follicular lymphoma
2163
Training Set of TumorSpecimens (N=95)
Genes Associated
with Poor
Prognosis
0.33 31
Relative level of expression(¬ median value)
Immune-Response 1 Signature
Immune-Response 2 Signature
ACTN1ATP8B2BIN2C1RLC6orf37C9orf52CD7CD8B1DDEF2DKFZP566G1424DKFZP761D1624FLJ32274FLNAFLT3LGGALNT12GNAQHCSTHOXB2IL7R
TNFRSF1BTNFRSF25TNFSF12TNFSF13B
BLVRAC17orf31C1QAC1QBC3AR1C4AC6orf145CEB1DHRS3DUSP3F8FCGR1AGPRC5BHOXD8LGMNME1
Genes Associated
with Favorable Prognosis
IMAGE:5289004INPP1ITKJAM KIAA1128KIAA1450LEF1LGALS2LOC340061NFICPTRFRAB27ARALGDSSEMA4CSEPW1STAT4TBC1D4TEAD1TMEPAI
MITFMRVI1NDNOASLPELOSCARB2SEPT10TLR5
Pro
bab
ilit
y o
f S
urv
ival 0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
Years
No. at Risk 12296699147187
A B
C D
Follicular lymphoma biopsy specimens(N=191)
Training set (N=95) Test set (N=96)
Identify genes with expression patterns
associated with favorable prognosis
Create optimal multi-variate model withsurvival signature
averages
Validate survival predictor in test set
Identify genes with expression patterns
associated with poor prognosis
Average gene- expression levels for genes in each survival signature
Use hierarchicalclustering to
identify survivalsignatures
Use hierarchicalclustering to
identify survivalsignatures
Average gene- expression levels for genes in each survival signature
Copyright © 2004 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 12�
�
�
�
n en
gl j m
ed
351;21
ww
w.n
ejm.o
rgn
ovem
ber
18, 200
4
pr
ed
ict
ion
of
sur
viv
al
in f
ol
lic
ul
ar
lym
ph
om
a
2165
Figure 2. Development of a Molecular Predictor of Survival in Follicular Lymphoma.
Panel A shows overall survival among the patients with biopsy specimens in the test set, according to the quartile of the survival-predictor score (SPS). Panel B shows overall survival
according to the International Prognostic Index (IPI) risk group for all the patients for whom these data were available. Panel C shows overall survival among the patients with specimens
in the test set for the indicated IPI risk group, stratified according to the quartile of the SPS.
Pro
bab
ilit
y o
f S
urv
ival
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P<0.001
Years
No. at RiskIPI, 0 or 1
IPI, 2 or 3
IPI, 4 or 5
8 1 0
19 5 0
38 18 0
55 31 1
75 48 6
91 64 9
Pro
bab
ilit
y o
f S
urv
ival
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P<0.001
Years
No. at RiskSPS quartile 1
SPS quartile 2
SPS quartile 3
SPS quartile 4
1 2 2 1
4 3 6 1
9 10 10 4
14 16 13 7
21 23 18 12
23 242423
1
2
3
4
51 58 54 29
86 86 69 38
13.6 11.1 10.8 3.9
Pro
bab
ilit
y o
f S
urv
ival
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P<0.001
Years
No. at RiskSPS quartile 1
SPS quartile 2
SPS quartile 3
SPS quartile 4
1 1 2 0
3 2 4 0
5 3 6 1
13 7 8 2
9 5 8 2
15 8 9 6
0.8
0.6
0.2
1.0
0.4
0.00 3 6 9 12 15
P<0.001
Years
0 0 0 0
0 0 2 0
2 4 4 2
3 7 5 3
3 10 9 6
3 10 11 12
Quartile of SPS SurvivalMedian (yr) 5 yr (%) 10 yr (%)
A
B CIPI, 0 or 1 IPI, 2 or 3
Copyright ©
2004 Massachusetts M
edical Society. A
ll rights reserved. D
ownloaded from
ww
w.nejm
.org at Stanford U
niversity on May 10, 2005 .
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 13�
�
�
�
p-values of all cluster pairs
1e−08 1e−06 1e−04 1e−02
5e−
045e−
035e−
025e−
01
train set p−value
test
set
p−
valu
e
The total number of points (cluster pairs) with test set p-values less
than 0.05 (239) is far fewer than we’d expect to see by chance (735)
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 14�
�
�
�
Univariate p-values
1e−04 5e−04 2e−03 1e−02
0.05
0.20
0.50
training pval
test
pva
l
poor prognosis genes
5e−04 2e−03 1e−020.
10.
20.
51.
0
training pval
test
pva
l
good prognosis genes
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 15�
�
�
�
Swapping train and test sets
1e−11 1e−08 1e−05 1e−02
0.00
10.
005
0.05
00.
500
train set p−value
test
set
p−
valu
e
There are only 85 pairs out of 11628 that are significant in the test
set at the 0.05 level, while we would expect 11628*.05=581 pairs
just by chance.
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 16�
�
�
�
Swapping train and test sets, ctd...
2e−04 1e−03 5e−03 2e−02
0.1
0.2
0.5
1.0
training pval
test
pva
l
poor prognosis genes
5e−05 5e−04 5e−030.
10.
20.
51.
0
training pval
test
pva
l
good prognosis genes
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 17�
�
�
�
Cluster size ranges (30,60) rather than (25,50)
1e−07 1e−05 1e−03 1e−01
5e−
045e−
035e−
025e−
01
train set p−value
test
set
p−
valu
e
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 18�
�
�
�
The aftermath
• I published a short letter to NEJM in March 2005; full details
of my re-analysis appear on my website
• The authors published a rebuttal in the same issue. Their
arguments:
– (1) we followed standard statistical procedures, found a
small p-value on the test set, therefore our finding is correct;
– (2) our method found an interaction, which SAM can’t find
– (3) we get small p-values if we apply our model to random
halves of the data (????!!!!!!)
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 19�
�
�
�n engl j med 352;14 www.nejm.org april 7, 2005
correspondence
1497
Our predictor could not have been discoveredwith the SAM method, which relies solely on uni-variate associations with survival. Rather, our pre-dictor derives its strength from the synergisticcombination of two gene-expression signatures ina multivariate model. Tibshirani confuses the abil-ity of our method to discover a survival associationwith the fact that we actually found one that validat-ed the association. When he exchanged the trainingset for the test set, he was unable to rediscover ourgene-expression predictor because some genes inour predictor fell below the P value threshold forassociation with survival in the test set. This doesnot negate the fact that our model is highly associ-ated with survival in the test set.
Hong et al. have made three errors. First, in oursorted subpopulations, the CD19¡ fraction con-tained, on average, 12.6 percent contaminationwith follicular-lymphoma cells, not 25 percent, asthey claim. To believe that the higher expression ofthe immune-response signatures in the CD19¡
fraction is due to this 12.6 percent contaminationrequires that the lymphoma cells in the CD19¡fraction have expression of the immune-responsesignatures that was more than eight times as highas that in their counterparts in the CD19+ fraction.Second, Hong et al. incorrectly discount the im-mune-response 1 signature, which contributes sig-nificantly to the survival model in the test set(P<0.001). Third, many of the immune-responsesignature genes are selectively expressed in T cells,monocytes, or dendritic cells, or in more than oneof these, but not in B cells, making the contentionof Hong et al. even more implausible.
Louis M. Staudt, M.D., Ph.D.
George Wright, Ph.D.
Sandeep Dave, M.D.
National Cancer InstituteBethesda, MD [email protected]
1. Ransohoff DF. Rules of evidence for cancer molecular-markerdiscovery and validation. Nat Rev Cancer 2004;4:309-14.
When Doctors Go to War
to the editor: Like Bloche and Marks in their Per-spective article on doctors in combat (Jan. 6 is-sue),1 the American Medical Association (AMA)applauds the outstanding work of military physi-cians in treating wounded soldiers under extremely
challenging circumstances.2 Physicians are de-fined by their common calling to prevent harm andtreat people who are ill or injured and by their uni-versal commitment to uphold recognized principlesof medical ethics whenever patients rely on their
Figure 1. Results of Samplings of Half-Sets and Association between Gene Expression and Survival.
No
. o
f H
alf-
Set
s O
ut
of
2000 S
amp
lin
gs
200
250
150
100
50
010¡11 10¡10 10¡9 10¡8 10¡7 10¡6 10¡5 10¡4 10¡3 10¡2 10¡1 100
P Value for Association of Gene-Expression–Based Model with Survival
300
Copyright © 2005 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 20�
�
�
�
General comments
• Their finding is fragile. I don’t believe that it is real or
reproducible
• This experience uncovers a problem that is of general
importance to our field:
– with many predictors, it is too easy to overfit the data and
find spurious results
– we can inadvertently mislead the reader, and mislead
ourselves. I have been guilty of this too
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 21�
�
�
�
Some recommendations
• encourage authors to publish not only the raw data, but a
script of their analysis
• encourage authors to use “canned” methods/packages, with
built-in cross-validation to validate the model search process.
• develop software systems to encoruage reproducible research
• have journals encourage, not discourage, criticism in print of
published work.
Sta306b May 24, 2011 Rob Tibshirani, Stanford: 22�
�
�
�
A more radical idea
• Incentives do not exist in the current system to encourage
correctness of publshed analyses. The first priority for journals
and authors is to publish “hot” articles. There are no real bad
consequences for errors found years later in published articles.
• The reputation of authors and journal must be on the line.
• Proposal: An IMDB for scientific papers— “ISDB”