michael biehl kerstin bunte petra schneider dream 6 / flowcap 2 challenge: molecular classification...
TRANSCRIPT
Michael Biehl
Kerstin Bunte
Petra Schneider
DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute
Myeloid Leukaemia
Johann Bernoulli Institute for Mathematics and Computer ScienceUniversity of Groningen, The Netherlands
1
Centre for Diabetes, Endicronology & Metabolism School of Clinical & Experimental MedicineUniversity of Birmingham, UK
Team Admire-LVQAdaptive Distance Measures In Relevance Learning Vector Quantization
33
DREAM6/FlowCAP2 challenge 2011
The DREAM project [www.the-dream-project.org]
Dialogue for Reverse Engineering Assessments and Methods
FlowCAP initiative [http://flowcap.flowsite.org]
Flow Cytometry: Critical Assessment of Population Identification Methods
Organizers Ryan Brinkman, British Columbia Cancer Agency Raphael Gottardo, Fred Hutchinson Cancer Research Center Tim Mosmann, University of Rochester Richard H. Scheuermann, University of Texas Southwestern Medical Center
Organizers Gustavo Stolovitzky, Robert Prill, Raquel Norel, Pablo Meyer, IBM Computational Biology Center Julio Saez-Rodriguez, European Bioinformatics Institute (EMBL-EBI)
44
flow cytometry
preprocessing
cell size, granularity,
+26 protein markers
(ten-) thousands
of events per marker
4
training set: 23 AML patients, 156 healthy donors
test set : 180 unlabeled patients
Wade Rogers,
U. of Pennsylvania
peripheral blood/bone marrow aspirate
fluorophore-conjugated antibodiesfor specific proteins
© www.the-dream-project.org
55
list of markers
1 FS lin (~ cell size)
2 SS log (~ granularity)
3 CD45 (protein marker)
measured in all cells}
5© www.the-dream-project.org
four diff.features
66
possible workflow:
- selection of cells, based on e.g. FS Lin, SS Log, CD-45
- inspection of all markers only for selected cells
e.g. differential diagnosis (subtypes)
list of markers
here: classification based on entire cell population and all markers
target diagnosis: AML patient / healthy donor
unspecific with respect to types of AML
consideration of frequencies / histograms only
information about single cells disregarded
77
class-conditional mean histograms
healthy donors
AML patients
suggested set of features
(1)mean (2) standard deviation (3) skewness
(4) kurtosis (5) median (6) interquartile range
88
class-conditional mean histograms
healthy donors
AML patients
suggested set of features
(1)mean (2) standard deviation (3) skewness
(4) kurtosis (5) median (6) interquartile range
99
feature vectors (186-dim.)
healthy donors(mean)
AML patients(mean)
1010
matrix relevance LVQ
Training:
:d( , ) d( , )E ,
d( , ) d( , ) :
JJ m K mi
J m K m Km
ww x w x
ww x w x w
correct prototype
∙ cost function based Generalized Matrix LVQ (GMLVQ)
d , w x w x w x• 2
(186 186)
Ω ( - ) Ω Ω
with
x w •
∙ gradient based optimization of E ( prototypes and matrix Ω )
simplest setting: 1 prototype per class, healthy donors / AML patients
vectors w in 186-dim. features space
nearest prototype classifier according to adaptive distance measure
wrong prototype
1111
- 5/6 of data for training, 1/6 for validation- ROC, threshold-average over 50 random splits
validation
FS LinSS LogCD45
all markers
false positive rate false positive rate
tru
e p
ositi
ve r
ate
1212
- 5/6 of data for training, 1/6 for validation- ROC, threshold-average over 50 random splits
- note: patient 116 consistently misclassified
validation
tru
e p
ositi
ve r
ate
false positive rate
1313
validationtr
ain
ing
se
t
err
ors
val
idat
ion
se
t
err
ors
patient “116”
(AML)
1414
visualization
patient 116patient 116
projection on first eigenvector of Λ
proj
ectio
n on
firs
t ei
genv
ecto
r of
Λ
prototypes
1515
prediction: 180 test set patients
projection on first eigenvector of Λ
proj
ectio
n on
firs
t ei
genv
ecto
r of
Λ test set
prototypes
1616
1 2
1 2
1 d( , ) d( , )0 s = 1 1
2 d( , ) d( , )
w x w x
w x w x“AML – score”
prediction: 180 test set patients
20 AML cases!
perfect test set prediction
e.g. AUROC = 1
(achieved by 8 teams!)
Note: GMLVQ scores are
not directly interpretable
as “certainties” or
probabilistic assignments
1717
difference vector “ AML - healthy ” prototype
here: components corresponding to mean values
prototypes
1818
relevances
relevance of markers: in detail:
iqr
median
kurtosis
skewness
std. dev.
mean
← diagonal elements of Λ
1919
relevances
relevance of markers: in detail:
iqr
median
kurtosis
skewness
std. dev.
mean
SS log
2020
1 2
1 2
1 d( , ) d( , )0 s = 1 1
2 d( , ) d( , )
w x w x
w x w x“AML – score”
scores, certainties, ranking ?
20 AML cases!
perfect test set prediction
e.g. AUC =1 (ROC)
comparison:
scores vs. ground truth (?) :
Pearson-correlation: 0.9703
sum of |differences|: 3.8455
2121
tanh (3 s)0 1
tanh(3) “transformed AML – score”
20 AML cases!
perfect test set prediction
e.g. AUC =1 (ROC)
comparison:
scores vs. ground truth:
Pearson-correlation: 0.9820
sum of |differences|: 4.4347
scores, certainties, ranking ?
Pearson-correlation: 0.9703
sum of |differences|: 3.8455
2222
summary
feature vectors:
moment based characteristics of flow cytometry data
[mean, standard deviation, skewness, kurtosis, median, iqr ]
Matrix Relevance Learning Vector Quantization
- perfect classification with respect to training and test set
(e.g. AUC(roc)=1)
- weighting of features (pairs of features) according to
their relevance in the classification
- visualization of the data set
- identification of outliers (“116” ?)
2323
outlook
selection of reduced feature set:
relevance matrix results suggest a selection of
protein markers and/or specific features
identification / diagnosis of AML subtypes
- AML subtypes to be identified by specific marker profiles
- machine learning approach requires larger data sets, e.g.
GMLVQ with several prototypes representing AML
- back to gating – selection of cells for differential diagnosis?
direct classification of histograms
non-Euclidean, histogram-specific distance measures
e.g. Divergence-based LVQ [Mwebaze et al., 2010]
2424
P. Schneider, M. Biehl, B. Hammer, Adaptive relevance matrices in learning vector quantization Neural Computation 21: 3532-3561 (2009)
A recent application in tumor classification:
references (www.cs.rug.nl/~biehl)
W. Arlt, M. Biehl, A.E. Taylor et al. J Clinical Endocrinology & Metabolism, in press (2011) Urine Steroid Metabolomics as a Biomarker Tool for Detecting Malignancy in Patients with Adrenal Tumors
The method (GMLVQ):
2525
thanks
Thanks