michael biehl kerstin bunte petra schneider dream 6 / flowcap 2 challenge: molecular classification...

Michael Biehl

Kerstin Bunte

Petra Schneider

DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute

Myeloid Leukaemia

Johann Bernoulli Institute for Mathematics and Computer ScienceUniversity of Groningen, The Netherlands

1

Centre for Diabetes, Endicronology & Metabolism School of Clinical & Experimental MedicineUniversity of Birmingham, UK

Team Admire-LVQAdaptive Distance Measures In Relevance Learning Vector Quantization

33

DREAM6/FlowCAP2 challenge 2011

The DREAM project [www.the-dream-project.org]

Dialogue for Reverse Engineering Assessments and Methods

FlowCAP initiative [http://flowcap.flowsite.org]

Flow Cytometry: Critical Assessment of Population Identification Methods

Organizers Ryan Brinkman, British Columbia Cancer Agency Raphael Gottardo, Fred Hutchinson Cancer Research Center Tim Mosmann, University of Rochester Richard H. Scheuermann, University of Texas Southwestern Medical Center

Organizers Gustavo Stolovitzky, Robert Prill, Raquel Norel, Pablo Meyer, IBM Computational Biology Center Julio Saez-Rodriguez, European Bioinformatics Institute (EMBL-EBI)

44

flow cytometry

preprocessing

cell size, granularity,

+26 protein markers

(ten-) thousands

of events per marker

4

training set: 23 AML patients, 156 healthy donors

test set : 180 unlabeled patients

Wade Rogers,

U. of Pennsylvania

peripheral blood/bone marrow aspirate

fluorophore-conjugated antibodiesfor specific proteins

© www.the-dream-project.org

55

list of markers

1 FS lin (~ cell size)

2 SS log (~ granularity)

3 CD45 (protein marker)

measured in all cells}

5© www.the-dream-project.org

four diff.features

66

possible workflow:

- selection of cells, based on e.g. FS Lin, SS Log, CD-45

- inspection of all markers only for selected cells

e.g. differential diagnosis (subtypes)

list of markers

here: classification based on entire cell population and all markers

target diagnosis: AML patient / healthy donor

unspecific with respect to types of AML

consideration of frequencies / histograms only

information about single cells disregarded

77

class-conditional mean histograms

healthy donors

AML patients

suggested set of features

(1)mean (2) standard deviation (3) skewness

(4) kurtosis (5) median (6) interquartile range

88

class-conditional mean histograms

healthy donors

AML patients

suggested set of features

(1)mean (2) standard deviation (3) skewness

(4) kurtosis (5) median (6) interquartile range

99

feature vectors (186-dim.)

healthy donors(mean)

AML patients(mean)

1010

matrix relevance LVQ

Training:

:d( , ) d( , )E ,

d( , ) d( , ) :

JJ m K mi

J m K m Km

ww x w x

ww x w x w

correct prototype

∙ cost function based Generalized Matrix LVQ (GMLVQ)

d , w x w x w x• 2

(186 186)

Ω ( - ) Ω Ω

with

x w •

∙ gradient based optimization of E ( prototypes and matrix Ω )

simplest setting: 1 prototype per class, healthy donors / AML patients

vectors w in 186-dim. features space

nearest prototype classifier according to adaptive distance measure

wrong prototype

1111

- 5/6 of data for training, 1/6 for validation- ROC, threshold-average over 50 random splits

validation

FS LinSS LogCD45

all markers

false positive rate false positive rate

tru

e p

ositi

ve r

ate

1212

- 5/6 of data for training, 1/6 for validation- ROC, threshold-average over 50 random splits

- note: patient 116 consistently misclassified

validation

tru

e p

ositi

ve r

ate

false positive rate

1313

validationtr

ain

ing

se

t

err

ors

val

idat

ion

se

t

err

ors

patient “116”

(AML)

1414

visualization

patient 116patient 116

projection on first eigenvector of Λ

proj

ectio

n on

firs

t ei

genv

ecto

r of

Λ

prototypes

1515

prediction: 180 test set patients

projection on first eigenvector of Λ

proj

ectio

n on

firs

t ei

genv

ecto

r of

Λ test set

prototypes

1616

1 2

1 2

1 d( , ) d( , )0 s = 1 1

2 d( , ) d( , )

w x w x

w x w x“AML – score”

prediction: 180 test set patients

20 AML cases!

perfect test set prediction

e.g. AUROC = 1

(achieved by 8 teams!)

Note: GMLVQ scores are

not directly interpretable

as “certainties” or

probabilistic assignments

1717

difference vector “ AML - healthy ” prototype

here: components corresponding to mean values

prototypes

1818

relevances

relevance of markers: in detail:

iqr

median

kurtosis

skewness

std. dev.

mean

← diagonal elements of Λ

1919

relevances

relevance of markers: in detail:

iqr

median

kurtosis

skewness

std. dev.

mean

SS log

2020

1 2

1 2

1 d( , ) d( , )0 s = 1 1

2 d( , ) d( , )

w x w x

w x w x“AML – score”

scores, certainties, ranking ?

20 AML cases!


e.g. AUC =1 (ROC)

comparison:

scores vs. ground truth (?) :

Pearson-correlation: 0.9703

sum of |differences|: 3.8455

2121

tanh (3 s)0 1

tanh(3) “transformed AML – score”

20 AML cases!


e.g. AUC =1 (ROC)

comparison:

scores vs. ground truth:



scores, certainties, ranking ?



2222

summary

feature vectors:

moment based characteristics of flow cytometry data

[mean, standard deviation, skewness, kurtosis, median, iqr ]

Matrix Relevance Learning Vector Quantization

- perfect classification with respect to training and test set

(e.g. AUC(roc)=1)

- weighting of features (pairs of features) according to

their relevance in the classification

- visualization of the data set

- identification of outliers (“116” ?)

2323

outlook

selection of reduced feature set:

relevance matrix results suggest a selection of

protein markers and/or specific features

identification / diagnosis of AML subtypes

- AML subtypes to be identified by specific marker profiles

- machine learning approach requires larger data sets, e.g.

GMLVQ with several prototypes representing AML

- back to gating – selection of cells for differential diagnosis?

direct classification of histograms

non-Euclidean, histogram-specific distance measures

e.g. Divergence-based LVQ [Mwebaze et al., 2010]

2424

P. Schneider, M. Biehl, B. Hammer, Adaptive relevance matrices in learning vector quantization Neural Computation 21: 3532-3561 (2009)

A recent application in tumor classification:

references (www.cs.rug.nl/~biehl)

W. Arlt, M. Biehl, A.E. Taylor et al. J Clinical Endocrinology & Metabolism, in press (2011) Urine Steroid Metabolomics as a Biomarker Tool for Detecting Malignancy in Patients with Adrenal Tumors

The method (GMLVQ):

2525

thanks

Thanks

michael biehl kerstin bunte petra schneider dream 6 / flowcap 2 challenge: molecular classification...

Documents