michael biehl kerstin bunte petra schneider dream 6 / flowcap 2 challenge: molecular classification...

25
1 Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen, The Netherlands 1 Centre for Diabetes, Endicronology & Metabolism School of Clinical & Experimental Medicine Team Admire-LVQ Adaptive Distance Measures In Relevance Learning Vector Quantization

Upload: osborn-mckinney

Post on 19-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

Michael Biehl

Kerstin Bunte

Petra Schneider

DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute

Myeloid Leukaemia

Johann Bernoulli Institute for Mathematics and Computer ScienceUniversity of Groningen, The Netherlands

1

Centre for Diabetes, Endicronology & Metabolism School of Clinical & Experimental MedicineUniversity of Birmingham, UK

Team Admire-LVQAdaptive Distance Measures In Relevance Learning Vector Quantization

Page 2: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute
Page 3: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

33

DREAM6/FlowCAP2 challenge 2011

The DREAM project [www.the-dream-project.org]

Dialogue for Reverse Engineering Assessments and Methods

FlowCAP initiative [http://flowcap.flowsite.org]

Flow Cytometry: Critical Assessment of Population Identification Methods

Organizers Ryan Brinkman, British Columbia Cancer Agency Raphael Gottardo, Fred Hutchinson Cancer Research Center Tim Mosmann, University of Rochester Richard H. Scheuermann, University of Texas Southwestern Medical Center

Organizers Gustavo Stolovitzky, Robert Prill, Raquel Norel, Pablo Meyer, IBM Computational Biology Center Julio Saez-Rodriguez, European Bioinformatics Institute (EMBL-EBI)

Page 4: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

44

flow cytometry

preprocessing

cell size, granularity,

+26 protein markers

(ten-) thousands

of events per marker

4

training set: 23 AML patients, 156 healthy donors

test set : 180 unlabeled patients

Wade Rogers,

U. of Pennsylvania

peripheral blood/bone marrow aspirate

fluorophore-conjugated antibodiesfor specific proteins

© www.the-dream-project.org

Page 5: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

55

list of markers

1 FS lin (~ cell size)

2 SS log (~ granularity)

3 CD45 (protein marker)

measured in all cells}

5© www.the-dream-project.org

four diff.features

Page 6: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

66

possible workflow:

- selection of cells, based on e.g. FS Lin, SS Log, CD-45

- inspection of all markers only for selected cells

e.g. differential diagnosis (subtypes)

list of markers

here: classification based on entire cell population and all markers

target diagnosis: AML patient / healthy donor

unspecific with respect to types of AML

consideration of frequencies / histograms only

information about single cells disregarded

Page 7: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

77

class-conditional mean histograms

healthy donors

AML patients

suggested set of features

(1)mean (2) standard deviation (3) skewness

(4) kurtosis (5) median (6) interquartile range

Page 8: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

88

class-conditional mean histograms

healthy donors

AML patients

suggested set of features

(1)mean (2) standard deviation (3) skewness

(4) kurtosis (5) median (6) interquartile range

Page 9: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

99

feature vectors (186-dim.)

healthy donors(mean)

AML patients(mean)

Page 10: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1010

matrix relevance LVQ

Training:

:d( , ) d( , )E ,

d( , ) d( , ) :

JJ m K mi

J m K m Km

ww x w x

ww x w x w

correct prototype

∙ cost function based Generalized Matrix LVQ (GMLVQ)

d , w x w x w x• 2

(186 186)

Ω ( - ) Ω Ω

with

x w •

∙ gradient based optimization of E ( prototypes and matrix Ω )

simplest setting: 1 prototype per class, healthy donors / AML patients

vectors w in 186-dim. features space

nearest prototype classifier according to adaptive distance measure

wrong prototype

Page 11: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1111

- 5/6 of data for training, 1/6 for validation- ROC, threshold-average over 50 random splits

validation

FS LinSS LogCD45

all markers

false positive rate false positive rate

tru

e p

ositi

ve r

ate

Page 12: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1212

- 5/6 of data for training, 1/6 for validation- ROC, threshold-average over 50 random splits

- note: patient 116 consistently misclassified

validation

tru

e p

ositi

ve r

ate

false positive rate

Page 13: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1313

validationtr

ain

ing

se

t

err

ors

val

idat

ion

se

t

err

ors

patient “116”

(AML)

Page 14: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1414

visualization

patient 116patient 116

projection on first eigenvector of Λ

proj

ectio

n on

firs

t ei

genv

ecto

r of

Λ

prototypes

Page 15: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1515

prediction: 180 test set patients

projection on first eigenvector of Λ

proj

ectio

n on

firs

t ei

genv

ecto

r of

Λ test set

prototypes

Page 16: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1616

1 2

1 2

1 d( , ) d( , )0 s = 1 1

2 d( , ) d( , )

w x w x

w x w x“AML – score”

prediction: 180 test set patients

20 AML cases!

perfect test set prediction

e.g. AUROC = 1

(achieved by 8 teams!)

Note: GMLVQ scores are

not directly interpretable

as “certainties” or

probabilistic assignments

Page 17: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1717

difference vector “ AML - healthy ” prototype

here: components corresponding to mean values

prototypes

Page 18: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1818

relevances

relevance of markers: in detail:

iqr

median

kurtosis

skewness

std. dev.

mean

← diagonal elements of Λ

Page 19: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

1919

relevances

relevance of markers: in detail:

iqr

median

kurtosis

skewness

std. dev.

mean

SS log

Page 20: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

2020

1 2

1 2

1 d( , ) d( , )0 s = 1 1

2 d( , ) d( , )

w x w x

w x w x“AML – score”

scores, certainties, ranking ?

20 AML cases!

perfect test set prediction

e.g. AUC =1 (ROC)

comparison:

scores vs. ground truth (?) :

Pearson-correlation: 0.9703

sum of |differences|: 3.8455

Page 21: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

2121

tanh (3 s)0 1

tanh(3) “transformed AML – score”

20 AML cases!

perfect test set prediction

e.g. AUC =1 (ROC)

comparison:

scores vs. ground truth:

Pearson-correlation: 0.9820

sum of |differences|: 4.4347

scores, certainties, ranking ?

Pearson-correlation: 0.9703

sum of |differences|: 3.8455

Page 22: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

2222

summary

feature vectors:

moment based characteristics of flow cytometry data

[mean, standard deviation, skewness, kurtosis, median, iqr ]

Matrix Relevance Learning Vector Quantization

- perfect classification with respect to training and test set

(e.g. AUC(roc)=1)

- weighting of features (pairs of features) according to

their relevance in the classification

- visualization of the data set

- identification of outliers (“116” ?)

Page 23: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

2323

outlook

selection of reduced feature set:

relevance matrix results suggest a selection of

protein markers and/or specific features

identification / diagnosis of AML subtypes

- AML subtypes to be identified by specific marker profiles

- machine learning approach requires larger data sets, e.g.

GMLVQ with several prototypes representing AML

- back to gating – selection of cells for differential diagnosis?

direct classification of histograms

non-Euclidean, histogram-specific distance measures

e.g. Divergence-based LVQ [Mwebaze et al., 2010]

Page 24: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

2424

P. Schneider, M. Biehl, B. Hammer, Adaptive relevance matrices in learning vector quantization Neural Computation 21: 3532-3561 (2009)

A recent application in tumor classification:

references (www.cs.rug.nl/~biehl)

W. Arlt, M. Biehl, A.E. Taylor et al. J Clinical Endocrinology & Metabolism, in press (2011) Urine Steroid Metabolomics as a Biomarker Tool for Detecting Malignancy in Patients with Adrenal Tumors

The method (GMLVQ):

Page 25: Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute

2525

thanks

Thanks