dubii-module -statistics with r · introduction statistics applied to bioinformatics jacques van...

Machine learning

DUBii - Module - Statistics with R

Jacques van HeldenORCID 0000-0002-8799-8584

Institut Français de Bioinformatique (IFB)French node of the European ELIXIR bioinformatics infrastructure

Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)

https://www.france-bioinformatique.fr/https://elixir-europe.org/https://tagc.univ-amu.fr/

Brain-learning exercise : assign individuals to groups based on their features

Conceptual illustration with two predictor variablesn In the next slides, we will

provide you with a higher-resolution of the plots, which represent represent a studycase.

n Exercise: assign intuitively each individual (black dot) to one of the two groups (A, B).q At each step, ask yourself

the following questions. q Which criterion did you use

to assign an individual to a group?

q How confident do you feel for each of your predictions?

q What is the effect of the respective means?

q What is the effect of the respective standard deviations?

q What is the effect of the correlations between the two variables?

3

Conceptual illustration with two variables – Study case 1

n Inspect the distribution of points for the two groups of individuals (pink, blue) on the 2-dimensional feature space.

4

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Effect of the group centre location.

5

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Effect of the group variance.

6

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Effect of the group variance.

7

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Impact of the group-specific variances (heteroscedasticity of the data)

8

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Impact of the group-specific variances (heteroscedasticity of the data)

9

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Effect of the covariance between features.

10

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Effect of the covariance between features

11

X1 (Feature 1)

X2 (F

eatu

re 1

)


n Group-specific covariances between features.q The two groups have different

covariance matrices: the clouds of points are elongated in different directions.

q How does this difference affects group assignments ?

12

X1 (Feature 1)

X2 (F

eatu

re 1

)

Multivariate analysisIntroduction

Statistics Applied to Bioinformatics

Jacques van HeldenORCID 0000-0002-8799-8584



https://www.france-bioinformatique.fr/https://elixir-europe.org/https://tagc.univ-amu.fr/

Multivariate data

n Each row represents one object (also called unit)n Each column represents one variable

variable 1 variable 2 ... variable pindividual 1 x11 x21 ... xp1individual 2 x12 x22 ... xp2individual 3 x13 x23 ... xp3individual 4 x14 x24 ... xp4individual 5 x15 x25 ... xp5individual 6 x16 x26 ... xp6individual 7 x17 x27 ... xp7individual 8 x18 x28 ... xp8... ... ... ... ...individual n x1n x2n ... xpn

Multivariate data with an outcome variable

n The outcome variable (also called criterion variable) can be q qualitative (nominal) : classes (e.g. cancer type)q quantitative (e.g. survival expectation for a cancer patient)

Outcome variablevariable 1 variable 2 ... variable p variable p+1

individual 1 x11 x21 ... xp1 y1individual 2 x12 x22 ... xp2 y2individual 3 x13 x23 ... xp3 y3individual 4 x14 x24 ... xp4 y4individual 5 x15 x25 ... xp5 y5individual 6 x16 x26 ... xp6 y6individual 7 x17 x27 ... xp7 y7individual 8 x18 x28 ... xp8 y8... ... ... ... ... ...individual n x1n x2n ... xpn yn

Predictor variables


individual 1 x11 x21 ... xp1 y1individual 2 x12 x22 ... xp2 y2individual 3 x13 x23 ... xp3 y3... ... ... ... ... ...individual N_train x1n x2n ... xpn yn


individual 1 x11 x21 ... xp1 ?individual 2 x12 x22 ... xp2 ?individual 3 x13 x23 ... xp3 ?... ... ... ... ... ...individual N_pred x1n x2n ... xpn ?

Predictor variables

Set t

o pr

edic

t

Predictor variables

Trai

ning

set

Predictive approaches - Training set

n The training set is used to build a predictive functionn This function is used to predict the value of the outcome variable for new objects

Evaluation of prediction with a testing set Outcome variablevariable 1 variable 2 ... variable p variable p+1

individual 1 x11 x12 ... x1p y1individual 2 x21 x22 ... x2p y2individual 3 x31 x32 ... x3p y3... ... ... ... ... ...individual ntrain xn1 xn2 ... xnp yn


(known value)variable p+1 (predicted)

individual 1 x11 x12 ... x1p y1 y'1individual 2 x21 x22 ... x2p y2 y'2individual 3 x31 x32 ... x3p y3 y'3... ... ... ... ... ... ...individual ntest xn1 xn2 ... xnp yntest y'ntest


individual 1 x11 x12 ... x1p ?individual 2 x21 x22 ... x2p ?individual 3 x31 x32 ... x3p ?... ... ... ... ... ...individual npred xn1 xn2 ... xnp ?

Predictor variables

Set

to p

redi

ct

Predictor variables

Trai

ning

set

Predictor variables

Test

ing

set

Flowchart of the approaches in multivariate analysismultivariate table X

Reduction of dimensions- variable selection- principal component analysis

Multidimensionalscaling distance matrix

outcomevariable Y?

Exploratory analysis

none quantitative

Regression analysis

Predicted value of a quantitative variable

yest = f(x)

Supervised classification

nominal

Assignment of individuals to predefined classes

g=f(x)

Discovered classes+ individual assignment

Cluster analysisVisualisation

Graphical representations

Quizz

Check your understanding of the concepts presented in the previous slides by applying them to your own data. 1. Describe in one sentence a typical case of multidimensional data that is handled in your domain. 2. Explain how you would organise this dataset into a multivariate structure

q What would correspond to the individuals?q What would correspond to the variables?q How many individuals (n) would you have?q How many variables (p) would you have?q Do you dispose of one or several outcome variable(s)?q If so, are these quantitative, qualitative or both?

3. Based on the conceptual framework defined above, which kind of approaches would be you envisage to extract which kind of relevant information from this data? Note that several approaches can be combined to address different questions.

Historical (vintage) examples

Historical example of clustering heat map

n Spellman et al. (1998).n Systematic detection of genes regulated in a

periodic way during the cell cycle. n Several experiments were regrouped, with

various ways of synchronization (elutriation, cdcmutants, …)

n ~800 genes showing a periodic patterns of expression were selected (by Fourier analysis)

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97.Time profiles of yeast cells followed during cell cycle.

Stress response in yeast

Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241-57.

n Gasch et al. (2000) tested the transcriptional response of yeast genome toq Various stress conditions (heat shock,

osmotic shock, …)q Drugsq Alternative carbon sourcesq …

n The heatmap shows clusters of genes having similar profiles of responses to the different types of stress.

22

Cancer types (Golub, 1999)

n Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively.

n Selected the 50 genes most correlated with the cancer type.

n Goal: use these genes as molecular signatures for the diagnostic of new patients.

23

n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7.

Den Boer et al., 2009 : procedure

n Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.

n Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.

n They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.

n They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.

Den Boer 2009 - The transcriptomic signature

n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.

n The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.

n The heatmaps show that the selected genes are differentially expressed q between subtypes of the

training set (left);q between subtypes of the

testing set (right).

n The heatmap is bi-clustered, in order to identify simultaneously the groups of patients (rows), and groups of genes (columns) based on the similarity between expression profiles.

Supervised classification

Statistics for bioinformatics

Jacques van Helden



https://orcid.org/0000-0002-8799-8584

https://tagc.univ-amu.fr/https://www.france-bioinformatique.fr/https://elixir-europe.org/

Supervised classification - Introduction

n In the previous chapter, we presented the problem of clustering, which consists in grouping objects without any a priori definition of the groups. The group definition emerge from the clustering itself (class discovery). Clustering is thus unsupervised.

n In some cases, one would like to focus on some pre-defined classes :q classifying tissues as cancer or non-cancer q classifying tissues between different cancer typesq classifying genes according to pre-defined functional classes (e.g. metabolic pathway, different phases of

the cell cycle, ...)n The classifier can be built with a training set, and used later for classifying new objects. This is

called supervised classification.

27

Supervised classification methods

n There are many alternative methods for supervised classificationq Discriminant analysis (linear: LDA or quadratic: QDA)q Bayesian classifierq K-nearest neighbours (KNN)q Support Vector Machine (SVM)q Decision treeq Random Forest (RF)q Neural network (NN)q ...

n Questionsq Which method should we choose?q How should we tune its parameters?q How to evaluate the respective performances of the methods and parametric choices?

28

Supervised classification methods

Choosing the best method is not trivialn Some methods rely on strong assumptions.

q LDA and QDA: multivariate normality.q LDA: all the classes have the same

covariance matrix.n Some methods implicitly rely on Euclidian

distance (e.g. KNN)n Some methods require a large training set, to

avoid over-fitting.n Global vs local classifiers.

q Global classifiers (e.g. LDA, QDA): same classification rule in the whole data space. The rule is built on the whole training set.

q Local classifiers (e.g. KNN): rules are made in different sub-spaces on the basis of the neighbouring training points.

n The choice of the method thus depends on the structure and on the size of the data sets.

Choosing the best parameters is not trivialn KNN: number of neighboursn LDA, QDA: prior/posterior probabilitiesn SVM: kerneln Decision treesn RF: number of iterationsn …

29

Study case 1: ALL versus AML(data from Golub et al., 1999)

Cancer types (Golub, 1999)

n A founding paper: Golub et al (1999)

n Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively.

n Selected the 50 genes most correlated with the cancer type.

n Goal: use these genes as molecular signatures for the diagnostic of new patients.

n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7. 31

Motivation

n The article by Golub et al. (1999) was motivated by the need to develop efficient diagnostics to predict the cancer type from blood samples of patients.

n They proposed a “molecular signature” of cancer type, allowing to discriminate ALM from ALL.

n This first “historical” study relied on somewhat arbitrary criteria to select the genes composing this signature, and the way to apply them to classify new patients.

n We will present here the classical methods used in statistics to classify “objects” (patients, genes) in pre-defined classes.

32

Golub et al (1999)

n Data source: Golub et al (1999). First historical publication searching for molecular signatures of cancer type.

n Training setq 38 samples from 2 types of leukemia

• 27 Acute lymphoblastic leukemia (note: 2 subtypes: ALL-T and ALL-B)

• 11 Acute myeloid leukemiaq Original data set contains ~7000 genesq Filtering out poorly expressed genes retains 3051 genes

n We re-analyze the data using different methods.n Selection of differentially expressed genes (DEG)

q Welch t-test with robust estimators (median, IQR) retains 367differentially expressed genes with E-value = 0

33

n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7.

●

●

●

●

●

●●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●●●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●●●

●

●●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●●

●●

●

●

● ●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●

●●

●

●●

●

●

●

●●●

●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

● ●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●●

●●

●

●●

●

●●●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●●

●

●

●●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●●

●

●

●

●

●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●

● ●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●● ●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

−3 −2 −1 0 1 2 3

−20

24

68

volcano plot − standardization with median and IQR

golub.t.result.robust$means.diff

golub.t.result.robust$sig

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●●●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●●●

●●

●

●●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●●

●

●

●●

●

●

●●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●● ●

●●

● ●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

● ●●

●

●

●

●

●

●

● ●● ●●

●●

●

●

●

●

●

●

●● ●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●●

●

●

●●

● ●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

● ●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●

●

● ●●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●●

●●

●

●

●

●

●

●

●●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●● ●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●●

●●

●

●

●●

●● ●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●●

●

●●

●

●●

●

●

●

●

●

●●

●●●●

●●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●●

●●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●

●

●

● ●

●

●

●●● ●

●

●

●●

●

●

● ●

●●

●

●●

●

●

●

●

●

●

●●

● ●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

● ●●

●

●

●

●

●●● ●●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

● ●●●

● ●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●●

●●

●

●

● ●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

● ●

●

●

●●●●●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

● ●●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●● ●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

● ●●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

−2 −1 0 1 2

0.1

0.2

0.3

0.4

0.5

0.6

Difference between the means

Stan

dard

erro

r on

the

diffe

renc

e

●●●●●●

●●

●

●

●●●

● ●

●

●

●● ●

●● ●

●●●●

●

●●●●

●●●●● ●

●

●

●

●

●

●

Golub 1999 - Profiles of selected genes

n The 367 gene selected by the T-test have apparently different profiles.

q Some genes seem greener for the ALL patients (27 leftmost samples)

q Some genes seem greener for the AML patients (11 rightmost samples)

AMLALL 34

Golub – hierarchical vlustering of DEG genes / profiles

n Hierarchical clustering perfectly separates the two cancer types (AML versus ALL).

n This perfect separation is observed for various metrics (Euclidian, correlation, dot product) and agglomeration rules (complete, average, Ward).

n Sample clustering further reveals subgroups of ALL.

n Gene clustering reveals 4 groups of profiles:q AML red, ALL greenq AML green, ALL redq Overall green, stronger

in AMLq Overall red, stronger in

ALL

X33_AM

LX37_AM

LX34_AM

LX35_AM

LX30_AM

LX31_AM

LX29_AM

LX28_AM

LX36_AM

LX32_AM

LX38_AM

LX1_ALL

X16_ALL

X19_ALL

X26_ALL

X13_ALL

X15_ALL

X5_ALL

X4_ALL

X24_ALL

X20_ALL

X21_ALL

X18_ALL

X12_ALL

X25_ALL

X7_ALL

X22_ALL

X8_ALL

X27_ALL

X14_ALL

X9_ALL

X11_ALL

X10_ALL

X3_ALL

X6_ALL

X23_ALL

X2_ALL

X17_ALL

240286625318072753114529027462438158182156024667356246179687385152266166847990639418348491202101611717231489785178718691025236519552386248910421883118113291368171542207266621127119933201101239422329192389839105714411782079155964823071811156416531425241826453231070255327342920292129221829937101014174001112175462289614481066112426815669321162188110302306377206524591766922241018280714702179152415427043641221197819592851198529502132280176313482592702129322761963713133720256111931920801223524816381702278670112829712829462159823642794106023482347111023557283126145916482512139030452212982971081108630726272889145519791817191698414453762216877100625934941045235670316401351094160415941712226572511262602220810192087344138134515852020273623552051253940250616291327261696318821513173233010786861948171920801456280254622361909523717337267324442372052133419394898621037199528609669521801741616293930468582155551671123819381721926187453145321821225146819032430200222892663266482917682656297710692589100527491556179726311682056184924564191907789904161143613781009267014131778295827503788601391250010622952111396808197726471220232221742329199825701439166513831652279114971493274384111227926661411197517842761152310898816011676773475155310382124260019117668031048190126611774280028131430117116471406303140183051912421884287028741910263612292197219830119952989

golub ; eu distance; complete linkage

AML ALL 35

Principal component analysis (PCA)

n Principal component analysis (PCA) relies on a transformation of a multivariate table into a multi-dimensional table of “components”.

n With Golub dataset, q Most variance is captured by

the first component.q The first component (Y axis)

clearly separates ALL from AML patients.

q The second component splits the AML set into two well-separated groups, which correspond almost perfectly to T-cells and B-cells, resp.

−0.2 −0.1 0.0 0.1 0.2 0.3 0.4

−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

PC1

PC2

X1_ALL_B

X2_ALL_T

X3_ALL_T

X4_ALL_B

X5_ALL_B

X6_ALL_T

X7_ALL_B

X8_ALL_B

X9_ALL_T

X10_ALL_T

X11_ALL_T

X12_ALL_BX13_ALL_B

X14_ALL_T

X15_ALL_BX16_ALL_B

X17_ALL_B

X18_ALL_B

X19_ALL_B

X20_ALL_B

X21_ALL_B

X22_ALL_B

X23_ALL_T

X24_ALL_B X25_ALL_BX26_ALL_B

X27_ALL_B

X28_AML

X29_AML

X30_AML

X31_AMLX32_AMLX33_AMLX34_AML

X35_AML

X36_AMLX37_AML

X38_AML

−5 0 5 10

−50

510

1123556696 108117126

135

158 168171

172174178182

187

192

202

205237

239248253259

283297304307

320

323329

330

337344345 364

376

377

378389

394

400

401

419422436

453

462

475

479

489

494

515

519

522

523

542

546

557560561

566

617

621 622624

648

666

686695701703

704713

717725

735

738

746

763

766 773

785 789

792

801803807808

829

830

839841

849858

860

862

866

877 896

904

906922

932937

940963 968

971

984988

9951005

1006

1009

101010161019 1025

1030

1037

1038

1042 10451048

1057

1060

10621066 1069

1070

1078

1081

1086

109411011110

1112

1122

1124

11261145

1162

11711181

11931202

122012211225 12291238

12421253

1271

1282

1293

1298

13271329

13341337

1348

1368 13781381 13831390

1391

13961406

14111413

1417

14251430

1439

1441

14451448

1453

1455

14561459146814701489

149314971513 152315241542

1553

15561559156415851594

15981601

1604

1611

16161629 163816401647

1648

16521653

16651668

1671

1676

17021712

1719

17231732

1754

1766

1768

17741778

1784

17871797

18071811

1817

1821

1829

1834

1849

1869

1881

1882

1883

1884

19011903

1907

1909

1910

1911

191619201926

1938

1939

1948

1955

19591963

1975

1977

19781979

19851993

1995

1998

2002

20202052

2056

2065

2072

20792080

2087

2124

21322155

217421792180 21822197

2198220822162235

2236

2265

2266

22762289

2306

2307 23222329

234723482355

2356

2364

2365

2386

2402

241024182430

2438

2444

2456

2459

24662489

2500

25062512

2553

257025892593

2600

2602

26162627

2631

2636

2645

2647

2656

2661

2663

26642670

2673

2681

2702

2734

2736 27432749 27502753

2761

2786

2791

2794

2800

2801 2802

28132829

28512860 2870287428892902

292029212922

2939

2950

29522958

2977

29893011

30313046

golub.pca

Variances

010

2030

4050

60

36

Study case 2: ALL subtypes(data from Den Boer et al., 2009)

Den Boer et al., 2009 : procedure

n Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.

n They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.

n They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.

n Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 38

hyperdiploid 44pre-B ALL 44TEL-AML1 43T-ALL 36E2A-rearranged (EP) 8BCR-ABL 4E2A-rearranged (E-sub) 4MLL 4BCR-ABL + hyperdiploidy 1E2A-rearranged (E) 1TEL-AML1 + hyperdiploidy 1

Den Boer 2009 - The transcriptomic signature

n The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.

n The heatmaps show that the selected genes are differentially expressed q between subtypes of the

training set (left);q between subtypes of the testing

set (right).

n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 39

Den Boer 2009 - Accuracy of the classifier

n The signature has an excellent diagnostic value: for the well-represented cancer types, the sensitivity and specificity are >90%.n Note: accuracy is misleading some subtypes have 98% accuracy with 0% sensitivity.

n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 40

Sn = TP / (TP + FN)

Sp = TN / (TN + FP)

PPV = TP / (VP + FP)

Den Boer 2009 - Exploring some profiles

n Left: expression for 2 genes selected at random. Each symbol represents one sample, coloured by cancer type. All cancer types are intermingled.

n Right: expression of the 2 genes with the highest sample-wise variance. The first gene (CD9) separates cell types T and Bt (low expression) from Bh, Bep, Br (high expression). Bo is dispersed over the whole range.

n Question: how can we identify a combination of genes that discriminate the different subtypes as well as possible ?

41

3.5 4.0 4.5 5.0 5.5

4.5

5.0

5.5

6.0

6.5

Den Boer (2009), randomly selected genes

CCDC28A|209479_at

TCF3|215260_s_at

T

T

T

T

TT

T

TT

TT

T

T

T

T

T

T TT

T

TT

T

T

TTT T

T

T

T

T

T

T

T

T

Bt

Bt

Bt

Bt

Bt

BtBt

Bt

Bt

Bt

Bt

Bt

Bt

Bt

Bt Bt

Bt

Bt

BtBt

Bt

BtBt

Bt

BtBt

Bt

Bt

BtBtBt

BtBt

BtBt

Bt

BtBt

Bt

Bt

Bt

Bt

Bt

Bth

Bh

Bh

Bh

Bh

Bh

BhBh

Bh

Bh

Bh

Bh

BhBh

BhBh

Bh

Bh

BhBhBh Bh

BhBh

Bh

Bh

Bh

Bh

Bh

Bh

Bh

BhBhBh

Bh

Bh Bh

Bh

Bh

Bh

BhBh

Bh

BhBhBEBEp

BEpBEp

BEp

BEp

BEp

BEp

BEp

BEs

BEs

BEs

BEs

Bc

Bc

Bc

Bc

BchBM

BM

BM

BM

Bo

Bo

BoBo

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

Bo

Bo

Bo

BoBo

BoBo

BoBo

BoBo

Bo

Bo

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

Bo

Bo

Bo

Bo

BoBo

2 4 6 8

46

810

12

2 genes with the highest variance

gene CD9|201005_at

gene

IL23

A|21

1796

_s_a

t

T

T

TT

TT

T

T

T

TT

T

T

T

T

T

TTT

T

T

T

TT T

T T

T

T

T

T

TT

T

T

T

BtBtBt

BtBt

Bt Bt

Bt

BtBtBt

BtBt

Bt Bt

Bt

BtBt

Bt

Bt

BtBt

Bt

Bt

Bt Bt

BtBt

BtBt

BtBtBt

Bt

Bt Bt

Bt

Bt

BtBt

Bt

Bt

Bt

BthBh

Bh

BhBh

Bh

Bh

Bh

BhBhBh

BhBh

Bh

Bh

Bh

Bh

Bh

BhBhBh

Bh

Bh

BhBh

Bh

Bh

Bh

BhBh

Bh

BhBhBh

BhBh

Bh

Bh

Bh

Bh

BhBhBh

Bh

BhBE

BEpBEp

BEpBEp

BEp

BEp

BEp BEpBEs

BEs

BEsBEs

Bc

Bc

Bc

BcBch

BM

BMBM BM

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

Bo

BoBoBo

Bo

Bo Bo

Bo

Bo Bo

BoBo Bo

Bo

BoBo

BoBo

Bo

Bo

Bo

Bo

Bo

Bo

Bo

BoBo

BoBo

Bo

Bo

Supervised classification: methodological principles

Statistical Analysis of Microarray Data

Multivariate data with a nominal criterion variable

n One disposes of a set of objects (the sample) which have been previously assigned to predefined classes.

n Each object is characterized by a series of quantitative variables (the predictors), and its class is indicated in a separated column (the criterion variable).

43

Criterion variablevariable 1 variable 2 ... variable p class

object 1 x1,1 x2,1 ... xp,1 A

object 2 x1,2 x2,2 ... xp,2 A

object 3 x1,3 x2,3 ... xp,3 A... ... ... ... ... ...object i x1,i x2,i ... xp,i B

object i+1 x1,i+1 x2,i+1 ... xp,i+1 B

object i+2 x1,i+2 x2,i+2 ... xp,i+2 B... ... ...object n-1 x1,n-1 x2,n-1 ... xp,n-1 K

object n x1,n x2,n ... xp,n K

Predictor variables

Supervised classification – training and prediction

n Training phase (training + evaluation)q The sample is used to build a discriminant functionq The quality of the discriminant function is evaluated

n Prediction phaseq The discriminant function is used to predict the value of the criterion variable for new objects

44


object 1 x11 x21 ... xp1 A

object 2 x12 x22 ... xp2 A

object 3 x13 x23 ... xp3 B... ... ... ... ... ...object ntrain x1n x2n ... xpn K


object 1 x11 x21 ... xp1 ?

object 2 x12 x22 ... xp2 ?

object 3 x13 x23 ... xp3 ?... ... ... ... ... ...object npred x1n x2n ... xpn ?

Predictor variables

Predictor variables

Prediction: individuals of unknown class

Testing: individuals of known class

Supervised classification: steps of the general procedure

Training: individuals of known class

Training set

Training

Trained classifier

Testing set

Prediction Predicted class Confusion table(for evaluation)Comparison

Individuals of unknown class

Prediction Predicted class

45

Validated classifier

Discriminant analysis

Statistical Analysis of Microarray Data

Linear or quadratic discriminant analysis (LDA vs QDA)n Equal covariance matrix

between groups?q Linear Discriminant

Analysis (LDA) is appropriate

q Green lines on the graphq The discrimination rule

amounts to draw a straight line between the gravity centers of the training groups

n Different covariant matrices?q Quadratic Discriminant

Analysis is recommended (red boundaries on the graphs)

47

Classification rules

n New units can be classified on the basis of rules based on the calibration samplen Several alternative rules can be used

q Maximum likelihood rule: based on the density function. Assign unit u to group g if

q Inverse probability rule: based on the probability. Assign unit u to group g if

q Posterior probability rule: assign unit u to group g if

WhereX is the unit vectorg,g’ are two groupsf(X|g) is the density function of the value X for group gP(X|g) is the probability to emit the value X given the group gP(g|X) is the probability to belong to group g, given the value X

48

€

P(X | g) > P(X | g') for g'≠ g

€

P(g | X) > P(g' | X) for g'≠ g

€

f (X | g) > f (X | g') for g'≠ g

Posterior probability

dubii-module -statistics with r · introduction statistics applied to bioinformatics jacques van...

Documents