dubii-module -statistics with r · introduction statistics applied to bioinformatics jacques van...
TRANSCRIPT
-
Machine learning
DUBii - Module - Statistics with R
Jacques van HeldenORCID 0000-0002-8799-8584
Institut Français de Bioinformatique (IFB)French node of the European ELIXIR bioinformatics infrastructure
Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)
https://www.france-bioinformatique.fr/https://elixir-europe.org/https://tagc.univ-amu.fr/
-
Brain-learning exercise : assign individuals to groups based on their features
-
Conceptual illustration with two predictor variablesn In the next slides, we will
provide you with a higher-resolution of the plots, which represent represent a studycase.
n Exercise: assign intuitively each individual (black dot) to one of the two groups (A, B).q At each step, ask yourself
the following questions. q Which criterion did you use
to assign an individual to a group?
q How confident do you feel for each of your predictions?
q What is the effect of the respective means?
q What is the effect of the respective standard deviations?
q What is the effect of the correlations between the two variables?
3
-
Conceptual illustration with two variables – Study case 1
n Inspect the distribution of points for the two groups of individuals (pink, blue) on the 2-dimensional feature space.
4
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 2
n Effect of the group centre location.
5
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 3
n Effect of the group variance.
6
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 4
n Effect of the group variance.
7
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 5
n Impact of the group-specific variances (heteroscedasticity of the data)
8
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 6
n Impact of the group-specific variances (heteroscedasticity of the data)
9
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 7
n Effect of the covariance between features.
10
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 8
n Effect of the covariance between features
11
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Conceptual illustration with two variables – Study case 9
n Group-specific covariances between features.q The two groups have different
covariance matrices: the clouds of points are elongated in different directions.
q How does this difference affects group assignments ?
12
X1 (Feature 1)
X2 (F
eatu
re 1
)
-
Multivariate analysisIntroduction
Statistics Applied to Bioinformatics
Jacques van HeldenORCID 0000-0002-8799-8584
Institut Français de Bioinformatique (IFB)French node of the European ELIXIR bioinformatics infrastructure
Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)
https://www.france-bioinformatique.fr/https://elixir-europe.org/https://tagc.univ-amu.fr/
-
Multivariate data
n Each row represents one object (also called unit)n Each column represents one variable
variable 1 variable 2 ... variable pindividual 1 x11 x21 ... xp1individual 2 x12 x22 ... xp2individual 3 x13 x23 ... xp3individual 4 x14 x24 ... xp4individual 5 x15 x25 ... xp5individual 6 x16 x26 ... xp6individual 7 x17 x27 ... xp7individual 8 x18 x28 ... xp8... ... ... ... ...individual n x1n x2n ... xpn
-
Multivariate data with an outcome variable
n The outcome variable (also called criterion variable) can be q qualitative (nominal) : classes (e.g. cancer type)q quantitative (e.g. survival expectation for a cancer patient)
Outcome variablevariable 1 variable 2 ... variable p variable p+1
individual 1 x11 x21 ... xp1 y1individual 2 x12 x22 ... xp2 y2individual 3 x13 x23 ... xp3 y3individual 4 x14 x24 ... xp4 y4individual 5 x15 x25 ... xp5 y5individual 6 x16 x26 ... xp6 y6individual 7 x17 x27 ... xp7 y7individual 8 x18 x28 ... xp8 y8... ... ... ... ... ...individual n x1n x2n ... xpn yn
Predictor variables
-
Outcome variablevariable 1 variable 2 ... variable p variable p+1
individual 1 x11 x21 ... xp1 y1individual 2 x12 x22 ... xp2 y2individual 3 x13 x23 ... xp3 y3... ... ... ... ... ...individual N_train x1n x2n ... xpn yn
Outcome variablevariable 1 variable 2 ... variable p variable p+1
individual 1 x11 x21 ... xp1 ?individual 2 x12 x22 ... xp2 ?individual 3 x13 x23 ... xp3 ?... ... ... ... ... ...individual N_pred x1n x2n ... xpn ?
Predictor variables
Set t
o pr
edic
t
Predictor variables
Trai
ning
set
Predictive approaches - Training set
n The training set is used to build a predictive functionn This function is used to predict the value of the outcome variable for new objects
-
Evaluation of prediction with a testing set Outcome variablevariable 1 variable 2 ... variable p variable p+1
individual 1 x11 x12 ... x1p y1individual 2 x21 x22 ... x2p y2individual 3 x31 x32 ... x3p y3... ... ... ... ... ...individual ntrain xn1 xn2 ... xnp yn
Outcome variablevariable 1 variable 2 ... variable p variable p+1
(known value)variable p+1 (predicted)
individual 1 x11 x12 ... x1p y1 y'1individual 2 x21 x22 ... x2p y2 y'2individual 3 x31 x32 ... x3p y3 y'3... ... ... ... ... ... ...individual ntest xn1 xn2 ... xnp yntest y'ntest
Outcome variablevariable 1 variable 2 ... variable p variable p+1
individual 1 x11 x12 ... x1p ?individual 2 x21 x22 ... x2p ?individual 3 x31 x32 ... x3p ?... ... ... ... ... ...individual npred xn1 xn2 ... xnp ?
Predictor variables
Set
to p
redi
ct
Predictor variables
Trai
ning
set
Predictor variables
Test
ing
set
-
Flowchart of the approaches in multivariate analysismultivariate table X
Reduction of dimensions- variable selection- principal component analysis
Multidimensionalscaling distance matrix
outcomevariable Y?
Exploratory analysis
none quantitative
Regression analysis
Predicted value of a quantitative variable
yest = f(x)
Supervised classification
nominal
Assignment of individuals to predefined classes
g=f(x)
Discovered classes+ individual assignment
Cluster analysisVisualisation
Graphical representations
-
Quizz
Check your understanding of the concepts presented in the previous slides by applying them to your own data. 1. Describe in one sentence a typical case of multidimensional data that is handled in your domain. 2. Explain how you would organise this dataset into a multivariate structure
q What would correspond to the individuals?q What would correspond to the variables?q How many individuals (n) would you have?q How many variables (p) would you have?q Do you dispose of one or several outcome variable(s)?q If so, are these quantitative, qualitative or both?
3. Based on the conceptual framework defined above, which kind of approaches would be you envisage to extract which kind of relevant information from this data? Note that several approaches can be combined to address different questions.
-
Historical (vintage) examples
-
Historical example of clustering heat map
n Spellman et al. (1998).n Systematic detection of genes regulated in a
periodic way during the cell cycle. n Several experiments were regrouped, with
various ways of synchronization (elutriation, cdcmutants, …)
n ~800 genes showing a periodic patterns of expression were selected (by Fourier analysis)
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97.Time profiles of yeast cells followed during cell cycle.
-
Stress response in yeast
Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241-57.
n Gasch et al. (2000) tested the transcriptional response of yeast genome toq Various stress conditions (heat shock,
osmotic shock, …)q Drugsq Alternative carbon sourcesq …
n The heatmap shows clusters of genes having similar profiles of responses to the different types of stress.
22
-
Cancer types (Golub, 1999)
n Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively.
n Selected the 50 genes most correlated with the cancer type.
n Goal: use these genes as molecular signatures for the diagnostic of new patients.
23
n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7.
-
Den Boer et al., 2009 : procedure
n Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.
n Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.
n They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.
n They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.
-
Den Boer 2009 - The transcriptomic signature
n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.
n The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.
n The heatmaps show that the selected genes are differentially expressed q between subtypes of the
training set (left);q between subtypes of the
testing set (right).
n The heatmap is bi-clustered, in order to identify simultaneously the groups of patients (rows), and groups of genes (columns) based on the similarity between expression profiles.
-
Supervised classification
Statistics for bioinformatics
Jacques van Helden
Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)
Institut Français de Bioinformatique (IFB)French node of the European ELIXIR bioinformatics infrastructure
https://orcid.org/0000-0002-8799-8584
https://tagc.univ-amu.fr/https://www.france-bioinformatique.fr/https://elixir-europe.org/
-
Supervised classification - Introduction
n In the previous chapter, we presented the problem of clustering, which consists in grouping objects without any a priori definition of the groups. The group definition emerge from the clustering itself (class discovery). Clustering is thus unsupervised.
n In some cases, one would like to focus on some pre-defined classes :q classifying tissues as cancer or non-cancer q classifying tissues between different cancer typesq classifying genes according to pre-defined functional classes (e.g. metabolic pathway, different phases of
the cell cycle, ...)n The classifier can be built with a training set, and used later for classifying new objects. This is
called supervised classification.
27
-
Supervised classification methods
n There are many alternative methods for supervised classificationq Discriminant analysis (linear: LDA or quadratic: QDA)q Bayesian classifierq K-nearest neighbours (KNN)q Support Vector Machine (SVM)q Decision treeq Random Forest (RF)q Neural network (NN)q ...
n Questionsq Which method should we choose?q How should we tune its parameters?q How to evaluate the respective performances of the methods and parametric choices?
28
-
Supervised classification methods
Choosing the best method is not trivialn Some methods rely on strong assumptions.
q LDA and QDA: multivariate normality.q LDA: all the classes have the same
covariance matrix.n Some methods implicitly rely on Euclidian
distance (e.g. KNN)n Some methods require a large training set, to
avoid over-fitting.n Global vs local classifiers.
q Global classifiers (e.g. LDA, QDA): same classification rule in the whole data space. The rule is built on the whole training set.
q Local classifiers (e.g. KNN): rules are made in different sub-spaces on the basis of the neighbouring training points.
n The choice of the method thus depends on the structure and on the size of the data sets.
Choosing the best parameters is not trivialn KNN: number of neighboursn LDA, QDA: prior/posterior probabilitiesn SVM: kerneln Decision treesn RF: number of iterationsn …
29
-
Study case 1: ALL versus AML(data from Golub et al., 1999)
-
Cancer types (Golub, 1999)
n A founding paper: Golub et al (1999)
n Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively.
n Selected the 50 genes most correlated with the cancer type.
n Goal: use these genes as molecular signatures for the diagnostic of new patients.
n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7. 31
-
Motivation
n The article by Golub et al. (1999) was motivated by the need to develop efficient diagnostics to predict the cancer type from blood samples of patients.
n They proposed a “molecular signature” of cancer type, allowing to discriminate ALM from ALL.
n This first “historical” study relied on somewhat arbitrary criteria to select the genes composing this signature, and the way to apply them to classify new patients.
n We will present here the classical methods used in statistics to classify “objects” (patients, genes) in pre-defined classes.
32
-
Golub et al (1999)
n Data source: Golub et al (1999). First historical publication searching for molecular signatures of cancer type.
n Training setq 38 samples from 2 types of leukemia
• 27 Acute lymphoblastic leukemia (note: 2 subtypes: ALL-T and ALL-B)
• 11 Acute myeloid leukemiaq Original data set contains ~7000 genesq Filtering out poorly expressed genes retains 3051 genes
n We re-analyze the data using different methods.n Selection of differentially expressed genes (DEG)
q Welch t-test with robust estimators (median, IQR) retains 367differentially expressed genes with E-value = 0
33
n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7.
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●●
●
●●
●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
●●
●
●●
●
●●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●● ●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
−3 −2 −1 0 1 2 3
−20
24
68
volcano plot − standardization with median and IQR
golub.t.result.robust$means.diff
golub.t.result.robust$sig
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●●●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●● ●
●●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●
● ●● ●●
●●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
● ●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●
●
● ●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●●
●●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●● ●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
● ●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●●
●●
●
●
●●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●●
●●●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
● ●
●
●
●●● ●
●
●
●●
●
●
● ●
●●
●
●●
●
●
●
●
●
●
●●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
● ●●
●
●
●
●
●●● ●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
● ●●●
● ●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
● ●
●
●
●●●●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●● ●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
● ●●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
−2 −1 0 1 2
0.1
0.2
0.3
0.4
0.5
0.6
Difference between the means
Stan
dard
erro
r on
the
diffe
renc
e
●●●●●●
●●
●
●
●●●
● ●
●
●
●● ●
●● ●
●●●●
●
●●●●
●●●●● ●
●
●
●
●
●
●
-
Golub 1999 - Profiles of selected genes
n The 367 gene selected by the T-test have apparently different profiles.
q Some genes seem greener for the ALL patients (27 leftmost samples)
q Some genes seem greener for the AML patients (11 rightmost samples)
AMLALL 34
-
Golub – hierarchical vlustering of DEG genes / profiles
n Hierarchical clustering perfectly separates the two cancer types (AML versus ALL).
n This perfect separation is observed for various metrics (Euclidian, correlation, dot product) and agglomeration rules (complete, average, Ward).
n Sample clustering further reveals subgroups of ALL.
n Gene clustering reveals 4 groups of profiles:q AML red, ALL greenq AML green, ALL redq Overall green, stronger
in AMLq Overall red, stronger in
ALL
X33_AM
LX37_AM
LX34_AM
LX35_AM
LX30_AM
LX31_AM
LX29_AM
LX28_AM
LX36_AM
LX32_AM
LX38_AM
LX1_ALL
X16_ALL
X19_ALL
X26_ALL
X13_ALL
X15_ALL
X5_ALL
X4_ALL
X24_ALL
X20_ALL
X21_ALL
X18_ALL
X12_ALL
X25_ALL
X7_ALL
X22_ALL
X8_ALL
X27_ALL
X14_ALL
X9_ALL
X11_ALL
X10_ALL
X3_ALL
X6_ALL
X23_ALL
X2_ALL
X17_ALL
240286625318072753114529027462438158182156024667356246179687385152266166847990639418348491202101611717231489785178718691025236519552386248910421883118113291368171542207266621127119933201101239422329192389839105714411782079155964823071811156416531425241826453231070255327342920292129221829937101014174001112175462289614481066112426815669321162188110302306377206524591766922241018280714702179152415427043641221197819592851198529502132280176313482592702129322761963713133720256111931920801223524816381702278670112829712829462159823642794106023482347111023557283126145916482512139030452212982971081108630726272889145519791817191698414453762216877100625934941045235670316401351094160415941712226572511262602220810192087344138134515852020273623552051253940250616291327261696318821513173233010786861948171920801456280254622361909523717337267324442372052133419394898621037199528609669521801741616293930468582155551671123819381721926187453145321821225146819032430200222892663266482917682656297710692589100527491556179726311682056184924564191907789904161143613781009267014131778295827503788601391250010622952111396808197726471220232221742329199825701439166513831652279114971493274384111227926661411197517842761152310898816011676773475155310382124260019117668031048190126611774280028131430117116471406303140183051912421884287028741910263612292197219830119952989
golub ; eu distance; complete linkage
AML ALL 35
-
Principal component analysis (PCA)
n Principal component analysis (PCA) relies on a transformation of a multivariate table into a multi-dimensional table of “components”.
n With Golub dataset, q Most variance is captured by
the first component.q The first component (Y axis)
clearly separates ALL from AML patients.
q The second component splits the AML set into two well-separated groups, which correspond almost perfectly to T-cells and B-cells, resp.
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
−0.2
−0.1
0.0
0.1
0.2
0.3
0.4
PC1
PC2
X1_ALL_B
X2_ALL_T
X3_ALL_T
X4_ALL_B
X5_ALL_B
X6_ALL_T
X7_ALL_B
X8_ALL_B
X9_ALL_T
X10_ALL_T
X11_ALL_T
X12_ALL_BX13_ALL_B
X14_ALL_T
X15_ALL_BX16_ALL_B
X17_ALL_B
X18_ALL_B
X19_ALL_B
X20_ALL_B
X21_ALL_B
X22_ALL_B
X23_ALL_T
X24_ALL_B X25_ALL_BX26_ALL_B
X27_ALL_B
X28_AML
X29_AML
X30_AML
X31_AMLX32_AMLX33_AMLX34_AML
X35_AML
X36_AMLX37_AML
X38_AML
−5 0 5 10
−50
510
1123556696 108117126
135
158 168171
172174178182
187
192
202
205237
239248253259
283297304307
320
323329
330
337344345 364
376
377
378389
394
400
401
419422436
453
462
475
479
489
494
515
519
522
523
542
546
557560561
566
617
621 622624
648
666
686695701703
704713
717725
735
738
746
763
766 773
785 789
792
801803807808
829
830
839841
849858
860
862
866
877 896
904
906922
932937
940963 968
971
984988
9951005
1006
1009
101010161019 1025
1030
1037
1038
1042 10451048
1057
1060
10621066 1069
1070
1078
1081
1086
109411011110
1112
1122
1124
11261145
1162
11711181
11931202
122012211225 12291238
12421253
1271
1282
1293
1298
13271329
13341337
1348
1368 13781381 13831390
1391
13961406
14111413
1417
14251430
1439
1441
14451448
1453
1455
14561459146814701489
149314971513 152315241542
1553
15561559156415851594
15981601
1604
1611
16161629 163816401647
1648
16521653
16651668
1671
1676
17021712
1719
17231732
1754
1766
1768
17741778
1784
17871797
18071811
1817
1821
1829
1834
1849
1869
1881
1882
1883
1884
19011903
1907
1909
1910
1911
191619201926
1938
1939
1948
1955
19591963
1975
1977
19781979
19851993
1995
1998
2002
20202052
2056
2065
2072
20792080
2087
2124
21322155
217421792180 21822197
2198220822162235
2236
2265
2266
22762289
2306
2307 23222329
234723482355
2356
2364
2365
2386
2402
241024182430
2438
2444
2456
2459
24662489
2500
25062512
2553
257025892593
2600
2602
26162627
2631
2636
2645
2647
2656
2661
2663
26642670
2673
2681
2702
2734
2736 27432749 27502753
2761
2786
2791
2794
2800
2801 2802
28132829
28512860 2870287428892902
292029212922
2939
2950
29522958
2977
29893011
30313046
golub.pca
Variances
010
2030
4050
60
36
-
Study case 2: ALL subtypes(data from Den Boer et al., 2009)
-
Den Boer et al., 2009 : procedure
n Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.
n They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.
n They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.
n Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 38
hyperdiploid 44pre-B ALL 44TEL-AML1 43T-ALL 36E2A-rearranged (EP) 8BCR-ABL 4E2A-rearranged (E-sub) 4MLL 4BCR-ABL + hyperdiploidy 1E2A-rearranged (E) 1TEL-AML1 + hyperdiploidy 1
-
Den Boer 2009 - The transcriptomic signature
n The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.
n The heatmaps show that the selected genes are differentially expressed q between subtypes of the
training set (left);q between subtypes of the testing
set (right).
n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 39
-
Den Boer 2009 - Accuracy of the classifier
n The signature has an excellent diagnostic value: for the well-represented cancer types, the sensitivity and specificity are >90%.n Note: accuracy is misleading some subtypes have 98% accuracy with 0% sensitivity.
n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 40
Sn = TP / (TP + FN)
Sp = TN / (TN + FP)
PPV = TP / (VP + FP)
-
Den Boer 2009 - Exploring some profiles
n Left: expression for 2 genes selected at random. Each symbol represents one sample, coloured by cancer type. All cancer types are intermingled.
n Right: expression of the 2 genes with the highest sample-wise variance. The first gene (CD9) separates cell types T and Bt (low expression) from Bh, Bep, Br (high expression). Bo is dispersed over the whole range.
n Question: how can we identify a combination of genes that discriminate the different subtypes as well as possible ?
41
3.5 4.0 4.5 5.0 5.5
4.5
5.0
5.5
6.0
6.5
Den Boer (2009), randomly selected genes
CCDC28A|209479_at
TCF3|215260_s_at
T
T
T
T
TT
T
TT
TT
T
T
T
T
T
T TT
T
TT
T
T
TTT T
T
T
T
T
T
T
T
T
Bt
Bt
Bt
Bt
Bt
BtBt
Bt
Bt
Bt
Bt
Bt
Bt
Bt
Bt Bt
Bt
Bt
BtBt
Bt
BtBt
Bt
BtBt
Bt
Bt
BtBtBt
BtBt
BtBt
Bt
BtBt
Bt
Bt
Bt
Bt
Bt
Bth
Bh
Bh
Bh
Bh
Bh
BhBh
Bh
Bh
Bh
Bh
BhBh
BhBh
Bh
Bh
BhBhBh Bh
BhBh
Bh
Bh
Bh
Bh
Bh
Bh
Bh
BhBhBh
Bh
Bh Bh
Bh
Bh
Bh
BhBh
Bh
BhBhBEBEp
BEpBEp
BEp
BEp
BEp
BEp
BEp
BEs
BEs
BEs
BEs
Bc
Bc
Bc
Bc
BchBM
BM
BM
BM
Bo
Bo
BoBo
Bo
Bo
Bo
BoBo
Bo
Bo
Bo
Bo
Bo
Bo
BoBo
BoBo
BoBo
BoBo
Bo
Bo
Bo
Bo
Bo
BoBo
Bo
Bo
Bo
BoBo
Bo
Bo
Bo
Bo
Bo
Bo
Bo
BoBo
2 4 6 8
46
810
12
2 genes with the highest variance
gene CD9|201005_at
gene
IL23
A|21
1796
_s_a
t
T
T
TT
TT
T
T
T
TT
T
T
T
T
T
TTT
T
T
T
TT T
T T
T
T
T
T
TT
T
T
T
BtBtBt
BtBt
Bt Bt
Bt
BtBtBt
BtBt
Bt Bt
Bt
BtBt
Bt
Bt
BtBt
Bt
Bt
Bt Bt
BtBt
BtBt
BtBtBt
Bt
Bt Bt
Bt
Bt
BtBt
Bt
Bt
Bt
BthBh
Bh
BhBh
Bh
Bh
Bh
BhBhBh
BhBh
Bh
Bh
Bh
Bh
Bh
BhBhBh
Bh
Bh
BhBh
Bh
Bh
Bh
BhBh
Bh
BhBhBh
BhBh
Bh
Bh
Bh
Bh
BhBhBh
Bh
BhBE
BEpBEp
BEpBEp
BEp
BEp
BEp BEpBEs
BEs
BEsBEs
Bc
Bc
Bc
BcBch
BM
BMBM BM
Bo
Bo
Bo
BoBo
Bo
Bo
Bo
BoBo
Bo
Bo
Bo
Bo
BoBoBo
Bo
Bo Bo
Bo
Bo Bo
BoBo Bo
Bo
BoBo
BoBo
Bo
Bo
Bo
Bo
Bo
Bo
Bo
BoBo
BoBo
Bo
Bo
-
Supervised classification: methodological principles
Statistical Analysis of Microarray Data
-
Multivariate data with a nominal criterion variable
n One disposes of a set of objects (the sample) which have been previously assigned to predefined classes.
n Each object is characterized by a series of quantitative variables (the predictors), and its class is indicated in a separated column (the criterion variable).
43
Criterion variablevariable 1 variable 2 ... variable p class
object 1 x1,1 x2,1 ... xp,1 A
object 2 x1,2 x2,2 ... xp,2 A
object 3 x1,3 x2,3 ... xp,3 A... ... ... ... ... ...object i x1,i x2,i ... xp,i B
object i+1 x1,i+1 x2,i+1 ... xp,i+1 B
object i+2 x1,i+2 x2,i+2 ... xp,i+2 B... ... ...object n-1 x1,n-1 x2,n-1 ... xp,n-1 K
object n x1,n x2,n ... xp,n K
Predictor variables
-
Supervised classification – training and prediction
n Training phase (training + evaluation)q The sample is used to build a discriminant functionq The quality of the discriminant function is evaluated
n Prediction phaseq The discriminant function is used to predict the value of the criterion variable for new objects
44
Criterion variablevariable 1 variable 2 ... variable p class
object 1 x11 x21 ... xp1 A
object 2 x12 x22 ... xp2 A
object 3 x13 x23 ... xp3 B... ... ... ... ... ...object ntrain x1n x2n ... xpn K
Criterion variablevariable 1 variable 2 ... variable p class
object 1 x11 x21 ... xp1 ?
object 2 x12 x22 ... xp2 ?
object 3 x13 x23 ... xp3 ?... ... ... ... ... ...object npred x1n x2n ... xpn ?
Predictor variables
Predictor variables
-
Prediction: individuals of unknown class
Testing: individuals of known class
Supervised classification: steps of the general procedure
Training: individuals of known class
Training set
Training
Trained classifier
Testing set
Prediction Predicted class Confusion table(for evaluation)Comparison
Individuals of unknown class
Prediction Predicted class
45
Validated classifier
-
Discriminant analysis
Statistical Analysis of Microarray Data
-
Linear or quadratic discriminant analysis (LDA vs QDA)n Equal covariance matrix
between groups?q Linear Discriminant
Analysis (LDA) is appropriate
q Green lines on the graphq The discrimination rule
amounts to draw a straight line between the gravity centers of the training groups
n Different covariant matrices?q Quadratic Discriminant
Analysis is recommended (red boundaries on the graphs)
47
-
Classification rules
n New units can be classified on the basis of rules based on the calibration samplen Several alternative rules can be used
q Maximum likelihood rule: based on the density function. Assign unit u to group g if
q Inverse probability rule: based on the probability. Assign unit u to group g if
q Posterior probability rule: assign unit u to group g if
WhereX is the unit vectorg,g’ are two groupsf(X|g) is the density function of the value X for group gP(X|g) is the probability to emit the value X given the group gP(g|X) is the probability to belong to group g, given the value X
48
€
P(X | g) > P(X | g') for g'≠ g
€
P(g | X) > P(g' | X) for g'≠ g
€
f (X | g) > f (X | g') for g'≠ g
-
Posterior probability