dubii-module -statistics with r · introduction statistics applied to bioinformatics jacques van...

93
Machine learning DUBii - Module - Statistics with R Jacques van Helden ORCID 0000-0002-8799-8584 Institut Français de Bioinformatique (IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity (TAGC )

Upload: others

Post on 20-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Machine learning

    DUBii - Module - Statistics with R

    Jacques van HeldenORCID 0000-0002-8799-8584

    Institut Français de Bioinformatique (IFB)French node of the European ELIXIR bioinformatics infrastructure

    Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)

    https://www.france-bioinformatique.fr/https://elixir-europe.org/https://tagc.univ-amu.fr/

  • Brain-learning exercise : assign individuals to groups based on their features

  • Conceptual illustration with two predictor variablesn In the next slides, we will

    provide you with a higher-resolution of the plots, which represent represent a studycase.

    n Exercise: assign intuitively each individual (black dot) to one of the two groups (A, B).q At each step, ask yourself

    the following questions. q Which criterion did you use

    to assign an individual to a group?

    q How confident do you feel for each of your predictions?

    q What is the effect of the respective means?

    q What is the effect of the respective standard deviations?

    q What is the effect of the correlations between the two variables?

    3

  • Conceptual illustration with two variables – Study case 1

    n Inspect the distribution of points for the two groups of individuals (pink, blue) on the 2-dimensional feature space.

    4

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 2

    n Effect of the group centre location.

    5

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 3

    n Effect of the group variance.

    6

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 4

    n Effect of the group variance.

    7

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 5

    n Impact of the group-specific variances (heteroscedasticity of the data)

    8

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 6

    n Impact of the group-specific variances (heteroscedasticity of the data)

    9

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 7

    n Effect of the covariance between features.

    10

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 8

    n Effect of the covariance between features

    11

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Conceptual illustration with two variables – Study case 9

    n Group-specific covariances between features.q The two groups have different

    covariance matrices: the clouds of points are elongated in different directions.

    q How does this difference affects group assignments ?

    12

    X1 (Feature 1)

    X2 (F

    eatu

    re 1

    )

  • Multivariate analysisIntroduction

    Statistics Applied to Bioinformatics

    Jacques van HeldenORCID 0000-0002-8799-8584

    Institut Français de Bioinformatique (IFB)French node of the European ELIXIR bioinformatics infrastructure

    Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)

    https://www.france-bioinformatique.fr/https://elixir-europe.org/https://tagc.univ-amu.fr/

  • Multivariate data

    n Each row represents one object (also called unit)n Each column represents one variable

    variable 1 variable 2 ... variable pindividual 1 x11 x21 ... xp1individual 2 x12 x22 ... xp2individual 3 x13 x23 ... xp3individual 4 x14 x24 ... xp4individual 5 x15 x25 ... xp5individual 6 x16 x26 ... xp6individual 7 x17 x27 ... xp7individual 8 x18 x28 ... xp8... ... ... ... ...individual n x1n x2n ... xpn

  • Multivariate data with an outcome variable

    n The outcome variable (also called criterion variable) can be q qualitative (nominal) : classes (e.g. cancer type)q quantitative (e.g. survival expectation for a cancer patient)

    Outcome variablevariable 1 variable 2 ... variable p variable p+1

    individual 1 x11 x21 ... xp1 y1individual 2 x12 x22 ... xp2 y2individual 3 x13 x23 ... xp3 y3individual 4 x14 x24 ... xp4 y4individual 5 x15 x25 ... xp5 y5individual 6 x16 x26 ... xp6 y6individual 7 x17 x27 ... xp7 y7individual 8 x18 x28 ... xp8 y8... ... ... ... ... ...individual n x1n x2n ... xpn yn

    Predictor variables

  • Outcome variablevariable 1 variable 2 ... variable p variable p+1

    individual 1 x11 x21 ... xp1 y1individual 2 x12 x22 ... xp2 y2individual 3 x13 x23 ... xp3 y3... ... ... ... ... ...individual N_train x1n x2n ... xpn yn

    Outcome variablevariable 1 variable 2 ... variable p variable p+1

    individual 1 x11 x21 ... xp1 ?individual 2 x12 x22 ... xp2 ?individual 3 x13 x23 ... xp3 ?... ... ... ... ... ...individual N_pred x1n x2n ... xpn ?

    Predictor variables

    Set t

    o pr

    edic

    t

    Predictor variables

    Trai

    ning

    set

    Predictive approaches - Training set

    n The training set is used to build a predictive functionn This function is used to predict the value of the outcome variable for new objects

  • Evaluation of prediction with a testing set Outcome variablevariable 1 variable 2 ... variable p variable p+1

    individual 1 x11 x12 ... x1p y1individual 2 x21 x22 ... x2p y2individual 3 x31 x32 ... x3p y3... ... ... ... ... ...individual ntrain xn1 xn2 ... xnp yn

    Outcome variablevariable 1 variable 2 ... variable p variable p+1

    (known value)variable p+1 (predicted)

    individual 1 x11 x12 ... x1p y1 y'1individual 2 x21 x22 ... x2p y2 y'2individual 3 x31 x32 ... x3p y3 y'3... ... ... ... ... ... ...individual ntest xn1 xn2 ... xnp yntest y'ntest

    Outcome variablevariable 1 variable 2 ... variable p variable p+1

    individual 1 x11 x12 ... x1p ?individual 2 x21 x22 ... x2p ?individual 3 x31 x32 ... x3p ?... ... ... ... ... ...individual npred xn1 xn2 ... xnp ?

    Predictor variables

    Set

    to p

    redi

    ct

    Predictor variables

    Trai

    ning

    set

    Predictor variables

    Test

    ing

    set

  • Flowchart of the approaches in multivariate analysismultivariate table X

    Reduction of dimensions- variable selection- principal component analysis

    Multidimensionalscaling distance matrix

    outcomevariable Y?

    Exploratory analysis

    none quantitative

    Regression analysis

    Predicted value of a quantitative variable

    yest = f(x)

    Supervised classification

    nominal

    Assignment of individuals to predefined classes

    g=f(x)

    Discovered classes+ individual assignment

    Cluster analysisVisualisation

    Graphical representations

  • Quizz

    Check your understanding of the concepts presented in the previous slides by applying them to your own data. 1. Describe in one sentence a typical case of multidimensional data that is handled in your domain. 2. Explain how you would organise this dataset into a multivariate structure

    q What would correspond to the individuals?q What would correspond to the variables?q How many individuals (n) would you have?q How many variables (p) would you have?q Do you dispose of one or several outcome variable(s)?q If so, are these quantitative, qualitative or both?

    3. Based on the conceptual framework defined above, which kind of approaches would be you envisage to extract which kind of relevant information from this data? Note that several approaches can be combined to address different questions.

  • Historical (vintage) examples

  • Historical example of clustering heat map

    n Spellman et al. (1998).n Systematic detection of genes regulated in a

    periodic way during the cell cycle. n Several experiments were regrouped, with

    various ways of synchronization (elutriation, cdcmutants, …)

    n ~800 genes showing a periodic patterns of expression were selected (by Fourier analysis)

    Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97.Time profiles of yeast cells followed during cell cycle.

  • Stress response in yeast

    Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241-57.

    n Gasch et al. (2000) tested the transcriptional response of yeast genome toq Various stress conditions (heat shock,

    osmotic shock, …)q Drugsq Alternative carbon sourcesq …

    n The heatmap shows clusters of genes having similar profiles of responses to the different types of stress.

    22

  • Cancer types (Golub, 1999)

    n Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively.

    n Selected the 50 genes most correlated with the cancer type.

    n Goal: use these genes as molecular signatures for the diagnostic of new patients.

    23

    n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7.

  • Den Boer et al., 2009 : procedure

    n Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.

    n Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.

    n They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.

    n They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.

  • Den Boer 2009 - The transcriptomic signature

    n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.

    n The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.

    n The heatmaps show that the selected genes are differentially expressed q between subtypes of the

    training set (left);q between subtypes of the

    testing set (right).

    n The heatmap is bi-clustered, in order to identify simultaneously the groups of patients (rows), and groups of genes (columns) based on the similarity between expression profiles.

  • Supervised classification

    Statistics for bioinformatics

    Jacques van Helden

    Aix-Marseille Université (AMU)Lab. Theory and Approaches of Genomic Complexity (TAGC)

    Institut Français de Bioinformatique (IFB)French node of the European ELIXIR bioinformatics infrastructure

    https://orcid.org/0000-0002-8799-8584

    https://tagc.univ-amu.fr/https://www.france-bioinformatique.fr/https://elixir-europe.org/

  • Supervised classification - Introduction

    n In the previous chapter, we presented the problem of clustering, which consists in grouping objects without any a priori definition of the groups. The group definition emerge from the clustering itself (class discovery). Clustering is thus unsupervised.

    n In some cases, one would like to focus on some pre-defined classes :q classifying tissues as cancer or non-cancer q classifying tissues between different cancer typesq classifying genes according to pre-defined functional classes (e.g. metabolic pathway, different phases of

    the cell cycle, ...)n The classifier can be built with a training set, and used later for classifying new objects. This is

    called supervised classification.

    27

  • Supervised classification methods

    n There are many alternative methods for supervised classificationq Discriminant analysis (linear: LDA or quadratic: QDA)q Bayesian classifierq K-nearest neighbours (KNN)q Support Vector Machine (SVM)q Decision treeq Random Forest (RF)q Neural network (NN)q ...

    n Questionsq Which method should we choose?q How should we tune its parameters?q How to evaluate the respective performances of the methods and parametric choices?

    28

  • Supervised classification methods

    Choosing the best method is not trivialn Some methods rely on strong assumptions.

    q LDA and QDA: multivariate normality.q LDA: all the classes have the same

    covariance matrix.n Some methods implicitly rely on Euclidian

    distance (e.g. KNN)n Some methods require a large training set, to

    avoid over-fitting.n Global vs local classifiers.

    q Global classifiers (e.g. LDA, QDA): same classification rule in the whole data space. The rule is built on the whole training set.

    q Local classifiers (e.g. KNN): rules are made in different sub-spaces on the basis of the neighbouring training points.

    n The choice of the method thus depends on the structure and on the size of the data sets.

    Choosing the best parameters is not trivialn KNN: number of neighboursn LDA, QDA: prior/posterior probabilitiesn SVM: kerneln Decision treesn RF: number of iterationsn …

    29

  • Study case 1: ALL versus AML(data from Golub et al., 1999)

  • Cancer types (Golub, 1999)

    n A founding paper: Golub et al (1999)

    n Compared the profiles of expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively.

    n Selected the 50 genes most correlated with the cancer type.

    n Goal: use these genes as molecular signatures for the diagnostic of new patients.

    n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7. 31

  • Motivation

    n The article by Golub et al. (1999) was motivated by the need to develop efficient diagnostics to predict the cancer type from blood samples of patients.

    n They proposed a “molecular signature” of cancer type, allowing to discriminate ALM from ALL.

    n This first “historical” study relied on somewhat arbitrary criteria to select the genes composing this signature, and the way to apply them to classify new patients.

    n We will present here the classical methods used in statistics to classify “objects” (patients, genes) in pre-defined classes.

    32

  • Golub et al (1999)

    n Data source: Golub et al (1999). First historical publication searching for molecular signatures of cancer type.

    n Training setq 38 samples from 2 types of leukemia

    • 27 Acute lymphoblastic leukemia (note: 2 subtypes: ALL-T and ALL-B)

    • 11 Acute myeloid leukemiaq Original data set contains ~7000 genesq Filtering out poorly expressed genes retains 3051 genes

    n We re-analyze the data using different methods.n Selection of differentially expressed genes (DEG)

    q Welch t-test with robust estimators (median, IQR) retains 367differentially expressed genes with E-value = 0

    33

    n Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7.

    ●●●

    ●●

    ●●

    ● ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●●

    ●●●●

    ●●

    ●●●

    ●●

    ● ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●●

    ● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●●

    ●●

    ●●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●●●

    ● ●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    −3 −2 −1 0 1 2 3

    −20

    24

    68

    volcano plot − standardization with median and IQR

    golub.t.result.robust$means.diff

    golub.t.result.robust$sig

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ●● ●

    ●●●

    ●●

    ●●

    ●●

    ● ●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●● ●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ● ●

    ● ●

    ●●●

    ●●

    ●●

    ● ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ● ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ● ●● ●●

    ●●

    ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●●

    ● ●●

    ●●●

    ●●

    ●●

    ●●

    ● ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ● ●●

    ● ●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ● ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●●

    ● ●

    ● ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ● ●

    ● ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ● ●●

    ● ●

    ●●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●

    ●●

    ●●

    ● ●●

    ●●● ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●●

    ● ●●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ● ●

    ●●●●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ●●●

    ●●

    ●●●

    ●●

    ● ●

    ● ●●

    ●●

    ●●

    ●● ●●

    ●●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    −2 −1 0 1 2

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    Difference between the means

    Stan

    dard

    erro

    r on

    the

    diffe

    renc

    e

    ●●●●●●

    ●●

    ●●●

    ● ●

    ●● ●

    ●● ●

    ●●●●

    ●●●●

    ●●●●● ●

  • Golub 1999 - Profiles of selected genes

    n The 367 gene selected by the T-test have apparently different profiles.

    q Some genes seem greener for the ALL patients (27 leftmost samples)

    q Some genes seem greener for the AML patients (11 rightmost samples)

    AMLALL 34

  • Golub – hierarchical vlustering of DEG genes / profiles

    n Hierarchical clustering perfectly separates the two cancer types (AML versus ALL).

    n This perfect separation is observed for various metrics (Euclidian, correlation, dot product) and agglomeration rules (complete, average, Ward).

    n Sample clustering further reveals subgroups of ALL.

    n Gene clustering reveals 4 groups of profiles:q AML red, ALL greenq AML green, ALL redq Overall green, stronger

    in AMLq Overall red, stronger in

    ALL

    X33_AM

    LX37_AM

    LX34_AM

    LX35_AM

    LX30_AM

    LX31_AM

    LX29_AM

    LX28_AM

    LX36_AM

    LX32_AM

    LX38_AM

    LX1_ALL

    X16_ALL

    X19_ALL

    X26_ALL

    X13_ALL

    X15_ALL

    X5_ALL

    X4_ALL

    X24_ALL

    X20_ALL

    X21_ALL

    X18_ALL

    X12_ALL

    X25_ALL

    X7_ALL

    X22_ALL

    X8_ALL

    X27_ALL

    X14_ALL

    X9_ALL

    X11_ALL

    X10_ALL

    X3_ALL

    X6_ALL

    X23_ALL

    X2_ALL

    X17_ALL

    240286625318072753114529027462438158182156024667356246179687385152266166847990639418348491202101611717231489785178718691025236519552386248910421883118113291368171542207266621127119933201101239422329192389839105714411782079155964823071811156416531425241826453231070255327342920292129221829937101014174001112175462289614481066112426815669321162188110302306377206524591766922241018280714702179152415427043641221197819592851198529502132280176313482592702129322761963713133720256111931920801223524816381702278670112829712829462159823642794106023482347111023557283126145916482512139030452212982971081108630726272889145519791817191698414453762216877100625934941045235670316401351094160415941712226572511262602220810192087344138134515852020273623552051253940250616291327261696318821513173233010786861948171920801456280254622361909523717337267324442372052133419394898621037199528609669521801741616293930468582155551671123819381721926187453145321821225146819032430200222892663266482917682656297710692589100527491556179726311682056184924564191907789904161143613781009267014131778295827503788601391250010622952111396808197726471220232221742329199825701439166513831652279114971493274384111227926661411197517842761152310898816011676773475155310382124260019117668031048190126611774280028131430117116471406303140183051912421884287028741910263612292197219830119952989

    golub ; eu distance; complete linkage

    AML ALL 35

  • Principal component analysis (PCA)

    n Principal component analysis (PCA) relies on a transformation of a multivariate table into a multi-dimensional table of “components”.

    n With Golub dataset, q Most variance is captured by

    the first component.q The first component (Y axis)

    clearly separates ALL from AML patients.

    q The second component splits the AML set into two well-separated groups, which correspond almost perfectly to T-cells and B-cells, resp.

    −0.2 −0.1 0.0 0.1 0.2 0.3 0.4

    −0.2

    −0.1

    0.0

    0.1

    0.2

    0.3

    0.4

    PC1

    PC2

    X1_ALL_B

    X2_ALL_T

    X3_ALL_T

    X4_ALL_B

    X5_ALL_B

    X6_ALL_T

    X7_ALL_B

    X8_ALL_B

    X9_ALL_T

    X10_ALL_T

    X11_ALL_T

    X12_ALL_BX13_ALL_B

    X14_ALL_T

    X15_ALL_BX16_ALL_B

    X17_ALL_B

    X18_ALL_B

    X19_ALL_B

    X20_ALL_B

    X21_ALL_B

    X22_ALL_B

    X23_ALL_T

    X24_ALL_B X25_ALL_BX26_ALL_B

    X27_ALL_B

    X28_AML

    X29_AML

    X30_AML

    X31_AMLX32_AMLX33_AMLX34_AML

    X35_AML

    X36_AMLX37_AML

    X38_AML

    −5 0 5 10

    −50

    510

    1123556696 108117126

    135

    158 168171

    172174178182

    187

    192

    202

    205237

    239248253259

    283297304307

    320

    323329

    330

    337344345 364

    376

    377

    378389

    394

    400

    401

    419422436

    453

    462

    475

    479

    489

    494

    515

    519

    522

    523

    542

    546

    557560561

    566

    617

    621 622624

    648

    666

    686695701703

    704713

    717725

    735

    738

    746

    763

    766 773

    785 789

    792

    801803807808

    829

    830

    839841

    849858

    860

    862

    866

    877 896

    904

    906922

    932937

    940963 968

    971

    984988

    9951005

    1006

    1009

    101010161019 1025

    1030

    1037

    1038

    1042 10451048

    1057

    1060

    10621066 1069

    1070

    1078

    1081

    1086

    109411011110

    1112

    1122

    1124

    11261145

    1162

    11711181

    11931202

    122012211225 12291238

    12421253

    1271

    1282

    1293

    1298

    13271329

    13341337

    1348

    1368 13781381 13831390

    1391

    13961406

    14111413

    1417

    14251430

    1439

    1441

    14451448

    1453

    1455

    14561459146814701489

    149314971513 152315241542

    1553

    15561559156415851594

    15981601

    1604

    1611

    16161629 163816401647

    1648

    16521653

    16651668

    1671

    1676

    17021712

    1719

    17231732

    1754

    1766

    1768

    17741778

    1784

    17871797

    18071811

    1817

    1821

    1829

    1834

    1849

    1869

    1881

    1882

    1883

    1884

    19011903

    1907

    1909

    1910

    1911

    191619201926

    1938

    1939

    1948

    1955

    19591963

    1975

    1977

    19781979

    19851993

    1995

    1998

    2002

    20202052

    2056

    2065

    2072

    20792080

    2087

    2124

    21322155

    217421792180 21822197

    2198220822162235

    2236

    2265

    2266

    22762289

    2306

    2307 23222329

    234723482355

    2356

    2364

    2365

    2386

    2402

    241024182430

    2438

    2444

    2456

    2459

    24662489

    2500

    25062512

    2553

    257025892593

    2600

    2602

    26162627

    2631

    2636

    2645

    2647

    2656

    2661

    2663

    26642670

    2673

    2681

    2702

    2734

    2736 27432749 27502753

    2761

    2786

    2791

    2794

    2800

    2801 2802

    28132829

    28512860 2870287428892902

    292029212922

    2939

    2950

    29522958

    2977

    29893011

    30313046

    golub.pca

    Variances

    010

    2030

    4050

    60

    36

  • Study case 2: ALL subtypes(data from Den Boer et al., 2009)

  • Den Boer et al., 2009 : procedure

    n Den Boer et al (2009) use Affymetrix microarrays to characterize the transcriptome of 190 Acute Lymphoblastic Leukemia of different types.

    n They use these profiles to select “transcriptome signatures” that will serve for diagnostics purposes: assigning new samples to one of the cancer types.

    n They apply an elaborate procedure relying on an inner and an outer loop of cross-validation.

    n Data source: Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 38

    hyperdiploid 44pre-B ALL 44TEL-AML1 43T-ALL 36E2A-rearranged (EP) 8BCR-ABL 4E2A-rearranged (E-sub) 4MLL 4BCR-ABL + hyperdiploidy 1E2A-rearranged (E) 1TEL-AML1 + hyperdiploidy 1

  • Den Boer 2009 - The transcriptomic signature

    n The training procedure selects 100 gens whose combined expression levels can be used to assign samples to cancer subtypes.

    n The heatmaps show that the selected genes are differentially expressed q between subtypes of the

    training set (left);q between subtypes of the testing

    set (right).

    n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 39

  • Den Boer 2009 - Accuracy of the classifier

    n The signature has an excellent diagnostic value: for the well-represented cancer types, the sensitivity and specificity are >90%.n Note: accuracy is misleading some subtypes have 98% accuracy with 0% sensitivity.

    n Den Boer et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134. 40

    Sn = TP / (TP + FN)

    Sp = TN / (TN + FP)

    PPV = TP / (VP + FP)

  • Den Boer 2009 - Exploring some profiles

    n Left: expression for 2 genes selected at random. Each symbol represents one sample, coloured by cancer type. All cancer types are intermingled.

    n Right: expression of the 2 genes with the highest sample-wise variance. The first gene (CD9) separates cell types T and Bt (low expression) from Bh, Bep, Br (high expression). Bo is dispersed over the whole range.

    n Question: how can we identify a combination of genes that discriminate the different subtypes as well as possible ?

    41

    3.5 4.0 4.5 5.0 5.5

    4.5

    5.0

    5.5

    6.0

    6.5

    Den Boer (2009), randomly selected genes

    CCDC28A|209479_at

    TCF3|215260_s_at

    T

    T

    T

    T

    TT

    T

    TT

    TT

    T

    T

    T

    T

    T

    T TT

    T

    TT

    T

    T

    TTT T

    T

    T

    T

    T

    T

    T

    T

    T

    Bt

    Bt

    Bt

    Bt

    Bt

    BtBt

    Bt

    Bt

    Bt

    Bt

    Bt

    Bt

    Bt

    Bt Bt

    Bt

    Bt

    BtBt

    Bt

    BtBt

    Bt

    BtBt

    Bt

    Bt

    BtBtBt

    BtBt

    BtBt

    Bt

    BtBt

    Bt

    Bt

    Bt

    Bt

    Bt

    Bth

    Bh

    Bh

    Bh

    Bh

    Bh

    BhBh

    Bh

    Bh

    Bh

    Bh

    BhBh

    BhBh

    Bh

    Bh

    BhBhBh Bh

    BhBh

    Bh

    Bh

    Bh

    Bh

    Bh

    Bh

    Bh

    BhBhBh

    Bh

    Bh Bh

    Bh

    Bh

    Bh

    BhBh

    Bh

    BhBhBEBEp

    BEpBEp

    BEp

    BEp

    BEp

    BEp

    BEp

    BEs

    BEs

    BEs

    BEs

    Bc

    Bc

    Bc

    Bc

    BchBM

    BM

    BM

    BM

    Bo

    Bo

    BoBo

    Bo

    Bo

    Bo

    BoBo

    Bo

    Bo

    Bo

    Bo

    Bo

    Bo

    BoBo

    BoBo

    BoBo

    BoBo

    Bo

    Bo

    Bo

    Bo

    Bo

    BoBo

    Bo

    Bo

    Bo

    BoBo

    Bo

    Bo

    Bo

    Bo

    Bo

    Bo

    Bo

    BoBo

    2 4 6 8

    46

    810

    12

    2 genes with the highest variance

    gene CD9|201005_at

    gene

    IL23

    A|21

    1796

    _s_a

    t

    T

    T

    TT

    TT

    T

    T

    T

    TT

    T

    T

    T

    T

    T

    TTT

    T

    T

    T

    TT T

    T T

    T

    T

    T

    T

    TT

    T

    T

    T

    BtBtBt

    BtBt

    Bt Bt

    Bt

    BtBtBt

    BtBt

    Bt Bt

    Bt

    BtBt

    Bt

    Bt

    BtBt

    Bt

    Bt

    Bt Bt

    BtBt

    BtBt

    BtBtBt

    Bt

    Bt Bt

    Bt

    Bt

    BtBt

    Bt

    Bt

    Bt

    BthBh

    Bh

    BhBh

    Bh

    Bh

    Bh

    BhBhBh

    BhBh

    Bh

    Bh

    Bh

    Bh

    Bh

    BhBhBh

    Bh

    Bh

    BhBh

    Bh

    Bh

    Bh

    BhBh

    Bh

    BhBhBh

    BhBh

    Bh

    Bh

    Bh

    Bh

    BhBhBh

    Bh

    BhBE

    BEpBEp

    BEpBEp

    BEp

    BEp

    BEp BEpBEs

    BEs

    BEsBEs

    Bc

    Bc

    Bc

    BcBch

    BM

    BMBM BM

    Bo

    Bo

    Bo

    BoBo

    Bo

    Bo

    Bo

    BoBo

    Bo

    Bo

    Bo

    Bo

    BoBoBo

    Bo

    Bo Bo

    Bo

    Bo Bo

    BoBo Bo

    Bo

    BoBo

    BoBo

    Bo

    Bo

    Bo

    Bo

    Bo

    Bo

    Bo

    BoBo

    BoBo

    Bo

    Bo

  • Supervised classification: methodological principles

    Statistical Analysis of Microarray Data

  • Multivariate data with a nominal criterion variable

    n One disposes of a set of objects (the sample) which have been previously assigned to predefined classes.

    n Each object is characterized by a series of quantitative variables (the predictors), and its class is indicated in a separated column (the criterion variable).

    43

    Criterion variablevariable 1 variable 2 ... variable p class

    object 1 x1,1 x2,1 ... xp,1 A

    object 2 x1,2 x2,2 ... xp,2 A

    object 3 x1,3 x2,3 ... xp,3 A... ... ... ... ... ...object i x1,i x2,i ... xp,i B

    object i+1 x1,i+1 x2,i+1 ... xp,i+1 B

    object i+2 x1,i+2 x2,i+2 ... xp,i+2 B... ... ...object n-1 x1,n-1 x2,n-1 ... xp,n-1 K

    object n x1,n x2,n ... xp,n K

    Predictor variables

  • Supervised classification – training and prediction

    n Training phase (training + evaluation)q The sample is used to build a discriminant functionq The quality of the discriminant function is evaluated

    n Prediction phaseq The discriminant function is used to predict the value of the criterion variable for new objects

    44

    Criterion variablevariable 1 variable 2 ... variable p class

    object 1 x11 x21 ... xp1 A

    object 2 x12 x22 ... xp2 A

    object 3 x13 x23 ... xp3 B... ... ... ... ... ...object ntrain x1n x2n ... xpn K

    Criterion variablevariable 1 variable 2 ... variable p class

    object 1 x11 x21 ... xp1 ?

    object 2 x12 x22 ... xp2 ?

    object 3 x13 x23 ... xp3 ?... ... ... ... ... ...object npred x1n x2n ... xpn ?

    Predictor variables

    Predictor variables

  • Prediction: individuals of unknown class

    Testing: individuals of known class

    Supervised classification: steps of the general procedure

    Training: individuals of known class

    Training set

    Training

    Trained classifier

    Testing set

    Prediction Predicted class Confusion table(for evaluation)Comparison

    Individuals of unknown class

    Prediction Predicted class

    45

    Validated classifier

  • Discriminant analysis

    Statistical Analysis of Microarray Data

  • Linear or quadratic discriminant analysis (LDA vs QDA)n Equal covariance matrix

    between groups?q Linear Discriminant

    Analysis (LDA) is appropriate

    q Green lines on the graphq The discrimination rule

    amounts to draw a straight line between the gravity centers of the training groups

    n Different covariant matrices?q Quadratic Discriminant

    Analysis is recommended (red boundaries on the graphs)

    47

  • Classification rules

    n New units can be classified on the basis of rules based on the calibration samplen Several alternative rules can be used

    q Maximum likelihood rule: based on the density function. Assign unit u to group g if

    q Inverse probability rule: based on the probability. Assign unit u to group g if

    q Posterior probability rule: assign unit u to group g if

    WhereX is the unit vectorg,g’ are two groupsf(X|g) is the density function of the value X for group gP(X|g) is the probability to emit the value X given the group gP(g|X) is the probability to belong to group g, given the value X

    48

    P(X | g) > P(X | g') for g'≠ g

    P(g | X) > P(g' | X) for g'≠ g

    f (X | g) > f (X | g') for g'≠ g

  • Posterior probability