machine learning in bioinformatics and drug...

54
Machine learning in bioinformatics and drug discovery Jean-Philippe Vert [email protected] Mines ParisTech / Institut Curie / Inserm Workshop on Bioinformatics for Medical and Pharmaceutical Research, Institut Curie, Paris, November 16, 2009. Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 1 / 43

Upload: others

Post on 08-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Machine learning in bioinformatics and drugdiscovery

Jean-Philippe [email protected]

Mines ParisTech / Institut Curie / Inserm

Workshop on Bioinformatics for Medical and PharmaceuticalResearch, Institut Curie, Paris, November 16, 2009.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 1 / 43

Page 2: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Where we are

A joint lab about “Cancer computational genomics, bioinformatics,biostatistics and epidemiology”Located in th Institut Curie, a major hospital and cancer researchinstitute in Europe

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 2 / 43

Page 3: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Statistical machine learning for cancer informatics

Main topicsTowards better diagnosis, prognosis, and personalized medicine

Supervised classification of genomic, transcriptomic, proteomicdata; heterogeneous data integration

Towards new drug targetsSystems biology, reconstruction of gene networks, pathwayenrichment analysis, multidimensional phenotyping of cellpopulations.

Towards new drugsLigand-based virtual screening, in silico chemogenomics.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 3 / 43

Page 4: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Towards personalized medicine:Diagnosis/prognosis from genome/transcriptome

From Golub et al., Science, 1999.Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 4 / 43

Page 5: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Towards new drug targets:Inference of biological networks

From Mordelet and Vert, Bioinformatics, 2008.Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 5 / 43

Page 6: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Towards new drugs:Ligand-Based Virtual Screening and QSAR

inactive

active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 6 / 43

Page 7: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Pattern recognition, aka supervised classification

inactive

active

active

active

inactive

inactive

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 43

Page 8: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Pattern recognition, aka supervised classification

inactive

active

active

active

inactive

inactive

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 43

Page 9: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Pattern recognition, aka supervised classification

inactive

active

active

active

inactive

inactive

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 43

Page 10: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Pattern recognition, aka supervised classification

inactive

active

active

active

inactive

inactive

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 43

Page 11: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Pattern recognition, aka supervised classification

ChallengesHigh dimensionFew samplesStructured dataPrior knowledgeFast and scalableimplementationsInterpretable models

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 8 / 43

Page 12: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Linear classifiers

The modelEach sample is represented by a vector x = (x1, . . . , xp)

Goal: from a training set of samples with known labels, estimate alinear function:

fβ(x) =

p∑i=1

βixi + β0 .

whose sign is a good predictor.Interpretability: the weight βi quantifies the influence of feature i(but...)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 43

Page 13: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Estimating a linear classifiers

We have a training set of samples (x (1), . . . , x (n)) with known class(y (1), . . . , y (n)).For any candidate set of weights β = (β1, . . . , β

p) we quantify how"good" the linear function fβ is on the training set with someaverage loss, e.g.,

R(β) =1n

n∑i=1

l(fβ(x (i)), y (i)) ,

We choose the β that achieves the minimium risk, subject to someconstraint on β, e.g.:

Ω(β) ≤ C .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 10 / 43

Page 14: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Importance of the constraint Ω(β) < C

Why it is necessaryPrevents overfitting (especially when n is small)Helps to overcome numerical issues (regularization)

Why it is usefulCan lead to efficient implementations (convexification)Good place to put prior knowledge!

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 11 / 43

Page 15: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Outline

1 Gene selection for transcriptomic signatures

2 Prognosis from array CGH data

3 Pathway signatures

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 12 / 43

Page 16: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Outline

1 Gene selection for transcriptomic signatures

2 Prognosis from array CGH data

3 Pathway signatures

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 12 / 43

Page 17: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Outline

1 Gene selection for transcriptomic signatures

2 Prognosis from array CGH data

3 Pathway signatures

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 12 / 43

Page 18: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Outline

1 Gene selection for transcriptomic signatures

2 Prognosis from array CGH data

3 Pathway signatures

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 13 / 43

Page 19: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Tissue profiling with DNA chips

DataGene expression measures for more than 10k genesMeasured typically on less than 100 samples of two (or more)different classes (e.g., different tumors)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 14 / 43

Page 20: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Tissue classification from microarray data

GoalDesign a classifier toautomatically assign aclass to future samplesfrom their expressionprofileInterpret biologically thedifferences between theclasses

DifficultyLarge dimensionFew samples

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 15 / 43

Page 21: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Gene signature

The ideaWe look for a limited set of genes that are sufficient for prediction.Equivalently, the linear classifier will be sparse

MotivationsBet on sparsity: we believe the "true" model is sparse.Interpretation: we will get a biological interpretation more easily bylooking at the selected genes.Accuracy: by restricting the class of classifiers, we "increase thebias" but "decrease the variance". This should be helpful in largedimensions (it is better to estimate well a wrong model thanestimate badly a good model).

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 16 / 43

Page 22: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Example: MAMMAPRINT

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 17 / 43

Page 23: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

How to estimate a sparse linear model?

Best subset selectionWe look for a sparse weight vector β by solving the problem:

min R(fβ) s.t. ‖β ‖0 ≤ k

This is usually a NP-hard problem, feasible for p as large as 30 or40The state-of-the-art is branch-and-bound optimization, known asleaps and bound for least squares (Furnival and Wilson, 1974).Not useful in practice for us...

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 18 / 43

Page 24: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Efficient feature selection

To work with more variables, we must use different methods. Thestate-of-the-art is split among

Filter methods : the predictors are preprocessed and ranked fromthe most relevant to the less relevant. The subsets are thenobtained from this list, starting from the top.Wrapper method: here the feature selection is iterative, and usesthe ERM algorithm in the inner loopEmbedded methods : here the feature selection is part of theERM algorithm itself (see later the shrinkage estimators).

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 19 / 43

Page 25: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Filter methods

Associate a score S(i) to each feature i , then rank the features bydecreasing score.Many scores / criteria can be used

Loss of the ERM trained on a single featureStatistical tests (Fisher, T-test)Other performance criteria of the ERM restricted to a single feature(AUC, ...)Information theoretical criteria (mutual information...)

ProsSimple, scalable, good empirical success

ConsSelection of redundant featuresSome variables useless alone can become useful together

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 20 / 43

Page 26: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Filter methods

Associate a score S(i) to each feature i , then rank the features bydecreasing score.Many scores / criteria can be used

Loss of the ERM trained on a single featureStatistical tests (Fisher, T-test)Other performance criteria of the ERM restricted to a single feature(AUC, ...)Information theoretical criteria (mutual information...)

ProsSimple, scalable, good empirical success

ConsSelection of redundant featuresSome variables useless alone can become useful together

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 20 / 43

Page 27: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Filter methods

Associate a score S(i) to each feature i , then rank the features bydecreasing score.Many scores / criteria can be used

Loss of the ERM trained on a single featureStatistical tests (Fisher, T-test)Other performance criteria of the ERM restricted to a single feature(AUC, ...)Information theoretical criteria (mutual information...)

ProsSimple, scalable, good empirical success

ConsSelection of redundant featuresSome variables useless alone can become useful together

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 20 / 43

Page 28: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Wrapper methods

Forward stepwise selectionStart from no featuresSequentially add into the model the feature that most improves thefit

Backward stepwise selection (if n>p)Start from all featuresSequentially removes from the model the feature that leastdegrades the fit

Other variantsHybrid stepwise selection strategies that consider both forward andbackward moves at each stage, and make the "best" move

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 21 / 43

Page 29: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Wrapper methods

Forward stepwise selectionStart from no featuresSequentially add into the model the feature that most improves thefit

Backward stepwise selection (if n>p)Start from all featuresSequentially removes from the model the feature that leastdegrades the fit

Other variantsHybrid stepwise selection strategies that consider both forward andbackward moves at each stage, and make the "best" move

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 21 / 43

Page 30: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Wrapper methods

Forward stepwise selectionStart from no featuresSequentially add into the model the feature that most improves thefit

Backward stepwise selection (if n>p)Start from all featuresSequentially removes from the model the feature that leastdegrades the fit

Other variantsHybrid stepwise selection strategies that consider both forward andbackward moves at each stage, and make the "best" move

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 21 / 43

Page 31: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Embedded methods (LASSO)

minβ

R(β) +

p∑i=1

|βi |

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 22 / 43

Page 32: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Why LASSO leads to sparse solutions

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 23 / 43

Page 33: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Outline

1 Gene selection for transcriptomic signatures

2 Prognosis from array CGH data

3 Pathway signatures

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 24 / 43

Page 34: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Chromosomic aberrations in cancer

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 25 / 43

Page 35: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Comparative Genomic Hybridization (CGH)

MotivationComparative genomic hybridization (CGH) data measure the DNAcopy number along the genomeVery useful, in particular in cancer researchCan we classify CGH arrays for diagnosis or prognosis purpose?

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 26 / 43

Page 36: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Aggressive vs non-aggressive melanoma

0 500 1000 1500 2000 2500−0.5

0

0.5

0 500 1000 1500 2000 2500−1

0

1

2

0 500 1000 1500 2000 2500−2

−1

0

1

0 500 1000 1500 2000 2500−2

0

2

4

0 500 1000 1500 2000 2500−4

−2

0

2

0 500 1000 1500 2000 2500−1

−0.5

0

0.5

0 500 1000 1500 2000 2500−1

−0.5

0

0.5

0 500 1000 1500 2000 2500−4

−2

0

2

0 500 1000 1500 2000 2500−4

−2

0

2

0 500 1000 1500 2000 2500−1

0

1

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 27 / 43

Page 37: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Example: CGH array classification

Prior knowledgeFor a CGH profile x = (x1, . . . , xp), we focus on linear classifiers,i.e., the sign of :

f (x) =

p∑i=1

βixi .

We expect β to besparse : not all positions should be discriminativepiecewise constant : within a selected region, all probes shouldcontribute equally

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 28 / 43

Page 38: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

A penalty for CGH array classification

The fused LASSO penalty (Tibshirani et al., 2005)

Ωfusedlasso(β) =∑

i

|βi |+∑i∼j

|βi − βj | .

First term leads to sparse solutionsSecond term leads to piecewise constant solutionsCombined with a hinge loss leads to a fused SVM (Rapaport etal., 2008);

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 29 / 43

Page 39: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Application: metastasis prognosis in melanoma

0 500 1000 1500 2000 2500 3000 3500 4000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

BAC

Wei

ght

0 500 1000 1500 2000 2500 3000 3500 4000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

BAC

Wei

ght

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 30 / 43

Page 40: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Outline

1 Gene selection for transcriptomic signatures

2 Prognosis from array CGH data

3 Pathway signatures

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 31 / 43

Page 41: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Motivation

Challenging the idea of gene signatureWe often observe little stability in the genes selected...Is gene selection the most biologically relevant hypothesis?What about thinking instead of "pathways" or "modules"signatures?

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 32 / 43

Page 42: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Gene networks

N

-

Glycan biosynthesis

Protein kinases

DNA and RNA polymerase subunits

Glycolysis / Gluconeogenesis

Sulfurmetabolism

Porphyrinand chlorophyll metabolism

Riboflavin metabolism

Folatebiosynthesis

Biosynthesis of steroids, ergosterol metabolism

Lysinebiosynthesis

Phenylalanine, tyrosine andtryptophan biosynthesis Purine

metabolism

Oxidative phosphorylation, TCA cycle

Nitrogen,asparaginemetabolism

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 33 / 43

Page 43: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Gene networks and expression data

MotivationBasic biological functions usually involve the coordinated action ofseveral proteins:

Formation of protein complexesActivation of metabolic, signalling or regulatory pathways

Many pathways and protein-protein interactions are already knownHypothesis: the weights of the classifier should be “coherent” withrespect to this prior knowledge

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 34 / 43

Page 44: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Graph based penalty

Prior hypothesisGenes near each other on the graph should have similar weigths.

Two solutions (Rapaport et al., 2007, 2008)

Ωspectral(β) =∑i∼j

(βi − βj)2 ,

Ωgraphfusion(β) =∑i∼j

|βi − βj |+∑

i

|βi | .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 35 / 43

Page 45: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Graph based penalty

Prior hypothesisGenes near each other on the graph should have similar weigths.

Two solutions (Rapaport et al., 2007, 2008)

Ωspectral(β) =∑i∼j

(βi − βj)2 ,

Ωgraphfusion(β) =∑i∼j

|βi − βj |+∑

i

|βi | .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 35 / 43

Page 46: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

ClassifiersRapaport et al

N

-

Glycan biosynthesis

Protein kinases

DNA and RNA polymerase subunits

Glycolysis / Gluconeogenesis

Sulfurmetabolism

Porphyrinand chlorophyll metabolism

Riboflavin metabolism

Folatebiosynthesis

Biosynthesis of steroids, ergosterol metabolism

Lysinebiosynthesis

Phenylalanine, tyrosine andtryptophan biosynthesis Purine

metabolism

Oxidative phosphorylation, TCA cycle

Nitrogen,asparaginemetabolism

Fig. 4. Global connection map of KEGG with mapped coefficients of the decision function obtained by applying a customary linear SVM

(left) and using high-frequency eigenvalue attenuation (80% of high-frequency eigenvalues have been removed) (right). Spectral filtering

divided the whole network into modules having coordinated responses, with the activation of low-frequency eigen modes being determined by

microarray data. Positive coefficients are marked in red, negative coefficients are in green, and the intensity of the colour reflects the absolute

values of the coefficients. Rhombuses highlight proteins participating in the Glycolysis/Gluconeogenesis KEGG pathway. Some other parts of

the network are annotated including big highly connected clusters corresponding to protein kinases and DNA and RNA polymerase sub-units.

5 DISCUSSION

Our algorithm groups predictor variables according to highly

connected "modules" of the global gene network. We assume

that the genes within a tightly connected network module

are likely to contribute similarly to the prediction function

because of the interactions between the genes. This motivates

the filtering of gene expression profile to remove the noisy

high-frequencymodes of the network.

Such grouping of variables is a very useful feature of the

resulting classification function because the function beco-

mes meaningful for interpreting and suggesting biological

factors that cause the class separation. This allows classifi-

cations based on functions, pathways and network modules

rather than on individual genes. This can lead to a more robust

behaviour of the classifier in independent tests and to equal if

not better classification results. Our results on the dataset we

analysed shows only a slight improvement, although this may

be due to its limited size. Thereforewe are currently extending

our work to larger data sets.

An important remark to bear in mind when analyzing pictu-

res such as fig.4 and 5 is that the colors represent the weights

of the classifier, and not gene expression levels. There is

of course a relationship between the classifier weights and

the typical expression levels of genes in irradiated and non-

irradiated samples: irradiated samples tend to have expression

profiles positively correlated with the classifier, while non-

irradiated samples tend to be negatively correlated. Roughly

speaking, the classifier tries to find a smooth function that

has this property. If more samples were available, better

non-smooth classifier might be learned by the algorithm, but

constraining the smoothness of the classifier is away to reduce

the complexity of the learning problem when a limited num-

ber of samples are available. This means in particular that the

pictures provide virtually no information regarding the over-

8

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 36 / 43

Page 47: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

ClassifierSpectral analysis of gene expression profiles using gene networks

a) b)Fig. 5. Theglycolysis/gluconeogenesis pathways ofKEGGwithmapped coefficients of the decision function obtained by applying a customary

linear SVM (a) and using high-frequency eigenvalue attenuation (b). The pathways are mutually exclusive in a cell, as clearly highlighted by

our algorithm.

or under-expression of individual genes, which is the cost to

pay to obtain instead an interpretation in terms of more glo-

bal pathways. Constraining the classifier to rely on just a few

genes would have a similar effect of reducing the complexity

of the problem,butwould lead to amoredifficult interpretation

in terms of pathways.

An advantage of our approach over other pathway-based

clustering methods is that we consider the network modules

that naturally appear from spectral analysis rather than a histo-

rically defined separation of the network into pathways. Thus,

pathways cross-talking is taken into account, which is diffi-

cult to do using other approaches. It can however be noticed

that the implicit decomposition into pathways that we obtain

is biased by the very incomplete knowledge of the network

and that certain regions of the network are better understood,

leading to a higher connection concentration.

Like most approaches aiming at comparing expression data

with gene networks such as KEGG, the scope of this work

is limited by two important constraints. First the gene net-

work we use is only a convenient but rough approximation to

describe complex biochemical processes; second, the trans-

criptional analysis of a sample can not give any information

regarding post-transcriptional regulation and modifications.

Nevertheless, we believe that our basic assumptions remain

valid, in that we assume that the expression of the genes

belonging to the same metabolic pathways module are coor-

dinately regulated. Our interpretation of the results supports

this assumption.

Another important caveat is that we simplify the network

description as an undirected graph of interactions. Although

this would seem to be relevant for simplifying the descrip-

tion of metabolic networks, real gene regulation networks are

influenced by the direction, sign and importance of the interac-

tion. Although the incorporationof weights into the Laplacian

(equation 1) is straightforward and allows the extension of the

approach to weighted undirected graphs, the incorporation

of directions and signs to represent signalling or regulatory

pathways requires more work but could lead to important

advances for the interpretation of microarray data in cancer

studies, for example.

ACKNOWLEDGMENTS

This work was supported by the grant ACI-IMPBIO-2004-47

of the French Ministry for Research and New Technologies.

9

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 37 / 43

Page 48: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have non-zero weigths(i.e., the support of β should be made of a few connectedcomponents).

Two solutions?

Ωintersection(β) =∑i∼j

√β2

i + β2j ,

Ωunion(β) = supα∈Rp:∀i∼j,‖α2

i +α2j ‖≤1

α>β .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 38 / 43

Page 49: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have non-zero weigths(i.e., the support of β should be made of a few connectedcomponents).

Two solutions?

Ωintersection(β) =∑i∼j

√β2

i + β2j ,

Ωunion(β) = supα∈Rp:∀i∼j,‖α2

i +α2j ‖≤1

α>β .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 38 / 43

Page 50: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Example: finding discriminant modules in genenetworks

Groups (1,2) and (2,3). Left: Ωintersection(β). Right: Ωunion(β). Verticalaxis is β2.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 39 / 43

Page 51: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Graph lasso vs kernel on graph

Graph lasso:

Ωgraph lasso(w) =∑i∼j

√w2

i + w2j .

constrains the sparsity, not the valuesGraph kernel

Ωgraph kernel(w) =∑i∼j

(wi − wj)2 .

constrains the values (smoothness), not the sparsity

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 40 / 43

Page 52: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Preliminary results

Breast cancer dataGene expression data for 8,141 genes in 295 breast cancertumors.Canonical pathways from MSigDB containing 639 groups ofgenes, 637 of which involve genes from our study.

METHOD `1 Ωgroup.ERROR 0.38± 0.04 0.36± 0.03] PATH. 148,58,183 6,5,78PROP. PATH. 0.32,0.14,0.41 0.01,0.01,0.17

Graph on the genes.

METHOD `1 Ωgraph(.)ERROR 0.39± 0.04 0.36± 0.01AV. SIZE C.C. 1.1,1,1.0 1.3,1.4,1.2

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 41 / 43

Page 53: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Conclusion

Machine learning provides many solutions for the analysis ofhigh-throughput data (more examples later..)The development of dedicated method is increasingly important toovercome the challenges (few samples, high-dimension,structures..)This increasingly requires tight collaboration with domain experts

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 42 / 43

Page 54: Machine learning in bioinformatics and drug discoverymembers.cbio.mines-paristech.fr/~jvert/talks/091116curie/...Machine learning in bioinformatics and drug discovery Jean-Philippe

Thanks!

Franck Rapaport, Emmanuel Barillot, Andrei Zynoviev, LaurentJacob, Anne-Claire Haury (Institut Curie / Mines ParisTech)Guillaume Obozinski (UC Berkeley / INRIA)

... and the INSERM-JSPS grant for collaborative research whichsupport this workshop

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 43 / 43