metabolic network inference from multiple types of genomic data yoshihiro yamanishi centre de...

Metabolic Network Inference Metabolic Network Inference from Multiple Types of from Multiple Types of

Genomic DataGenomic Data

Yoshihiro YamanishiYoshihiro Yamanishi

Centre de Bio-informatique,

Ecole des Mines de Paris

Outline Outline

Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference

- Supervised network inference- Supervised network inference

- Multiple data integration- Multiple data integration ApplicationApplication

- Global network prediction- Global network prediction Concluding remarksConcluding remarks

Metabolic networkMetabolic network

The metabolic network consists of The metabolic network consists of enzyme proteins and chemical enzyme proteins and chemical compoundscompounds

6018 genes in yeast genome6018 genes in yeast genome 1120 genes with EC numbers1120 genes with EC numbers 668 genes with pathway information668 genes with pathway information

(in the KEGG as of Sep. 2004)(in the KEGG as of Sep. 2004)

Problem: unknown part of pathways Problem: unknown part of pathways and many missing enzyme genesand many missing enzyme genes

Network inference Network inference methodsmethods

For gene regulatory networkFor gene regulatory network Bayesian network (Friedman et al., 2000, Bayesian network (Friedman et al., 2000,

Imoto et al, 2002)Imoto et al, 2002) Boolean network (Akutsu et al., 2000)Boolean network (Akutsu et al., 2000) Graphical modeling (Toh et al., 2001)Graphical modeling (Toh et al., 2001)

For protein interaction network For protein interaction network Joint graph method (Marcotte et al., 1999)Joint graph method (Marcotte et al., 1999) Mirror tree method (Pazos et al., 2001)Mirror tree method (Pazos et al., 2001)

Objectives Objectives

Develop a method to infer metabolic Develop a method to infer metabolic gene networks in a supervised gene networks in a supervised contextcontext

Integrate heterogeneous genomic Integrate heterogeneous genomic data in the framework of network data in the framework of network inferenceinference

Reconstruct unknown pathways and Reconstruct unknown pathways and identify genes for missing enzymesidentify genes for missing enzymes

Kernel in this studyKernel in this study

k x , x 'Kernel : representation of the similarity between two genes and (e.g., correlation coefficient)

Kernel matrix: similarity matrix of a set of genes

K ij : k x i , x j i , j 1,2 , .. . ,N

x x '

N genes x1 , x2 ,. .. , xN

An example of the kernelAn example of the kernel

0 . 1 0 . 4 0 . 2 0 . 3

righ

0 . 2 0 . 3 0 . 3 0 . 2

righ

x1

K x1 , x2 =< x1 , x 2

0 .1 0 . 2 0 . 4 0 . 3 0 . 2 0 . 3 0 . 3 0 . 2 0 . 26

Suppose we have a set of genes x1, x2,…, xN

and represent them by gene expression profiles

An example of kernel An example of kernel matrixmatrix

kernel matrix:K x1 , x1 K x1 , x2

K x 2 , x1 K x 2 , x 2

righ

0 . 3 0 . 260 . 26 0 . 26

righ

K

This can be regarded as a kind of similarity matrix

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

Similarity matrix based on a genomic dataset

1 2 3 4 5 6 7 8 9123456789

Configuration of genes

12

3

5

4

7

68

9

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Evaluation of the direct Evaluation of the direct approach:approach:

using gene expression using gene expression datadata

Gold standard data: metabolic network of 668 genes of the yeast in the

KEGG/PathwayROC

curve

False positives

True positives

x1x 2x3

157 expriments (SMD)

Outline Outline


- - Supervised network inferenceSupervised network inference


- Global network prediction- Global network prediction

- Missing enzyme gene estimation- Missing enzyme gene estimation Concluding remarksConcluding remarks

An illustration of An illustration of formalismformalism

Unknown pathway

Protein networkSimilarity matrix in expression

An illustration of An illustration of formalismformalism

Unknown pathway

Protein networkSimilarity matrix in expression

training training

Supervised network Supervised network inferenceinference

:training set

Original space

x1

x 2

x3

Key idea: use of partially known network information


:training set

Original space

: edge predicted by direct approach

x1

x 2

x3


:training set

Original space

:true edge

x1

x 2

x3

Supervised network Supervised network inference 1/2inference 1/2

Step 1: map proteins to a space, where interacting proteins are close to each other

Feature space

f x1

f x2

f x3

f

:training set

Original space

:true edge

x1

x 2

x3


Feature space

f x1

f x2

f x3

f

:training set

:test set

Original space

:true edge

x1

x 2

x3


Feature space

f x1

f x2

Step 2: predict interacting protein pairs involving the test set

f x3

f

:training set

:test set

Original space

x1

x 2

x3:true edge

AlgorithmAlgorithm

Suppose we have a partially known graphG V , E with V x 1 , x2 , , x n

f argminxi , x j E

f x i f x j2

Kernel CCA (Yamanishi et al., 2004)Distance metric learning (Vert et al., 2004)

Result of the supervised Result of the supervised learning:learning:

ROC curve by cross-ROC curve by cross-validationvalidation

Direct approach Supervised approach

Outline Outline



- - Multiple data integrationMultiple data integration ApplicationApplication

- Global network prediction- Global network prediction


Various genomic dataVarious genomic data

Bit Bit stringsstrings

Bit Bit stringsstrings

NumericNumericalal

vectorsvectors

StructureStructure

EvolutionarEvolutionary similarityy similarity

Co-Co-localization localization similaritysimilarity

Co-Co-expresion expresion similaritysimilarity

Gene-gene Gene-gene relationshirelationshi

pp

DataData

PhylogenPhylogenetic etic profileprofile

LocalizatiLocalization dataon data

GeneGene

expressioexpressionn

Data Data of the yeast of the yeast S. cerevisiaeS. cerevisiae

Expression: 6059 genes with 157 Expression: 6059 genes with 157 experiments (SMD database)experiments (SMD database)

Localization: 6059 proteins with 23 Localization: 6059 proteins with 23 intracellular locations (Huh et al, intracellular locations (Huh et al, 2003)2003)

Phylogenetic profile: 6059 proteins Phylogenetic profile: 6059 proteins with 145 organisms with 145 organisms (KEGG/Ortholog Cluster) (KEGG/Ortholog Cluster)

Gene expression profilesGene expression profiles

exp1 exp2 exp3 exp4 exp5 … exp Pexp1 exp2 exp3 exp4 exp5 … exp P

gene 1 (0.1, 0.4, 0.6, 0.2 , -0.3, … , 1.5)gene 1 (0.1, 0.4, 0.6, 0.2 , -0.3, … , 1.5)gene 2 (0.2, 0.9, 1.8, 0.7 , -0.3, … , 0.4)gene 2 (0.2, 0.9, 1.8, 0.7 , -0.3, … , 0.4)gene 3 (0.6, 0.7, -1.0, 0.8 , 1.2, … , 0.6)gene 3 (0.6, 0.7, -1.0, 0.8 , 1.2, … , 0.6) … …gene N (1.2, 0.3, 1.9, -0.1 , -0.7, … , 0.1)gene N (1.2, 0.3, 1.9, -0.1 , -0.7, … , 0.1)

Numerical vectors of the gene expression ratio

gene

Experiments (or time series)

gene gene similariy : Kexp x , x' x x '

where x :vector of gene expression profile

Phylogenetic profilesPhylogenetic profiles

org1 org2 org3 org4 org5 … org Porg1 org2 org3 org4 org5 … org P

gene 1 (1, 1, 0, 0 , 0, … , 1)gene 1 (1, 1, 0, 0 , 0, … , 1)gene 2 (1, 0, 1, 0 , 1, … , 0)gene 2 (1, 0, 1, 0 , 1, … , 0)gene 3 (0, 1, 0, 0 , 1, … , 0)gene 3 (0, 1, 0, 0 , 1, … , 0) … …gene N (1, 0, 1, 0 , 0, … , 1)gene N (1, 0, 1, 0 , 0, … , 1)

Bit strings in which the presence and absence of the genes are corded as 1 or 0 across organisms

gene

organism

gene gene similarity: K phy x , x' x x '

where x :bit string of phylogenetic profile

An illustration of our An illustration of our network inference network inference

procedureprocedure

Geneexpression

Proteinlocalization

Phylogeneticprofile

Gene networksimilarity matrix of genes

INPUT OUTPUT

infer

Data representation and Data representation and integrationintegration

expression datalocalization dataphylogenetic profileintegrationweighted integration

Genomic data

K exp

K loc

K phy

K int Kexp K loc K phy

Kwint w1K exp w2 K loc w3 K phy

Similarity matrix

Evaluating the weight for Evaluating the weight for each data sourceeach data source

1.Individual application to each data

2.Evaluation of its biological relevance by the ROC score

ROC curve

ROC score: area under the ROC curve

Evaluating the weight by Evaluating the weight by the ROC scoresthe ROC scores

For each data, compute the ROC score - 0.5, which are used as the weightExpression Localization Phylogenetic profile

Evolutionary information seems to be useful

w1 0 . 31 expressionw2 0 . 16 localizationw3 0 . 53 phylogenetic profile

The resulting normalized weights:

Kwint w1K exp w2 K loc w3 K phy

The effect of data The effect of data integrationintegration

ROC curve

Outline Outline




- - Global network predictionGlobal network prediction


Comprehensive Comprehensive prediction ofprediction of

a global gene network a global gene network

We predicted a network of 6059 genesWe predicted a network of 6059 genes

Possible biological applicationsPossible biological applications

1.1. Estimate unknown pathwaysEstimate unknown pathways

2.2. Predict biochemical function for Predict biochemical function for hypothetical proteinshypothetical proteins

3.3. Identify missing enzyme genesIdentify missing enzyme genes

Prediction for a role in Prediction for a role in pathwayspathways

YJR137C (the detail function was YJR137C (the detail function was unknown as of Sep. 2003) is connected unknown as of Sep. 2003) is connected with with EC:1.8.4.8EC:1.8.4.8 and and EC:2.5.1.47EC:2.5.1.47 in the in the predicted networkpredicted network

Recently, there has been a report that Recently, there has been a report that YJR137C is annotated as YJR137C is annotated as EC:1.8.1.2EC:1.8.1.2

Prediction for a role in Prediction for a role in pathwayspathways

Outline Outline




- Global network prediction- Global network prediction Concluding remarksConcluding remarks

SummarySummary

We developed supervised approaches to We developed supervised approaches to infer the metabolic network from multiple infer the metabolic network from multiple genomic datagenomic data

The accuracy improved from the The accuracy improved from the supervised learning and the weighted data supervised learning and the weighted data integrationintegration

We showed some possibilities to obtain We showed some possibilities to obtain new biological findingsnew biological findings

CollaboratorCollaborator

For the methodsFor the methods

Jean-Philippe Vert (Ecole des Mines)Jean-Philippe Vert (Ecole des Mines)

Minoru Kanehisa (Kyoto University)Minoru Kanehisa (Kyoto University)

For the biochemical experimentsFor the biochemical experiments

Hisaaki Mihara, Motoharu Ohsaki, Hisaaki Mihara, Motoharu Ohsaki, Hisashi Muramatsu, Nobuyoshi Esaki Hisashi Muramatsu, Nobuyoshi Esaki (Kyoto University)(Kyoto University)

metabolic network inference from multiple types of genomic data yoshihiro yamanishi centre de...

Documents