metabolic network inference from multiple types of genomic data yoshihiro yamanishi centre de...
TRANSCRIPT
Metabolic Network Inference Metabolic Network Inference from Multiple Types of from Multiple Types of
Genomic DataGenomic Data
Yoshihiro YamanishiYoshihiro Yamanishi
Centre de Bio-informatique,
Ecole des Mines de Paris
Outline Outline
Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference
- Supervised network inference- Supervised network inference
- Multiple data integration- Multiple data integration ApplicationApplication
- Global network prediction- Global network prediction Concluding remarksConcluding remarks
Metabolic networkMetabolic network
The metabolic network consists of The metabolic network consists of enzyme proteins and chemical enzyme proteins and chemical compoundscompounds
6018 genes in yeast genome6018 genes in yeast genome 1120 genes with EC numbers1120 genes with EC numbers 668 genes with pathway information668 genes with pathway information
(in the KEGG as of Sep. 2004)(in the KEGG as of Sep. 2004)
Problem: unknown part of pathways Problem: unknown part of pathways and many missing enzyme genesand many missing enzyme genes
Network inference Network inference methodsmethods
For gene regulatory networkFor gene regulatory network Bayesian network (Friedman et al., 2000, Bayesian network (Friedman et al., 2000,
Imoto et al, 2002)Imoto et al, 2002) Boolean network (Akutsu et al., 2000)Boolean network (Akutsu et al., 2000) Graphical modeling (Toh et al., 2001)Graphical modeling (Toh et al., 2001)
For protein interaction network For protein interaction network Joint graph method (Marcotte et al., 1999)Joint graph method (Marcotte et al., 1999) Mirror tree method (Pazos et al., 2001)Mirror tree method (Pazos et al., 2001)
Objectives Objectives
Develop a method to infer metabolic Develop a method to infer metabolic gene networks in a supervised gene networks in a supervised contextcontext
Integrate heterogeneous genomic Integrate heterogeneous genomic data in the framework of network data in the framework of network inferenceinference
Reconstruct unknown pathways and Reconstruct unknown pathways and identify genes for missing enzymesidentify genes for missing enzymes
Kernel in this studyKernel in this study
k x , x 'Kernel : representation of the similarity between two genes and (e.g., correlation coefficient)
Kernel matrix: similarity matrix of a set of genes
K ij : k x i , x j i , j 1,2 , .. . ,N
x x '
N genes x1 , x2 ,. .. , xN
An example of the kernelAn example of the kernel
0 . 1 0 . 4 0 . 2 0 . 3
righ
0 . 2 0 . 3 0 . 3 0 . 2
righ
x1
K x1 , x2 =< x1 , x 2
0 .1 0 . 2 0 . 4 0 . 3 0 . 2 0 . 3 0 . 3 0 . 2 0 . 26
Suppose we have a set of genes x1, x2,…, xN
and represent them by gene expression profiles
An example of kernel An example of kernel matrixmatrix
kernel matrix:K x1 , x1 K x1 , x2
K x 2 , x1 K x 2 , x 2
righ
0 . 3 0 . 260 . 26 0 . 26
righ
K
This can be regarded as a kind of similarity matrix
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
Similarity matrix based on a genomic dataset
1 2 3 4 5 6 7 8 9123456789
Configuration of genes
12
3
5
4
7
68
9
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data
1 2 3 4 5 6 7 8 9123456789
12
3
5
4
7
68
9
Similarity matrixPredicted network
Evaluation of the direct Evaluation of the direct approach:approach:
using gene expression using gene expression datadata
Gold standard data: metabolic network of 668 genes of the yeast in the
KEGG/PathwayROC
curve
False positives
True positives
x1x 2x3
157 expriments (SMD)
Outline Outline
Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference
- - Supervised network inferenceSupervised network inference
- Multiple data integration- Multiple data integration ApplicationApplication
- Global network prediction- Global network prediction
- Missing enzyme gene estimation- Missing enzyme gene estimation Concluding remarksConcluding remarks
An illustration of An illustration of formalismformalism
Unknown pathway
Protein networkSimilarity matrix in expression
An illustration of An illustration of formalismformalism
Unknown pathway
Protein networkSimilarity matrix in expression
training training
Supervised network Supervised network inferenceinference
:training set
Original space
x1
x 2
x3
Key idea: use of partially known network information
Supervised network Supervised network inferenceinference
:training set
Original space
: edge predicted by direct approach
x1
x 2
x3
Supervised network Supervised network inferenceinference
:training set
Original space
:true edge
x1
x 2
x3
Supervised network Supervised network inference 1/2inference 1/2
Step 1: map proteins to a space, where interacting proteins are close to each other
Feature space
f x1
f x2
f x3
f
:training set
Original space
:true edge
x1
x 2
x3
Supervised network Supervised network inference 2/2inference 2/2
Feature space
f x1
f x2
f x3
f
:training set
:test set
Original space
:true edge
x1
x 2
x3
Supervised network Supervised network inference 2/2inference 2/2
Feature space
f x1
f x2
Step 2: predict interacting protein pairs involving the test set
f x3
f
:training set
:test set
Original space
x1
x 2
x3:true edge
AlgorithmAlgorithm
Suppose we have a partially known graphG V , E with V x 1 , x2 , , x n
f argminxi , x j E
f x i f x j2
Kernel CCA (Yamanishi et al., 2004)Distance metric learning (Vert et al., 2004)
Result of the supervised Result of the supervised learning:learning:
ROC curve by cross-ROC curve by cross-validationvalidation
Direct approach Supervised approach
Outline Outline
Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference
- Supervised network inference- Supervised network inference
- - Multiple data integrationMultiple data integration ApplicationApplication
- Global network prediction- Global network prediction
- Missing enzyme gene estimation- Missing enzyme gene estimation Concluding remarksConcluding remarks
Various genomic dataVarious genomic data
Bit Bit stringsstrings
Bit Bit stringsstrings
NumericNumericalal
vectorsvectors
StructureStructure
EvolutionarEvolutionary similarityy similarity
Co-Co-localization localization similaritysimilarity
Co-Co-expresion expresion similaritysimilarity
Gene-gene Gene-gene relationshirelationshi
pp
DataData
PhylogenPhylogenetic etic profileprofile
LocalizatiLocalization dataon data
GeneGene
expressioexpressionn
Data Data of the yeast of the yeast S. cerevisiaeS. cerevisiae
Expression: 6059 genes with 157 Expression: 6059 genes with 157 experiments (SMD database)experiments (SMD database)
Localization: 6059 proteins with 23 Localization: 6059 proteins with 23 intracellular locations (Huh et al, intracellular locations (Huh et al, 2003)2003)
Phylogenetic profile: 6059 proteins Phylogenetic profile: 6059 proteins with 145 organisms with 145 organisms (KEGG/Ortholog Cluster) (KEGG/Ortholog Cluster)
Gene expression profilesGene expression profiles
exp1 exp2 exp3 exp4 exp5 … exp Pexp1 exp2 exp3 exp4 exp5 … exp P
gene 1 (0.1, 0.4, 0.6, 0.2 , -0.3, … , 1.5)gene 1 (0.1, 0.4, 0.6, 0.2 , -0.3, … , 1.5)gene 2 (0.2, 0.9, 1.8, 0.7 , -0.3, … , 0.4)gene 2 (0.2, 0.9, 1.8, 0.7 , -0.3, … , 0.4)gene 3 (0.6, 0.7, -1.0, 0.8 , 1.2, … , 0.6)gene 3 (0.6, 0.7, -1.0, 0.8 , 1.2, … , 0.6) … …gene N (1.2, 0.3, 1.9, -0.1 , -0.7, … , 0.1)gene N (1.2, 0.3, 1.9, -0.1 , -0.7, … , 0.1)
Numerical vectors of the gene expression ratio
gene
Experiments (or time series)
gene gene similariy : Kexp x , x' x x '
where x :vector of gene expression profile
Phylogenetic profilesPhylogenetic profiles
org1 org2 org3 org4 org5 … org Porg1 org2 org3 org4 org5 … org P
gene 1 (1, 1, 0, 0 , 0, … , 1)gene 1 (1, 1, 0, 0 , 0, … , 1)gene 2 (1, 0, 1, 0 , 1, … , 0)gene 2 (1, 0, 1, 0 , 1, … , 0)gene 3 (0, 1, 0, 0 , 1, … , 0)gene 3 (0, 1, 0, 0 , 1, … , 0) … …gene N (1, 0, 1, 0 , 0, … , 1)gene N (1, 0, 1, 0 , 0, … , 1)
Bit strings in which the presence and absence of the genes are corded as 1 or 0 across organisms
gene
organism
gene gene similarity: K phy x , x' x x '
where x :bit string of phylogenetic profile
An illustration of our An illustration of our network inference network inference
procedureprocedure
Geneexpression
Proteinlocalization
Phylogeneticprofile
Gene networksimilarity matrix of genes
INPUT OUTPUT
infer
Data representation and Data representation and integrationintegration
expression datalocalization dataphylogenetic profileintegrationweighted integration
Genomic data
K exp
K loc
K phy
K int Kexp K loc K phy
Kwint w1K exp w2 K loc w3 K phy
Similarity matrix
Evaluating the weight for Evaluating the weight for each data sourceeach data source
1.Individual application to each data
2.Evaluation of its biological relevance by the ROC score
ROC curve
ROC score: area under the ROC curve
Evaluating the weight by Evaluating the weight by the ROC scoresthe ROC scores
For each data, compute the ROC score - 0.5, which are used as the weightExpression Localization Phylogenetic profile
Evolutionary information seems to be useful
w1 0 . 31 expressionw2 0 . 16 localizationw3 0 . 53 phylogenetic profile
The resulting normalized weights:
Kwint w1K exp w2 K loc w3 K phy
The effect of data The effect of data integrationintegration
ROC curve
Outline Outline
Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference
- Supervised network inference- Supervised network inference
- Multiple data integration- Multiple data integration ApplicationApplication
- - Global network predictionGlobal network prediction
- Missing enzyme gene estimation- Missing enzyme gene estimation Concluding remarksConcluding remarks
Comprehensive Comprehensive prediction ofprediction of
a global gene network a global gene network
We predicted a network of 6059 genesWe predicted a network of 6059 genes
Possible biological applicationsPossible biological applications
1.1. Estimate unknown pathwaysEstimate unknown pathways
2.2. Predict biochemical function for Predict biochemical function for hypothetical proteinshypothetical proteins
3.3. Identify missing enzyme genesIdentify missing enzyme genes
Prediction for a role in Prediction for a role in pathwayspathways
YJR137C (the detail function was YJR137C (the detail function was unknown as of Sep. 2003) is connected unknown as of Sep. 2003) is connected with with EC:1.8.4.8EC:1.8.4.8 and and EC:2.5.1.47EC:2.5.1.47 in the in the predicted networkpredicted network
Recently, there has been a report that Recently, there has been a report that YJR137C is annotated as YJR137C is annotated as EC:1.8.1.2EC:1.8.1.2
Prediction for a role in Prediction for a role in pathwayspathways
Outline Outline
Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference
- Supervised network inference- Supervised network inference
- Multiple data integration- Multiple data integration ApplicationApplication
- Global network prediction- Global network prediction Concluding remarksConcluding remarks
SummarySummary
We developed supervised approaches to We developed supervised approaches to infer the metabolic network from multiple infer the metabolic network from multiple genomic datagenomic data
The accuracy improved from the The accuracy improved from the supervised learning and the weighted data supervised learning and the weighted data integrationintegration
We showed some possibilities to obtain We showed some possibilities to obtain new biological findingsnew biological findings
CollaboratorCollaborator
For the methodsFor the methods
Jean-Philippe Vert (Ecole des Mines)Jean-Philippe Vert (Ecole des Mines)
Minoru Kanehisa (Kyoto University)Minoru Kanehisa (Kyoto University)
For the biochemical experimentsFor the biochemical experiments
Hisaaki Mihara, Motoharu Ohsaki, Hisaaki Mihara, Motoharu Ohsaki, Hisashi Muramatsu, Nobuyoshi Esaki Hisashi Muramatsu, Nobuyoshi Esaki (Kyoto University)(Kyoto University)