genome-scale metabolic networks: reconstruction, properties, and

14
Genome-Scale Metabolic Networks: reconstruction, properties, and applications Microme Workshop on Microbial Metabolism - EMBL-EBI, October 9, 2013 - csb computational systems biology 1 Mathias Ganter

Upload: truongliem

Post on 03-Jan-2017

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Genome-Scale Metabolic Networks: reconstruction, properties, and

Genome-Scale Metabolic Networks:reconstruction, properties, and

applications

Microme Workshop on Microbial Metabolism- EMBL-EBI, October 9, 2013 -

csbcomputational systems biology

1

Mathias Ganter

Page 2: Genome-Scale Metabolic Networks: reconstruction, properties, and

Metabolic network models

http://ww

w.cs.cm

u.edu/~blmt/Sem

inar/SeminarM

aterials/IntroMolBasD

isease.html

2

• knowledge repository

• phenotype growth simulations

• model-driven discovery

E. coli: reactions: ~ 2400metabolites: ~ 1600genes: ~ 1400

Motivation:

A. thaliana (unpublished): ~ 4200~ 3700~ 2400

Page 3: Genome-Scale Metabolic Networks: reconstruction, properties, and

Metabolic reactions

3

Reed

et al

., Nat

ure R

eview

s Gen

etics

7, 13

0-14

1 (Fe

brua

ry 20

06)

©!2006!Nature Publishing Group!

!

Step

-wis

e in

corp

orat

ion

of in

form

atio

n

C3H4O3

Level 3: Stoichiometry

Level 4: Thermodynamic considerations and/or directionality

1 LAC + 1 NAD ? 1 PYR + 1 NADH + 1 H

LAC

Prokaryotes

Eukaryotes

Primary metabolites Coenzymes

PYR

Charged formulae

NADH

Level 5: Localization

1 LAC [c] + 1 NAD [c] 1 PYR [c] + 1 NADH [c] + 1 H [c]

1 LAC + 1 NAD 1 PYR + 1 NADH + 1 H

NAD

C3H6O3 C21H28N7O14P2 C21H29N7O14P2

C3H3O3–C3H5O3

– C21H26N7O14P2– C21H27N7O14P2

2–

[c]: cytoplasm [n]: nucleus [m]: mitochondria[e]: extracellular [g]: golgi aparatus [x]: peroxisome[p]: periplasm [v]: vacuole [h]: chloroplast [l]: lysosome [r]: endoplasmic reticulum

Level 2: Metabolite formulaeNeutral formulae

Level 1: Metabolite specificity

How to reconstruct metabolic networks. Although high gene- or protein-sequence homology implies a similar function for gene products, a one-dimensional annota-tion that is based purely on sequence homology is an hypothesis43 that needs biochemical verification. Several details need to be considered for translating a one-dimensional annotation of a gene into a set of defined biochemical reactions (BOX 1). Scientists who want to reconstruct biochemical-reaction networks should pay attention to the issues that are outlined below and summarized in BOX 1.

As a first step in generating enzyme-specific bio-chemical reactions, the substrate specificity of an enzyme has to be determined. In general, enzymes can be classi-fied into two groups on the basis of substrate specificity: those that function only on one or a few highly similar substrates and those with a broader substrate specificity that can function on a class of compounds with similar functional groups (for example, alcohol dehydrogenase). The substrates that are recognized by either type of these enzymes might differ across organisms. The substrate specificity can differ for primary metabolites, as well as coenzymes (such as NADH versus NADPH and ATP versus GTP). BRENDA44, an online database, contains detailed information about enzyme substrate specifici-ties for a number of organisms and links to relevant publications.

Once the molecular formulae have been determined for the participating metabolites, the stoichiometry of the reaction can be specified. Here the overall charge and every element (including C, H, N, O, S and P) of the substrates and products have to be balanced. The

stoichiometry for the metabolites is generally available in biochemical databases (TABLE 1), although protons and water molecules are often left out of the reactions in these databases. The directionality or reversibility of a reaction, which is a function of the thermodynam-ics of the reaction, also needs to be defined. Biochemical characterization studies will sometimes test the reversibility of enzyme reactions, but the directionality can differ between in vitro and in vivo environments owing to differences in temperature, pH and metabolite concentrations.

Reactions and proteins need to be assigned to specific cellular compartments. This task is relatively easy for prokaryotes, which have only a small number of cellular compartments, but becomes challenging for eukaryotes, which have significantly more subcellular compartments (BOX 1). Incorrect assignment of the location of a reac-tion can lead to further gaps in the metabolic network and misrepresentation of the network properties. In the absence of experimental data, proteins can be assumed to reside in the cytosol45.

Algorithms, such as PSORT46 and SubLoc47, predict the cellular localization of proteins on the basis of nucle-otide or amino-acid sequences (see REF. 48 for a review of the algorithms). Additionally, high-throughput experimental approaches have been developed for determining the cellular localization of proteins, such as immunofluorescence49 and GFP tagging50 of indi-vidual proteins. In multicellular organisms, the expres-sion of individual genes can vary across cell types33; in these cases tissue-specific reconstructions might be more functionally relevant.

Box 1 | Defining metabolic reactions

Different levels of information are needed to obtain a detailed description of a biochemical transformation. Biochemical accuracy is especially important if the mathematical representation of the reconstruction is to be used for subsequent computations, otherwise the calculated network properties are likely to be incorrect. The first level defines the metabolite specificity of a gene product. Although primary metabolites are often the same for homologous enzymes across organisms, the use of coenzymes might vary. In the case of lactate dehydrogenase in Escherichia coli (see figure), NAD serves as an electron acceptor for lactate (LAC) resulting in the formation of pyruvate (PYR) and NADH. The second level of detail accounts for the charged molecular formula of each metabolite at a physiological pH. The knowledge of the chemical formula leads to the third level of detail, the stoichiometric coefficients of the reaction. By balancing out the elements and charge in the reaction, the overall stoichiometry of the reaction can be defined. It is here that protons and water molecules are often added to balance the chemical equation. The directionality of the reaction represents the fourth level, at which biochemical studies and thermodynamic properties define the in vivo reaction directionality. At the fifth level, the cellular compartment in which the reaction takes place has to be determined. See supplementary information S1 (box) for more details.

REVIEWS

132 | FEBRUARY 2006 | VOLUME 7 www.nature.com/reviews/genetics

metabolic reaction

©!2006!Nature Publishing Group!

!

HEX1 PGI PFK FBA TPI GAPD PGK PGM ENO PYK

Abbreviation Glycolytic reactions GenesHEX1 [c]GLC + ATP G6P + ADP + H glk PGI [c]G6P F6P pgiPFK [c]ATP + F6P ADP + FDP + H p!A, p!BFBA [c]FDP DHAP + G3P "aA, "aBTPI [c]DHAP G3P tpiAGAPD [c]G3P + NAD + PI 13DPG + H + NADH gapA, gapC1, gapC2PGK [c]13DPG + ADP 3PG + ATP pgkPGM [c]3PG 2PG gpmA, gpmBENO [c]2PG H2O + PEP enoPYK [c]ADP + H + PEP ATP + PYR pykA, pykF

ATP –1 0 –1 0 0 0 1 0 0 1GLC –1 0 0 0 0 0 0 0 0 0ADP 1 0 1 0 0 0 –1 0 0 –1G6P 1 –1 0 0 0 0 0 0 0 0H 1 0 1 0 0 1 0 0 0 –1F6P 0 1 –1 0 0 0 0 0 0 0FDP 0 0 1 –1 0 0 0 0 0 0DHAP 0 0 0 1 –1 0 0 0 0 0G3P 0 0 0 1 1 –1 0 0 0 0NAD 0 0 0 0 0 –1 0 0 0 0PI 0 0 0 0 0 –1 0 0 0 013DPG 0 0 0 0 0 1 –1 0 0 0NADH 0 0 0 0 0 1 0 0 0 03PG 0 0 0 0 0 0 1 –1 0 02PG 0 0 0 0 0 0 0 1 –1 0PEP 0 0 0 0 0 0 0 0 1 –1H2O 0 0 0 0 0 0 0 0 1 0PYR 0 0 0 0 0 0 0 0 0 1

GapA

GAPD

b1676

b1416

gapC2

pykF

PykF

b2779

b1779

gapA

eno

Eno

ENO

and

GapC

PYK

b1854

b1417

gapC1

pykA

PykA

GAPD

ENO

PYK

HEX1

TPI

or

or

PGI

PFK

FBA

PGK

PGM

Boolean rulesLogic statements that use Boolean operators (and, or, not) to evaluate the ‘on/off’ state of a variable.

P/O ratioThe number of ATP molecules (P) that are formed per oxygen atom (O) consumed during respiration.

Network gapOne or more reaction that is missing from the network reconstruction owing to the lack of direct genetic or biochemical evidence.

Blocked reactionsReactions that, at steady state, can have no net flux (reactions that involve dead-end metabolites are blocked reactions).

Pathway holesMissing reactions from defined metabolic pathways such as glycolysis and amino-acid biosynthesis.

constructed solely on the basis of genomic and bio-chemical evidence often contain many network gaps. Network gaps can be identified by analysing the ability of the network to generate individual biomass components that are needed for growth. For example, if a metabolic network is unable to generate a non-essential amino acid owing to missing steps in the biosynthetic pathways, the network gaps can be closed by completing the pathway with the missing reactions.

Physiological data, such as the growth capabilities of an organism, can be used to identify missing reactions or refine existing pathways. For example, metabolic path-ways that are involved in the use of a carbon source can be added to a network reconstruction even in the absence of genomic or biochemical information if the organism can grow on the compound. The growth requirements of an organism therefore provide important evidence for improving, refining and expanding the quality and the content of the reconstructed networks. Reactions that are added to the network at this stage should be assigned low confidence scores because there are no genetic or biochemical data to confirm them.

Analytical tools can also be used to identify network gaps that involve reactions (blocked reactions or pathway holes) or metabolites (dead-end metabolites) that are iso-lated from the rest of the network. Isolated reactions can be identified computationally using flux-coupling analysis71 (or Pathway Tools) and isolated metabolites can be

identified through metabolite connectivity39. Addition of any reaction to the reconstructed network to fill network gaps should be supported, if possible, by previ-ous observations and/or presence in phylogenetically related organisms. Subsequently, for each added reac-tion, putative genes can be identified using homology-based and context-based computational techniques (such as those that are described in the section on one-dimensional annotation)36,37,68. Such added reactions and putative assignments form a set of testable hypotheses that are subject to further experimental investigation. Reactions that cause network gaps can be removed from the network; for example, pathways that have many gaps might not occur in an organism and the functional assignment of associated genes should be re-examined38. On the other hand, gaps that were included on the basis of biochemical data indicate missing metabolic knowledge and should remain.

Discrepancies between predicted and experimen-tal phenotypic data for genetic perturbations (either knockouts or knockdowns through small interfering RNA) on defined growth conditions can also be used to evaluate the content of the metabolic network. As described above, false negatives (for example, experi-mental growth but no predicted in silico growth) can indicate that reactions are missing from the metabolic network or the existence of isozymes35,45. False positives (for example, growth that is predicted in silico without

Box 3 | Assembly and representation

A list of charge and elementally balanced metabolic reactions canbe represented in a stoichiometric matrix (S), where rows and columns correspond to metabolites and reactions and the elements are the stoichiometric coefficients. In genome-scale metabolic networks these stoichiometric matrices contain few non-zero elements, as relatively few metabolites participate in a given reaction. Connections between genes and reactions can be represented as gene–protein–reaction (GPR) associations by using Boolean rules or visualized using graphic images. In the GPR scheme, the first level (teal) corresponds to genetic loci, the second level (pink) to transcripts, the third level (orange) to functional proteins, and the fourth level (blue) to reactions. [c], cytoplasmic reactions.

REVIEWS

136 | FEBRUARY 2006 | VOLUME 7 www.nature.com/reviews/genetics

gene-protein-reaction relation

Page 4: Genome-Scale Metabolic Networks: reconstruction, properties, and

From genomes to models

4

Annotated genome Metabolic model http://ww

w.cs.cm

u.edu/~blmt/Sem

inar/SeminarM

aterials/IntroMolBasD

isease.html

External knowledge

Page 5: Genome-Scale Metabolic Networks: reconstruction, properties, and

Problems

directions

dead-end

localization

missing reactions

erroneousannotations

?

?

missingpathways

commonnamespace

glucose, GLC, met1,C00031, CHEBI:4167

5

Page 6: Genome-Scale Metabolic Networks: reconstruction, properties, and

Special reactions for modeling

6

• Uptake and secretion reactions

• Biomass reaction

• ATP maintenance reaction (all energy demands not related to growth)

amino acidslipids

nucleotidescofactors

1 gram of biomassconversion

...

Page 7: Genome-Scale Metabolic Networks: reconstruction, properties, and

Current reconstruction protocol

7

NATURE PROTOCOLS | VOL.4 NO.12 | 2009 | 1

PROTOCOL

INTRODUCTIONMetabolic network reconstruction has become an indispensable tool for studying the systems biology of metabolism1–7. The number of organisms for which metabolic reconstructions have been cre-ated is increasing at a pace similar to whole genome sequencing. However, the quality of metabolic reconstructions differ consider-ably, which is partially caused by varying amounts of available data for the target organisms and also by a missing standard operating procedure that describes the reconstruction process in detail. This protocol details a procedure by which a quality-controlled quality- assured reconstruction can be built to ensure high quality and comparability between reconstructions. In particular, the protocol points out data that are necessary for the reconstruction process and that should accompany reconstructions. Moreover, standard tests are presented, which are necessary to verify functionality and applicability of reconstruction-derived metabolic models. Finally, this protocol presents strategies to debug non- or malfunctioning models. Although the reconstruction process has been reviewed conceptually by numerous groups8–11 and a good general overview of the necessary data and steps is available, no detailed description of the reconstruction, debugging and iterative validation process has been published. This protocol seeks to make this process explicit and generally available.

The presented protocol describes the procedure necessary to reconstruct metabolic networks intended to be used for computa-tional modeling, including the constraint-based reconstruction and analysis (COBRA) approach11,12. These network reconstructions, and in silico models, are created in a bottom–up manner based on genomic and bibliomic data, and thus represent a biochemical, genetic and genomic (BiGG) knowledge-base for the target orga-nism9. These BiGG reconstructions can be converted into mathe-matical models and their systems and physiological properties can be determined. For example, they can be used to simulate the maximal growth of a cell in a given environmental condition using flux-balance analysis (FBA)13,14. In contrast, the generation of networks derived from top–down approaches (high-throughput

data-based interference of component interactions) is not discussed here, as they do not generally result in functional, mathematical models.

The metabolic reconstruction process described herein is usually very labor and time intensive, spanning from 6 months for well-studied, medium-sized bacterial genome, to 2 years (and six people) for the metabolic reconstruction of human metabolism15. Often, the reconstruction process is iterative, as demonstrated by the metabolic network of Escherichia coli, whose reconstruc-tion has been expanded and refined over the last 19 years7. As the number of reconstructed organisms increases, the need to find automated, or at least semi-automated, ways to reconstruct meta-bolic networks straight from the genome annotation is growing. Despite the growing experience and knowledge, to date, we are still not able to completely automatically reconstruct high-quality metabolic networks that can be used as predictive models. Recent reviews highlight current problems with genome annotations and databases, which make automated reconstructions challenging and thus, require manual evaluation8,9. Organism-specific features, such as substrate and cofactor utilization of enzymes, intracellular pH and reaction directionality remain problematic, and thus, requiring manual evaluation. However, some organism-specific databases and approaches exist, which can be used for automation. We describe here the manual reconstruction process in detail.

A limited number of software tools and packages are available (freely and commercially), which aim at assisting and facilitating the reconstruction process (Table 1). This protocol can, in princi-ple, be combined with those reconstruction tools. For generality, we present the entire procedure using a spreadsheet, namely Excel workbook (Microsoft), and a numeric computation and visualiza-tion software package, namely Matlab (Mathwork). Free spread-sheets (e.g., Open office and Google Docs) could be used instead of the listed spreadsheet. Alternatively, MySQL databases may be used, as they are very helpful to structure and track data. Matlab was also used to encode the COBRA Toolbox, which is a suite of COBRA

Q2Q2

Q3Q3

Q4Q4

A protocol for generating a high-quality genome-scale metabolic reconstructionInes Thiele1, 2 & Bernhard Ø Palsson1

1Department of Bioengineering, University of California, San Diego, La Jolla, California, USA. 2Current address: Center for Systems Biology, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland. Correspondence should be addressed to B.Ø.P. ([email protected]).

Published online XX XX 2009; doi:10.1038/nprot.2009.203

Network reconstructions are a common denominator in systems biology. Bottom–up metabolic network reconstructions have been developed over the last 10 years. These reconstructions represent structured knowledge bases that abstract pertinent information on the biochemical transformations taking place within specific target organisms. The conversion of a reconstruction into a mathematical format facilitates a myriad of computational biological studies, including evaluation of network content, hypothesis testing and generation, analysis of phenotypic characteristics and metabolic engineering. To date, genome-scale metabolic reconstructions for more than 30 organisms have been published and this number is expected to increase rapidly. However, these reconstructions differ in quality and coverage that may minimize their predictive potential and use as knowledge bases. Here we present a comprehensive protocol describing each step necessary to build a high-quality genome-scale metabolic reconstruction, as well as the common trials and tribulations. Therefore, this protocol provides a helpful manual for all stages of the reconstruction process.

Q1Q1

NATURE PROTOCOLS | VOL.4 NO.12 | 2009 | 3

PROTOCOLfunctions commonly used for simulation16. This Toolbox was extended to facilitate the reconstruction, debugging and manual curation process described herein.

The protocol describes in detail the process to generate metabolic recons-tructions applicable for representatives of all domains of life. The process of recons-tructing prokaryotic and eukaryotic meta-bolic networks is, in principle, identical, although eukaryote reconstructions are more challenging because of size of genomes, coverage of knowledge and the multitude of cellular compartments. Specific proper-ties and pitfalls are highlighted.

The described reconstruction and debug-ging process requires organism-specific information. The minimum information includes the genome sequence, from which key metabolic functions can be obtained, and physiological data, such as growth conditions, which allow the comparison of model prediction to refine the network’s content. In general, the more information about physiology, biochemistry and genetics is available for the target organism, the better the predictive capacity of the models. This property becomes obvious considering that the network evaluation and validation process relies on comparing predicted phenotypes (e.g., growth rate) with experimental obser-vations. Additional cellular objectives (other than maximal growth rate) may be compared with the experimental data but they are not detailed in this protocol15,17–20.

Although this protocol presents the reconstruction process in terms of metabolic networks, the same approach can, and has been, applied for reconstructing signaling21,22 and transcription/translation networks23. Regulatory networks have yet not been constructed in a fully stoichiometric manner, although a pseudo-stoichiometric approach has been proposed24,25. The reconstruction process for these networks is not as well estab-lished as for metabolic networks, and is thus still subject to active research.

A myriad of data sources are used during the reconstruction process rendering metabolic network reconstructions as knowl-edge bases, which summarize and structure the available BiGG knowledge about the target organism. Frequently used organism- unspecific, and some of the organism-specific, resources are listed in Table 1. It should be noted that the quality and wealth of organism-specific information will directly affect the quality and coverage of the metabolic reconstruction. Great resources are organism-specific books that have been published for a grow-ing number of organisms26–29. In cases where organism-specific information is scarce, data from phylogenic neighbors may be of great help. It is important to ensure that, in cases where the recon-struction relies extensively on relative information, the overall behavior of the model matches the target organism. This assur-ance can be achieved by carefully comparing the predictions with experimental and physiological data, such as growth conditions, secretion products and knock-out phenotypes.

The resulting knowledge bases can be queried, used for map-ping experimental data (e.g., gene expression, proteomic, fluxomic

and metabolomic data), and converted into a mathematical format to investigate metabolic capabilities and generate new biologi-cal hypotheses. The multitude of possible applications of BiGG knowledge bases distinguishes them from other automated efforts. By introducing standards in content and format with this protocol it will soon be possible to compare metabolic reconstruc-tions between different organisms, which will further enhance our understanding of the evolutionary processes and may provide a complementary approach to comparative genomics.

Experimental designThe metabolic network reconstruction process described herein consists of four major stages followed by its prospective use in Stage 5 (Fig. 1). The order of steps in the different stages is a recommendation and may be altered within each stage, and with some limitations between stages, as long as they are completed. The quality of the reconstruction is generally ensured by carrying out all the steps.

Stage 1: Creating a draft reconstruction. It is to be noted that the creation of a draft reconstruction and the manual reconstruction refinement (next stage) may be combined for bacterial reconstruc-tions with main emphasis on reconstruction refinement.

The first stage consists of the generation of a draft reconstruc-tion based on the genome annotation of the target organism and biochemical databases. This draft reconstruction, or automated reconstruction, is thus a collection of genome-encoded metabolic functions, some of which may be falsely included even though others are missing (e.g., because of missing, wrong or incom-plete annotations). Software tools such as Pathway tools30 or metaSHARK31 can be used for the generation of draft reconstruc-tion, but they do not replace the manual curation.

Genome annotation (Step 1): Genomic information is impor-tant to unambiguously define the gene properties with respect to the organism’s genome, as well as to allow data mapping (e.g., gene expression) in subsequent studies. As the draft reconstruc-tion, and to some extent the curated reconstruction, relies mainly

1. Draft reconstruction

1| Obtain genome annotation. 2| Identify candidate metabolic functions. 3| Obtain candidate metabolic reactions.4| Assembly of draft reconstruction. 5| Collect of experimental data.

2. Refinement of reconstruction6| Determine and verify substrate and cofactor usage.7| Obtain neutral formula for each metabolite.8| Determine the charged formula.9| Calculate reaction stoichiometry.10| Determine reaction directionality.11| Add information for gene and reaction localization.12| Add subsystems information.13| Verify gene!protein-reaction association.14| Add metabolite identifier.15| Determine and add confidence score.16| Add references and notes.17| Flag information from other organisms.18| Repeat Steps 6 to 17 for all genes.19| Add spontaneous reactions to the reconstruction.20| Add extracellular and periplasmic transport reactions.21| Add exchange reactions.22| Add intracellular transport reactions.23| Draw metabolic map (optional).24!32| Determine biomass composition.33| Add biomass reaction.34| Add ATP-maintenance reaction (ATPM).35| Add demand reactions.36| Add sink reactions.37| Determine growth medium requirements.

3. Conversion of reconstruction into computable format

38| Initialize the COBRA toolbox.39| Load reconstruction into Matlab.40| Verify S matrix.41| Set objective function. 42| Set simulation constraints.

4. Network evaluation 43!44| Test if network is mass-and charge balanced.45| Identify metabolic dead-ends.46!48| Gap analysis.49| Add missing exchange reactions to model. 50| Set exchange constraints for a simulation condition.51!58| Test for stoichiometrically balanced cycles.59| Re-compute gap list.60!65| Test if biomass precursors can be produced in standard medium.66| Test if biomass precursors can be produced in other growth media.67!75| Test if the model can produce known secretion products.76!78| Check for blocked reactions.79!80| Compute single gene deletion phenotypes.81!82| Test for known incapabilites of the organism.83| Compare predicted physiological properties with known properties.84!87| Test if the model can grow fast enough.88!94| Test if the model grows too fast.

Data assembly and dissemination95| Print Matlab model content. 96| Add gap information to the reconstruction output.

Figure 1 | Overview of the procedure to iteratively reconstruct metabolic networks. In particular, Stages 2–4 are continuously iterated until model predictions are similar to the phenotypic characteristics of the target organism and/or all experimental data for comparison are exhausted.

days to a week

ongoing

month to a year

days to a week

week to months

days to weeks

Goal: automatic model reconstruction

total: up to 2 years and 6 people (e.g.

human)

Page 8: Genome-Scale Metabolic Networks: reconstruction, properties, and

Model coverage

8

Recent advances in reconstruction and applications ofgenome-scale metabolic modelsTae Yong Kim1,2, Seung Bum Sohn1,2, Yu Bin Kim1,2,Won Jun Kim1,2 and Sang Yup Lee1,2,3

In the last decade, reconstruction and applications of genome-

scale metabolic models have greatly influenced the field of

systems biology by providing a platform on which high-

throughput computational analysis of metabolic networks can

be performed. The last two years have seen an increase in

volume of more than 33% in the number of published genome-

scale metabolic models, signifying a high demand for these

metabolic models in studying specific organisms. The diversity

in modeling different types of cells, from photosynthetic

microorganisms to human cell types, also demonstrates their

growing influence in biology. Here we review the recent

advances and current state of genome-scale metabolic

models, the methods employed towards ensuring high quality

models, their biotechnological applications, and the progress

towards the automated reconstruction of genome-scale

metabolic models.

Addresses1Metabolic and Biomolecular Engineering National ResearchLaboratory, Department of Chemical and Biomolecular Engineering(BK21 program), Center for Systems and Synthetic Biotechnology,Institute for the BioCentury, Korea Advanced Institute of Science andTechnology (KAIST), Daejeon 305-701, Republic of Korea2BioInformatics Research Center, KAIST, Daejeon 305-701, Republic ofKorea3Department of Bio and Brain Engineering, BioProcess EngineeringResearch Center, KAIST, Daejeon 305-701, Republic of Korea

Corresponding author: Lee, Sang Yup ([email protected])

Current Opinion in Biotechnology 2012, 23:617–623

This review comes from a themed issue on Systems biology

Edited by Jens Nielsen and Sang Yup Lee

For a complete overview see the Issue and the Editorial

Available online 4th November 2011

0958-1669/$ – see front matter, # 2011 Elsevier Ltd. All rightsreserved.

http://dx.doi.org/10.1016/j.copbio.2011.10.007

IntroductionGenome-scale metabolic models have become an import-ant tool in the study of metabolic networks in biotechnol-ogy. The explosion in the number of new genome-scalemetabolic models reconstructed over the last decade, andin particular in last several years, is a proof of its greatusefulness in the study and applications of biologicalsystems (Figure 1). It also highlights the increasing import-ance of these metabolic models in pharmaceutical, chemi-cal, and environmental industries. Initial genome-scale

metabolic models were employed towards understandingthe characteristics of microbial pathogens at genome-scale,which was followed by developing strategies for metabo-lically engineering microbial hosts for enhanced pro-duction of various bioproducts. As developing superiormicroorganisms for biorefinery applications have becomeincreasingly important, these metabolic models are widelyused in metabolic engineering studies to overcome thelimitations of established knowledge on the metabolicnetwork and to identify new non-intuitive metabolic reac-tions to be engineered for further improvement of strains.

Availability of ever increasing number of genome-scalemetabolic models is of course important, but the quality ofthese metabolic models is more important. Validation ofthese metabolic models ensures the quality of the meta-bolic model and their ability to correctly predict thephysiological characteristics of the organism. This entailsthe use of experimental data that can be compared againstthe predicted physiological characteristics of the metabolicmodel. By comparing the simulated physiological charac-teristics with the observed experimental results, theaccuracy of the metabolic model can be improved. Further-more, algorithms have been developed to incorporate otheraspects of cellular characteristics, other than metabolicfunctions, to increase the accuracy of the model.

Recently, the genome-scale metabolic models havebecome more refined and complex, allowing for theexpanded scopes in their applications. Algorithms havebeen developed to examine metabolic models from variousangles; for instance, calculating the redistribution of themetabolic flux in response to genetic or environmentalperturbation [1,2]. Reconstruction of metabolic models ofyeast species has been employed to investigate the pro-duction of heterologous therapeutic proteins that are unsui-table for production in bacterial hosts owing to the absenceof eukaryotic post-translational modification mechanisms[3]. Pathogenic metabolic models allow for the develop-ment of novel drugs to combat infection with minimal sideeffect to the host [4!,5]. The metabolic models of mammals,such as Homo sapiens, have been employed to study varioushuman diseases and develop strategies for potential treat-ments [6,7!!].

The advantages acquired by employing genome-scalemetabolic models have consequently driven the develop-ment of new metabolic models and algorithms. As theavailability of the complete genome sequences of new

Available online at www.sciencedirect.com

www.sciencedirect.com Current Opinion in Biotechnology 2012, 23:617–623

Page 9: Genome-Scale Metabolic Networks: reconstruction, properties, and

Automatic model reconstruction

9

Computational challenge: combinatorial optimization

Manual (assisted) network reconstruction

Gap-filling by genome annotation

Database and information integration

Automated reconstruction & prediction of novel network components

Computational com

plexity

Page 10: Genome-Scale Metabolic Networks: reconstruction, properties, and

Iterative model updates: E. coli

10

©20

08 N

atur

e Pu

blis

hing

Gro

up h

ttp://

ww

w.n

atur

e.co

m/n

atur

ebio

tech

nolo

gy

linear updates: hardly any problems

Page 11: Genome-Scale Metabolic Networks: reconstruction, properties, and

Iterative model updates: Yeast

11

also improved to be able to describe the growth on a mixed substrateof glucose and ethanol. Another metabolic network model, primarilyaimed for simulation of anaerobic growth on glucose, was recon-structed by Nissen et al. (1997). This model contained 37 reactionsand 27 intracellular metabolites. The model was employed formetabolic !ux analysis to estimate the carbon channeling by theintracellular reactions.

The yeast prototype metabolic models were used and adapted formany different types of studies during the coming years throughstoichiometric modeling approaches. Jin et al. (1997) appliedmetabolic modeling for heterologous protein production by arecombinant yeast strain, and these models were also used foridenti"cation of targets for improving ethanol production throughmetabolic engineering (Nissen et al., 2000). Metabolic networks werealso used together with 13C-labeling experiments as a successful toolto gain insight into the yeast metabolism (Gombert et al., 2001;Maaheimo et al., 2001). Carlson et al. (2002) reconstructed ametabolic model for yeast and used topological analysis to calculatethe elementary !ux modes (EFMs) of the system for poly-beta-hy-droxybutyrate (PHB) production, and Förster et al. (2002) combinedmetabolomic data with metabolic network modeling to infer thefunction of some unannotated genes (orphans).

The development of prototype metabolic networks with focus onthe central carbon metabolism of yeast and their applicabilityprovides a solid basis for development of comprehensive metabolicmodels following the availability of the yeast genome sequence.

3.2. Genome-scale metabolic models

The "rst comprehensive reconstruction of the yeast metabolism,which also is the "rst genome-scale model for a eukaryotic organism,was a joint effort from the groups of Nielsen and Palsson (Förster et al.,2003a). The model consists of 708 metabolic genes associated with1175 reactions and 733 metabolites. Three cellular compartments,namely cytosol, mitochondria and the extracellular space wereincluded. The physiological validations and predictions by themodel using !ux balance analysis (FBA) were shown to have goodagreements with many experimental datasets (Famili et al., 2003).Furthermore, large-scale in silico gene deletion analysis of the modelshowed high accuracy in predicting gene essentiality when compar-ing to in vivo knock-out phenotypes (Förster et al., 2003b). The "rstgenome-scale model was named iFF708 where FF stands for Försterand Famili, which are the names of the model creators and 708 is thenumber of ORFs included in themodel. After iFF708 had been released

Fig. 2. The "gure shows the development of metabolic modeling in Saccharomyces cerevisiae during the past 16 years. Each box represents a metabolic model, the bold text is thename of the model as referred to in the text and the small text in each box summarizes the scope of that model. The arrows between the boxes show the relationship between themodels. The graph shows the number of reactions, metabolites and genes in the different models. The "rst model of yeast metabolism was presented in 1995 and was areconstruction of the central carbon metabolism. The "rst genome-scale metabolic model iFF708 was published in 2003 in the post-genomic era. This "rst genome-scalereconstruction has served as the basis for many new reconstructions. The latest model called Yeast 4.0 includes 932 metabolic genes, 1865 reactions and 1319 metabolites.

982 T. Österlund et al. / Biotechnology Advances 30 (2012) 979–988Research review paper

Fifteen years of large scale metabolic modeling of yeast: Developments and impacts

Tobias Österlund, Intawat Nookaew, Jens Nielsen !Department of Chemical and Biological Engineering, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden

a b s t r a c ta r t i c l e i n f o

Available online 6 August 2011

Keywords:Genome-scale metabolic modelSystems biologyMetabolic engineeringComputational algorithmsEvolution

Since the !rst large-scale reconstruction of the Saccharomyces cerevisiae metabolic network 15 years ago thedevelopment of yeast metabolic models has progressed rapidly, resulting in no less than nine different yeastgenome-scale metabolic models. Here we review the historical development of large-scale mathematicalmodeling of yeast metabolism and the growing scope and impact of applications of these models in fourdifferent areas: as guide for metabolic engineering and strain improvement, as a tool for biologicalinterpretation and discovery, applications of novel computational framework and for evolutionary studies.

© 2011 Elsevier Inc. All rights reserved.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9792. Framework for reconstructing genome-scale metabolic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 980

2.1. Metabolic network reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9802.2. Mathematical formulation and debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9812.3. Validation with experimental data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981

3. Development of yeast metabolic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9813.1. Models of central carbon metabolism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9813.2. Genome-scale metabolic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982

4. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9834.1. Guidance for metabolic engineering and strain improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9834.2. Biological interpretation and discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9844.3. Applications of novel computational framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9854.4. Evolutionary elucidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986

5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986

1. Introduction

The yeast Saccharomyces cerevisiae serves as an important cell factoryin biotech production of food, beer,wine, nutraceuticals, pharmaceuticals,chemicals and fuels. It is also a very important model organism foreukaryal biology as it has a number of features that are conserved withhigher eukaryotes, including humans. Its genome was among the !rst tobe completely sequenced (Cherry et al., 1997; Goffeau et al., 1996) andmany functional genomics tools have been pioneered using this yeast as amodel organism (Chien et al., 1991; Winzeler et al., 1999; Wodicka et al.,

1997). Thus there are comprehensive databases available, including thehighly structured Saccharomyces Genome Database (SGD) (www.yeastgenome.org) (Weng et al., 2003). Many different yeast strains havebeen sequenced with the objective to understand evolution towardsdifferent kinds of applications (Borneman et al., 2008; Legras et al., 2007;Rainieri et al., 2006), e.g. wine, bread and beer production, and forproviding a basis for advancingmetabolic engineering (Otero et al., 2011).With the advancement of systems biology, in particular for gaining newinsight from high-throughput experimental data, S. cerevisiae has alsoplayed an important role (Mustacchi et al., 2006; Nielsen and Jewett,2008). In this interface between experiments andmathematicalmodelingthe concept of genome-scalemetabolicmodels (Covert et al., 2001a)playsan important role, as it allows for direct integration of high-throughputexperimental data with mathematical modeling, and hence advance our

Biotechnology Advances 30 (2012) 979–988

! Corresponding author. Tel.: +46 31 772 3804; fax: +46 31 772 3801.E-mail address: [email protected] (J. Nielsen).

0734-9750/$ – see front matter © 2011 Elsevier Inc. All rights reserved.doi:10.1016/j.biotechadv.2011.07.021

Contents lists available at ScienceDirect

Biotechnology Advances

j ourna l homepage: www.e lsev ie r.com/ locate /b iotechadv

many different branches: many problems

Page 12: Genome-Scale Metabolic Networks: reconstruction, properties, and

Application: Drug discovery

12

motivation: resistance against existing drugs

Page 13: Genome-Scale Metabolic Networks: reconstruction, properties, and

Application: Drug discovery - malaria

13

Reconstruction and flux-balance analysis of thePlasmodium falciparum metabolic network

German Plata1,2,6, Tzu-Lin Hsiao1,3,6, Kellen L Olszewski4,5, Manuel Llinas4,5,* and Dennis Vitkup1,3,*

1 Center for Computational Biology and Bioinformatics, Columbia University, New York City, NY, USA, 2 Integrated Program in Cellular, Molecular, Structural andGenetic Studies, Columbia University, New York City, NY, USA, 3 Department of Biomedical Informatics, Columbia University, New York City, NY, USA, 4 Departmentof Molecular Biology, Princeton University, Princeton, NJ, USA and 5 Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA6 These authors contributed equally to this work* Corresponding authors. D Vitkup, Department of Biomedical Informatics, Center for Computational Biology and Bioinformatics, Columbia University, 1130 St NicholasAvenue 803A, New York City, NY 10032, USA. Tel.: ! 1 212 851 5151; Fax: ! 1 212 851 5149; E-mail: [email protected] or M Llinas, Department of MolecularBiology, Lewis-Sigler Institute for Integrative Genomics, Princeton University, 246 Carl Icahn Lab, Princeton, NJ 08544, USA. Tel.: ! 1 609 258 9391;Fax: ! 1 609 258 3565. E-mail: [email protected]

Received 19.4.10; accepted 9.7.10

Genome-scale metabolic reconstructions can serve as important tools for hypothesis generation andhigh-throughput data integration. Here, we present a metabolic network reconstruction and flux-balance analysis (FBA) of Plasmodium falciparum, the primary agent of malaria. The compartmen-talized metabolic network accounts for 1001 reactions and 616 metabolites. Enzyme–geneassociations were established for 366 genes and 75% of all enzymatic reactions. Compared withother microbes, the P. falciparummetabolic network contains a relatively high number of essentialgenes, suggesting little redundancy of the parasite metabolism. The model was able to reproducephenotypes of experimental gene knockout and drug inhibition assays with up to 90% accuracy.Moreover, using constraints based on gene-expression data, the model was able to predict thedirection of concentration changes for external metabolites with 70% accuracy. Using FBA of thereconstructed network, we identified 40 enzymatic drug targets (i.e. in silico essential genes), with noor very low sequence identity to human proteins. To demonstrate that the model can be used to makeclinically relevant predictions, we experimentally tested one of the identified drug targets, nicotinatemononucleotide adenylyltransferase, using a recently discovered small-molecule inhibitor.Molecular Systems Biology 6: 408; published online 7 September 2010; doi:10.1038/msb.2010.60Subject Categories: metabolic and regulatory networks; microbiology and pathogensKeywords: flux-balance analysis; Plasmodium falciparum metabolism; systems biology

This is an open-access article distributed under the terms of the Creative Commons AttributionNoncommercial No Derivative Works 3.0 Unported License, which permits distribution and reproductioninanymedium,providedtheoriginalauthorandsourcearecredited.This licensedoesnotpermitcommercialexploitation or the creation of derivative works without specific permission.

Introduction

Malaria is an ancient disease, which can be dated back to 2800BC (Nerlich et al, 2008), and remains one of the most severepublic health challenges worldwide. Currently, about half ofthe Earth’s population is at risk from this infectious diseaseaccording to the World Health Organization (WHO, 2008).Malaria inflicts acute illness on hundreds of millions of peopleworldwide and leads to at least one million deaths annually(Baird, 2005; WHO, 2008). It ranks as a leading cause of deathand disease in many developing countries, where the mostaffected groups are young children and pregnant women(WHO, 2008). The disease is transmitted to humans by thefemale Anopheles mosquito and is caused by at least fivespecies of Plasmodium parasites. The life cycle of the parasiteis highly complex and includes various hosts and tissue types.

During a blood meal, sporozoites are transmitted from themosquito to humans and initiate infection in the liver wherethey reproduce prolifically but are asymptomatic. In the nextstage of infection, the parasites are released from the liver cystinto the bloodstream in the form of merozoites, where theyinvade red blood cells (RBCs) and reproduce asexually (Alyet al, 2009). The destruction of RBCs coupled with thesignificant load imposed on the host metabolism is ultimatelyresponsible for the major clinical symptoms of malaria, whichare often fatal (Haldar and Mohandas, 2009).Although several anti-malarial drugs are currently available,

most of them are losing efficacy due to acquired drugresistance in the most lethal causative agent, Plasmodiumfalciparum (Wongsrichanalai et al, 2002; Mackinnon andMarsh, 2010). The loss of drug efficiency in resistant strainsposes a great threat to malaria control and has been linked to

Molecular Systems Biology 6; Article number 408; doi:10.1038/msb.2010.60Citation: Molecular Systems Biology 6:408& 2010 EMBO and Macmillan Publishers Limited All rights reserved 1744-4292/10www.molecularsystemsbiology.com

& 2010 EMBO and Macmillan Publishers Limited Molecular Systems Biology 2010 1

schizont developmental stages (see Materials and methods).Following Colijn et al (2009), the maximum flux allowedthrough enzymes was constrained proportionally to therelative expression level of the corresponding genes.We compared the accuracy of our predictions to the

experimentally measured metabolic changes in Plasmodium-infected RBCs (Olszewski et al, 2009). In Figure 3, we show thepredicted and experimentally measured changes, indicatingeither an increase or decrease in metabolic concentrations forthe transition from the ring to trophozoite and fromtrophozoite to schizont stages. The predicted shifts inmetabolic concentrations agree with the experimental resultsin 70% of the measurements (binomial, P-value!9"10#4). Inaddition, we found a significant correlation between themagnitudes of the change inmetabolite concentrations and thepredicted flux values (Pearson’s correlation: 0.34, P-value!6"10#3, Spearman’s correlation: 0.25, P-value!0.04).In order to further investigate the statistical significance of

the results, we repeated flux predictions after randomlyshuffling expression values between P. falciparum genes. Inonly 2% of these random trials, the accuracy of the predictions

made with the shuffled data were higher than those obtainedusing the original expression values (Supplementary FigureS3). To explore the effects of multiple optimal FBA solutions(Mahadevan and Schilling, 2003) on the prediction accuracy,we used the centering hit-and-run algorithm (Kaufman andSmith, 1998), implemented in the COBRA toolbox (Beckeret al, 2007), to randomly sample the solution space associatedwith the expression constraints. The 70% accuracy value,obtained for a single solution, is close to the mean of solutionssampled from alternative optima (mean 0.69, s.d. 0.05; seeSupplementary Figure S3). Moreover, there is a significantdifference (Mann–Whitney U, P-valueo10#10) between theresults for randomized expression values and those based onthe multiple alternative optima. These results illustrate theability of the model, with appropriate constraints, to predictphysiological changes unrelated to gene knockouts. It alsosuggests that expression and metabolomics measurements,which are being rapidly accumulated for various stages ofparasite growth (Winzeler, 2008; Kafsack and Llinas, 2010),can be integrated with the model to gain a better under-standing of the P. falciparum physiology.

12 h 24 h 36 h 48 h 66 h

Ring

Ring

Ring

Late trophTroph

Troph Troph Troph Troph

SchizontRing

Control

100 !Mcpd 1_03

Reinvasion

NA

NaMN

NAD+

NM

NaAD

NADP+

HostA

B

NMNAT

NMase

NPRT

NADS

Histoneacetylation

NADK

Compound 1_03

H

Br

NHN

NO O

Figure 2 Small-molecule inhibition of the parasite nicotinate mononucleotide adenylyltransferase (NMNAT). (A) Schematic of the P. falciparum NAD(P) synthesis andrecycling pathway determined from the genome sequence. Nicotinamide (NM) and nicotinic acid (NA) can be scavenged from the host. Compound 1_03 is an inhibitortargeting NMNAT. (B) Compound 1_03 causes growth arrest of intraerythrocytic P. falciparum. Cultures were resuspended in niacin-free medium containing 0 or 100mMof compound 1_03 at early ring stage and observed for 66 h (see Materials and methods). Untreated parasites undergo normal development and reinvasion, whereasdrug-treated parasites arrest at the trophozoite (‘troph’) stage and do not reinvade. NM, nicotinamide; NA, nicotinic acid; NaMN, nicotinate mononucleotide; NaAD,nicotinate adenine dinucleotide; NAD(P)$, nicotinamide adenine dinucleotide (phosphate), reduced; NMase, nicotinamidase; NPRT, nicotinate phosphoribosyl-transferase; NMNAT, nicotinate mononucleotide adenylyltransferase; NADS, NAD synthase; NADK, NAD kinase.

P. falciparum metabolic network reconstructionG Plata et al

& 2010 EMBO and Macmillan Publishers Limited Molecular Systems Biology 2010 9

metabolic model

essential genes

non-human homologs

check for reported drugs in other pathogens

biol. experiment

no growth growth

targeted experiments

Page 14: Genome-Scale Metabolic Networks: reconstruction, properties, and

Difficult problems/challenges

14

• combinatorial explosion

• reaction direction assignment

• compartmentalization of euk. models

• automation of reconstruction protocol for eukaryotes

• integration of exp. data and predictions

• reconstruction of models with few exp. data

• tissue-specific models

• multicellular models

• whole organism models