1 semantic aggregation, integration, and inference of pathway data co-destructors: joanne luciano,...
TRANSCRIPT
1
Semantic Aggregation, Integration, and
Inference of Pathway Data
Co-Destructors:
Joanne Luciano, PhD [email protected]
Jeremy [email protected]
ISMB 2005 Tutorial Detroit MichiganJune 25th 2005
http://www.biopathways.org/ismb2005tutorial-am6/
(Pedantic Aggravation, Irritation, and Interference)
2
Overview
Introduction (45 minutes)
Time Out (15 minutes)Workshop Case Studies & Exercises (2 hrs
15 minutes)
Subdivide into groups of triads and dyads
•Case Study I (45 minutes)•Case Study II (45 minutes)•Case Study III (45 minutes)
Time Out (15 minutes)Lessons Learned (30 minutes)Lessons Not Yet Learned (take home)
3
Introduction (45 minutes)
Semantic Aggregation, Integration and Inference of Pathway Data
Pathway Data (domain)– What is it?– What does it look like?– Why do we care? (motivation)
Definitions & DisclaimersStrategies
4
What is it?Pathway Databases
So many pathway databases, so little time.
Pathway Data (domain)
Graphic from Mike Cary and Gary Bader
5
MetabolicPathways
MolecularInteractionNetworks
SignalingPathways
The Main Categories
GeneRegulation
GlycolysisProtein-Protein Apoptosis Lac Operon
Different types of pathways
(different strokes for different folks, it’s OK.)
6
Different representations of the same pathways
KEGG Reference Pathway GLYCOLYSIS
<!ELEMENT reaction (substrate*,product*)>
<!ATTLIST reaction name %keggid.type; #REQUIRED>
<!ATTLIST reaction type %reaction-type.type; #REQUIRED>
<!ELEMENT substrate EMPTY>
<!ATTLIST substrate name %keggid.type; #REQUIRED>
<!ELEMENT product EMPTY>
<!ATTLIST product name %keggid.type; #REQUIRED>
starts at -D-Glucose 1P
7
Different representations of the same pathways
BioCYC Reference Pathway GLYCOLYSIS
reactions.dat This file lists all chemical reactions in the PGDB.
Attributes: UNIQUE-ID TYPES COMMON-NAME ACTIVATORS BASAL-TRANSCRIPTION-VALUE DBLINKS DELTAG0 DEPRESSORS EC-LIST EC-NUMBER ENZYMATIC-REACTION EQUILIBRIUM-CONSTANT IN-PATHWAY INHIBITORS LEFT MOVED-IN MOVED-OUT OFFICIAL-EC? REACTANTS REQUIREMENTS RIGHT SIGNAL SPECIES SPONTANEOUS? STIMULATORS SYNONYMS
starts at -D-glucose6-phosphate
8
Different representations of the same pathways
Reactome Pathway GLYCOLYSIS
<reaction name="R_alpha_D_glucose_6_phosphate_D_fructose_6_phosphate" id="R_163457">
<listOfReactants>
<speciesReference species="R_30537_alpha_D_Glucose_6_phosphate" />
</listOfReactants>
<listOfProducts>
<speciesReference species="R_29512_D_Fructose_6_phosphate" />
</listOfProducts>
<listOfModifiers>
<modifierSpeciesReference species="R_163455_glucose_6_phosphate_isomerase_dimer_name_copied_from_complex_in_Homo_sapiens_" />
</listOfModifiers>
</reaction>
DatabaseObject [41245]
Event [8285]
Reaction [6598]
ConcreteReaction [4034]
GenericReaction [2564]
9
Different representations of the same pathways
BioCarta Reference Pathway GLYCOLYSIS
Does not compute.
Pretty,but useless
Starts at Glucose (but it doesn’t matter)
Reactions clickable but...
10
Pathway Data Why do we care?
Pathway Research has Broad Impact
– Drug Discovery (pathway of target, safety)– Basic Science (identify pathways)– Disease Research (cancer pathways)– Environmental Research (microbial research)
Combine knowledge from multiple sources– Whole is greater than the sum of its parts– Biological knowledge is fragmented– Need database to manage resources
11
Aggregation2 (or more) data sources, different data models, common link between (among) them.
Integration2 (or more) data sources, same data model, semantic mapping and instance merging required.
Inference1 (or more) data sources, one data model, creating new instances or new relationships.(Evidence code type kind of “inference”)
Disclaimer “Controlled” Vocabulary scope = this tutorial
Definitions & Disclaimers
12
Assembling KnowledgeAggregation, Integration,
InferenceUse Case I
Use Case IIIUse Case II
“When it comes to data cleaning, there’s no such thing as a free lunch.” Tim Berners-Lee
Some tasks are specific to a use case, some are common to more than one and there’s no escaping others.
13
Bridging Chemistry and Molecular Biology
Uniprot:P49841
•Different Views have different semantics: Lenses
• When there is a correspondence between objects, a semantic binding is possible
Apply Correspondence Rule:if ?target.xref.lsid == ?bpx:prot.xref.lsidthen ?target.correspondsTo.?bpx:prot
Source: Eric Neumann Haystack BioDASH Demo http://www.w3.org/2005/04/swls/BioDash/Demo/
14
GO2Keyword.rdf
UniProt.rdf
GO.rdf
Keywords.rdf
Taxonomy.rdfPubMed.xml
Citation
IntAct.rdf
Organism
Enzymes.rdf
OMIM.rdf
GO2OMIM.rdf
GO2Enzyme.rdf
MIM Id
KEGG.rdf
KeywordGO2UniProt.rdf
Protein
Enzyme
ProbeSet.rdf
Gene
Probe
Pathway
Compound
1. Differentiate different forms of disease
2. Identify patients subgroups.
3. Identify top biomarkers
4. Identify function
5. Identify biological and chemical properties and disease associations of biomarker
6. Identify documents
7. Identify role in metabolic pathways
8. Identify compounds that interact
9. Identify and compare function in other organisms
10. Identify any prior art
Seamark Demonstration: Identification of new drug candidates
15
SMBL integration using BioPAX
Use BioPAX to Address SBML’s data integration issues
• Different data types, same representation
• Same data, different representations
• External references…• Synonyms…• Provenance…
16
A problem: same representation different semantics (SBML)
Protein-Protein Interaction
<reaction id=“pyruvate_dehydrogenase_cplx”/> <listOfReactants> <speciesRef species=“PdhA”/> <speciesRef species=“PdhB”/> </listOfReactants> <listOfProducts> <speciesRef
species=“Pyruvate_dehydrogenase_E1”/>
</listOfProducts>
</reaction>
Biochemical Reaction<reaction id=“pyruvate_dehydrogenase_rxn”/> <listOfReactants> <speciesRef species=“NADP+”/> <speciesRef species=“CoA”/> <speciesRef species=“pyruvate”/> </listOfReactants> <listOfProducts> <speciesRef species=“NADPH”/> <speciesRef species=“acetyl-CoA”/> <speciesRef species=“CO2”/> </listOfProducts> <listOfModifers> <modifierSpeciesRef
species=“pyruvate_dehydrogenase_E1”/> </listOfModifiers>
</reaction>
17
SBML annotated with BioPAX
<sbml xmlns:bp=“http://www.biopax.org/release1/biopax-release1.owl” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><listOfSpecies> <species id=“PdhA” metaid=“PdhA”> <annotation> <bp:protein rdf:ID=“#PdhA”/> </annotation> </species> <species id=“NADP+” metaid=“NADP+”> <annotation> <bp:smallMolecule rdf:ID=“#NADP+”/> </annotation> </listOfSpecies><listOfReactions> <reaction id=“pyruvate_dehydrogenase_cplx”> <annotation> <bp:complexAssembly rdf:ID=“#pyruvate_dehydrogenase_cplx”/> </annotation> </reaction></listOfReactions>
species is protein
protein is PdhA
species is small molecule
small molecule is NADP+
18
BioPAX: External References
<species id=“pyruvate” metaid=“pyruvate”><annotation xmlns:bp=“http://biopax.org/release1/biopax-release1.owl”>
<bp:smallMolecule rdf:ID=“#pyruvate”> <bp:Xref> <bp:unificationXref rdf:ID=“#unificationXref119">
<bp:DB>LIGAND</bp:DB> <bp:ID>c00022</bp:ID> </bp:unificationXref> </bp:Xref> </bp:smallMolecule> </annotation></species>
19
BioPAX: Synonyms
<species id=“pyruvate” metaid=“pyruvate”><annotation xmlns:bp=“http://biopax.org/release1/biopax_release1.owl”/>
<bp:smallMolecule rdf:ID=“#pyruvate” > <bp:SYNONYMS>2-oxo-propionic acid</bp:SYNONYMS>
<bp:SYNONYMS>2-oxopropanoate</bp:SYNONYMS> <bp:SYNONYMS>BTS</bp:SYNONYMS> <bp:SYNONYMS>pyruvic acid</bp:SYNONYMS></bp:smallMolecule></annotation></species>
20
Strategies
• Develop bridging technologies
• Develop pathway representation standard within the Life Science community (BioPAX) (Social Engineering!)
• Utilize Semantic Web Integration Technologies (LSID, RDF/OWL)
How we get to a Standard Pathway Representation? (Game plan: Take over the world
or have the world take over itself?)
21
Exchange Formats in Pathway Data Space
(Scope)
BioPAX
PSI-MI 2SBML,CellML
GeneticInteractions
Molecular InteractionsPro:Pro All:All
Interaction NetworksMolecular Non-molecularPro:Pro TF:Gene Genetic
Regulatory PathwaysLow Detail High Detail
Database ExchangeFormats
Simulation ModelExchange Formats
RateFormulas
Metabolic PathwaysLow Detail High Detail
Biochemical Reactions
Small MoleculesLow Detail High Detail
Graphic from Mike Cary & Gary Bader
22
BioPAX Objectives
• Accommodate existing database representations
• Integration and exchange of pathway data
• Interchange through a common (standard) representation
• Provide a basis for future databases• Enable development of tools for searching and reasoning over the data
23
BioPAX Motivation
Before BioPAX With BioPAX
Common format will make data more accessible, promoting data sharing and distributed curation efforts
>180 DBs and tools
Database
Application
User
24
BioPAX Biological PAthway
eXchange
A data exchange ontology and format for biological pathway integration, aggregation and
inference
Initiative arose from the community
25
MetabolicPathways
MolecularInteractionNetworks
SignalingPathways
GeneRegulation
Glycolysis Apoptosis Lac Operon
BioPAXLevel 1
Biological pathways of the Cell
What is a Pathway?Protein-Protein
BioPAXLevel 2
26
Aggregation, Integration, Inference
1. Multiple kinds of pathway databases– metabolic– molecular interactions– signal transduction
2. Constructs designed for integration– DB References– XRefs (Publication, Unification,
Relationship)– synonyms– provenance
3. OWL DL – to enable reasoning
27
phosphoglucoseisomerase 5.3.1.9
OWL(schema)
Instances (Individuals)
(data)
BioPAX Biochemical Reaction
28
BioPAX Ontology
• Conceptual framework based upon existing DB schemas:
• aMAZE, BIND, EcoCyc, WIT, KEGG, Reactome, etc.• Allows wide range of detail, multiple levels of abstraction
• BioPAX ontology in OWL (XML)• Designed for pathway database integration– Database ID– Unification X-REF– Relationship X-REF– Publication X-REF– Synonyms– Provenance
29
BioPAX uses other ontologies
• Use pointers to existing ontologies to provide supplemental annotation where appropriate– Cellular location GO Component– Cell type Cell.obo– Organism NCBI taxon DB
• Incorporate other standards where appropriate– Chemical structure SMILES, CML, INCHI
30
BioPAX Ontology: Overview
Level 1 v1.0 (July 7th, 2004)
parts
how the parts are known to interact
a set ofinteractions
31
BioPAX Ontology: Top Level
• Pathway– A set of interactions– E.g. Glycolysis, MAPK, Apoptosis
• Interaction– A set of entities and some relationship
between them– E.g. Reaction, Molecular Association,
Catalysis• Physical Entity
– A building block of simple interactions– E.g. Small molecule, Protein, DNA, RNA
Entity
Pathway
Interaction
Physical Entity
Subclass (is a)Contains (has a)
Graphic from Gary Bader
32
BioPAX Ontology: Root
• Root class: Entity– Any concept referred to as a discrete biological unit when describing pathways. This is the root class for all biological concepts in the ontology, which include pathways, interactions and physical entities
33
Metabolic PathwaysInteraction sub-classes
Definition An entity that defines a single biochemical interaction between two or more entities.
An interaction cannot be defined without the entities it relates.
participants
34
Metabolic PathwaysInteraction sub-classes
Definition Two terms exist under interaction: Control and conversion. In future BioPAX levels, this list may be extended to include other classes, such as genetic interactions.
Examples Enzyme catalysis controls a biochemical reaction, transport catalysis controls transport, a small molecule that inhibits a pathway by an unknown mechanism controls the pathway.
35
BioPAX as a solution toAggregation, Integration,
Inference1. Multiple kinds of pathway databases
– metabolic– molecular interactions– signal transduction– gene regulatory
2. Constructs designed for integration– DB References– XRefs (Publication, Unification,
Relationship)– Synonyms– Provenance (not yet implemented)
3. OWL DL – to enable reasoning
36
Time Out
(15 minutes)
37
Workshop Case Studies & Exercises
(2 hrs 15 minutes)
Break into groups of triads and dyads
Case Study I (45 minutes)• Use Case 1: Inference of a Metabolic Flux Model from an Annotated Genome
• Group Exercise 1
Case Study II (45 minutes)• Use Case 2: Integration of a metabolic flux model from two sources
• Group Exercise 2
Case Study III (45 minutes)• Use Case 3: Multi-source aggregation Validation and Testing
• Group Exercise 3
38
Methodology
• Define the goal of the integration– How will the integrated data be used?– This defines the level of integration from syntactic through semantic
• Take stock of current resources– This defines your staring point
• Data base sources, programmers, lab access, collaborators
• Scope the work to get from B to A– Data Profiling– Resource Profiling
39
3 Case Studies
• Case study I: Semantic Inference of metabolic pathway data from an annotated genome.
• Case study II: Semantic Integration of a metabolic flux model from two sources.
• Case study III: Semantic Aggregation of pathway data from multiple sources
40
Case Study I:Inference of a Metabolic
Flux Model from an Annotated Genome
• Objective: To apply Biological knowledge to constrain the possible behaviors of a metabolic network.
• Resources: Annotated Genome, Transport DB, Pathway databases, experimental community, published literature
41
Genes make RNA make Protein
Gene1 P1RNA 1
Gene2 P2RNA 2
Gene3 P3RNA 3
Gene4 P4RNA 4
Gene6 P6RNA 6
Gene7 P7RNA 7
Gene8 P8RNA 8
Legend:
Enzyme
Transporter
Transcription
Translation
Gene RNA Protein
Gene5 P5RNA 5
Gene9 P9RNA 9
42
Proteins catalyze biochemical reactions
P2
P4
P8
Legend:Metabolites: A-F
P1 P5 P9
Periplasm
Cytoplasm
FEA
A B
A C
2 DB
E
2 BC
F
C D
D
Reaction:
Enzyme
Transporter
P6
P3 P7
Catalyzes
43
Biochemical reactions comprise a metabolic
network
Legend: Exchange IntracellularObjective
Biomass: R8
B
A
F
2D
E
C
R3
R2
Waste: R9
Uptake: R5
Uptake: R1
R4
R7
R6 D
2B
44
Metabolic Inference Subgoals
1. Infer genes from sequence and homology2. Infer enzymatic reactions from Enzyme
Commission (EC) numbers3. Infer metabolic reaction network from
enzymatic reactions and metabolites.4. Infer pathway holes using network
debugging algorithms5. Propose candidate enzymes using pathway-
hole filling algorithms6. Add experimentally verified candidates
to the annotated genome7. Lather, rinse, repeat
45
Data Profiling of the Annotated Genome
• Orphaned genes• Orphaned enzymes• Misannotated genes• Misannotated enzymes• Sequencing errors• BLAST Algorithm errors
46
Schema Level Errors
Gene that codes for the gene product (protein enzyme)
Enzyme (protein) that catalyzes the biochemical reaction
Biochemical reaction
Biochemical reaction
47
Semantic bugs revealed by chemical structure
EcoCyc 7.5 Pathway:Riboflavin and FMN
and FAD biosynthesis
No place to go!4-(1-D-ribitylamino)-5-amino-2,6-dihydroxypyrimidine:
48
EcoCyc 8.0 Pathway:Riboflavin and FMN
and FAD biosynthesis
Synonyms 4-(1-D-ribitylamino)-5-amino-2,6-dihydroxypyrimidine:
Semantic bugs revealed by chemical structure
49
Data Profiling of Pathway/Genome Database
• Unbalanced Reactions• Pathway holes• Unproducible metabolites• Generalized Metabolites• Unconsumable metabolites (toxins)
50
Biomass
Bugs in Network structure revealed by Forward and
Backward chainingFired
Reaction
Missing essentialcompound
Known Nutrient
set
Essential
compounds
Unfired Reaction
51
Biomass
Bugs in Network structure revealed by Forward and
Backward chaining
Missing essentialcompound
Essential
compounds
Precursor metabolite
Unproduced metabolite
52
Case study II:Integration of a metabolic
flux model from two sources
• What is metabolic flux analysis?• How does one build a metabolic flux model?
• What can go wrong in building a metabolic flux model?
53
What is Metabolic Flux Analysis?
• Starts with the metabolic network• Assumes steady-state behavior• Constrain with Thermodynamics• Add Nutrient conditions• Choose an objective: Biomass
growth• Predicts growth rate for mutant
and wild-type organisms under different conditions.
54
Start with the metabolic network
Flux legend:Exchange IntracellularObjective
Objective
v8
B
A
F
2D
E
C
v3
v2
Waste: v9
Uptake: v5
Uptake: v1
v4
v7
v6 D
2B
55
Stoichiometric Matrix: Representation of the
metabolic networkR1 R2 R3 R4 R5 R6 R7 R8 R8
A +1 -1 -1
B +1 -1 -2
C +1 +1 -1
D 2 +1 -1
E -1 +1
F +1 -1
R4: B + E → 2D
R5: → ER6: 2B → C + FR7: C → DR8: D →R9: F →
R1: → AR2: A → BR3: A → C
56
What is a metabolic flux?
Sink fluxes
Source fluxes
Metabolite Pool
57
What is a metabolic flux?
dt
Bd
dt
Adv
][][2
For a reaction of stoichiometry R2: A → B
the rate of reaction, or flux is equal to:
For a reaction of stoichiometry R4: B+E → 2D
the flux is equal to:
dt
Dd
dt
Ed
dt
Bdv
][
2
1][][4
58
What is a metabolic flux?
For a reaction of stoichiometry R4: B+E → 2D
The rate of reaction, or flux, is equal to:
dt
Dd
dt
Ed
dt
Bdv
][
2
1][][4
59
At steady-state, nonlinear dynamics simplify to
linear fluxes.
0321
][
]3][[
][
]2][[
][
]1][[
3,
3
2,
2
1,
1
vvvdt
dA
KC
PCk
KB
PBk
KA
PAk
dt
dA
mmmext
ext
Aext AP1
P2
P3
B
C
AextA
v1
B
v2
v3
C
k1
k2
k3
60
At steady-state, the sum of the fluxes that produce a metabolite is equal to the sum of the fluxes that
consume it.
0][
i
iivcdt
Ad
AextA
v1
B
v2
v3
C
61
Stoichiometric Matrix: more unknowns than equations
0763][
046*22][
0321][
vvvdt
Cd
vvvdt
Bd
vvvdt
AdR1 R2 R4 R4 R5 R6 R7 R8 R9
A +1 -1 -1
B +1 -1 -2
C +1 +1 -1
D 2 +1 -1
E -1 +1
F +1 -1
v1
v2
v3
v4
v5
v6
v7
v8
v9
0874*2][
vvvdt
Dd
096][
054][
vvdt
Fd
vvdt
Ed
62
How to determine the metabolic capabilities of
a network?
Flux legend:Exchange IntracellularObjective
Biomass: v8
B
A
F
2D
E
C
v3
v2
Waste: v9
Uptake: v5
Uptake: v1
v4
v7
v6 D
2B
63
B
A
F
2D
E
C
v3
v2
R9
v5
v1
v4
v7
v6 v8D
2BB
A
F
2D
E
C
v3
v2
v9
v5
v1
v4
v7
v6 v8D
2B
B
A
F
2D
E
C
v3
v2
v9
v5
v1
v4
v7
v6 v8D
2B
EE
Using Elementary modes to study the steady state-behavior
V1 v2 v3 v4 v5 v6 v7 v8 v9
A +1 -1 -1
B +1 -1 -2
C +1 +1 -1
D 2 +1 -1
E -1 +1
F +1 -1
64
How to make predictions about the behavior of the
metabolic network?
Flux legend:Exchange IntracellularObjective
Biomass: v8
B
A
F
2D
E
C
v3
v2
Waste: v9
Uptake: v5
Uptake: v1
v4
v7
v6 D
2B
65
B
A
F
2D
E
C
v3
v2
v9
v5
v1
v4
v7
v6v8
D
2B
10
10
10
10
20
Optimal wild-type flux distribution
Optimal Growth Flux
66
B
A
F
2D
E
C
v3
v2
v9
v5
v1
v4
v7
v6v8
D
2B
10
1010
10
STOP
Optimal mutant flux distribution
67
B
A
F
2D
E
C
v3
v2
v9
v5
v1
v4
v7
v6
v8
D
2B
10
3.36.7
6.7
STOP6.7
3.3
3.3
Suboptimal mutant flux distribution
68
Case II: Palsson JR904
• good flux balance model• implicit schema• literature curated biochemical reactions
• 904 enzymatic reactions• gene, enzyme-reaction associations
69
Case II: What sources of data are
available to build a Metabolic Flux model?
• Annotated Genome• Literature• Pathway Databases• Experimental measurements
70
(fluxes in [mmol/gr DM h] normalized to glucose uptake flux)
(Segrè, Vitkup and Church, PNAS 2002)
0 50 100 150 200
0
50
100
150
200
12
3
45 6
7
8
9
10
1112
1314
15
16
17
WT (FBA)C 0.4
vi (exper)
v i (t
heor
)
Corr.coeff.=0.97
Model vs. Exper., Glucose limited
71
- 5 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0- 5 0
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
1
2
3
4
5 6
78
9
1 0
1 1
1 2
1 31 4
1 5
1 6
1 7
- 5 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0- 5 0
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
1
2
34 5 6
78
9
1 0
1 1
1 2
1 31 4
1 5
1 6
1 7
- 5 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0- 5 0
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
1
2
3
456
7
8
91 0
1 11 2
1 3
1 4
1 5
1 6
1 7
0 5 0 1 0 0 1 5 0 2 0 0
0
5 0
1 0 0
1 5 0
2 0 0
12
3
45 6
7
8
9
1 0
1 11 2
1 31 4
1 5
1 6
1 7
0 5 0 1 0 0 1 5 0 2 0 0
0
5 0
1 0 0
1 5 0
2 0 0
1
2
3
45
6
7
8
9
1 0
1 11 2
1 31 4
1 5
1 6
1 7
0 5 0 1 0 0 1 5 0 2 0 0
0
5 0
1 0 0
1 5 0
2 0 0
12
3
45 6
7
8
9
1 0
1 11 2
1 31 4
1 5
1 6
1 7
0 5 0 1 0 0 1 5 0 2 0 0
0
5 0
1 0 0
1 5 0
2 0 0
1
2
3
4 5 6
78
9
1 01 1
1 2
1 31 4
1 5
1 6
1 7
0 5 0 1 0 0 1 5 0 2 0 0
0
5 0
1 0 0
1 5 0
2 0 0
1
2
3
45 6
78
9
1 0
1 1
1 21 3
1 4 1 5
1 6
1 7
0 5 0 1 0 0 1 5 0 2 0 0
0
5 0
1 0 0
1 5 0
2 0 0
1
2
3
4
5 6
7
8
9
1 01 1
1 2
1 3
1 4
1 5
1 6
1 7
WT
(FB
A)
KO
(F
BA
)K
O (
MP
A)
C 0 . 0 9 C 0 . 4 N 0 . 0 9
)exper(iv )exper(iv )exper(iv
)th
eor
(iv
)th
eor
(py
kiv
)th
eor
(py
kiv
)th
eor
(py
kiv
)th
eor
(py
kiv
A
B
C
D
E
F
G
H
I
Low Glucose Limited High Glucose Limited Nitrogen Limited
i (exper) i (exper) i (exper)
Corr.coeff.=0.91 Corr.coeff.=0.97 Corr.coeff.=0.78
72
-50 0 50 100 150 200 250-50
0
50
100
150
200
250
1
2
34 5
6
78
9
10
11
12
1314
15
16
17
vi (exper)
v i ( t
heor
)
Corr.coeff.= - 0.064P-value=0.6
-50 0 50 100 150 200 250-50
0
50
100
150
200
250
1
2
34 5
6
78
9
10
11
12
1314
15
16
17
vi (exper)
v i ( t
heor
)
Corr.coeff.= - 0.064P-value=0.6
Max growth (optimal)
-50 0 50 100 150 200 250-50
0
50
100
150
200
250
1
2
3
456
78
910
1112
1314
15
16
17
vi (exper)
v i (t
heor
)
Corr.coeff.=0.564
P-value=0.007
Min Adjust. (suboptimal)
73
The power of a model lies in its ability to distinguish between
competing hypotheses
74
Case II: EcoCyc
• good schema• Flux balance model doesn’t work
75
What happens if the steady-state behavior of the model fails to
reproduce the steady-state behavior of the organism?
GenomePathologic
Transporterprediction
Pathway/GenomeDatabase
BioCycto
SBML
Nutrients &Objective
FBA &MOMA
Fluxprediction
ModelDefinition
(SBML)
76
What happens if the steady-state behavior of the model fails to
reproduce the steady-state behavior of the organism?
GenomePathologic
Transporterprediction
Pathway/GenomeDatabase
BioCycto
SBML
NetworkDebugging
Nutrients &Objective
FBA &MOMA
Fluxprediction
ModelDefinition
(SBML)
77
Case II: EcoCyc/JR904
• Best of both worlds
• Biological Objective: From nutrients create all essential compounds required for growth
• True test of metabolic databases: Is the data good enough to predict growth rate under different nutrient conditions and effect of gene knockouts?
78
Case II: Schema level integration
• Translation from BioCyc ontology to BioPAX ontology
• Translation of implicit JR904 schema to BioPAX ontology
• Integration of JR904 concepts with BioPAX ontology (flux limits)
79
Case II: Instance level
• EcoCyc <-> JR904 Gene names • EcoCyc <-> JR904 Enzyme names• EcoCyc <-> JR904 Reaction names• EcoCyc <-> JR904 Reversibility/flux limits
• EcoCyc <-> JR904 Gene->protein associations
• EcoCyc <-> JR904 protein->enzyme complex associations
• EcoCyc <-> JR904 enzyme->reaction associations
80
Data Profiling of Flux Model
• Incorrect constraints (reversibility)• Incorrect Nutrient conditions• Incorrect Biomass composition• Incorrect protein function predictions
81
Data profiling of Flux Predictions
• Incorrect hypothesis (FBA vs MOMA vs ROOM)
• Incorrect network architecture(Gene knockouts)
• Incorrect modeling assumptions(steady state assumption, gene expression profiles)
82
Fixing the problems you find
Requires different amounts of time, money, and expertise
– Enzyme Genomics project– Community annotation projects– Adopt-a-Genome project– High-throughput experiments– Pathway hole filling algorithms
83
Case III: Semantic Aggregation Case study
Prochlorococcus marinus MED4• Most abundant species in the ocean• Responsible for a significant portion of photosynthetic carbon fixation.
• Iron hypothesis: Possible solution to global warming?
• Need to understand details of metabolic network
84
Case III: Multi-source aggregation
Public– KEGG (metabolism)– BioCyc (metabolism)– WIT (metabolism)– TransportDB (transport proteins)
Local– RNA expression (microarrays)– protein expression (mass spec)
85
Case III: Goal
Constrain metabolic flux model with
experimental measurements:
•RNA expression•Protein expression•Metabolite concentrations•Flux measurements
86
Case III: Aggregation Problems
• Higher Level: Orphan enzymes• Schema Level: Bridge ontologies• Instance Level: Object identity problem
• Simulation Level: underdetermined system.
87
Case III: Multi-source aggregation Validation and
Testing
• Joint-learning from multiple sources• Semantic test suite for data validation
• Network debugging algorithms
88
Time Out
(15 minutes)
89
Lessons Learned(30 minutes)
What did you learn?
Discussion
“A good representation is the key to good problem solving” –Patrick Winston
“Standard is better than best”—Gerald J Sussman
“The great thing about standards is that there are so many from which to choose” --Unknown
“Above all, one must develop a feeling for the organism.”—Barbara McClintock
“Someone does it once, everybody benefits.”Eric Miller, W3C Semantic Web Activity Lead
Remember people, process, technology, however without people there isn’t any process or technology, so it’s all
social engineering.
90
Lessons Not Yet Learned
(Take home exercise)
91
FeedbackOur goal is to have you walk away with a clear understanding of how to approach any
database integration projectTo provide
• A methodology to scope and plan the project• An understanding of what to expect• Some specific examples to illustrate what is common to all integration projects (data cleaning) and what specific to a particular task. (i.e. to provide you with examples to give a sense of it)
• Some first hand experience at pedantic aggravation, irritation and interference
How did we do? Please let us know how we can improve this tutorial.
92
Thank You
Joanne & Jeremy