1 semantic aggregation, integration, and inference of pathway data co-destructors: joanne luciano,...

1

Semantic Aggregation, Integration, and

Inference of Pathway Data

Co-Destructors:

Joanne Luciano, PhD [email protected]

Jeremy [email protected]

ISMB 2005 Tutorial Detroit MichiganJune 25th 2005

http://www.biopathways.org/ismb2005tutorial-am6/

(Pedantic Aggravation, Irritation, and Interference)

2

Overview

Introduction (45 minutes)

Time Out (15 minutes)Workshop Case Studies & Exercises (2 hrs

15 minutes)

Subdivide into groups of triads and dyads

•Case Study I (45 minutes)•Case Study II (45 minutes)•Case Study III (45 minutes)

Time Out (15 minutes)Lessons Learned (30 minutes)Lessons Not Yet Learned (take home)

3

Introduction (45 minutes)

Semantic Aggregation, Integration and Inference of Pathway Data

Pathway Data (domain)– What is it?– What does it look like?– Why do we care? (motivation)

Definitions & DisclaimersStrategies

4

What is it?Pathway Databases

So many pathway databases, so little time.

Pathway Data (domain)

Graphic from Mike Cary and Gary Bader

5

MetabolicPathways

MolecularInteractionNetworks

SignalingPathways

The Main Categories

GeneRegulation

GlycolysisProtein-Protein Apoptosis Lac Operon

Different types of pathways

(different strokes for different folks, it’s OK.)

6

Different representations of the same pathways

KEGG Reference Pathway GLYCOLYSIS

<!ELEMENT reaction (substrate*,product*)>

<!ATTLIST reaction name %keggid.type; #REQUIRED>

<!ATTLIST reaction type %reaction-type.type; #REQUIRED>

<!ELEMENT substrate EMPTY>

<!ATTLIST substrate name %keggid.type; #REQUIRED>

<!ELEMENT product EMPTY>

<!ATTLIST product name %keggid.type; #REQUIRED>

starts at -D-Glucose 1P

7


BioCYC Reference Pathway GLYCOLYSIS

reactions.dat This file lists all chemical reactions in the PGDB.

Attributes: UNIQUE-ID TYPES COMMON-NAME ACTIVATORS BASAL-TRANSCRIPTION-VALUE DBLINKS DELTAG0 DEPRESSORS EC-LIST EC-NUMBER ENZYMATIC-REACTION EQUILIBRIUM-CONSTANT IN-PATHWAY INHIBITORS LEFT MOVED-IN MOVED-OUT OFFICIAL-EC? REACTANTS REQUIREMENTS RIGHT SIGNAL SPECIES SPONTANEOUS? STIMULATORS SYNONYMS

starts at -D-glucose6-phosphate

8


Reactome Pathway GLYCOLYSIS

<reaction name="R_alpha_D_glucose_6_phosphate_D_fructose_6_phosphate" id="R_163457">

<listOfReactants>

<speciesReference species="R_30537_alpha_D_Glucose_6_phosphate" />

</listOfReactants>

<listOfProducts>

<speciesReference species="R_29512_D_Fructose_6_phosphate" />

</listOfProducts>

<listOfModifiers>

<modifierSpeciesReference species="R_163455_glucose_6_phosphate_isomerase_dimer_name_copied_from_complex_in_Homo_sapiens_" />

</listOfModifiers>

</reaction>

DatabaseObject [41245]

Event [8285]

Reaction [6598]

ConcreteReaction [4034]

GenericReaction [2564]

9


BioCarta Reference Pathway GLYCOLYSIS

Does not compute.

Pretty,but useless

Starts at Glucose (but it doesn’t matter)

Reactions clickable but...

10

Pathway Data Why do we care?

Pathway Research has Broad Impact

– Drug Discovery (pathway of target, safety)– Basic Science (identify pathways)– Disease Research (cancer pathways)– Environmental Research (microbial research)

Combine knowledge from multiple sources– Whole is greater than the sum of its parts– Biological knowledge is fragmented– Need database to manage resources

11

Aggregation2 (or more) data sources, different data models, common link between (among) them.

Integration2 (or more) data sources, same data model, semantic mapping and instance merging required.

Inference1 (or more) data sources, one data model, creating new instances or new relationships.(Evidence code type kind of “inference”)

Disclaimer “Controlled” Vocabulary scope = this tutorial

Definitions & Disclaimers

12

Assembling KnowledgeAggregation, Integration,

InferenceUse Case I

Use Case IIIUse Case II

“When it comes to data cleaning, there’s no such thing as a free lunch.” Tim Berners-Lee

Some tasks are specific to a use case, some are common to more than one and there’s no escaping others.

13

Bridging Chemistry and Molecular Biology

Uniprot:P49841

•Different Views have different semantics: Lenses

• When there is a correspondence between objects, a semantic binding is possible

Apply Correspondence Rule:if ?target.xref.lsid == ?bpx:prot.xref.lsidthen ?target.correspondsTo.?bpx:prot

Source: Eric Neumann Haystack BioDASH Demo http://www.w3.org/2005/04/swls/BioDash/Demo/

14

GO2Keyword.rdf

UniProt.rdf

GO.rdf

Keywords.rdf

Taxonomy.rdfPubMed.xml

Citation

IntAct.rdf

Organism

Enzymes.rdf

OMIM.rdf

GO2OMIM.rdf

GO2Enzyme.rdf

MIM Id

KEGG.rdf

KeywordGO2UniProt.rdf

Protein

Enzyme

ProbeSet.rdf

Gene

Probe

Pathway

Compound

1. Differentiate different forms of disease

2. Identify patients subgroups.

3. Identify top biomarkers

4. Identify function

5. Identify biological and chemical properties and disease associations of biomarker

6. Identify documents

7. Identify role in metabolic pathways

8. Identify compounds that interact

9. Identify and compare function in other organisms

10. Identify any prior art

Seamark Demonstration: Identification of new drug candidates

15

SMBL integration using BioPAX

Use BioPAX to Address SBML’s data integration issues

• Different data types, same representation

• Same data, different representations

• External references…• Synonyms…• Provenance…

16

A problem: same representation different semantics (SBML)

Protein-Protein Interaction

<reaction id=“pyruvate_dehydrogenase_cplx”/> <listOfReactants> <speciesRef species=“PdhA”/> <speciesRef species=“PdhB”/> </listOfReactants> <listOfProducts> <speciesRef

species=“Pyruvate_dehydrogenase_E1”/>

</listOfProducts>

</reaction>

Biochemical Reaction<reaction id=“pyruvate_dehydrogenase_rxn”/> <listOfReactants> <speciesRef species=“NADP+”/> <speciesRef species=“CoA”/> <speciesRef species=“pyruvate”/> </listOfReactants> <listOfProducts> <speciesRef species=“NADPH”/> <speciesRef species=“acetyl-CoA”/> <speciesRef species=“CO2”/> </listOfProducts> <listOfModifers> <modifierSpeciesRef

species=“pyruvate_dehydrogenase_E1”/> </listOfModifiers>

</reaction>

17

SBML annotated with BioPAX

<sbml xmlns:bp=“http://www.biopax.org/release1/biopax-release1.owl” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><listOfSpecies> <species id=“PdhA” metaid=“PdhA”> <annotation> <bp:protein rdf:ID=“#PdhA”/> </annotation> </species> <species id=“NADP+” metaid=“NADP+”> <annotation> <bp:smallMolecule rdf:ID=“#NADP+”/> </annotation> </listOfSpecies><listOfReactions> <reaction id=“pyruvate_dehydrogenase_cplx”> <annotation> <bp:complexAssembly rdf:ID=“#pyruvate_dehydrogenase_cplx”/> </annotation> </reaction></listOfReactions>

species is protein

protein is PdhA

species is small molecule

small molecule is NADP+

18

BioPAX: External References

<species id=“pyruvate” metaid=“pyruvate”><annotation xmlns:bp=“http://biopax.org/release1/biopax-release1.owl”>

<bp:smallMolecule rdf:ID=“#pyruvate”> <bp:Xref> <bp:unificationXref rdf:ID=“#unificationXref119">

<bp:DB>LIGAND</bp:DB> <bp:ID>c00022</bp:ID> </bp:unificationXref> </bp:Xref> </bp:smallMolecule> </annotation></species>

19

BioPAX: Synonyms

<species id=“pyruvate” metaid=“pyruvate”><annotation xmlns:bp=“http://biopax.org/release1/biopax_release1.owl”/>

<bp:smallMolecule rdf:ID=“#pyruvate” > <bp:SYNONYMS>2-oxo-propionic acid</bp:SYNONYMS>

<bp:SYNONYMS>2-oxopropanoate</bp:SYNONYMS> <bp:SYNONYMS>BTS</bp:SYNONYMS> <bp:SYNONYMS>pyruvic acid</bp:SYNONYMS></bp:smallMolecule></annotation></species>

20

Strategies

• Develop bridging technologies

• Develop pathway representation standard within the Life Science community (BioPAX) (Social Engineering!)

• Utilize Semantic Web Integration Technologies (LSID, RDF/OWL)

How we get to a Standard Pathway Representation? (Game plan: Take over the world

or have the world take over itself?)

21

Exchange Formats in Pathway Data Space

(Scope)

BioPAX

PSI-MI 2SBML,CellML

GeneticInteractions

Molecular InteractionsPro:Pro All:All

Interaction NetworksMolecular Non-molecularPro:Pro TF:Gene Genetic

Regulatory PathwaysLow Detail High Detail

Database ExchangeFormats

Simulation ModelExchange Formats

RateFormulas

Metabolic PathwaysLow Detail High Detail

Biochemical Reactions

Small MoleculesLow Detail High Detail

Graphic from Mike Cary & Gary Bader

22

BioPAX Objectives

• Accommodate existing database representations

• Integration and exchange of pathway data

• Interchange through a common (standard) representation

• Provide a basis for future databases• Enable development of tools for searching and reasoning over the data

23

BioPAX Motivation

Before BioPAX With BioPAX

Common format will make data more accessible, promoting data sharing and distributed curation efforts

>180 DBs and tools

Database

Application

User

24

BioPAX Biological PAthway

eXchange

A data exchange ontology and format for biological pathway integration, aggregation and

inference

Initiative arose from the community

25

MetabolicPathways

MolecularInteractionNetworks

SignalingPathways

GeneRegulation

Glycolysis Apoptosis Lac Operon

BioPAXLevel 1

Biological pathways of the Cell

What is a Pathway?Protein-Protein

BioPAXLevel 2

26

Aggregation, Integration, Inference

1. Multiple kinds of pathway databases– metabolic– molecular interactions– signal transduction

2. Constructs designed for integration– DB References– XRefs (Publication, Unification,

Relationship)– synonyms– provenance

3. OWL DL – to enable reasoning

27

phosphoglucoseisomerase 5.3.1.9

OWL(schema)

Instances (Individuals)

(data)

BioPAX Biochemical Reaction

28

BioPAX Ontology

• Conceptual framework based upon existing DB schemas:

• aMAZE, BIND, EcoCyc, WIT, KEGG, Reactome, etc.• Allows wide range of detail, multiple levels of abstraction

• BioPAX ontology in OWL (XML)• Designed for pathway database integration– Database ID– Unification X-REF– Relationship X-REF– Publication X-REF– Synonyms– Provenance

29

BioPAX uses other ontologies

• Use pointers to existing ontologies to provide supplemental annotation where appropriate– Cellular location GO Component– Cell type Cell.obo– Organism NCBI taxon DB

• Incorporate other standards where appropriate– Chemical structure SMILES, CML, INCHI

30

BioPAX Ontology: Overview

Level 1 v1.0 (July 7th, 2004)

parts

how the parts are known to interact

a set ofinteractions

31

BioPAX Ontology: Top Level

• Pathway– A set of interactions– E.g. Glycolysis, MAPK, Apoptosis

• Interaction– A set of entities and some relationship

between them– E.g. Reaction, Molecular Association,

Catalysis• Physical Entity

– A building block of simple interactions– E.g. Small molecule, Protein, DNA, RNA

Entity

Pathway

Interaction

Physical Entity

Subclass (is a)Contains (has a)

Graphic from Gary Bader

32

BioPAX Ontology: Root

• Root class: Entity– Any concept referred to as a discrete biological unit when describing pathways. This is the root class for all biological concepts in the ontology, which include pathways, interactions and physical entities

33

Metabolic PathwaysInteraction sub-classes

Definition An entity that defines a single biochemical interaction between two or more entities.

An interaction cannot be defined without the entities it relates.

participants

34

Metabolic PathwaysInteraction sub-classes

Definition Two terms exist under interaction: Control and conversion. In future BioPAX levels, this list may be extended to include other classes, such as genetic interactions.

Examples Enzyme catalysis controls a biochemical reaction, transport catalysis controls transport, a small molecule that inhibits a pathway by an unknown mechanism controls the pathway.

35

BioPAX as a solution toAggregation, Integration,

Inference1. Multiple kinds of pathway databases

– metabolic– molecular interactions– signal transduction– gene regulatory

2. Constructs designed for integration– DB References– XRefs (Publication, Unification,

Relationship)– Synonyms– Provenance (not yet implemented)

3. OWL DL – to enable reasoning

36

Time Out

(15 minutes)

37

Workshop Case Studies & Exercises

(2 hrs 15 minutes)

Break into groups of triads and dyads

Case Study I (45 minutes)• Use Case 1: Inference of a Metabolic Flux Model from an Annotated Genome

• Group Exercise 1

Case Study II (45 minutes)• Use Case 2: Integration of a metabolic flux model from two sources


Case Study III (45 minutes)• Use Case 3: Multi-source aggregation Validation and Testing


38

Methodology

• Define the goal of the integration– How will the integrated data be used?– This defines the level of integration from syntactic through semantic

• Take stock of current resources– This defines your staring point

• Data base sources, programmers, lab access, collaborators

• Scope the work to get from B to A– Data Profiling– Resource Profiling

39

3 Case Studies

• Case study I: Semantic Inference of metabolic pathway data from an annotated genome.

• Case study II: Semantic Integration of a metabolic flux model from two sources.

• Case study III: Semantic Aggregation of pathway data from multiple sources

40

Case Study I:Inference of a Metabolic

Flux Model from an Annotated Genome

• Objective: To apply Biological knowledge to constrain the possible behaviors of a metabolic network.

• Resources: Annotated Genome, Transport DB, Pathway databases, experimental community, published literature

41

Genes make RNA make Protein

Gene1 P1RNA 1

Gene2 P2RNA 2

Gene3 P3RNA 3

Gene4 P4RNA 4

Gene6 P6RNA 6

Gene7 P7RNA 7

Gene8 P8RNA 8

Legend:

Enzyme

Transporter

Transcription

Translation

Gene RNA Protein

Gene5 P5RNA 5

Gene9 P9RNA 9

42

Proteins catalyze biochemical reactions

P2

P4

P8

Legend:Metabolites: A-F

P1 P5 P9

Periplasm

Cytoplasm

FEA

A B

A C

2 DB

E

2 BC

F

C D

D

Reaction:

Enzyme

Transporter

P6

P3 P7

Catalyzes

43

Biochemical reactions comprise a metabolic

network

Legend: Exchange IntracellularObjective

Biomass: R8

B

A

F

2D

E

C

R3

R2

Waste: R9

Uptake: R5

Uptake: R1

R4

R7

R6 D

2B

44

Metabolic Inference Subgoals

1. Infer genes from sequence and homology2. Infer enzymatic reactions from Enzyme

Commission (EC) numbers3. Infer metabolic reaction network from

enzymatic reactions and metabolites.4. Infer pathway holes using network

debugging algorithms5. Propose candidate enzymes using pathway-

hole filling algorithms6. Add experimentally verified candidates

to the annotated genome7. Lather, rinse, repeat

45

Data Profiling of the Annotated Genome

• Orphaned genes• Orphaned enzymes• Misannotated genes• Misannotated enzymes• Sequencing errors• BLAST Algorithm errors

46

Schema Level Errors

Gene that codes for the gene product (protein enzyme)

Enzyme (protein) that catalyzes the biochemical reaction

Biochemical reaction

Biochemical reaction

47

Semantic bugs revealed by chemical structure

EcoCyc 7.5 Pathway:Riboflavin and FMN

and FAD biosynthesis

No place to go!4-(1-D-ribitylamino)-5-amino-2,6-dihydroxypyrimidine:

48

EcoCyc 8.0 Pathway:Riboflavin and FMN

and FAD biosynthesis

Synonyms 4-(1-D-ribitylamino)-5-amino-2,6-dihydroxypyrimidine:

Semantic bugs revealed by chemical structure

49

Data Profiling of Pathway/Genome Database

• Unbalanced Reactions• Pathway holes• Unproducible metabolites• Generalized Metabolites• Unconsumable metabolites (toxins)

50

Biomass

Bugs in Network structure revealed by Forward and

Backward chainingFired

Reaction

Missing essentialcompound

Known Nutrient

set

Essential

compounds

Unfired Reaction

51

Biomass

Bugs in Network structure revealed by Forward and

Backward chaining

Missing essentialcompound

Essential

compounds

Precursor metabolite

Unproduced metabolite

52

Case study II:Integration of a metabolic

flux model from two sources

• What is metabolic flux analysis?• How does one build a metabolic flux model?

• What can go wrong in building a metabolic flux model?

53

What is Metabolic Flux Analysis?

• Starts with the metabolic network• Assumes steady-state behavior• Constrain with Thermodynamics• Add Nutrient conditions• Choose an objective: Biomass

growth• Predicts growth rate for mutant

and wild-type organisms under different conditions.

54

Start with the metabolic network

Flux legend:Exchange IntracellularObjective

Objective

v8

B

A

F

2D

E

C

v3

v2

Waste: v9

Uptake: v5

Uptake: v1

v4

v7

v6 D

2B

55

Stoichiometric Matrix: Representation of the

metabolic networkR1 R2 R3 R4 R5 R6 R7 R8 R8

A +1 -1 -1

B +1 -1 -2

C +1 +1 -1

D 2 +1 -1

E -1 +1

F +1 -1

R4: B + E → 2D

R5: → ER6: 2B → C + FR7: C → DR8: D →R9: F →

R1: → AR2: A → BR3: A → C

56

What is a metabolic flux?

Sink fluxes

Source fluxes

Metabolite Pool

57


dt

Bd

dt

Adv

][][2

For a reaction of stoichiometry R2: A → B

the rate of reaction, or flux is equal to:

For a reaction of stoichiometry R4: B+E → 2D

the flux is equal to:

dt

Dd

dt

Ed

dt

Bdv

][

2

1][][4

58


For a reaction of stoichiometry R4: B+E → 2D

The rate of reaction, or flux, is equal to:

dt

Dd

dt

Ed

dt

Bdv

][

2

1][][4

59

At steady-state, nonlinear dynamics simplify to

linear fluxes.

0321

][

]3][[

][

]2][[

][

]1][[

3,

3

2,

2

1,

1

vvvdt

dA

KC

PCk

KB

PBk

KA

PAk

dt

dA

mmmext

ext

Aext AP1

P2

P3

B

C

AextA

v1

B

v2

v3

C

k1

k2

k3

60

At steady-state, the sum of the fluxes that produce a metabolite is equal to the sum of the fluxes that

consume it.

0][

i

iivcdt

Ad

AextA

v1

B

v2

v3

C

61

Stoichiometric Matrix: more unknowns than equations

0763][

046*22][

0321][

vvvdt

Cd

vvvdt

Bd

vvvdt

AdR1 R2 R4 R4 R5 R6 R7 R8 R9

A +1 -1 -1

B +1 -1 -2

C +1 +1 -1

D 2 +1 -1

E -1 +1

F +1 -1

v1

v2

v3

v4

v5

v6

v7

v8

v9

0874*2][

vvvdt

Dd

096][

054][

vvdt

Fd

vvdt

Ed

62

How to determine the metabolic capabilities of

a network?


Biomass: v8

B

A

F

2D

E

C

v3

v2

Waste: v9

Uptake: v5

Uptake: v1

v4

v7

v6 D

2B

63

B

A

F

2D

E

C

v3

v2

R9

v5

v1

v4

v7

v6 v8D

2BB

A

F

2D

E

C

v3

v2

v9

v5

v1

v4

v7

v6 v8D

2B

B

A

F

2D

E

C

v3

v2

v9

v5

v1

v4

v7

v6 v8D

2B

EE

Using Elementary modes to study the steady state-behavior

V1 v2 v3 v4 v5 v6 v7 v8 v9

A +1 -1 -1

B +1 -1 -2

C +1 +1 -1

D 2 +1 -1

E -1 +1

F +1 -1

64

How to make predictions about the behavior of the

metabolic network?


Biomass: v8

B

A

F

2D

E

C

v3

v2

Waste: v9

Uptake: v5

Uptake: v1

v4

v7

v6 D

2B

65

B

A

F

2D

E

C

v3

v2

v9

v5

v1

v4

v7

v6v8

D

2B

10

10

10

10

20

Optimal wild-type flux distribution

Optimal Growth Flux

66

B

A

F

2D

E

C

v3

v2

v9

v5

v1

v4

v7

v6v8

D

2B

10

1010

10

STOP

Optimal mutant flux distribution

67

B

A

F

2D

E

C

v3

v2

v9

v5

v1

v4

v7

v6

v8

D

2B

10

3.36.7

6.7

STOP6.7

3.3

3.3

Suboptimal mutant flux distribution

68

Case II: Palsson JR904

• good flux balance model• implicit schema• literature curated biochemical reactions

• 904 enzymatic reactions• gene, enzyme-reaction associations

69

Case II: What sources of data are

available to build a Metabolic Flux model?

• Annotated Genome• Literature• Pathway Databases• Experimental measurements

70

(fluxes in [mmol/gr DM h] normalized to glucose uptake flux)

(Segrè, Vitkup and Church, PNAS 2002)

0 50 100 150 200

0

50

100

150

200

12

3

45 6

7

8

9

10

1112

1314

15

16

17

WT (FBA)C 0.4

vi (exper)

v i (t

heor

)

Corr.coeff.=0.97

Model vs. Exper., Glucose limited

71

- 5 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0- 5 0

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

1

2

3

4

5 6

78

9

1 0

1 1

1 2

1 31 4

1 5

1 6

1 7

- 5 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0- 5 0

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

1

2

34 5 6

78

9

1 0

1 1

1 2

1 31 4

1 5

1 6

1 7

- 5 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0- 5 0

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

1

2

3

456

7

8

91 0

1 11 2

1 3

1 4

1 5

1 6

1 7

0 5 0 1 0 0 1 5 0 2 0 0

0

5 0

1 0 0

1 5 0

2 0 0

12

3

45 6

7

8

9

1 0

1 11 2

1 31 4

1 5

1 6

1 7

0 5 0 1 0 0 1 5 0 2 0 0

0

5 0

1 0 0

1 5 0

2 0 0

1

2

3

45

6

7

8

9

1 0

1 11 2

1 31 4

1 5

1 6

1 7

0 5 0 1 0 0 1 5 0 2 0 0

0

5 0

1 0 0

1 5 0

2 0 0

12

3

45 6

7

8

9

1 0

1 11 2

1 31 4

1 5

1 6

1 7

0 5 0 1 0 0 1 5 0 2 0 0

0

5 0

1 0 0

1 5 0

2 0 0

1

2

3

4 5 6

78

9

1 01 1

1 2

1 31 4

1 5

1 6

1 7

0 5 0 1 0 0 1 5 0 2 0 0

0

5 0

1 0 0

1 5 0

2 0 0

1

2

3

45 6

78

9

1 0

1 1

1 21 3

1 4 1 5

1 6

1 7

0 5 0 1 0 0 1 5 0 2 0 0

0

5 0

1 0 0

1 5 0

2 0 0

1

2

3

4

5 6

7

8

9

1 01 1

1 2

1 3

1 4

1 5

1 6

1 7

WT

(FB

A)

KO

(F

BA

)K

O (

MP

A)

C 0 . 0 9 C 0 . 4 N 0 . 0 9

)exper(iv )exper(iv )exper(iv

)th

eor

(iv

)th

eor

(py

kiv

)th

eor

(py

kiv

)th

eor

(py

kiv

)th

eor

(py

kiv

A

B

C

D

E

F

G

H

I

Low Glucose Limited High Glucose Limited Nitrogen Limited

i (exper) i (exper) i (exper)

Corr.coeff.=0.91 Corr.coeff.=0.97 Corr.coeff.=0.78

72

-50 0 50 100 150 200 250-50

0

50

100

150

200

250

1

2

34 5

6

78

9

10

11

12

1314

15

16

17

vi (exper)

v i ( t

heor

)

Corr.coeff.= - 0.064P-value=0.6

-50 0 50 100 150 200 250-50

0

50

100

150

200

250

1

2

34 5

6

78

9

10

11

12

1314

15

16

17

vi (exper)

v i ( t

heor

)

Corr.coeff.= - 0.064P-value=0.6

Max growth (optimal)

-50 0 50 100 150 200 250-50

0

50

100

150

200

250

1

2

3

456

78

910

1112

1314

15

16

17

vi (exper)

v i (t

heor

)

Corr.coeff.=0.564

P-value=0.007

Min Adjust. (suboptimal)

73

The power of a model lies in its ability to distinguish between

competing hypotheses

74

Case II: EcoCyc

• good schema• Flux balance model doesn’t work

75

What happens if the steady-state behavior of the model fails to

reproduce the steady-state behavior of the organism?

GenomePathologic

Transporterprediction

Pathway/GenomeDatabase

BioCycto

SBML

Nutrients &Objective

FBA &MOMA

Fluxprediction

ModelDefinition

(SBML)

76

What happens if the steady-state behavior of the model fails to

reproduce the steady-state behavior of the organism?

GenomePathologic

Transporterprediction

Pathway/GenomeDatabase

BioCycto

SBML

NetworkDebugging

Nutrients &Objective

FBA &MOMA

Fluxprediction

ModelDefinition

(SBML)

77

Case II: EcoCyc/JR904

• Best of both worlds

• Biological Objective: From nutrients create all essential compounds required for growth

• True test of metabolic databases: Is the data good enough to predict growth rate under different nutrient conditions and effect of gene knockouts?

78

Case II: Schema level integration

• Translation from BioCyc ontology to BioPAX ontology

• Translation of implicit JR904 schema to BioPAX ontology

• Integration of JR904 concepts with BioPAX ontology (flux limits)

79

Case II: Instance level

• EcoCyc <-> JR904 Gene names • EcoCyc <-> JR904 Enzyme names• EcoCyc <-> JR904 Reaction names• EcoCyc <-> JR904 Reversibility/flux limits

• EcoCyc <-> JR904 Gene->protein associations

• EcoCyc <-> JR904 protein->enzyme complex associations

• EcoCyc <-> JR904 enzyme->reaction associations

80

Data Profiling of Flux Model

• Incorrect constraints (reversibility)• Incorrect Nutrient conditions• Incorrect Biomass composition• Incorrect protein function predictions

81

Data profiling of Flux Predictions

• Incorrect hypothesis (FBA vs MOMA vs ROOM)

• Incorrect network architecture(Gene knockouts)

• Incorrect modeling assumptions(steady state assumption, gene expression profiles)

82

Fixing the problems you find

Requires different amounts of time, money, and expertise

– Enzyme Genomics project– Community annotation projects– Adopt-a-Genome project– High-throughput experiments– Pathway hole filling algorithms

83

Case III: Semantic Aggregation Case study

Prochlorococcus marinus MED4• Most abundant species in the ocean• Responsible for a significant portion of photosynthetic carbon fixation.

• Iron hypothesis: Possible solution to global warming?

• Need to understand details of metabolic network

84

Case III: Multi-source aggregation

Public– KEGG (metabolism)– BioCyc (metabolism)– WIT (metabolism)– TransportDB (transport proteins)

Local– RNA expression (microarrays)– protein expression (mass spec)

85

Case III: Goal

Constrain metabolic flux model with

experimental measurements:

•RNA expression•Protein expression•Metabolite concentrations•Flux measurements

86

Case III: Aggregation Problems

• Higher Level: Orphan enzymes• Schema Level: Bridge ontologies• Instance Level: Object identity problem

• Simulation Level: underdetermined system.

87

Case III: Multi-source aggregation Validation and

Testing

• Joint-learning from multiple sources• Semantic test suite for data validation

• Network debugging algorithms

88

Time Out

(15 minutes)

89

Lessons Learned(30 minutes)

What did you learn?

Discussion

“A good representation is the key to good problem solving” –Patrick Winston

“Standard is better than best”—Gerald J Sussman

“The great thing about standards is that there are so many from which to choose” --Unknown

“Above all, one must develop a feeling for the organism.”—Barbara McClintock

“Someone does it once, everybody benefits.”Eric Miller, W3C Semantic Web Activity Lead

Remember people, process, technology, however without people there isn’t any process or technology, so it’s all

social engineering.

90

Lessons Not Yet Learned

(Take home exercise)

91

FeedbackOur goal is to have you walk away with a clear understanding of how to approach any

database integration projectTo provide

• A methodology to scope and plan the project• An understanding of what to expect• Some specific examples to illustrate what is common to all integration projects (data cleaning) and what specific to a particular task. (i.e. to provide you with examples to give a sense of it)

• Some first hand experience at pedantic aggravation, irritation and interference

How did we do? Please let us know how we can improve this tutorial.

92

Thank You

Joanne & Jeremy

1 semantic aggregation, integration, and inference of pathway data co-destructors: joanne luciano,...

Documents