cse-291: ontologies in data integration amarnath gupta department of computer science &...

98
CSE-291: Ontologies in Data Integration Amarnath Gupta Amarnath Gupta Department of Computer Science & Engineering Department of Computer Science & Engineering University of California, San Diego University of California, San Diego CSE-291: Ontologies in Data CSE-291: Ontologies in Data Integration Integration Spring 2004 Spring 2004 Ontologies and Biological Pathways Ontologies and Biological Pathways

Upload: alexina-taylor

Post on 12-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

CSE-291: Ontologies in Data Integration

Amarnath GuptaAmarnath Gupta

Department of Computer Science & Engineering Department of Computer Science & Engineering University of California, San DiegoUniversity of California, San Diego

CSE-291: Ontologies in Data IntegrationCSE-291: Ontologies in Data IntegrationSpring 2004Spring 2004

Ontologies and Biological PathwaysOntologies and Biological Pathways

CSE-291: Ontologies in Data Integration

So, What is an Ontology Again?So, What is an Ontology Again?

• From previous classesFrom previous classes– [Sowa] The subject of ontology is the study of the categories of things that

exist or may exist in some domain. The product of such a study, called an ontology, is a catalog of the types of things that are assumed to exist in a domain of interest D from the perspective of a person who uses a language L for the purpose of talking about D… A formal ontology is specified by a collection of names for concept and relation types organized in a partial ordering by the type-subtype relation.

– [Guarino] Theory of formal distinctions• among things• among relations

– Basic tools• Theory of parthood• Theory of integrity• Theory of identity• Theory of dependence

Is this good enough to characterize all concepts and relations?Is this good enough to characterize all concepts and relations?

CSE-291: Ontologies in Data Integration

Description Logics as Ontology FrameworksDescription Logics as Ontology Frameworks

• You have learnt about Description LogicsYou have learnt about Description Logics– DLs allow you to do the following:

CSE-291: Ontologies in Data Integration

Property Frames in DLsProperty Frames in DLs

• Some Description Logics like SHOQ(D)Some Description Logics like SHOQ(D)11, a progenitor of , a progenitor of OWL, allow:OWL, allow:– Roles or properties to be more powerful

– If R and S are roles, one can specify a role box that contains• role equivalence axioms: ∃component_of.⊤ ≐ ∃part_of.⊤ • role inverses (not present in SHOQ, but present in SHIQ)

• role inclusion axioms: R ⊑ S

• role transitivity axioms: Trans(R)

– Thus one can construct role hierarchies in addition to concept lattices

1Ian Horrocks and U. Sattler. “Ontology Reasoning for the Semantic Web”. In B. Nebel, editor, Proc. of the 17th Int. Joint Conf. on Articial Intelligence (IJCAI'01), Morgan Kaufmann, pages 199-204, 2001.

CSE-291: Ontologies in Data Integration

Thing-Centric OntologiesThing-Centric Ontologies

• Now let’s try these:Now let’s try these:1. sky 2. blue_sky ≡ sky ⊓ ∃ has_color.blue3. cloudy_sky ≡ sky ⊓ ∃ covered_by.cloud4. rain5. acid_rain ⊏ rain6. acid_rain_from_cloudy_sky ≡ acid_rain ⊓ ∃ drops_from.cloudy_sky

• Is this reasonable?Is this reasonable?

• How about these?How about these?1. year2. quarter ⊑ ∃⁼4 part_of.year3. mid_term ⊑ exam ⊓ ¬final_test ⊓ ∃ occurs_in.quarter

• Is it working? Why?Is it working? Why?

Not every concept and relation is thing-centric!!Not every concept and relation is thing-centric!!

CSE-291: Ontologies in Data Integration

Ontologies for Processes, Events, TimeOntologies for Processes, Events, Time

• Temporal Description LogicTemporal Description Logic22

– Allen’s interval relations

2A. Artale and E. Franconi. “A temporal description logic for reasoning about actions and plans”. Journal of Artificial Intelligence Research, 9:463--506, 1998

CSE-291: Ontologies in Data Integration

Temporal Description LogicTemporal Description Logic

• IngredientsIngredients– non-temporal concepts E– temporal concepts C

• things that change their state

– temporal qualifier C@X where X is a temporal variable– temporal constraints Tc

• (X (R) Y) where – X is any temporal variable or the “NOW” interval #– R can be Allen’s interval relations or an expression composed from it

– existential quantifiers• ⋄ (X) Tc.C

– selections p:E where p is • an atomic feature f• a parameterized feature *f

CSE-291: Ontologies in Data Integration

Applying Temporal DLApplying Temporal DL

• Translocation of a proteinTranslocation of a protein– translocation ≐⋄(x y)(x m #)(# m y) ((*Protein: InCytoplasm)@x ⊓

(*Protein: InNucleus)@y) • *Protein is the formal parameter of this action

• States of the *Protein are treated as though they are different type assignments for the same variable

– The above is a definition of the term “translocation”

– Now we can have an assertion (meaning data) of the form• translocation(tp1, MAPK-translocation), i.e., of the form translocation(Interval,

Action) to designate a specific case, thus implying

• translocation(i, a) ⇒ ∃p. *Protein(a, p) ⋀ ∃j,l. (InCytoplasm(j,p) ⋀ InNucleus(l,p) m(j,i) m(i,l))⋀ ⋀

x y

#in-cytoplasm(protein) in-nucleus(protein)translocation

CSE-291: Ontologies in Data Integration

Applying Temporal DLApplying Temporal DL• Some identitiesSome identities

– ⋄ x (x a #). C@x ≡ ⋄ xy (y mi #)(x mi y). C@x– ⋄ x (x d #). C@x ≡ ⋄ xy (y s #)(x f y). C@x– ⋄ x (x o #). C@x ≡ ⋄ xy (y s #)(x fi y). C@x

• A little more complex caseA little more complex case

z w

#PTK_ligand_binding GRB2_bindingGRB2_secondary_response

ytyrosin_phosphorylated

xtyrosin

yautophosphorylation

tyrosin_p ≐ ⋄⋄ x (x o #). (tyrosin@x ⊓ autophosphorylation)

GRB2_s_r ≐ ⋄ ⋄ (y z w)(y b w)(z b w) (tyrosine_p@y PTK_l_b@⊓ z ⊓ GRB2_b@w )

We only really need the relations s, f and mi

CSE-291: Ontologies in Data Integration

Applying Temporal DLApplying Temporal DL

• More features of the temporal DLMore features of the temporal DL– path p ○ q

• *Protein○ bound should be interpreted as• ∃ a,p,i,o1 Protein(a, p, i) bound(i, p, o1)⋀

– Agreement operator ↓• (*Protein○ bound ↓ *Receptor)@y means at the interval y the object to

which Protein is bound is Receptor)

– Substitution• Suppose A ≐ ⋄⋄ (x y z w)(…) is an axiom and

B ≐ ⋄⋄ (x u v)(…) is another axiom whose body is a part of A• The temporal substitutive qualifier (B[x]@v) renames within the defined

B action the variable x to w and it is a way of making coreference between two temporal variables, while the temporal constraints peculiar to the renamed variable x are inherited by the substituting interval w. This will eliminate x from A.

• This can be used to define one temporal concept in terms of another

CSE-291: Ontologies in Data Integration

And now on to Biological PathwaysAnd now on to Biological Pathways

The goals are: The goals are: 1. to comprehend what we need to represent before we

think about how to represent them

2. what computations we can do with them

CSE-291: Ontologies in Data Integration

What are Pathways?What are Pathways?

• A pathway is a set of linked biological components interacting with A pathway is a set of linked biological components interacting with each other over time to generate a biological effecteach other over time to generate a biological effect

• A component in a pathway can often be broken down into a finer A component in a pathway can often be broken down into a finer level of interacting components that finally get to single level of interacting components that finally get to single biochemical reactionsbiochemical reactions

• When people talk about pathways they refer toWhen people talk about pathways they refer to– signal transduction networks

– metabolic pathways

– gene regulatory pathways

– protein-protein interaction networks

CSE-291: Ontologies in Data Integration

Signal Transduction NetworksSignal Transduction Networks

What is Signal Transduction?What is Signal Transduction?

Process by which a cell converts one kind of signal or stimulus into another

CSE-291: Ontologies in Data Integration

The Big PictureThe Big Picture

• How do organisms communicate with their environment?How do organisms communicate with their environment?• How do cells exchange information?How do cells exchange information?• What information needs to be exchanged?What information needs to be exchanged?• What is the currency of information?What is the currency of information?

CSE-291: Ontologies in Data Integration

EventsEvents

• StimuliStimuli– Synthesis of signaling molecule by the signaling

cell.– Release of signaling molecule by the signaling

cell.– Transport of the signal to the target cell.– Detection of the signal by a specific receptor

protein.• ResponsesResponses

– Reception: First messenger – extracellular molecule (signal), binds to a receptor.

– Transduction• Amplification: Binding activates receptor protein,

which then activates relay protein.• Conversion: Relay protein stimulates another

membrane protein which acts as an effector (effects changes in cell).

– Induction/Response: Effector protein – enzyme that produces a secondary messenger (cytoplasmic molecule that triggers metabolic and/or structural responses within cell).

– Removal of the signal, often terminating the cellular response.

CSE-291: Ontologies in Data Integration

Types of SignalsTypes of Signals

• ExtracellularExtracellular– Signal molecules are specific to their

receptors

– Receptors, usually proteins, have N terminal face outwards and C terminal inside the cell.

– When bound to a signal molecule, a receptor changes its conformation

CSE-291: Ontologies in Data Integration

Types of SignalsTypes of Signals

• IntracellularIntracellular– Mostly triggered by the

extracellular signal– Converts the extracellular

signal into an intracellular signal

– Eg. - G protein, GTPase, cAMP, Ca++, Kinases, phosphatases and many more

– Also called second messengers

CSE-291: Ontologies in Data Integration

Types of SignalsTypes of Signals

• IntercellularIntercellular– Extracellular signalling

– Endocrinology

– Types• Endocrine – Travel through

blood

• Paracrine – In the vicinity

• Autocrine – Same cell type

• Juxtacrine – Along cell membranes

CSE-291: Ontologies in Data Integration

Types of SignalsTypes of Signals

• HormonesHormones– Between cells or tissues within an individual– Process

• Synthesis Storage and secretion Transport Recognition of hormone by its receptor change in receptor shape Relay and amplification of signal Response

• Sending cell is a specialized cell while the receiving can be of any type• A single hormone can have many receptors for different pathways or many

hormones can have same receptor to invoke same pathway• Two classes of hormone receptors

– Membrane associated – Cytoplasmic

CSE-291: Ontologies in Data Integration

Cellular ResponseCellular Response

– depends on the particular signaling pathways - may involve changes in :

• cell cycle progression

• gene expression

• protein trafficking

• cell migration

• cytoskeleton architecture

• adhesion

• metabolism

• cell survival

CSE-291: Ontologies in Data Integration

It should be noted that the RAS-RAF-MEK-MAPK pathway is only one example of so called “MAPK (Mitogen-Activated Protein Kinase)) pathways” .

Two other mammalian MAPK pathways involving JNK1 and p38, are involved in stress responses (they are also “MAPK pathways”).

Example: Example: RAS-RAF-MEK-MAPKRAS-RAF-MEK-MAPK pathwayspathwaysExample: Example: RAS-RAF-MEK-MAPKRAS-RAF-MEK-MAPK pathwayspathways

CSE-291: Ontologies in Data Integration

RAS-RAF-MEK-MAPK

• Ligand binds receptor PTK

CSE-291: Ontologies in Data Integration

• Ligand binds receptor PTK

• Autophosphorylation on tyrosine

P

P

P

P

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

• Ligand binds receptor PTK

• Autophosphorylation on tyrosine

• GRB2 (a SH2- and SH3-containing protein) binds to the receptor phosphotyrosine motif Y-V/L-N-X via its SH2 domain

P

P

P

P

RAS-RAF-MEK-MAPK

SH2 SH

3

GRB2

SOS

CSE-291: Ontologies in Data Integration

• Ligand binds receptor PTK

• Autophosphorylation on tyrosine

• GRB2 (a SH2- and SH3-containing protein) binds to the receptor phosphotyrosine motif Y-V/L-N-X via its SH2 domain

• The SH3 of GRB2 binds constitutively to the proline-rich sequence in the C-terminus of SOS (a guanine nucleotide exchange factor for RAS).

P

P

P

P SH2 SH

3

GRB2

SOS

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

• Recruitment of SOS to the close proximity of RAS in the membrane

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GDP

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

• RAS becomes activated by exchanging GDP for GTP

P

P

P

P SH2 SH

3

GRB2

SOS

GDPGTPRAS

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

• The RAS-GTP effector domain interacts with the N-terminal regulatory region of the RAF (serine/threonine protein kinase), hence recruiting RAF to the membrane

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activation of RAF (most likely by phosphorylation of RAF and binding to the scaffold protein 14-3-3)

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activation of RAF (most likely by phosphorylation of RAF and binding to the scaffold protein 14-3-3)

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activated RAF in turn activates MEK (also called MAPK kinase; a dual specificity kinase) by phosphorylation on two conserved serine residues in MEK. P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activated RAF in turn activates MEK (also called MAPK kinase; a dual specificity kinase) by phosphorylation on two conserved serine residues in MEK. P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MEK activates MAPK (a serine/threonine protein kinase) by phosphorylation of conserved threonine and tyrosine residues. P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

MAPK

P P

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MEK activates MAPK (a serine/threonine protein kinase) by phosphorylation of conserved threonine and tyrosine residues.

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

MAPK

P P

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MAPK phosphorylates a number of substrates in the plasma membrane and the cytoplasm;

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

MAPK

P PSubstrates

Substrates

P

P

RAS-RAF-MEK-MAPK

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MAPK phosphorylates a number of substrates in the plasma membrane and the cytoplasm; • It also translocated into the nucleus(within min) where it phosphorylates nuclear transcription factors.

P

P

P

P SH2 SH3

GRB2

SOS

RAS GTP

RAF

MEK

P P

MAPK

P P

Substrates

RAS-RAF-MEK-MAPK

MAPK

P P

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MAPK phosphorylates a number of substrates in the plasma membrane and the cytoplasm;

• It also translocated into the nucleus(within min) where it phosphorylates nuclear transcription factors.

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

Substrates

RAS-RAF-MEK-MAPK

MAPK

P P

MAPK

P P

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MAPK phosphorylates a number of substrates in the plasma membrane and the cytoplasm;

• It also translocated into the nucleus(within min) where it phosphorylates nuclear transcription factors.

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

Substrates

RAS-RAF-MEK-MAPK

MAPK

P P

MAPK

P P

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MAPK phosphorylates a number of substrates in the plasma membrane and the cytoplasm;

• It also translocated into the nucleus(within min) where it phosphorylates nuclear transcription factors.

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

Substrates

RAS-RAF-MEK-MAPK

MAPK

P P

MAPK

P P

P

CSE-291: Ontologies in Data Integration

14-3-3

• Activated MAPK phosphorylates a number of substrates in the plasma membrane and the cytoplasm;

• It also translocated into the nucleus(within minutes) where it phosphorylates nuclear transcription factors.

Transcription of genes important for cell proliferation.

P

P

P

P SH2 SH

3

GRB2

SOS

RAS GTP

RAF

MEK

P P

RAS-RAF-MEK-MAPK

SubstratesMAPK

P P

MAPK

P P

P

CSE-291: Ontologies in Data Integration

Metabolic PathwaysMetabolic Pathways

What is metabolism?What is metabolism?

The sum of all the chemical and physical changes that take place within the body and enable its continued growth and functioning.

Metabolism involves the breakdown of complex organic constituents of the body with the liberation of energy, which is required for other processes, and the building up of complex

substances, which form the material of the tissues and organs.

CSE-291: Ontologies in Data Integration

Chemical reactionsChemical reactions• Reactants and productsReactants and products

– together called metabolites

• Free energy change (Free energy change (ΔG) of a reaction ΔG) of a reaction A + B A + B C + D C + D

ΔG = ΔGo + RT ln [C][D] / [A][B]– depends on concentrations and nature of metabolites– ΔG < 0 for a spontaneous (exergonic) reaction– ΔG > 0 for an endergonic reaction

• Chemical equilibriumChemical equilibrium– Same rate of forward and backward reactions– ΔG = 0, let Keq = [C][D]/[A][B], the ratio of products to reactants at

equilibrium– ΔGo = - RT ln Keq

– Keq = e–ΔGo/RT

CSE-291: Ontologies in Data Integration

Rate LawRate Law

• Consider a reaction of overall stoichiometry, Consider a reaction of overall stoichiometry,

The rate, or velocity, The rate, or velocity, vv of this reaction is the amount of P formed or the amount of of this reaction is the amount of P formed or the amount of A consumed per unit time. Thus:A consumed per unit time. Thus:

Rate law states that: Rate law states that:

Where Where kk is rate constant. is rate constant. vv is a function of [A] to the first power, or the first order. is a function of [A] to the first power, or the first order. kk is called first order constant. is called first order constant.

dt

Advor

dt

Pdv

][][

PA

][][

Akdt

Adv

CSE-291: Ontologies in Data Integration

Equilibrium constant and equation Equilibrium constant and equation ratesrates

For a reversible reaction A + B C + D

the rate will be the difference between the forward and reverse rates

dC/dt = kf [A][B] - kr [C] [D]

At equilibrium,

kf [A][B] = kr [C] [D]

Keq = kf / kr = [C] [D] / [A][B]

CSE-291: Ontologies in Data Integration

EnzymesEnzymes

• usually proteins. A small number of enzymes are made of RNA (ribozymes).

• are usually quite big (compared to the portions of the reactants or substrates which are modified in the reaction to be catalyzed).

Enzyme(hexokinase)

Ribozyme(self-splicing intron)

CSE-291: Ontologies in Data Integration

Enzymes have a substrate binding site which binds the reaction substrates and brings them together in the orientations appropriate for the reaction.

This binding is usually highly specific. Often, one enzyme catalyses only one type of reaction between a specific set of substrates.

CSE-291: Ontologies in Data Integration

Enzymes have an active site—a specialized configuration of side-chain and main-chain atoms located at the substrate binding site which assist in the chemical steps of the reaction.

Triosephosphateisomerase

Active site

CSE-291: Ontologies in Data Integration

Active sitesActive sites

• 3-dimensional cleft3-dimensional cleft– can be formed by faraway residues

– Lysozyme’s active site includes residues at positions 35, 52, 62, 63, 101, 108 (out of a total of 129 residues)

• Small fraction of the total volume of an enzymeSmall fraction of the total volume of an enzyme• Substrates are bound to enzymes through multiple Substrates are bound to enzymes through multiple

weak attractionsweak attractions

CSE-291: Ontologies in Data Integration

Regulation of enzymesRegulation of enzymes• Reversible and irreversible Reversible and irreversible

inhibitioninhibition• Competitive and allosteric Competitive and allosteric

regulationregulation– Allosteric regulation can be

activation or inhibition – Tense (T) and relaxed (R)

states– Activator binds to R state– Inhibitor binds to T state

• Different kinetics for each Different kinetics for each

CSE-291: Ontologies in Data Integration

Rate of reactionsRate of reactions

CSE-291: Ontologies in Data Integration

Regulatory control of enzymesRegulatory control of enzymes

• Alteration of enzyme activityAlteration of enzyme activity– Enzyme modification

• Covalent modification

• Protein-protein interaction

– Substrate control– Product control – Allosteric control

CSE-291: Ontologies in Data Integration

Regulatory control of enzymesRegulatory control of enzymes

• Alteration of number of enzyme moleculesAlteration of number of enzyme molecules– Transcription– Translation– Control of enzyme degradation

• Compartmentalization Compartmentalization – Example: hexokinase in brain and liver

CSE-291: Ontologies in Data Integration

Enzyme NomenclatureEnzyme Nomenclature

• OxidoreductasesOxidoreductases (EC Class 1)(EC Class 1)– Transfer electrons (RedOx reactions)\

• TransferasesTransferases (EC Class 2)(EC Class 2)– Transfer functional groups between molecules

• HydrolasesHydrolases (EC Class 3)(EC Class 3)– Break bonds by adding H2O

• LyasesLyases (EC Class 4)(EC Class 4)– Elimination reactions to form double bonds

• IsomerasesIsomerases (EC Class 5)(EC Class 5)– Intramolecular rearangements

• LigasesLigases (EC Class 6)(EC Class 6)– Join molecules with new bonds

CSE-291: Ontologies in Data Integration

ID 2.3.1.43DE Phosphatidylcholine--sterol O-acyltransferase.AN Lecithin--cholesterol acyltransferase.AN LCAT.AN Phospholipid--cholesterol acyltransferase.CA Phosphatidylcholine + sterol = sterol ester +CA 1-acylglycerophosphocholine.CC -!- Palmitoyl, oleoyl, and linoleoyl can be transferred; a number ofCC sterols, including cholesterol, can act as acceptor.CC -!- The bacterial enzyme also catalyses the reactions of EC 3.1.1.4 andCC EC 3.1.1.5.DI Norum disease; MIM:245900.DI Fish-eye disease; MIM:136120.PR PROSITE; PDOC00110;DR BRENDA; 2.3.1.43.DR EMP/PUMA; 2.3.1.43.DR WIT; 2.3.1.43.DR KYOTO UNIVERSITY LIGAND CHEMICAL DATABASE; 2.3.1.43.DR P10480, GCAT_AERHY; P53760, LCAT_CHICK; P04180, LCAT_HUMAN;DR P16301, LCAT_MOUSE; Q08758, LCAT_PAPAN; P30930, LCAT_PIG ;DR P53761, LCAT_RABIT; P18424, LCAT_RAT ;//

Example entry from the Enzyme Database at Example entry from the Enzyme Database at http://www.expasy.ch/enzyme/http://www.expasy.ch/enzyme/

CSE-291: Ontologies in Data Integration

Enzyme Catalytic MechanismsEnzyme Catalytic Mechanisms

• Fundamentally familiar reactions from Organic Fundamentally familiar reactions from Organic ChemistryChemistryAcid Base Catalysis - Donation or abstraction of protons

Covalent Catalysis - Covalent (co)enzyme-substrate intermediate

Metal Ion - Substrates and metals positioned for reaction

Electrostatic - Charge complimentarity to transition state

Proximity and Orientation - Substrates aligned for reaction

Transition state stabilization - G‡ reduced

CSE-291: Ontologies in Data Integration

Metabolic networksMetabolic networks

• Each enzyme/reaction can be a path between nodesEach enzyme/reaction can be a path between nodes– Each node is an enzyme substrate (product or reactant)

• Converting individual reactions to paths and nodesConverting individual reactions to paths and nodes– Produces directed graphs

• Classification of biochemical reactionsClassification of biochemical reactions– EC numbering system (Enzyme Commission)

– Hierarchical numerical system i.e. 1.5.3.1

– Based on organic chemistry involved, not proteins

CSE-291: Ontologies in Data Integration

PainPainthe Boehringer-Mannheim wallchartsthe Boehringer-Mannheim wallcharts

CSE-291: Ontologies in Data Integration

more painmore pain

CSE-291: Ontologies in Data Integration

A Pathway ExampleA Pathway Example

CSE-291: Ontologies in Data Integration

Gene Regulatory NetworksGene Regulatory Networks

What is gene regulation?What is gene regulation?

The primary role of a gene, is transcription, which produces mRNA, a copy of a single strand of the gene. Different proteins can control the transcription process by activating, inhibiting, or competitively binding to the promoter region of genes.

CSE-291: Ontologies in Data Integration

Protein SynthesisProtein Synthesis

• TranscriptionTranscription– Before the synthesis of a protein

begins, the corresponding RNA molecule is produced by RNA transcription. One strand of the DNA double helix is used as a template by the RNA polymerase to synthesize a messenger RNA (mRNA).

– This mRNA migrates from the nucleus to the cytoplasm. During this step, mRNA goes through different types of maturation including one called splicing when the non-coding sequences are eliminated. The coding mRNA sequence can be described as a unit of three nucleotides called a codon.

CSE-291: Ontologies in Data Integration

Protein SynthesisProtein Synthesis• TranslationTranslation

– The ribosome binds to the mRNA at the start codon that is recognized only by the initiator tRNA.

– The ribosome proceeds to the elongation phase of protein synthesis. During this stage, complexes, composed of an amino acid linked to tRNA, sequentially bind to the appropriate codon in mRNA by forming complementary base pairs with the tRNA anticodon.

– The ribosome moves from codon to codon along the mRNA. Amino acids are added one by one, translated into polypeptidic sequences dictated by DNA and represented by mRNA.

– At the end, a release factor binds to the stop codon, terminating translation and releasing the complete polypeptide from the ribosome.

CSE-291: Ontologies in Data Integration

Control of Gene ExpressionControl of Gene Expression

• Gene Expression is a term indicating the act of protein synthesis by Gene Expression is a term indicating the act of protein synthesis by a genea gene– not all genes produce proteins in all cells or in all phases of a cell’s life cycle

• Many control pointsMany control points– transcription, mRNA processing, nRNA transport, translation, post-

translational modifications

• Each gene has its own control regionsEach gene has its own control regions– all genes differ slightly in the exact locations of control and the exact set of

transcription factors (proteins that control transcription)

• Different combinations of transcription factors, and their relative Different combinations of transcription factors, and their relative timing of bindings create a large space of control signalstiming of bindings create a large space of control signals– some control signals may control the transcription of more than one gene

CSE-291: Ontologies in Data Integration

Transcription RegulationTranscription Regulation

CSE-291: Ontologies in Data Integration

Transcription-Initiation ComplexTranscription-Initiation Complex

CSE-291: Ontologies in Data Integration

Events Leading to Transcription InitiationEvents Leading to Transcription Initiation

CSE-291: Ontologies in Data Integration

Enhancers can be equally complexEnhancers can be equally complex

CSE-291: Ontologies in Data Integration

A sense of the data: the molecular neighborhood of IME1

CSE-291: Ontologies in Data Integration

Types of InteractionsTypes of Interactions

CSE-291: Ontologies in Data Integration

Ontologies and Databases for Ontologies and Databases for Biological PathwaysBiological Pathways

CSE-291: Ontologies in Data Integration

BioPaxBioPax

BioPAX

Molecular InteractionsPro:Pro All:All

PSI

Biochemical Reactions

SBML,CellML

Regulatory PathwaysQualitative Quantitative

GeneticInteractions

Interaction NetworksMolecular Non-molecularPro:Pro TF:Gene Genetic

Metabolic Pathways Qualitative Quantitative

DatabaseExchange Formats

Simulation ModelExchange Formats

SmallMolecules (CML)

RateFormulas

Enzymes

CSE-291: Ontologies in Data Integration

Design GoalsDesign Goals• EncapsulationEncapsulation: An entire pathway in one record: An entire pathway in one record

• CompatibleCompatible: Use existing standards wherever possible: Use existing standards wherever possible

• ComputableComputable: From file reading to logical inference: From file reading to logical inference• OWL (Ontology Web Language)

– Fast

– Complete: all conclusions are guaranteed to be computed

– Decidable: all computations will finish in finite time (with OWL Lite, short amount of time.

CSE-291: Ontologies in Data Integration

Requirements SpecificationRequirements Specification

• Accommodate Accommodate existing databaseexisting database representations: BioCyc, BIND, representations: BioCyc, BIND, WIT, aMAZE, KEGG, etc.WIT, aMAZE, KEGG, etc.– Compatible as a superset of representations

• Support different pathway types:Support different pathway types:– Metabolic pathways– Signaling pathways– Protein-protein interactions– Gene regulatory pathways

• OWL- used for encoding the ontologyOWL- used for encoding the ontology

CSE-291: Ontologies in Data Integration

Implementation of BioPAXImplementation of BioPAX• Implemented using OWL languageImplemented using OWL language• OWL isOWL is

– Ontology Web Language

– XML based

– W3C standard www.W3C.org

• Example of a BioPAX Class and Instance in OWLExample of a BioPAX Class and Instance in OWL

CSE-291: Ontologies in Data Integration

Example – Class def in OWLExample – Class def in OWL

<owl:Class rdf:ID="protein"> <rdfs:subClassOf> <owl:Class rdf:about="#physicalEntity"/> </rdfs:subClassOf> <rdfs:comment

rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A protein (e.g. The EGFR protein sequence. See Swiss-Protfor more examples.)

</rdfs:comment></owl:Class>

CSE-291: Ontologies in Data Integration

Example – Instance in OWLExample – Instance in OWL

<bpx:protein rdf:ID="biopax-L1v0.5_Instance_42"> <bpx:NAMES> <bpx:namesType rdf:ID="biopax-L1v0.5_Instance_43"> <bpx:SHORTLABEL>phosphoglucose isomerase</bpx:SHORTLABEL> </bpx:namesType> </bpx:NAMES> </bpx:protein>

CSE-291: Ontologies in Data Integration

CSE-291: Ontologies in Data Integration

Current structure of Current structure of

class hierarchyclass hierarchy

Level 1 v0.9 (Dec. 2003)Level 1 v0.9 (Dec. 2003)

BioPAX OntologyBioPAX Ontology

CSE-291: Ontologies in Data Integration

Annotation with BioPaxAnnotation with BioPax

CSE-291: Ontologies in Data Integration

Metabolic Data in BioPAXMetabolic Data in BioPAX

Biochemical ReactionBiochemical Reaction

IDID 11

Full NameFull Name Glucose-6-p to Glucose-6-p to fructose-6-pfructose-6-p

LeftLeft <cml>glucose-6-<cml>glucose-6-phosphate</cml>phosphate</cml>

RightRight <cml>fructose-6-<cml>fructose-6-phosphate</cml>phosphate</cml>

Delta GDelta G 0.4 kcal/mole0.4 kcal/mole

ECEC 5.3.1.95.3.1.9

EcoCyc: Reaction BioPAX: Biochemical Reaction

CSE-291: Ontologies in Data Integration

Metabolic Data in BioPAXMetabolic Data in BioPAX

CatalysisCatalysis

IDID 22

NameName Catalysis of glucose-Catalysis of glucose-6-p to fructose-6-p6-p to fructose-6-p

EnzymeEnzyme glucose-6-phosphate glucose-6-phosphate isomeraseisomerase

ReactionReaction BioPAX ID=1BioPAX ID=1

InhibitorsInhibitors Low pHLow pH

EcoCyc: Enzyme-Catalyzed Reaction BioPAX: Catalysis

CSE-291: Ontologies in Data Integration

Metabolic Data in BioPAXMetabolic Data in BioPAX

PathwayPathway

IDID 1010

NameName GlycolysisGlycolysis

InteractionsInteractions

1. BioPAX ID=21. BioPAX ID=2

2. BioPAX ID=42. BioPAX ID=4

3. BioPAX ID=63. BioPAX ID=6

etc.etc.

EcoCyc: Pathway BioPAX Class: Pathway

CSE-291: Ontologies in Data Integration

Signal Transduction Data in BioPAXSignal Transduction Data in BioPAX

ReactionReaction

IDID 2020

NameName Activation of NF-kBActivation of NF-kB

SubstrateSubstrate NF-kB (inactive)NF-kB (inactive)

ProductProduct NF-kB (active)NF-kB (active)

Enzyme CatalysisEnzyme Catalysis

IDID 2121

NameName MAP-kinase activates NF-MAP-kinase activates NF-kBkB

EnzymeEnzyme MAP-kinaseMAP-kinase

ReactionReaction BioPAX ID=20BioPAX ID=20

CSNDB Signaling Pathway Step

CSE-291: Ontologies in Data Integration

Signal Transduction Data in BioPAXSignal Transduction Data in BioPAX

PathwayPathway

IDID 1010

NameName MAPKMAPK

InteractionsInteractions 1. BioPAX ID=211. BioPAX ID=21

2. BioPAX ID=232. BioPAX ID=23

3. BioPAX ID=253. BioPAX ID=25

etc.etc.

CSNDB Pathway

CSE-291: Ontologies in Data Integration

Descriptions of some databasesDescriptions of some databases

Name:Name: KEGG (Kyoto Encyclopedia of Genes and Genomes)KEGG (Kyoto Encyclopedia of Genes and Genomes)Web:Web: http://www.genome.ad.jp/kegg/http://www.genome.ad.jp/kegg/Owner:Owner: Institute for Chemical Research, Kyoto UniversityInstitute for Chemical Research, Kyoto UniversityDescription:Description: KEGG is an effort to computerize current knowledge of molecular and cellular

biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects. The KEGG project is undertaken in the Bioinformatics Center, Institute for Chemical Research, Kyoto Univ.

Name:Name: PathDBPathDBWeb:Web: http://www.ncgr.org/pathdb/index.htmlhttp://www.ncgr.org/pathdb/index.htmlOwner:Owner: National Center for Genomic ResourcesNational Center for Genomic ResourcesDescription:Description: PathDB™ is a functional prototype research tool for biochemistry and

functional genomics. One of the key underlying philosophies of their project is to capture discrete metabolic steps. This allows them to build tools to construct metabolic networks de novo from a set of defined steps. PathDB is not simply a data repository but a system around which tools can be created for building, visualizing, and comparing metabolic networks.

CSE-291: Ontologies in Data Integration

List of Pathway Database/Tools (cont.)List of Pathway Database/Tools (cont.)

Name:Name: GenMAPP (Gene MicroArray Pathway Profiler)GenMAPP (Gene MicroArray Pathway Profiler)

Gladstone Institute, UCSF. Gladstone Institute, UCSF.

GenMAPP is a computer application designed to visualize gene expression data on maps representing biological pathways and groupings of genes. The first release of GenMAPP 1.0 beta is available with over 50 mouse and human pathways. They also provide hundreds of functional groupings of genes derived from the Gene Ontology Project for the human, mouse, Drosophila, C. elegans, and yeast genomes. GenMAPP seeks collaborators in the biological community to assist in the development of a library of pathways that will encompass all known genes in the major model organisms.

  

Name: Name: SPAD: Signaling PAthway DatabaseSPAD: Signaling PAthway Database

Graduate School of Genetic Resources Technology. Kyushu University. Graduate School of Genetic Resources Technology. Kyushu University.

There are multiple signal transduction pathways: cascade of information from plasma membrane to nucleus in response to an extracellular stimulus in living organisms. Extracellular signal molecule binds specific intracellular receptor, and initiates the signaling pathway. Now, there is a large amount of information about the signaling pathways which control the gene expression and cellular proliferation. They have developed an integrated database SPAD to understand the overview of signaling transduction. SPAD is divided to four categories based on extracellular signal molecules (Growth factor, Cytokine, and Hormone) that initiate the intracellular signaling pathway. SPAD is compiled in order to describe information on interaction between protein and protein, protein and DNA as well as information on sequences of DNA and proteins.

CSE-291: Ontologies in Data Integration

Specific Pathway DatabasesSpecific Pathway Databases

• Cytokine Signaling Pathway DBCytokine Signaling Pathway DB.. Dept. of Biochemistry. Kumamoto Univ.Dept. of Biochemistry. Kumamoto Univ.– The Database contains information on signaling pathways of cytokines. It is designed for researchers who work

with cytokines and their receptors, and provides biochemical data and references about signaling molecules as well as ligand-receptor relationships.

• EcoCyc and MetaCycEcoCyc and MetaCyc Stanford Research InstituteStanford Research Institute– EcoCyc database describes the genome and the biochemical machinery of E. coli. The database contains up-to-

date annotations of all E. coli genes. EcoCyc describes all known pathways of E. coli small-molecule metabolism. Each pathway and its component reactions and enzymes are annotated in rich detail, with extensive references to the biomedical literature. The Pathway Tools software provides query and visualization services.

BIND (Biomolecular Interaction Network Database)BIND (Biomolecular Interaction Network Database) UBC, Univ. of Toronto UBC, Univ. of Toronto

-- -- BIND is a database designed to store full descriptions of interactions, molecular complexes and pathways, including interactions between any two molecules composed of proteins, nucleic acids and small molecules. Chemical reactions, photochemical activation and conformational changes can also be described. Abstraction is made in such a way that graph theory methods may be applied for data mining. The database can be used to study networks of interactions, to map pathways across taxonomic branches and to generate information for kinetic simulations.

CSE-291: Ontologies in Data Integration

Objectives of the KEGG ProjectObjectives of the KEGG Project

• Pathway Database:Pathway Database: Computerize current knowledge of molecular and cellular biology in terms of the pathway of interacting molecules or genes.– generic metabolic pathways (143)– inferred pathways for all sequenced genomes (2706)

• Genes Database:Genes Database: Maintain gene catalogs of all sequenced organisms and link each gene product to a pathway component

• Ligand Database:Ligand Database: Organize a database of all chemical compounds in living cells and link each compound to a pathway component

• Pathway Tools:Pathway Tools: Develop new bioinformatics technologies for functional genomics, such as pathway comparison, pathway reconstruction, and pathway design

CSE-291: Ontologies in Data Integration

Data Representation in KEGGData Representation in KEGG

• Entity:Entity: a molecule or a gene a molecule or a gene

• Binary relation:Binary relation: a relation between two entities a relation between two entities

• Network:Network: a graph formed from a set of related entities a graph formed from a set of related entities

• Pathway:Pathway: metabolic pathway or regulatory pathway metabolic pathway or regulatory pathway

CSE-291: Ontologies in Data Integration

CSE-291: Ontologies in Data Integration

CSE-291: Ontologies in Data Integration

CSE-291: Ontologies in Data Integration

CSE-291: Ontologies in Data Integration

KEGG Model KEGG Model

CSE-291: Ontologies in Data Integration

CSE-291: Ontologies in Data Integration

KEGG: query capabilitiesKEGG: query capabilities

• Searching an browsingSearching an browsing• Clickable mapsClickable maps• Map coloring Map coloring

– user provides a family of genes from gene expression data– matching pathways are listed– genes are colored on pathway maps

• Path finding between compoundsPath finding between compounds

CSE-291: Ontologies in Data Integration

Pathway modelsPathway models

CSE-291: Ontologies in Data Integration

Concluding remarksConcluding remarks

• We focused on what needs to be representedWe focused on what needs to be represented• New kinds of queriesNew kinds of queries

– Graph queries– Comparison of models and traces– is flux q possible in steady state for network N?– Similarity of networks based on the similarity of their flux

cones– Compare networks based on

• Their structure• Their flux cone• Their dynamic behavior

– What-if queries

• We did not cover logics for simulationWe did not cover logics for simulation– linear logic, computation tree logic