increasingly accurate representation of biochemistry (v2)

52
Increasingly Accurate Representation of Biochemistry (v2) Michel Dumontier, Ph.D. Assistant Professor of Bioinformatics Department of Biology, School of Computer Science Institute of Biochemistry, Ottawa Institute of Systems Biology Carleton University 1 SemWeb Group::Vancouver 21/05/2009

Upload: michel-dumontier

Post on 13-Jul-2015

1.213 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Increasingly Accurate Representation of Biochemistry (v2)

Increasingly Accurate Representation of Biochemistry (v2)

Michel Dumontier, Ph.D.Assistant Professor of Bioinformatics

Department of Biology, School of Computer ScienceInstitute of Biochemistry, Ottawa Institute of Systems Biology

Carleton University

1 SemWeb Group::Vancouver 21/05/2009

Page 2: Increasingly Accurate Representation of Biochemistry (v2)

Representational Issues

Biochemical Identity

Accurate Descriptions

Precise Identifiers

Modeling Situations

Page 3: Increasingly Accurate Representation of Biochemistry (v2)

Which of these are different?

# A, B Difference?

1 α-D-Glucose, alpha-D-Glucose None, multiple names

2 α-D-Glucose, β-D-Glucose

3 α-D-Glucose, β-D-Glucose, D-Glucose

4 α-D-Glucose, α-D-Glucose-6-phosphate

5 Hk, Hk(L529S)

6 Hk(human), Hk(mouse)

7 Hk(open), Hk(closed)

8 Hk (L529), Hk (L540)

9 Hk (L529), Hk (A530)

Page 4: Increasingly Accurate Representation of Biochemistry (v2)

α-D-Glucose vs β-D-Glucose

• Rearrangement (isomer)• Related, but structurally different

Page 5: Increasingly Accurate Representation of Biochemistry (v2)

α-D-Glucose and β-D-Glucose are more specific types* of D-Glucose

* They resolve an ambiguity in stereochemistry

Page 6: Increasingly Accurate Representation of Biochemistry (v2)

α-D-Glucose vs α-D-Glucose-6-Phosphate

• Change (addition+removal) in atoms• Structurally different

– one is not a type of the other!

Page 7: Increasingly Accurate Representation of Biochemistry (v2)

Post-Translational Modifications

• Structurally different• Unable to capture the difference with

single letter AA sequence representation

Page 8: Increasingly Accurate Representation of Biochemistry (v2)

Hexokinase (mutation)

500 510 520 530 540 RRFHKTLRRL VPDSDVRFLL SESGSGKGAA MVTAVAYRSA EQHRQIEETL

500 510 520 530 540 RRFHKTLRRL VPDSDVRFLL SESGSGKGAA MVTAVAYRLA EQHRQIEETL

Leads to hemolytic anemia

different sequence = different entity

related by some mutation process

Page 9: Increasingly Accurate Representation of Biochemistry (v2)

Hexokinase (human vs mouse)>sp|P19367|HXK1_HUMAN Hexokinase-1 OS=Homo sapiens GN=HK1 PE=1 SV=3 MIAAQLLAYYFTELKDDQVKKIDKYLYAMRLSDETLIDIMTRFRKEMKNGLSRDFNPTAT VKMLPTFVRSIPDGSEKGDFIALDLGGSSFRILRVQVNHEKNQNVHMESEVYDTPENIVH GSGSQLFDHVAECLGDFMEKRKIKDKKLPVGFTFSFPCQQSKIDEAILITWTKRFKASGV EGADVVKLLNKAIKKRGDYDANIVAVVNDTVGTMMTCGYDDQHCEVGLIIGTGTNACYME ELRHIDLVEGDEGRMCINTEWGAFGDDGSLEDIRTEFDREIDRGSLNPGKQLFEKMVSGM YLGELVRLILVKMAKEGLLFEGRITPELLTRGKFNTSDVSAIEKNKEGLHNAKEILTRLG VEPSDDDCVSVQHVCTIVSFRSANLVAATLGAILNRLRDNKGTPRLRTTVGVDGSLYKTH PQYSRRFHKTLRRLVPDSDVRFLLSESGSGKGAAMVTAVAYRLAEQHRQIEETLAHFHLT KDMLLEVKKRMRAEMELGLRKQTHNNAVVKMLPSFVRRTPDGTENGDFLALDLGGTNFRV LLVKIRSGKKRTVEMHNKIYAIPIEIMQGTGEELFDHIVSCISDFLDYMGIKGPRMPLGF TFSFPCQQTSLDAGILITWTKGFKATDCVGHDVVTLLRDAIKRREEFDLDVVAVVNDTVG TMMTCAYEEPTCEVGLIVGTGSNACYMEEMKNVEMVEGDQGQMCINMEWGAFGDNGCLDD IRTHYDRLVDEYSLNAGKQRYEKMISGMYLGEIVRNILIDFTKKGFLFRGQISETLKTRG IFETKFLSQIESDRLALLQVRAILQQLGLNSTCDDSILVKTVCGVVSRRAAQLCGAGMAA VVDKIRENRGLDRLNVTVGVDGTLYKLHPHFSRIMHQTVKELSPKCNVSFLLSEDGSGKG AALITAVGVRLRTEASS

>sp|P17710|HXK1_MOUSE Hexokinase-1 OS=Mus musculus GN=Hk1 PE=1 SV=2 MGWGAPLLSRMLHGPGQAGETSPVPERQSGSENPASEDRRPLEKQCSHHLYTMGQNCQRG QAVDVEPKIRPPLTEEKIDKYLYAMRLSDEILIDILTRFKKEMKNGLSRDYNPTASVKML PTFVRSIPDGSEKGDFIALDLGGSSFRILRVQVNHEKSQNVSMESEVYDTPENIVHGSGS QLFDHVAECLGDFMEKRKIKDKKLPVGFTFSFPCRQSKIDEAVLITWTKRFKASGVEGAD VVKLLNKAIKKRGDYDANIVAVVNDTVGTMMTCGYDDQQCEVGLIIGTGTNACYMEELRH IDLVEGDEGRMCINTEWGAFGDDGSLEDIRTEFDRELDRGSLNPGKQLFEKMVSGMYMGE LVRLILVKMAKESLLFEGRITPELLTRGKFTTSDVAAIETDKEGVQNAKEILTRLGVEPS HDDCVSVQHVCTIVSFRSANLVAATLGAILNRLRDNKGTPRLRTTVGVDGSLYKMHPQYS RRFHKTLRRLVPDSDVRFLLSESGSGKGAAMVTAVAYRLAEQHRQIEETLSHFRLSKQAL MEVKKKLRSEMEMGLRKETNSRATVKMLPSYVRSIPDGTEHGDFLALDLGGTNFRVLLVK IRSGKKRTVEMHNKIYSIPLEIMQGTGDELFDHIVSCISDFLDYMGIKGPRMPLGFTFSF PCKQTSLDCGILITWTKGFKATDCVGHDVATLLRDAVKRREEFDLDVVAVVNDTVGTMMT CAYEEPSCEIGLIVGTGSNACYMEEMKNVEMVEGNQGQMCINMEWGAFGDNGCLDDIRTD FDKVVDEYSLNSGKQRFEKMISGMYLGEIVRNILIDFTKKGFLFRGQISEPLKTRGIFET KFLSQIESDRLALLQVRAILQQLGLNSTCDDSILVKTVCGVVSKRAAQLCGAGMAAVVQK IRENRGLDHLNVTVGVDGTLYKLHPHFSRIMHQTVKELSPKCTVSFLLSEDGSGKGAALI TAVGVRLRGDPTNA

Page 10: Increasingly Accurate Representation of Biochemistry (v2)

Hexokinase

Open vs Closed

Structurally identical, but conformationally different

Page 11: Increasingly Accurate Representation of Biochemistry (v2)

Parts need to be identifiable and describable

# A, B Difference?

1 α-D-Glucose, alpha-D-Glucose None, multiple names

2 α-D-Glucose, β-D-Glucose Structural (rearrangement)

3 α-D-Glucose, β-D-Glucose, D-Glucose More specific type

4 α-D-Glucose, α-D-Glucose-6-phosphate Structural (modification)

5 Hk, Hk(L529S) Structural (mutation)

6 Hk(human), Hk(mouse) Structural (sequence)

7 Hk(open), Hk(closed) Conformational

8 Hk (L529), Hk (L540) Positional

9 Hk (L529), Hk (A530) Structural, positional

Page 12: Increasingly Accurate Representation of Biochemistry (v2)

Biochemical identity

is necessarily based on

a description of structure

Page 13: Increasingly Accurate Representation of Biochemistry (v2)

To determine identity, we have compare their descriptions

Page 14: Increasingly Accurate Representation of Biochemistry (v2)

Given A and B

How would you know that they are different?

Page 15: Increasingly Accurate Representation of Biochemistry (v2)

Given two descriptions about a protein, but where their names differ, how do you know they are the same or different?

– Structure (sequence)– PTMs– Organism– Function, Process, Localization– Conformation

Page 16: Increasingly Accurate Representation of Biochemistry (v2)

Biochemical identity

is necessarily based on

having accurate descriptions

Page 17: Increasingly Accurate Representation of Biochemistry (v2)

Yet, current approaches add *annotations* rather than create new records with their

respective descriptions

Page 18: Increasingly Accurate Representation of Biochemistry (v2)
Page 19: Increasingly Accurate Representation of Biochemistry (v2)

Current approach to assigning biochemical identifiers is erroneous, misleading or underspecified

• Information gathered from multiple structural variants are attributed to the unmodified form.

Uniprot/Genbank

• This conflates functionality arising from similar, but different structural forms

Inaccurate specification of knowledge

• Incomplete descriptions are just as bad– Reactome has an internal

identifier for referring to different forms, but links to Uniprot entries

– Obfuscates identity between databases

Page 20: Increasingly Accurate Representation of Biochemistry (v2)

Biochemical relationship

is necessarily based on

a comparison of accurate descriptions

Page 21: Increasingly Accurate Representation of Biochemistry (v2)

For each description, we must

assign a unique name or identifier

Page 22: Increasingly Accurate Representation of Biochemistry (v2)

If the description changes

we need a new identifier!

Page 23: Increasingly Accurate Representation of Biochemistry (v2)

1. Precise Biochemical Identifiers

• Identifiers and their exact descriptions are required for these kinds of entities:– atom : atomic interactions, catalytic mechanism– collection of atoms : binding/catalytic site,

interaction– residue : post translational modification– collection of residues : motif/domain/interaction site– molecule : metabolism, signalling – complex : metabolism , signalling, scaffolds,

containers

• We need a reproducible methodology for naming and providing descriptions

Page 24: Increasingly Accurate Representation of Biochemistry (v2)

Different molecules must have different identifiers

• IUPAC International Chemical Identifier (InChI)• A data string that provides

– the structure of a chemical compound – the convention for drawing the structure

• It can be made by anyone, anywhere at any time – a deterministic algorithm ensures that is always written in the same way (syntactic identity), and fully specifies the molecular description (semantic identity).

– It is a data identifier

Page 25: Increasingly Accurate Representation of Biochemistry (v2)

(S)-Glutamic Acid

InChI={version}1/{formula}C5H9NO4/c{connections}6-3(5(9)10)1-2-4(7)8/h{H_atoms}3H,1-2,6H2,(H,7,8)(H,9,10)/p{protons}+1/t{stereo:sp3}3-/m{stereo:sp3:inverted}0/s{stereo:type (1=abs, 2=rel, 3=rac)}1/i{isotopic:atoms}4+1

Page 26: Increasingly Accurate Representation of Biochemistry (v2)

CMLSDF

O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025

IUPAC

InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1InCHI

α-D-Glucose

6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol

SMILES

2. Accurate Descriptions

Page 27: Increasingly Accurate Representation of Biochemistry (v2)

OWL Has Explicit Semantics

Can therefore be used to capture knowledge in a machine understandable way

Page 28: Increasingly Accurate Representation of Biochemistry (v2)

Chemical Ontology

Chemical Knowledge for the Semantic Web.Mykola Konyk, Alexander De Leon, and Michel Dumontier. LNBI. 2008. 5109:169-176. Data Integration in the Life Sciences (DILS2008). Evry. France.

Page 29: Increasingly Accurate Representation of Biochemistry (v2)

http://code.google.com/p/semanticwebopenbabel/

RDF/OWL descriptions of molecules

Page 30: Increasingly Accurate Representation of Biochemistry (v2)

hydroxyl groupmethyl group

Knowledge of functional groups is important in chemical synthesis, pharmaceutical design and lead optimization.

Functional groups describe chemical reactivity in terms of atoms and their connectivity, and exhibits characteristic chemical behavior when present in a compound.

Describing chemical functional groups in OWL-DL for the classification of chemical compounds

N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria.

Ethanol

Page 31: Increasingly Accurate Representation of Biochemistry (v2)

Describing Functional Groups in DL

HydroxylGroup: CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom)

OHR

R group

Page 32: Increasingly Accurate Representation of Biochemistry (v2)

Fully Classified Ontology

35 FG

Page 33: Increasingly Accurate Representation of Biochemistry (v2)

And, we define certain compounds

Alcohol: OrganicCompound that (hasPart some HydroxylGroup)

Page 34: Increasingly Accurate Representation of Biochemistry (v2)

Organic Compound Ontology

28 OC

Page 35: Increasingly Accurate Representation of Biochemistry (v2)

Question Answering

• Query all attributes

• Query PubChem, DrugBank and dbPedia

Page 36: Increasingly Accurate Representation of Biochemistry (v2)

We also need Identifiers/Descriptions for Atoms

• Atom identifiers need to be consistently assigned – OpenBabel plugin component naming was first come,

first served along with the assigned mol identifier from PubChem SDF files.

e.g. id#aN, where a is the “atom” label and N is the position

– Canonical numbering (InChI) is required

• Atom descriptions need only specify the mereological relation:id#aN :isProperPartOf :id

Page 37: Increasingly Accurate Representation of Biochemistry (v2)

What about identifiers for collection of atoms?

• Potentially useful in describing residues, PTMs, binding sites, etc. – Is the lack of connectivity sufficient?

• Contiguous: – ranges (id#aN-aN)– enumerations (id#aN,aN,aN)

• Non-contiguous:– Combination of ranges, enumerations?

Page 38: Increasingly Accurate Representation of Biochemistry (v2)

Can we reuse our positional nomenclature for residues?

• Residues are generally referred to by their absolute position in the biopolymer sequence.Global atom numbering:

id#a50-a65 owl:sameAs id#r5Residue specific atom numbering

id#r5_a1-r5_a15 owl:sameAs id#r5

• Collection of residues might follow the same rules as a collection of atoms.– Useful for defining domains, motifs, etc

Page 39: Increasingly Accurate Representation of Biochemistry (v2)

While we’re at it, we could extend our expressive capability to create broader descriptions:

• Specification – Exactly mod1@pos X– Only mod1@posX

• Minimum : – At least mod1@posX

• Combination:– mod1@posX AND mod2@posY, X != Y

• Possibilities/Uncertainty: – (mod1 OR mod2) @posX

• Exclusion:– not mod1 @ posX

Page 40: Increasingly Accurate Representation of Biochemistry (v2)

So what if...we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations?

that way we get a unique identifier with a description that is extensible and compatible with the semantic web.

Page 41: Increasingly Accurate Representation of Biochemistry (v2)

Biological Identifier Service

Page 42: Increasingly Accurate Representation of Biochemistry (v2)

Description to Identifier

Page 43: Increasingly Accurate Representation of Biochemistry (v2)

What does this mean?

• Identifier exactly matches the description– Great as a primary key for databases – Can be used for citation purposes (no more fuzzy

diagrams!)• exact description can be obtained for a given identifier.

• Description is extensible, and new identifiers can be autogenerated, independently– Needs canonical serialization / central service– Histories can be made, and published

Page 44: Increasingly Accurate Representation of Biochemistry (v2)

Case Study: HIF1αHypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665)Master transcriptional regulator of the adaptive response to hypoxia

• Under normoxic conditions, HIF1α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation.

Situationb) Normoxicc) Hypoxicd) Other/Unspecified

Multiple structural forms

Part, named/ unnamed regions

The part is the agent in the process

Selective interaction with parts

Page 45: Increasingly Accurate Representation of Biochemistry (v2)

Structure-based biochemical identity:Differences between apples and oranges

• HIF1α – au naturel• HIF1α

– hydroxylated @P402

• HIF1α– hydroxylated @P564

• HIF1α– hydroxylated @P402 & @P564

• HIF1α– hydroxylated @P402 & (@P564)– ubiquitinated @K532

• HIF1α– L400A & L397A

Page 46: Increasingly Accurate Representation of Biochemistry (v2)

Uniprot example revisited

Under normoxic conditions, HIF1α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation

.

:A rdfs:subClassOf :Hydroxylation:A hasParticipant (:0#r402 and :Substrate):A hasParticipant (:1#r402 and :Product):A hasParticipant (:5 and :Enzyme)

:B rdfs:subClassOf :Interaction:B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564):B :hasParticipant (:6)

:1 (HIF1α):2 (HIF1α + P402hyd):3 (HIF1α + P564hyd):4 (HIF1α + P402hyd + P564hyd):5 (EGLN1):6 (VHL)

Please ignore the made up short-hand syntax!

Page 47: Increasingly Accurate Representation of Biochemistry (v2)

Situational Modeling

Page 48: Increasingly Accurate Representation of Biochemistry (v2)

Infering Protein Participation

• OWL Role ChainhasParticipant o isPartOf -> hasParticipant

if process has the part as a participant, then the whole is also a participant

:0#r402 :isPartOf :0:1#r402 :isPartOf :1

:A rdfs:subClassOf :Hydroxylation:A hasParticipant (:0#r402 and :Substrate):A hasParticipant (:1#r402 and :Product)

:A hasParticipant :0:A hasParticipant :1

Page 49: Increasingly Accurate Representation of Biochemistry (v2)

We will add new knowledge about biochemicals and their parts into the linked data web through Bio2RDF!

Page 50: Increasingly Accurate Representation of Biochemistry (v2)

Query descriptions to find matching biochemicals

• Chemical– Structural– Conformation (e.g. open vs closed form)– Collections (alpha vs beta forms of D-glucose)

• Biological– Species– mRNA/Gene from which it was transcribed/encoded– Reactions / post-translational modifications– Mutations

Page 51: Increasingly Accurate Representation of Biochemistry (v2)

Summary

• Biochemical identity is tightly linked to accurate descriptions.

• Automatic and consistent identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed– No curation required!!!!– Will be discovered automatically – link biochemical knowledge at various levels of granularity

• Situational modeling enables the careful separation of what is known under a particular circumstance.

Page 52: Increasingly Accurate Representation of Biochemistry (v2)

dumontierlab.com

[email protected]

Special thanks to PhD Student Leonid Chepelev for insightful discussions

semanticscience.org