biomedical ontologies
DESCRIPTION
Biomedical Ontologies. Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Overview. What is ontology ? - PowerPoint PPT PresentationTRANSCRIPT
1
Biomedical Ontologies
Bio-Trac 40 (Protein Bioinformatics)Bio-Trac 40 (Protein Bioinformatics)
October 8, 2009October 8, 2009
Zhang-Zhi Hu, M.D. Zhang-Zhi Hu, M.D. Associate ProfessorAssociate ProfessorDepartment of OncologyDepartment of OncologyDepartment of Biochemistry and Molecular & Cellular BiologyDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical CenterGeorgetown University Medical Center
2
Overview• What is ontology?
– What is biomedical ontology?• What is gene ontology?
– How is it generated?– How is it used for annotation?
• What is protein ontology? – Why is it necessary?– How to use it?
• ……
3
Tree of Porphyry with Aristotle’s Categories
Aristotle, 384 BC – 322 BC
4
Ontology:
• In philosophy, it seeks to describe basic categories and relationships of being or existence to define entities and types of entities within its framework: – What do you know? How do you know it?– What is existence? What is a physical object? – What constitutes the identity of an object? ……
• Central goal is to have a definitive and exhaustive classification of all entities.
onto-, of being or existence; -logy, study. Greek origin; Latin, ontologia,1606
“The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality”
– Barry Smith, U Buffalo
5
In computer and information science• Ontology is a data model that represents a set of
concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.
is_a
Individuals (instances)
Classes (concepts)
Attributes
Relations
Most ontologies describe individuals (instances), classes (concepts), attributes, and relations
your Ford, my Ford, his Ford…
e.g. color, engine, door…
Classes
6
What are ontology useful for?
• Terminology management• Integration, interoperability, and sharing of data
– promote precise communication between scientists– enable information retrieval across multiple resources
• Knowledge reuse and decision support– extend the power of computational approaches to perform
data exploration, inference, and mining
Ontology is a form of knowledge representation about the world or some part of it.
Biomedical Terminology vs. Biomedical Ontology
• UMLS (unified medical language system)
• MeSH (medical subject heading)
• NCI Thesaurus
• SNOMED / SNODENT
• Medical WordNet
7
Ontology Enables Large-Scale Biomedical Science
• Structured representation of biomedicine: – For different types of entities and relations to
describe biomedicine (ontology content curation).
• Annotation: using ontologies to summarize and describe biomedical experimental results to enable: – Integration of their data with other researchers’
results– Cross-species analyses
The center of two major activities currently in biomedical research:
8
what makes it so wildly
successful ?
Gene Ontology (GO)
9
GO Consortium
• The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genome of three model organisms:
– Drosophila melanogaster (fruit fly) (FlyBase)
– Mus musculus (mouse) (MGD)
– Saccharomyces cerevisiae (yeast) (SGD)
• Many other model organism databases have joined the GO consortium, contributing:
– development of the ontologies
– annotations for the genes of one or more organisms
http://www.geneontology.org/
10
Three key concepts: [Currently total 27399 GO terms] (May 2009)
• Biological process: series of events accomplished by one or more ordered assemblies of molecular functions, e.g. signal transduction, or pyrimidine metabolism, and alpha-glucoside transport. [ [total: 16468]
• Molecular function: describes activities, such as catalytic or binding activities, that occur at the molecular level. Activities that can be performed by individual gene products, or by assembled complexes of gene products; e.g. catalytic activity, transporter activity. [total: 8585]
• Cellular component: a component of a cell that it is part of some larger object, maybe an anatomical structure (e.g. ER or nucleus) or a gene product group (e.g. ribosome, or a protein dimer). [total: 2346]
• What is Gene Ontology? GO provides controlled vocabulary to describe gene and gene product attributes in any organism – how gene products behave in a cellular context
Need for annotation of genome sequences
• GO annotation - Characterization of gene products using GO terms - Members submit their data which are available at GO website.
11
GO Representation: Tree or Network?
Node, a concept or a term
root
Leaf node
C has two parents, A and B
A
B
C
C
GO is a network structure
Relations: is_a, or part_of
12http://www.geneontology.org/
13
GO term (GO:0006366): mRNA transcription from RNA polymerase II promoter
Leaf node
GO search and display tool
14
Human p53 – GO annotation (UniProtKB:P04637)
GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]
15
• Science basis of the GO: trained experts use the experimental observations from literature to associate GO terms with gene products (to annotate the entities represented in the gene/protein databases)
• Enabling data integration across databases and making them available to semantic search
GO annotation of gene products
Human, mouse, plant, worm, yeast …
http://www.geneontology.org/GO.current.annotations.shtml
16
What GO is NOT……• Ontology of gene products: e.g. cytochrome c is not in
GO, but attributes of cytochrome c are, e.g. oxidoreductase activity.
• Processes, functions and component unique to mutants or diseases: e.g. oncogenesis is not a valid GO.
• Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of
cellular components, including cell types.
Neither GO is Ontology of Genes!! – a misnomer
17
Missing GO nodes…
not deep enough…not broad enough…
18
Lack of connections among GOs
Estrogen receptor
19
what cellular component?
what molecular function?
what biological process?
GO: A Common Standard for Omics Data Analysis
20
need more…• need to improve the quality of GO to support more
rigorous logic-based reasoning across the data annotated in its terms
• need to extend the GO by engaging ever broader community support for addition of new terms and for correction of errors
• need to extend the methodology to other domains, including clinical domains, such as: – disease ontology– immunology ontology– symptom (phenotype) ontology– clinical trial ontology– ...
21
• Establish common rules governing best practices for creating ontologies and for using these in annotations
• Apply these rules to create a complete suite of orthogonal interoperable biomedical reference ontologies
National Center for Biomedical
Ontology (NCBO)
http://www.obofoundry.org/
http://bioontology.org/
22
http://www.obofoundry.org/index.cgi?sort=domain&show=ontologies
23
• tight connection to the biomedical basic sciences• compatibility, interoperability, common relations• support for logic-based reasoning
OBO Foundry = a subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure:
The OBO Foundry
• A family of interoperable gold standard biomedical reference ontologies to serve annotation of:
– scientific literature
– model organism databases
– clinical trial data …
OBO Foundry Principles: http://www.obofoundry.org/crit.shtml
24
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Componen
t(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RNAO, PRO)
Molecular Function(GO)
Molecular Process
(GO)
Rationale of OBO Foundry coverage
GRANULARITY
RELATION TO TIME
25
Foundational is_apart_of
Spatial located_incontained_inadjacent_to
Temporal transformation_ofderives_frompreceded_by
Participation has_participanthas_agent
OBO Relation Ontology
e.g.: A is_a B =def. every instance of A is an instance of B
“rose is_a plant all instances of rose is_a plant”
26
What is Protein Ontology? Why?
PRO
http://pir.georgetown.edu/pro/
27
27
PRotein Ontology (PRO) PRO in OBO Foundry to represent protein entities
Ontology for Protein Evolution (ProEvo) for evolutionary classes of proteins: captures the protein classes reflecting evolutionary relationships at full-length protein levels. It is meant to indicate the relatedness of proteins, not their evolution. Use is_a relationship.
Ontology for Protein Forms (ProForm) for multiple protein forms of a gene: captures the different protein forms of a specific gene from genetic variations, alternative splicing, cleavage, post-translational modifications. Relations used
is_a or derives_from
28
Why PRO?– Allows specification of relationships between PRO and other ontologies, such as
GO and Disease Ontology– Provides a structure to support formal, computer-based inferences based on shared
attributes among homologous proteins– Provides a stable unique identifier to any protein type
PRO:000000650 Smad2 isoform 1 phosphorylated 1 (phosphorylated SSxS motif) NOT has_function GO:0003677:DNA binding
PRO:000000656 Smad2 isoform 2 phosphorylated 1 (phosphorylated SSxS motif) has_function GO:0003677 DNA binding
Provides formalization and precise annotation of specific protein forms/ classes, allowing accurate and consistent data mapping, integration and analysis: for dendritic cell ontology or pathway[Term] http://www.biomedcentral.com/1471-2105/10/70id: DC_CL:0000003name: conventional dendritic celldef: "conventional dendritic cell is_a leukocyte that has_high_plasma_membrane_amount_relative_to_leukocyte CD11c and lacks_plasma_membrane_part CD19, CD3, C34, and CD56." [AMM:amm]comment: Immunological Reviews 2007 219: 118-142intersection_of: CL:0000738 ! leukocyteintersection_of: has_high_plasma_membrane_amount_relative_to_leukocyte PRO:000001013 ! CD11cintersection_of: lacks_plasma_membrane_part DC_CL:0000072 ! CD3intersection_of: lacks_plasma_membrane_part PRO:000001002 ! CD19intersection_of: lacks_plasma_membrane_part PRO:000001003 ! CD34intersection_of: lacks_plasma_membrane_part PRO:000001024 ! CD56
29
II ITGF-beta receptor
Smad 4
Nucleus
DNA binding and transcription regulation
MAPKKK
Shc S PY P
S PY P
Y PS P
Y P
S PS PT P
XIAP
P38 MAPKpathway
JNKcascade
II IS P
Y PY P
S PY P
S PS PT P
Smad 7
STRAP
TAK1K U
Degradation
TAK1
Smad 2
S PS P
Ski
Shc
Smad 2
Smad 4
Smad 2
S PS P
Smad 4
Smad 2
S PS P
Smad 4
Smad 2
S PS P
X
ERK1/2
CaM
Smad 2
S PS P
X P
Smad 2
TPSPSPSPSP
TP
Furin
TGF- TGF-
TGF-LAP
Cytoplasm
Ca2+Growth signal
s
S P
T P
Y P
K U
Phosphorylation (P) at Serine (S),
Threonine (T)
Tyrosine (Y)Ubiquitination (U) at Lysine (K)
TGF-beta signaling – comparison between PID and Reactome
Growth signal
s
Only included in Reactome
Common in both Reactome & PID
* All others are in PID. Not all components in the pathway from both databases are listed
Stress signal
s
MEKK1
T PT P
Smad 2
S PS P
X P
X
PRO:000000523
PRO:000000650
PRO:000000651
PRO:000000410
PRO:000000616
PRO:000000652
PRO:000000366
PRO:000000481
PRO:000000618
PRO:000000397
PRO:000000650PRO:000000366
PRO:000000650PRO:000000366
PRO:000000468
30
The Need for Representation of Various Proteins Forms
Human PRLR
and PTMs…
Glucocorticoid receptor (GR)
31
Sphingomyelin phosphodiesterase (SMPD1) (ASM_HUMAN)
• Cleavage sites:– lysosomal: the enzyme is transported from the Golgi apparatus
to the lysosome after additions of mannose-6-phosphate moieties (M6P) and binding to M6P receptor.
– secreted: the shorter cleaved form is not modified with M6P and is targeted for secretion to the extracellular space, with different functions such as LDL binding and oxidized LDL catabolism.
M6P
lysosome
Extracelluar, e.g. LDL binding
32
GOA for Transcription factor Ovo-like 2Form 1 - long: GO:0045892 IDA - negative regulation of transcription, DNA-dependentForm 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent
274 aa
OVOL2_MOUSE (Q8CIV7)
- Gene. 2004 336:47-58. PMID:15225875
33
Human
Chimp
Mou
se
E.coli
Fly Yeast
Wor
mRat B.su
btilis
Implications of Protein Evolution
Common ancestor
• Conclusions from experiments performed on proteins from one organism are often applicable to the homologous protein from another organism.
• Information learned about existing proteins allows us to infer the properties of ancestral proteins.
34
Functional convergence
• Protein classes of the same function derived from different evolutionary origins, e.g. carbonate dehydratase (or carbonic anhydrase EC 4.2.1.1), which has three independent gene families with functional convergence.
Animal and prokaryotic type
Plant and prokaryotic type
Archaea type
35
Functional divergence
TGM3 (Human)
TGM3 (Mouse)
EPB42 (Human)
EPB42 (Mouse)
Gene Duplication (TGM3/EPB42 split) Speciation (Human/mouse split)
TGM3 (Human)
TGM3 (Mouse)
EPB42 (Human)
EPB42 (Mouse)
TGM3 branch
EPB42 branch
Human
Mouse
Human
Mouse
TGM3 = Protein-glutamine gamma-glutamyltransferase(Transglutaminase; involved in protein modification)
EBP42 = Erythrocyte membrane protein band 4.2(Constituent of cytoskeleton; involved in cell shape)
36
ProEvo
ProForm
GO Gene Ontology
molecular function
cellular component
biological process
participates_in
part_of (for complexes) located_in (for compartments)
has_function
PROprotein
Root Levelis_a
translation product of an evolutionarily-related gene
translation product of a specific mRNA
Family-Level Distinction• Derivation: common ancestor• Source: PIRSF family
Modification-Level Distinction• Derived from post-translational modification• Source: UniProtKB
Sequence-Level Distinction• Derivation: specific allele or splice variant• Source: UniProtKB
cleaved/modified translation product disease
OMIM Disease
agent_in
is_a
protein modification
has_modification
PSI-MOD Modification
SO Sequence Ontology
sequence change
has_agent (sequence change) agent_of (effect on function)
derives_from
protein domain
has_part
Pfam Domain
Example:
TGF-beta receptor phosphorylated smad2 isoform1
is a phosphorylated smad2 isoform1
derives_from smad2 isoform 1
is a smad2
is a TGF- receptor-regulated smad
is a smad
is a protein
Modification Level
Sequence Level
Family Level
Root Level
translation product of a specific geneGene-Level Distinction• Derivation: specific gene• Sources: PIRSF subfamily, Panther subfamily
is_a
Gene Level
Protein Ontology (PRO) http://pir.georgetown.edu/pro/
37
Cellular Component:- nucleus- cytoplasm
Molecular Function:- protein binding
Biological Process:- signal transduction- regulation of transcription, DNA-dependent
Mothers against decapentaplegic homolog 2
Smad 2
GO annotation of SMAD2_HUMAN:
38
II I
TGF-TGF-beta receptor
PP Smad 4
4 DNA binding
1 phosphorylation
2 complex formation
Nucleus
Cytoplasm
Smad 2
Smad 2
PPSmad 2
Smad 4
Transcription Regulation
PPSmad 2
Smad 4
3 nuclear translocation
PPSmad 2PP
P
++
ERK1
CAMK2
PP
39
“normal” •Cytoplasmic PRO:00000011
TGF- receptor phosphorylated
•Forms complex•Nuclear•Txn upregulation
PRO:00000013
ERK1 phosphorylated •Forms complex•Nuclear•Txn upregulation++
PRO:00000014
CAMK2 phosphorylated
•Forms complex•Cytoplasmic•No Txn upregulation
PRO:00000015
alternatively spliced short form
•Cytoplasmic
PRO:00000016
phosphorylated short form
•Nuclear•Txn upregulation PRO:00000018
point mutation (causative agent: large intestine carcinoma)
•Doesn’t form complex•Cytoplasmic•No Txn upregulation
PRO:00000019
Smad 2
PPSmad 2
PPSmad 2P
PPSmad 2 P
Smad 2 x
Smad 2
Smad 2 PP
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
Smad2 gene products Forms Location ID
40
Search:smad2 -> 21 protein terms
http://pir.georgetown.edu/cgi-bin/pro/textsearch_pro
41
Root:Protein
PRO hierarchy for smad2 isoform 2
42
43
PRO entry report shows the ontology and the annotation
44
45
Summary• The vision of the biomedical ontology community is that all
biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that they are semantically interoperable and useful for improving biomedical science and clinical care.
• The scope extends to all knowledge and data that is relevant to the understanding or improvement of human biology and health.
• Knowledge and data are semantically interoperable when they enable predictable, meaningful, computation across knowledge sources developed independently to meet diverse needs.
• Principled ontologies are ones that follow NCBO-recommended formats and methodologies for ontology development, maintenance, and use.