biomedical ontologies

45
1 Biomedical Ontologies Bio-Trac 40 (Protein Bioinformatics) Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 October 8, 2009 Zhang-Zhi Hu, M.D. Zhang-Zhi Hu, M.D. Associate Professor Associate Professor Department of Oncology Department of Oncology Department of Biochemistry and Molecular & Department of Biochemistry and Molecular & Cellular Biology Cellular Biology Georgetown University Medical Center Georgetown University Medical Center

Upload: lawrence-ashton

Post on 03-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Biomedical Ontologies. Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Overview. What is ontology ? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biomedical Ontologies

1

Biomedical Ontologies

Bio-Trac 40 (Protein Bioinformatics)Bio-Trac 40 (Protein Bioinformatics)

October 8, 2009October 8, 2009

Zhang-Zhi Hu, M.D. Zhang-Zhi Hu, M.D. Associate ProfessorAssociate ProfessorDepartment of OncologyDepartment of OncologyDepartment of Biochemistry and Molecular & Cellular BiologyDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical CenterGeorgetown University Medical Center

Page 2: Biomedical Ontologies

2

Overview• What is ontology?

– What is biomedical ontology?• What is gene ontology?

– How is it generated?– How is it used for annotation?

• What is protein ontology? – Why is it necessary?– How to use it?

• ……

Page 3: Biomedical Ontologies

3

Tree of Porphyry with Aristotle’s Categories

Aristotle, 384 BC – 322 BC

Page 4: Biomedical Ontologies

4

Ontology:

• In philosophy, it seeks to describe basic categories and relationships of being or existence to define entities and types of entities within its framework: – What do you know? How do you know it?– What is existence? What is a physical object? – What constitutes the identity of an object? ……

• Central goal is to have a definitive and exhaustive classification of all entities.

onto-, of being or existence; -logy, study. Greek origin; Latin, ontologia,1606

“The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality”

– Barry Smith, U Buffalo

Page 5: Biomedical Ontologies

5

In computer and information science• Ontology is a data model that represents a set of

concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.

is_a

Individuals (instances)

Classes (concepts)

Attributes

Relations

Most ontologies describe individuals (instances), classes (concepts), attributes, and relations

your Ford, my Ford, his Ford…

e.g. color, engine, door…

Classes

Page 6: Biomedical Ontologies

6

What are ontology useful for?

• Terminology management• Integration, interoperability, and sharing of data

– promote precise communication between scientists– enable information retrieval across multiple resources

• Knowledge reuse and decision support– extend the power of computational approaches to perform

data exploration, inference, and mining

Ontology is a form of knowledge representation about the world or some part of it.

Biomedical Terminology vs. Biomedical Ontology

• UMLS (unified medical language system)

• MeSH (medical subject heading)

• NCI Thesaurus

• SNOMED / SNODENT

• Medical WordNet

Page 7: Biomedical Ontologies

7

Ontology Enables Large-Scale Biomedical Science

• Structured representation of biomedicine: – For different types of entities and relations to

describe biomedicine (ontology content curation).

• Annotation: using ontologies to summarize and describe biomedical experimental results to enable: – Integration of their data with other researchers’

results– Cross-species analyses

The center of two major activities currently in biomedical research:

Page 8: Biomedical Ontologies

8

what makes it so wildly

successful ?

Gene Ontology (GO)

Page 9: Biomedical Ontologies

9

GO Consortium

• The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genome of three model organisms:

– Drosophila melanogaster (fruit fly) (FlyBase)

– Mus musculus (mouse) (MGD)

– Saccharomyces cerevisiae (yeast) (SGD)

• Many other model organism databases have joined the GO consortium, contributing:

– development of the ontologies

– annotations for the genes of one or more organisms

http://www.geneontology.org/

Page 10: Biomedical Ontologies

10

Three key concepts: [Currently total 27399 GO terms] (May 2009)

• Biological process: series of events accomplished by one or more ordered assemblies of molecular functions, e.g. signal transduction, or pyrimidine metabolism, and alpha-glucoside transport. [ [total: 16468]

• Molecular function: describes activities, such as catalytic or binding activities, that occur at the molecular level. Activities that can be performed by individual gene products, or by assembled complexes of gene products; e.g. catalytic activity, transporter activity. [total: 8585]

• Cellular component: a component of a cell that it is part of some larger object, maybe an anatomical structure (e.g. ER or nucleus) or a gene product group (e.g. ribosome, or a protein dimer). [total: 2346]

• What is Gene Ontology? GO provides controlled vocabulary to describe gene and gene product attributes in any organism – how gene products behave in a cellular context

Need for annotation of genome sequences

• GO annotation - Characterization of gene products using GO terms - Members submit their data which are available at GO website.

Page 11: Biomedical Ontologies

11

GO Representation: Tree or Network?

Node, a concept or a term

root

Leaf node

C has two parents, A and B

A

B

C

C

GO is a network structure

Relations: is_a, or part_of

Page 12: Biomedical Ontologies

12http://www.geneontology.org/

Page 13: Biomedical Ontologies

13

GO term (GO:0006366): mRNA transcription from RNA polymerase II promoter

Leaf node

GO search and display tool

Page 14: Biomedical Ontologies

14

Human p53 – GO annotation (UniProtKB:P04637)

GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]

Page 15: Biomedical Ontologies

15

• Science basis of the GO: trained experts use the experimental observations from literature to associate GO terms with gene products (to annotate the entities represented in the gene/protein databases)

• Enabling data integration across databases and making them available to semantic search

GO annotation of gene products

Human, mouse, plant, worm, yeast …

http://www.geneontology.org/GO.current.annotations.shtml

Page 16: Biomedical Ontologies

16

What GO is NOT……• Ontology of gene products: e.g. cytochrome c is not in

GO, but attributes of cytochrome c are, e.g. oxidoreductase activity.

• Processes, functions and component unique to mutants or diseases: e.g. oncogenesis is not a valid GO.

• Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of

cellular components, including cell types.

Neither GO is Ontology of Genes!! – a misnomer

Page 17: Biomedical Ontologies

17

Missing GO nodes…

not deep enough…not broad enough…

Page 18: Biomedical Ontologies

18

Lack of connections among GOs

Estrogen receptor

Page 19: Biomedical Ontologies

19

what cellular component?

what molecular function?

what biological process?

GO: A Common Standard for Omics Data Analysis

Page 20: Biomedical Ontologies

20

need more…• need to improve the quality of GO to support more

rigorous logic-based reasoning across the data annotated in its terms

• need to extend the GO by engaging ever broader community support for addition of new terms and for correction of errors

• need to extend the methodology to other domains, including clinical domains, such as: – disease ontology– immunology ontology– symptom (phenotype) ontology– clinical trial ontology– ...

Page 21: Biomedical Ontologies

21

• Establish common rules governing best practices for creating ontologies and for using these in annotations

• Apply these rules to create a complete suite of orthogonal interoperable biomedical reference ontologies

National Center for Biomedical

Ontology (NCBO)

http://www.obofoundry.org/

http://bioontology.org/

Page 22: Biomedical Ontologies

22

http://www.obofoundry.org/index.cgi?sort=domain&show=ontologies

Page 23: Biomedical Ontologies

23

• tight connection to the biomedical basic sciences• compatibility, interoperability, common relations• support for logic-based reasoning

OBO Foundry = a subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure:

The OBO Foundry

• A family of interoperable gold standard biomedical reference ontologies to serve annotation of:

– scientific literature

– model organism databases

– clinical trial data …

OBO Foundry Principles: http://www.obofoundry.org/crit.shtml

Page 24: Biomedical Ontologies

24

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Componen

t(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RNAO, PRO)

Molecular Function(GO)

Molecular Process

(GO)

Rationale of OBO Foundry coverage

GRANULARITY

RELATION TO TIME

Page 25: Biomedical Ontologies

25

Foundational is_apart_of

Spatial located_incontained_inadjacent_to

Temporal transformation_ofderives_frompreceded_by

Participation has_participanthas_agent

OBO Relation Ontology

e.g.: A is_a B =def. every instance of A is an instance of B

“rose is_a plant all instances of rose is_a plant”

Page 26: Biomedical Ontologies

26

What is Protein Ontology? Why?

PRO

http://pir.georgetown.edu/pro/

Page 27: Biomedical Ontologies

27

27

PRotein Ontology (PRO) PRO in OBO Foundry to represent protein entities

Ontology for Protein Evolution (ProEvo) for evolutionary classes of proteins: captures the protein classes reflecting evolutionary relationships at full-length protein levels. It is meant to indicate the relatedness of proteins, not their evolution. Use is_a relationship.

Ontology for Protein Forms (ProForm) for multiple protein forms of a gene: captures the different protein forms of a specific gene from genetic variations, alternative splicing, cleavage, post-translational modifications. Relations used

is_a or derives_from

Page 28: Biomedical Ontologies

28

Why PRO?– Allows specification of relationships between PRO and other ontologies, such as

GO and Disease Ontology– Provides a structure to support formal, computer-based inferences based on shared

attributes among homologous proteins– Provides a stable unique identifier to any protein type

PRO:000000650 Smad2 isoform 1 phosphorylated 1 (phosphorylated SSxS motif) NOT has_function GO:0003677:DNA binding

PRO:000000656 Smad2 isoform 2 phosphorylated 1 (phosphorylated SSxS motif) has_function GO:0003677 DNA binding

Provides formalization and precise annotation of specific protein forms/ classes, allowing accurate and consistent data mapping, integration and analysis: for dendritic cell ontology or pathway[Term] http://www.biomedcentral.com/1471-2105/10/70id: DC_CL:0000003name: conventional dendritic celldef: "conventional dendritic cell is_a leukocyte that has_high_plasma_membrane_amount_relative_to_leukocyte CD11c and lacks_plasma_membrane_part CD19, CD3, C34, and CD56." [AMM:amm]comment: Immunological Reviews 2007 219: 118-142intersection_of: CL:0000738 ! leukocyteintersection_of: has_high_plasma_membrane_amount_relative_to_leukocyte PRO:000001013 ! CD11cintersection_of: lacks_plasma_membrane_part DC_CL:0000072 ! CD3intersection_of: lacks_plasma_membrane_part PRO:000001002 ! CD19intersection_of: lacks_plasma_membrane_part PRO:000001003 ! CD34intersection_of: lacks_plasma_membrane_part PRO:000001024 ! CD56

Page 29: Biomedical Ontologies

29

II ITGF-beta receptor

Smad 4

Nucleus

DNA binding and transcription regulation

MAPKKK

Shc S PY P

S PY P

Y PS P

Y P

S PS PT P

XIAP

P38 MAPKpathway

JNKcascade

II IS P

Y PY P

S PY P

S PS PT P

Smad 7

STRAP

TAK1K U

Degradation

TAK1

Smad 2

S PS P

Ski

Shc

Smad 2

Smad 4

Smad 2

S PS P

Smad 4

Smad 2

S PS P

Smad 4

Smad 2

S PS P

X

ERK1/2

CaM

Smad 2

S PS P

X P

Smad 2

TPSPSPSPSP

TP

Furin

TGF- TGF-

TGF-LAP

Cytoplasm

Ca2+Growth signal

s

S P

T P

Y P

K U

Phosphorylation (P) at Serine (S),

Threonine (T)

Tyrosine (Y)Ubiquitination (U) at Lysine (K)

TGF-beta signaling – comparison between PID and Reactome

Growth signal

s

Only included in Reactome

Common in both Reactome & PID

* All others are in PID. Not all components in the pathway from both databases are listed

Stress signal

s

MEKK1

T PT P

Smad 2

S PS P

X P

X

PRO:000000523

PRO:000000650

PRO:000000651

PRO:000000410

PRO:000000616

PRO:000000652

PRO:000000366

PRO:000000481

PRO:000000618

PRO:000000397

PRO:000000650PRO:000000366

PRO:000000650PRO:000000366

PRO:000000468

Page 30: Biomedical Ontologies

30

The Need for Representation of Various Proteins Forms

Human PRLR

and PTMs…

Glucocorticoid receptor (GR)

Page 31: Biomedical Ontologies

31

Sphingomyelin phosphodiesterase (SMPD1) (ASM_HUMAN)

• Cleavage sites:– lysosomal: the enzyme is transported from the Golgi apparatus

to the lysosome after additions of mannose-6-phosphate moieties (M6P) and binding to M6P receptor.

– secreted: the shorter cleaved form is not modified with M6P and is targeted for secretion to the extracellular space, with different functions such as LDL binding and oxidized LDL catabolism.

M6P

lysosome

Extracelluar, e.g. LDL binding

Page 32: Biomedical Ontologies

32

GOA for Transcription factor Ovo-like 2Form 1 - long: GO:0045892 IDA - negative regulation of transcription, DNA-dependentForm 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent

274 aa

OVOL2_MOUSE (Q8CIV7)

- Gene. 2004 336:47-58. PMID:15225875

Page 33: Biomedical Ontologies

33

Human

Chimp

Mou

se

E.coli

Fly Yeast

Wor

mRat B.su

btilis

Implications of Protein Evolution

Common ancestor

• Conclusions from experiments performed on proteins from one organism are often applicable to the homologous protein from another organism.

• Information learned about existing proteins allows us to infer the properties of ancestral proteins.

Page 34: Biomedical Ontologies

34

Functional convergence

• Protein classes of the same function derived from different evolutionary origins, e.g. carbonate dehydratase (or carbonic anhydrase EC 4.2.1.1), which has three independent gene families with functional convergence.

Animal and prokaryotic type

Plant and prokaryotic type

Archaea type

Page 35: Biomedical Ontologies

35

Functional divergence

TGM3 (Human)

TGM3 (Mouse)

EPB42 (Human)

EPB42 (Mouse)

Gene Duplication (TGM3/EPB42 split) Speciation (Human/mouse split)

TGM3 (Human)

TGM3 (Mouse)

EPB42 (Human)

EPB42 (Mouse)

TGM3 branch

EPB42 branch

Human

Mouse

Human

Mouse

TGM3 = Protein-glutamine gamma-glutamyltransferase(Transglutaminase; involved in protein modification)

EBP42 = Erythrocyte membrane protein band 4.2(Constituent of cytoskeleton; involved in cell shape)

Page 36: Biomedical Ontologies

36

ProEvo

ProForm

GO Gene Ontology

molecular function

cellular component

biological process

participates_in

part_of (for complexes) located_in (for compartments)

has_function

PROprotein

Root Levelis_a

translation product of an evolutionarily-related gene

translation product of a specific mRNA

Family-Level Distinction• Derivation: common ancestor• Source: PIRSF family

Modification-Level Distinction• Derived from post-translational modification• Source: UniProtKB

Sequence-Level Distinction• Derivation: specific allele or splice variant• Source: UniProtKB

cleaved/modified translation product disease

OMIM Disease

agent_in

is_a

protein modification

has_modification

PSI-MOD Modification

SO Sequence Ontology

sequence change

has_agent (sequence change) agent_of (effect on function)

derives_from

protein domain

has_part

Pfam Domain

Example:

TGF-beta receptor phosphorylated smad2 isoform1

is a phosphorylated smad2 isoform1

derives_from smad2 isoform 1

is a smad2

is a TGF- receptor-regulated smad

is a smad

is a protein

Modification Level

Sequence Level

Family Level

Root Level

translation product of a specific geneGene-Level Distinction• Derivation: specific gene• Sources: PIRSF subfamily, Panther subfamily

is_a

Gene Level

Protein Ontology (PRO) http://pir.georgetown.edu/pro/

Page 37: Biomedical Ontologies

37

Cellular Component:- nucleus- cytoplasm

Molecular Function:- protein binding

Biological Process:- signal transduction- regulation of transcription, DNA-dependent

Mothers against decapentaplegic homolog 2

Smad 2

GO annotation of SMAD2_HUMAN:

Page 38: Biomedical Ontologies

38

II I

TGF-TGF-beta receptor

PP Smad 4

4 DNA binding

1 phosphorylation

2 complex formation

Nucleus

Cytoplasm

Smad 2

Smad 2

PPSmad 2

Smad 4

Transcription Regulation

PPSmad 2

Smad 4

3 nuclear translocation

PPSmad 2PP

P

++

ERK1

CAMK2

PP

Page 39: Biomedical Ontologies

39

“normal” •Cytoplasmic PRO:00000011

TGF- receptor phosphorylated

•Forms complex•Nuclear•Txn upregulation

PRO:00000013

ERK1 phosphorylated •Forms complex•Nuclear•Txn upregulation++

PRO:00000014

CAMK2 phosphorylated

•Forms complex•Cytoplasmic•No Txn upregulation

PRO:00000015

alternatively spliced short form

•Cytoplasmic

PRO:00000016

phosphorylated short form

•Nuclear•Txn upregulation PRO:00000018

point mutation (causative agent: large intestine carcinoma)

•Doesn’t form complex•Cytoplasmic•No Txn upregulation

PRO:00000019

Smad 2

PPSmad 2

PPSmad 2P

PPSmad 2 P

Smad 2 x

Smad 2

Smad 2 PP

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

SMAD2_HUMAN

Smad2 gene products Forms Location ID

Page 40: Biomedical Ontologies

40

Search:smad2 -> 21 protein terms

http://pir.georgetown.edu/cgi-bin/pro/textsearch_pro

Page 41: Biomedical Ontologies

41

Root:Protein

PRO hierarchy for smad2 isoform 2

Page 42: Biomedical Ontologies

42

Page 43: Biomedical Ontologies

43

PRO entry report shows the ontology and the annotation

Page 44: Biomedical Ontologies

44

Page 45: Biomedical Ontologies

45

Summary• The vision of the biomedical ontology community is that all

biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that they are semantically interoperable and useful for improving biomedical science and clinical care.

• The scope extends to all knowledge and data that is relevant to the understanding or improvement of human biology and health.

• Knowledge and data are semantically interoperable when they enable predictable, meaningful, computation across knowledge sources developed independently to meet diverse needs.

• Principled ontologies are ones that follow NCBO-recommended formats and methodologies for ontology development, maintenance, and use.