enhancing ontologies through annotations

72
Enhancing Ontologies Through Annotations raining Course raining Course Olivier Bodenreider Olivier Bodenreider Lister Hill National Center Lister Hill National Center for Biomedical Communications for Biomedical Communications Bethesda, Maryland - USA Bethesda, Maryland - USA May 22, 2006 Schloß Dagstuh iomedical Ontology iomedical Ontology in in

Upload: dora-craig

Post on 04-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Training Course. Schlo ß Dagstuhl. in. May 22, 2006. Biomedical Ontology. Enhancing Ontologies Through Annotations. Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA. Outline. Dependence relations in MeSH and co-occurrence in MEDLINE - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Enhancing Ontologies Through Annotations

Enhancing OntologiesThrough Annotations

Training CourseTraining Course

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

May 22, 2006

Schloß Dagstuhl

Biomedical OntologyBiomedical Ontology

inin

Page 2: Enhancing Ontologies Through Annotations

2 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

OutlineOutline

Dependence relations in MeSHand co-occurrence in MEDLINE

Identifying associative relationsin the Gene Ontology

Linking the Gene Ontologyto other biological ontologies: GO-ChEBI

Page 3: Enhancing Ontologies Through Annotations

Using Dependence Relations in MeSHas a Framework for the Analysis

of Disease Information in Medline

Page 4: Enhancing Ontologies Through Annotations

4 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

AcknowledgmentsAcknowledgments

Lowell VizenorLowell VizenorNational Library of National Library of Medicine, USAMedicine, USA

Page 5: Enhancing Ontologies Through Annotations

5 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Relations among biomedical entities Relations among biomedical entities

Symbolic relationsSymbolic relations Represented in biomedical terminologies/ontologiesRepresented in biomedical terminologies/ontologies Explicit semanticsExplicit semantics

Hierarchical (Hierarchical (isaisa, , part ofpart of)) Associative (Associative (location oflocation of, , causescauses, …), …)

Statistical relationsStatistical relations Represented in textRepresented in text

Among lexical items (entity recognition)Among lexical items (entity recognition) AnnotationsAnnotations

No explicit semanticsNo explicit semantics

Page 6: Enhancing Ontologies Through Annotations

6 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Example Example Viral meningitisViral meningitis

Viral meningitis

CNS disease

isa

Infectious disease

isa

Meningeslocated in

Viruscaused by

Anitviral agentstreated byHerpesviridaeInfections

T-Lymphocytes

18

9

Page 7: Enhancing Ontologies Through Annotations

7 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Statistical relationsStatistical relations

Crucial for text mining applicationsCrucial for text mining applications Entity recognitionEntity recognition Frequency of co-occurrenceFrequency of co-occurrence

No semanticsNo semantics Frequency of co-occurrence used as an indicator Frequency of co-occurrence used as an indicator

of the salience of the relationof the salience of the relation

Page 8: Enhancing Ontologies Through Annotations

8 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

An example from MEDLINEAn example from MEDLINE

Hurwitz JL, Korngold R, Doherty PC.Specific and nonspecific T-cell recruitment in viral meningitis: possible implications for autoimmunity. Cell Immunol. 1983 Mar;76(2):397-401.

Specific and nonspecific T-cell invasion into cerebrospinal fluid has been investigated in the nonfatal viral meningoencephalitis induced by intracerebral inoculation of mice with vaccinia virus. At the peak of the inflammatory process on Day 7 approximately 5 to 10% of the Lyt 2+ T cells present are apparently specific for vaccinia virus. Concurrently, in mice primed previously with influenza virus, 0.5 to 1.0% of the appropriate T-cell set located in cerebrospinal fluid is reactive to influenza-infected target cells. This vaccinia virus-induced inflammatory exudate may thus contain as many as 500 influenza-immune memory T cells. These findings are discussed from the aspect that such nonspecific T-cell invasion into the central nervous system during aseptic viral meningitis could result in exposure of potentially brain-reactive T cells to central nervous system components. PMID: 6601524

BrainBrain/immunology/immunology Cytotoxicity, Immunologic Cytotoxicity, Immunologic Exudates and TransudatesExudates and Transudates/cytology/cytology Exudates and TransudatesExudates and Transudates

/immunology/immunology Meningitis, ViralMeningitis, Viral/immunology/immunology** T-LymphocytesT-Lymphocytes/immunology/immunology** Vaccinia virus Vaccinia virus

Animals Animals Humans Humans Mice Mice Research Support, Non-U.S. Gov't Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Research Support, U.S. Gov't,

P.H.S. P.H.S.

Page 9: Enhancing Ontologies Through Annotations

9 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Ontological analysisOntological analysis

Formal ontological distinctionFormal ontological distinction Dependence relationsDependence relations

Every instance of a class is related tosome instances of another class

A is ontologically dependent on Bif and only if A exists then B exists

Contingent relationsContingent relations Only some instances of a class are related to

some instances of another class

CNS disease CNS

Viral meningitis Virus

aspirin headache

Page 10: Enhancing Ontologies Through Annotations

10 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Statistical vs. ontologicalStatistical vs. ontological

Can we use formal ontology to help analyze statistical relations?

What is the relation between ontological and statistical relations?

Hypothesis: Correspondence between

Dependence relations(ontological)

High frequencies of co-occurrence (statistical) Dependence relations ↔ Systematically high

frequencies of co-occurrence

Page 11: Enhancing Ontologies Through Annotations

11 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

More formal-ontological distinctionsMore formal-ontological distinctions

OxygenB-lymphocyteMitral valve

Streptococcus

Continuants Occurrents

continue to existthrough time

unfold through timein successive phases

Dependentcontinuants

Independentcontinuants

require theexistence of any other entity

in order to exist?yes no

Oxygen transportGlucose metabolismEchocardiography

Oxygen transporter

Page 12: Enhancing Ontologies Through Annotations

12 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Application to diseasesApplication to diseases

Diseases are (mostly) processes, i.e., occurrentsDiseases are (mostly) processes, i.e., occurrents Diseases are dependent entitiesDiseases are dependent entities Diseases depend on independent continuantsDiseases depend on independent continuants

Anatomical structuresAnatomical structures Classification by “location” (body system)Classification by “location” (body system)

Agents (pathogens)Agents (pathogens) Classification by etiologyClassification by etiology

Page 13: Enhancing Ontologies Through Annotations

13 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Participation relationParticipation relation

Participation relations are dependence relationsParticipation relations are dependence relations Between processes and biomedical continuantsBetween processes and biomedical continuants Passive participation: Passive participation: has_participanthas_participant

Viral meningitis Viral meningitis has_participanthas_participant Meninges Meninges

Active participation: Active participation: has_agenthas_agent Viral meningitis Viral meningitis has_agenthas_agent Virus Virus

Defined at the instance levelDefined at the instance levelbut can be adapted at the class levelbut can be adapted at the class level

Page 14: Enhancing Ontologies Through Annotations

14 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Statistical relationsStatistical relations

Independent eventsIndependent events P(E1 ∩ E2) = P(E1) . P(E2)

Tests of independenceTests of independence χχ2 2 testtest GG2 2 test (likelihood ratio test)test (likelihood ratio test)

Page 15: Enhancing Ontologies Through Annotations

15 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ObjectivesObjectives

Analyze dependence relations in MeSH and to compare them to statistical relations obtained from co-occurrence data

Restricted to the relations between disease categories and other categories of biomedical interest

Hypothesis: Co-occurrence relations between diseases and other categories

Highest proportion for the dependent category, systematically across diseases

Smaller proportions for other non-dependent categories

Page 16: Enhancing Ontologies Through Annotations

Materials

Page 17: Enhancing Ontologies Through Annotations

17 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Medical Subject Headings (MeSH)Medical Subject Headings (MeSH)

Controlled vocabulary used to index MEDLINEControlled vocabulary used to index MEDLINE 22,658 descriptors (2004 version)22,658 descriptors (2004 version) 16 tree-like hierarchies16 tree-like hierarchies

AnatomyAnatomy OrganismsOrganisms DiseasesDiseases ……

Page 18: Enhancing Ontologies Through Annotations

18 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MEDLINEMEDLINE

385,491 citations (year 2004)385,491 citations (year 2004) Indexed with 20,085 distinct MeSH descriptors

Restrictions Starred descriptors only (3.5 / citation, on average)Starred descriptors only (3.5 / citation, on average) Frequency of co-occurrence Frequency of co-occurrence ≥≥ 10 10 Associations between diseases and other categoriesAssociations between diseases and other categories

Page 19: Enhancing Ontologies Through Annotations

Methods and Results

Page 20: Enhancing Ontologies Through Annotations
Page 21: Enhancing Ontologies Through Annotations

21 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying dependence relationsIdentifying dependence relations

Manual examination of the 23 top-level disease Manual examination of the 23 top-level disease categories [C tree]categories [C tree] ExceptionsExceptions

Pathological conditions, signs and symptomsPathological conditions, signs and symptoms (C23) (C23)

AdditionsAdditions Mental disordersMental disorders (F03) (F03)

Identify the categories in (active or passive) Identify the categories in (active or passive) participation relation with the processparticipation relation with the process

Page 22: Enhancing Ontologies Through Annotations

22 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying dependence relations Identifying dependence relations ResultsResults

has_participanthas_participant

Page 23: Enhancing Ontologies Through Annotations

23 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying dependence relations Identifying dependence relations ResultsResults

has_agenthas_agent

Page 24: Enhancing Ontologies Through Annotations

24 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying statistical relationsIdentifying statistical relations

Aggregation at the level of top-level categoriesAggregation at the level of top-level categories

Viral meningitis (C02, C10)

Meninges (A08)

(C02, A08)and

(C10, A08)

Page 25: Enhancing Ontologies Through Annotations

25 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying statistical relationsIdentifying statistical relations

Contingency tableContingency table

Testing independenceTesting independence GG2 2 test (likelihood ratio test)test (likelihood ratio test)

YesYes NoNo

YesYes nnABAB nnAbAb

NoNo nnaBaB nnabab

Indexed withterm B

Indexed withterm A

Page 26: Enhancing Ontologies Through Annotations

26 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying statistical relations Identifying statistical relations ResultsResults

Quantitative resultsQuantitative results 25,376 pairs of co-occurring descriptors25,376 pairs of co-occurring descriptors All but 68 of these statistically significant (All but 68 of these statistically significant (GG22 test) test) 7,896 pairs with frequency of co-occurrence 7,896 pairs with frequency of co-occurrence ≥≥ 10 10 6,525 between diseases and other categories6,525 between diseases and other categories

Page 27: Enhancing Ontologies Through Annotations

27 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying statistical relations Identifying statistical relations ResultsResults

23 disease top-level categories

88 o

ther

top-

leve

l cat

egor

ies

45.75

Page 28: Enhancing Ontologies Through Annotations

28 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying statistical relations Identifying statistical relations ResultsResults

Qualitative results (1)Qualitative results (1) Generally one top-level category of the Generally one top-level category of the AnatomyAnatomy and and

OrganismsOrganisms trees accounting for the highest frequency trees accounting for the highest frequency of co-occurrence for a given diseaseof co-occurrence for a given disease

Cardiovascular Diseases Cardiovascular System

ExceptionsExceptions Neoplasms [C04] Congenital, Hereditary, and Neonatal Diseases and

Abnormalities [C16] Endocrine Diseases [C19] Immunologic Diseases [C20]

Page 29: Enhancing Ontologies Through Annotations

29 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Identifying statistical relations Identifying statistical relations ResultsResults

Qualitative results (2)Qualitative results (2) Most Anatomy and Organisms categories are

preferentially associated with one disease category Cardiovascular System Cardiovascular Diseases

Categories other than Categories other than Anatomy and Organisms tend not tend not to be associated with to be associated with one particular disease category (contingent rather than dependent relations)

Pathological Conditions, Signs and Symptoms [C23] Amino Acids, Peptides, and Proteins [D12] Diagnosis [E01] Therapeutics [E02] Surgical Procedures, Operative [E04]

Page 30: Enhancing Ontologies Through Annotations

Discussion

Page 31: Enhancing Ontologies Through Annotations

31 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ApplicationsApplications

To semantic miningTo semantic mining Formal ontological analysis of relations provides a Formal ontological analysis of relations provides a

useful framework for elucidating statistical associationsuseful framework for elucidating statistical associations

To terminology creation and maintenanceTo terminology creation and maintenance Most terminologies do not represent trans-ontological Most terminologies do not represent trans-ontological

relations explicitlyrelations explicitly Concepts in dependence relation should not be Concepts in dependence relation should not be

modified independently of each othermodified independently of each other

Page 32: Enhancing Ontologies Through Annotations

32 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

SummarySummary

We have studied statistical associations between We have studied statistical associations between MeSH terms co-occurring in MEDLINE citationsMeSH terms co-occurring in MEDLINE citations

We have shown that the ontological relation of We have shown that the ontological relation of dependence is generally corroborated by a strong, dependence is generally corroborated by a strong, systematic statistical associationsystematic statistical association

These techniquesThese techniques Provide a framework for semantic mining of diseasesProvide a framework for semantic mining of diseases Can help maintain terminologiesCan help maintain terminologies

Page 33: Enhancing Ontologies Through Annotations

33 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Vizenor L, Bodenreider O. Vizenor L, Bodenreider O. Using dependence relations in MeSH as a Using dependence relations in MeSH as a framework for the analysis of disease information in Medlineframework for the analysis of disease information in Medline. . Proceedings of the Second International Symposium on Semantic Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM-2006) 2006:76-83.Mining in Biomedicine (SMBM-2006) 2006:76-83.http://mor.nlm.nih.gov/pubs/pdf/2006-smbm-lv.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2006-smbm-lv.pdf

Page 34: Enhancing Ontologies Through Annotations

Non-lexical Approaches to IdentifyingAssociative Relations in the Gene Ontology

Page 35: Enhancing Ontologies Through Annotations

35 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

AcknowledgmentsAcknowledgments

Marc AubryMarc AubryUMR 6061 CNRS,UMR 6061 CNRS,Rennes, FranceRennes, France

Anita BurgunAnita BurgunUniversity of University of Rennes, FranceRennes, France

Page 36: Enhancing Ontologies Through Annotations

36 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Gene OntologyGene Ontology

Annotate gene productsAnnotate gene products CoverageCoverage

Molecular functionsMolecular functions Cellular componentsCellular components Biological processesBiological processes

Explicit relations to other terms within the same Explicit relations to other terms within the same hierarchyhierarchy

No (explicit) relationsNo (explicit) relations To terms across hierarchiesTo terms across hierarchies To concepts from other biological ontologiesTo concepts from other biological ontologies

Page 37: Enhancing Ontologies Through Annotations

37 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Gene OntologyGene Ontology

Molecularfunctions

Cellularcomponents

Biologicalprocesses

BP: metal ion transportMF: metal ion transporter activity

Page 38: Enhancing Ontologies Through Annotations

38 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MotivationMotivation

Richer ontologyRicher ontology Beyond hierarchiesBeyond hierarchies

Easier to maintainEasier to maintain Explicit dependence relationsExplicit dependence relations

More consistent annotationsMore consistent annotations Quality assurance Quality assurance Assisted curationAssisted curation

Page 39: Enhancing Ontologies Through Annotations

39 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Related workRelated work

Ontologizing GOOntologizing GO GONGGONG

Identifying relations among GO terms across hierarchiesIdentifying relations among GO terms across hierarchies Lexical approachLexical approach Non-lexical approachesNon-lexical approaches

Identifying relations between GO terms and OBO termsIdentifying relations between GO terms and OBO terms ChEBIChEBI

Representing relations among GO terms and between GO Representing relations among GO terms and between GO terms and OBO termsterms and OBO terms ObolObol

See also:See also:

[Ogren & al., PSB 2004-2005]

[Bodenreider & al., PSB 2005]

[Wroe & al., PSB 2003]

[Mungall, CFG 2005]

[Burgun & al., SMBM 2005]

[Bada & al., 2004], [Kumar & al., 2004], [Dolan & al., 2005]

Page 40: Enhancing Ontologies Through Annotations

40 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and annotation databasesGO and annotation databases

5 model organisms5 model organisms FlyBaseFlyBase GOA-HumanGOA-Human MGIMGI SGDSGD WormBaseWormBase

Brca1Brca1 GO:0000793GO:0000793 IDAIDA

Brca1Brca1 GO:0003677GO:0003677 IEAIEA

Brca1Brca1 GO:0003684GO:0003684 IDAIDA

Brca1Brca1 GO:0003723GO:0003723 ISSISS

Brca1Brca1 GO:0004553GO:0004553 ISSISS

Brca1Brca1 GO:0005515GO:0005515 ISS, TASISS, TAS

Brca1Brca1 GO:0005622GO:0005622 IEAIEA

Brca1Brca1 GO:0005634GO:0005634 IDA, ISSIDA, ISS

Brca1Brca1 ……270,000 gene-term associations270,000 gene-term associations

Page 41: Enhancing Ontologies Through Annotations

41 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Three non-lexical approachesThree non-lexical approaches

All based on annotation databasesAll based on annotation databases

Similarity in the vector space modelSimilarity in the vector space model

Statistical analysis of co-occurring GO termsStatistical analysis of co-occurring GO terms

Association rule miningAssociation rule mining

Page 42: Enhancing Ontologies Through Annotations

42 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Similarity in the vector space modelSimilarity in the vector space model

Annotationdatabase

Annotationdatabase

GO

term

s

Genesg1 g2 … gn

t1

t2

tn

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

Page 43: Enhancing Ontologies Through Annotations

43 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Similarity in the vector space modelSimilarity in the vector space model

GO terms

GO

term

s t1

t2

tn

t1 t2 … tn

Similaritymatrix

Similaritymatrix

Sim(ti,tj) = ti · tj

GO

term

s

Genesg1 g2 … gn

t1

t2

tn

tj

ti

Page 44: Enhancing Ontologies Through Annotations

44 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Analysis of co-occurring GO termsAnalysis of co-occurring GO terms

Annotationdatabase

Annotationdatabase

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

g2

t2 t7 t9

… t3 t7 t9

g5

t5

tt22-t-t77 11

tt22-t-t99 11

tt77-t-t99 22

……

tt55 11

tt77 22

tt99 22

……

Page 45: Enhancing Ontologies Through Annotations

45 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Analysis of co-occurring GO termsAnalysis of co-occurring GO terms

Statistical analysis: test independenceStatistical analysis: test independence Likelihood ratio test (GLikelihood ratio test (G22)) Chi-square test (Pearson’s Chi-square test (Pearson’s χχ22))

Example from GOA (Example from GOA (22,72022,720 annotations) annotations) C0006955 [BP]C0006955 [BP] Freq. = Freq. = 588588 C0008009 [MF]C0008009 [MF] Freq. = Freq. = 5353

presentpresent absentabsent TotalTotal

presentpresent 4646 542542 588588

absentabsent 77 21,58321,583 22,13222,132

totaltotal 5353 22,12522,125 22,72022,720

GO:0008009 immune responseimmune response

GO:0006955

chemokinechemokineactivityactivity

Co-oc. = Co-oc. = 4646

GG22 = 298.7 = 298.7p < 0.000p < 0.000

Page 46: Enhancing Ontologies Through Annotations

46 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Association rule miningAssociation rule mining

Annotationdatabase

Annotationdatabase

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

g2

t2 t7 t9

transaction

apriori• Rules: t1 => t2

• Confidence: > .9• Support: .05

Page 47: Enhancing Ontologies Through Annotations

47 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Examples of associationsExamples of associations

Vector space modelVector space model MF: ice binding BP: response to freezing

Co-occurring terms MF: chromatin binding CC: nuclear chromatin

Association rule mining MF: carboxypeptidase A activity BP: peptolysis and peptidolysis

Page 48: Enhancing Ontologies Through Annotations

48 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identifiedAssociations identified

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

7665 by at least one approach

Page 49: Enhancing Ontologies Through Annotations

49 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identified Associations identified VSMVSM

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

MF: ice bindingBP: response to freezing

Page 50: Enhancing Ontologies Through Annotations

50 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identified Associations identified COCCOC

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

MF: chromatin bindingCC: nuclear chromatin

Page 51: Enhancing Ontologies Through Annotations

51 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identified Associations identified ARMARM

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

MF: carboxypeptidase A activityBP: peptolysis and peptidolysis

Page 52: Enhancing Ontologies Through Annotations

52 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Limited overlap among approachesLimited overlap among approaches

Lexical vs. non-lexicalLexical vs. non-lexical Among non-lexicalAmong non-lexical

7665 5493

230

VSM COC

ARM

3689 2587

453

305 309

121201

Page 53: Enhancing Ontologies Through Annotations

53 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Bodenreider O, Aubry M, Burgun A. Bodenreider O, Aubry M, Burgun A. Non-lexical approaches to Non-lexical approaches to identifying associative relations in the Gene Ontologyidentifying associative relations in the Gene Ontology. In: Altman RB, . In: Altman RB, Dunker AK, Hunter L, Jung TA, Klein TE, editors. Pacific Dunker AK, Hunter L, Jung TA, Klein TE, editors. Pacific Symposium on Biocomputing 2005: World Scientific; 2005. p. 91-Symposium on Biocomputing 2005: World Scientific; 2005. p. 91-102.102.http://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdf

Page 54: Enhancing Ontologies Through Annotations

Linking the Gene Ontologyto other biological ontologies

Page 55: Enhancing Ontologies Through Annotations

55 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

AcknowledgmentsAcknowledgments

Anita BurgunAnita BurgunUniversity of University of Rennes, FranceRennes, France

Page 56: Enhancing Ontologies Through Annotations

56 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Related domainsRelated domains

OrganismsOrganisms Cell typesCell types Physical entitiesPhysical entities

Gross anatomyGross anatomy MoleculesMolecules

FunctionsFunctions Organism functionsOrganism functions Cell functionsCell functions

PathologyPathology

cytosolic ribosome (sensu Eukaryota)

T-cell activation

brain development

transferrin receptor activity

visual perception

T-cell activation

regulation of blood pressure

Page 57: Enhancing Ontologies Through Annotations

57 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and other domainsGO and other domains

[adapted from B. Smith]

Physical Physical entityentity

FunctionFunction ProcessProcess

OrganismOrganismGross Gross anatomyanatomy

Organism Organism functionsfunctions

Organism Organism processesprocesses

CellCellCellular Cellular componentscomponents

Cellular Cellular functionsfunctions

Cellular Cellular processesprocesses

MoleculeMolecule MoleculesMoleculesMolecular Molecular functionsfunctions

Molecular Molecular processesprocesses

Page 58: Enhancing Ontologies Through Annotations

58 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and other domains (revisited)GO and other domains (revisited)

[adapted from B. Smith]

ResolutionResolutionPhysical Physical wholewhole

PhysicalPhysicalpartpart

FunctionFunction ProcessProcess

OrganismOrganismOrganism Organism componentscomponents

Organism Organism functionsfunctions

Organism Organism processesprocesses

CellCellCellular Cellular componentscomponents

Cellular Cellular functionsfunctions

Cellular Cellular processesprocesses

MoleculeMoleculeMolecular Molecular componentscomponents

Molecular Molecular functionsfunctions

Molecular Molecular processesprocesses

Page 59: Enhancing Ontologies Through Annotations

59 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Biological ontologies (OBO)Biological ontologies (OBO)

Page 60: Enhancing Ontologies Through Annotations

60 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and other domains (revisited)GO and other domains (revisited)

[adapted from B. Smith]

ResolutionResolutionPhysical Physical wholewhole

PhysicalPhysicalpartpart

FunctionFunction ProcessProcess

OrganismOrganismOrganism Organism componentscomponents

Organism Organism functionsfunctions

Organism Organism processesprocesses

CellCellCellular Cellular componentscomponents

Cellular Cellular functionsfunctions

Cellular Cellular processesprocesses

MoleculeMoleculeMolecular Molecular componentscomponents

Molecular Molecular functionsfunctions

Molecular Molecular processesprocesses

Page 61: Enhancing Ontologies Through Annotations

61 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Integrating biological ontologiesIntegrating biological ontologies

Cellularcomponents

Cellularcomponents

Molecularfunctions

Molecularfunctions

MoleculeMolecule

Biologocalprocesses

Biologocalprocesses

OrganismOrganism

Other OBOontologies

Other OBOontologies

CellCell

Page 62: Enhancing Ontologies Through Annotations

Linking GO to ChEBI

Page 63: Enhancing Ontologies Through Annotations

63 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ChEBIChEBI

Member of the OBO familyMember of the OBO family Ontology ofOntology of

ChChemical emical EEntities of ntities of BBiological iological IInterestnterest AtomAtom MoleculeMolecule IonIon RadicalRadical

10,516 entities10,516 entities 27,097 terms27,097 terms [Dec. 22, 2004]

Page 64: Enhancing Ontologies Through Annotations

64 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MethodsMethods

Every ChEBI term searched in every GO termEvery ChEBI term searched in every GO term Maximize precisionMaximize precision

Ignored ChEBI terms of 3 characters or lessIgnored ChEBI terms of 3 characters or less Proper substringProper substring

Maximize recallMaximize recall Case insensitive matchesCase insensitive matches Normalized ChEBI namesNormalized ChEBI names

(generated singular forms from plurals)(generated singular forms from plurals)

Page 65: Enhancing Ontologies Through Annotations

65 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ExamplesExamples

iron [CHEBI:18248]

uronic acid [CHEBI:27252]

carbon [CHEBI:27594]

BP iron ion transport [GO:0006826]

MF iron superoxide dismutase activity [GO:0008382]

CC vanadium-iron nitrogenase complex [GO:0016613]

BP response to carbon dioxide [GO:0010037]

MF carbon-carbon lyase activity [GO:0016830]

BP uronic acid metabolism [GO:0006063]

MF uronic acid transporter activity [GO:0015133]

Page 66: Enhancing Ontologies Through Annotations

66 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Quantitative resultsQuantitative results

2,700 ChEBI entities 2,700 ChEBI entities (27%) identified in some (27%) identified in some GO termGO term

9,431 GO terms (55%) 9,431 GO terms (55%) include some ChEBI include some ChEBI entity in their namesentity in their names

10,516 entities 17,250 terms

20,497 associations

Page 67: Enhancing Ontologies Through Annotations

67 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GeneralizationGeneralization

MFCC BP

CHEBI

FMA

Cell types

REX FIX

MousePathology

cellmembraneviscosity

enzymaticreaction

Leydic cell tumor

irondeposition

cerebellar aplasia

Page 68: Enhancing Ontologies Through Annotations

Conclusions

Page 69: Enhancing Ontologies Through Annotations

69 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Conclusions (1)Conclusions (1)

Links across OBO ontologies need to be made Links across OBO ontologies need to be made explicitexplicit Between GO terms across GO hierarchiesBetween GO terms across GO hierarchies Between GO terms and OBO termsBetween GO terms and OBO terms Between terms across OBO ontologiesBetween terms across OBO ontologies

Automatic approachesAutomatic approaches Effective (GO-GO, GO-ChEBI)Effective (GO-GO, GO-ChEBI) At least to bootstrap the processAt least to bootstrap the process Needs to be refinedNeeds to be refined

Page 70: Enhancing Ontologies Through Annotations

70 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Conclusions (2)Conclusions (2)

Affordable relationsAffordable relations Computer-intensive, not labor-intensiveComputer-intensive, not labor-intensive

Methods must be combinedMethods must be combined Cross-validationCross-validation Redundancy as a surrogate for reliabilityRedundancy as a surrogate for reliability Relations identified specifically by one approachRelations identified specifically by one approach

False positivesFalse positives Specific strength of a particular methodSpecific strength of a particular method

Requires (some) manual curationRequires (some) manual curation Biologists must be involvedBiologists must be involved

Page 71: Enhancing Ontologies Through Annotations

71 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Burgun A, Bodenreider O. Burgun A, Bodenreider O. An ontology of chemical entities helps An ontology of chemical entities helps identify dependence relations among Gene Ontology termsidentify dependence relations among Gene Ontology terms. . Proceedings of the First International Symposium on Semantic Mining Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM-2005)in Biomedicine (SMBM-2005)Electronic proceedings: CEUR-WS/Vol-148Electronic proceedings: CEUR-WS/Vol-148http://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdf

Page 72: Enhancing Ontologies Through Annotations

MedicalOntologyResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov