lexical and statistical approaches to acquiring ontological relations formal methods for casual...

35
Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Olivier Bodenreider Lister Hill National Center Lister Hill National Center for Biomedical Communications for Biomedical Communications Bethesda, Maryland - USA Bethesda, Maryland - USA Ontology and Biomedical Informatics Rome, Italy – May 1, 2005

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

Lexical and Statistical Approachesto Acquiring Ontological Relations

Formal Methods for Casual Ontology?

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Ontology and Biomedical Informatics Rome, Italy – May 1, 2005

Page 2: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

2 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

IntroductionIntroduction

Biomedical ontologiesBiomedical ontologies Precisely defined (e.g., formal ontology)Precisely defined (e.g., formal ontology) Limited sizeLimited size Built manuallyBuilt manually

Large amounts of knowledgeLarge amounts of knowledge Not represented explicitly by symbolic relationsNot represented explicitly by symbolic relations But expressed implicitly But expressed implicitly

By lexico-syntactic relations (i.e., embedded in terms)By lexico-syntactic relations (i.e., embedded in terms) By statistical relations (e.g., co-occurrence)By statistical relations (e.g., co-occurrence)

Can be extracted automaticallyCan be extracted automatically

Page 3: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

3 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

FormalFormal vs. vs. casual casual ontologyontology

Formal ontologyFormal ontology Provides a framework for building sound ontologiesProvides a framework for building sound ontologies Too labor-intensive for building large ontologiesToo labor-intensive for building large ontologies

Casual ontologyCasual ontology Usually unsuitable for reasoningUsually unsuitable for reasoning Tools for automatic acquisition availableTools for automatic acquisition available

Page 4: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

4 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

General frameworkGeneral framework

Ontology learningOntology learning [Maedche & Staab, Velardi] ECAI, IJCAI

Term variationTerm variation [Jacquemin][Jacquemin] Terminology / KnowledgeTerminology / Knowledge TKE, TIATKE, TIA Knowledge acquisition/captureKnowledge acquisition/capture K-CAPK-CAP Information extractionInformation extraction

Page 5: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

5 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Sources of knowledge for casual ontologySources of knowledge for casual ontology

Long tradition of terminology buildingLong tradition of terminology building Over 100 terminologies available in electronic formatOver 100 terminologies available in electronic format

Large corpora available (e.g., MEDLINE)Large corpora available (e.g., MEDLINE) Entity recognition tools availableEntity recognition tools available

E.g., MetaMap (UMLS-based)E.g., MetaMap (UMLS-based) Several for gene/protein namesSeveral for gene/protein names

Information extraction methodsInformation extraction methods

Large annotation databases availableLarge annotation databases available MEDLINE citations indexed with MeSHMEDLINE citations indexed with MeSH Model organism databases annotated with GOModel organism databases annotated with GO

Page 6: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

6 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Formal methods for casual ontologyFormal methods for casual ontology

Lexico-syntactic methodsLexico-syntactic methods Lexico-syntactic patternsLexico-syntactic patterns Nominal modificationNominal modification Prepositional phrasesPrepositional phrases Reified relationsReified relations Semantic interpretationSemantic interpretation

Statistical methodsStatistical methods ClusteringClustering Statistical analysis of co-occurrence dataStatistical analysis of co-occurrence data Association rule miningAssociation rule mining

Page 7: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

Lexico-syntactic methods

Page 8: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

8 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

SynonymySynonymy

Source: terminologySource: terminology Lexical similarityLexical similarity

Lexical variant generation program (UMLS)Lexical variant generation program (UMLS) normnorm

LimitationsLimitations Clinical synonymy vs. SynonymyClinical synonymy vs. Synonymy Molecular biologyMolecular biology

[McCray & al., SCAMC, 1994]

Page 9: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

9 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

NormalizationNormalization

Hodgkin’s diseases, NOS

Hodgkin diseases, NOSRemove genitive

Hodgkin diseases, Remove stop words

hodgkin diseases, Lowercase

hodgkin diseasesStrip punctuation

hodgkin diseaseUninflect

Sort wordsdisease hodgkin

Page 10: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

10 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Normalization Normalization ExampleExample

Hodgkin DiseaseHODGKINS DISEASEHodgkin's DiseaseDisease, Hodgkin'sHodgkin's, diseaseHODGKIN'S DISEASEHodgkin's diseaseHodgkins DiseaseHodgkin's disease NOSHodgkin's disease, NOSDisease, HodgkinsDiseases, HodgkinsHodgkins DiseasesHodgkins diseasehodgkin's diseaseDisease, Hodgkin

normalize disease hodgkin

Page 11: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

11 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Taxonomic relations Taxonomic relations Lexico-syntactic patternsLexico-syntactic patterns

Source: text corpusSource: text corpus Example of patternsExample of patterns

LamivudinLamivudin is ais a nucleoside analoguenucleoside analogue with potent with potent antiviral properties.antiviral properties.

The treatment of schizophrenia with old typical The treatment of schizophrenia with old typical antipsychotic drugsantipsychotic drugs such assuch as haloperidolhaloperidol can be can be problematic.problematic.

[Hearst, COLING, 1992][Fiszman & al., AMIA, 2003]

Page 12: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

12 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Taxonomic relations Taxonomic relations Nominal modificationNominal modification

Source: text corpus / terminologySource: text corpus / terminology Example of modifiersExample of modifiers

AdjectiveAdjective TuberculousTuberculous Addison’s disease Addison’s disease AcuteAcute hepatitis hepatitis

Noun (noun-noun compounds)Noun (noun-noun compounds) ProstateProstate cancer cancer Carbon monoxideCarbon monoxide poisoning poisoning

[Jacquemin, ACL, 1999][Bodenreider & al., TIA, 2001]

Terminology: constrained environment(increased reliability)

Page 13: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

13 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

<X,<X, is-a,is-a, Part ofPart of Y Y > >

<X,<X, part-ofpart-of,, Y >Y >

Reified relationsReified relations

Source: terminologySource: terminology Example: reification of Example: reification of part ofpart of

Augmented relations from reified Augmented relations from reified part-ofpart-of relations relations Reified: Reified: <Cardiac chamber, <Cardiac chamber, is-ais-a, , Subdivision ofSubdivision of heart heart>> Augmented: Augmented: <Cardiac chamber, <Cardiac chamber, part-ofpart-of, Heart, Heart>>

[Zhang & al., ISWC/Sem. Int., 2003]

Page 14: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

14 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Prepositional attachmentPrepositional attachment

Source: text corpus / terminologySource: text corpus / terminology Example: Example: ofof

Lobe Lobe ofof lung lung part ofpart of LungLung Bone Bone ofof femur femur part ofpart of FemurFemur

RestrictionsRestrictions Validity of preposition-to-relation correspondence may Validity of preposition-to-relation correspondence may

be limited to a subdomain (e.g., anatomy)be limited to a subdomain (e.g., anatomy) Not applicable to complex termsNot applicable to complex terms

Groove for arch Groove for arch ofof aorta aorta NOT NOT part ofpart of AortaAorta

[Zhang & al., ISWC/Sem. Int., 2003]

Page 15: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

15 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic interpretationSemantic interpretation

Source: text corpus / terminologySource: text corpus / terminology Correspondence betweenCorrespondence between

Linguistic phenomenaLinguistic phenomena Semantic relationsSemantic relations

Semantic constraints provided by ontologiesSemantic constraints provided by ontologies

[Navigli & al., TKE, 2002][Romacker, AIME, 2001]

[Rindflesch & al., JBI, 2003]

Page 16: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

16 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic interpretationSemantic interpretation

Hemofiltration in digoxin overdose

Disease orSyndrome

Therapeutic orPreventive Procedure

Digoxin overdoseHemofiltration

Mapping to UMLS

Syntactic analysis

in digoxin overdosenoun phrasenoun phrase

Hemofiltrationpreposition

treatsSemantic rules

Semantic networkrelationships

Antibiotic treats Disease or SyndromeMedical Device treats Injury or PoisoningPharmacologic Substance treats Congenital AbnormalityTher. or Prev. Procedure treats Disease or Syndrome[…]

Select matching rule

treats

Semanti

c

interpret

ation

Page 17: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

17 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Compositional features of termsCompositional features of terms

Lexical itemsLexical items Terms within a vocabularyTerms within a vocabulary

Clinical vocabulariesClinical vocabularies Gene OntologyGene Ontology

Terms across vocabulariesTerms across vocabularies SNOMED / LOINCSNOMED / LOINC GO / ChEBIGO / ChEBI

Lexicon / TermsLexicon / Terms Semantic lexiconSemantic lexicon

[Baud & al., AMIA, 1998]

[Ogren & al., PSB, 2004][Mungall, CFG, 2004]

[Johnson, JAMIA, 1999][Verspoor, CFG, 2005]

[Burgun, SMBM, 2005]

[Dolin, JAMIA, 1998]

[McDonald & al., AMIA, 1999]

Page 18: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

Statistical methods

Page 19: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

19 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Taxonomic relations Taxonomic relations ClusteringClustering

Source: text corpusSource: text corpus Principle: similarity between words reflected in Principle: similarity between words reflected in

their contextstheir contexts Co-occurring words (+ frequencies)Co-occurring words (+ frequencies) Hierarchical clustering algorithmsHierarchical clustering algorithms

Similarity measure (cosine, Kullback Leibler)Similarity measure (cosine, Kullback Leibler)

Can be refined using classification techniques Can be refined using classification techniques (e.g., k nearest neighbors)(e.g., k nearest neighbors)

[Faure & al., LREC, 1998][Maedche & al., HoO, 2004]

Page 20: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

20 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associative relationsAssociative relations

Source: text corpus / annotation databasesSource: text corpus / annotation databases Principle: dependence relationsPrinciple: dependence relations

Associations between termsAssociations between terms

Several methodsSeveral methods Vector space modelVector space model Co-occuring termsCo-occuring terms Association rule miningAssociation rule mining

Limitations: no semanticsLimitations: no semantics

[Bodenreider & al., PSB, 2005]

Page 21: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

21 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Similarity in the vector space modelSimilarity in the vector space model

Annotationdatabase

Annotationdatabase

GO

term

s

Genesg1 g2 … gn

t1

t2

tn

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

Page 22: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

22 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Similarity in the vector space modelSimilarity in the vector space model

GO terms

GO

term

s t1

t2

tn

t1 t2 … tn

Similaritymatrix

Similaritymatrix

Sim(ti,tj) = ti · tj

GO

term

s

Genesg1 g2 … gn

t1

t2

tn

tj

ti

Page 23: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

23 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Analysis of co-occurring GO termsAnalysis of co-occurring GO terms

Annotationdatabase

Annotationdatabase

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

g2

t2 t7 t9

… t3 t7 t9

g5

t5

tt22-t-t77 11

tt22-t-t99 11

tt77-t-t99 22

……

tt55 11

tt77 22

tt99 22

……

Page 24: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

24 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Analysis of co-occurring GO termsAnalysis of co-occurring GO terms

Statistical analysis: test independenceStatistical analysis: test independence Likelihood ratio test (GLikelihood ratio test (G22)) Chi-square test (Pearson’s Chi-square test (Pearson’s χχ22))

Example from GOA (Example from GOA (22,72022,720 annotations) annotations) C0006955 [BP]C0006955 [BP] Freq. = Freq. = 588588 C0008009 [MF]C0008009 [MF] Freq. = Freq. = 5353

presentpresent absentabsent TotalTotal

presentpresent 4646 542542 588588

absentabsent 77 21,58321,583 22,13222,132

totaltotal 5353 22,12522,125 22,72022,720

GO:0008009 immune responseimmune response

GO:0006955

chemokinechemokineactivityactivity

Co-oc. = Co-oc. = 4646

GG22 = 298.7 = 298.7p < 0.000p < 0.000

Page 25: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

25 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Association rule miningAssociation rule mining

Annotationdatabase

Annotationdatabase

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

g2

t2 t7 t9

transaction

apriori• Rules: t1 => t2

• Confidence: > .9• Support: .05

Page 26: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

26 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Example of associations (GO)Example of associations (GO)

Vector space modelVector space model MF: ice binding BP: response to freezing

Co-occurring terms MF: chromatin binding CC: nuclear chromatin

Association rule mining MF: carboxypeptidase A activity BP: peptolysis and peptidolysis

Page 27: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

Discussion and Conclusions

Page 28: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

28 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Combine methodsCombine methods

Affordable relationsAffordable relations Computer-intensive, not labor-intensiveComputer-intensive, not labor-intensive

Methods must be combinedMethods must be combined Cross-validationCross-validation Redundancy as a surrogate for reliabilityRedundancy as a surrogate for reliability Relations identified specifically by one approachRelations identified specifically by one approach

False positivesFalse positives Specific strength of a particular methodSpecific strength of a particular method

Requires (some) manual curationRequires (some) manual curation Biologists must be involvedBiologists must be involved

Page 29: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

29 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Limited overlap among approachesLimited overlap among approaches

Lexical vs. non-lexicalLexical vs. non-lexical Among non-lexicalAmong non-lexical

7665 5493

230

VSM COC

ARM

3689 2587

453

305 309

121201

[Bodenreider & al., PSB, 2005]

Page 30: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

30 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Reusing thesauriReusing thesauri

First approximation for taxonomic relationsFirst approximation for taxonomic relations No need for creating taxonomies from scratch in No need for creating taxonomies from scratch in

biomedicinebiomedicine

Beware of purpose-dependent relationsBeware of purpose-dependent relations Addison’s disease Addison’s disease isaisa Autoimmune disorder Autoimmune disorder

Relations used to create hierarchiesRelations used to create hierarchiesvs. hierarchical relationsvs. hierarchical relations

Requires (some) manual curationRequires (some) manual curation[Wroe & al., PSB, 2003][Hahn & al., PSB, 2003]

Page 31: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

31 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Formal Formal vs.vs. Casual Casual

Formal ontologyFormal ontology Provides a framework for building sound ontologiesProvides a framework for building sound ontologies Too labor-intensive for building large ontologiesToo labor-intensive for building large ontologies

Casual ontologyCasual ontology Usually unsuitable for reasoningUsually unsuitable for reasoning Tools for automatic acquisition availableTools for automatic acquisition available

What is What is notnot useful useful• Formal ontology = righteousFormal ontology = righteous• Casual ontology = sloppyCasual ontology = sloppy

Page 32: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

32 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Formal Formal andand Casual Casual

Formal ontologyFormal ontology Provides a framework which can be used as a referenceProvides a framework which can be used as a reference Help us think clearly (?) aboutHelp us think clearly (?) about

ConceptsConcepts Relations (e.g., isa: is a kind of / is an instance of)Relations (e.g., isa: is a kind of / is an instance of)

Casual ontologyCasual ontology Supported by “cheap” (but formal) methodsSupported by “cheap” (but formal) methods Extracted from large amounts of dataExtracted from large amounts of data Helps populating the framework from formal ontologyHelps populating the framework from formal ontology

Page 33: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

33 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Combining Combining formalformal and and casualcasual

Formal ontologyFormal ontology Provides a framework for building sound ontologiesProvides a framework for building sound ontologies Too labor-intensive for building large ontologiesToo labor-intensive for building large ontologies Can benefit from loosely defined ontologiesCan benefit from loosely defined ontologies

Casual ontologyCasual ontology Usually unsuitable for reasoningUsually unsuitable for reasoning Tools for automatic acquisition availableTools for automatic acquisition available Can benefit from formal ontologyCan benefit from formal ontology

OrganizationOrganization ValidationValidation

Page 34: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

34 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Casual ontology as a bridgeCasual ontology as a bridge

Casual ontologyCasual ontology Speaks the language of biologistsSpeaks the language of biologists

Extracted from text or terminologiesExtracted from text or terminologies

Passes (part of) the rigorous framework of formal Passes (part of) the rigorous framework of formal ontology on to biologistsontology on to biologists

Casual ontologistCasual ontologist Not a sloppy ontologistNot a sloppy ontologist

Uses the formal methods of casual ontologyUses the formal methods of casual ontology

Mediator between formal ontology and biologyMediator between formal ontology and biology

Page 35: Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center

MedicalOntologyResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov