bmi 201 - investigating term reuse and overlap in biomedical ontologies
TRANSCRIPT
Inves&ga&ng Term Reuse and Overlap in Biomedical Ontologies
Maulik R. Kamdar Musen Lab
7th April, 2015
Mo&va&on • Uses of Ontologies in biomedical research: decision support,
knowledge management, seman&c search, annota&on, data integra&on, exchange and reasoning.
• Advantages of Reuse – Develop a unified theory of biomedicine – Seman&c interoperability – Reduce engineering costs (reuse avoids rebuilding exis&ng ontology structures).
– ICD-‐11 reuses from other ontologies, such as SNOMED CT, to not create exis&ng content and to support its use in EHRs.
– Enables federated search engines to query mul&ple, heterogeneous knowledge sources
Mo&va&on • Uses of Ontologies in biomedical research: decision support,
knowledge management, seman&c search, annota&on, data integra&on, exchange and reasoning.
• Advantages of Reuse – Develop a unified theory of biomedicine – Seman&c interoperability – Reduce engineering costs (reuse avoids rebuilding exis&ng ontology structures).
– ICD-‐11 reuses from other ontologies, such as SNOMED CT, to not create exis&ng content and to support its use in EHRs.
– Enables federated search engines to query mul&ple, heterogeneous knowledge sources
Related Work • Open Biomedical Ontologies (OBO) Foundry aims to create a set of
‘orthogonal’ ontologies, using Interna&onalised Resource Iden&fiers (IRI) and xref mappings between similar terms.
• Unified Medical Language System (UMLS) uses the no&on of a
Concept Unique Iden&fier (CUI) to map similar terms.
• Analysis of the OBO Foundry indicated progress made to achieve ‘orthogonality’, yet gap in term overlap has remained consistent.
• OntoFox and BioPortal Import Plugin for Protégé ontology editor allows import of terms, proper&es, annota&ons from source ontologies.
Related Work • Open Biomedical Ontologies (OBO) Foundry aims to create a set of
‘orthogonal’ ontologies, using Interna&onalised Resource Iden&fiers (IRI) and xref mappings between similar terms.
• Unified Medical Language System (UMLS) uses the no&on of a
Concept Unique Iden&fier (CUI) to map similar terms.
• Analysis of the OBO Foundry indicated progress made to achieve ‘orthogonality’, yet gap in term overlap has remained consistent.
• OntoFox and BioPortal Import Plugin for Protégé ontology editor allows import of terms, proper&es, annota&ons from source ontologies.
Goals
We classify the reuse in biomedical ontologies in two categories: – Reuse of an ontology, through an import mechanism – Reuse of terms from one source ontology into another
Contribu:ons: – A set of descrip:ve sta:s:cs describing reuse in biomedical ontologies, – An interac&ve visualiza&on for displaying the reuse dependencies, – A clustering method to help iden&fy pa]erns of reuse, – Discussion on the state of reuse and need of a semi-‐automated tool
Results -‐ Explicit and xref Reuse
175,347 terms (3.1%) explicitly reused. Source ontology for all but 37 terms (e.g., 0me#date0medescrip0on). A_er removing the ‘reused ontologies’, only 59,618 terms (1.1%) 4,370,350 xref axioms found (database and ontology xrefs). 171,069 ‘outlinking’ terms (3.9%) xref-‐linked to 386,442 `inlinking' terms (8.84%)
Results -‐ Explicit and xref Reuse
175,347 terms (3.1%) explicitly reused. Source ontology for all but 37 terms (e.g., 0me#date0medescrip0on). A_er removing the ‘reused ontologies’, only 59,618 terms (1.1%) 4,370,350 xref axioms found (database and ontology xrefs). 171,069 ‘outlinking’ terms (3.9%) xref-‐linked to 386,442 `inlinking' terms (8.84%)
Results Ontology # explicit % explicit Ontology # xref % xref NIFSTD 42 89.6 UBERON 37 72.17 HUPSON 32 55.79 CL 21 14 OBI_BCGO 25 97.97 TMO 21 17.28 IDOMAL 24 43.53 HPIO 16 53.74 IDODEN 23 29.1 DOID 13 90.81 OBI 22 19.14 TRAK 10 23.84
CCONT 22 98.77 GO 9 0.76 EFO 21 70.09 HP 8 11.83 CLO 19 7.21 DERMO 7 25.66
IDOBRU 19 43.27 EFO 6 0.76
Ontologies that reuse the maximum number of terms from other ontologies
Results Ontology # explicit % reuse Ontology # explicit % reuse BFO (59) 81 258.97 GO 24 1.59
GO 74 95.15 CHEBI 16 3.2 IAO (9) 55 72.83 CARO (4) 16 478 OBI 51 43.07 MESH 11 1.59
PATO (10) 45 190.52 NCIT 10 6.66 CHEBI 37 54.23 FMA 10 13.99 CL 36 15.35 PATO 10 22.71
NCBITAXON 30 0.3 CL 9 18.84 STY (29) 29 100 NCBITAXON 8 19.89 UO (5) 27 135.65 SO 8 5.05
Ontologies whose terms are reused most by other ontologies
Results -‐ UMLS CUI Construct 3
236,460 CUIs mapped to more than two terms in UMLS terminologies. Some of the most mapped CUI terms are: • Neoplasms (C0027651) and Diabetes mellitus (C0011849) -‐ 18 terminologies • Schizophrenia (C0036341) and Leukemia (C0023418) -‐17 terminologies
Results -‐ Term Overlap Sta&s&cs
• Execu&ng normalised string matching, we found a term overlap of 823,621 shared labels (14.4%).
• Removing explicitly-‐reused terms -‐ 752,176 labels (13.2%). • Removing terms sharing UMLS CUIs -‐ 617,509 labels (10.8%).
• Removing almost similar term IRIs (same iden&fier and source, but a different representa&on) -‐ 93,650 labels (1.6%).
Two-‐phase clustering approach • Generate Term-‐ontology matrix. • Sparse K-‐means algorithm with the Gap-‐Es&mate method (K=6) • For each pair of terms, compute similarity scores.
• Use spectral clustering method with the term-‐term affinity matrix.
Clustering
Discussion: Intent for Reuse Different versions: – SAO and SOPHARM reuse terms from BFO version 1.0 instead of 1.1. – CCO and HINO reuse terms from an older version of NCI Thesaurus.
E.g. NCIT:Cerebral_Vein instead of the recent NCIT:C53037
Different nota:ons: – E.g., OBO:FMA_31396 is reused as OBO:owlapi/fma#FMA_31396, OBO:owl/
FMA#FMA_31396, and even with the en&re label OBO:fma#Car0lage_of_inferior_surface_of_posterolateral_part.
Different namespaces: – E.g. RH-‐MESH uses hUp://phenomebrowser.net/ontologies/mesh/mesh.owl,
while most other ontologies use hUp://purl.bioontology.org/ontology/MESH. – SNOMED CT: hUp://ihtsdo.org/snomedct/ and hUp://purl.bioontology.org/
ontology/SNOMEDCT
Discussion • Most ontologies exhibit substan&al term overlap but considerably
less than 5% reuse, and from only a small set of ontologies.
• Lexically-‐similar terms may represent different concepts (e.g., anatomical concepts between Zebrafish Anatomy (ZFA) and Xenopus Anatomy (XAO)).
• Same terms using different IRI representa&ons, and without explicit CUI or xref mappings are not considered term reuse.
• Our visualiza&on of reuse dependencies could guide term reuse based on the structure of ontologies in related domains.
Future Work • Analyze Web Protégé BioPortal Import Plugin Logs • Item-‐based Collabora&ve Filtering Method (used by Amazon) to
provide recommenda&ons to users through a Web Protégé Plugin.
• Two-‐fold Evalua&on – a posteriori: check if the term-‐reuse recommenda&ons match those actually reused by users, as analyzed from the logs
– user-‐centered: monitoring term reuse when developers build an ontology combining exis&ng ontologies, and surveys
Future Work • Analyze Web Protégé BioPortal Import Plugin Logs • Item-‐based Collabora&ve Filtering Method (used by Amazon) to
provide recommenda&ons to users through a Web Protégé Plugin.
• Two-‐fold Evalua&on – a posteriori: check if the term-‐reuse recommenda&ons match those actually reused by users, as analyzed from the logs
– user-‐centered: monitoring term reuse when developers build an ontology combining exis&ng ontologies, and surveys
Acknowledgements • Mark Musen • Tania Tudorache • Manuel Salvadores Olaizola • Musen Lab
• Steve Bagley • MaryJeanne Oliva • Nancy Leannartson • BMI Program
• US NIH Grants GM086587 and GM103316