the umls and the semantic web

25
The UMLS and the Semantic Web W3C Semantic Web W3C Semantic Web Health Care and Life Sciences Interest Group Health Care and Life Sciences Interest Group BioRDF Teleconference BioRDF Teleconference September 22, 2008 September 22, 2008 Olivier Bodenreider Olivier Bodenreider Lister Hill National Center Lister Hill National Center for Biomedical Communications for Biomedical Communications Bethesda, Maryland - USA Bethesda, Maryland - USA

Upload: ferris-atkinson

Post on 30-Dec-2015

40 views

Category:

Documents


0 download

DESCRIPTION

W3C Semantic Web Health Care and Life Sciences Interest Group BioRDF Teleconference September 22, 2008. The UMLS and the Semantic Web. Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA. Outline. The UMLS (in a nutshell) - PowerPoint PPT Presentation

TRANSCRIPT

The UMLS and the Semantic Web

W3C Semantic WebW3C Semantic WebHealth Care and Life Sciences Interest GroupHealth Care and Life Sciences Interest Group

BioRDF TeleconferenceBioRDF TeleconferenceSeptember 22, 2008September 22, 2008

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 2

OutlineOutline

The UMLS (in a nutshell)The UMLS (in a nutshell) Lexical resourcesLexical resources MetathesaurusMetathesaurus Semantic NetworkSemantic Network

Why is the UMLS relevant to the Semantic Web?Why is the UMLS relevant to the Semantic Web? Issues and challengesIssues and challenges

Unified Medical Language System Unified Medical Language System (UMLS)(UMLS)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 4

UMLS: 3 componentsUMLS: 3 components

SPECIALIST LexiconSPECIALIST Lexicon 200,000 lexical items200,000 lexical items Part of speech and variant informationPart of speech and variant information

MetathesaurusMetathesaurus 5M names from over 100 terminologies5M names from over 100 terminologies 1M concepts1M concepts 16M relations16M relations

Semantic NetworkSemantic Network 135 high-level categories135 high-level categories 7000 relations among them7000 relations among them

Lexicalresources

Ontologicalresources

Terminologicalresources

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5

UMLS UMLS Characteristics (1)Characteristics (1)

Current version: 2008AA (2-3 annual releases)Current version: 2008AA (2-3 annual releases) Type: Terminology integration systemType: Terminology integration system Domain: BiomedicineDomain: Biomedicine Developer: NLMDeveloper: NLM Funding: NLM (intramural)Funding: NLM (intramural) AvailabilityAvailability

Publicly available: Yes* (cost-free license required)Publicly available: Yes* (cost-free license required) Repositories: UMLSRepositories: UMLS

URL: URL: http://umlsks.nlm.nih.gov/

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 6

UMLS UMLS Characteristics (2)Characteristics (2)

Number ofNumber of Concepts: 1.5M (2008AA)Concepts: 1.5M (2008AA) Terms: ~6MTerms: ~6M

Major organizing principles (Metathesaurus):Major organizing principles (Metathesaurus): Concept orientationConcept orientation Source transparencySource transparency Multi-lingual through translationMulti-lingual through translation

Formalism: Proprietary format (RRF)Formalism: Proprietary format (RRF)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 7

UMLS UMLS Integrating subdomainsIntegrating subdomains

Biomedicalliterature

Biomedicalliterature

MeSH

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIM

Clinicalrepositories

Clinicalrepositories

SNOMED CTOthersubdomains

Othersubdomains

AnatomyAnatomy

FMA

UMLS

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 8

Trans-namespace integrationTrans-namespace integration

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIMOther

subdomainsOther

subdomains

AnatomyAnatomy

FMA

UMLS

Addison Disease (D000224)

Addison's disease (363732003)

Biomedicalliterature

Biomedicalliterature

MeSH

Clinicalrepositories

Clinicalrepositories

SNOMED CT

UMLSC0001403

Heart

Concepts

Metathesaurus

22

225

97

4

12

9 31

Esophagus

Left PhrenicNerve

HeartValves

FetalHeart

Medias-tinum

SaccularViscus

AnginaPectoris

CardiotonicAgents

TissueDonors

AnatomicalStructure

Fully FormedAnatomical

Structure

EmbryonicStructure

Body Part, Organ orOrgan Component Pharmacologic

Substance

Disease orSyndrome

PopulationGroup

Semantic Types

SemanticNetwork

Why is the UMLS relevantWhy is the UMLS relevantto the Semantic Web?to the Semantic Web?

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 11

Relevance to the SW Relevance to the SW MetathesaurusMetathesaurus

Terminology integration systemTerminology integration system Trans-namespace integrationTrans-namespace integration Integration beyond shared identifiersIntegration beyond shared identifiers

Repository of biomedical terminologies/ontologiesRepository of biomedical terminologies/ontologies Many UMLS vocabularies used for the annotation Many UMLS vocabularies used for the annotation

of datasets (including clinical records)of datasets (including clinical records)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 12

Relevance to the SW Relevance to the SW MetathesaurusMetathesaurus

Broad coverage of biomedicineBroad coverage of biomedicine Large user baseLarge user base Tooling availableTooling available

E.g, visualization, named entity recognition, etc.E.g, visualization, named entity recognition, etc.

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 13

Relevance to the SW Relevance to the SW Semantic NetworkSemantic Network

Top-level ontology of the biomedical domainTop-level ontology of the biomedical domain Broad biomedical categoriesBroad biomedical categories Helps partition biomedical conceptsHelps partition biomedical concepts Semantic relationsSemantic relations

Issues and ChallengesIssues and Challenges

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 15

Issues and challengesIssues and challenges

AvailabilityAvailability Mandatory license agreementMandatory license agreement

DiscoverabilityDiscoverability No metadata No metadata

FormalismFormalism No easy conversion to SKOS/RDF(S)/OWLNo easy conversion to SKOS/RDF(S)/OWL

IdentifiersIdentifiers

Steep learning curveSteep learning curve

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 16

AvailabilityAvailability

Some source vocabularies have intellectual Some source vocabularies have intellectual property restrictionsproperty restrictions E.g., most drug vocabulariesE.g., most drug vocabularies Complex agreement for SNOMED CT: available at no Complex agreement for SNOMED CT: available at no

cost for member countries of the IHTSDOcost for member countries of the IHTSDO Mandatory license agreementMandatory license agreement

No cost for researchNo cost for research May require negotiation with the vocabulary developer May require negotiation with the vocabulary developer

for production applicationsfor production applications MetamorphoSys helps extract selected sources MetamorphoSys helps extract selected sources

from the UMLSfrom the UMLS

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 17

DiscoverabilityDiscoverability

Discoverability of individual conceptsDiscoverability of individual concepts UMLSKS web servicesUMLSKS web services Search all UMLS source vocabularies at the same timeSearch all UMLS source vocabularies at the same time Named entity recognition/normalization (e.g., Named entity recognition/normalization (e.g.,

MetaMap)MetaMap)

Discoverability of terminologies/ontologiesDiscoverability of terminologies/ontologies No comprehensive registriesNo comprehensive registries No rich registriesNo rich registries

With rich metadata supporting the discoverability of With rich metadata supporting the discoverability of terminologies/ontologiesterminologies/ontologies

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 18

FormalismFormalism

UMLS: Proprietary formatUMLS: Proprietary format Rich Release Format (RRF)Rich Release Format (RRF) All terminologies/ontologies represented in the same All terminologies/ontologies represented in the same

formatformat

No easy conversion to SKOS/RDF(S)/OWLNo easy conversion to SKOS/RDF(S)/OWL Underspecified semanticsUnderspecified semantics

Child/parent Child/parent subClassOf subClassOf

Complex semanticsComplex semantics Descriptors / concepts / termsDescriptors / concepts / terms

Rich attribute setRich attribute set

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 19

Identifiers for biomedical entitiesIdentifiers for biomedical entities

What is identified?What is identified? Entity vs. resource about the entityEntity vs. resource about the entity

Which identifier to pick?Which identifier to pick? E.g., Addison’s diseaseE.g., Addison’s disease

363732003363732003 (SNOMED CT)(SNOMED CT) D000224D000224 (MeSH)(MeSH) C0001403C0001403 (UMLS Metathesaurus)(UMLS Metathesaurus)

Which format?Which format? URI vs. LSIDURI vs. LSID

Which authoritative source for minting URIs?Which authoritative source for minting URIs? Ontology developers vs. (e.g.) Bio2RDF Ontology developers vs. (e.g.) Bio2RDF

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 20

Steep learning curveSteep learning curve

Large resourceLarge resource 1.5M concepts1.5M concepts 6M terms6M terms Over 20M relations Over 20M relations

Complex structureComplex structure MetathesaurusMetathesaurus Semantic NetworkSemantic Network

Rich set of attributesRich set of attributes

Rich set of relationsRich set of relations TerminologicalTerminological SemanticSemantic StatisticalStatistical MappingMapping

Multiple languagesMultiple languages

Complex domainComplex domain

ConclusionsConclusions

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 22

ConclusionsConclusions

UMLS as a terminology integration systemUMLS as a terminology integration system Helps bridge across namespacesHelps bridge across namespaces Helps integrate information sourcesHelps integrate information sources

Beyond shared identifiersBeyond shared identifiers

UMLS as a repository of terminologies/ontologiesUMLS as a repository of terminologies/ontologies Single source, single format for 143 vocabulariesSingle source, single format for 143 vocabularies

Issues with availability, discoverability and Issues with availability, discoverability and formalismformalism

Identifiers for biomedical entitiesIdentifiers for biomedical entities

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 23

ReferencesReferences

UMLSUMLSumlsinfo.nlm.nih.govumlsinfo.nlm.nih.gov

UMLS browsersUMLS browsers (free, but UMLS license required) (free, but UMLS license required) Knowledge Source Server: Knowledge Source Server: umlsks.nlm.nih.govumlsks.nlm.nih.gov Semantic Navigator: Semantic Navigator: http://mor.nlm.nih.gov/perl/semnav.plhttp://mor.nlm.nih.gov/perl/semnav.pl

RRF browserRRF browser(standalone application distributed with the UMLS)(standalone application distributed with the UMLS)

Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 24

ReferencesReferences

Recent overviewsRecent overviews Bodenreider O. (2004). Bodenreider O. (2004). The Unified Medical Language The Unified Medical Language

System (UMLS): Integrating biomedical terminologySystem (UMLS): Integrating biomedical terminology. . Nucleic Acids ResearchNucleic Acids Research; D267-D270.; D267-D270.

Bodenreider O. Bodenreider O. From terminology integration to From terminology integration to information integration: Unified Medical Language information integration: Unified Medical Language System (UMLS).System (UMLS). BioRDF Teleconference, W3C BioRDF Teleconference, W3C Semantic Web Health Care and Life Sciences Interest Semantic Web Health Care and Life Sciences Interest Group, June 5, 2006.Group, June 5, 2006.http://mor.nlm.nih.gov/pubs/pres/060605-BioRDF.pdf

MedicalMedicalOntologyOntologyResearchResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov