the umls and the semantic web
DESCRIPTION
W3C Semantic Web Health Care and Life Sciences Interest Group BioRDF Teleconference September 22, 2008. The UMLS and the Semantic Web. Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA. Outline. The UMLS (in a nutshell) - PowerPoint PPT PresentationTRANSCRIPT
The UMLS and the Semantic Web
W3C Semantic WebW3C Semantic WebHealth Care and Life Sciences Interest GroupHealth Care and Life Sciences Interest Group
BioRDF TeleconferenceBioRDF TeleconferenceSeptember 22, 2008September 22, 2008
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 2
OutlineOutline
The UMLS (in a nutshell)The UMLS (in a nutshell) Lexical resourcesLexical resources MetathesaurusMetathesaurus Semantic NetworkSemantic Network
Why is the UMLS relevant to the Semantic Web?Why is the UMLS relevant to the Semantic Web? Issues and challengesIssues and challenges
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 4
UMLS: 3 componentsUMLS: 3 components
SPECIALIST LexiconSPECIALIST Lexicon 200,000 lexical items200,000 lexical items Part of speech and variant informationPart of speech and variant information
MetathesaurusMetathesaurus 5M names from over 100 terminologies5M names from over 100 terminologies 1M concepts1M concepts 16M relations16M relations
Semantic NetworkSemantic Network 135 high-level categories135 high-level categories 7000 relations among them7000 relations among them
Lexicalresources
Ontologicalresources
Terminologicalresources
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 5
UMLS UMLS Characteristics (1)Characteristics (1)
Current version: 2008AA (2-3 annual releases)Current version: 2008AA (2-3 annual releases) Type: Terminology integration systemType: Terminology integration system Domain: BiomedicineDomain: Biomedicine Developer: NLMDeveloper: NLM Funding: NLM (intramural)Funding: NLM (intramural) AvailabilityAvailability
Publicly available: Yes* (cost-free license required)Publicly available: Yes* (cost-free license required) Repositories: UMLSRepositories: UMLS
URL: URL: http://umlsks.nlm.nih.gov/
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 6
UMLS UMLS Characteristics (2)Characteristics (2)
Number ofNumber of Concepts: 1.5M (2008AA)Concepts: 1.5M (2008AA) Terms: ~6MTerms: ~6M
Major organizing principles (Metathesaurus):Major organizing principles (Metathesaurus): Concept orientationConcept orientation Source transparencySource transparency Multi-lingual through translationMulti-lingual through translation
Formalism: Proprietary format (RRF)Formalism: Proprietary format (RRF)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 7
UMLS UMLS Integrating subdomainsIntegrating subdomains
Biomedicalliterature
Biomedicalliterature
MeSH
Genomeannotations
Genomeannotations
GOModelorganisms
Modelorganisms
NCBITaxonomy
Geneticknowledge bases
Geneticknowledge bases
OMIM
Clinicalrepositories
Clinicalrepositories
SNOMED CTOthersubdomains
Othersubdomains
…
AnatomyAnatomy
FMA
UMLS
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 8
Trans-namespace integrationTrans-namespace integration
Genomeannotations
Genomeannotations
GOModelorganisms
Modelorganisms
NCBITaxonomy
Geneticknowledge bases
Geneticknowledge bases
OMIMOther
subdomainsOther
subdomains
…
AnatomyAnatomy
FMA
UMLS
Addison Disease (D000224)
Addison's disease (363732003)
Biomedicalliterature
Biomedicalliterature
MeSH
Clinicalrepositories
Clinicalrepositories
SNOMED CT
UMLSC0001403
Heart
Concepts
Metathesaurus
22
225
97
4
12
9 31
Esophagus
Left PhrenicNerve
HeartValves
FetalHeart
Medias-tinum
SaccularViscus
AnginaPectoris
CardiotonicAgents
TissueDonors
AnatomicalStructure
Fully FormedAnatomical
Structure
EmbryonicStructure
Body Part, Organ orOrgan Component Pharmacologic
Substance
Disease orSyndrome
PopulationGroup
Semantic Types
SemanticNetwork
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 11
Relevance to the SW Relevance to the SW MetathesaurusMetathesaurus
Terminology integration systemTerminology integration system Trans-namespace integrationTrans-namespace integration Integration beyond shared identifiersIntegration beyond shared identifiers
Repository of biomedical terminologies/ontologiesRepository of biomedical terminologies/ontologies Many UMLS vocabularies used for the annotation Many UMLS vocabularies used for the annotation
of datasets (including clinical records)of datasets (including clinical records)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 12
Relevance to the SW Relevance to the SW MetathesaurusMetathesaurus
Broad coverage of biomedicineBroad coverage of biomedicine Large user baseLarge user base Tooling availableTooling available
E.g, visualization, named entity recognition, etc.E.g, visualization, named entity recognition, etc.
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 13
Relevance to the SW Relevance to the SW Semantic NetworkSemantic Network
Top-level ontology of the biomedical domainTop-level ontology of the biomedical domain Broad biomedical categoriesBroad biomedical categories Helps partition biomedical conceptsHelps partition biomedical concepts Semantic relationsSemantic relations
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 15
Issues and challengesIssues and challenges
AvailabilityAvailability Mandatory license agreementMandatory license agreement
DiscoverabilityDiscoverability No metadata No metadata
FormalismFormalism No easy conversion to SKOS/RDF(S)/OWLNo easy conversion to SKOS/RDF(S)/OWL
IdentifiersIdentifiers
Steep learning curveSteep learning curve
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 16
AvailabilityAvailability
Some source vocabularies have intellectual Some source vocabularies have intellectual property restrictionsproperty restrictions E.g., most drug vocabulariesE.g., most drug vocabularies Complex agreement for SNOMED CT: available at no Complex agreement for SNOMED CT: available at no
cost for member countries of the IHTSDOcost for member countries of the IHTSDO Mandatory license agreementMandatory license agreement
No cost for researchNo cost for research May require negotiation with the vocabulary developer May require negotiation with the vocabulary developer
for production applicationsfor production applications MetamorphoSys helps extract selected sources MetamorphoSys helps extract selected sources
from the UMLSfrom the UMLS
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 17
DiscoverabilityDiscoverability
Discoverability of individual conceptsDiscoverability of individual concepts UMLSKS web servicesUMLSKS web services Search all UMLS source vocabularies at the same timeSearch all UMLS source vocabularies at the same time Named entity recognition/normalization (e.g., Named entity recognition/normalization (e.g.,
MetaMap)MetaMap)
Discoverability of terminologies/ontologiesDiscoverability of terminologies/ontologies No comprehensive registriesNo comprehensive registries No rich registriesNo rich registries
With rich metadata supporting the discoverability of With rich metadata supporting the discoverability of terminologies/ontologiesterminologies/ontologies
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 18
FormalismFormalism
UMLS: Proprietary formatUMLS: Proprietary format Rich Release Format (RRF)Rich Release Format (RRF) All terminologies/ontologies represented in the same All terminologies/ontologies represented in the same
formatformat
No easy conversion to SKOS/RDF(S)/OWLNo easy conversion to SKOS/RDF(S)/OWL Underspecified semanticsUnderspecified semantics
Child/parent Child/parent subClassOf subClassOf
Complex semanticsComplex semantics Descriptors / concepts / termsDescriptors / concepts / terms
Rich attribute setRich attribute set
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 19
Identifiers for biomedical entitiesIdentifiers for biomedical entities
What is identified?What is identified? Entity vs. resource about the entityEntity vs. resource about the entity
Which identifier to pick?Which identifier to pick? E.g., Addison’s diseaseE.g., Addison’s disease
363732003363732003 (SNOMED CT)(SNOMED CT) D000224D000224 (MeSH)(MeSH) C0001403C0001403 (UMLS Metathesaurus)(UMLS Metathesaurus)
Which format?Which format? URI vs. LSIDURI vs. LSID
Which authoritative source for minting URIs?Which authoritative source for minting URIs? Ontology developers vs. (e.g.) Bio2RDF Ontology developers vs. (e.g.) Bio2RDF
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 20
Steep learning curveSteep learning curve
Large resourceLarge resource 1.5M concepts1.5M concepts 6M terms6M terms Over 20M relations Over 20M relations
Complex structureComplex structure MetathesaurusMetathesaurus Semantic NetworkSemantic Network
Rich set of attributesRich set of attributes
Rich set of relationsRich set of relations TerminologicalTerminological SemanticSemantic StatisticalStatistical MappingMapping
Multiple languagesMultiple languages
Complex domainComplex domain
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 22
ConclusionsConclusions
UMLS as a terminology integration systemUMLS as a terminology integration system Helps bridge across namespacesHelps bridge across namespaces Helps integrate information sourcesHelps integrate information sources
Beyond shared identifiersBeyond shared identifiers
UMLS as a repository of terminologies/ontologiesUMLS as a repository of terminologies/ontologies Single source, single format for 143 vocabulariesSingle source, single format for 143 vocabularies
Issues with availability, discoverability and Issues with availability, discoverability and formalismformalism
Identifiers for biomedical entitiesIdentifiers for biomedical entities
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 23
ReferencesReferences
UMLSUMLSumlsinfo.nlm.nih.govumlsinfo.nlm.nih.gov
UMLS browsersUMLS browsers (free, but UMLS license required) (free, but UMLS license required) Knowledge Source Server: Knowledge Source Server: umlsks.nlm.nih.govumlsks.nlm.nih.gov Semantic Navigator: Semantic Navigator: http://mor.nlm.nih.gov/perl/semnav.plhttp://mor.nlm.nih.gov/perl/semnav.pl
RRF browserRRF browser(standalone application distributed with the UMLS)(standalone application distributed with the UMLS)
Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications 24
ReferencesReferences
Recent overviewsRecent overviews Bodenreider O. (2004). Bodenreider O. (2004). The Unified Medical Language The Unified Medical Language
System (UMLS): Integrating biomedical terminologySystem (UMLS): Integrating biomedical terminology. . Nucleic Acids ResearchNucleic Acids Research; D267-D270.; D267-D270.
Bodenreider O. Bodenreider O. From terminology integration to From terminology integration to information integration: Unified Medical Language information integration: Unified Medical Language System (UMLS).System (UMLS). BioRDF Teleconference, W3C BioRDF Teleconference, W3C Semantic Web Health Care and Life Sciences Interest Semantic Web Health Care and Life Sciences Interest Group, June 5, 2006.Group, June 5, 2006.http://mor.nlm.nih.gov/pubs/pres/060605-BioRDF.pdf
MedicalMedicalOntologyOntologyResearchResearch
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
Contact:Contact:Web:Web:
[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov