brian a. carlsen apelon, inc
DESCRIPTION
Networked Knowledge Organization Systems/Services Workshop June 28, 2001. Tools For Classification Integration. Brian A. Carlsen Apelon, Inc. Presentation Outline. State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Further Approaches. - PowerPoint PPT PresentationTRANSCRIPT
Brian A. CarlsenBrian A. Carlsen
Apelon, Inc.Apelon, Inc.
Tools For Classification IntegrationTools For Classification Integration
Networked Knowledge OrganizationSystems/Services Workshop
June 28, 2001
2
Presentation OutlinePresentation Outline
• State of the UMLS MetathesaurusState of the UMLS Metathesaurus
• Life-cycle of a Source
• Tools and Processes
• Challenges
• Further Approaches
3
State of the UMLS MetathesaurusState of the UMLS Metathesaurus
• Concept orientation, concept persistance• Growth to over 800,000 concepts and over 60
vocabulary families• Over 1000 users worldwide• Uses of the Metathesaurus
• Natural Language ProcessingNatural Language Processing• Knowledge RepresentationKnowledge Representation• Patient Record SystemsPatient Record Systems• Linking Patient Data to Knowledge SourcesLinking Patient Data to Knowledge Sources• Automated Indexing/ RetrievalAutomated Indexing/ Retrieval
40
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
Concepts Names
Concept and Name Counts By Release YearConcept and Name Counts By Release Year
5
EnglishEnglish Word, String Counts by Release Year Word, String Counts by Release Year
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
Lowercase Words
Strings
6
OutlineOutline
• State of the UMLS Metathesaurus
• Life-cycle of a SourceLife-cycle of a Source
• Tools and Processes
• Challenges
• Further Approaches
7
Life-cycle of a Source: InversionLife-cycle of a Source: Inversion
• Source arrives in “machine readable” format*• Many formats are used, including PDF, Clipper dump
files, WordPerfect files, unit-record formats, and relational flat files.
• Source undergoes “inversion”• Requires a human
• Input is this machine readable file
• Process is source-specific
• Output is a common relational flat-file format used internally.
8
Life-cycle of a Source: InsertionLife-cycle of a Source: Insertion
• A “Recipe” is created
• Test insertion to validate recipe
• Insertion and matching.• Load common format into database
• Match to existing content algorithmically• Use string normalization
• Determine SAFE vs. UNSAFE matches
• Prepare data for editing
• Process is fully undoable
9
Life-cycle of a Source: EditingLife-cycle of a Source: Editing
• Predicate-based partitioning
• Workflow management• Review ALL content for new sources
• Review UNSAFE content for updates
• Human Review
• QA Driven Editing• Source-specific QA
• Feedback QA
• Conservation of Mass QA
10
Life-cycle of a Source: ReleaseLife-cycle of a Source: Release
• Synchronize editing changes• State-based model
• Release data in desired format• Full release/partial release
• Transform base release• “MetamorphoSys”
• Remove unlicensed data
• Create “Content Views”
11
OutlineOutline
• State of the UMLS Metathesaurus
• Life-cycle of a Source
• Tools and ProcessesTools and Processes
• Challenges
• Further Approaches
12
Tools and Processes: OverviewTools and Processes: Overview
• Humans vs. Computers• Humans are good at making content decisions
• Computers are good at automating tasks
• Tools vs. Processes• Tools enable computers to automate tasks
• Processes keep humans productive.
13
Tools and Processes: Pre-EditingTools and Processes: Pre-Editing
• No common data representation
• Source-by-source conversion to common format• Perl, Unix tools
• What would a common format need?• Represent terms and attributes
• Represent within-source relationships
• Represent hierarchies
• Represent external-source relationships
• Represent classifications (e.g. Concept)
14
Tools and Processes: EditingTools and Processes: Editing
• Workflow Management• Report Generation • State Model vs. Action Model
• Actions represented as new states vs.• Single state + actions as data
• Human Editing• Interface enabling “high level cognitive editing”
• LVG: String Normalization• Automated Editing
• Save vs. Unsafe, Integrities
15
Tools and Processes: ReleaseTools and Processes: Release
• License Agreements
• Content Views• e.g. Indexing View
• Filter by Semantic Type
• Filter by Language
• Alternative Release Formats
• Updates
• MetamorphoSys
16
OutlineOutline
• State of the UMLS Metathesaurus
• Life-cycle of a Source
• Tools and Processes
• ChallengesChallenges
• Further Approaches
17
Challenges: AmbiguityChallenges: Ambiguity
• Ambiguous Strings• e.g. “Cold”
• Solution: Disambiguating strings, Preferred Names with “face validity”, Integrity checks when merging.
• Not fully specified Strings• e.g. “Head of Pancreas” within “Malignant Neoplasm
of Pancreas”
• Solution: Fully specified preferred name.
18
Challenges: What is a Classification?Challenges: What is a Classification?
• A classification is any grouping of terms with a consistent semantics.
• Thesauri typically group terms by meaning into concepts (synonymy).
• Alternatives• Neighborhoods (e.g. Descriptors in MeSH).• Near-synonymy• No classification (identity or term classification).• Lexical
• Connecting relationships/attributes to classifiers
19
Challenges: PrecedenceChallenges: Precedence
• Concepts (or other classifications) generally have a preferred name
• A thesaurus will have terms from different sources competing for precedence
• Source precedence should be a user-level choice• Preferred name should not be used as a proxy for
concept-ness• Every level of classification should have a
preferred term• Preferred name exists primarily for “face validity”
20
Challenges: Update ModelChallenges: Update Model
• Constituent sources of a thesaurus will be updated
• Editing cycle• Updated sources will require editing
• Typically overlap is > 90%
• Overlap can safely replace the old version’s content
• Safe replacements should not be edited
• Ideally, source providers would indicate replacement otherwise it must be computed
• Release• Release changes
21
OutlineOutline
• State of the UMLS Metathesaurus
• Life-cycle of a Source
• Tools and Processes
• Challenges
• Further ApproachesFurther Approaches
22
Further Approaches: Description LogicFurther Approaches: Description Logic
• What is it?• Concepts (or other classifications) are axioms• Relationships (roles) are theorems• The transitive closure of the roles across the concepts is
computed to ensure no violations. • e.g. A isa B, B isa C, C isa A (!violation)
• When is it useful?• In formalized, static domains like Anatomy
• When is it not useful?• Performance > formalism• In dynamic, loosely coupled domains like Genomics
23
Further Approaches: Standards XMLFurther Approaches: Standards XML
• Standardized Terminology/Ontology Representation• XML is the most likely candidate• Ideally would support
• Links to external sources• Relationships between different levels of classification• Update model• Description Logic Metadata
• Standardized Thesaurus Representation• XML Repository• Standard Object Representations
24
Conclusion: Lessons LearnedConclusion: Lessons Learned
• Use the Web• Use current technology• Use Description Logic where appropriate• Make editing intuitive• Automate tasks
• “A well-understood, reproducible, automated process that succeeds 95% of the time is a vast improvement over a poorly-understood, labor-intensive process that is believed to succeed 100% of the time. “
• Review UNSAFE automated tasks.• Stop automating when marginal utility falls below a threshold.