brian a. carlsen apelon, inc

24
Brian A. Carlsen Brian A. Carlsen Apelon, Inc. Apelon, Inc. ols For Classification Integrat ols For Classification Integrat Networked Knowledge Organization Systems/Services Workshop June 28, 2001

Upload: bevis-acevedo

Post on 31-Dec-2015

37 views

Category:

Documents


3 download

DESCRIPTION

Networked Knowledge Organization Systems/Services Workshop June 28, 2001. Tools For Classification Integration. Brian A. Carlsen Apelon, Inc. Presentation Outline. State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Further Approaches. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Brian A. Carlsen Apelon, Inc

Brian A. CarlsenBrian A. Carlsen

Apelon, Inc.Apelon, Inc.

Tools For Classification IntegrationTools For Classification Integration

Networked Knowledge OrganizationSystems/Services Workshop

June 28, 2001

Page 2: Brian A. Carlsen Apelon, Inc

2

Presentation OutlinePresentation Outline

• State of the UMLS MetathesaurusState of the UMLS Metathesaurus

• Life-cycle of a Source

• Tools and Processes

• Challenges

• Further Approaches

Page 3: Brian A. Carlsen Apelon, Inc

3

State of the UMLS MetathesaurusState of the UMLS Metathesaurus

• Concept orientation, concept persistance• Growth to over 800,000 concepts and over 60

vocabulary families• Over 1000 users worldwide• Uses of the Metathesaurus

• Natural Language ProcessingNatural Language Processing• Knowledge RepresentationKnowledge Representation• Patient Record SystemsPatient Record Systems• Linking Patient Data to Knowledge SourcesLinking Patient Data to Knowledge Sources• Automated Indexing/ RetrievalAutomated Indexing/ Retrieval

Page 4: Brian A. Carlsen Apelon, Inc

40

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1,800,000

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001

Concepts Names

Concept and Name Counts By Release YearConcept and Name Counts By Release Year

Page 5: Brian A. Carlsen Apelon, Inc

5

EnglishEnglish Word, String Counts by Release Year Word, String Counts by Release Year

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001

Lowercase Words

Strings

Page 6: Brian A. Carlsen Apelon, Inc

6

OutlineOutline

• State of the UMLS Metathesaurus

• Life-cycle of a SourceLife-cycle of a Source

• Tools and Processes

• Challenges

• Further Approaches

Page 7: Brian A. Carlsen Apelon, Inc

7

Life-cycle of a Source: InversionLife-cycle of a Source: Inversion

• Source arrives in “machine readable” format*• Many formats are used, including PDF, Clipper dump

files, WordPerfect files, unit-record formats, and relational flat files.

• Source undergoes “inversion”• Requires a human

• Input is this machine readable file

• Process is source-specific

• Output is a common relational flat-file format used internally.

Page 8: Brian A. Carlsen Apelon, Inc

8

Life-cycle of a Source: InsertionLife-cycle of a Source: Insertion

• A “Recipe” is created

• Test insertion to validate recipe

• Insertion and matching.• Load common format into database

• Match to existing content algorithmically• Use string normalization

• Determine SAFE vs. UNSAFE matches

• Prepare data for editing

• Process is fully undoable

Page 9: Brian A. Carlsen Apelon, Inc

9

Life-cycle of a Source: EditingLife-cycle of a Source: Editing

• Predicate-based partitioning

• Workflow management• Review ALL content for new sources

• Review UNSAFE content for updates

• Human Review

• QA Driven Editing• Source-specific QA

• Feedback QA

• Conservation of Mass QA

Page 10: Brian A. Carlsen Apelon, Inc

10

Life-cycle of a Source: ReleaseLife-cycle of a Source: Release

• Synchronize editing changes• State-based model

• Release data in desired format• Full release/partial release

• Transform base release• “MetamorphoSys”

• Remove unlicensed data

• Create “Content Views”

Page 11: Brian A. Carlsen Apelon, Inc

11

OutlineOutline

• State of the UMLS Metathesaurus

• Life-cycle of a Source

• Tools and ProcessesTools and Processes

• Challenges

• Further Approaches

Page 12: Brian A. Carlsen Apelon, Inc

12

Tools and Processes: OverviewTools and Processes: Overview

• Humans vs. Computers• Humans are good at making content decisions

• Computers are good at automating tasks

• Tools vs. Processes• Tools enable computers to automate tasks

• Processes keep humans productive.

Page 13: Brian A. Carlsen Apelon, Inc

13

Tools and Processes: Pre-EditingTools and Processes: Pre-Editing

• No common data representation

• Source-by-source conversion to common format• Perl, Unix tools

• What would a common format need?• Represent terms and attributes

• Represent within-source relationships

• Represent hierarchies

• Represent external-source relationships

• Represent classifications (e.g. Concept)

Page 14: Brian A. Carlsen Apelon, Inc

14

Tools and Processes: EditingTools and Processes: Editing

• Workflow Management• Report Generation • State Model vs. Action Model

• Actions represented as new states vs.• Single state + actions as data

• Human Editing• Interface enabling “high level cognitive editing”

• LVG: String Normalization• Automated Editing

• Save vs. Unsafe, Integrities

Page 15: Brian A. Carlsen Apelon, Inc

15

Tools and Processes: ReleaseTools and Processes: Release

• License Agreements

• Content Views• e.g. Indexing View

• Filter by Semantic Type

• Filter by Language

• Alternative Release Formats

• Updates

• MetamorphoSys

Page 16: Brian A. Carlsen Apelon, Inc

16

OutlineOutline

• State of the UMLS Metathesaurus

• Life-cycle of a Source

• Tools and Processes

• ChallengesChallenges

• Further Approaches

Page 17: Brian A. Carlsen Apelon, Inc

17

Challenges: AmbiguityChallenges: Ambiguity

• Ambiguous Strings• e.g. “Cold”

• Solution: Disambiguating strings, Preferred Names with “face validity”, Integrity checks when merging.

• Not fully specified Strings• e.g. “Head of Pancreas” within “Malignant Neoplasm

of Pancreas”

• Solution: Fully specified preferred name.

Page 18: Brian A. Carlsen Apelon, Inc

18

Challenges: What is a Classification?Challenges: What is a Classification?

• A classification is any grouping of terms with a consistent semantics.

• Thesauri typically group terms by meaning into concepts (synonymy).

• Alternatives• Neighborhoods (e.g. Descriptors in MeSH).• Near-synonymy• No classification (identity or term classification).• Lexical

• Connecting relationships/attributes to classifiers

Page 19: Brian A. Carlsen Apelon, Inc

19

Challenges: PrecedenceChallenges: Precedence

• Concepts (or other classifications) generally have a preferred name

• A thesaurus will have terms from different sources competing for precedence

• Source precedence should be a user-level choice• Preferred name should not be used as a proxy for

concept-ness• Every level of classification should have a

preferred term• Preferred name exists primarily for “face validity”

Page 20: Brian A. Carlsen Apelon, Inc

20

Challenges: Update ModelChallenges: Update Model

• Constituent sources of a thesaurus will be updated

• Editing cycle• Updated sources will require editing

• Typically overlap is > 90%

• Overlap can safely replace the old version’s content

• Safe replacements should not be edited

• Ideally, source providers would indicate replacement otherwise it must be computed

• Release• Release changes

Page 21: Brian A. Carlsen Apelon, Inc

21

OutlineOutline

• State of the UMLS Metathesaurus

• Life-cycle of a Source

• Tools and Processes

• Challenges

• Further ApproachesFurther Approaches

Page 22: Brian A. Carlsen Apelon, Inc

22

Further Approaches: Description LogicFurther Approaches: Description Logic

• What is it?• Concepts (or other classifications) are axioms• Relationships (roles) are theorems• The transitive closure of the roles across the concepts is

computed to ensure no violations. • e.g. A isa B, B isa C, C isa A (!violation)

• When is it useful?• In formalized, static domains like Anatomy

• When is it not useful?• Performance > formalism• In dynamic, loosely coupled domains like Genomics

Page 23: Brian A. Carlsen Apelon, Inc

23

Further Approaches: Standards XMLFurther Approaches: Standards XML

• Standardized Terminology/Ontology Representation• XML is the most likely candidate• Ideally would support

• Links to external sources• Relationships between different levels of classification• Update model• Description Logic Metadata

• Standardized Thesaurus Representation• XML Repository• Standard Object Representations

Page 24: Brian A. Carlsen Apelon, Inc

24

Conclusion: Lessons LearnedConclusion: Lessons Learned

• Use the Web• Use current technology• Use Description Logic where appropriate• Make editing intuitive• Automate tasks

• “A well-understood, reproducible, automated process that succeeds 95% of the time is a vast improvement over a poorly-understood, labor-intensive process that is believed to succeed 100% of the time. “

• Review UNSAFE automated tasks.• Stop automating when marginal utility falls below a threshold.