![Page 1: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/1.jpg)
MedKATMedical Knowledge Analysis Tool
December 2009
![Page 2: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/2.jpg)
Overview
✤ MedKAT and MedKAT/p✤ Developed at IBM, donated to OHNLP with Apache license V2.0✤ Goal:
✤ Identification of concepts and their attributes based on a standard or proprietary terminology/ontology
✤ “/p” adaptation to pathology reports – relation extraction✤ UIMA-based, Modular, Generic, Expandable✤ Terminology agnostic: able to plug in any terminology✤ Easy adaptation to specific corpus and conventions✤ Integration into institutional system
✤ Ongoing commitment to Research and Development
![Page 3: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/3.jpg)
3
Core Components
✤ Document structure
✤ Syntactic tools (tokenization .. shallow parsing)
✤ Negation
✤ Concept identification
✤ Relationship extraction
![Page 4: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/4.jpg)
4
Core Components
✤ Document structure
✤ Syntactic tools (tokenization .. shallow parsing)
✤ Negation
✤ Concept identification
✤ Relationship extraction
![Page 5: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/5.jpg)
Document Structure
✤ Plain text or XML (e.g., CDA)
✤ Processes specific document section types (e.g., diagnosis)
✤ Detection of enumerated subsections (e.g., lists)
✤ Detection of formatting (e.g. bullets)
✤ Detection of relations between sections (e.g., coreference between corresponding lists appearing in different document sections)
✤ Making implicit conventions explicit (e.g. meaning of title)
![Page 6: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/6.jpg)
Document Structure Annotators
![Page 7: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/7.jpg)
7
Document Structure
16
Multiple document sections
![Page 8: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/8.jpg)
8
Document Structure
17
Corresponding document subsections
![Page 9: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/9.jpg)
9
Document Structure
18
Need to know document structure to be able to
compute concept coreference during relation
extraction
![Page 10: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/10.jpg)
10
Core Components
✤ Document structure
✤ Syntactic tools (tokenization .. shallow parsing)
✤ Negation
✤ Concept identification
✤ Relationship extraction
![Page 11: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/11.jpg)
Syntactic Structure Annotators
![Page 12: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/12.jpg)
Tokenization
Basic building block for subsequent annotators. The text:
poorly-differentiated/undifferentiatedcould be tokenized as 1, 3, or 5 tokens:
![Page 13: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/13.jpg)
Part of Speech Tagger
✤ OpenNLP POS tagger with standard models
✤ Domain adaptation:
✤ Entries from lexicon are pre-tagged
✤ Rule-based overwriting of tags for specific cases
![Page 14: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/14.jpg)
14
Shallow Parser
32
![Page 15: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/15.jpg)
Merging NP Types
The shallow parser defines three types of noun phrase:1. NP2. NPP3. NPList
![Page 16: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/16.jpg)
Merging NP Types
The NPMerger module creates NPCombined annotations to cover all types of noun phrases.
![Page 17: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/17.jpg)
17
Core Components
✤ Document structure
✤ Syntactic tools (tokenization .. shallow parsing)
✤ Negation
✤ Concept identification
✤ Relationship extraction
![Page 18: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/18.jpg)
Negation Annotators
![Page 19: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/19.jpg)
Negation
✤ Keyword and syntactic analysis driven
✤ Set of keywords configurable via dictionary
✤ Type of syntactic phrase used to determine context is configurable
![Page 20: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/20.jpg)
20
Core Components
✤ Document structure
✤ Syntactic tools (tokenization .. shallow parsing)
✤ Negation
✤ Concept identification
✤ Relationship extraction
![Page 21: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/21.jpg)
Concept Identification Annotators
![Page 22: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/22.jpg)
Concept Identification
✤ Lexicon entries can be added, changed, deleted
✤ Lexicon entry attributes can be added, changed, deleted
✤ Search parameters can be modified
✤ Post processing filters
✤ Tokenization of text and lexicon should be the same
![Page 23: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/23.jpg)
Lexicon Entries
✤ A sample lexicon entry. The variant elements define all of the synonyms that can be matched during lookup. Attributes associated with “token” element apply to all variants, but can be overridden within individual variants (e.g., the “POS” attribute in some of these variant entries).
<token canonical="colon, nos" CodeType="ICDO" CodeValue="C18.9" SemClass="Site" POS="NN"> <variant base="colon, nos" /> <variant base="colon" /> <variant base="colonic" POS="JJ" /> <variant base="colic" POS="JJ" /> <variant base="large intestine" /> <variant base="large bowel" /></token>
![Page 24: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/24.jpg)
Concept Identification Configuration
✤ Configured to find all matched entries, not just longest match, even if overlapping
✤ Case-insensitive
✤ Token order independent matching performed, e.g.: A B C = C A B
✤ Subsequent filtering used to remove unnecessary over-generated results
![Page 25: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/25.jpg)
Concept Filters
✤ Remove:
✤ any duplicates over a single span
✤ generic terms like “tumor” if part of a longer term
✤ terms that contain other terms that were previously marked, such as a modifier
![Page 26: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/26.jpg)
26
Core Components
✤ Document structure
✤ Syntactic tools (tokenization .. shallow parsing)
✤ Negation
✤ Concept identification
✤ Relationship extraction
![Page 27: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/27.jpg)
Relationship Extraction Annotators
![Page 28: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/28.jpg)
Relationship Extraction
✤ Find coreferences of both anatomical sites and histological diagnoses across document sections
✤ Discover relationships between named entities and build knowledge model:
✤ Tumors (primary, metastatic)
✤ Gross description
✤ Lymph nodes
![Page 29: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/29.jpg)
Knowledge Model
✤ Benefits
✤ Summarization
✤ Comparison
✤ Change detection
✤ Temporal progression of disease
✤ Validation
✤ Manual annotation of pathology reports and clinical notes
![Page 30: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/30.jpg)
![Page 31: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/31.jpg)
The MedKAT/p Pipeline
![Page 32: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/32.jpg)
MedKAT/p Annotator Pipeline
![Page 33: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/33.jpg)
MedKAT/p Pipeline
✤ The full processing pipeline brings together all of the MedKAT components
✤ Used a manually annotated gold standard corpus of 302 documents: 201 documents for training, 101 for testing
✤ UIMA CAS can be output as database load file, XML, or other format using a UIMA CAS Consumer module
![Page 34: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/34.jpg)
Concept Extraction Results
Training InstancesTraining Instances Test InstancesTest Instances F-ScoreF-Score
Anatomical SiteAnatomical Site 1,598 782 0.95
HistologyHistology 670 336 0.98
SizeSize 942 471 1.00
DateDate 145 88 1.00
GradeGrade 246 124 0.98
![Page 35: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/35.jpg)
Model Extraction Results
Training InstancesTraining Instances Test InstancesTest Instances F-ScoreF-Score
Gross DescriptionGross Description 277 137 0.80
Lymph NodesLymph Nodes 117 59 0.81
Primary TumorPrimary Tumor 235 126 0.82
Metastatic TumorMetastatic Tumor 33 19 0.65
![Page 36: Mednlp@us.ibm.com MedKAT Medical Knowledge Analysis Tool December 2009](https://reader035.vdocuments.net/reader035/viewer/2022062804/5697bf911a28abf838c8e760/html5/thumbnails/36.jpg)
Summary
✤ MedKAT and MedKAT/p were developed at IBM, donated to OHNLP with Apache license V2.0
✤ Apache UIMA based solution for flexible, expandable system ✤ Concepts are identified, with their associated attributes, based on a standard
or proprietary terminology/ontology✤ The “/p” version has additional components for processing pathology reports