hisa ltd. biography proforma

26
ade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus HISA ltd. Biography proforma MEDINFO 2007 413 Lygon Street, Brunswick East 3057 Australia Presenter Name: Stefan Schulz Country: 1. Germany, 2. Brazil Qualification(s): MD (Doctor in Theoretical Medicine) Vocational Training in Medical Informatics Postdoctoral Habililtation degree in Medical Informatics Position: Associate Professor Department/ Organisation : 1. Medical Informatics, Freiburg University Medical Center, Freiburg, Germany 2. Master Program of Health Technology, Catholic University of Paraná, Curitiba, Brazil Major Achievement(s): Research in the fields of Medical Terminology, Biomedical Ontologies, Medical Language Processing, Text Mining, and Document Retrieval.

Upload: rachelle-roy

Post on 31-Dec-2015

42 views

Category:

Documents


3 download

DESCRIPTION

HISA ltd. Biography proforma. MEDINFO 2007 413 Lygon Street, Brunswick East 3057 Australia. Presenter Name: Stefan Schulz Country: 1. Germany, 2. Brazil Qualification(s) : MD (Doctor in Theoretical Medicine) Vocational Training in Medical Informatics - PowerPoint PPT Presentation

TRANSCRIPT

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

HISA ltd. Biography proformaMEDINFO 2007

413 Lygon Street, Brunswick East 3057 Australia

Presenter Name: Stefan SchulzCountry: 1. Germany, 2. BrazilQualification(s):

MD (Doctor in Theoretical Medicine)Vocational Training in Medical InformaticsPostdoctoral Habililtation degree in Medical

InformaticsPosition:

Associate ProfessorDepartment/ Organisation :

1. Medical Informatics, Freiburg University Medical Center,

Freiburg, Germany2. Master Program of Health Technology, Catholic University

of Paraná, Curitiba, Brazil Major Achievement(s):

Research in the fields of Medical Terminology, Biomedical Ontologies, Medical Language Processing, Text Mining, and Document Retrieval.

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Roosewelt L. Andradea,b, Edson Pachecoa,b, Pindaro S. Canciana,b, Percy Nohamaa,b, Stefan Schulzb,c

a Paraná University of Technology (UTFPR), Curitiba, Brazil

Corpus-based Error Detection in a Multilingual Medical Thesaurus

b Pontificial Catholic University of Paraná, Master Program of Health Technology, Curitiba, Brazil

c University Medical Center Freiburg, Medical Informatics, Freiburg, Germany

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

• Controlled Vocabulary for document indexing and retrieval

• Assigns semantic descriptors (concepts) to (quasi-)synonymous terms

• Contains additional semantic relations (e.g. hyperonym / hyponym)

• Examples: MeSH, UMLS, WordNet• Multilingual thesaurus: contains translations

(cross-language synonymy links)

Thesaurus

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

• International team of curators

• React to new terms and senses

• Decide which terms are synonymous / translations

• Decide which senses of a term have to be accounted for in the domain

• Requires quality assurance measures

Introduction Methods Results Discussion Conclusion

Multilingual Thesaurus Management

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Case study: Morphosaurus

• Medical subword thesaurus• Organizes subwords (meaningful word

fragments) in multilingual equivalence classes: – #derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … } – #inflamm = { inflamm, -itic, -itis, phlog, entzuend, -itis, -itisch,

inflam, flog, inflam, flog, ... }

• Maintained at two locations:Freiburg (Germany), Curitiba (Brazil)

• Lexicon curators: frequently changing team of medical students

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Segmentation:

Myo|kard|itis

Herz|muskel|entzünd|ung

Inflamm|ation of the heart muscle

muscle

myo

muskel

muscul

inflamm

-itis

inflam

entzünd

Eq Class

subword herzheart

card

corazon

card

INFLAMMMUSCLE

HEART

Morphosaurus Structure

• Thesaurus:~21.000 equivalence classes (MIDs)

• Lexicon entries:– English: ~23.000– German:

~24.000– Portuguese: ~15.000– Spanish :

~11.000– French: ~ 8.000– Swedish:

~10.000

Introduction Methods Results Discussion Conclusion

Indexation:

#muscle #heart #inflamm

#heart #muscle #inflamm

#inflamm #heart #muscle

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Morphosemantic Normalization

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Additional Challenges for Morphosaurus

• Properly delimit subword entries so that they are correctly extracted from complex words

• Create consensus about the scope of synonymy classes, especially with regard to highly ambiguous lexicon entries

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Morphosaurus Quality Assurance

• Content quality: Identify content errors in the thesaurus content – Implement a quality process to fix these errors– Show positive impact

• Process quality: Detect and prevent user action anomalies– actions that consume effort without any positive

impact : uncoordinated edit / update / delete “do undo” transactions done by different people)

– see Paper Bitencourt et al., Session 089

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

• Apply morphosemantic indexing to the Merck manual in English, Spanish, Portuguese, German

• Hypothesis: nearly identical frequency distribution of Morphosaurus identifiers (MIDs) for each language

• Problematic MIDs can be spotted by comparing MID frequencies for each language pair

Testbed: Parallel Medical Corpora

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Testbed: Parallel Medical Corpora

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

21

21

ff

ff

dS

degree of imbalance

Scoring of MID Imbalance

2,1ff frequencies of MIDs in language 1 and 2

Score used for MID ranking per language pair

Introduction Methods Results Discussion Conclusion

3

2aS

dS

S

max21

21

xf

xf

ff

aS

mean relative frequency

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

For each l1,l2: MIDs ranked by Score

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Experimental Modification of Workflow

Introduction Methods Results Discussion Conclusion

• Discussion of 100 most highly ranked MID imbalances for each language pair

• Classification of Problems

• Documentation

• Correction in consensus

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Reason for MID

high score

Portuguese / English

German / English

Spanish / English

Ambiguous lexemes 0.23 0.38 0.14

Missing or dispensable MID 0.49 0.18 0.53

Same Sense in Different MIDs

0.06 0.12 0.19

One MID with Different Senses

0.04 0.05 0.06

No problem 0.11 0.10 0.04

Unclassified problem 0.07 0.17 0.04

Problems identified

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Modified Workflow

Introduction Methods Results Discussion Conclusion

•Discussion of highly ranked MID imbalances

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

• Did the targeted error spotting and resolution have an impact on the “in vivo” performance of the thesaurus ?

• Benchmark: OHSUMED document collection (user queries with relevant MEDLINE abstract assigned), queries translated to German, Spanish, Portuguese

• Target: monitoring of Timelines of Eleven Point Average measure (precision values at recall 0, 0.1, 0.2,…0.9, 1.0)

Summative Evaluation

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

EnglishPortugueseGermanSpanish

Weeks

Eleven-Point-AveragePrecision

OHSUMED benchmark

Summative Evaluation Results

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Introduction Methods Results Discussion Conclusion

Discussion of Results

• Most problems detected corresponded to real errors (approx. 90%)

• Consensus could be found in most cases

• “In vivo” evaluation showed a clear increase in IR performance for Spanish, at the time the less consolidated language in Morphosaurus

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

• Error detection by MID imbalance score proved useful

• New workflow productive for error elimination• Recommendations

– Record MID score over time • avg for each language pair• avg for each language• for each MID

– Generate alerts for every MIDs which exhibits an increase in imbalance above a tolerance interval

– Continue monitoring AvgP11 values– Include into Morphosaurus editing environment

Conclusions / Recommendations

Introduction Methods Results Discussion Conclusion

Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus

Jeferson L. Bitencourt a,b, Pindaro S. Canciana,b, Edson Pachecoa,b, Percy Nohamaa,b, Stefan Schulzb,c

a Paraná University of Technology (UTFPR), Curitiba, Brazil

Medical Thesaurus Anomaly Detection by User Action Monitoring Session 089

b Pontificial Catholic University of Paraná, Master Program of Health Technology, Curitiba, Brazil

c University Medical Center Freiburg, Medical Informatics, Freiburg, Germany