morpheme-based, cross-lingual indexing for medical document retrieval

13
International Journal of Medical Informatics 58 – 59 (2000) 87 – 99 Morpheme-based, cross-lingual indexing for medical document retrieval Stefan Schulz a,b, *, Udo Hahn a a CLIF Computational Linguistics Laboratory, Freiburg Uni6ersity, Werthmannplatz 1, D-79085 Freiburg, Germany b Department of Medical Informatics, Freiburg Uni6ersity Hospital, Stefan -Meier -Strasse 26, D-79104 Freiburg, Germany Received 31 December 1999 Abstract The increasing availability of machine-readable medical documents is not really matched with the sophistication of currently used retrieval facilities to deal with a variety of critical natural language phenomena. Still most popular are string-matching methods which encounter problems for the medical sublanguage, in particular, concerning the wide-spread use of complex word forms such as noun compounds. We introduce a methodology for the segmentation of complex compounds into medically motivated morphemes. Given the sublanguage patterns in our data these morphemes derive from German, Greek and Latin roots. For indexing and retrieval purposes, such a morpheme dictionary may be further structured by defining the semantic relations among morpheme sets in order to build up a multilingual morpheme thesaurus. We present a tool for thesaurus compilation and management, and outline a methodology for the proper construction and maintenance of a multilingual morpheme thesaurus. © 2000 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Medical language processing; Morphological analysis; Information retrieval www.elsevier.com/locate/ijmedinf 1. Introduction Modern clinical documentation systems, as well as an increasing number of health-re- lated electronic publications and databases available on CD-ROMs, hospital intranets and the Internet have swamped the physi- cian’s desktop with large amounts of com- puter-readable free text. The full utilization of these textual resources, however, is cur- rently hampered by inadequate retrieval facilities. In a common free-text information re- trieval environment, the search for a docu- ment relies on an (exact) pattern matching operation between the query term(s) and the document terms. So, a query term such as Leukocyte ’ retrieves all documents in which this query term occurs literally. Germanic * Corresponding author. Tel.: +49-761-2036702; fax: +49- 761-2033251. E-mail addresses: [email protected] (S. Schulz), [email protected] (U. Hahn). 1386-5056/00/$ - see front matter © 2000 Elsevier Science Ireland Ltd. All rights reserved. PII:S1386-5056(00)00078-2

Upload: stefan-schulz

Post on 19-Sep-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

International Journal of Medical Informatics 58–59 (2000) 87–99

Morpheme-based, cross-lingual indexing for medicaldocument retrieval

Stefan Schulz a,b,*, Udo Hahn a

a CLIF Computational Linguistics Laboratory, Freiburg Uni6ersity, Werthmannplatz 1, D-79085 Freiburg, Germanyb Department of Medical Informatics, Freiburg Uni6ersity Hospital, Stefan-Meier-Strasse 26, D-79104 Freiburg, Germany

Received 31 December 1999

Abstract

The increasing availability of machine-readable medical documents is not really matched with the sophistication ofcurrently used retrieval facilities to deal with a variety of critical natural language phenomena. Still most popular arestring-matching methods which encounter problems for the medical sublanguage, in particular, concerning thewide-spread use of complex word forms such as noun compounds. We introduce a methodology for the segmentationof complex compounds into medically motivated morphemes. Given the sublanguage patterns in our data thesemorphemes derive from German, Greek and Latin roots. For indexing and retrieval purposes, such a morphemedictionary may be further structured by defining the semantic relations among morpheme sets in order to build upa multilingual morpheme thesaurus. We present a tool for thesaurus compilation and management, and outline amethodology for the proper construction and maintenance of a multilingual morpheme thesaurus. © 2000 ElsevierScience Ireland Ltd. All rights reserved.

Keywords: Medical language processing; Morphological analysis; Information retrieval

www.elsevier.com/locate/ijmedinf

1. Introduction

Modern clinical documentation systems, aswell as an increasing number of health-re-lated electronic publications and databasesavailable on CD-ROMs, hospital intranetsand the Internet have swamped the physi-

cian’s desktop with large amounts of com-puter-readable free text. The full utilizationof these textual resources, however, is cur-rently hampered by inadequate retrievalfacilities.

In a common free-text information re-trieval environment, the search for a docu-ment relies on an (exact) pattern matchingoperation between the query term(s) and thedocument terms. So, a query term such as‘Leukocyte ’ retrieves all documents in whichthis query term occurs literally. Germanic

* Corresponding author. Tel.: +49-761-2036702; fax: +49-761-2033251.

E-mail addresses: [email protected] (S. Schulz),[email protected] (U. Hahn).

1386-5056/00/$ - see front matter © 2000 Elsevier Science Ireland Ltd. All rights reserved.

PII: S1 386 -5056 (00 )00078 -2

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–9988

and Romanic languages, however, are char-acterized by the morphological process ofinflection. This introduces grammatically mo-tivated string variants of document termssuch that (usually) suffixes are concatenatedwith a basic word form, e.g. ‘Leukocyte�s ’(� denotes the string concatenation opera-tor). Furthermore, through the morphologi-cal process of deri6ation basic word forms areconcatenated with derivational prefixes andsuffixes, usually implying a change of theword class (e.g. the noun ‘Leukocyte ’ changesinto the adjective ‘leukocyt� ic ’). While infl-ection preserves semantic invariance, deriva-tion is usually accompanied with subtlechanges of the core meaning of the stemundergoing derivation. Finally, by means ofcomposition several basic word forms arecombined to form a complex compound (e.g.the nouns ‘Leukocyte ’ and ‘[H]em�o ’ joininto ‘Leuk�em� ia ’, with a tricky omissionof the starting character of ‘Hemo ’, and theuse of ‘-ia ’ as suffix). In the case of transpar-ent composition, this introduces a great dealof systematic semantic refinement such thatthe core meaning of the head of a compound(usually the rightmost stem) is semanticallymodified in a significant way by the semanticfacets of the other fundamental units withinthe compound. All the three morphologicalprocesses, inflection, derivation and composi-tion, may combine and interact, grammati-cally as well as semantically. In order toaccount for the morphological variation ofthese types, three possibilities arise in a free-text retrieval system:1. Enumerate, either manually or automati-

cally, all morphological variants of aquery term and combine them in a dis-junctive query such as in ‘Leuko-cyte ’�‘Leukocytes ’�‘Leukocyte’s ’�‘Leukocytic ’�‘Leukemia ’�…, and let the sys-tem perform exact matches with the corre-sponding document terms.

2. Apply a masking (or truncation) operator(such as ‘$’) to the longest common sub-string of all morphological variants (aso-called formal stem), e.g. ‘Leuk$’. Thesystem will then perform a partial stringmatch between this truncated query termand all document terms whose leftmostsubstring is identical with ‘Leuk ’, whilethe remainder can be any arbitrary string(including o, the empty string)1. Obvi-ously, such a mechanism mimics linguisti-cally based morphological computationsby mere string processing approximation.

3. Submit a linguistically motivated basicword form as a query term, e.g. ‘Leuko-cyt ’, and let the system automaticallycope with morphological variants using aconsiderable stock of linguistic knowledgeof the host language’s morphology, andlet it perform a search for the documentsbased on the system-determined variantsets.

Solution (1) often yields incomplete cover-age due to missing variants even for linguisti-cally well-trained human searchers (thus,lowering recall); when this is performed by acomputational device, the procedure, if feasi-ble at all, is just a generative reversal ofanalytic principles underlying the third ap-proach. Solution (2) in many non-trivial caseshas the tendency to overgenerate, i.e. producemany unintended matches (lowering preci-sion), since the matching process is entirelyunderconstrained. Considering solution (3),one certainly has to distinguish differentmethodologies to automatic morphologicalanalysis in order to assess the potential

1 Note that in order to account for notoriously occurring‘K–C–Z ’ spelling variants, such as ‘Leuko…’ versus‘Leuco…’, one even has to reduce the formal stem to ‘Leu ’,which increases the number of false positives. In an Internetsearch, we determined a ratio of 5:1 for both forms (on a hitrate of 10 000 altogether) on English Web pages.

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–99 89

benefits or drawbacks for document retrievalsystems. In the information retrieval commu-nity, the most common approach to morpho-logical analysis is based on stemming, i.e.conflating different morphological variants toa single formal stem. Typically such al-gorithms (e.g. the Lovins stemmer [1] or thePorter stemmer [2]) refrain from using dic-tionary information and are solely based onsimple string processing routines. Their prin-cipal way of operation consists of removinginflectional endings (e.g. plural or genitival ortense suffixes) or derivational suffixes, includ-ing some recoding transformations. Some ofthem, e.g. the Lovins stemmer, follow a one-pass strategy based on right-to-left longestmatching plus recoding, others, e.g. thePorter stemmer, employ an iterative multi-pass approach. In fact, there has been somecontroversy about their contribution to im-proving the effectiveness of document re-trieval systems [3–5]. The key issue forquality improvement seems to be rootedmainly in the presence or absence of someform of dictionary, i.e. a list of content wordsin some agreed-upon basic lexical formatplus, possibly, additional linguistic informa-tion concerning parts of speech, gender, num-ber, tense, mood, semantic relations, etc.Empirical evidence has been brought forwardthat inflectional and/or derivational stemmersaugmented by (machine-readable) dictionar-ies indeed perform substantially better thanthose without access to lexical repositories[5].

In addition, the above-mentioned stem-ming algorithms and their many variantsbenefit from the limited suffix set and rathersimple formation rules underlying Englishinflection. When turning to other languages,e.g. French, Italian, Spanish, or German, nocomparable algorithmic standard yet exists,not to mention major Asian languages suchas Japanese, Korean and major dialects of

Chinese. Many of these languages exhibit amuch richer inventory of inflectional suffixes,and also their structural combination is morecomplex. Evidence for this statement and theimplications on text retrieval performancecomes from a large variety of highly inflec-tional and/or agglutinating languages such asHebrew [6], Finnish [7], or Slovene [8]. Tak-ing the perspective of different languagesmakes perfect sense here, since routinely gen-erated narratives in medical practice (findingand admission reports, discharge summaries,etc.), no matter whether they appear in aclinical environment or at the practitioner’swork place, are written in the medical ex-pert’s native language.

Morphological complexity further in-creases, in structural terms and independentof particular languages, when one looks atderivation and composition (for a survey ofGerman language, cf. [9], for English compo-sition, cf. [10]). There have already been ob-servations on the crucial status of compoundsfor information retrieval and the problemsthey cause [7]. This becomes particularly per-tinent for the medical domain where a largenumber of established terms of tremendousmorphological complexity yet exist. In Sec-tion 3, we present the results of an experi-ment that elucidates the impact of derivationand composition on medical terminology.Also, medical terminology is characterized bya typical mix of Latin and Greek roots withthe corresponding host language (e.g. Ger-man)2 often referred to as Neo-Latin com-pounding. While this is often merely a sideissue for general-purpose morphological ana-lyzers, such an observation becomes impor-tant for any attempt to cope with medical

2 In contradistinction to other languages, equally character-ized by medical terms of Greek/Latin origin (e.g. duodenalulcer, ulcere duodenal), the German medical sublanguagesimultaneously includes both varieties either obeying German(Duodenalulkus) or Latin (Ulcus duodeni ) morphological rules.

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–9990

free-texts in an information retrieval setting(cf. [11]).

In this article, we propose a methodologyfor morphological analysis that accounts for(a) all the three basic morphological pro-cesses, i.e. inflection, derivation, and compo-sition, and (b) the combination of Greek,Latin, and the particular host language (inour case, German). Unlike approaches, whichare purely driven by considerations of generalnatural language processing (NLP), we em-bed our methodology in the framework ofmedical information retrieval. We will seethat this has implications for (c) the choice ofthe fundamental unit of morphological analy-sis, as well as (d) the way in which we con-ceptually relate these basic units by makingreference to well-known medical terminolo-gies. It is this ‘light-weight’ conceptual foun-dation that sets our approach apart from allstandard NLP methods — typically they donot explicitly link up to particular ontologies.Making full use of a large corpus and compu-tational facilities we are able to provide ageneral automaton-based methodology thatsupersedes early pattern-matching-based orstring-based proposals in the MLP commu-nity in terms of linguistic generality.

2. Discussion of alternative approaches tomorphological analysis of medical language

While considerable pessimism has been ex-pressed with respect to a full semantic inter-pretation of medical compounds [12], severalapproximations have already been proposedto get around with this intricate problem.The earliest approach to deal with medicalterminology by way of morphological analy-sis is due to Pratt and Pacak [13]. Theirapproach transforms semantically equivalentadjectival and nominal forms by employingsimple suffix trees and transformation rules

for recoding morphologically reduced forms.Transformations succeed if a recoded form ismatched with an entry in the SNOP nomen-clature. Hence, this approach is fully depen-dent on the existence of a dictionarycontaining relevant lexical forms of the medi-cal sublanguage.

Follow-up studies of this work by Pacakand Norton [14,15] not only determine apreferred normalized form for several mor-phological variants but rather compute para-phrase and other semantic relations (such aslocative, causative ones). These are implicitlydenoted by complex medical compoundnouns and can be made explicit by breakingcompounds up into their constituent parts.The distributional patterns suggested byPacak and Norton are based on four top-level conceptual categories, which are directlyderived from SNOP/SNOMED codes (viz.topography, (medical) morphology, etiology,and function). A major limitation of thisstudy, however, is due to the restriction ofthe decompositional analysis to compoundsreferring to inflammatory processes (indi-cated by the ‘-itis ’ ending) or to surgicalprocedures (indicated, e.g. by ‘-ectomy ’ or‘-plasty ’ endings) only. In a similar vein, Du-jols et al. [16] treat ‘-osis ’ forms only, thoughin a slightly more sophisticated manner thanimplied by the Norton/Pacak morphoseman-tic patterns. These restrictions are somewhatweakened in the study of Wolff [11] both interms of a larger number of Greco-Latinsuffixes being covered, as well as more gen-eral compositional patterns of Neo-Latincompounding. However, her approach — bydesign, a lexical knowledge engineering strat-egy for augmenting LSP-style lexicons ratherthan aiming at classical document retrievalapplications — diverges from medical ortho-doxy, insofar as the conceptual categories sheemploys refer to the subclass coding princi-ples specifically holding within the Linguistic

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–99 91

String Project (LSP) context (for an overviewof the LSP, cf. [17]), rather than to the con-ventional SNOP/SNOMED-style nomencla-ture.

Also a part of this study is loosely charac-terized by a mixture of isolated data struc-tures (e.g. suffix trees) and various proceduralheuristics (right-to-left longest matching,floating ‘o ’ insertion as in ‘cyst�o� lith�ectomy ’ vs. ‘cyst�ectomy ’, etc.). Inan attempt to formulate the principles ofmedical word segmentation in a formallymore rigid, almost language-independentframework, Wingert [18] chose an automa-ton-based specification for morphologicalanalysis in terms of augmented transition net-works [19]. He, finally, proposes a set of 255cascading rules to capture the combinatorialregularities of different morpheme classesand, similar to Pratt and Pacak, refers to theentries of the SNOP nomenclature in order toexploit semantic information from the medi-cal domain.

In the last decade, research in morphologyhas ceased in the medical language processing(MLP) community. Just recently, interest inthis topic has revived employing much moresophisticated linguistic and conceptualknowledge. Baud et al. [20–22] use finite-state technology for the decomposition ofcomplex terms into semantically non-decom-posable segmentation units they refer to asmorphosemantemes. A lot of power of theirapproach derives from the fact that the con-ceptual correlates of these morphose man-temes no longer refer to flat SNOMED-stylecategories but rather are formulated inGRAIL, a highly expressive deductive termi-nological knowledge representation languagewithin the GALEN framework [23]. In orderto isolate a morphosemanteme, compositeconcepts are dissected to their medically plau-sible conceptual core, using terminologicalknowledge derived from GRAIL. Baud et

al.’s approach fully depends on and is, there-fore, limited by the comparatively poor cov-erage of the medical domain by GRAIL,which, as any of these deep knowledge ap-proaches, hardly scales up to reasonablysized, practically-to-use knowledge bases.Also, Baud et al.’s notion of a morphoseman-teme is more promiscuous than what we calla medically rele6ant morpheme. More gener-ally, we argue in favor of an automaton forword decomposition whose segmentation ca-pability depends on a stricter though medi-cally bounded sense of morphologicalatomicity. As we will see, this criterion alsoeases information retrieval on multilingualplatforms.

It is interesting to observe that none of theabove-mentioned proposals make use of thestate-of-the-art methodologies for morpho-logical analysis in NLP proper at the time oftheir writing, viz. chart-based approaches inthe (early) eighties [24], or, currently, themodel of two-level morphology as originallyformulated by Koskenniemi [25] and lucidlydescribed in [26]. The reason might be thatthese pure NLP methodologies still pose toostrong requirements on their linguistic re-sources (e.g. two-level morphology requireselaborate and complete stem and suffix lexi-cons) and are also too rigid with respect towell-formedness of their input. Also majorefforts have been so far directed at deflectiononly, with only minor attention being paid toderivational (for exceptions to the rule, cf.[27,28]) or compositional morphology (forexceptions to the rule, cf. [29,30]). Evenworse, some languages such as German poseparticularly weird problems to a two-levelmachine because of contextual alteration de-pendencies within words such as umlauts (fora problem statement, cf. [31,32]). The prob-lem of mixed-language input, as evidenced bymedical Neo-Latin compounding, has to thebest of our knowledge not been considered inthis framework so far.

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–9992

3. An empirical study of the distribution andcoverage of complex compounds

In order to collect empirical evidencewhether morphological analysis of complexword forms is really an urgent need, we con-ducted the following experiment. In a ran-dom selection of 100 pathology reports(average token count 147.9 per report) wefound 895 occurrences of different domain-specific compounds. We then matched these895 forms with all words contained in amachine-readable version of a comprehensiveGerman-language medical dictionary, the‘Pschyrembel ’3. The retrieval process wasbased on exact string match. Surprisingly,400 out of these 895 compounds did notoccur in the dictionary. This reflects the enor-mous productivity of medical language lead-ing to a large number of ad hoc compounds.A number of examples, both German andEnglish ones, are given in Table 1. Analyzingthe rubrics of the English-language codingsystem ICD-9-CM, we found — to a minorextent — a considerable number of nominalcompounds (cf. the English terms in Table 1),thus indicating that this phenomenon is byno means restricted to the German language

only. Generalizing from this study, we findconfirmation for the hypothesis that account-ing for complex morphological phenomena ishighly rewarded in medical retrievalenvironments.

At least two basic approaches seem to bereasonable. In the first one, the derivationaland compositional forms have to be explicitlyspelt out for each medical term in the termi-nology. This causes the size of those thesaurito grow dramatically by the sheer number ofdifferent terms. Also, given the speed of ter-minological growth, the goal of enumeratingall morphological varieties can always onlybe approximated but never fully achieved.

Alternatively, one might want to avoidthese scaling problems (and the associatedmaintenance load) and keep up with contin-uous terminological dynamics by exploitingbasic linguistic regularities through a sophis-ticated morphological analyzer. We adhereto this second approach, and propose theconcept of a morpheme-based dictionary/thesaurus. A repository of morphemes asthe smallest meaningful morphological unitsis expected to be several orders of magni-tude below the size of phrasal or fully lexi-calized dictionaries, with quite a lowergrowth rate. Hence, it should also be mucheasier and cheaper to compile and to main-tain. This parsimony must, however, be

3 Pschyrembel Klinisches Worterbuch : for MS Windows 3.1,3.11 and ’95, Walter de Gruyter. Its whole text corpus con-tains more than 100 000 different entries.

Table 1Nominal compounds in medical German and English

Kryostatschnittverfahren Kryo�stat�schnitt�verfahr�enFibroblastenproliferation Fibro�blast�en�prolifer�ation

Niere�a� trans�plant�at�geweb�eNierentransplantatgewebeTransi� tio�nal�zell�karzin�omTransitionalzellkarzinomSchil�druse�n� uber�schreit�ungSchilddrusenuberschreitungNeur�o�encephal�o�myel�o�path�yNeuroencephalomyelopathyPseudo�hypo�para� thyroid� ismPseudohypoparathyroidism

Proctosigmoidoscopy Proct�o�sigm�oid�o�scop�yHyperprebetalipoproteinemia Hyper�pre�beta� lipo�protein�em� ia

Arteri�o�nephr�o�scler�os� isArterionephrosclerosis

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–99 93

Fig. 1. Example for segmentation of complex wordsand morpheme classes.

� Prefixes like {a, de, in, ent, 6er, anti, …}precede the word’s stem(s).

� Infixes (e.g. ‘o ’ in ‘gastr�o� intestinal ’,or ‘s ’ in Sektion�s�bericht ’) are justused as a (phonologically motivated) ‘glue’between morphemes, typically as a linkbetween stems.

� Derivational suffixes such as {io, ion, ie,ung, itis, tomie, …} usually follow theword’s stem(s).

� Inflectional suffixes like {e, en, s, idis, ae,oris, …} appear at the very end of a wordfollowing the word’s stem(s) or deriva-tional suffixes.

� Eponyms (mostly proper names), digitsand acronyms {AIDS, ECG, …} are non-decomposable entities in morphologicalterms and also undergo no further mor-phological variation, e.g. by suffixing.A segmentation of two compounds that is

based on these distinctions into morphemeclasses is depicted in Fig. 1. In general, how-ever, adequate morphological segmentationfor use in an information retrieval environ-ment places restrictions on morphologicalanalysis, which differ from purely linguisticconsiderations. In effect, it turns out that abalanced notion of morpheme atomicity iscrucial for success. While linguists tend tostrive for the utmost degree of breaking upcomplex words into the smallest morphologi-cal units possible, from an information re-trieval perspective this might beinappropriate. The reason for this lies in apriori meaning assignments to morphologicalcomplex forms in a particular domain.

In the example depicted in Fig. 2, themorphologically complex word stem ‘dia-phys ’ has a unique conceptual correlate in themedical domain, viz. the shaft of a long bone.Further segmentation into ‘dia ’ and ‘phys ’,though linguistically entirely plausible, wouldproduce ambiguous morphological entitieswith conceptual correlates, if any, of far

Fig. 2. Task-specific atomicity: incomplete segmenta-tion to avoid lexical ambiguities.

traded against the lack of semantic expres-siveness of the entries and an increased com-putational complexity of the algorithmsneeded for decomposition and disambigua-tion. In the following, a tool and a methodol-ogy for a morpheme thesaurus will bedescribed and preliminary results obtained bya pilot study are given.

4. Morphological segmentation model

We distinguish the following morphemeclasses and define syntactic rules of well-formedness formally encoded in a finite stateautomaton (we only give some informal andincomplete examples here) for their usewithin composed words:� Stems like {gastr, hepat, nier, leuk, dia-

phys…} are the primary carrier of contentin a word; they can be prefixed, linked byinfixes, and suffixed.

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–9994

lesser specificity, at least in medical terms.Hence, we argue for a conceptually moti-vated criterion for morphological atomicity,one that is rooted in medical plausibility(similar arguments were already raised in[15,33]).

The following heuristics were considered tobe adequate for determining medically moti-vated morphemes: a lexical string undergoesno further segmentation if an apparentlyatomic synonymous or quasi-synonymousstem exists (SHAFT in our example) and/orif it points to a basic conceptual entity in themedical domain. Obviously, this view impliesa trade-off between compactness (reducednumber of morphemes) and expressiveness(reduced semantic specificity of morphemes).This is demonstrated in Fig. 3, where variantsof word stems, so-called allomorphs (‘leuko ’,‘leuco ’, ‘leuk ’, ‘leuc ’) can be generalized up tothe longest common substring, viz. ‘leu ’.While this certainly reduces the size of themorpheme dictionary, an increased numberof morphological segmentation ambiguitiesand, even worse, false positives are likely tobe encountered in an information retrievalenvironment. We refrain from such a solutionand subscribe to a modest size growth of thedictionary by the intentional inclusion ofallomorphs.

5. MEDSEARCH: an interactive tool for theconstruction of a morpheme dictionary

On the basis of these considerations, wedeveloped a morphological segmentation en-gine, MEDSEARCH. The system consists of amorphological parser that builds, on the ba-sis of existing morpheme lexicons, all possibleparse trees for the input. Morphological seg-mentation ambiguities are ranked accordingto the following preference criteria:1. longest match;2. minimal number of stems per word;3. minimal number of consecutive affixes

(this criterion penalizes utterly formal seg-mentations); and

4. relative weight — more specifically, a se-mantic weight factor w=2 is assigned toall stems and some semantically impor-tant suffixes, such as ‘-tomie ’, ‘-itis ’; w=1is assigned to prefixes and derivationalsuffixes, and w=0 is rendered to inflec-tional suffixes and infixes.

The MEDSEARCH tools provide a graphicalinterface for interactive management of themorpheme dictionary, the MedSearch Mor-phology Workbench. In each training cycle, itimports complex words from the source cor-pora under consideration, performs segmen-tation on the basis of the morphemes alreadyspecified, and displays the analysis results,weighted according to the above mentionedcriteria. Fig. 4 gives a glimpse of the userinterface and illustrates the result of the seg-mentation of the nominal compound ‘Post-gastrektomiesymptomatik ’. Note that manyambiguous analyses are displayed, since thestem ‘gastr ’ has not yet been specified as astem. Upper case characters mark mor-phemes with a weight factor w=2. Everyrelevant morpheme that is missing can bemanually inserted in the appropriate mor-pheme lists such as it is shown in Fig. 4. Bysupplying new entries the segmentation capa-

Fig. 3. Allomorphy: set of semantically identical mor-phemes (allomorphs).

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–99 95

Fig. 4. MEDSEARCH morphology workbench: tool forthe incremental construction of the morpheme dic-tionary.

OPS301, and, finally, a list of more than100 000 German language clinical diagnosisphrases. This raw material was automaticallysegmented by the morphological analyzerwithout human interaction in a fast batchmode. This led to a huge list of unknownsubstrings, since except the previously deter-mined affixes no other morphemes were ini-tially known to MEDSEARCH. Theseunknown substrings were then ranked by fre-quency. The first 1000 most frequent sub-strings were checked manually, and themissing word stems were included in the mor-pheme lexicon.

This bootstrapped morpheme list was thenused for another cycle of automatic segmen-tation and manual workup in which theabove-cited rules for domain-specific delimi-tation were applied once again. The followingcriteria for the exclusion of word stems weredefined, (1) one-time occurrence only; (2)acronyms or proper names (e.g. drug names);(3) non-medical word stems. Applying theserules the analysis of the whole materialyielded a core table of 7130 medicine-specificword stems, with 22 369 words or word frag-ments being excluded.

We also performed a second study in orderto estimate the growth behavior of a medicalword stem lexicon using 30 000 diagnosisphrases from a clinical documentation system(this constituted a more homogeneous textsample). Applying the above exclusion crite-ria we obtained a list of 4098 word stems.The growth curve can be approximated by alogarithmic function (cf. Fig. 5)

7. Steps towards a multilingual morphemethesaurus

In the procedure just described, we alreadyaccount for morphemes from different lan-guages. In this section, we propose an exten-

bility of our tool continuously improves.Prior to storing a new morpheme, attributesconcerning morpheme class and language(German, Latin, Greek) have to be supplied.

MEDSEARCH has been implemented usingthe programming environment of Visual Ba-sic 6.0. Besides the graphical interface, it canbe run as a Microsoft component objectmodel (COM) server application, thus ex-porting its methods within a network.

6. Experience with running MEDSEARCH

In a pilot study, we proceeded in the fol-lowing way in order to get started. Using theMEDSEARCH Workbench, we first manuallysupplied an a priori selected set of Germanand Greco-Latin prefixes, suffixes, and infi-xes. In particular, no word stems were pro-vided. After this initialization phase, our testcorpus was obtained by merging the follow-ing sources: the German translation of the9th and the 10th release of the internationalclassification of diseases (ICD), the GermanSNOMED, the international classification ofprocedures in medicine in its German version

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–9996

sion of the emerging multilingual dictionaryto a multilingual thesaurus by incorporatingsemantic relations between the morphemeitems (an idea first articulated by Pacak et al.[14]). In order to map between synonymousor quasi-synonymous expressions within thesame language, but also between differentnatural languages, we will discuss two alter-native approaches (cf. Fig. 6).

The first approach, a ‘simple thesaurus’ (cf.Fig. 6, left side), is based on the definition ofsemantic links within the morpheme thesau-rus proper, without any reference to an exter-nal ontology. As basic semantic relationsbetween pairs of morphemes, we choose thesimilarity S and the equivalence E relations.Both S and E are symmetric, but only E istransitive. Hence, E defines sets of semanti-

cally equivalent morphemes, and S linksthose groups with one another that have asimilar meaning. As an example, the E rela-tion holds within the morpheme set M1={stomach, magen, estomac}, as well asbetween the elements of M2={heart, cor,cord, herz, coeur}. Regarding a third mor-pheme set M3={6entric, 6entrik}, we state asimilarity relation between M1 and M3 aswell as between M2 and M3, because ‘6entri-cle ’ means both the cavity of the heart andthe cavity of the stomach.

The second approach, a ‘complex thesau-rus’ (cf. Fig. 6, right side), maps morphemesto sets of elements of an existing terminologi-cal system, e.g. the UMLS. The UMLSmetathesaurus provides, besides the syn-onymy relation, hierarchy-inducing relations

Fig. 5. Growth behavior of a word stem lexicon obtained from diagnosis phrases.

Fig. 6. Two approaches for semantic linkage in a multilingual morpheme thesaurus.

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–99 97

Fig. 7. Query obtained by morpheme-supported expansion of two synonymous surface expressions.

such as broader/narrower or child/parent re-lationships, and — to a minor extent —semantically more specific relations such asis-a or part-of.

The advantage of the first approach is thatall the knowledge necessary for morphologi-cal analysis and efficient retrieval is directlyavailable at the repository level. However,the internal linkage has to be done from thescratch. Even with a relatively small numberof dictionary entries, and the restriction totwo basic relations only, the decision as towhether two morphemes should be linked interms of a real synonymy or only a similarityrelation will in many cases not always be easyto make and requires much experience in theuse of medical language.

The second approach has, in contradistinc-tion, the advantage that all semantic relationsare already given. Its difficulties will arisefrom the task of the adequate mapping of asmall set of semantically shallow morphemesto a huge collection — about two orders ofmagnitude larger in size — of semanticallyprecise concepts. As an example, in the

UMLS metathesaurus there is no concept forthe generic notion of ‘6entricle ’. A solutionmay be the subscription to the first approach,but using the UMLS as a support for thethesaurus construction process.

Considering the ‘simple thesaurus’ ap-proach, a scenario for information retrieval isoutlined in Fig. 7. Nearly identical queriesare obtained by the morpheme-supported ex-pansion of two synonyms, though lexicallyentirely different surface expressions wereused in the original query.

8. Conclusion

Given the productivity and developmentaldynamics of medical terminologies, it seemsalmost impossible to keep pace with theirexcessive lexical growth rates. As a way outof this dilemma, we opt for automatic mor-phological segmentation. In order to supporteffective free-text retrieval, in this article weproposed a pragmatic methodology thatbuilds on a morpheme thesaurus, a reposi-

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–9998

tory of morphologically significant forms, thatis used for automatic deflection, dederivationand decomposition. While we stressed themorphological segmentation aspect of ourwork, i.e. how to decompose complex wordforms into simpler morphological units, wehave also outlined the next step required viz.how to link these basic morphological units byconceptual relationships. Finally, a retrieval-based evaluation of our framework is stillpending. The latter part will be the focus ofon-going work, while the operational segmen-tation tool has already proved to be useful.

References

[1] J.B. Lovins, Development of a stemming al-gorithm, Mech. Trans. Comput. Linguistics 11(1/2) (1968) 22–31.

[2] M.F. Porter, An algorithm for suffix stripping,Program 14 (3) (1980) 130–137.

[3] D. Harman, How effective is suffixing?, J. Am.Soc. Inf. Sci. 42 (1) (1991) 7–15.

[4] D.A. Hull, Stemming algorithms: a case study fordetailed evaluation, J. Am. Soc. Inf. Sci. 47 (1)(1996) 70–84.

[5] R. Krovetz, Viewing morphology as an inferenceprocess, in: R. Korfhage, E. Rasmussen, P. Willett(Eds.), SIGIR’93 — Proceedings of the 16th An-nual International ACM SIGIR Conference onResearch and Development in Information Re-trieval, Pittsburgh, PA, USA, June 27–July 1,1993, ACM, New York, NY, pp. 191–203.

[6] Y. Choueka, Responsa: an operational full-textretrieval system with linguistic components forlarge corpora, in: A. Zampolli (Ed.), Computa-tional Lexicology and Lexicography: A Volume inHonor of B. Quemada, Giardini Press, Pisa, 1992.

[7] H. Jappinen, J. Niemisto, Inflections and com-pounds: some linguistic problems for automaticindexing, in: Proceedings of the RIAO 88 Confer-ence: ‘User-Oriented Content-Based Text and Im-age Handling’, vol. 1, Cambridge, MA, March21–24, 1988, pp. 333–342.

[8] M. Popovic, P. Willett, The effectiveness of stem-ming for natural language access to Slovene tex-tual data, J. Am. Soc. Inf. Sci. 43 (5) (1992)384–390.

[9] J. Toman, Wortsyntax. Eine Diskussion ausge-wahlter Probleme deutscher Wortbildung, MaxNiemeyer, Tubingen, 1987.

[10] J.N. Levi, The Syntax and Semantics of ComplexNominals, McGraw-Hill, New York, 1978.

[11] S. Wolff, The use of morphosemantic regularitiesin the medical vocabulary for automatic lexicalcoding, Methods Inf. Med. 23 (4) (1984) 195–203.

[12] A.T. McCray, A.C. Browne, D.L. Moore, Thesemantic structure of neo-classical compounds, in:R.A. Greenes (Ed.), SCAMC’88 — Proceedingsof the 12th Annual Symposium on ComputerApplications in Medical Care, IEEE ComputerSociety Press, Washington, DC, 1988, pp. 165–168.

[13] A.W. Pratt, M.G. Pacak, Identification and trans-formation of terminal morphemes in medical En-glish, Methods Inf. Med. 8 (2) (1969) 84–90.

[14] M.G. Pacak, L.M. Norton, G.S. Dunham, Mor-phosemantic analysis of ITIS forms in medicallanguage, Methods Inf. Med. 19 (2) (1980) 99–105.

[15] L.M. Norton, M.G. Pacak, Morphosemanticanalysis of compound word forms denoting surgi-cal procedures, Methods Inf. Med. 22 (1) (1983)29–36.

[16] P. Dujols, P. Aubas, C. Baylon, F. Gremy, Mor-phosemantic analysis and translation of medicalcompound terms, Methods Inf. Med. 30 (1) (1991)30–35.

[17] N. Sager, M. Lyman, C.E. Bucknall, N.T. Nhan,L.J. Tick, Natural language processing and therepresentation of clinical data, J. Am. Med. Inf.Assoc. 1 (2) (1994) 142–160.

[18] F. Wingert, Morphosyntaktische Zerlegung vonKomposita der medizinischen Sprache, MethodsInf. Med. 16 (4) (1977) 248–255.

[19] F. Wingert, Morphologic analysis of compoundwords, Methods Inf. Med. 24 (3) (1985) 155–162.

[20] C. Lovis, R. Baud, P.-A. Michel, J.-R. Scherrer,Morphosemantems decomposition and semanticrepresentation to allow fast and efficient naturallanguage recognition, in: R. Masys (Ed.),AMIA’97 — Proceedings of the 1997 AMIA An-nual Fall Symposium (formerly SCAMC). TheEmergence of ‘Internetable’ Health Care: Systemsthat Really Work, Nashville, TN, October 25–29,1997, Hanley & Belfus, Philadelphia, PA (longversion on CD-ROM), pp. 873.

[21] R.H. Baud, C. Lovis, A.-M. Rassinoux, J.-R.Scherrer, Morpho-semantic parsing of medical ex-pressions, in: C.G. Chute (Ed.), AMIA’98 —

S. Schulz, U. Hahn / International Journal of Medical Informatics 58–59 (2000) 87–99 99

Proceedings of the 1998 AMIA Annual Fall Sym-posium. A Paradigm Shift in Health Care Infor-mation Systems: Clinical Infrastructures for the21st Century, Orlando, FL, November 7–11,1998, Hanley & Belfus, Philadelphia, PA, pp.760–764.

[22] R.H. Baud, A.-M. Rassinoux, P. Ruch, C. Lovis,J.-R. Scherrer, The power and limits of a rule-based morpho-syntactic parser, in: N.M. Lorenzi(Ed.), AMIA’99 — Proceedings of the 1999 An-nual Symposium of the American Medical Infor-matics Association. Transforming Health Carethrough Informatics, Washington, DC, November6–10, 1999, Hanley & Belfus, Philadelphia, PA,pp. 22–26.

[23] A.L. Rector, S. Bechhofer, C.A. Goble, I. Hor-rocks, W.A. Nowlan, W.D. Solomon, TheGRAIL concept modelling language for medicalterminology, Artif. Intell. Med. 9 (1997) 139–171.

[24] M. Kay, Morphological analysis, in: A. Zampolli,N. Calzolari (Eds.), Computational and Mathe-matical Linguistics, Proceedings of the Interna-tional Conference on Computational Linguistics,vol. 1, Pisa, August 27–September 1, 1973, L.S.Olschki, Firenze, pp. 205–223.

[25] K. Koskenniemi, A general computational modelfor word formation recognition and production,in: COLING’84 — Proceedings of the 10th Inter-national Conference on Computational Linguisticsand 22nd Annual Meeting of the Association forComputational Linguistics, Stanford, California,USA, July 2–6, 1984, pp. 178–181.

[26] R. Sproat, Morphology and Computation, MITPress, Cambridge, MA, 1992.

[27] G.J. Russell, G.D. Ritchie, S.G. Pulman, A.W.Black, A dictionary and morphological analyzerfor English, in: COLING ’86 — Proceedings ofthe 11th International Conference on Computa-tional Linguistics, Bonn, August 25–29, 1986, In-stitut fur angewandte Kommunikations- undSprachforschung (IKS), Bonn, pp. 277–279.

[28] H. Trost, Coping with derivation in a morpholog-ical component, in: EACL’93 — Proceedings of

the 6th Conference of the European Chapter ofthe Association for Computational Linguistics,Utrecht, The Netherlands, April 21–23, 1993, As-soc. Comput. Linguistics 368–376.

[29] A.W. Black, J. van de Plassche, B. William, Anal-ysis of unknown words through morphologicaldecomposition, in: Proceedings of the 5th Confer-ence of the European Chapter of the Associationfor Computational Linguistics, Berlin, Germany,April 9–11, 1991, Assoc. Comput. Linguistics101–106.

[30] L. Karttunen, R.M. Kaplan, A. Zaenen, Two-level morphology with composition, in: COL-ING’92 — Proceedings of the 15th InternationalConference on Computational Linguistics, vol. 1,topical papers, August 23–28, 1992, ICCL, Nan-tes, pp. 141–148.

[31] H. Trost, The application of two-level morphol-ogy to non-concatenative German morphology,in: COLING’90 — Papers presented at the 13thInternational Conference on Computational Lin-guistics on the Occasion of the 25th Anniversaryof COLING and the 350th Anniversary ofHelsinki University, vol. 2, Helsinki, Finland,1990, pp. 371–376.

[32] A. Schiller, P. Steffens, Morphological processingin the two-level paradigm, in: O. Herzog, C.-R.Rollinger (Eds.), Text Understanding in LILOG.Integrating Computational Linguistics and Artifi-cial Intelligence, final report on the IBM GermanyLILOG-Project, number 546 in Lecture Notes inArtificial Intelligence, Springer, Berlin, 1991, pp.112–126.

[33] C. Lovis, P.-A. Michel, R.H. Baud, J.-R. Scherrer,Word segmentation processing: a way to exponen-tially extend medical dictionaries, in: R.A.Greenes, H.E. Peterson, D.J. Protti (Eds.), MED-INFO ’95 — Proceedings of the 8th Conferenceon Medical Informatics, number 8 in IFIP WorldConference Series on Medical Informatics, Van-couver, Canada, North-Holland, Amsterdam,1995, pp. 28–32.