evaluating and reducing the effect of data corruption when applying bag of words approaches to...

9
Evaluating and reducing the effect of data corruption when applying bag of words approaches to medical records P. Ruch a,b, , R. Baud b , A. Geissbu ¨ hler b a Theoretical Computer Science Laboratory, Swiss Federal Institute of Technology, Lausanne, Switzerland b Medical Informatics Division, University Hospital of Geneva, Geneva, Switzerland Abstract Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval (IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term frequency /inverse document frequency (tf /idf) as weighting schema; we pay special attention to the normalization factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4 /7%), whereas higher corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the engine. # 2002 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Corruption; Information retrieval; Medical records; Spelling correction; Natural language processing 1. Introduction Information retrieval (IR) is familiar to most of us due to the development of the Web, and the availability of large medical electronic libraries. Although IR in textual supports is probably the main application field of retrieval technologies, other domains, such as text mining in molecular biology [22,7] or knowledge extraction in medicine [4,3] often relies on the same vector space archi- tecture, which transforms any linearly ordered set of words (like a sentence or a document) into an unordered set (hence the so-called bag of words (BoW) approach). If any word-based system (or using a variant of words, such as stems), relying on the BoW transformation falls into the scope of this study 1 , we believe that even more elaborated systems [11,5,1,6], which use linguistically-motivated approaches are also to be affected by data corruption. In this paper, we argue that spelling errors are a Corresponding author. Tel.: /1-41-21-6936665 E-mail address: [email protected] (P. Ruch). 1 Including text categorization systems using advanced machine learning instruments; [25,8]. International Journal of Medical Informatics 67 (2002) 75 /83 www.elsevier.com/locate/ijmedinf 1386-5056/02/$ - see front matter # 2002 Elsevier Science Ireland Ltd. All rights reserved. PII:S1386-5056(02)00057-6

Upload: p-ruch

Post on 19-Sep-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Evaluating and reducing the effect of data corruption whenapplying bag of words approaches to medical records

P. Ruch a,b,�, R. Baud b, A. Geissbuhler b

a Theoretical Computer Science Laboratory, Swiss Federal Institute of Technology, Lausanne, Switzerlandb Medical Informatics Division, University Hospital of Geneva, Geneva, Switzerland

Abstract

Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of

documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations.

After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval

(IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term

frequency�/inverse document frequency (tf�/idf) as weighting schema; we pay special attention to the normalization

factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4�/7%), whereas higher

corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic

spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the

engine.

# 2002 Elsevier Science Ireland Ltd. All rights reserved.

Keywords: Corruption; Information retrieval; Medical records; Spelling correction; Natural language processing

1. Introduction

Information retrieval (IR) is familiar to

most of us due to the development of the

Web, and the availability of large medical

electronic libraries. Although IR in textual

supports is probably the main application

field of retrieval technologies, other domains,

such as text mining in molecular biology [22,7]

or knowledge extraction in medicine [4,3]

often relies on the same vector space archi-

tecture, which transforms any linearly ordered

set of words (like a sentence or a document)

into an unordered set (hence the so-called bag

of words (BoW) approach). If any word-based

system (or using a variant of words, such as

stems), relying on the BoW transformation

falls into the scope of this study1, we believe

that even more elaborated systems [11,5,1,6],

which use linguistically-motivated approaches

are also to be affected by data corruption. In

this paper, we argue that spelling errors are a

� Corresponding author. Tel.: �/1-41-21-6936665

E-mail address: [email protected] (P. Ruch).

1 Including text categorization systems using advanced

machine learning instruments; [25,8].

International Journal of Medical Informatics 67 (2002) 75�/83

www.elsevier.com/locate/ijmedinf

1386-5056/02/$ - see front matter # 2002 Elsevier Science Ireland Ltd. All rights reserved.

PII: S 1 3 8 6 - 5 0 5 6 ( 0 2 ) 0 0 0 5 7 - 6

major challenge for most IR systems, andevaluate the improvement brought by merging

such as system with a fully-automatic spellingcorrector. Indeed, if research experiments areusually conducted on well-spelled corpora

(MedLine abstracts, or newswire collection),applying such research works to real IR and

text mining tasks conducted on patient re-cords implied to design systems able to handle

misspellings.Indeed, the corruption introduced by mis-

spelled words plays a particular role when we

attempt to built up a IR application dedicatedto retrieving information in an electronic

patient record (EPR). Documents that con-stitute the patient record are not supposed to

be ever published2, and therefore, are espe-cially rich in misspellings as compared with

most IR collections. In addition, documentsof the patient files are often dictated; there-fore, typos are often introduced at the tran-

scription level from speech to text.Due to the complexity of the task, which

involves several heterogeneous representationlevels (syntax, semantics, phonetics, keyboard

configuration, user profile. . .), automatic spellcheckers usually performed this task in inter-action with the user. However, in some cases,

like IR, interactive spelling correction islargely forbidden [12].

If some short queries entered manually canbe corrected with the assistance of the

authors, such helpful assistance is not allowedwhen we consider the task of retrieving similar

cases in patient file warehouses using a givenpatient file as a query! In this case, the query,ranges from some sentences of a unique

document to hundreds of lexically rich docu-ments, so that user-assisted spelling correction

seems clearly impossible. A fortiori, spellingerrors occurring in the document collectionremain unreadable.

Moreover, misspellings not only create‘garbage strings’, which will increase silence,but also corrupt the general document viewformed by an IR system, and therefore, cansubstantially hinder the successful retrieval ofrelevant documents for user-queries. Indeed,most modern IR system use sophisticatedterm weighting functions to assign importanceto the individual words (or any other chosenitems) of a document for document-contentrepresentation [15,19�/21], and these termweight function, can be more or less depen-dent on the collection corruption.

The goal of our work is to show howspelling correction can be performed in a fullyautomatic manner in order to reach perfor-mances similar to retrieval applied to non-corrupted collections. Considering the spec-ulative state of this study, we will focus on:

. modeling IR in a case-based database(CBD) or an EPR.

. Measuring effects of spelling errors on theretrieval effectiveness, both at the queryand document levels.

. Providing evidences that IR can be im-proved by using fully-automatic spellingcorrection.

2. Background and design options

Assessing the effect of misspelled strings onan IR system applied to medical texts impliesto synthesize knowledge from at least threedifferent origins that we are going to summar-ize briefly:

. medical corpora,

. IR,

. spelling correction.

2 A notable exception concerns the discharge summaries,

which are likely to be sent outside the institution, and therefore,

are more carefully written than documents that are to be used

internally.

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8376

2.1. Choice of a collection

In order to evaluate effects of misspellingson an IR system susceptible to be applied inan EPR, and considering the absence of EPRcollection publicly available, we decided to usethe cystic fibrosis (CF) collection3. The CFcollection is a 1239 documents and 100 queriescollection. It has been chosen for the purposeof this study experiment, because of itsmanageable size, and because its query collec-tion (with sometimes more than 30 tokens)could be regarded as a set of short documents,and therefore, better simulate an engine,which will have to accept document extractsas query.

2.2. Information retrieval

Any IR system defines three basic elements:

. the document and query representation;

. the matching function;

. a ranking criterium.

For the representation, we decided to usestems, while, the two last elements are depen-dent on the selected weighting features.

2.2.1. Weighting schemes

IR engine using vector space approach isusually based on a variant of the termfrequency�/inverse document frequency (tf�/

idf) family [20]. This approach states thatthe weight of a given term is related to thefrequency of this term in a given document(i.e. tf), and inversely proportional to thefrequency of this term throughout the docu-

ment collection (idf). In Table 1, we provide

some commonly used tf�/idf features, follow-

ing the de facto SMART [19] standard

representations.A retrieval experiment can be characterized

by a pair of triples, ddd�/qqq , where the first

triple corresponds to term weighting used for

the document collection, and the second triple

corresponds to the query term weight. Each

triple refers to a tf , an idf and a normalization

function (cf. Table 1).Depending on the collection, it is possible

to calculate a posteriori the best weighting

scheme. In these experiments, we limit our

exploration to a core parameter, which plays

an important role in the case of applying IR

systems to corrupted collections: we evaluate

the system with or without cosine normal-

ization. Cosine normalization is strictly ap-

plied at the level of the document collection.

Since normalization of query term weights

just acts as a scaling factor for all the query-

document similarities, and has no effect on the

relative ranking of the documents, there was

not need to vary the normalization factor for

the query term weight.

3 T h e o r i g i n a l C F C i s a v a i l a b l e a t h t t p : / /

www.sims.berkeley.edu/-hearst/irbook/cfc.html. The study was

conducted on the English language, however, the results

reported here, would be equivalent on most European

language, except for highly agglutinative ones (such as

Finno�/Hungarian languages).

Table 1

Term weights in the SMART system

Term frequency

First letter f (tf )

n (natural) tf

l (logarithmic) 1�/log(tf )

a (augmented) a�/b�/(tf /max(tf )) with a�/b�/1

Second letter f (1/df )

Inverse document frequency

Second letter f (1/df )

n (no) 1

t (full) log(N /df )

Normalization

Third letter f (length)

n (no) 1

c (cosine) /

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffir2

1�r22�� � ��r2

n

p/

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 77

2.3. Spelling correction

Spelling correction is processed by comput-

ing a string edit distance between a given

token and the items of a lexical list (see [14,9],

for a survey of the probabilistic models of

spelling). This simple approach can be im-

proved by additional modules using a word

frequency list and/or some additional contex-

tual information (word-language model, part-

of-speech tagger. . .), as in [16].Spelling checker are usually applied in

interaction with a human; in this context the

correcting system returns a ranked list of

candidates for every misspelled token. Apply-

ing spelling corrector as a batch task may

result in replacing the hypothetical misspelled

token by a wrong candidate. Capitalizing on

an improved spelling checker [16], which

returns the good candidate at the top of the

list with a probability of 96 versus 90% for

traditional systems, we expect results closed to

retrieval applied to well-spelled corpora.In these experiments, we used a 200 000

items dictionary that allow a good coverage of

both the general and the medical English

language. We set up a confidence threshold

in order to avoid replacement of misspelled

words by bad candidates: if the score of a

candidate is below a certain edit distance4, the

replacement step is skipped. The idea lying

here is that we prefer to keep a misspelled

word rather that replacing it by a wrong

word5.

2.4. OCR retrieval

Since misspellings have rarely*/if ever*/

been studied in the IR framework, investigat-ing retrieval and misspellings obliged to referto the field of IR applied to optical characterrecognition (OCR), whose the most interest-ing conclusion are provided in the following(see [23] for an application, and [13] for amore theoretical foundation).

2.4.1. Representation items

The TREC 5 confusion track, used a set of49 known-item tasks to study the impact ofdata corruption (two corruption rates wereapplied: 5 and 20%) on retrieval systemperformance. A known-item search simulatesa user seeking a unique particular, partially-remembered document in the collection. Ifthere are obvious differences between known-item and ad hoc retrieval tasks, it is interestingto notice that retrieval methods that at-tempted a probabilistic reconstruction of theoriginal text fared better than methods thatsimply accepted corrupted versions of thequery text [10].

2.4.2. Cosine normalization sensitivity

Weighting functions use the occurrencestatistics of words (or any other documentrepresentation item) in the documents toassign importance to different words. As theoccurrence statistics of words can changesubstantially due to OCR errors, weightingschemes are especially sensitive to degradationin the quality of the input text. The authorsobserve that the cosine normalization, whichis commonly used in order to improve vector-space IR, must be manipulated carefully whenworking with corrupted collection. As a pre-liminary task, we will have to decide whethercosine normalization can be opportunely usedwith our collection, i.e. if it would bring anyimprovement on the original CF collection as

4 We take the ration between the number of edit operation

and the length of the word. We performed some test runs and

set the final value to 0.335, which means about three edit

operations are allowed for a ten characters word.5 However, we could imagine a system, which would keep

the original word in the query or in the document in addition to

the replacement candidate. The entire IR task is very resilient to

improper expansion of terms, therefore, more than one

candidate could have been added.

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8378

compared with a normalization-free weighting

scheme.

2.5. Corruption model

We investigate the effect of misspellings

both at the document and at the query level.

Recent studies report on the high rate of

misspellings in web user-query. Thus, [26]

measures the rate of misspelled words in a

large collection of queries issued from a

medical web-based IR system and observe a

rate of 7% for the expert users (healthcare

professionals), which can reach the score of

15% for non-expert users. While our own

investigation [18] carried on patient records

showed a mininal error rate of 3%, which goes

up to 10% for some documents.Finally, we corrupted each query and each

document using two different rate:

. Three percent for the document collection.

. Fifteen percent for the query collection.

A 15% rate means that every 6.66 random

words was introduced a spelling errors. To

achieve this goal, we define a corruption

model, consistent with Damerau’s seminal

researches. Damerau [5] showed that 80% of

misspellings can be generated from a correct

spelling by a few simple rules:

. transposition of two adjacent letters: heap-

titis ;. insertion of a letter: heppatitis ;. deletion of a letter: hepattis ;. replacement of a letter by another one:

hepatotis .

In addition with this first model6, whichwas applied to 80% of the cases, we intro-duced via a second process another 20% oferrors, which could not be produced with sucha model, mainly for approximating sound-alike errors (the character i is more likely tobe replaced by the character y , than by thecharacter q).

3. Method

First, we attempt to select a good weightingfunction for our collection. While a moresystematic evaluation will have to be per-formed in order to select the best scheme, suchstudies would go far beyond the scope of thesepreliminary study, therefore, we concentrateon atn/atc parameters7, which are supposed toperform well on heterogeneous collections[20].

3.1. Relevance scoring

In the CF collection, each query is providedwith a ranked list of relevant documents. Theranking is provided by four experts alongthree relevance levels (0, not relevant, 1,moderately relevant, 2, very relevant), andresults in a final relevance score, which rangesfrom 0 to 8. For the study, this fine-grainedrelevance score was mapped into binaryvalues (relevant or irrelevant) in order to beevaluated in a TREC-like style using theTRECEVAL

8 evaluation program.

3.2. IR parameters

Every experiment is conducted with wordconflation and using a list of English stop-

6 The production of an automatic corruption model, which

would be consistent with human errors, in order to assess a

correction system, is a complex task, and we do not pretend to

have a human representative corruption model.

7 With a�/b�/0.5 for the idf factor.8 Available at: ftp://ftp.cs.cornell.edu/pub/smart/.

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 79

words. Table 2 provides the index size calcu-lated on each document collection: it isinteresting to see how a, low corruption ratecan highly affect the main matching instru-ment of the IR engine: the index for thecorrupted collection is about twice as large(93%) as than for the original one. Thisphenomenon is well documented in the re-search conducted on OCR tools, and [24]notice a huge increase (4-fold) in the diction-ary size for the degraded text collection. Thispreliminary observation suggests that efficientindex pruning strategies9 should investigatethe effect of misspellings in order to reduce theindexes’ size.

3.2.1. Weighting schemes selection

As explained in the introduction, weightingschemes are a major component of any IRengine and a central issue for retrieving inmisspelled collections. We started to compareretrieval effectiveness with and without cosinenormalization, on the original collection (well-spelled queries and well-spelled documents).We measure the interpolated average preci-sion (11-pt, with N�/200). In Table 3, weobserve that cosine normalization results in amoderate degradation at every point of recall.

Therefore, atn�/ntn will now serve as abaseline for assessing the effect of misspell-ings, as well as the effect of the automaticspelling correction on an ad hoc retrieval task.We must notice that the difference betweendifferent weighting model is dependent on the

collection profile, and that we do not pretend

that better weighting schemes cannot be

selected for the CF collection10. However,

we assume that IR effectiveness will be

qualitatively affected in a similar fashion

whatever weighting scheme is applied. This

result allow us to restrict our study to normal-

ization-free weighting function, which are

anyway reported to be less sensitive to ‘gar-

bage strings’.

4. Misspellings effects

Effects of misspelled words is measured

along three modes:

. only the document collection is corrupted(Table 4);

. only the query collection is corrupted(Table 5);

. all the CF collection is corrupted (Table 6).

Table 2

Index size and number of relevant document over all queries

#stems in the original collection 6035

#stems in the misspelled collection 11677 (�/93%)

#of relevant documents over all queries 4801

9 See [2] for some recent developments on the question.

Table 3

Comparison atn�/ntn vs. atc�/ntn

Atn�/ntn Atc�/ntn

Relevant retrieved 2249 2205

Interpolated recall-precision averages

at 0.00 0.8679 0.8290

at 0.10 0.6411 0.6219

at 0.20 0.5113 0.5033

at 0.30 0.3779 0.3470

at 0.40 0.2606 0.2369

at 0.50 0.1742 0.1680

at 0.60 0.0924 0.0894

at 0.70 0.0406 0.0332

at 0.80 0.0145 0.0140

at 0.90 0.0017 0.0013

at 1.00 0.0005 0.0000

Average precision (non-interpolated ) over all rel docs

0.2406 (100%) 0.2293 (95%)

10 More recent weighting schema, such as SMART Lnu-Itc

or Okapi BM25, could be applied as well.

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8380

As expected the maximal degradation is

observed when both documents and queries

are corrupted (Table 6), with an average

precision falling to 18%, i.e. 25% worst than

for the original collection. Moreover, at any

corruption level, the silence grows: from 2249

relevant documents retrieved originally (Table

3, atn�/ntn) down to 2227 (Table 4), 2074

(Table 5) and finally 2045 (Table 6), i.e.

misspellings make disappear at least 22 and

at most 204 relevant documents.

Moreover, precision is almost uniformlyaffected, whatever interpolated recall point isconsidered. Even when only the documentcollection is corrupted (with a low corruptionratio, 3%), we observe that the general averageprecision decrease of 4%. The degradation iseven bigger when comparing average preci-sion between Tables 5 and 6, as it affectsprecision of about 7% (we calculate the ratiobetween each average precision, 19.56 and18.21%).

5. Results

In Table 7, we measure the effect ofapplying the spelling corrector on the cor-rupted query and document collection, andobserve the improvement in comparison withTable 6, using the common baseline (calcu-lated on the original collection) provided inTable 3 (atn�/ntn). It shows that the averageprecision (96%) calculated with the fully-automatic correction system is only 4% lowerthan precision on the original collection.Finally, the total number of relevant docu-ment is the same (2227) as the a system

Table 4

Results with misspelled documents

Relevant retrieved 2227

Interpolated recall precision averages

at 0.00 0.8479

at 0.10 0.6270

at 0.20 0.4927

at 0.30 0.3632

at 0.40 0.2452

at 0.50 0.1742

at 0.60 0.0932

at 0.70 0.0370

at 0.80 0.0154

at 0.90 0.0018

at 1.00 0.0005

Average precision (non-interpolated ) over all rel docs

0.2325 (96%)

Table 5

Results with misspelled queries

Relevant retrieved 2074

Interpolated recall precision averages

at 0.00 0.7537

at 0.10 0.5364

at 0.20 0.4127

at 0.30 0.3090

at 0.40 0.2069

at 0.50 0.1386

at 0.60 0.0691

at 0.70 0.0290

at 0.80 0.0077

at 0.90 0.0032

at 1.00 0.0005

Average precision (non-interpolated ) over all rel docs

0.1956 (81%)

Table 6

Result with misspelled queries and documents

Relevant retrieved 2045

Interpolated recall precision averages

at 0.00 0.7276

at 0.10 0.5050

at 0.20 0.3888

at 0.30 0.2901

at 0.40 0.1824

at 0.50 0.1278

at 0.60 0.0664

at 0.70 0.0246

at 0.80 0.0070

at 0.90 0.0019

at 1.00 0.0005

Average precision (non-interpolated ) over all rel docs

0.1821 (76%)

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 81

applied on a document collection with 3% ofmisspelled words (Table 4).

6. Conclusion

We showed that misspellings does affectretrieval effectiveness, and it is already true atvery low corruption levels: thus, a 3% errorrate affects the average precision of about 4�/

7%. Finally, in the worst case (when bothqueries and documents are corrupted), theaverage precision decrease of 25%. We alsoshowed that with the help of an automaticspelling corrector, the retrieval degradationbecomes of some percent only (less than 5%).

7. Future work

From a more global point of view, merginga fully-automatic spelling corrector with aretrieval engine requires a partial rethinkingof both components. In this paper, we showedhow both systems can interact in order toimprove retrieval, however, we believe that aless additive and more synthetic interaction

could bring substantial added-value. There-fore, two future research directions deserve tobe mentioned.

7.1. Weighting and string normalization

The spell-checker, we capitalized on couldbe adapted to the collection the engine has toindex. In this case, the correction could takeadvantage of the tf�/idf information, the ideais: when in the same document, a token A isfound to be similar to a token B regarding thestring distance, and it is observed that thefrequency of A is high, while the frequency ofB is low, then A is more likely to be the well-spelled instance of B. Such hypothesis couldalso be applied at the stemming level, since amisspelled instance of a given type cannot benormalized properly, i.e. mapped to the rightstem.

7.2. Named-entity recognition

Another necessary step in order to build anIR system susceptible to work on patientrecords, concerns the handling of non-lexicalitems. Indeed, named-entities (NEs) such aspatient and doctor’s names are frequent inpatient records (see [17]), and would need tobe recognized in order to be ignored by thespelling corrector.

Acknowledgements

We would like to thank Melanie Hilario,Christian Pellegrini and Vincenzo Pallotta fortheir comments and suggestions on an earlierversion of this paper. This study has beenpartially supported by the Swiss NationalScience Foundation (SNF 3200-065228.01).

Table 7

Results for corrected queries and documents

Relevant retrieved 2227

Interpolated recall-precision averages

at 0.00 0.8501

at 0.10 0.6217

at 0.20 0.4962

at 0.30 0.3649

at 0.40 0.2474

at 0.50 0.1645

at 0.60 0.0886

at 0.70 0.0384

at 0.80 0.0145

at 0.90 0.017

at 1.00 0.0005

Average precision (non-interpolated) over all rel docs

0.2319 (96%)

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8382

References

[1] A. Aizawa, Linguistic techniques to improve the perfor-

mance of automatic text categorization, Proceedings of the

Sixth Natural Language Processing Pacific Rim Sympo-

sium (NLPRS2001), 2001, pp. 307�/314.

[2] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici,

Y. Maarek, A. Soffer, Static index pruning for information

retrieval systems, Proceedings ACM-SIGIR, 2001, pp. 43�/

50.

[3] L. de Bruijn, A. Hasman, J. Arends, Automatic SNOMED

Classification*/A Corpus Based Method, Yearbook of

Medical Informatics, 1999.

[4] P. Franz, A. Zaiss, S. Schulz, U. Hahn, R. Klar,

Automated coding of diagnoses: three methods compared,

AMIA Symposium Proceedings, 2000.

[5] C. Friedman, P. Kra, M. Krauthammer, H. Yu, A.

Rzhetsky, Genies: a natural-language processing system

for the extraction of molecular pathways from journal

articles, Bioinformatics 17 (Suppl. 1) (2001) 74�/82.

[6] U. Hahn, M. Romacker, S. Schulz, Creating knowledge

repositories from biomedical reports: the medsyndikate

text mining system, Pacific Symposium on Biocomputing 7

(2002) 338�/349.

[7] I. Iliopoulos, A. Enright, C. Ouzounis, TextQuest: docu-

ment clustering of MedLine abstracts for concept discov-

ery in molecular biology, Pacific Symposium on

Biocomputing 6 (2001) 384�/395.

[8] T. Joachims, Making large-scale support vector machine

learning practical, in B. Scholkopf, C. Burges, A. Surola.

Advances in Kernel Methods: Support Vector machines,

MIT Press, Cambridge, MA, December 1998.

[9] D. Jurafeky, J. Martin, Speech and Language Processing,

Prentice Hall, London, 2000.

[10] P. Kantor, E. Voorhees, The TREC-5 confusion track:

comparing retrieval methods for scanned text, Information

Retrieval (2000) 165�/176.

[11] J. Klavans, S. Muresan, Evaluation of DEFINDER: a

system to mine definitions from consumer-oriented medi-

cal text, JCDL’01, 2002.

[12] K. Kukich, Techniques for automatically correcting words

in text, ACM Computer Survey 24 (1992) 377�/439.

[13] E. Mittendorf, P. Schauble, Measuring the effects of data

corruption on information retrieval, SDAIR Proceedings,

1996.

[14] J. Peterson, Computer programs for detecting and correct-

ing spelling errors, Communication ACM 23 (1980) 12.

[15] S. Robertson, K.S. Jones, Relevance weighting of search

terms, Journal of American Society for Information

Science 27 (3) (1976) 129�/146.

[16] P. Ruch, R. Baud, A. Geissbuhler, C. Lovis, A. Rassinoux,

A. Riviere, Using part-of-speech and word-sense disambi-

guation for boosting string-edit, distance spelling correc-

tion, Lecture Notes in Artificial Intelligence 2101 (2001)

249�/257.

[17] P. Ruch, R. Baud, R. Rassinoux, P. Bouillon, G. Robert,

Medical document anonymisation with a semantic lexicon,

Journal of American Medical Information Association

(Symposium Suppl.) (2000) 729�/733.

[18] P. Ruch, A. Gaudinat, Comparing corpora and lexical

disambiguation, ACL Workshop on Comparing Corpora

Proceedings, 2001.

[19] G. Salton, The SMART Retrieval System*/Experiment in

Automatic Document Retrieval, Prentice Hall, 1971.

[20] G. Salton, C. Buckley, Term-weighting approaches in

automatic text retrieval, Information Processing and

Management 24 (5) (1988) 513�/523.

[21] G. Salton, M. McGill, Introduction to Modern Informa-

tion Retrieval, McGraw Hill Book, 1983.

[22] H. Shatkay, S. Edwards, W. Wilbur, M. Boguski, Genes

themes and microarrays: using information retrieval for

large-scale gene analysis, Proceddings of International

Conference Intellectual System and Molecular Biology 8

(2000) 317�/328.

[23] A. Singhal, G. Salton, C. Buckley, Length normalization

in degraded text collections. Technical Report TR95-1507,

NEC, 1995.

[24] K. Taghva, J. Borsack, A. Condit, Results to applying

probablistic IR to OCR text, ACM-SIGIR, 1994 pp. 202�/

211.

[25] Y. Yang, X. Liu, A re-examination of text categorization

methods. ACM SIGIR1, 1998, pp. 42�/49.

[26] Q. Zeng, Patient and clinician vocabulary: how different

are they, Medlnfo’2001 Proceedings, 2001.

P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 83