evaluating and reducing the effect of data corruption when applying bag of words approaches to...
TRANSCRIPT
Evaluating and reducing the effect of data corruption whenapplying bag of words approaches to medical records
P. Ruch a,b,�, R. Baud b, A. Geissbuhler b
a Theoretical Computer Science Laboratory, Swiss Federal Institute of Technology, Lausanne, Switzerlandb Medical Informatics Division, University Hospital of Geneva, Geneva, Switzerland
Abstract
Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of
documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations.
After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval
(IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term
frequency�/inverse document frequency (tf�/idf) as weighting schema; we pay special attention to the normalization
factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4�/7%), whereas higher
corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic
spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the
engine.
# 2002 Elsevier Science Ireland Ltd. All rights reserved.
Keywords: Corruption; Information retrieval; Medical records; Spelling correction; Natural language processing
1. Introduction
Information retrieval (IR) is familiar to
most of us due to the development of the
Web, and the availability of large medical
electronic libraries. Although IR in textual
supports is probably the main application
field of retrieval technologies, other domains,
such as text mining in molecular biology [22,7]
or knowledge extraction in medicine [4,3]
often relies on the same vector space archi-
tecture, which transforms any linearly ordered
set of words (like a sentence or a document)
into an unordered set (hence the so-called bag
of words (BoW) approach). If any word-based
system (or using a variant of words, such as
stems), relying on the BoW transformation
falls into the scope of this study1, we believe
that even more elaborated systems [11,5,1,6],
which use linguistically-motivated approaches
are also to be affected by data corruption. In
this paper, we argue that spelling errors are a
� Corresponding author. Tel.: �/1-41-21-6936665
E-mail address: [email protected] (P. Ruch).
1 Including text categorization systems using advanced
machine learning instruments; [25,8].
International Journal of Medical Informatics 67 (2002) 75�/83
www.elsevier.com/locate/ijmedinf
1386-5056/02/$ - see front matter # 2002 Elsevier Science Ireland Ltd. All rights reserved.
PII: S 1 3 8 6 - 5 0 5 6 ( 0 2 ) 0 0 0 5 7 - 6
major challenge for most IR systems, andevaluate the improvement brought by merging
such as system with a fully-automatic spellingcorrector. Indeed, if research experiments areusually conducted on well-spelled corpora
(MedLine abstracts, or newswire collection),applying such research works to real IR and
text mining tasks conducted on patient re-cords implied to design systems able to handle
misspellings.Indeed, the corruption introduced by mis-
spelled words plays a particular role when we
attempt to built up a IR application dedicatedto retrieving information in an electronic
patient record (EPR). Documents that con-stitute the patient record are not supposed to
be ever published2, and therefore, are espe-cially rich in misspellings as compared with
most IR collections. In addition, documentsof the patient files are often dictated; there-fore, typos are often introduced at the tran-
scription level from speech to text.Due to the complexity of the task, which
involves several heterogeneous representationlevels (syntax, semantics, phonetics, keyboard
configuration, user profile. . .), automatic spellcheckers usually performed this task in inter-action with the user. However, in some cases,
like IR, interactive spelling correction islargely forbidden [12].
If some short queries entered manually canbe corrected with the assistance of the
authors, such helpful assistance is not allowedwhen we consider the task of retrieving similar
cases in patient file warehouses using a givenpatient file as a query! In this case, the query,ranges from some sentences of a unique
document to hundreds of lexically rich docu-ments, so that user-assisted spelling correction
seems clearly impossible. A fortiori, spellingerrors occurring in the document collectionremain unreadable.
Moreover, misspellings not only create‘garbage strings’, which will increase silence,but also corrupt the general document viewformed by an IR system, and therefore, cansubstantially hinder the successful retrieval ofrelevant documents for user-queries. Indeed,most modern IR system use sophisticatedterm weighting functions to assign importanceto the individual words (or any other chosenitems) of a document for document-contentrepresentation [15,19�/21], and these termweight function, can be more or less depen-dent on the collection corruption.
The goal of our work is to show howspelling correction can be performed in a fullyautomatic manner in order to reach perfor-mances similar to retrieval applied to non-corrupted collections. Considering the spec-ulative state of this study, we will focus on:
. modeling IR in a case-based database(CBD) or an EPR.
. Measuring effects of spelling errors on theretrieval effectiveness, both at the queryand document levels.
. Providing evidences that IR can be im-proved by using fully-automatic spellingcorrection.
2. Background and design options
Assessing the effect of misspelled strings onan IR system applied to medical texts impliesto synthesize knowledge from at least threedifferent origins that we are going to summar-ize briefly:
. medical corpora,
. IR,
. spelling correction.
2 A notable exception concerns the discharge summaries,
which are likely to be sent outside the institution, and therefore,
are more carefully written than documents that are to be used
internally.
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8376
2.1. Choice of a collection
In order to evaluate effects of misspellingson an IR system susceptible to be applied inan EPR, and considering the absence of EPRcollection publicly available, we decided to usethe cystic fibrosis (CF) collection3. The CFcollection is a 1239 documents and 100 queriescollection. It has been chosen for the purposeof this study experiment, because of itsmanageable size, and because its query collec-tion (with sometimes more than 30 tokens)could be regarded as a set of short documents,and therefore, better simulate an engine,which will have to accept document extractsas query.
2.2. Information retrieval
Any IR system defines three basic elements:
. the document and query representation;
. the matching function;
. a ranking criterium.
For the representation, we decided to usestems, while, the two last elements are depen-dent on the selected weighting features.
2.2.1. Weighting schemes
IR engine using vector space approach isusually based on a variant of the termfrequency�/inverse document frequency (tf�/
idf) family [20]. This approach states thatthe weight of a given term is related to thefrequency of this term in a given document(i.e. tf), and inversely proportional to thefrequency of this term throughout the docu-
ment collection (idf). In Table 1, we provide
some commonly used tf�/idf features, follow-
ing the de facto SMART [19] standard
representations.A retrieval experiment can be characterized
by a pair of triples, ddd�/qqq , where the first
triple corresponds to term weighting used for
the document collection, and the second triple
corresponds to the query term weight. Each
triple refers to a tf , an idf and a normalization
function (cf. Table 1).Depending on the collection, it is possible
to calculate a posteriori the best weighting
scheme. In these experiments, we limit our
exploration to a core parameter, which plays
an important role in the case of applying IR
systems to corrupted collections: we evaluate
the system with or without cosine normal-
ization. Cosine normalization is strictly ap-
plied at the level of the document collection.
Since normalization of query term weights
just acts as a scaling factor for all the query-
document similarities, and has no effect on the
relative ranking of the documents, there was
not need to vary the normalization factor for
the query term weight.
3 T h e o r i g i n a l C F C i s a v a i l a b l e a t h t t p : / /
www.sims.berkeley.edu/-hearst/irbook/cfc.html. The study was
conducted on the English language, however, the results
reported here, would be equivalent on most European
language, except for highly agglutinative ones (such as
Finno�/Hungarian languages).
Table 1
Term weights in the SMART system
Term frequency
First letter f (tf )
n (natural) tf
l (logarithmic) 1�/log(tf )
a (augmented) a�/b�/(tf /max(tf )) with a�/b�/1
Second letter f (1/df )
Inverse document frequency
Second letter f (1/df )
n (no) 1
t (full) log(N /df )
Normalization
Third letter f (length)
n (no) 1
c (cosine) /
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffir2
1�r22�� � ��r2
n
p/
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 77
2.3. Spelling correction
Spelling correction is processed by comput-
ing a string edit distance between a given
token and the items of a lexical list (see [14,9],
for a survey of the probabilistic models of
spelling). This simple approach can be im-
proved by additional modules using a word
frequency list and/or some additional contex-
tual information (word-language model, part-
of-speech tagger. . .), as in [16].Spelling checker are usually applied in
interaction with a human; in this context the
correcting system returns a ranked list of
candidates for every misspelled token. Apply-
ing spelling corrector as a batch task may
result in replacing the hypothetical misspelled
token by a wrong candidate. Capitalizing on
an improved spelling checker [16], which
returns the good candidate at the top of the
list with a probability of 96 versus 90% for
traditional systems, we expect results closed to
retrieval applied to well-spelled corpora.In these experiments, we used a 200 000
items dictionary that allow a good coverage of
both the general and the medical English
language. We set up a confidence threshold
in order to avoid replacement of misspelled
words by bad candidates: if the score of a
candidate is below a certain edit distance4, the
replacement step is skipped. The idea lying
here is that we prefer to keep a misspelled
word rather that replacing it by a wrong
word5.
2.4. OCR retrieval
Since misspellings have rarely*/if ever*/
been studied in the IR framework, investigat-ing retrieval and misspellings obliged to referto the field of IR applied to optical characterrecognition (OCR), whose the most interest-ing conclusion are provided in the following(see [23] for an application, and [13] for amore theoretical foundation).
2.4.1. Representation items
The TREC 5 confusion track, used a set of49 known-item tasks to study the impact ofdata corruption (two corruption rates wereapplied: 5 and 20%) on retrieval systemperformance. A known-item search simulatesa user seeking a unique particular, partially-remembered document in the collection. Ifthere are obvious differences between known-item and ad hoc retrieval tasks, it is interestingto notice that retrieval methods that at-tempted a probabilistic reconstruction of theoriginal text fared better than methods thatsimply accepted corrupted versions of thequery text [10].
2.4.2. Cosine normalization sensitivity
Weighting functions use the occurrencestatistics of words (or any other documentrepresentation item) in the documents toassign importance to different words. As theoccurrence statistics of words can changesubstantially due to OCR errors, weightingschemes are especially sensitive to degradationin the quality of the input text. The authorsobserve that the cosine normalization, whichis commonly used in order to improve vector-space IR, must be manipulated carefully whenworking with corrupted collection. As a pre-liminary task, we will have to decide whethercosine normalization can be opportunely usedwith our collection, i.e. if it would bring anyimprovement on the original CF collection as
4 We take the ration between the number of edit operation
and the length of the word. We performed some test runs and
set the final value to 0.335, which means about three edit
operations are allowed for a ten characters word.5 However, we could imagine a system, which would keep
the original word in the query or in the document in addition to
the replacement candidate. The entire IR task is very resilient to
improper expansion of terms, therefore, more than one
candidate could have been added.
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8378
compared with a normalization-free weighting
scheme.
2.5. Corruption model
We investigate the effect of misspellings
both at the document and at the query level.
Recent studies report on the high rate of
misspellings in web user-query. Thus, [26]
measures the rate of misspelled words in a
large collection of queries issued from a
medical web-based IR system and observe a
rate of 7% for the expert users (healthcare
professionals), which can reach the score of
15% for non-expert users. While our own
investigation [18] carried on patient records
showed a mininal error rate of 3%, which goes
up to 10% for some documents.Finally, we corrupted each query and each
document using two different rate:
. Three percent for the document collection.
. Fifteen percent for the query collection.
A 15% rate means that every 6.66 random
words was introduced a spelling errors. To
achieve this goal, we define a corruption
model, consistent with Damerau’s seminal
researches. Damerau [5] showed that 80% of
misspellings can be generated from a correct
spelling by a few simple rules:
. transposition of two adjacent letters: heap-
titis ;. insertion of a letter: heppatitis ;. deletion of a letter: hepattis ;. replacement of a letter by another one:
hepatotis .
In addition with this first model6, whichwas applied to 80% of the cases, we intro-duced via a second process another 20% oferrors, which could not be produced with sucha model, mainly for approximating sound-alike errors (the character i is more likely tobe replaced by the character y , than by thecharacter q).
3. Method
First, we attempt to select a good weightingfunction for our collection. While a moresystematic evaluation will have to be per-formed in order to select the best scheme, suchstudies would go far beyond the scope of thesepreliminary study, therefore, we concentrateon atn/atc parameters7, which are supposed toperform well on heterogeneous collections[20].
3.1. Relevance scoring
In the CF collection, each query is providedwith a ranked list of relevant documents. Theranking is provided by four experts alongthree relevance levels (0, not relevant, 1,moderately relevant, 2, very relevant), andresults in a final relevance score, which rangesfrom 0 to 8. For the study, this fine-grainedrelevance score was mapped into binaryvalues (relevant or irrelevant) in order to beevaluated in a TREC-like style using theTRECEVAL
8 evaluation program.
3.2. IR parameters
Every experiment is conducted with wordconflation and using a list of English stop-
6 The production of an automatic corruption model, which
would be consistent with human errors, in order to assess a
correction system, is a complex task, and we do not pretend to
have a human representative corruption model.
7 With a�/b�/0.5 for the idf factor.8 Available at: ftp://ftp.cs.cornell.edu/pub/smart/.
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 79
words. Table 2 provides the index size calcu-lated on each document collection: it isinteresting to see how a, low corruption ratecan highly affect the main matching instru-ment of the IR engine: the index for thecorrupted collection is about twice as large(93%) as than for the original one. Thisphenomenon is well documented in the re-search conducted on OCR tools, and [24]notice a huge increase (4-fold) in the diction-ary size for the degraded text collection. Thispreliminary observation suggests that efficientindex pruning strategies9 should investigatethe effect of misspellings in order to reduce theindexes’ size.
3.2.1. Weighting schemes selection
As explained in the introduction, weightingschemes are a major component of any IRengine and a central issue for retrieving inmisspelled collections. We started to compareretrieval effectiveness with and without cosinenormalization, on the original collection (well-spelled queries and well-spelled documents).We measure the interpolated average preci-sion (11-pt, with N�/200). In Table 3, weobserve that cosine normalization results in amoderate degradation at every point of recall.
Therefore, atn�/ntn will now serve as abaseline for assessing the effect of misspell-ings, as well as the effect of the automaticspelling correction on an ad hoc retrieval task.We must notice that the difference betweendifferent weighting model is dependent on the
collection profile, and that we do not pretend
that better weighting schemes cannot be
selected for the CF collection10. However,
we assume that IR effectiveness will be
qualitatively affected in a similar fashion
whatever weighting scheme is applied. This
result allow us to restrict our study to normal-
ization-free weighting function, which are
anyway reported to be less sensitive to ‘gar-
bage strings’.
4. Misspellings effects
Effects of misspelled words is measured
along three modes:
. only the document collection is corrupted(Table 4);
. only the query collection is corrupted(Table 5);
. all the CF collection is corrupted (Table 6).
Table 2
Index size and number of relevant document over all queries
#stems in the original collection 6035
#stems in the misspelled collection 11677 (�/93%)
#of relevant documents over all queries 4801
9 See [2] for some recent developments on the question.
Table 3
Comparison atn�/ntn vs. atc�/ntn
Atn�/ntn Atc�/ntn
Relevant retrieved 2249 2205
Interpolated recall-precision averages
at 0.00 0.8679 0.8290
at 0.10 0.6411 0.6219
at 0.20 0.5113 0.5033
at 0.30 0.3779 0.3470
at 0.40 0.2606 0.2369
at 0.50 0.1742 0.1680
at 0.60 0.0924 0.0894
at 0.70 0.0406 0.0332
at 0.80 0.0145 0.0140
at 0.90 0.0017 0.0013
at 1.00 0.0005 0.0000
Average precision (non-interpolated ) over all rel docs
0.2406 (100%) 0.2293 (95%)
10 More recent weighting schema, such as SMART Lnu-Itc
or Okapi BM25, could be applied as well.
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8380
As expected the maximal degradation is
observed when both documents and queries
are corrupted (Table 6), with an average
precision falling to 18%, i.e. 25% worst than
for the original collection. Moreover, at any
corruption level, the silence grows: from 2249
relevant documents retrieved originally (Table
3, atn�/ntn) down to 2227 (Table 4), 2074
(Table 5) and finally 2045 (Table 6), i.e.
misspellings make disappear at least 22 and
at most 204 relevant documents.
Moreover, precision is almost uniformlyaffected, whatever interpolated recall point isconsidered. Even when only the documentcollection is corrupted (with a low corruptionratio, 3%), we observe that the general averageprecision decrease of 4%. The degradation iseven bigger when comparing average preci-sion between Tables 5 and 6, as it affectsprecision of about 7% (we calculate the ratiobetween each average precision, 19.56 and18.21%).
5. Results
In Table 7, we measure the effect ofapplying the spelling corrector on the cor-rupted query and document collection, andobserve the improvement in comparison withTable 6, using the common baseline (calcu-lated on the original collection) provided inTable 3 (atn�/ntn). It shows that the averageprecision (96%) calculated with the fully-automatic correction system is only 4% lowerthan precision on the original collection.Finally, the total number of relevant docu-ment is the same (2227) as the a system
Table 4
Results with misspelled documents
Relevant retrieved 2227
Interpolated recall precision averages
at 0.00 0.8479
at 0.10 0.6270
at 0.20 0.4927
at 0.30 0.3632
at 0.40 0.2452
at 0.50 0.1742
at 0.60 0.0932
at 0.70 0.0370
at 0.80 0.0154
at 0.90 0.0018
at 1.00 0.0005
Average precision (non-interpolated ) over all rel docs
0.2325 (96%)
Table 5
Results with misspelled queries
Relevant retrieved 2074
Interpolated recall precision averages
at 0.00 0.7537
at 0.10 0.5364
at 0.20 0.4127
at 0.30 0.3090
at 0.40 0.2069
at 0.50 0.1386
at 0.60 0.0691
at 0.70 0.0290
at 0.80 0.0077
at 0.90 0.0032
at 1.00 0.0005
Average precision (non-interpolated ) over all rel docs
0.1956 (81%)
Table 6
Result with misspelled queries and documents
Relevant retrieved 2045
Interpolated recall precision averages
at 0.00 0.7276
at 0.10 0.5050
at 0.20 0.3888
at 0.30 0.2901
at 0.40 0.1824
at 0.50 0.1278
at 0.60 0.0664
at 0.70 0.0246
at 0.80 0.0070
at 0.90 0.0019
at 1.00 0.0005
Average precision (non-interpolated ) over all rel docs
0.1821 (76%)
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 81
applied on a document collection with 3% ofmisspelled words (Table 4).
6. Conclusion
We showed that misspellings does affectretrieval effectiveness, and it is already true atvery low corruption levels: thus, a 3% errorrate affects the average precision of about 4�/
7%. Finally, in the worst case (when bothqueries and documents are corrupted), theaverage precision decrease of 25%. We alsoshowed that with the help of an automaticspelling corrector, the retrieval degradationbecomes of some percent only (less than 5%).
7. Future work
From a more global point of view, merginga fully-automatic spelling corrector with aretrieval engine requires a partial rethinkingof both components. In this paper, we showedhow both systems can interact in order toimprove retrieval, however, we believe that aless additive and more synthetic interaction
could bring substantial added-value. There-fore, two future research directions deserve tobe mentioned.
7.1. Weighting and string normalization
The spell-checker, we capitalized on couldbe adapted to the collection the engine has toindex. In this case, the correction could takeadvantage of the tf�/idf information, the ideais: when in the same document, a token A isfound to be similar to a token B regarding thestring distance, and it is observed that thefrequency of A is high, while the frequency ofB is low, then A is more likely to be the well-spelled instance of B. Such hypothesis couldalso be applied at the stemming level, since amisspelled instance of a given type cannot benormalized properly, i.e. mapped to the rightstem.
7.2. Named-entity recognition
Another necessary step in order to build anIR system susceptible to work on patientrecords, concerns the handling of non-lexicalitems. Indeed, named-entities (NEs) such aspatient and doctor’s names are frequent inpatient records (see [17]), and would need tobe recognized in order to be ignored by thespelling corrector.
Acknowledgements
We would like to thank Melanie Hilario,Christian Pellegrini and Vincenzo Pallotta fortheir comments and suggestions on an earlierversion of this paper. This study has beenpartially supported by the Swiss NationalScience Foundation (SNF 3200-065228.01).
Table 7
Results for corrected queries and documents
Relevant retrieved 2227
Interpolated recall-precision averages
at 0.00 0.8501
at 0.10 0.6217
at 0.20 0.4962
at 0.30 0.3649
at 0.40 0.2474
at 0.50 0.1645
at 0.60 0.0886
at 0.70 0.0384
at 0.80 0.0145
at 0.90 0.017
at 1.00 0.0005
Average precision (non-interpolated) over all rel docs
0.2319 (96%)
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/8382
References
[1] A. Aizawa, Linguistic techniques to improve the perfor-
mance of automatic text categorization, Proceedings of the
Sixth Natural Language Processing Pacific Rim Sympo-
sium (NLPRS2001), 2001, pp. 307�/314.
[2] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici,
Y. Maarek, A. Soffer, Static index pruning for information
retrieval systems, Proceedings ACM-SIGIR, 2001, pp. 43�/
50.
[3] L. de Bruijn, A. Hasman, J. Arends, Automatic SNOMED
Classification*/A Corpus Based Method, Yearbook of
Medical Informatics, 1999.
[4] P. Franz, A. Zaiss, S. Schulz, U. Hahn, R. Klar,
Automated coding of diagnoses: three methods compared,
AMIA Symposium Proceedings, 2000.
[5] C. Friedman, P. Kra, M. Krauthammer, H. Yu, A.
Rzhetsky, Genies: a natural-language processing system
for the extraction of molecular pathways from journal
articles, Bioinformatics 17 (Suppl. 1) (2001) 74�/82.
[6] U. Hahn, M. Romacker, S. Schulz, Creating knowledge
repositories from biomedical reports: the medsyndikate
text mining system, Pacific Symposium on Biocomputing 7
(2002) 338�/349.
[7] I. Iliopoulos, A. Enright, C. Ouzounis, TextQuest: docu-
ment clustering of MedLine abstracts for concept discov-
ery in molecular biology, Pacific Symposium on
Biocomputing 6 (2001) 384�/395.
[8] T. Joachims, Making large-scale support vector machine
learning practical, in B. Scholkopf, C. Burges, A. Surola.
Advances in Kernel Methods: Support Vector machines,
MIT Press, Cambridge, MA, December 1998.
[9] D. Jurafeky, J. Martin, Speech and Language Processing,
Prentice Hall, London, 2000.
[10] P. Kantor, E. Voorhees, The TREC-5 confusion track:
comparing retrieval methods for scanned text, Information
Retrieval (2000) 165�/176.
[11] J. Klavans, S. Muresan, Evaluation of DEFINDER: a
system to mine definitions from consumer-oriented medi-
cal text, JCDL’01, 2002.
[12] K. Kukich, Techniques for automatically correcting words
in text, ACM Computer Survey 24 (1992) 377�/439.
[13] E. Mittendorf, P. Schauble, Measuring the effects of data
corruption on information retrieval, SDAIR Proceedings,
1996.
[14] J. Peterson, Computer programs for detecting and correct-
ing spelling errors, Communication ACM 23 (1980) 12.
[15] S. Robertson, K.S. Jones, Relevance weighting of search
terms, Journal of American Society for Information
Science 27 (3) (1976) 129�/146.
[16] P. Ruch, R. Baud, A. Geissbuhler, C. Lovis, A. Rassinoux,
A. Riviere, Using part-of-speech and word-sense disambi-
guation for boosting string-edit, distance spelling correc-
tion, Lecture Notes in Artificial Intelligence 2101 (2001)
249�/257.
[17] P. Ruch, R. Baud, R. Rassinoux, P. Bouillon, G. Robert,
Medical document anonymisation with a semantic lexicon,
Journal of American Medical Information Association
(Symposium Suppl.) (2000) 729�/733.
[18] P. Ruch, A. Gaudinat, Comparing corpora and lexical
disambiguation, ACL Workshop on Comparing Corpora
Proceedings, 2001.
[19] G. Salton, The SMART Retrieval System*/Experiment in
Automatic Document Retrieval, Prentice Hall, 1971.
[20] G. Salton, C. Buckley, Term-weighting approaches in
automatic text retrieval, Information Processing and
Management 24 (5) (1988) 513�/523.
[21] G. Salton, M. McGill, Introduction to Modern Informa-
tion Retrieval, McGraw Hill Book, 1983.
[22] H. Shatkay, S. Edwards, W. Wilbur, M. Boguski, Genes
themes and microarrays: using information retrieval for
large-scale gene analysis, Proceddings of International
Conference Intellectual System and Molecular Biology 8
(2000) 317�/328.
[23] A. Singhal, G. Salton, C. Buckley, Length normalization
in degraded text collections. Technical Report TR95-1507,
NEC, 1995.
[24] K. Taghva, J. Borsack, A. Condit, Results to applying
probablistic IR to OCR text, ACM-SIGIR, 1994 pp. 202�/
211.
[25] Y. Yang, X. Liu, A re-examination of text categorization
methods. ACM SIGIR1, 1998, pp. 42�/49.
[26] Q. Zeng, Patient and clinician vocabulary: how different
are they, Medlnfo’2001 Proceedings, 2001.
P. Ruch et al. / International Journal of Medical Informatics 67 (2002) 75�/83 83