text retrieval using self-organized document maps

Text Retrieval Using Self-Organized Document Maps

KRISTA LAGUSHelsinki University of Technology, Neural Networks Research Centre, P.O. Box 5400,FIN-02015 HUT, Finland. e-mail: krista.lagus@hut.¢

Abstract. A map of text documents arranged using the Self-Organizing Map (SOM) algorithm(1) is organized in a meaningful manner so that items with similar content appear at nearbylocations of the 2-dimensional map display, and (2) clusters the data, resulting in an approximatemodelofthe datadistribution in thehigh-dimensional documentspace.This article describeshowadocumentmap that is automaticallyorganized forbrowsingandvisualization canbe successfullyutilized also in speeding up document retrieval. Furthermore, experiments on the well-knownCISI collection [3] show signi¢cantly improved performance compared to Salton’s vector spacemodel, measured by average precision (AP) when retrieving a small, ¢xed number of bestdocuments. Regarding comparison with Latent Semantic Indexing the results are inconclusive.

Key words: document maps, information retrieval, LSI, SOM, text mining

1. Introduction

People approach large information spaces at least with the following different kindsof motives: (1) to search for a speci¢c piece of information or for informationof a speci¢c topic, (2) to gain familiarity with or an overview of some general topicor domain, or (3) to locate something that might be of interest, without a clear priornotion of what ‘interesting’ should look like. The ¢eld of Information Retrieval (IR)[1, 15] develops methods that focus on the ¢rst situation, whereas the latter motivesare mainly addressed in approaches that concentrate on exploration and onvisualizing properties of data (for an overview, see [5]). Often the motives andinformation needs of users in actual sessions alternate. Consequently, a balancedcombination of visualization, exploration, and search tools and facilities is likelyto be more useful in real systems than any single approach alone.It has been shown previously in the WEBSOM project [6, 7, 11, 13] that using the

Self-Organizing Map (SOM) algorithm [9, 10] very large text collections can beautomatically organized onto document maps that are suitable for visualizationand intuitive exploration of the information space. In a user study reported in[2] similar maps were found especially useful for a situation in which the users weremainly browsing for something interesting without a clear information need.The construction and use of exploration models and search indices consumes

processing time, memory, and disk space. Especially with very large collections thatpush the limits of computer systems it is an advantage if a single set of data structuresor indices can be utilized for all of the identi¢ed tasks related to text mining.

Neural Processing Letters 15: 21^29, 2002. 21# 2002 Kluwer Academic Publishers. Printed in the Netherlands.

Furthermore, in real systems any search and exploration methods must becomputationally ef¢cient. In particular, the delay perceived by users is critical.It is therefore important to develop methods that can speed up the search processwhile maintaining high perceived quality, particularly in the range of high precisionand low recall which is most crucial in actual user settings.The two aspects of the vocabulary problem are well-known: a single termmay have

many different meanings, and the same ideamay be expressed in various ways, e.g. byusing different terms. This can be viewed as a problem of noisy data on one hand, andof missing data values on the other hand. In general, such problems can beapproached by utilizing an appropriate model of the data space, created usingmethods that reduce noise and cope gracefully with missing values and partialmatches.In this Letter a computationally ef¢cient method is presented that is suitable for

performing information retrieval using document maps that were organized forexploration of document collections. Furthermore, experiments on the well-knownCISI reference collection show that comparable or improved retrieval performanceis obtained when compared to two standard methods, Salton’s vector space model[16] and the LSI [4].The organization of the paper is as follows: The introduced method is presented in

Section 2, the experimental setup in Section 3 and the results in Section 4. First, thestandard algorithms used in the experiment are brie£y described.

1.1. THE SELF-ORGANIZING MAP (SOM)

The SOM [9, 10] is a nonlinear projection method that maps an originallyhigh-dimensional data space onto a usually two-dimensional map grid in an orderlyfashion. The map grid locations are associated with so-called reference vectors thatact as local models of the closest data items and, to a lesser degree, of the neighboringmap regions. By virtue of the SOM algorithm nearby map locations contain similardata items, enabling intuitive visualization of the data space. Furthermore, thereference vectors partition the data into subsets of similar items, performing aclustering of the data.Numerous applications of the SOM are described in [10]; for an extended list see

[8].

1.2. SALTON’S VECTOR SPACE MODEL (VSM)

In the vector space model [16] documents are represented as points in a t-dimensionalspace where t is the size of the vocabulary, and the component of the documentvector dij re£ects the frequency of occurrence of the term i in the document j.Furthermore, the terms can be weighted in various ways, e.g. using a tf � idf scheme[14].

22 KRISTA LAGUS

1.3. LATENT SEMANTIC INDEXING (LSI)

Global relationships between words can be deduced from their co-occurrencepatterns across documents. A method called Latent Semantic Indexing (LSI) [4]utilizes this property by applying singular-value decomposition to thedocument-by-word matrix to obtain a projection of both documents and words intoa space called the latent space. Dimensionality reduction is achieved by retainingonly the latent variables (projection dimensions) with the largest variance (largesteigenvalues). Subsequent distance calculations between documents or terms are thenperformed in the reduced-dimensional latent space.

2. Document Retrieval using the SOM

If the object of the search is stated as locating a small number n of best documents inthe order of goodness corresponding to a query, the following search strategy can beapplied:

1. Indexing phase: Apply the SOM to partition a document collection ofD documentsinto M subsets or clusters. Represent each subset by its centroid.

2. Search phase: For a given query,a) Pre-select: Based on comparison with the centroids, select the best subsets andcollect the documents in those subsets until K documents (K >¼ n) areobtained.

b) Re¢ne: perform an exhaustive search among the K prospective documents andreturn the n best ones in the order of goodness. The exhaustive search canbe carried out with identical document encoding as used in forming the subsets.

The hierarchical strategy has two effects: the complexity is reduced compared to afull search, and the returned set is different from that of an exhaustive search.One can view the role of the parameter K as determining the balance between the

clustering and the re¢ning search: with a smallK a small number of clusters is trustedto suf¢ce for pre-selecting the best documents. With a larger K a greater number ofclusters will be selected to be screened closely by the detailed search. Furthermore,when SOM is used for obtaining the subsets, the clustering is implicit (i.e. a clustermay be formed by several neighboring map units) and the appropriate value ofK is likely to depend on the stiffness of the map, which is affected by the SOMlearning parameters.A straightforward approach for utilizing clusters in searching would be to return

for a query all documents in the best-matching clusters. However, if a topicalclustering is used, this approach is likely to cause also false hits to appear highin the list (this was con¢rmed in a preliminary experiment not reported here), aswill be explained in Section 2.2. Instead, in the described approach the documentmap is thought of as a domain model that is utilized for determining the most central

TEXT RETRIEVAL USING SELF-ORGANIZED DOCUMENT MAPS 23

topical clusters for a query. Subsequently the best clusters are scanned more closelyto identify the best hits for the particular query.

2.1. EFFECTS ON COMPUTATIONAL COMPLEXITY

The partitioning of the document collection can be constructed of£ine. In general,obtaining a meaningful partitioning of a large data set is computationally hard.However, by utilizing the SOM the task can be performed fast [11].In carrying out a single search, an upper limit of OðM þ K þ n log nÞ comparisons

of document vectors are needed (M for the comparison with centroids, K forcomparison with the result set of documents, and n log n for retaining the best unitsin a heap data structure). In contrast, the complexity of an exhaustive search wouldbe OðDþ n log nÞ, D for comparing the query with each document, and n log noperations for saving results in a best-¢rst order.

2.2. EFFECTS ON SEARCH QUALITY

If tight clusters of similar documents are obtained, the data is well characterized bythe centroid of the subset, and thus the computation of similarities with respectto the centroids instead of to individual documents is likely to form a goodapproximation of full search. Furthermore, a clustering may bring relateddocuments together based on family resemblance, i.e. the cluster may not be de¢nedin terms of a single set of common traits. Instead, as in a family, all membersresemble at least some other members of the family, but the features in whichresemblance occurs may differ. As a consequence, through clusters the querycan reach documents that share few or no words with the query.The pre-selection stage is likely to discard outlier (one-of-a-kind) hits, as well as a

large number of documents that only weakly resemble the query (e.g. due to termswith small weight in common with the query). The further detailed search is thusmade easier, and hence retrieval results may even improve.

3. Experiments

3.1. DATA SET

The CISI collection [3] was selected for the experiments in order to evaluate themethod on a standard test collection where the relevant documents are knownand the properties of the collection are well-understood by practitioners of the ¢eld.The collection consists of 1460 documents and 76 information requests or querieswith an average of 41 known relevant documents per query. The average lengthsof documents and queries were 115 and 51 words, respectively.

24 KRISTA LAGUS

3.2. EVALUATION MEASURE

The evaluation techniques used in IR are based on the notions of precision andrecall, de¢ned as follows:

precision ¼number of relevant items retrieved

total number of items retrievedð1Þ

recall ¼number of relevant items retrieved

number of relevant items in collectionð2Þ

The retrieval performance was compared using non-interpolated average precisionover all relevant documents (AP) [17]. AP is calculated for a single query by recordingthe precision each time a relevant document is retrieved, and taking an average ofthese (the precision of non-retrieved documents is calculated as zero). The resultsare further averaged over all queries. The measure not only takes into accountthe number of relevant documents retrieved, but also their order in the result set,rewarding systems for ranking relevant documents high in the returned lists ofresults.

3.3. SEARCHES USING THE DOCUMENT MAP

In order to evaluate the retrieval performance of document maps constructed forexploration of a collection, the document map was created following the principlespreviously established in the WEBSOM project (for a state-of-the art description,see [11]).

3.3.1. Preprocessing and document encodingThe texts were preprocessed by replacing words with their base forms using theTWOL morphological analyzer [12], and by removing a stoplist of common,uninformative words. Terms occurring less than ¢ve times were discarded as well.The ¢nal vocabulary consisted of 2136 terms. The documents were encoded asvectors using the vector space model where a tf � idf weighting was applied.

3.3.2. Construction of the mapThe 1460 document vectors were organized on a SOM of 150 units (10� 15)according to principles established in the WEBSOM project. The map size wasselected as appropriate for browsing, so that each map unit would contain an averageof 10 documents. Furthermore, the so-called dot product SOM was utilized [10] inwhich distance computations are performed using dot products on normalizeddocument vectors (cosine distance measure).


3.3.3. Performing searches on the mapFor each query, the documents in the best map units were pre-selected until a list ofKdocuments was obtained. To return the ¢nal n documents a re¢nement search wasperformed: out of theK pre-selected documents the nmost similar were chosen basedon the cosine measure.

3.3.4. Estimation of the value for KTo choose a value for K the test data (the set of 76 queries) was randomly split intotwo halves with equal probability. One of the halves was used to select a valuefor K while the other was set aside and used for comparison of the different methods.The different values of K that were tried and the respective performances are shownin Figure 1. The value K ¼ 300 was chosen.

3.4. SEARCHES USING VSM AND LSI

In experiments with the VSM and the LSI the preprocessing, the vocabulary ofterms, and the weighting of words were identical to those of the document mapexperiment to avoid comparing different preprocessings. With both models the

Figure 1. Selection of the value for parameter K. Average precision (AP) was calculated using he suggesteddocument map method with various values of K (the number of pre-selected documents), and varyingthe number of retrieved documents (n). The results are averaged over the queries in the ¢rst half of thetest set. The AP for larger n is higher as a larger number of relevant documents fall into the result set.

26 KRISTA LAGUS

searches were carried out by comparing the query vector with each document vector(in the vector space or in the latent space), and by selecting the n best documents.In the VSM experiment the distances were calculated as dot products on

normalized document and query vectors.The dimensionality of the latent space in LSI was estimated in the same manner as

the parameterK in the document map experiment, i.e. based on results obtained withthe ¢rst half of the test set. The dimensionalities 100, 120, 130, 140, 150, and 160 weretried, of which 140 performed best and was selected for comparison of methods.

4. RETRIEVAL RESULTS

Each method was asked to return the n best documents, with various recordings of n.The resulting average performances presented in Figure 2 seem to imply that thedocument map performs better than or as well as either the VSM or the LSI.We then checked for statistical signi¢cance of the found performance differences

in a situation where n ¼ 100 documents are retrieved. The improvement obtainedwith document map compared to VSM turned out to be signi¢cant, measured bothby the pairwise t-test (p ¼ 0:00017) and the sign test (p ¼ 0:0376). On the other hand,

Figure 2. Comparison of methods. Average precision (AP) was calculated using the suggested documentmap method (Docmap), the Salton’s vector space model (VSM) and the Latent Semantic Indexing (LSI)with various numbers of retrieved documents. The results are averages over the queries in the second halfof the test set.


comparison between document map and the LSI did not produce signi¢cant resultson either test. This seems to be due to larger per-query variation between theperformances of the methods. Further experiments with larger data sets are thereforeneeded for comparison with the LSI.

5. Conclusions and Discussion

The study shows that a document map that was created for interactive exploration ofa text collection can be successfully utilized as a clustering in speeding up documentretrieval. Furthermore, the experiments suggest that using document maps forretrieval of a relatively small number of the best documents the retrieval perform-ance can be improved compared to the standard VSM, with reduced computationalcomplexity. Such improvement may be due to the fact that the document map createsa topical clustering and organization of the document space, which is then used torestrict the search to the most central areas of the space for a given query.Experiments with other well-understood collections are needed to con¢rm the

results as well as to examine further the conditions in which the proposed methodimproves performance. Additionally, the approach should be studied with very largecollections.Still further improvement, perceived by a user but not readily measurable within

the IR framework, may be achieved by visualizing the locations of the best hitson the map display and by enabling exploration of the map. In such an environmentthe user may visually con¢rm the results and interactively re¢ne the search, as well aslocate further interesting documents identi¢ed by resemblance with some of theretrieved ones.

References

1. Baeza-Yates, R. and Ribeiro-Neto, B. (eds): Modern Information Retrieval. AddisonWesley Longman, 1999.

2. Chen, H., Houston, A. L., Sewell, R. R. and Schatz, B. R.: Internet browsing andsearching: User evaluations of category map and concept space techniques. Journalof the American Society for Information Science (JASIR), 49(7) (1998), 582^603.

3. CISI-collection. The CISI reference collection for information retrieval. 1460 documentsand 76 queries. http://local.dcs.gla.ac.uk/idom/ir resources/test collections/cisi/,1981.

4. Deerwester, S., Dumais, S. T., Furnas, G. W. and Landauer, T. K.: Indexing by latentsemantic analysis. Journal of the American Society for Information Science, 41 (1990),391^407.

5. Hearst, M. A.: Modern Information Retrieval, chapter 10. User Interfaces andVisualization, pp. 257^324. Addison Wesley Longman, 1999.

6. Honkela, T., Kaski, S., Lagus, K. and Kohonen, T.: Newsgroup exploration withWEBSOM method and browsing interface. Technical Report A32, Helsinki Universityof Technology, Laboratory of Computer and Information Science, Espoo, Finland,1996.

28 KRISTA LAGUS

7. Kaski, S., Honkela, T., Lagus, K. and Kohonen, T.: WEBSOMMself-organizing mapsof document collections. Neurocomputing, 21 (1998), 101^117.

8. Kaski, S., Kangas, J. and Kohonen, T.: Bibliography of self-organizing map (SOM)papers: 1981^1997. Neural Computing Surveys, 1(3/4) (1998), 1^176. Available in elec-tronic form at http://www.icsi.berkeley.edu/�jagota/NCS/: Vol 1, pp. 102^350.

9. Kohonen, T.: Self-organizing formation of topologically correct feature maps. Bio-logical Cybernetics, 43(1) (1982), 59^69.

10. Kohonen, T.: Self-Organizing Maps, volume 30 of Springer Series in InformationSciences. Springer, Berlin, 1995. Second, extended edition, 1997.

11. Kohonen, T., Kaski, S., Lagus, K., Saloja« rvi, J., Paatero, V. and Saarela, A.: Organ-ization of a massive document collection. IEEE Transactions on Neural Networks,Special Issue on Neural Networks for Data Mining and Knowledge Discovery, 11(3)(2000), 574^585.

12. Koskenniemi, K.: Two-Level Morphology: A General Computational Model forWord-Form Recognition and Production. PhD thesis, University of Helsinki, Depart-ment of General Linguistics, 1983.

13. Lagus, K., Honkela, T., Kaski, S. and Kohonen, T.: WEBSOM for textual data mining.Arti¢cial Intelligence Review, 13(5/6) (1999), 345^364.

14. Salton, G. and Buckley, C.: Term weighting approaches in automatic text retrieval.Technical Report 87-881, Cornell University, Department of Computer Science, Ithaca,NY, 1987.

15. Salton, G. and McGill, M. J.: Introduction to Modern Information Retrieval.McGraw-Hill, New York, 1983.

16. Salton, G., Wong, A. and Yang, C. S.: A vector space model for automatic indexing.Communications of the ACM, 18(11) (1975), 613^620.

17. Voorhees, E. M. and Harman, D. K.: Appendix: Evaluation techniques and measures.In: Proceedings of The Eighth Text REtrieval Conference (TREC 8). NIST, 2000.http://trec.nist.gov/pubs/trec8/t8 proceedings.html.


text retrieval using self-organized document maps

Documents