Download - Document Organization using Self – Organizing Feature Maps (WEBSOFM)

Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki

Document Organization using Document Organization using Self – OrganizingSelf – Organizing

Feature MapsFeature Maps(WEBSOFM)(WEBSOFM)

AApostolospostolos Georgakis Georgakis Artificial Intelligence Artificial Intelligence

and and

Information Analysis LabInformation Analysis Lab


Self-Self-OOrganizing Maps Algorithmrganizing Maps Algorithm

SOMs areSOMs are neural networksneural networks with: with:•TwoTwo-layer-layer structure structure•FeedFeed-forward -forward topologytopology

They:They:•form a form a non-linear non-linear projectionprojection from an arbitrary from an arbitrary data data manifold onto manifold onto aa low low dimensional discrete map.dimensional discrete map.

•perform competitive, unsupervised trainingperform competitive, unsupervised training•map the map the probability density function probability density function (pdf) of the input (pdf) of the input space on a 2D or 3D latticespace on a 2D or 3D lattice

The topology of the lattice can be either hexagonal or The topology of the lattice can be either hexagonal or orthogonalorthogonal


LetLet xi(t)=[xi1(t), xi2(t),…, xim(t)]T denote andenote an m1 feature vectorfeature vector and and mj(t)

denote the weight vectordenote the weight vector of the j-th neuron of the j-th neuron..

xi (feature vectors)

mj (weight vector)neurons

SOMs belong to the class of vector quantizers (k-means, VQ, SOMs belong to the class of vector quantizers (k-means, VQ, LVQ, etc)LVQ, etc)


Index of the winning neuron:Index of the winning neuron:

Updating function: Updating function:

wherewhere Nc is is thethe neighbourhood centered on the winnerneighbourhood centered on the winner which which

isis modified toward modified toward xi(t)

x mi jj

c arg min t t

1m x m

mm

j cj i j cj

j c

t h t t t , j Nt

t , j N


2

22

c jcj

r rh t a t exp

t

a(t) is the learning rate, andis the learning rate, and σ(t) denotes the diameter of the denotes the diameter of the updating kernel. updating kernel.

rc andand rj is the location on the lattice of the winner neuron and the is the location on the lattice of the winner neuron and the

neuron being updated, respectively.neuron being updated, respectively.


Information Retrieval of Information Retrieval of Domain Specific MaterialDomain Specific Material

IInformation retrieval (IR) consists ofnformation retrieval (IR) consists of::

•OrganizationOrganization representation and storage of the available datarepresentation and storage of the available data

•RRetrievaletrieval exploration of the organized dataexploration of the organized data

Prior to retrieval, the data repositoryPrior to retrieval, the data repository ( (corpuscorpus)) has to be has to be organizedorganized according to the retrieval method to be applied. according to the retrieval method to be applied.

Without a retrieval-orientedWithout a retrieval-oriented organization method the retrieval organization method the retrieval of even a fraction of the available documents becomesof even a fraction of the available documents becomes onerous.onerous.


Information Retrieval of Information Retrieval of Domain Specific MaterialDomain Specific Material

• Scenario:Scenario: A user is interested in locating documents or web pages A user is interested in locating documents or web pages

related to a certain topic.related to a certain topic.

Query the IR system in Natural Language or use available Query the IR system in Natural Language or use available sources (documents and / or web pages)sources (documents and / or web pages)


Information RetrievalInformation Retrieval

• The above problem can be dealt either by the:The above problem can be dealt either by the:

Boolean modelBoolean modelIndex terms

Vector modelVector modelVectors in t-dimensional space

Probabilistic modelProbabilistic modelProbability theory


Boolean ModelBoolean ModelDocuments: represented by index terms or keywordsDocuments: represented by index terms or keywords

• IIndexing techniques:ndexing techniques: Suffix arrays

(text suffix: a string that goes from a text position to the end of the text)

Signature files(word oriented index structures based on hashing)

Inverted files Vocabulary file Occurrences file

• Boolean operators are used in the retrieval.Boolean operators are used in the retrieval.


• Drawbacks of the Boolean modelDrawbacks of the Boolean model It does not support significance-weighting of the query It does not support significance-weighting of the query

terms.terms. Users are not always familiar with the Boolean operators.Users are not always familiar with the Boolean operators. No sufficient mechanism exists for ranking the retrieved No sufficient mechanism exists for ranking the retrieved

documents.documents. CapableCapable of handling only of handling only keyword-based querieskeyword-based queries..

• AlternativesAlternatives FFuzzy Boolean operatorsuzzy Boolean operators..


Vector Space ModelVector Space Model• The words and the documents are represented by vectors.The words and the documents are represented by vectors.

• AdvantagesAdvantages Supports term-weighting. Supports term-weighting. The retrieved documents can be sorted according to their The retrieved documents can be sorted according to their

similarity with the help of the various vector norms.similarity with the help of the various vector norms. It is a user-friendly model.It is a user-friendly model.


PreprocessingPreprocessing

• Text processing Text processing Html cleaningHtml cleaning Plain text cleaning (Email addresses, URLs, word separators, Plain text cleaning (Email addresses, URLs, word separators,

numbers etc)numbers etc)

• StemmingStemming Clustering by elimination of word suffixesClustering by elimination of word suffixes

• Feature Vector ExtractionFeature Vector Extraction Calculation of the contextual statisticsCalculation of the contextual statistics Formation of the feature vectors (Vector space model)Formation of the feature vectors (Vector space model)


Language Modeling (CLanguage Modeling (Contextual statistics ontextual statistics ))

• Consider both its preceding and following words.Consider both its preceding and following words.

1x ei nli l

liN or

1

1

1

e

x e

e

w

w

N

li ll

l i

i ii

N

im mm

m i

n

N

n

1


The word categories map for a 27x27 network


k-th document A Prague Guide - Czech Restaurants - Prag - Praha - From Andel 3W Tourist Service English.htm

PRAGUE guid CZECH restaur PRAG PRAHA ANDEL tourist servic ENGLISH CZECH restaur HOSTINEC KALICHA NEBOZIZEK NOVOMESTSKY PIVOVAR POD KRIDLEM FLEKU KRKAVCU MALTEZSKYCH RYTIRU HOSTINEC KALICHA NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . folklor musican realli get live mood drink glass beer gobbl traddit CZECH GULAS DUMPLINGS pick histori . reserv recommend . back list NEBOZIZEK NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular .

….

Combined together and apply smoothing


Document Map


Mean Square ErrorMean Square Error

The solid vertical line denotes the time instant when the simulated annealing process was used to overcome local minima in the clustering procedure.


Dimensionality reductionDimensionality reductionThe dimensionality of the feature vectors is exceptionally high, The dimensionality of the feature vectors is exceptionally high, i.e. for the HyperGeo corpus (393 web pages, i.e. for the HyperGeo corpus (393 web pages, 30843084 stems) the stems) the dimension is 3084*3=9252.dimension is 3084*3=9252.

Dimensionality reduction techniques:Dimensionality reduction techniques:

•PCA, ICA, LDAPCA, ICA, LDA

•Random projectionRandom projection

•Component eliminationComponent elimination


Random ProjectionRandom ProjectionIn random projection we compute the In random projection we compute the mm**3N3Nww matrix matrix RR having having

the following properties:the following properties:

•The components in each column are chosen to be independent, The components in each column are chosen to be independent, identically distributed Gaussian variables with zero mean and identically distributed Gaussian variables with zero mean and unit variance.unit variance.

•Each column is normalized to unit norm.Each column is normalized to unit norm.

3 wi m N iR x x


Component EliminationComponent Elimination

To overcome the computational complexity in locating the To overcome the computational complexity in locating the winner we order the components of the feature vectors winner we order the components of the feature vectors according to the distance:according to the distance:

and and kk is the number of feature vectors. is the number of feature vectors.

wherewhere

2

1

k

j ij ji

u x t E x t

1

1 k

j iji

E x t x tk


By omitting the second sum we still can get an By omitting the second sum we still can get an estimate of the winning neuron. estimate of the winning neuron.

Rearrange the components in each feature vector Rearrange the components in each feature vector using using uj so that the components with the strongest so that the components with the strongest

values appear first. The Euclidean distance is then values appear first. The Euclidean distance is then computed bycomputed by

2 2

1 1

d m

i j in jn in jnn n d

x t m t x t m t x t m t


Percentage of the correctly identified winners with respect to the dimensionality dif-ference between the original space and the“truncated” sub-space.


ReferencesReferences1.1. R. Baeza - Yates and B. Ribeiro - Neto, R. Baeza - Yates and B. Ribeiro - Neto, Modern Information Retrieval Modern Information Retrieval (ACM, (ACM,

1999).1999).2.2. M. W. Berry and M. Browne, M. W. Berry and M. Browne, Understanding Search Engines: Mathematical Understanding Search Engines: Mathematical

Modelling and Text RetrievalModelling and Text Retrieval (SIAM, 1999). (SIAM, 1999). 3.3. G. Salton and M. J. McGill, G. Salton and M. J. McGill, Introduction to Modern Information RetrievalIntroduction to Modern Information Retrieval

(McGraw Hill, 1983). (McGraw Hill, 1983). 4.4. S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the

WEBSOM System in Organizing Document Collections,WEBSOM System in Organizing Document Collections, Computing Science Computing Science and Statisticsand Statistics, 29, 1998, 281-290. , 29, 1998, 281-290.

5.5. T. Kohonen, T. Kohonen, Self-Organizing MapsSelf-Organizing Maps (Springer Verlag, 1997). (Springer Verlag, 1997). 6.6. T. Kohonen, Self-organization of very large document collections: State of the T. Kohonen, Self-organization of very large document collections: State of the

art, art, Proc. of the 8Proc. of the 8thth Int. Conf. on Artificial Neural Networks Int. Conf. on Artificial Neural Networks, 1, Springer, 1998, , 1, Springer, 1998, 65-74. 65-74.

7.7. H. Ritter and T. Kohonen, Self-Organizing Semantic Map, H. Ritter and T. Kohonen, Self-Organizing Semantic Map, Biol. CyberneticsBiol. Cybernetics, , 61, 1989, 241-254.61, 1989, 241-254.


8.8. S. Haykin, S. Haykin, Neural Networks, A Comprehensive FoundationNeural Networks, A Comprehensive Foundation (Prentice-Hall, Inc, (Prentice-Hall, Inc, 1999). 1999).

9.9. T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and V. Paatero and A. Saarela, Self-Organization of Massive Document Collection, V. Paatero and A. Saarela, Self-Organization of Massive Document Collection, IEEE Trans. On Neural NetworksIEEE Trans. On Neural Networks, 11:3, 2000, 574-585. , 11:3, 2000, 574-585.

10.10. S. Kaski, Data exploration using self-organizing maps, S. Kaski, Data exploration using self-organizing maps, PhD ThessisPhD Thessis, Helsinki , Helsinki University of Technology, 1997.University of Technology, 1997.

11.11. D. Manning and H. Schutze, D. Manning and H. Schutze, Foundation of Statistical Natural Language Foundation of Statistical Natural Language ProcessingProcessing (MIT Press, 1999). (MIT Press, 1999).

12.12. W. B. Frakes and R. Baeza - Yates, W. B. Frakes and R. Baeza - Yates, Information Retrieval: Data Structures and Information Retrieval: Data Structures and AlgorithmsAlgorithms (Prentice-Hall, Inc, 1992). (Prentice-Hall, Inc, 1992).

13.13. M. F. Porter, An algorithm for suffix stripping, M. F. Porter, An algorithm for suffix stripping, ProgramProgram, 14, 1980, 130-137. , 14, 1980, 130-137. 14.14. C. Becchetti and L. P. Ricotti, C. Becchetti and L. P. Ricotti, Speech Recognition: Theory and C++ Speech Recognition: Theory and C++

ImplementationImplementation (Wiley, 1998). (Wiley, 1998). 15.15. M. P. Oak, M. P. Oak, Statistics for Corpus LinguisticsStatistics for Corpus Linguistics (University Press, 1998). (University Press, 1998).16.16. S. Kaski, Dimensionality reduction by random mapping: Fast similarity S. Kaski, Dimensionality reduction by random mapping: Fast similarity

computation for clustering, computation for clustering, Proc. of the 8Proc. of the 8thth Int. Conf. on Neural Networks Int. Conf. on Neural Networks, , IEEE, 1, 1998, 413-418.IEEE, 1, 1998, 413-418.

17.17. S. Kaski, Fast winner search for SOM-based monitoring and retrieval of high-S. Kaski, Fast winner search for SOM-based monitoring and retrieval of high-dimensional data, dimensional data, Proc. of the 9Proc. of the 9thth Int. Conf. on Artificial Neural Networks Int. Conf. on Artificial Neural Networks, , Edinburgh, U.K., September 1999.Edinburgh, U.K., September 1999.

Download - Document Organization using Self – Organizing Feature Maps (WEBSOFM)

Top Related