Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Document Organization using Document Organization using Self – OrganizingSelf – Organizing
Feature MapsFeature Maps(WEBSOFM)(WEBSOFM)
AApostolospostolos Georgakis Georgakis Artificial Intelligence Artificial Intelligence
and and
Information Analysis LabInformation Analysis Lab
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Self-Self-OOrganizing Maps Algorithmrganizing Maps Algorithm
SOMs areSOMs are neural networksneural networks with: with:•TwoTwo-layer-layer structure structure•FeedFeed-forward -forward topologytopology
They:They:•form a form a non-linear non-linear projectionprojection from an arbitrary from an arbitrary data data manifold onto manifold onto aa low low dimensional discrete map.dimensional discrete map.
•perform competitive, unsupervised trainingperform competitive, unsupervised training•map the map the probability density function probability density function (pdf) of the input (pdf) of the input space on a 2D or 3D latticespace on a 2D or 3D lattice
The topology of the lattice can be either hexagonal or The topology of the lattice can be either hexagonal or orthogonalorthogonal
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
LetLet xi(t)=[xi1(t), xi2(t),…, xim(t)]T denote andenote an m1 feature vectorfeature vector and and mj(t)
denote the weight vectordenote the weight vector of the j-th neuron of the j-th neuron..
xi (feature vectors)
mj (weight vector)neurons
SOMs belong to the class of vector quantizers (k-means, VQ, SOMs belong to the class of vector quantizers (k-means, VQ, LVQ, etc)LVQ, etc)
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Index of the winning neuron:Index of the winning neuron:
Updating function: Updating function:
wherewhere Nc is is thethe neighbourhood centered on the winnerneighbourhood centered on the winner which which
isis modified toward modified toward xi(t)
x mi jj
c arg min t t
1m x m
mm
j cj i j cj
j c
t h t t t , j Nt
t , j N
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
2
22
c jcj
r rh t a t exp
t
a(t) is the learning rate, andis the learning rate, and σ(t) denotes the diameter of the denotes the diameter of the updating kernel. updating kernel.
rc andand rj is the location on the lattice of the winner neuron and the is the location on the lattice of the winner neuron and the
neuron being updated, respectively.neuron being updated, respectively.
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Information Retrieval of Information Retrieval of Domain Specific MaterialDomain Specific Material
IInformation retrieval (IR) consists ofnformation retrieval (IR) consists of::
•OrganizationOrganization representation and storage of the available datarepresentation and storage of the available data
•RRetrievaletrieval exploration of the organized dataexploration of the organized data
Prior to retrieval, the data repositoryPrior to retrieval, the data repository ( (corpuscorpus)) has to be has to be organizedorganized according to the retrieval method to be applied. according to the retrieval method to be applied.
Without a retrieval-orientedWithout a retrieval-oriented organization method the retrieval organization method the retrieval of even a fraction of the available documents becomesof even a fraction of the available documents becomes onerous.onerous.
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Information Retrieval of Information Retrieval of Domain Specific MaterialDomain Specific Material
• Scenario:Scenario: A user is interested in locating documents or web pages A user is interested in locating documents or web pages
related to a certain topic.related to a certain topic.
Query the IR system in Natural Language or use available Query the IR system in Natural Language or use available sources (documents and / or web pages)sources (documents and / or web pages)
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Information RetrievalInformation Retrieval
• The above problem can be dealt either by the:The above problem can be dealt either by the:
Boolean modelBoolean modelIndex terms
Vector modelVector modelVectors in t-dimensional space
Probabilistic modelProbabilistic modelProbability theory
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Boolean ModelBoolean ModelDocuments: represented by index terms or keywordsDocuments: represented by index terms or keywords
• IIndexing techniques:ndexing techniques: Suffix arrays
(text suffix: a string that goes from a text position to the end of the text)
Signature files(word oriented index structures based on hashing)
Inverted files Vocabulary file Occurrences file
• Boolean operators are used in the retrieval.Boolean operators are used in the retrieval.
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
• Drawbacks of the Boolean modelDrawbacks of the Boolean model It does not support significance-weighting of the query It does not support significance-weighting of the query
terms.terms. Users are not always familiar with the Boolean operators.Users are not always familiar with the Boolean operators. No sufficient mechanism exists for ranking the retrieved No sufficient mechanism exists for ranking the retrieved
documents.documents. CapableCapable of handling only of handling only keyword-based querieskeyword-based queries..
• AlternativesAlternatives FFuzzy Boolean operatorsuzzy Boolean operators..
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Vector Space ModelVector Space Model• The words and the documents are represented by vectors.The words and the documents are represented by vectors.
• AdvantagesAdvantages Supports term-weighting. Supports term-weighting. The retrieved documents can be sorted according to their The retrieved documents can be sorted according to their
similarity with the help of the various vector norms.similarity with the help of the various vector norms. It is a user-friendly model.It is a user-friendly model.
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
PreprocessingPreprocessing
• Text processing Text processing Html cleaningHtml cleaning Plain text cleaning (Email addresses, URLs, word separators, Plain text cleaning (Email addresses, URLs, word separators,
numbers etc)numbers etc)
• StemmingStemming Clustering by elimination of word suffixesClustering by elimination of word suffixes
• Feature Vector ExtractionFeature Vector Extraction Calculation of the contextual statisticsCalculation of the contextual statistics Formation of the feature vectors (Vector space model)Formation of the feature vectors (Vector space model)
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Language Modeling (CLanguage Modeling (Contextual statistics ontextual statistics ))
• Consider both its preceding and following words.Consider both its preceding and following words.
1x ei nli l
liN or
1
1
1
e
x e
e
w
w
N
li ll
l i
i ii
N
im mm
m i
n
N
n
1
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
The word categories map for a 27x27 network
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
k-th document A Prague Guide - Czech Restaurants - Prag - Praha - From Andel 3W Tourist Service English.htm
PRAGUE guid CZECH restaur PRAG PRAHA ANDEL tourist servic ENGLISH CZECH restaur HOSTINEC KALICHA NEBOZIZEK NOVOMESTSKY PIVOVAR POD KRIDLEM FLEKU KRKAVCU MALTEZSKYCH RYTIRU HOSTINEC KALICHA NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular . folklor musican realli get live mood drink glass beer gobbl traddit CZECH GULAS DUMPLINGS pick histori . reserv recommend . back list NEBOZIZEK NA BOJISTI PRAGUE famou CZECH restaur PRAGUE popular .
….
Combined together and apply smoothing
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Document Map
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Document Map
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Mean Square ErrorMean Square Error
The solid vertical line denotes the time instant when the simulated annealing process was used to overcome local minima in the clustering procedure.
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Dimensionality reductionDimensionality reductionThe dimensionality of the feature vectors is exceptionally high, The dimensionality of the feature vectors is exceptionally high, i.e. for the HyperGeo corpus (393 web pages, i.e. for the HyperGeo corpus (393 web pages, 30843084 stems) the stems) the dimension is 3084*3=9252.dimension is 3084*3=9252.
Dimensionality reduction techniques:Dimensionality reduction techniques:
•PCA, ICA, LDAPCA, ICA, LDA
•Random projectionRandom projection
•Component eliminationComponent elimination
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Random ProjectionRandom ProjectionIn random projection we compute the In random projection we compute the mm**3N3Nww matrix matrix RR having having
the following properties:the following properties:
•The components in each column are chosen to be independent, The components in each column are chosen to be independent, identically distributed Gaussian variables with zero mean and identically distributed Gaussian variables with zero mean and unit variance.unit variance.
•Each column is normalized to unit norm.Each column is normalized to unit norm.
3 wi m N iR x x
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Component EliminationComponent Elimination
To overcome the computational complexity in locating the To overcome the computational complexity in locating the winner we order the components of the feature vectors winner we order the components of the feature vectors according to the distance:according to the distance:
and and kk is the number of feature vectors. is the number of feature vectors.
wherewhere
2
1
k
j ij ji
u x t E x t
1
1 k
j iji
E x t x tk
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
By omitting the second sum we still can get an By omitting the second sum we still can get an estimate of the winning neuron. estimate of the winning neuron.
Rearrange the components in each feature vector Rearrange the components in each feature vector using using uj so that the components with the strongest so that the components with the strongest
values appear first. The Euclidean distance is then values appear first. The Euclidean distance is then computed bycomputed by
2 2
1 1
d m
i j in jn in jnn n d
x t m t x t m t x t m t
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
Percentage of the correctly identified winners with respect to the dimensionality dif-ference between the original space and the“truncated” sub-space.
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
ReferencesReferences1.1. R. Baeza - Yates and B. Ribeiro - Neto, R. Baeza - Yates and B. Ribeiro - Neto, Modern Information Retrieval Modern Information Retrieval (ACM, (ACM,
1999).1999).2.2. M. W. Berry and M. Browne, M. W. Berry and M. Browne, Understanding Search Engines: Mathematical Understanding Search Engines: Mathematical
Modelling and Text RetrievalModelling and Text Retrieval (SIAM, 1999). (SIAM, 1999). 3.3. G. Salton and M. J. McGill, G. Salton and M. J. McGill, Introduction to Modern Information RetrievalIntroduction to Modern Information Retrieval
(McGraw Hill, 1983). (McGraw Hill, 1983). 4.4. S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the
WEBSOM System in Organizing Document Collections,WEBSOM System in Organizing Document Collections, Computing Science Computing Science and Statisticsand Statistics, 29, 1998, 281-290. , 29, 1998, 281-290.
5.5. T. Kohonen, T. Kohonen, Self-Organizing MapsSelf-Organizing Maps (Springer Verlag, 1997). (Springer Verlag, 1997). 6.6. T. Kohonen, Self-organization of very large document collections: State of the T. Kohonen, Self-organization of very large document collections: State of the
art, art, Proc. of the 8Proc. of the 8thth Int. Conf. on Artificial Neural Networks Int. Conf. on Artificial Neural Networks, 1, Springer, 1998, , 1, Springer, 1998, 65-74. 65-74.
7.7. H. Ritter and T. Kohonen, Self-Organizing Semantic Map, H. Ritter and T. Kohonen, Self-Organizing Semantic Map, Biol. CyberneticsBiol. Cybernetics, , 61, 1989, 241-254.61, 1989, 241-254.
Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki
8.8. S. Haykin, S. Haykin, Neural Networks, A Comprehensive FoundationNeural Networks, A Comprehensive Foundation (Prentice-Hall, Inc, (Prentice-Hall, Inc, 1999). 1999).
9.9. T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and V. Paatero and A. Saarela, Self-Organization of Massive Document Collection, V. Paatero and A. Saarela, Self-Organization of Massive Document Collection, IEEE Trans. On Neural NetworksIEEE Trans. On Neural Networks, 11:3, 2000, 574-585. , 11:3, 2000, 574-585.
10.10. S. Kaski, Data exploration using self-organizing maps, S. Kaski, Data exploration using self-organizing maps, PhD ThessisPhD Thessis, Helsinki , Helsinki University of Technology, 1997.University of Technology, 1997.
11.11. D. Manning and H. Schutze, D. Manning and H. Schutze, Foundation of Statistical Natural Language Foundation of Statistical Natural Language ProcessingProcessing (MIT Press, 1999). (MIT Press, 1999).
12.12. W. B. Frakes and R. Baeza - Yates, W. B. Frakes and R. Baeza - Yates, Information Retrieval: Data Structures and Information Retrieval: Data Structures and AlgorithmsAlgorithms (Prentice-Hall, Inc, 1992). (Prentice-Hall, Inc, 1992).
13.13. M. F. Porter, An algorithm for suffix stripping, M. F. Porter, An algorithm for suffix stripping, ProgramProgram, 14, 1980, 130-137. , 14, 1980, 130-137. 14.14. C. Becchetti and L. P. Ricotti, C. Becchetti and L. P. Ricotti, Speech Recognition: Theory and C++ Speech Recognition: Theory and C++
ImplementationImplementation (Wiley, 1998). (Wiley, 1998). 15.15. M. P. Oak, M. P. Oak, Statistics for Corpus LinguisticsStatistics for Corpus Linguistics (University Press, 1998). (University Press, 1998).16.16. S. Kaski, Dimensionality reduction by random mapping: Fast similarity S. Kaski, Dimensionality reduction by random mapping: Fast similarity
computation for clustering, computation for clustering, Proc. of the 8Proc. of the 8thth Int. Conf. on Neural Networks Int. Conf. on Neural Networks, , IEEE, 1, 1998, 413-418.IEEE, 1, 1998, 413-418.
17.17. S. Kaski, Fast winner search for SOM-based monitoring and retrieval of high-S. Kaski, Fast winner search for SOM-based monitoring and retrieval of high-dimensional data, dimensional data, Proc. of the 9Proc. of the 9thth Int. Conf. on Artificial Neural Networks Int. Conf. on Artificial Neural Networks, , Edinburgh, U.K., September 1999.Edinburgh, U.K., September 1999.