the self-organizing mapdasgupta/254-neural-ul/jenny.pdf\rule of thumb": 500 number of network...

$: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,$
The Self-Organizing Map

Jenny Hamer

University of California, San Diego

Paper by Teuvo Kohonen, IEEE 1990

November 8, 2018

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 1 / 20

Overview

1 Competitive learning

2 Learning vector quantization (LVQ) methodsUnsupervised vector quantizationSupervised LVQ

3 Topological neighborhood

4 Self-organizing map algorithm

5 Applications and results


Competitive learning

Let x ∈ Rn be an input vector. Let d(a,b) be a similarity metric betweentwo vectorial inputs, for example the Euclidean distance between a,b.

1. Initialize a set of “reference” vectors M = {mi | mi ∈ Rn, i = 1, , k}(e.g. randomly)

2. Simultaneously compare x with each mi ; compute the similaritybetween x,mi : d(x,mi )

3. The best-matching mi is declared the “winner”; let mc be thisreference vector

4. Tune mc such that d(x,mc) decreases while all other mi are keptunchanged.

Supposing that the x follow some distribution with probability densityfunction p, then the tuned mi form an approximation of p.


Competitive learning visualization

The unit uj in the layer F2 is the “winner” for input pattern I: input I’scluster is determined by this unit.


Unsupervised Vector Quantization: k-means

Goal: learn a finite, discrete approximation M of continuous probabilitydensity function of vectorial data X = {x | x ∈ Rn}.

As in competitive learning, randomly initialize the codebook M.For each input vector x:

1. Assign x to the cluster of its “winner” mc satisfying mini{||x−mi ||}(in general, w.r.t. Lp norm: we’ll use p = 2 or Euclidean distance)

2. Tune the codebook vectors:mc(t + 1) = mc(t) + α(t)[x(t)−mc(t)] where mc was a “winner”mi (t + 1) = mi (t) for all i 6= c

Originally used in signal processing and approximation:

often more economical to use a batched method: match and assignnumerous x(t), then update in a single step.


Classical method used to approximate the continuous probability density function p(x) the set of vectorial data x

Find the mc most similar to x - that which is closest in Euclidean distance to x.

Supervised LVQ

Suppose some set of the input data x have known class labels: assignseveral mi ∈ M as representatives for each class (e.g. randomly). The mi

closest to x determines x’s class.

1. Assign class of mc to x

2. Update the mi as follows:

If x is correctly classified:

mc(t + 1) = mc(t) + α(t)[x−mc(t)]

If x is incorrectly classified:

mc(t + 1) = mc(t)− α(t)[x−mc(t)]

leaving all weights i 6= c unchanged: mi (t + 1) = mi (t)Here α(t), 0 < α(t) < 1, is a scalar gain which should decreasemonotonically over time.


Suppose some set of the input data x have known class labels: assign several mi M as representatives for each class (e.g. randomly). The mi closest to x determines x's class in a simple ``nearest neighbor'' fashion.

Suppose mc is the best matching or ``winner'' codebook vector of x: that with the minimal Euclidean distance to x

Topological neighborhood

Goal: partition the cells of the map such that each responds to a discreteset of features of the data. Like competitive learning, but withneighborhood-dependent updates of cells in Nc to learn a spatial ordering.

This algorithm allows for high-dimensional data to be visualized in lowerdimensional (often 2D) space.


Goal: partition the cells of the map such that each responds to a discrete set of ``patterns'' or features of the data.

Like competitive learning, but with neighborhood-dependent which we consider to be a topologically related subset of the cells in the map to learn a spatial ordering of the input data

SOM architecture

Network: (most commonly)n ×m cells positioned on agrid or lattice, where eachcell mi ∈ Rn (the red nodes)

x1, x2, x3 are the inputvectors or patterns ∈ Rn

(the yellow nodes)

Each input xi is “connected”to every mi in the network(so that each xi can becompared with all mi )

Figure: An example of a 3× 3 network

(Snapshot from https://youtu.be/ Euwc9fWBJw)


SOM Learning algorithm

Forming the topological order of the map: begin with a large Nc , roughly1/2 number of neurons in network.Iteratively:

1. Find cell c s.t. ||x−mc || = mini{||x−mi ||}2. Update the mi :

If cell i ∈ Nc(t)

mi (t + 1) = mi (t) + α(t)[x(t)−mi (t)]

The adaption gain scalar α, 0 < α < 1 is often replaced with a scalar“kernel function” hci (t), most commonly a Gaussian neighborhood:

hci (t) = h0(t) exp(−||ri − rc ||2/σ(t)2)

where h0(t), σ(t) decrease over time, and ri , rc denote cells c and i .


As before, if i Nc(t), mi(t+1) = mi(t) stays unchanged.

Using the Gaussian neighborhood supports a biologically lateral interaction within regions of neurons in the brain

Simple demonstration with colors

Using 15 samples of RGB values, this 20× 30 cell network learns aclassification of the colors.


Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.

Effect of decrease in network size

Alternative clustering of the same 15 RGB vectors with a 10× 15 network:



Effect of decrease in network size

Now, with a 50× 60 network:



Practical tips in applications

Initialization of mi (0):May be arbitrarily or randomly chosenOnly restriction is that they are distinct

Number of iterations:“Rule of thumb”: 500× number of network cells

Learning rate: αUse a high learning rate (close to 1) for the first 1000 iterations, thenmonotonically decrease α during fine-tuning

Neighborhood size: Nc

If Nc too small → map will not learn global orderingIf Nc too large → map will be very coarse-grained in clusteringChoose are generous Nc at start of training:Nc(0) can be > than half the size of the network;Reduce the size during training

Note that the updates to each cell’s weights tend to smooth out overlearning and weight vectors tend to become ordered in value along theaxes of the network.


This can be done linearly

Theory behind the SOM

Extremely difficult to capture the dynamics of the SOM inmathematical theorems and analyses

Many strict analyses have been attempted under simplifyingconstraints, though these are beyond the scope of this paper

Cottrell and Fort (1987) have proven that the SOM will converge to aglobally ordered state in some constrained low-dimensional cases

Ritter, Martinez, and Schulten (1992) have shown that under theassumption the x are discrete-valued, local order can be achieve formore general dimensionalities of the vectors

Several meaningful applications have been simulated using the SOMto demonstrate its utility and efficacy


The convergence of the original SOM algorithm to the globally ordered state has been proven mathematically in some simple low-dimensional cases: see Cottrell and Fort (1987). On the other hand, Ritter, Martinetz, and Schulten (1992) have presented a proof that a local order can be reached for more general dimensionalities of the vectors, if the distribution of the input vectors is discrete-valued.

Application: speech recognition

Building a “phonetic typewriter” to identify and recognize phonemes:represent continuous Finnish speech as spectral vectors, x

x ∈ R15 spectralvectors computedevery 9.83ms using256-pt FFT

No segmentation ofspeech and fullyunsupervised:naturally occurringfeatures contributedto self-organization

Figure: 8× 12 Finnish phoneme mapThe 21 phonemes of Finnish have been learned:most cells respond to a unique phoneme, butsome cells do to two.


This ``phonetic typewriter'' identifies and recognizes phonemes in continuous speech: here, in Finnish

Its main purpose is to identify phonemes in real-time for the transcription of speech into text

Spectra evaluated every 9.83ms using 256-pt FFT 15-dim. spectral vector (after grouping the channels)

The spectra were not labeled or segmented during learning.

The naturally occurring features in the original waveform contributed to self-organization

After the map was adapted to the spectral data, calibration was performed using known stationary phonemes, resulting in the 21 phonemes we see represented in the map

Converted speech into short-time spectra to form the x(t) input vectors.

All spectral vectors were presented to the algorithm in their naturally occurring order

Spectra were not segmented or labeled in any way - the features present in the waveform contributed to the self-organization in this map

After the map was adapted

Interestingly, most cells respond to a single phoneme, however some cells respond to two. These seem to appear on the borders of phoneme groups - perhaps this is indicative of a similarity between these two phonemes

LVQ used to fine-tune the map to obtain optimal decision accuracy

Application: Semantic mapping

Goal: extract abstract logically similar, semantic information fromsymbolic data.

Meaning of a symbolic encodingrequires consideration of theconditional probabilities of itsoccurrences (contexts) withother encodings

Represent symbolic encoding asa vector xs of symbols and theircontext xc :

x =

[xsxc

]=

[xs0

]+

[0xc

]

Figure: Vocabulary used in experimentSemantic meaning should be inferredonly from context in which words occur:Each word encoded as a random, 7-dim.unit vector


Using the vocabulary shown on the left comprised of nouns, verbs, and adverbs, sentence patterns, and some examples of 3-word sentences

The input vector x is represented as the vectorial sum of two orthogonal components; the symbolic and contextual components

Application: semantic maps

Presented 2, 000 word-context-pairs derived from 10,000 random sentences

Figure: 15× 10-cell learned semantic map


The map has segregated the difference vocabulary types (nouns, verbs, adverbs) into different regions of the map.

Within reach domain, there is additional grouping: notice that Mary, Jim, and Bob are in one area, near the other animals, and the comestibles are clustered together

Concluding remarks

The SOM is a powerful neural-model capable of creating localized,structured, clustered representations of input data

On its own, the SOM is not well suited for classification tasks unlessfine-tuning methods are used to increase decision accuracy

Initialization of the map affects the locality of its learned responsesand ordering

Particularly useful for high-dimensional data: reduces dimensionalityof the data and the topological mapping allows underlying propertiesof data to be visualized


References

Tuevo Kohonen

The Self-Organizing Map

Proceedings of the IEEE Vol. 78, No. 9, September 1990.

Tuevo Kohonen

Essentials of the self-organizing map

Elsevier Volume 37, January 2013, Pages 52-65. 2012.


Any questions?

1 Competitive learning

2 Learning vector quantization (LVQ) methodsUnsupervised vector quantizationSupervised LVQ

3 Topological neighborhood

4 Self-organizing map algorithm

5 Applications and results


the self-organizing mapdasgupta/254-neural-ul/jenny.pdf\rule of thumb": 500 number of network...

Documents