the self-organizing mapdasgupta/254-neural-ul/jenny.pdf\rule of thumb": 500 number of network...

32
The Self-Organizing Map Jenny Hamer University of California, San Diego Paper by Teuvo Kohonen, IEEE 1990 November 8, 2018 Jenny Hamer (UCSD) Self-organizing map November 8, 2018 1 / 20

Upload: others

Post on 03-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

The Self-Organizing Map

Jenny Hamer

University of California, San Diego

Paper by Teuvo Kohonen, IEEE 1990

November 8, 2018

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 1 / 20

Page 2: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Overview

1 Competitive learning

2 Learning vector quantization (LVQ) methodsUnsupervised vector quantizationSupervised LVQ

3 Topological neighborhood

4 Self-organizing map algorithm

5 Applications and results

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 2 / 20

Page 3: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Competitive learning

Let x ∈ Rn be an input vector. Let d(a,b) be a similarity metric betweentwo vectorial inputs, for example the Euclidean distance between a,b.

1. Initialize a set of “reference” vectors M = {mi | mi ∈ Rn, i = 1, , k}(e.g. randomly)

2. Simultaneously compare x with each mi ; compute the similaritybetween x,mi : d(x,mi )

3. The best-matching mi is declared the “winner”; let mc be thisreference vector

4. Tune mc such that d(x,mc) decreases while all other mi are keptunchanged.

Supposing that the x follow some distribution with probability densityfunction p, then the tuned mi form an approximation of p.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 3 / 20

Page 4: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Competitive learning visualization

The unit uj in the layer F2 is the “winner” for input pattern I: input I’scluster is determined by this unit.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 4 / 20

Page 5: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Unsupervised Vector Quantization: k-means

Goal: learn a finite, discrete approximation M of continuous probabilitydensity function of vectorial data X = {x | x ∈ Rn}.

As in competitive learning, randomly initialize the codebook M.For each input vector x:

1. Assign x to the cluster of its “winner” mc satisfying mini{||x−mi ||}(in general, w.r.t. Lp norm: we’ll use p = 2 or Euclidean distance)

2. Tune the codebook vectors:mc(t + 1) = mc(t) + α(t)[x(t)−mc(t)] where mc was a “winner”mi (t + 1) = mi (t) for all i 6= c

Originally used in signal processing and approximation:

often more economical to use a batched method: match and assignnumerous x(t), then update in a single step.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 5 / 20

Classical method used to approximate the continuous probability density function p(x) the set of vectorial data x
Find the mc most similar to x - that which is closest in Euclidean distance to x.
Page 6: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Supervised LVQ

Suppose some set of the input data x have known class labels: assignseveral mi ∈ M as representatives for each class (e.g. randomly). The mi

closest to x determines x’s class.

1. Assign class of mc to x

2. Update the mi as follows:

If x is correctly classified:

mc(t + 1) = mc(t) + α(t)[x−mc(t)]

If x is incorrectly classified:

mc(t + 1) = mc(t)− α(t)[x−mc(t)]

leaving all weights i 6= c unchanged: mi (t + 1) = mi (t)Here α(t), 0 < α(t) < 1, is a scalar gain which should decreasemonotonically over time.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 6 / 20

Suppose some set of the input data x have known class labels: assign several mi M as representatives for each class (e.g. randomly). The mi closest to x determines x's class in a simple ``nearest neighbor'' fashion.
Suppose mc is the best matching or ``winner'' codebook vector of x: that with the minimal Euclidean distance to x
Page 7: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Topological neighborhood

Goal: partition the cells of the map such that each responds to a discreteset of features of the data. Like competitive learning, but withneighborhood-dependent updates of cells in Nc to learn a spatial ordering.

This algorithm allows for high-dimensional data to be visualized in lowerdimensional (often 2D) space.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 7 / 20

Goal: partition the cells of the map such that each responds to a discrete set of ``patterns'' or features of the data.
Like competitive learning, but with neighborhood-dependent which we consider to be a topologically related subset of the cells in the map to learn a spatial ordering of the input data
Page 8: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

SOM architecture

Network: (most commonly)n ×m cells positioned on agrid or lattice, where eachcell mi ∈ Rn (the red nodes)

x1, x2, x3 are the inputvectors or patterns ∈ Rn

(the yellow nodes)

Each input xi is “connected”to every mi in the network(so that each xi can becompared with all mi )

Figure: An example of a 3× 3 network

(Snapshot from https://youtu.be/ Euwc9fWBJw)

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 8 / 20

Page 9: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

SOM Learning algorithm

Forming the topological order of the map: begin with a large Nc , roughly1/2 number of neurons in network.Iteratively:

1. Find cell c s.t. ||x−mc || = mini{||x−mi ||}2. Update the mi :

If cell i ∈ Nc(t)

mi (t + 1) = mi (t) + α(t)[x(t)−mi (t)]

The adaption gain scalar α, 0 < α < 1 is often replaced with a scalar“kernel function” hci (t), most commonly a Gaussian neighborhood:

hci (t) = h0(t) exp(−||ri − rc ||2/σ(t)2)

where h0(t), σ(t) decrease over time, and ri , rc denote cells c and i .

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 9 / 20

As before, if i Nc(t), mi(t+1) = mi(t) stays unchanged.
Using the Gaussian neighborhood supports a biologically lateral interaction within regions of neurons in the brain
Page 10: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Simple demonstration with colors

Using 15 samples of RGB values, this 20× 30 cell network learns aclassification of the colors.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 10 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 11: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Simple demonstration with colors

Using 15 samples of RGB values, this 20× 30 cell network learns aclassification of the colors.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 10 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 12: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Simple demonstration with colors

Using 15 samples of RGB values, this 20× 30 cell network learns aclassification of the colors.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 10 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 13: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Simple demonstration with colors

Using 15 samples of RGB values, this 20× 30 cell network learns aclassification of the colors.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 10 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 14: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Simple demonstration with colors

Using 15 samples of RGB values, this 20× 30 cell network learns aclassification of the colors.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 10 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 15: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Alternative clustering of the same 15 RGB vectors with a 10× 15 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 11 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 16: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Alternative clustering of the same 15 RGB vectors with a 10× 15 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 11 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 17: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Alternative clustering of the same 15 RGB vectors with a 10× 15 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 11 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 18: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Alternative clustering of the same 15 RGB vectors with a 10× 15 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 11 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 19: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Alternative clustering of the same 15 RGB vectors with a 10× 15 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 11 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 20: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Now, with a 50× 60 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 12 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 21: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Now, with a 50× 60 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 12 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 22: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Now, with a 50× 60 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 12 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 23: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Now, with a 50× 60 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 12 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 24: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Effect of decrease in network size

Now, with a 50× 60 network:

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 12 / 20

Each region of the map is effectively a feature classifier, so the 2D visualized graphical output can be thought of as a type of feature map of the input space.
Page 25: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Practical tips in applications

Initialization of mi (0):May be arbitrarily or randomly chosenOnly restriction is that they are distinct

Number of iterations:“Rule of thumb”: 500× number of network cells

Learning rate: αUse a high learning rate (close to 1) for the first 1000 iterations, thenmonotonically decrease α during fine-tuning

Neighborhood size: Nc

If Nc too small → map will not learn global orderingIf Nc too large → map will be very coarse-grained in clusteringChoose are generous Nc at start of training:Nc(0) can be > than half the size of the network;Reduce the size during training

Note that the updates to each cell’s weights tend to smooth out overlearning and weight vectors tend to become ordered in value along theaxes of the network.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 13 / 20

This can be done linearly
Page 26: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Theory behind the SOM

Extremely difficult to capture the dynamics of the SOM inmathematical theorems and analyses

Many strict analyses have been attempted under simplifyingconstraints, though these are beyond the scope of this paper

Cottrell and Fort (1987) have proven that the SOM will converge to aglobally ordered state in some constrained low-dimensional cases

Ritter, Martinez, and Schulten (1992) have shown that under theassumption the x are discrete-valued, local order can be achieve formore general dimensionalities of the vectors

Several meaningful applications have been simulated using the SOMto demonstrate its utility and efficacy

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 14 / 20

The convergence of the original SOM algorithm to the globally ordered state has been proven mathematically in some simple low-dimensional cases: see Cottrell and Fort (1987). On the other hand, Ritter, Martinetz, and Schulten (1992) have presented a proof that a local order can be reached for more general dimensionalities of the vectors, if the distribution of the input vectors is discrete-valued.
Page 27: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Application: speech recognition

Building a “phonetic typewriter” to identify and recognize phonemes:represent continuous Finnish speech as spectral vectors, x

x ∈ R15 spectralvectors computedevery 9.83ms using256-pt FFT

No segmentation ofspeech and fullyunsupervised:naturally occurringfeatures contributedto self-organization

Figure: 8× 12 Finnish phoneme mapThe 21 phonemes of Finnish have been learned:most cells respond to a unique phoneme, butsome cells do to two.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 15 / 20

This ``phonetic typewriter'' identifies and recognizes phonemes in continuous speech: here, in Finnish
Its main purpose is to identify phonemes in real-time for the transcription of speech into text
Spectra evaluated every 9.83ms using 256-pt FFT 15-dim. spectral vector (after grouping the channels)
The spectra were not labeled or segmented during learning.
The naturally occurring features in the original waveform contributed to self-organization
After the map was adapted to the spectral data, calibration was performed using known stationary phonemes, resulting in the 21 phonemes we see represented in the map
Converted speech into short-time spectra to form the x(t) input vectors.
All spectral vectors were presented to the algorithm in their naturally occurring order
Spectra were not segmented or labeled in any way - the features present in the waveform contributed to the self-organization in this map
After the map was adapted
Interestingly, most cells respond to a single phoneme, however some cells respond to two. These seem to appear on the borders of phoneme groups - perhaps this is indicative of a similarity between these two phonemes
LVQ used to fine-tune the map to obtain optimal decision accuracy
Page 28: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Application: Semantic mapping

Goal: extract abstract logically similar, semantic information fromsymbolic data.

Meaning of a symbolic encodingrequires consideration of theconditional probabilities of itsoccurrences (contexts) withother encodings

Represent symbolic encoding asa vector xs of symbols and theircontext xc :

x =

[xsxc

]=

[xs0

]+

[0xc

]

Figure: Vocabulary used in experimentSemantic meaning should be inferredonly from context in which words occur:Each word encoded as a random, 7-dim.unit vector

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 16 / 20

Using the vocabulary shown on the left comprised of nouns, verbs, and adverbs, sentence patterns, and some examples of 3-word sentences
The input vector x is represented as the vectorial sum of two orthogonal components; the symbolic and contextual components
Page 29: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Application: semantic maps

Presented 2, 000 word-context-pairs derived from 10,000 random sentences

Figure: 15× 10-cell learned semantic map

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 17 / 20

The map has segregated the difference vocabulary types (nouns, verbs, adverbs) into different regions of the map.
Within reach domain, there is additional grouping: notice that Mary, Jim, and Bob are in one area, near the other animals, and the comestibles are clustered together
Page 30: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Concluding remarks

The SOM is a powerful neural-model capable of creating localized,structured, clustered representations of input data

On its own, the SOM is not well suited for classification tasks unlessfine-tuning methods are used to increase decision accuracy

Initialization of the map affects the locality of its learned responsesand ordering

Particularly useful for high-dimensional data: reduces dimensionalityof the data and the topological mapping allows underlying propertiesof data to be visualized

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 18 / 20

Page 31: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

References

Tuevo Kohonen

The Self-Organizing Map

Proceedings of the IEEE Vol. 78, No. 9, September 1990.

Tuevo Kohonen

Essentials of the self-organizing map

Elsevier Volume 37, January 2013, Pages 52-65. 2012.

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 19 / 20

Page 32: The Self-Organizing Mapdasgupta/254-neural-ul/jenny.pdf\Rule of thumb": 500 number of network cells. Learning rate: Use a high learning rate (close to 1) for the rst 1000 iterations,

Any questions?

1 Competitive learning

2 Learning vector quantization (LVQ) methodsUnsupervised vector quantizationSupervised LVQ

3 Topological neighborhood

4 Self-organizing map algorithm

5 Applications and results

Jenny Hamer (UCSD) Self-organizing map November 8, 2018 20 / 20