quantitative properties of kohonen's self-organizing maps as adaptive vector quantizers

Systems and Computers in Japan, Vol. 24, No. 5, 1993 Translated from Denshi Joho Tsuchin Gakkai Ronbunshi, J-75-D-11, No. 6, June 1992, pp. 1085-1092

Quantitative Properties of Kohonen’s Self-Organizing Maps as Adaptive Vector Quantizers

Toshiyuki Tanaka, Member

Faculty of Engineering, The University of Tokyo, Tokyo, Japan 113

Masao Saito, Member

The Institute of Medical Electronics, Faculty of Medicine, The University of Tokyo, Tokyo, Japan 113

SUMMARY

Kohonen’s model as the self-organizing model for the neural network can be considered as a kind of adaptive vector quantization algorithm. Numerous reports have been presented on the application of the model to practical problems. Although some results have been presented for the theoretical properties of Kohonen’s model, many properties remain to be clari- fied.

Among various properties of Kohonen’s model as an adaptive vector quantization algorithm, this paper considers the problem of how the reference vectors are placed according to the probability distribution of the input signal. Considering the limit where the number of reference vectors is increased to infinity, this problem can be discussed theoretically as the distribution of the reference vectors.

Due to the effect of the “learning by neighborhood,” which is the feature of Kohonen’s model, the property of the Kohonen model differs quantitatively from the property of the ordinary vector quantization algorithm. This paper discusses quantitatively the properties of Kohonen’s model using the average learning equation.

a3

Key Words: Neural network; self-organization mapping; vector quantization; coding; cooperative learning.

1. Introduction

Vector quantization is a method to encode the set of vectors based on a finite number of reference vectors. Given a set of referencevectors {w i } , a typical procedure to encode the input signalx (nearest coding) is as follows: For the input signal x , the nearest reference vector wi is selected, and the input signal is encoded into the codeword corresponding to that reference vector wi.

The properties of the vector quantization as the encoding process is determined by the placement of the reference vectors for the given set of input signals. The evaluation criterion for the placement of the reference vectors depends on the purpose of the vector quantization. Typical criteria are the minimization of the expectation of the quantization error, and the maximization of the entropy of the resulting code.

In the determination of the optimal placement of the reference vectors for the given set of input signals

ISSN0882- 1666/93/0005-0083 0 1993 Scripta Technica, Inc.

for the given criterion, often the adaptive algorithm is employed. The reason for this is that the optimal placement of the reference vectors cannot, in general, be determined explicitly from the given set of input signals. Consequently, it is impossible to know beforehand the exact placement of the reference vectors obtained by the adaptive algorithm.

However, considering the limit where the number of reference vectors is increased to infinity, asymptotic results can be derived. Among those properties, this paper considers the asymptotic probability distribution q(x) of the referencevectors which is formed adaptively in the input signal space. The following property is known [l, 21. When the input signal x occurs independently, following the probability density distribution p(x) in the n-dimensional space, the distribution of the reference vectors, which minimize the expectation of the quantization error in terms of the power r of the Euclidean norm PI', asymptotically approaches q(x) a p ( ~ ) , , , ( ~ + ~ ~ , where r is an arbitrary positive real number.

The algorithm for the adaptive vector quantization can be realized as the algorithm for the competitive learning in a certain kind of neural network. From such a viewpoint, the adaptive vector quantization algorithm also is investigated from the viewpoint of the learning in the neural network. Kohooen introduced cooperative learning based on the concept of neighborhood into the competitive learning (called neighborhood learning in the following). He proposed a learning algorithm, and applied the idea to phoneme recognition and other problems [3].

Following Kohonen, a large number of improve- ment trials have been presented for the quantitative properties and the convergence characteristics of Kohonen's learning algorithm [4-71. Such learning algorithms were proposed from the standpoint of the learning of the neural network. Their effectiveness is verified experimentally through computer simulation and the application to practical problems. On the other hand, there are many theoretical properties of those learning algorithms which are unknown.

This paper discusses the problem where the asymptotic distribution of the referencevectors formed by the model proposed by Kohonen is to be determined. If the model is restricted to the one-dimensional case, detailed results including analysis are obtained concerning the relation of the formed distribution of the referencevectors to the size of the neighborhood used in the learning [&lo]. This paper also

presents an extension of the foregoing result for the problem to the case where the model executes the n-dimensional neighborhood learning.

When the model is assumed as n-dimensional, in general, it is difficult to provide a detailed analysis, such as the relation of the distribution of the reference vectors to the size of the neighborhood used in the learning. Here, it is noted that when the neighborhood used in the learning is somewhat large, the shape of the distribution of the reference vector depends little on the detailed size of the neighborhood. Based on such an observation, the approximate distribution of the reference vectors is determined for the case where the neighborhood is sufficiently large.

2. Algorithm for Adaptive Vector Quantization

2.1. Fundamental competitive learning

It is assumed that the input signal x occurs independently in the n-dimensional space S, following the probability density function&). Apart from this, a set F composed of N elements (which correspond to neurons) is prepared. The i-th element has the reference vector wi, i = 1, 2, ..., N, which takes values in S.

The procedure in the fundamental competitive learning is as follows (Fig. 1). When an input signalx occurs, the reference vector wi(x) with the value closest to x is sought. Let the unit vector directed from wi(=) to x be ei(x). Let the distance between wi(=) and x be written as D(x - wi(J, where D is an appropriate norm in S. Then the learning is executed by the following equation for the reference vector w+):

Awi(rl=cD(x - zt~,(.rJe,,r~ (1)

where E is a small positive number. By "norm D in this paper is meant not necessarily the "norm" in the strict mathematical sense, but that D(n-y) indicates the "distance" between x and y .

It is seen that when E is sufficiently small, there exists the following potential function in the fundamental competitive learning [ 111:

E = /L( - tci(r ,)I)( dx (2)

where L is a function satisfying I V L ( z ) I = D(z).

84

input signal

-

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0 p q 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0 Fig. 1. Fundamental competitive learning. 0 0 0 0 0 0 0

Fig. 2. Two-dimensional array of neuronal units and examples of topological neighborhood.

Consider the power-of-(r-1) norm DJz) = Br-' as a typical measure for the distance. Then I+@) = B r / r (r > 0). In this case, the fundamental competitive learning is the same as the probabilistic steepest descent for the potential E of Eq. (2). input signal

When the distance measure is defined as the power-of-(r-1) norm D&)= bfl the asymptotic distribution q(x) of the reference vectors is S for a sufficiently large number of elements N, can be determined by the method of variation. The result is [12]

(3)

- *

q(x)a @( x)'ll(''+

Furthermore, by calculating the second-order variation, it is verified that the foregoing solution is locally stable at least for r > 0.

Fig. 3. Learning in Kohonen's model.

In other words, to minimize the expectation of the power-of-r error, it suffices to execute the fundamental competitive learning using the power-of-(r- 1) norm. As a result of learning, the distribution of the reference vectors is an approximation to the distribution of the input signals to the power of n/(n+r). When N is sufficiently large, the forementioned result also applies asymptotically, even if a general norm D(z) is used as the distance measure, provided that D(z) = 12r-l for lzl + 0.

2.2. Kohonen's model

Kohonen proposed a learning algorithm where the cooperative learning using "neighborhood is introduced into the fundamental competitive learning algorithm discussed in the previous subsection. This learning algorithm is realized on the Kohonen model and is described in the following.

In Kohonen's model, a regular array of an n-dimensional grid formed by elements is considered as the set F of elements, with the same dimension as that of the input signals. By the neighborhood Ni of element i is the set of elements placed close to element i in this grid array. The "size" of the neighborhood is considered, and is denoted by m. If m = 0, Ni = {i}. When the size of m of the neighborhood is large, elements far from element i are included in Ni. Figure 2 shows an example of the grid array of elements and neighborhood for the case of n = 2.

When an input signalx occurs, only the element i(x) with the reference vector closest to the input executes the learning in the fundamental competitive learning discussed in the previous section. In Koho-

85

nen's model, all elements contained in the neighborhood Nicx, of element i(x) execute the learning (Fig. 3):

the equilibrium condition wi = 0 for the average learning equation (5 ) .

When E is sufficiently small, the following average learning equation is derived from Eq. (4):

where --= [ j V, ; 7 is the "region of input signal x

such that wj is the closest reference vector" (more precisely, the symmetry of the neighborhood jENi + + E N j is assumed). In other words, Ci is the region such that %hen the input occurs in Ci, element i executes the learning."

. IFNI

3. Asymptotic Distribution of Reference Vectors

3.1. One-dimensional model

When the model is one-dimensional, i.e., n = 1, the input signal is assumed as one-dimensional, and the one-dimensional array of element is considered as F. The neighborhood Ni of element i in F is defined as

~ ~ = ( j I I i-j I Sm} (6)

Then the asymptotic distribution of the reference vectors formed by the learning is determined by the size m of the neighborhood.

More precisely, consider the case where the norm In Kohonen's model, the following two processes D, is used as the distance measure, and let are assumed to progress with the learning [3]:

(a) The reference vectors in S are rearranged so Y(s)mb(;t')s (7) that their placement follow the element array in F.

Then it is verified by the theoretical investigation and - the computer simulation that [9, 101 (b) The placement of the reference vectors

converge to the equilibrium state of the learning.

[ m*+(vn+1)2 l l (8) s=- 2- Property (a) is important from the viewpoint of

the self-organization of the topological mapping. However, the rearrangement is ensured theoretically only for the case of the one-dimensional uniform input [ 131, and is verified only experimentally for the general case. In this paper, the discussion is made assuming

already completed.

In general, when the norm D, is used as the distance measure, the following result is easily derived in the same way as Ritter derived the result of Eq. (8) in

that the rearrangement of the reference vectors is 191:

To examine the distribution of the reference vectors formed by Kohonen's model, the dynamics of the system given by Eq. (5 ) should be analyzed. In the case of the fundamental competitive learning, there exists the potential function of the form of Eq. (2) by which the distribution of the reference vectors can be examined. In the case of Kohonen's model, on the other hand, a potential function has not been found that corresponds exactly to Eq. (2), although some discussions are presented to examine the properties of the model, using a set of functions similar to the potential function [14].

This paper derives the asymptotic distribution of the reference vectors formed by the learning based on

I 1 s=- [ 2 - 7 l + r m +(nz+IY (9)

It is seen from Eq. 9 that s = l/(l+r) for m = 0, and s rapidly approaches 2/( 1 +r) when m increases. When m 2 1, one can consider that s = 2/(l+r) independently of the value of m. The foregoing result indicates that the shape of the asymptotic distribution of the reference vector is affected greatly by 'tvhether or not the neighborhood learning is executed," but is affected little by the size m of the neighborhood.

3.2. n-dimensional model

In contrast to the previous case, it is very difficult

86

in the general n-dimensional model to explicitly handle the size of the neighborhood. It is anticipated from the result of the previous one-dimensionalcase that the result of the learning is affected greatly by "whether or not the neighborhood learning is executed," but is affected little by the size of the neighborhood, if the neighborhood learning is executed. The result when the neighborhood learning is not executed, as discussed in section 2.1, is already known. From such a viewpoint, an approximate result is derived in the following, assuming that the neighborhood is somewhat large.

The n-dimensional grid array is considered as the element array in F. The element is specified by a tuple of n suffixes i = ( i l , i,, ..., i f . The following (hyper) cube is considered as the neighborhood Ni of element 1:

Consider the situation where the learning has pro- gressed sufficiently, and assume that the rearrangement of the reference vectors is already completed. If the input probability density distribution&) is sufficiently smooth, the density distribution of the reference vectors will also be sufficiently smooth. Consider a certain reference vector wi. Near n = wi, the reference vectors contained in the neighborhood Ni of element i will be placed retaining the placement of the elements in F. At x = wi, the density of the reference vectors is q(wi) and the density gradient is Vq(wi). The foregoing situation is modeled as follows.

As the zeroth-order approximation, the model is constructed without considering the density gradient Vq(wj). The neighborhood Ni of element i contains (2m + 1)" elements, and one can consider that their reference vectors are placed in the form of a (hyper) cube in S with x = wi as the center. Letting the distance between the closest reference vectors be d, d" = q(wi). Define appropriately the orthonormal base {el, e2, ..., en} with x = wi as the origin. The position w . lo) of the reference vector of element (i + j)ai at the zeroth approximation is represented using the relative position in regard to wi as Awi(0) = wi+l(0) - wi.

' +I

Next, the first-order approximation is considered,

Fig. 4. Adjustment of reference vectors to fit their density gradient.

including the density gradient Vq(w,). The model is constructed by considering the perturbation by the density gradient applied to the (hyper) cubic arrary represented by Eq. (11). Let the unit vector directed to the direction of Vq(wi) be eg; Awi(0) in the zeroth-order approximation is corrected to the first-order approximation Awj by the following equation in proportion to the component eg*Awf0) of the gradient in the direction of eg:

Figure 4 shows an example of correction applied to the placement of the reference vectors by Eq. (13).

The value of the constant multiplier c can be determined from the condition that the density gradient of the placement given by Eq. (13) should agree with Vq(wi). It is given as follows:

1 v y o C e 8 = , + 1 --& w,)

The power-of-@- 1) norm 0, is used as the distance measure. When the size m of the neighborhood is sufficiently large, the equilibriumcondition wi = 0 is given approximately by the average learning equation as follows:

87

where

The independent variable is converted to z by the variable transformationx = g(z). Then

where J[ f ] is the Jacobian of f. Assuming that p(x) and f(z) are sufficiently smooth, the integrand in the left-hand side of Eq. (17) is expanded into a Taylor series around z = 0. By calculating using an approximation that the higher-order terms of z can be ignored, the result is obtained as

The detailed derivation of Eq. (18) is given in the Appendix.

Using the foregoing result and Eq. (14), the equilibrium condition for the average learning equation is given as

This is the condition which should apply to x = wi. Assuming that the number N of elements is sufficiently large, one can assume that the forementioned condition applies everywhere in the input signal space S. Then the following differential equation is obtained

By solving this equation,

The foregoing result includes the case of rn + 00

Fig. 5. Snapshots of reference vectors in the learning process (training steps: (a) 0; (b) lo3; (c) lo4; (d)

lo5; (e) lo6; ( f ) 5~20~).

in the result of n = 1 (Eq. (9)), and is an extension of the result. It may seem that the forementioned result depends on the detailed modeling of the placement of the reference vectors considering the density gradient (Eq. (13) or (16)), but this is not actually the case. As is seen in the derivation in the Appendix, the result of Eq. (21) is derived essentially from the lower-order terms in the expansion of f(z) in regard to z , together with the symmetrical properties of the neighborhood shape. Consequently, it is expected that the forementioned results is valid under a more general condition.

In the preceding reasoning, the form of the distribution q(x) of the reference vectors is derived from the local equilibrium condition. However, it may happen when n 2 2 that the local placement of the referencevectors is affected by the global placement of the reference vectors. It should be noted that such an effect is ignored in this paper.

88

4. Result of Simulation

The result of computer simulation is presented in this section to verify the theoretical result derived in the previous section. A two-dimensional model is considered. In other words, it is assumed that the input signal is two-dimensional and the reference vectors are arranged in two dimensions.

It is assumed that the input signalx = (xI , x2)E[0, 1)' occurs independently, following the probability density distributionp(xl, xz ) = 3x1x2"; 2304 reference vectors are prepared, forming a 48 X 48 array. The initial placement of the reference vectors is set as q(x) a p(x), so that the maximum entropy coding is realized by the initial state. As the distance measure, it is set that r = 2, i.e., D2(z) = lzl.

To speed up the convergence of the learning, the learning parameter (E in Eq. (1)) and the size of the neighborhood are set as large at the beginning and are reduced gradually with the progress of the learning. In the simulation, the size of the neighborhood is eventually set as m = 3 to observe the effect of the neighborhood learning. At each stage of learning, the peripheral distributions of the reference vectors are determined for xl and x2. To eliminate the edge effect, the central one-third is extracted, and the value of the exponents is determined by the regression and analysis for the representation q(x) ap(x)s.

Figure 5 is the result of computer simulation by the forementioned condition. The figure shows the placement of the reference vectors at each stage of the learning. Although there exist some edge effect, it is seen that a stable placement is formed at the central part of the reference vector array.

Figure 6 shows the value of the exponents determined at each stage of the learning as a function of the training steps. It is set that n = 2 and r = 2 in this case, and it is expected that s = 11'2 in the fundamental competitive learning, and s = 3/4 in the neighborhood learning considered in the previous section. In other words, it is expected in this case, where the neighborhood learning is executed, that the value of the exponent s approaches 314 with the progress of the learning.

It is seen from the results for x1 andx' that the tendency is to approach s = 3/4 with the progress of learning, although the situations are a little different. the previous section is shown.

I I I I

lo3 lo4 lo5 lo6 training steps

Fig. 6. A plot of the power law dependences of reference vector density q on input signal distributionp vs.

the number of training steps.

Thus, the validity of the theoretical result derived in the previous section is shown

5. Conclusions

This paper considered the asymptotic properties of the adaptive vector quantization algorithm, and discussed the problem as to how the distribution q(x) of the reference vectors is formed when the probability distribution p(x) of the input signal is given. The following property is shown for Kohonen's model, which has been considered from the viewpoint of the learning of the neural network. When the dimension of the model is n , the distance measure is given by the power-of-(r-1) norm D, and the effect of the neighborhood learning is sufficiently large, q(x) a p(x)("+')'("+') holds approximately.

-

The foregoing result corresponds to the already known result q(x) a p(x)"'("+') for the case of the fundamentalcompetitive learning, i.e., the case without the neighborhood learning, and indicates that there is a difference in the quantitative properties in the learning by the fundamental competitive learning and Kohonen's model considering the neighborhood learning. The result in this paper is also an extension of Ritter's result (Eq. (9)) for the one-dimensionalcase to the n-dimensional case, although there is a condition that the effect of the neighborhood learning is sufficiently large. Based on the presented result is considered as useful in providing a guide for such a case.

The stability of the learning is sometimes a problem in Kohonen's model [ 151. The unstability exhibited by Kohonen's model is one of the important examples as

89

a model in the discussion of the self-organization of the neural networks. There already exist a consid- derable number of studies on this point for the one- dimensional model [16 -18). Combining those results with the result of simulation in this paper, it seems that Kohonen’s model is stable (in the sense of this report) if the learning parameter (E in Eq. (1)) is sufficiently small. However, this point has not been concluded up to this stage, and we plan to investigate this problem further from the theoretical viewpoint.

Acknowledgement. A part of this study was funded by a Science Grant from the Min. Educ. (Gen. B, No. 034521 83).

REFERENCES

1. Y. Yamada, S. Tazaki and R. M. Gray. Asymp totic performance of block quantizers with difference distortion measures. IEEE Trans. Inform. Theory, IT-%, 1, pp. 6-14 (Jan. 1980).

2. P. L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension, IEEE Trans. Inform. Theory, lT-28,2, pp. 139-149 (March 1982).

3. T. Kohonen. Self-Organization and Associative Memory. 3rd ed., Springer-Verlag (1989).

4. D. DeSieno. Adding a conscience to competitive learning. IEEE Int. Conf. Neural Networks, 1,

5. S. C. Ahalt, A. K. Krishnamurthy, P. Chen and D. E. Melton. Competitive learning algorithms for vector quantization. Neural Networks, 3, 3,

S.-G. Kong and B. Kosko. Differential competitive learning for centroid estimation and phoneme recognition. IEEE Trans. Neural Networks, 2, 1, pp. 118-124 (Jan. 1991). S. P. Luttrell. Code vector density in topographic mappings. Scalar case, IEEE Trans. Neural Networks, 2, 4, pp. 427-436 (July 1991).

pp. 1-177-124 (1988).

pp. 277-290 (1990). 6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

H. Ritter and K Schulten. On the stationary I.E.I.C.E., Japan, NCW-107 (March 1991). H. Ritter. Asymptotic level density for a class of vector quantization processes. IEEE Trans. Neural Networks, 2, 1, pp. 173-175 (Jan. 1991). T. Tanaka and M. saito. On the property of Kohonen’s self-organizing maps. Papers of Technical Group on Neurocomputing, I.E.I.C.E., Japan, NC90-107 (March 1991). M. Iri, K Murota and T. Ohya. A fast Voronoi- diagram algorithm with applications to geographi- cal optimization problems. Proc., 11th IFIP Conf. System Modelling and Optimization 1983, Copen- hagen, Lecture Notes in Control and Information Science, 59, System Modelling and Optimization (P. Thoft-Christensen, ed.), Springer-Verlag, pp.

K Kurata. Learning of neural network with competitive hidden units. Mem. Inst. Math. Anal., Kyoto Univ., 678, pp. 52-67 (1989). M. Cottrell et J.-C. Fort. Etude Bun processus d’auto-organisation. Ann. Inst. Henri Poincare- Probabilites et Statistiques, 23, 1, pp. 1-20 (1987). V. V. Tolat. An analysis of Kohonen’s self-organizing maps using a system of energy functions. Biol. Cybern., 64, 2, pp. 155-164 (Dec. 1990). K Kurata. On the formation of columnar and hyper columnar structures in self-organizing models of topographic mappings. Papers of Tech- nical Group on Neurocomputing. I.E.I.C.E., Japan, NC89-72 (March 1990). H. Ritter and K. schulten. Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability and dimension selection. Biol. Cybern., 60, pp. 59-71 (1988). K Kurita and T. Yamada. On the formation of discrete microstructures in a self-organizing model of topographic mappings with variable excitiation intensity. Papers of Technical Group on Neuro- computing, I.E.I.C.E., Japan, NCW-108 (March 1991). K Kurata and T. Yamada. On the formation of discrete microstructures in Kohonen’s self-organizing models of topographic mapping. bid., NC90-109 (March 1991).

273-288 (1984).

90

APPENDIX

The derivation of Eq. (18) from Eq. (17) in the main text is shown in the following. Let the left-hand side of Eq. (17) be

I = / , 1 f ( z ) l‘-’f(~)p(~~+f(Z))/[fldz (Al)

Iz I is assumed as sufficiently small, and an approximation is applied, where the higher-order terms in the Taylor series expansion of the integrand are ignored. Since

there follows

Consequen tly,

The region C’ of integration is a hypercube with z = 0 as the center. Considering the symmetry, the fol-lowing relations are obtained, where a and b are con-s tant vectors:

k is a scalar, i.e.,

k = L , I z I r - 2 ~ ~ 2 d ~ (A81

Using those relations, Eq. (A6) is calculated. The result is

Since I = 0 in Eq. (17), it follows from Eq. (A9) that

Using the original notations, Eq. (18), i.e.,

is derived.

Here, f(z) represents the deviation of the placement of the reference vectors from the uniform placement. In the forementioned reasoning, it is assumed that f(z) takes the form of Eq. (A2). In general, f(z) contains higher-order terms oft . If, however, the foregoing approximation to ignore the higher-order terms is valid, the forementioned result is not affected by the higher-order terms off@). Another point is the region of integration C’. It is assumed in the foregoing as the hypercube with z = 0 as the center. However, since Eq. (A7) is derived considering only the symmetrical property of C’, the result remains the same even if C’ is another kind of region, such as a hyperpolyhedron or hypersphere.

91

AUTHORS (from left to right)

Toshiyuki Tanaka graduated in 1988 from the Dept. Electronic Eng., Fac. Eng., Univ. Tokyo, where he obtained a Master's degree in 1990 and is presently in the doctoral program. He is engaged in research on the learning theory of neural networks.

Masao Saito graduated in 1956 from the Dept. Electrical Eng., Fac. Eng., Univ. Tokyo, where he obtained a Dr. of Eng. degree in 1962. He was a Lecturer and Assoc. Prof. on the Fac. Eng., Univ. Tokyo. Since 1974, he has been a Prof. on the Fac. Medicine, Univ. Tokyo. He is engaged in research and education on network system theory, and medicallbiological engineering. He was President of JSMEBE, Vice-president of Jap. Soc. Hyperthermic Oncol., President of Int. Fed. Med. Biol. Eng., and Hon. member of the Int. Fed. Med. Biol. Eng. He received the Inada Price, Paper Award, IEICEJ and Paper Award JSMEBE. He is the author of Biological Engineering (IEICEJ); Bmk of Medical Engineering (Shokodo); and other books.

92

quantitative properties of kohonen's self-organizing maps as adaptive vector quantizers

Documents