a clustering method using hierarchical self-organizing maps

Journal of VLSI Signal Processing 32, 105–118, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

A Clustering Method Using Hierarchical Self-Organizing Maps

MASAHIRO ENDO, MASAHIRO UENO AND TAKAYA TANABENTT Cyber Space Laboratories, 3-9-11, Midori-cho, Musashino, Tokyo 180-8585, Japan

Received April 25, 2001; Revised October 30, 2001; Accepted November 19, 2001

Abstract. We describe a method of clustering that uses self-organizing maps (SOMs) in a method of imageclassification. To ensure that this clustering method is fast, we defined a hierarchical SOM and used it to constructthe clustering method (M. Endo, M. Ueno, T. Tanabe, and M. Yamamoto, in Proc. of the IEEE Int. Workshop onNeural Networks for Signal Processing X, 2000, pp. 261–270). We define the clustering method in detail and outlineits behavior as determined on the basis of both theory and experiment. We also propose a cooperative learningalgorithm for the hierarchical SOM. Experiments on artificial image data confirmed the basic performance andadaptability of the SOM in clustering images. We also confirmed, both experimentally and theoretically, that ourmethod is faster SOM, for the objects used in these experiments, than a method based on a non-hierarchical SOM.

Keywords: image retrieval system, clustering, self-organizing maps, hierarchical SOM, cooperative learningalgorithm, neighborhood function

1. Introduction

Improvements in computer performance and the rapidspread and development of the Internet and digital de-vices create more opportunities to use multimedia data,such as images and voice, and increase the demandfor data of this type. Images are an especially effectiveway to communicate information—it is said that we get80% of our information through the sense of sight—and using images can greatly increase the effectivenessof a human-machine interface. Thus, a storage systemtechnology that enables us to store large quantities ofimages, and to quickly and easily find stored images,promises to be of great value.

We have developed a network storage device thatuses several DVD-RAM libraries to store a large quan-tity of data (up to 2 TB). Unfortunately, no availableretrieval system is capable of handling the quantity ofdata that our system is able to store. In existing storagesystems, an index entry is made when information isstored. When users want to retrieve some item of in-formation, they control a retrieval system that uses theindex to find the required item.

Existing image-retrieval systems include Illustra andVIR [1], VisualSEEK [2], and ExSight [3]; these sys-tems are based on, respectively, the tone and design ofimages, the relative positions of many color regions inimages, and objects extracted from images.

However, with today’s databases and data-retrievalsystems, it is difficult to find something useful accu-rately and quickly among the many items in a largedatabase because of the huge index space for the data.ExSight applies C-Tree [4], a form of similarity index-ing in a high-dimensional space, to obtain high-speedsearching; however, the index is made by simply com-paring the distance between the feature values of thestored images with those of a key-image. It is thus dif-ficult to find data based on conceptual relationships.

Our goal is to create a mechanism for storage andsearching that overcomes these problems by mak-ing an image-retrieval system which is based onan information-processing mechanism that has somesimilarities with the brain. In other words, we arelooking for a mechanism that has a massively paral-lel processing structure which enables the extremelyfast processing of information. The mechanism should

106 Endo, Ueno and Tanabe

also, through learning, be able to modify its owninformation-processing structure and to extract ‘mean-ings’ from information by identifying relationships,in terms of various aspects of the data, among itsrecords.

We think of an image-retrieval system as being di-vided into several parts according to its processingfunctions. These parts include an extractor to extractobjects that express features from the images that willbe the targets of searches, a calculator to calculate fea-ture vectors (constructed values that express featuresof the objects), a clustering module to classify featurevectors according to their similarity, an indexing mod-ule to make an index of the objects being searched for,and searching modules.

In this paper, we describe a method of clusteringthat can be used as part of an image-retrieval system.Some systems that use the k-means method, the Isodatamethod, and so on, use the minimum-distance clus-tering algorithm [5] as the clustering system. In thesemethods, the input data is treated as a set of multi-dimensional vectors, and the degree of similarity be-tween items of input data is expressed as a distance(e.g., the Euclidean distance). The classification al-gorithms use these distances to classify the data thatis input. In the k-means method, a cluster number kmust be determined before cluster processing. It is notpossible to use this method to classify data when thevalue of k is inadequate. If input data comes from aset with an unknown probability distribution, it is dif-ficult to choose a suitable value for k. In the Isodatamethod allows the classification of data without decid-ing the number of clusters. However, some parametersmust be provided before cluster processing, and theresult of clustering is strongly affected by these pa-rameters. If the input data is from an unknown prob-ability distribution, these parameters are difficult todetermine.

We focused attention on the cluster processing pa-rameters and speed. We applied a self-organizing map(SOM) [6], which represents a kind of neural net-work, to avoid having to provide parameters like thosementioned above. We also developed a two-layeredhierarchical SOM for fast clustering. Although otherhierarchical SOMs have been developed [7], our hier-archical SOM has a new type of learning algorithm.Fast clustering is implemented on the basis of this al-gorithm and the results of clustering in the lower layerare integrated with a cooperative learning algorithm inthe upper layer.

2. Clustering

2.1. Self-Organizing Maps

The SOM, introduced by Kohonen [6, 8, 9], isamong the best-known unsupervised-learning neuralnetworks. It belongs to the class of vector-coding algo-rithms. The SOM uses a learning algorithm to definea mapping from the input data space onto an outputlayer. The SOM model used here is defined as follows.

Let xi ∈ �, (i = 1, 2, . . . , d) be an n-dimensionalinput feature vector and � be the input space. The out-put layer of an SOM consists of a two-dimensionalarray of nodes. k is the index number of the nodes.An n-dimensional parametric reference vector m j ( j =1, 2, . . . , k) is associated with every node j of the out-put layer of the SOM in the lattice. The array is de-fined as having a rectangular lattice, s o × p = k. LetO ∈ 1, . . . , o and P ∈ 1, . . . , p be the coordinates ofthe rectangular output layer, and mO

j and m Pj be the co-

ordinates of the reference vector m j . In the followingdescription, the distance is the Euclidean distance, butgeneralization to other distance functions would be astraightforward matter.

Let i be the index of the input feature vector xi .Assume that mb is the closest reference vector of xi

such that

‖xi − mb‖ = minj

‖xi − m j‖, (1)

where ‖xi − m j‖ is the Euclidean distance between xi

and m j . The node b that has the reference vector mb

is then called the best-matching node for input featurevector xi , and b = b(i) can be considered to be theoutput of the SOM. Note that for a fixed xi , Eq. (1)defines b of the best-matching node; for a fixed b, itdefines the Voronoi compartment of b as the set ofpoints that satisfy (1). The above relation maps theinput space � to a discrete set of nodes, that is, theoutput layer of SOM in this paper.

The following topological mapping is then defined:if an arbitrary point xi ∈ � is mapped to node j , thenall points in the neighborhood of xi are mapped eitherto j itself or to one of the nodes in the neighborhoodof j in the lattice. This implies that if j and j ′ are twoneighboring nodes, then their Voronoi compartmentshave a common boundary.

The basic SOM learning algorithm is as follows:

(i) Choose random initial values for all referencevectors.

A Clustering Method Using Hierarchical Self-Organizing Maps 107

(ii) Repeat steps (iii), (iv), and (v) for discrete timesteps t = 0, 1, . . . , T .

(iii) For each input feature vector xi , perform steps (iv)and (v).

(iv) Find the best-matching node b according toEq. (1).

(v) For each node of the output layer, adjust the ref-erence vectors of all the nodes by applying

m j (t + 1) = m j (t) + α(t)

×Nb, j × (xi − m j (t)), (2)

where α(t) is a gain factor and Nb, j (t) is a neigh-borhood function (usually a function of the dis-tance b − j between nodes b and j as measuredalong the lattice).

We used a neighborhood function that takes the value1 within a certain proximity b and takes 0 elsewhere:

Nb, j ={

1, dis(mb, m j ) ≤ r (t),

0, otherwise,(3)

dis(mb, m j ) = max{∣∣mO

b − mOj

∣∣, ∣∣m Pb − m P

j

∣∣}. (4)

The neighborhood and the gain α (t) should slowlybecome smaller over time. Equation (2) is derived byadopting an energy function and a steepest-descentminimization method (see the section below on the be-havior of the first-layer SOM).

2.2. Using an SOM for Clustering

We now describe the clustering algorithm that enablesan SOM to handle clustering tasks without requiringhuman judgment. The input data is classified into clus-ters by the clustering algorithm in the following way:

(i) Use the above learning algorithm to place sets ofinput feature vectors on the output layer of theSOM, which is a two-dimensional space. Inputfeature vectors with similar feature values are thenmapped onto nodes of the output layer that areclose to one another so that those vectors that havestrongly similar feature values are mapped ontothe same region of the output layer.

(ii) Use an image-processing method to identify clus-ters on the output layer and extract the cluster in-formation in the following way.

(a) Calculate distances between the referencevectors of neighboring nodes that belong tothe output layer. Using the Euclidean dis-tance between the reference vectors, let O ∈1, . . . , o and P ∈ 1, . . . , p be the coordinatesof the output layer for all O and P , and cal-culate the distance dmO,P according to

dmO,P = ‖mO,P − mO−1,P‖+ ‖mO,P − mO,P−1‖. (5)

(b) Make a two-dimensional space measuringo × p that consists of nodes that have the lo-cation value dmO,P . Let this space be the dis-tance map. Each node in the distance map cor-responds to a node in the output layer. Whenthe reference vectors of neighboring nodes inthe output layer of the SOM have almost thesame values, the corresponding nodes in thedistance map have a low dmO,P . When refer-ence vectors have a large difference in values,the nodes of the distance map that correspondto these nodes have a high dmO,P . Thus, nodesof the distance map that have large dmO,P val-ues correspond to the borders of cluster re-gions, so these borders appear on the distancemap.

(c) To make a clear border between the cluster re-gions, make a further two-dimensional spaceby filtering the distance space. Let the spacebe the clustering map. Let θ be a threshold.Calculate the value dm∗

O,P for all nodes of thedistance map according to

dm∗O,P =

{1, if dmO,P ≥ θ,

0, if dmO,P < θ.(6)

Create a clustering map which consists ofnodes that have the values dm∗

O,P . The value1 corresponds to the border.

(d) Close the borders to extract the cluster infor-mation (the clustering map created in step (c)does not have complete borders; thus, thoseparts where the borders are incomplete mustbe filled in to obtain borders that are made upof contiguous sets of nodes).

(e) Extract the cluster information. Identify fea-ture vectors that belong to particular regionsof the clustering map as information thatbelongs to the corresponding clusters.


Figure 1. Hierarchical SOM.

2.3. Using a Hierarchical SOM for Clustering

Next, we describe a fast method of clustering by a hi-erarchical SOM, which is defined as shown in Fig. 1.For a large quantity of input data, the convergence ofa small-scale SOM is unlikely to be effective in mostcases, but applying a large-scale SOM greatly length-ens the learning times for the SOM. This is a seriousproblem. Our method of clustering avoids this by usinga two-layered hierarchical SOM. The SOM in the firstlayer divides the input data into rough groups, whilethe SOM in the second layer classifies each group intofiner clusters. The approximate classification by thefirst-layer SOM allows us to reduce the size and learn-ing times of the system. Having less data input to thesecond-layer SOM, makes highly reliable classifica-tion with a small-scale SOM and short learning timespossible. The clustering algorithm applied for the hier-archical SOM is as follows:

(i) Provide input feature vectors to the first-layerSOM.

(ii) Apply the basic SOM learning algorithm to thefirst-layer SOM using a special neighborhoodfunction Nb, j (t). The output of the first-layer SOMis the result of Nb, j (t) at fixed positions on thelayer (see the next section for more details on thebehavior of the first-layer SOM).

(iii) For all fixed positions in the first-layer SOM,extract the unique set of data for input to eachsecond-layer SOM.

(iv) Apply the learning algorithm of the basic SOM toall second-layer SOMs.

(v) For each second-layer SOM, create a distance mapand a clustering map; then use the clustering al-gorithm to extract the cluster information.

3. Analysis and Improvementof the Hierarchical SOM

3.1. Behavior of the First-Layer SOM

The first-layer SOM in the clustering method using ahierarchical SOM roughly distributes input data at fixedpositions (that depend on the neighborhood function)on the SOM’s output layer. This approximate classifi-cation allows us to reduce the size and learning timeof the SOM. The learning of the first-layer SOM usesa slightly modified version of the basic SOM learningalgorithm. We analytically proved the dividing prin-ciple and obtained the fixed positions, which are thebest-matching nodes decided uniquely in the learningprocess [10]. For the sake of simplicity, we analyzedthe case for an SOM that has a one-dimensional out-put layer (shown below). This analysis was appliedto a two-dimensional output layer with a rectangularneighborhood function, as demonstrated empiricallylater based on experimental results.

3.1.1. Definition of the First-Layer SOM. The first-layer SOM was defined simplistically in Section 2.Some of its distinctive functions and variables are nowdefined, for the analysis of the behavior of the first-layer SOM. Let xi ∈ �, (i = 1, 2, . . . , d) be an n-dimensional input feature vector and i be the index ofthe input data. The output layer of an SOM consistsof nodes, and an n-dimensional parametric referencevector m j , ( j = 1, 2, . . . , k) is associated with everynode j . mb is the closest reference vector to all of thexi in the sense that it satisfies Eq. (1), where b = b(i)is the index of the closest reference vector. The follow-ing neighborhood function is defined for the first-layerSOM.

Nb, j ={

1, dis(mb, m j ) ≤ r,

0, otherwise,(7)

dis(mb, m j ) = |b − j |, (8)

where r is a constant.The learning algorithm of SOM is related to the en-

ergy function in [11–13]. Let Vb denote the set in theinput space where Eq. (1) defines Vb. Let p(x) denotethe probability density of the input x . Define the energy


function as

E(m) =∑

q

∫Vq

∑j

Nq, j‖x − m j‖2 p(x)d(x), (9)

where m = {m1, m2, . . . , mk}. Equation (9) is piece-wise differentiable. Let us write it in an equivalent form;

E = E(m) =∫ ∑

j

Nb, j‖x − m j‖2 p(x)d(x). (10)

vskip-3pt Here, b is defined appropriately as the indexof the best-matching node. This moves the discontinu-ity of the Vq to the function

b = b(x, m1, . . . , mk). (11)

The usual way to minimize a function such as E , whenthe density p(x) is unknown, is to resort to samplefunctions: for each xi define

Ei (xi , m) =∑

j

Nb, j‖xi − m j‖2. (12)

The steepest-descent minimization of Ei leads directlyto the usual SOM learning rule:

m j (t + 1) = m j (t) − α′(t)∂ Ei

∂m j

= m j (t) − α(t) × Nb, j × (xi − m j (t)).

(13)

3.1.2. Definition of Convergence and Stability. Thelearning algorithm of the SOM uses steepest-descentminimization to reduce the energy function (12). Thelearning process is said to have reached convergencewhen the energy function reaches the global minimumor a local minimum. If a unique set of best-matchingnodes is not determined for some set of input featurevectors, some reference vectors of the output layer ofan SOM will not converge to unique values. Therefore,unique determination of the best-matching nodes is anecessary condition for convergence. Therefore the cri-terion used in determining convergence is the stabilityof the best-matching nodes.

Definition (Stability of the best-matching nodes).The stability of the best-matching nodes is defined asfollows:

The best-matching nodes are stable

⇔ ∀i, ∃1b‖xi − mb(t)‖ = minj

‖xi − m j (t)‖. (14)

We now define the convergence of an SOM. The re-newal equation of the reference vectors (13) makes theenergy function (12) approach a minimum at the best-matching nodes chosen by Eq. (1). Therefore, an SOMconverges to a minimum (either the global one or a lo-cal one) according to the stability of the best-matchingnodes during learning.

Definition (Convergence). An SOM converges if andonly if the best-matching nodes are uniquely de-termined after some discrete time t . Namely, thebest-matching nodes are those nodes that are stablefor discrete time t, t + 1, t + 2, . . . , when an SOMconverges.

3.1.3. Minimization of E. The minimum of the en-ergy function is derived under the following conditions.Assume that the best-matching nodes are determineduniquely for all xi as b(i), where these nodes allowb(i1) = b(i2) for any i1, i2. According to the learningalgorithm, when for some xi j belongs to Bi,r , Eq. (2)is applied to all the reference vectors m j . Therefore,all m j tends to xi such that j ∈ Bi,r . In fact, the mini-mization of the energy function is proved analytically.Rewriting the energy function (Eq. (10)) gives us

E(x, m) =∑

i

∑j∈Bi,r

‖xi − m j‖2. (15)

Applying I j = {i | j ∈ Bi,r }, this energy function is

E(x, m) =∑

j

∑i∈I j

‖xi − m j‖2 =∑

j

Em j . (16)

Therefore, the minimization of (16) reduces the mini-mization of Em j . The function Em j is expanded in thefollowing way:

Em j =∑i∈I j

‖xi − m j‖2

= ∥∥xi1 − m j

∥∥2 + ∥∥xi2 − m j

∥∥2

+ · · · + ∥∥xiv − m j

∥∥2, (17)

where I j = {i1, i2, . . . , iv}. The condition for the min-imization of Em j is

∂ Em j

∂m j= 2

(m j − xi1

) + 2(m j − xi2

)

+ · · · + 2(m j − xiv

) = 0. (18)


Therefore, the only solution that satisfies the aboveequation is

m j = (xi1 + xi2 + · · · + xiv

)/v. (19)

3.1.4. Convergence of the First-Layer SOM atthe Fixed Positions. We now show that the best-matching nodes are decided uniquely because of stabil-ity under the condition that the best-matching nodes arethe fixed positions at which the first-layer SOM has dis-tributed the input data. The following theorem showshow the best-matching nodes are uniquely determined.

Theorem (Uniqueness of the best-matching nodes).Assume that the number of nodes k = s + r (s − 1),that the number of the input feature vectors and best-matching nodes satisfies s ≤ d, and that A is a positivevalue such that ‖xi1 − xi2‖ = A for any i1, i2. Then thenecessary and sufficient condition for stability is asfollows: for all xi , for all a ∈ {0, 1, . . . , s − 1},

∃ b(i) = a(1 + r ) + 1, and

b(i) ∈ {1, (1 + r ) + 1, . . . , a(1 + r ) + 1, . . . , k}.(20)

Therefore, the first-layer SOM converges under con-dition (20), which shows the positions of the best-matching nodes. In actual fact, convergence of the SOMwas confirmed by the experimental results, and thebest-matching nodes in condition (20) were chosen asthe best-matching nodes in the experiment. Further-more, the theorem states that the SOM converges tosome extent when the condition that ‖xi1 − xi2‖ = Afor any i1, i2 does not hold (see the section on the ex-perimental results).

The proof of the above theorem is given in the fol-lowing passages. Assume that the indices of the best-matching nodes are derived according to condition(20), and that λ(a) = a(1 + r ) + 1. For all a, where

λ(a − 1) + r < λ(a) < λ(a + 1) − r, (21)

Iλ(a) = {i | b(i) = λ(a)}. (22)

For all j such that for any a λ(a − 1) < j < λ(a),

I j = {i | b(i) = λ(a − 1) ∪ b(i) = λ(a)}. (23)

In the same way, for all j such that for all a λ(a) <

j < λ(a + 1),

I j = {i | b(i) = λ(a) ∪ b(i) = λ(a + 1)}. (24)

Therefore, for any a and j such that λ(a) − r ≤ j ≤λ(a) + r and j �= λ(a),

NUMλ(a) < NUM j . (25)

Hence, these b(i) are stable.On the other hand, if we assume that the best-

matching nodes are stable, then we now give two propo-sitions to prove the remaining part of the theorem.

Proposition 1. If the best-matching nodes are stableand A is a positive value such that ‖xi1 − xi2‖ = Afor any i1, i2, for any index of feature vectors i ∈{1, 2, . . . , d},

NUMb(i) < NUM j , ( j ∈ Bi,r , j �= b(i)). (26)

Let NUMb(i) = ni and NUM j = n j . Then, accordingto Eq. (19),

mb(i) =∑

v∈Ib(i)

x v/n i , (27)

m j =∑v∈I j

x v/n j . (28)

This is because ‖xi1 − xi2‖ = A for any i1, i2. Then,for the opposite condition of (26),

ni = n j ⇒ ∥∥xi − mb(i)

∥∥ = ‖xi − m j‖, (29)

ni > n j ⇒ ∥∥xi − mb(i)

∥∥ > ‖xi − m j‖. (30)

Thus these conditions contradict our stability assump-tion. For condition (19),

ni < n j ⇒ ∥∥xi − mb(i)

∥∥ < ‖xi − m j‖. (31)

Hence, for any i NUMb(i) < NUM j is the only condi-tion for stability.

Proposition 2. If the best-matching nodes are stable,for any index i of the input feature vectors, neighboringbest-matching nodes b(i1) and b(i2) of b(i) satisfy

b(i1) + r = b(i) − 1, (32)

b(i2) − r = b(i) + 1. (33)


If Proposition 1 holds, the following is true: for anyi , NUMb(i) < NUM j , ( j ∈ Bi,r , j �= b(i)). Then, whenconditions (32) and (33) are not true, the following logicholds.

(a) If b(i1) + r �= b(i ) − 1, then

if b(i2) − r �= b(i),

then NUMb(i) = NUM(b(i)−1),(34)

if b(i2) − r = b(i),

then NUMb(i) > NUM(b(i)−1).(35)

(b) If b(i2) − r �= b(i) + 1, then

if b(i1) + r �= b(i),

then NUMb(i) = NUM(b(i)+1),(36)

if b(i1) + r = b(i),

then NUMb(i) > NUM(b(i)+1).(37)

Because (a) and (b) contradict the assumed stabilityof the best matching nodes, these best-matching nodescannot be stable. For conditions (32) and (33), b(i) doesnot belong to Bi1,r or Bi2,r , and

b(i1) + 1, . . . , b(i) − 1 ∈ Bi,r , Bi1,r , (38)

b(i) + 1, . . . , b(i2) − 1 ∈ Bi,r , Bi2,r . (39)

Thus, NUMb(i) < NUM j , ( j ∈ Bi,r , j �= b(i)). There-fore, condition (32) is the only condition of the stability.

The proof of the remaining part of the theorem isas follows. Let the best-matching nodes satisfy thefollowing: for some i, i ′ exists such that b(i ′) − r ≥b(i) > b(i ′). In other words, this condition is the op-posite of condition (20). On the assumption that thebest-matching nodes are stable, Proposition 1 statesthat i1, i2 exist such that

b(i1) + r = b(i) − 1, (40)

b(i2) − r = b(i) + 1. (41)

Then,

b(i1) < b(i ′) < b(i), (42)

b(i1) > b(i ′) − r, (43)

b(i) < b(i ′) + r. (44)

Therefore b(i1), b(i ′), and b(i) belong to Bi ′,r . And,because of the assumption that the best-matching

nodes are stable, b(i ′) and b(i) belong to Bi,r . FromProposition 1, we derive the following equations:

NUMb(i ′) < NUMb(i), (45)

NUMb(i ′) > NUMb(i). (46)

These equations contradict each other, and the best-matching nodes are not the stability. Hence, thecondition (20) is the only condition that satisfies thestability.

3.1.5. Two-Dimensional Output Layer. We have ex-perimentally confirmed that the best-matching nodesfor the case of a two-dimensional output layer are lo-cated at fixed positions under the following conditions.That is, the first-layer SOM distributes the input datato fixed positions.

Let s be the number of best-matching nodes in thecase of a two-dimensional output layer. The dimensionsof the layer dimensions then satisfy o = p = √

s +r (

√s − 1). The neighborhood function and the gain

factor were defined in Eqs. (3) and (4). Also

α(t) = α 0 × (1 − t/T ). (47)

Also let r (t) be a constant value r . These equationsdefine the shape of the neighborhood function as rect-angular and the size of the rectangle as being the sameat all discrete time steps t . The gain factor α(t) wasreduced according to the discrete time t .

Figure 2 shows the output layer of the first-layerSOM and the fixed positions (gray nodes). In this case,the number s is 9, constant r is 3, and the dimensionsof the output layer o and p are both 9.

Figure 2. Fixed position for a two-dimensional output layer.


3.2. A Cooperative Learning Algorithmfor the Hierarchical SOM

The method of using a hierarchical SOM for cluster-ing succeeds in reducing learning times and achiev-ing accuracy in classification. However, the results oflearning for no pair of second-layer SOMs will be re-lated to each other, and when mis-clustering causedby mis-learning or overlearning occurs, the method ofclustering lacks the means to correct this. Therefore,some similar input vectors will probably end up beingassigned to different second-layer SOMs.

For this reason, we improved the learning algorithmof the hierarchical SOM to make it able to obtain highlyreliable clustering. Specifically, learning by each of thesecond-layer SOMs is advanced simultaneously, andthe domain of definition of the neighborhood functionis expanded, so that learning by each of the second-layer SOMs affects learning by the other SOMs. Learn-ing thus takes place in a cooperative way.

The cooperative learning algorithm for the hierar-chical SOM is as follows:

(i) Provide input feature vectors to the first-layerSOM.

(ii) Apply the learning algorithm of the first-layerSOM. In this case, let the number of fixed posi-tions be s.

(iii) Create the second-layer SOMs and choose ini-tial values from the first-layer SOM for all of thereference vectors of the second-layer SOMs. Inthis case, the second-layer SOMs are placed inorder in fixed positions, such that their outputlayers seem to form a single output layer. Next,each fixed position is made equivalent to a pointin a second-layer SOM. The other nodes of thesecond-layer SOMs are made to correspond tothe nodes of the first-layer SOM so that the po-sitions of the nodes of the first-layer SOM areheld. The next step is to use the values from thecorresponding nodes of the first-layer SOM toset initial values for the reference vectors of thesecond-layer SOMs.

(iv) Extract a unique set of data that will be input toeach second-layer SOM from each of the fixedpositions on the first-layer SOM.

(v) Repeat steps (vi), (vii), and (viii) at discrete timesteps t = 0, 1, . . . , T .

(vi) For each input feature vector x fi for each second-

layer SOM f = 1, . . . , s, perform steps (vii) and(viii).

(vii) Find the best-matching node b according toEq. (1), where the range of the search for a best-matching node is taken to be the output layer ofthat second-layer SOM to which the input featurevector x f

i belongs.(viii) Adjust the reference vectors of all of the nodes

in the output layer, node by node. This processis on the basis of Eq. (2), with the domain ofthe neighborhood function as the output layer ofall of the second-layer SOMs. More specifically,when the best-matching node of a second-layerSOM is close to the edge of its output layer, letthe output layers of the adjoining second-layerSOMs be within the domain of the node’s neigh-borhood function. As a consequence, learning byeach of the second-layer SOMs affects the otherSOMs, and learning will proceed in a coopera-tive manner.

Equations (3) and (4) provide the neighborhoodfunction and r (t) should slowly decrease over time.

Figure 3 shows the relationship between the first-layer SOM and the second-layer SOMs. The lowerpart of the figure shows the first-layer SOM and thegray nodes are the fixed positions, which were ex-plained earlier in this section. The arrows express thetransfer of feature vectors from the first-layer SOM tothe second-layer SOMs. The upper part of the figureshows the four second-layer SOMs. These are ar-ranged in a lattice. The gray areas of each second-layer SOM indicate the respective ranges of search-ing by step (vii) of the cooperative learning algorithm.On the other hand, the domain of the neighborhoodfunction in step (viii) includes all of the second-layerSOMs.

Figure 3. Relationship between the first-layer SOM and the second-layer SOMs.


4. Experimental Results

We have carried out experiments to confirm the behav-ior of these algorithms and evaluated their effective-ness in the clustering of images of simple objects. Allof the images we used in the experiments were 50 × 50pixels. Images of the same size were used so that thescale of the images did not affect clustering. The inputfeature vector was created for each image by taking, ata set of 36 angles (every 10◦), the sampling distancewithin the image from the edge of the object to its cen-ter of gravity. Feature vectors extracted from imagesare usually in terms of several image attributes: shape,color, brightness, orientation, and so on. However, be-cause these were basic experiments, we only used theobject’s shape in extracting feature vectors.

Four experiments were carried out. Experiments 1and 2 tested the basic performance and adaptability ofthe SOM-based system. Experiment 1 tested whetheror not objects were correctly located in the outputlayer of the SOM, and whether or not an appropri-ate distance map was created. Experiment 2 testedwhether or not an appropriate clustering map was cre-ated. Experiment 3 tested the behavior of the first-layerSOM. Experiment 4 tested the speed of the clusteringmethod using the hierarchical SOM in classifying theinput data. Experiment 5 tested whether or not the co-operative learning algorithm of the hierarchical SOMwas effective in quickly classifying the input data.

4.1. Experiment 1: Creating a Distance Map

To check whether or not the distance map correctly re-flected the degrees of similarity and difference amongthe input images, we used sets of shapes in which therewere slight differences from member to member. Weused a set of 36 sampling distances (with 10◦ angularsteps) as the feature vector obtained from each object.The output layer of the SOM consisted of 100 × 100nodes and the initial value of the reference vector foreach node of the output layer was randomly selectedfrom the range between 20 and 30. The learning num-ber T (the number of learning iterations) was 300. Theneighborhood function and the gain factor were as de-fined in (3) and (4), and

r (t) = r0 × (1 − t/T ), (48)

α (t) = α 0 × (1 − t/T ), (49)

where r0 = 50, α 0 = 0.5. Figure 4 shows the experi-mental results. The left-hand part of Fig. 4 is the input

object data, the middle part is the distance map pro-duced by learning, and the right-hand part shows thedistance map with the objects in the regions to whichthey correspond. The shaded lines indicate the bor-ders between objects or clusters. Neighboring objectstended to be similar in shape.

4.2. Experiment 2: Creating a Clustering Map

In creating a clustering map, we used objects withthree completely different shapes. The input featurevectors and parameters of the SOM were the same as inExperiment 1. Figure 5 shows the experimental results.The left-hand part of Fig. 5 is the input object data, themiddle part is the clustering map produced by learning,and the right-hand part is the clustering map with theobjects in the regions to which they correspond. Threedistinct clusters were obtained.

4.3. Experiment 3: Behavior of the First-Layer SOM

To test the behavior of the first-layer SOM, we carriedout experiments under a variety of conditions, as ex-plained below. We confirmed that the state of the first-layer SOM converged and that it provided input data atfixed positions, and that these fixed positions were thebest-matching nodes defined in the above theorem. Inthis section, we show experimental results for the casesof both one- and two-dimensional output layers.

Let s be the number of best-matching nodes in thecase of the two-dimensional output layer. The numberof coordinates thus satisfies o = p = √

s + r (√

s − 1).The neighborhood function and the gain factor weredefined in Eqs. (3), (4), and (47). Also, let r (t) be aconstant value r . These equations defined the neigh-borhood function as rectangular and the size of therectangle as being the same at any discrete time step t .The gain factor α was reduced according to the discretetime step t . The process of learning was finished whenthe best-matching nodes had not changed in the last 10steps of t , assuming that the last t was T ′. The errorof the best-matching nodes was taken as the indicatorof whether or not the best-matching nodes had beendetermined, and this error was defined as

Err =∑

i

‖Best (xi ) − Best∗(xi )‖/d, (50)

Best (xi ) = (mO

b(xi ), m Pb(xi )

), (51)

Best∗(xi ) = (mO

b∗(xi ), m Pb∗(xi )

), (52)


Figure 4. Creating the distance map.

Figure 5. Creating the clustering map.

where b∗(xi ) was the index of the fixed node for xi

according to condition (20), and b(xi ) was the indexof best-matching nodes according to the experimentalresult. When p = 1, Eqs. (50) to (52) also hold for theone-dimensional output layer. The experiments weredone using two sets of input data:

1. 36 objects, such that ‖xi1 − xi2‖ = A for any i1, i2.2. 46 objects that were made in the same way as those

used in Experiment 1.

We set the parameter s at 2, 3, and 4, and determinedvalues for r such that the parameters o and p were about10, 30, 50 and 100.

Tables 1–4 show the experimental results withone- and two-dimensional output layers, respectively.Tables 1 and 3 are for input data set 1, and Tables 2 and 4are for input data set 2. Overall, the learning number T ′

and the error of the best-matching nodes Err were smallwhen there were few best-matching nodes. The conver-gence of the state of an SOM with a two-dimensionaloutput layer was thus verified. We also showed that ifthe condition ‖xi1 − xi2‖ = A for any i1, i2 was not sat-isfied, the state of the first-layer SOM converged andthe best-matching nodes were the neighborhood of thefixed nodes according to condition (20).

Table 1. Results of learning for a one-dimensionaloutput layer and input data set 1.

s r o T ′ Err

2 8 10 14 0.00

28 30 14 0.00

48 50 14 0.03

98 100 14 0.00

3 4 11 15 0.08

14 31 15 0.08

24 51 15 0.06

49 101 23 0.00

4 2 10 15 0.22

9 31 35 0.00

15 49 36 0.11

32 100 36 0.00

4.4. Experiment 4: Clusteringby the Hierarchical SOM

We used 46 objects to test the creation of the clusteringmap by the hierarchical SOM. The parameters of theSOM were the same as in Experiment 1.


Table 2. Results of learning result for a one-dimensionaloutput layer and input data set 2.

s r o T ′ Err

2 8 10 25 0.17

28 30 25 0.15

48 50 25 0.15

98 100 25 0.15

3 4 11 16 0.24

14 31 44 0.83

24 51 16 1.59

49 101 39 1.00

4 2 10 24 0.40

9 31 21 0.41

15 49 368 1.43

32 100 17 0.98

Table 3. Results of learning for a two-dimensional outputlayer and input data set 1.

s r o T ′ Err

4 8 10 14 0.00

28 30 14 0.00

48 50 15 0.00

98 100 14 0.03

9 4 11 16 0.03

14 31 34 1.67

24 51 43 0.08

49 101 58 0.11

16 2 10 20 0.19

9 31 138 1.11

15 49 127 1.03

32 100 104 0.03

Figure 6 shows the clustering map obtained by thehierarchical SOM. The lower part of Fig. 6 showsthe results from the first-layer SOM; that is, theroughly segregated objects. Input objects were posi-tioned within four general groups by the first-layerSOM. The upper part of Fig. 6 shows the results ofthe second-layer SOMs, where the input objects fromeach of the four general groups were classified in moredetail than by the first-layer SOM.

4.5. Experiment 5: The CooperativeLearning Algorithm

To test the efficacy of the cooperative learning al-gorithm, we used the same input feature vectors

Table 4. Results of learning for a two-dimensional outputlayer and input data set 2.

s r o T ′ Err

4 8 10 15 0.07

28 30 15 0.07

48 50 15 0.07

98 100 15 0.07

9 4 11 94 0.55

14 31 368 0.60

24 51 368 0.69

49 101 368 0.87

16 2 10 41 0.46

9 31 108 0.68

15 49 368 0.69

32 100 368 1.06

Figure 6. Clustering by the hierarchical SOM.

and parameters for the hierarchical SOM as inExperiment 4.

Figure 7 shows a result obtained from the second-layer SOMs with the cooperative learning algorithm.The four second-layer SOMs were arranged in a lat-tice with no gaps between SOMs. The result for thefirst-layer SOM was the same as in Experiment 4. Thefeature vectors near the edges of the SOMs were notas similar in terms of shape as in Experiment 4, butwere located in a similar order of shape to that seen inExperiment 4.

This is not surprising when we consider that thecomputational quantities used in searching for the best


Figure 7. Clustering using the cooperative learning algorithm.

matching node were the same in Experiments 4 and 5,and that the computational quantities used to adjust thereference vectors almost the same.

5. Discussion

Our experiments demonstrated the basic effectivenessof the SOM and its adaptability to the clustering ofobjects. In Experiments 1 and 2, feature vectors thathad similar values were located close to each other inthe output layer of the SOM. The distance map andclustering map were formed correctly, and clusters ofnodes that had similar feature values were created. Wethus consider the SOM to be an effective means ofclustering objects.

Experiment 3 showed that the state of the first-layerSOM converged rapidly, and in most cases providedinput data for the second layer in the neighborhood ofthe fixed positions. For some of the 46 objects, however,it did not. We consider that this is because there areinput feature vectors which take values intermediatebetween those at the fixed positions. In this case, ifthe neighborhood function is made a little smaller as thelearning time progresses, the input data will be arrangedin the neighborhood of the fixed positions.

Experiment 4 showed that the first-layer SOM wasable to roughly divide the objects up, and that thesecond-layer SOMs were then able to classify theseroughly classified objects in detail. We have also exper-imentally and theoretically confirmed that a clusteringmethod using the hierarchical SOM is faster than oneusing a non-hierarchical SOM for the 46 objects usedin these experiments. We think that the division of ob-jects into rough groups in the first-layer SOM allowsa reduced learning time. Furthermore, because fewer

input objects are classified by each of the second-layerSOMs, a shorter learning time was possible for thatstage of the classification. We obtained the parame-ter that expresses the reduction with learning time andcalculated the computational quantity on the basis oftheory. We also measured it by computer simulation.

If the number of input data items is d, the learningnumber for the SOM is T , and the size of the SOMoutput layer is o × o, then the computational quantitywith the non-hierarchical SOM is

d × o2

{T + 1

4

{1 + 1

6(1 + T )(1 + 2T )

}}. (53)

On the other hand, if we assume that input data aredivided into s parts by the first-layer SOM and that thesize of the output layer of every second-layer SOM iso/

√s by o/

√s, then the computational quantity with

the hierarchical SOM is

d × o2 × T

β

{1 + 1

s

}+ d × o2

s

×{

T + 1

4

{1 + 1

6(1 + T )(1 + 2T )

}}, (54)

where β is a parameter that defines the learning timeof the first-layer SOM. The first term of (54) is for thefirst-layer SOM, and the second term is for the second-layer SOMs.

Figure 8 compares the experimental and theoreticalcomputational quantities for a non-hierarchical SOMand the hierarchical SOM. The ratio of computationalquantities(cq) was defined as

cq = cq with the non-hierarchical SOM

cq with the hierarchical SOM(55)

Figure 8(a) shows the ratio of computational quanti-ties as a function of the size of the SOMs (the length ofone side of the output layer). The computational quan-tity obtained with the hierarchical SOM was less thanthat with the non-hierarchical SOM, and this shows thatthe hierarchical SOM is the more effective method ofclustering.

Figure 8(b) shows the ratio of computationalquantities as a function of the number of groups intowhich the data input to the first-layer SOM was divided.As the number of groups rose, the clustering methodusing the hierarchical SOM became increasingly su-perior to the method using the non-hierarchical SOM.The hierarchical SOM is clearly the more useful wayof classifying a large number of objects.


Figure 8. Ratios of computational quantities.

The most suitable value of β has not been deter-mined. However, if β is greater than or equal to 1, cq isgreater than 1. Then, the computational quantity pro-duced by using the hierarchical SOM is lower than thequantity produced by using the non-hierarchical SOM,so the clustering method using the hierarchical SOMis the more effective of the two. Actually, the learningnumber of the first-layer SOM is less than the learningtime of the typical SOM and the computational quan-tity of the first-layer SOM is less than that of the typicalSOM.

6. Conclusion

We have described a system for clustering that wehave developed and its theoretical and experimentalevaluation. The system that we have developed usesself-organizing maps to classify images. For speed, thesystem is made up of a hierarchical SOM that we de-veloped, along with new learning algorithms.

Our experiments demonstrated the basic effective-ness and adaptability of the SOM in the clusteringof objects. The SOM formed appropriate distance andclustering, and created clusters of nodes that had featurevectors with similar values. Thus, the SOM appears tobe a useful means of clustering objects.

When the hierarchical SOM was used in cluster-ing, the objects were roughly grouped by the first-layerSOM. The objects in each group into which they hadbeen divided were then classified in more detail by thesecond-layer SOMs. We mathematically demonstratedthe appropriate behavior of the first-layer SOM of thehierarchical SOM. The convergence of the first-layerSOM having fixed positions was proved for the caseof a one-dimensional output layer, and convergence ofstate was experimentally confirmed for both the one-and two-dimensional output layers. We also showed,

experimentally and theoretically, that clustering of theobjects used in our experiments is faster with the hier-archical SOM than with a non-hierarchical SOM.

Furthermore, a cooperative learning algorithm wasproposed for the hierarchical SOM. This is more effec-tive than the previously proposed hierarchical SOM,because the accuracy of clustering by the second-layerSOMs was improved without any degradation of thespeed of clustering.

References

1. A. Gupta and R. Jain, “Visual Information Retrieval,” Comm.ACM, vol. 40, no. 5, 1997.

2. J.R. Smith and S.F. Chang, “VisualSEEK: A Fully AutomatedContent-Based Image Query System,” in Proc. ACM Interna-tional Conference on Multimedia, 1997, pp. 87–93.

3. K. Kushima, H. Akama, S. Konya, H. Kimoto, andM. Yamamuro, “ExSight: An Object-Based High PerformanceImage Retrieval System,” Trans. of IPSJ, vol. 40, no. 2, 1999,pp. 732–741, (in Japanese).

4. K. Curtis, J. Nakagawa, N. Taniguchi, and M. Yamamuro, “Sim-ilarity Indexing in High Dimensional Image Space,” IPSJ SIGNotes, 97-MPS-82-18, 1997, pp. 99–104.

5. M.R. Anderberg, Cluster Analysis for Applications, AcademicPress, 1973.

6. T. Kohonen, “The Self-Organizing Map,” in Proc. of the IEEE,vol. 78, no. 9, 1990, pp. 1464–1480.

7. P.N. Suganthan, “Hierarchical Overlapped SOM’s for PatternClassification,” IEEE Trans. on Neural Net-works, vol. 10, no.1, 1999, pp. 193–196.

8. T. Kohonen, Self-Organizing Maps, Berlin: Springer-Verlag,1995.

9. T. Kohonen, “Self-Organized Formation of Topologically Cor-rect Feature Maps,” Biol. Cybernet., vol. 43, 1982, pp. 59–69.

10. M. Endo, M. Ueno, T. Tanabe, and M. Yamamoto, “ClusteringMethod for Object Images using a Hierarchical Self-OrganizingMaps,” IEEE Trans. on Neural Networks, submitted.

11. H. Ritter and K. Schulten, “Kohonen’s Self-Organizing Maps:Exploring Their Computational Capabilities,” in Proc. IEEEInt. Joint Conf. on Neural Networks, vol. 1, 1988, pp. 109–116.


12. T. Kohonen, “Self-Organizing Maps: Optimization Approa-ches,” Artificial Neural Networks, vol. 2, 1991, pp. 981–990.

13. J. Lampinen and E. Oja, “Clustering Properties of HierarchicalSelf-Organizing Maps,” Journal of Mathematical Imaging andVision, vol. 2, no. 3, 1992, pp. 261–272.

14. C.M. Bishop, Neural Networks for Pattern Recognition, OxfordUniversity Press, 1995.

15. E. Cervera and A.P. del Pobil, “Multiple Self-Organizing Maps:A Hybrid Learning Scheme,” Neurocomputing, vol. 16, no. 4,1997, pp. 309–318.

16. C. Versino and L.M. Gambardella, “Learning Fine Motion byUsing the Hierarchical Extended Kohonen Map,” in Proc. Int.Con. on Artificial Neural Networks, 1996.

17. S.M. Bhandarkar, J. Koh, and M. Suk, “Multiscale Image Seg-mentation Using a Hierarchical Self-Organizing Map,” Neuro-comput., vol. 14, 1997, pp. 241–272.

18. M. Dittenbach, Dieter Merkl, and A. Rauber, “The GrowingHierarchical Self-Organizing Map,” in Proc. Int. Joint Conf. onNeural Networks, vol. 6, 2000, pp. 15–19.

19. O.A.S. Carpinteiro, “A Hierarchical Self-Organizing MapModel for Sequence Recognition,” in Proc. Int. Conf. on Ar-tificial Neural Networks, 1998.

20. O.A.S. Carpinteriro, “A Hierarchical Self-Organizing MapModel for Pattern Recognition,” in Proc. of the BrazilianCongress on Artificial Neural Networks, 1997, pp. 484–488.

21. M. Endo, M. Ueno, T. Tanabe, and M. Yamamoto, “ClusteringMethod Using Self-Organizing Map,” in Proc. of the IEEE Int.Workshop on Neural Networks for Signal Processing X, 2000,pp. 261–270.

Masahiro Endo received the M.E. degree from Toyohashi Universityof Technology, Japan, in 1998. He is presently a research engineer at

Nippon Telegraph and Telephone Corporation in Japan. He has beenstudying signal processing for high-density and high-speed storagesystems. He is a member of the Institute of Electronics, Informationand Communication Engineers of [email protected]

Masahiro Ueno received the M.E. degree from Chiba University inJapan. He is presently a senior research engineer at Nippon Telegraphand Telephone Corporation in Japan. He has been engaged in researchon image coding methods. He is currently developing holographicdata storage systems. He is a member of the Institute of Electronics,Information and Communication Engineers of [email protected]

Takaya Tanabe received the M.E. degree from Ibaraki University,Japan, in 1979 and Ph.D. degree from Tokyo Institute of Technology,Japan, in 1996. He is currently a senior research engineer at NipponTelegraph and Telephone Corporation in Japan. He has been study-ing signal processing for optical storage systems and mass [email protected]

a clustering method using hierarchical self-organizing maps

Documents