a supervised training algorithm for self-organizing maps for structures

11

Click here to load reader

Upload: markus-hagenbuchner

Post on 21-Jun-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A supervised training algorithm for self-organizing maps for structures

Pattern Recognition Letters 26 (2005) 1874–1884

www.elsevier.com/locate/patrec

A supervised training algorithm for self-organizing mapsfor structures

Markus Hagenbuchner a,*, Ah Chung Tsoi b

a Faculty of Informatics, University of Wollongong, Wollongong NSW 2522, Australiab Australian Research Council, GPO Box 2702, Canberra ACT 2601, Australia

Available online 24 May 2005

Abstract

Recent developments with self-organizing maps allow the application to graph structured data. This paper proposes

a supervised learning technique for self-organizing maps for structured data. The ideas presented in this paper differ

from Kohonen�s approach in that a rejection term is introduced. This approach is superior because it is more robust

to the variation of the number of different classes in a dataset. It is also more flexible because it is able to efficiently

process data with missing or incomplete class information, and hence, includes the unsupervised version as a special

case. We demonstrate the capabilities of the proposed model through an application to a relatively large practical data

set from the area of image recognition, viz., logo recognition. It is shown that by adding supervised learning to the

learning process the discrimination between pattern classes is enhanced, while the computational complexity is similar

to that of the unsupervised version.

� 2005 Elsevier B.V. All rights reserved.

Keywords: Self-organizing maps; Supervised training algorithm; Structured data

1. Introduction

The self-organizing map (SOM) neural network

model is a direct outcome of work in the area of

‘‘vector quantization’’ and was first introduced in

(Kohonen, 1986). Vector quantization (VQ) and

SOMs are popular and well studied methods forquantizing sets of input vectors (Kohonen, 1995).

0167-8655/$ - see front matter � 2005 Elsevier B.V. All rights reserv

doi:10.1016/j.patrec.2005.03.009

* Corresponding author. Fax: +61 2 4221 4218.

E-mail address: [email protected] (M. Hagenbuchner).

In practice, the SOM model has become a widely

applied mechanism which is used to either detect

and visualize properties of high dimensional input

data, or as a pre-processing step to filter out essen-

tial properties of a data set, and hence reduce the

complexity of data processing by reducing its

dimensionality. The SOM, known to be a topolog-ical preserving map, was developed to help identify

clusters in multidimensional datasets. This is per-

formed by projecting data from high dimensional

input space onto a two-dimensional display plane

ed.

Page 2: A supervised training algorithm for self-organizing maps for structures

M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1875

whilst preserving the topological relationship be-

tween two vectors in the transform mapping oper-

ation.1 The result is that data points that were

‘‘close’’ to one another in the original multidimen-

sional input data space are mapped onto nearbyareas in the two dimensional display space. SOMs

combine competitive learning with dimensionality

reduction by smoothing the clusters with respect

to an a priori grid of neurons. The SOM algorithm

provides a trade-off between the accuracy of the

quantization and the smoothness of the topologi-

cal mapping.

The two-dimensional display space of a SOM isrepresented by a regular two-dimensional array of

neurons. Every neuron i is associated with an n-

dimensional codebook vector mi = (mi1, . . . ,min)T,

where T denotes the transpose of a vector or ma-

trix. The neurons of the map are connected to

adjacent neurons by a neighbourhood relation,

which defines the relationship among nearby neu-

rons, and hence the converged structure of themap. The most common topologies in use are rect-

angular and hexagonal (Kohonen, 1995). Adjacent

neurons belong to the neighbourhood Ni of the

neuron i. Neurons belonging to Ni are updated

according to a neighbourhood function f( Æ ). Most

often, f( Æ ) is a Gaussian shaped function or aMex-

ican-hat function (Kohonen, 1995). In the basic

SOM algorithm, the topology and the number ofneurons remain fixed. The number of neurons

determines the granularity of the mapping, which

has an effect on the accuracy and generalization

capability of the resulting SOM (Kohonen, 1995).

The network is trained by finding the codebook

vector which is most similar to an input vector.

This codebook vector, and its neighbours are then

updated so as to render them more similar to theinput vector. The result is that, during the training

phase, the SOM forms an elastic cover that is

shaped by the input data. The learning algorithm

controls the cover so that it strives to approximate

the probability density of the input data. The ref-

erence vectors in the codebook drift to areas where

the density of the input data is high. Eventually,

1 In theory the input data can be mapped into a q-

dimensional maps, where q 2 Nþ. For simplicity, we restrict

ourselves to two-dimensional Kohonen maps.

only few codebook vectors lie in areas where the

input data is sparse.

This method is commonly trained in an unsu-

pervised fashion though some supervised ones do

exist (Fessant et al., 2001; Goren-Bar et al., 2000;

Kohonen, 1988). Typically, supervised SOMs con-

catenate class information to the input vector, an

approach which has a number of disadvantages:

(1) The influence of the attached data label to

the error measure is unbalanced. For exam-

ple, when training a SOM on inputs with a

small data label and a large class label, then

the mapping of the input is biased towards

the class label resulting in a poor representa-

tion of the actual data.

(2) The complexity of the training algorithmgrows linearly with the size of the input vec-

tors. Attaching a class label to the network

increases the computational demand particu-

larly when the dimension of the class label is

large.

(3) The approach is restricted to learning prob-

lems featuring numerical class labels.

These problems are typically observed when

dealing with datasets that have a large number of

pattern classes. A supervised approach to the

training of SOM which overcomes these problems

is presented in this paper.

A limitation of SOM is that its application is re-

stricted to the processing of fixed sized data vec-

tors. Extensions to allow the processing of datasequences are suggested in (Kohonen, 1995; Varsta

et al., 1997; Voegtlin, 2000). More recent develop-

ments include a general framework for unsuper-

vised processing of structured data (Hammer

et al., 2002), and a specific extension of SOMs to

the domain of graph structures (Hagenbuchner

et al., 2003). Supervised versions of the extended

SOM are limited to (Hagenbuchner et al., 2001)where Kohonen�s idea of attaching class labels to

input data is adapted. However, the model per-

forms satisfactorily only within a limited range of

application domains.

This paper extends the SOM for structured data

(SOM-SD) described (Hagenbuchner et al., 2003)

by adding a component to allow for efficient and

Page 3: A supervised training algorithm for self-organizing maps for structures

1876 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884

flexible supervised learning. The idea is to assign a

class label to the neurons of the map. The training

algorithm then rejects an input pattern if it belongs

to a different class to the best matching codebook

vector; an approach inspired by the learning pro-cess deployed in learning vector quantization algo-

rithms (Kohonen, 1986). The class membership of

the neurons is not static as it is determined online

by the class membership of the input data which

matched a codebook vector most often. Hence,

this method assists the SOM learning process in

finding clusters according to some pre-defined clas-

ses. The proposed method is very general in naturein that it allows the handling of missing or incom-

plete class information, and hence, includes the

unsupervised SOM-SD as a special case.

The paper is organized as follows: the SOM-SD

model described in Section 2 forms the basis for

the supervised learning approach which is pro-

posed in Section 3. In Section 4, the supervised

SOM-SD is applied to a relatively large real worldlearning problem, viz., the logo recognition prob-

lem. Finally, conclusions drawn in Section 5 list

advantages and potential problems of the super-

vised SOM-SD approach.

3.14, 0.27 2.27, 1.3

0.07, 0.51

a

cb

Node-id Data-label Children

a (0.07, 0.51) b,cb (3.14, 0.27) c

2. SOM for structures

An extension of SOM which can process struc-

tured data (SOM-SD) is introduced in (Hagen-

buchner et al., 2003). A distinct feature of the

SOM-SD model is that data structures, e.g.,

labeled directed acyclic graphs (DAGs),2 can be

processed in an unsupervised fashion. As is dem-

onstrated in (Hagenbuchner et al., 2003), a

SOM-SD model defines a general mechanismwhich includes standard SOMs as a special case.

The ideas presented (Hagenbuchner et al., 2003)

include a method to transform DAGs into a list of

fixed sized vectors, and an extension to the learn-

ing algorithm.

Graphs can be easily transformed into a set of

data vectors by rewriting a representation in a tab-

2 The SOM-SD described (Hagenbuchner et al., 2003) allows

the processing of some special classes of directed cyclic graphs

as well.

ular form. An example of such a procedure is given

in Fig. 1. Each node in the graph shown in Fig. 1 is

identified by a unique symbol. Associated with

each node is a 2-dimensional real valued data label

which represents the data in the node. The nodeidentified as a is called the root, node c is the leaf

node. All other nodes are intermediate nodes.

The maximum out-degree (the maximum number

of children at any node) of this graph is 2. In com-

parison, each row in the tabular representation is a

vector which represents a node in the graph. These

vectors can be made constant in size if the maxi-

mum out-degree and the maximum data labeldimension are known. For nodes with missing chil-

dren or smaller data label size, padding with a suit-

ably chosen constant value, e.g., 0 is deployed.

Fixed sized vectors are more convenient to serve

as inputs for most artificial neural network archi-

tectures including SOM.

Traditionally, neural network models use fixed

sized data labels as input without considering rela-tionships among the data. In the case of graphs,

relationships between the data labels are well de-

fined, and hence can be used to assist in the train-

ing process. Thus, the aim is for the network to

utilize not only the data label of a node as input

but also includes some information about the best

matching codebook vector for each child.

c (2.27, 1.30)

Fig. 1. Top: A labelled directed acyclic graph with three nodes.

Bottom: The same graph represented in a tabular form.

Page 4: A supervised training algorithm for self-organizing maps for structures

M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1877

A SOM-SD is unable to receive feedback from

other neurons in a conventional sense since there

are no weighted links between the neurons. The

solution is to add spatial location information,

i.e., the location of the winning neuron of the chil-dren, to the set of input features of the parent

node. As a result, the network needs to know

where the winning neurons for all its children are

located on the map when processing the parent

node. In practical terms, graphs are processed in

a bottom-up fashion (from the leaf nodes towards

the root). This is necessary in order to make infor-

mation about the children available when the par-ent node is processed.

In practice, for each node in the graph, the

network input will be a set of vectors which con-

sists of (a) the p-dimensional data label l and (b)

a set of coordinates c of the winning neuron for

each child. The vector c is qo-dimensional, where

o the maximum out-degree of any graph in the

data set, and q is the dimension of the displaymap. Without loss of generality of the model

described, q is set to 2. Hence, c consists of o

tuples, each tuple being a two-dimensional vector

representing the x–y coordinates of the winning

neuron of a child node. Children, which essen-

tially represent sub-graphs, are mapped at a par-

ticular position within the map. Hence, the tuples

in c contain the coordinates of codebook vectorswhich were associated with the children of the

current node. Once it is known where the chil-

dren are represented on the map, the vector com-

ponent c of the parent node is updated

accordingly. The training algorithm is hierarchi-

cal in nature, processing from the leaf nodes to

the root node.

Training is performed in a similar fashion aswith a traditional unsupervised SOM. The differ-

ence of the training algorithms arises from the fact

that with SOM-SD the network input vector x is

built through the concatenation of l and c so that

x = [lT,cT]T. As a result, x is a n = p + 2o dimen-

sional vector. The codebook vectors m are of the

same dimension. But since we have two compo-

nents in the input data, the vectors l and c, it is re-quired to modify the similarity measure (e.g., the

Euclidean distance) so as to weigh the input ele-

ments l, and c. This weighting operation is neces-

sary to balance the influence of the elements to

the training algorithm. For example: l may be high

dimensional with elements larger than those in c.The network would be unable to learn relevant

information provided through c. The approachof weighting the network input components will

become clearer in Section 3.2 where a training

algorithm is presented.

3. Supervised SOM for structured data

This section presents an extension of the unsu-pervised SOM-SD incorporating a teacher signal

when processing nodes for which a (symbolic) tar-

get label exists. The idea is to assign codebook vec-

tors to the same class as the node that was mapped

at this location. If several nodes from different

classes are mapped at the same location, then the

class of the codebook entry is obtained through

majority voting. Training proceeds in a similarmanner as the unsupervised case with the differ-

ence that codebook entries are rejected if they be-

long to a different class as the input vector. The

advantage of this method is that the formation

of clusters during the learning process can be con-

trolled through some pre-defined target values. In

addition, the proposed method can handle prob-

lems which feature incomplete or missing classlabels.

3.1. Theoretical background

Kohonen (1995) described a mechanism for

training a SOM in a supervised fashion. The idea

was to produce input vectors through the concate-

nation of the (numeric) target vector with the datalabel, and then to proceed to train in a similar

manner as that of an unsupervised SOM training

algorithm. However, while the extension of this

idea to the class of SOM-SD produced good

results for some artificial learning tasks which

feature only a small number of classes (Hagen-

buchner et al., 2001), it often does not produce

acceptable results in practical applications.Also, it appears that little work by other research-

ers has been performed on supervised SOM

training.

Page 5: A supervised training algorithm for self-organizing maps for structures

3 Generally, the neighbourhood radius in SOMs never

decreases to zero as otherwise, the algorithm reduces to vector

quantization and has no longer topological ordering properties

(Kohonen, 1995, p. 111).

1878 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884

Given a self-organizing map M with k neurons.

Each neuron is associated with a codebook entry

m 2 Rn. The best matching neuron mr for an input

node x is obtained, and the i-th element of the j-th

codebook vector is updated as follows:

Dmij ¼��aðtÞf ðDjrÞhðxi;mijÞ

if x and mj are

in different classes:

aðtÞ f ðDjrÞðxi � mijÞ else:

8><>:

ð1Þ

where f(Djr) is a neighbourhood function, a is the

learning rate which decreases to zero with rate 1t,

where t is the number of iterations, and � is a rejec-tion factor which weighs the influence of the rejec-

tion term h( Æ ). The purpose of the rejection term is

to move mj and its neighbours nearby away from xif mj is found to be in a different class. The effect is

a reduction of the likelihood that an input node

activates a codebook vector which is assigned to

a ‘‘foreign’’ class in subsequent iterations. This

modification is inspired by a similar approach usedin LVQ (Kohonen, 1986).

In order to improve efficiency, the rejection

term h( Æ ) should provide stronger actions if a

codebook entry is very similar to the input node

and has a lesser influence on codebook vectors

which are already very different to x. We define

the rejection term as follows:

hðxi;mijÞ ¼ sgnðxi � mijÞri

riþ j xi � mij j

� �ð2Þ

where the function sgn(Æ) returns the sign of itsargument, and ri is the standard deviation defined

as follows:

ri ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNl¼1ðxli � xiÞ2

N

sð3Þ

where N is the total number of nodes in the train-

ing set, and xi ¼ 1N

PNl¼1xli. In other words, the

term sgn(xi � mij) gives the direction for the weight

change ensuring that the weights in m are moved

away from x, and the term (ri/(ri + jxi � mijj))influences the size of the weight change. The

weight step remains small if the input vectors are

not very different from each other, and hence, onlysmall weight adjustments are necessary to achieve

the desired effect. In addition, the size of the

weight change depends on how strongly the input

vector differs from the codebook vector. The

weight change is maximal if xi and mij are identi-

cal, and is closer to zero the more xi and mij differ.Note that h( Æ ) produces results within the range

[�1;1]. This eliminates the harmful influence of

the magnitude of the elements of the vector to

the rejection term, and allows more sophisticated

control over the rejection term as it is independent

of the size of the vector elements. Experiments

have shown that the rejection term does not re-

quire the normalization of input vectors.The neighbourhood function f(Æ) controls the

amount by which the weights of the neighbouring

neurons are updated. The neighbourhood function

f(Æ) can take the form of a Gaussian function:

f ðDirÞ ¼ exp �kl i � lrk2

2rðtÞ2

!ð4Þ

where r(t) is the spread decreasing with the num-

ber of iterations,3 and lr is the location of the win-

ning neuron, and li is the location of the i-th

neuron in the lattice. Other neighbourhood func-

tions are also possible (Kohonen, 1995).

3.2. The training algorithm

Training a SOM-SD in a supervised fashion ex-

tends the training algorithm of the unsupervised

version. The difference is that for computing the

similarity, the similarity measure needs to be

weighted. Also, some components (the vector c)of the input vector need to be updated at every

training step. Another difference is that codebook

vectors are assigned to a class during training,

and the use of a rejection term for codebook entries

that do not belong to the same class as the input

vector. The following training algorithm summa-

rizes the development proposed in this paper:

Step 1. A node j is chosen from the data set. When

choosing a node, special care has to be

taken that the children of that node have

Page 6: A supervised training algorithm for self-organizing maps for structures

M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1879

already been processed. Hence, at the

beginning the leaf nodes of a graph are

processed first. Then, vector xj is presentedto the network. The winning neuron r is

obtained by finding the most similar code-book entry mr, e.g., by using the Euclidean

distance as follows:

r ¼ argmini

kðxj �miÞKk ð5Þ

where K is a n · n dimensional diagonal

matrix. Its diagonal elements k11 . . .kppare set to the weight constant l1, all

remaining diagonal elements are set to

the weight constant l2. The winning neu-ron is assigned to the same class as the

node. Step 1 is repeated until all nodes in

the training set have been considered

exactly once. Codebook vectors that were

activated by nodes belonging to different

classes are assigned to the class which acti-

vated this neuron most frequently. Note

that this step is to initialize neurons withclass labels; it does not involve any train-

ing. Neurons not activated in the training

set are assigned to the class unknown.

Step 2. A node j is chosen from the data set in the

same way as in step 1. The vector xj is pre-sented to the network, and the winning

neuron r is obtained by finding the most

similar codebook entry mr by using Eq.(5). The class membership of mr is updated

using majority voting. Hence, a codebook

vector needs to recall (store) the classes

of input vectors by which it was activated

within an iteration. Then, the winning

codebook vector and its neighbours are

updated according to Eq. (1).

Step 3. The coordinates of the winning neuron arepassed on to the parent node which in turn

updates its vector c accordingly.4

4 This step neglects the potential changes in the state of the

descendants due to the weight change in Step 2. This approx-

imation is required to speed up the training process. It works

best when processing all the nodes from one graph before

moving to the next graph.

The algorithm repeats Steps 2 and 3 until a

given number of training iterations is performed,

or when the mapping precision has reached a given

threshold.

Gonzalez et al. (1997) describes problems whenapplying SOM to non-stationary data. In our case,

the weight change in training Step 2 may produce

non-stationary input data as the descendant vector

component c may change at successive iterations.

In practice, however, this issue was not found to

cause any particular problem. This was attributed

to (a) the fact that c contains cardinal values and

hence, small weight changes do not have an imme-diate effect on the values in c, and (b) choosing a

small initial learning rate a(t) decreasing linearly

with the number of iterations t to a very small

value ensures the convergence of the learning

algorithm.

The optimal choice of the weight values l1 andl2 depends on the dimension of the data label l, themagnitude of its elements, the dimension of thecoordinate vector c, and the magnitude of its ele-

ments. The Euclidean distance used in Eq. (5) is

computed as follows:

d ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffil1

Xpi¼1

ðl i �miÞ2 þ l2

X2oj¼1

ðcj �mpþjÞ2vuut ð6Þ

Hence, it becomes clear that the sole role of l1 andl2 is to balance the influence of the two terms to

the behaviour of the learning algorithm. Ideally,

the influence of the data label and the coordinate

vector on the final result is equal. A way of obtain-

ing the pair of weight values is through the follow-

ing equation:

l1

l2

¼ nm

Pmj¼1ð/ðj lj jÞ � rðj lj jÞÞPn

i¼1ð/ðciÞ � rðciÞÞð7Þ

where /(jlij) is the average absolute value of the

i-th element of all data labels in the data set. Sim-

ilarly, /(ci) is the average of the i-th element of all

coordinates. The data label l is available for all

nodes in the data set, however, the coordinate vec-

tor becomes only available when training has

started. Hence, /(ci) needs to be approximatedby assuming that the mapping of nodes is at ran-

dom. As a result, the values in /(ci) are simply half

Page 7: A supervised training algorithm for self-organizing maps for structures

1880 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884

the horizontal or vertical extension of the map.

r(vi) is the standard deviation of the i-th vector ele-

ment of a vector v. It is claimed (Hagenbuchner

et al., 2003) that Eq. (7) underestimates l2 becauseit neglects the importance of structural informa-tion being passed on to parent nodes accurately.

Nevertheless, Eq. (7) can give a good first indica-

tion for appropriate values for l1 and l2. In order

to obtain unique value pairs, we make the assump-

tion that l1 + l2 = 1.

Fig. 2. The original data set of logos. Thirty-nine different

instances of logos define 39 classes for the learning problem. In

this figure, the images are scaled to feature the same horizontal

extension.

4. Experimental results

This section investigates the capabilities of the

proposed supervised SOM-SD model. To do this,

we chose a real world learning problem from the

area of pattern recognition where the task is to

classify company logos.

The task defined by the logo recognition prob-

lem is to recognize and classify company logossuch as those shown in Fig. 2. This data set has al-

ready been applied to investigate other neural net-

work models capable of dealing with structured

information such as recursive cascade correlation

architecture and recursive multilayer perceptron

(Frasconi et al., 1997). The data set consists of

39 classes of logos which are available in the form

of digital images5. There are 300 different samplesavailable for each of the 39 classes, produced by

simulating random noise contamination on the

original logos, producing a total set of 11,700

images.

A graph representation is extracted from each

of the images by following procedures based on

a contour-tree algorithm described in (Frasconi

et al., 1997). The result is a training set consistingof 5850 graphs featuring a total of 55,547 sub-tree

structures or nodes, and a validation data set with

5850 graphs featuring 55,654 sub-trees in total.

Each node in the graph has a 12-dimensional

numeric label attached which consists of:

5 The logo data set was provided by the Document Processing

Group, Center for Automation Research, University of

Maryland.

(1) The area consumed by the contour (the

number of pixels surrounded by the con-

tour normalized with respect to the

maximum value among all contours).

(2) The outer boundary of the contour

normalized with respect to the largest

boundary found in the picture.

(3) The number of pixels found inside thearea enclosed by a contour. The value

is normalized with respect to the maxi-

mum number of black pixels found in

any of the contours of the picture.

(4) The minimum distance in pixels

between the image barycenter and the

contour barycenter (normalized with

respect to half of the diagonal of theimage).

(5) The angle between a horizontal line and

the line drawn through the image bary-

center and the contour barycenter. The

angle is quantized in order to reduce

the sensitivity to rotation in the image.

Eight possible values are coded as real

numbers in [0,1] with a uniformsampling.

Page 8: A supervised training algorithm for self-organizing maps for structures

M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1881

(6), (7) The maximum curvature angle for con-

vex sides, and the maximum curvature

angle for concave sides.

(8), (9) The number of points for which the cur-

vature angle exceeds a given thresholdfor convex (first value) and concave

regions (second value).

(10) The smallest distance in pixels between

the contour and other contours.(11), (12) The smallest distance between the con-

tour and the two contours with the larg-

est outer boundary.

The root node represents an image as a whole,

whereas its direct descendants represent outer

contours. Recursively, a descendant represents acontour which is located inside the contour repre-

sented by its parent node. As a consequence, it is

the intermediate and leaf nodes that hold informa-

tion about the actual contents of a logo, and the

structure of a graph represents the relationship be-

tween the components of a logo. The data label as-

signed to the root node holds little information.

The pre-processing step of the contour-treealgorithm uses filtering techniques to remove

noise and small features. As a result, the graphs

produced have a maximum out-degree not

greater than 7. Hence, the dimension of the input

vectors and codebook vectors is 12 + 2 * 7 = 26.

This data set provides a difficult learning task

considering that some of the images in the set

contained noise covering up to 50% of a proto-type image.

Table 1 gives a detailed overview of the proper-

ties of the data for each class. The columns in

Table 1 give (a) the class label, (b) the symbol used

in plots to represent a class, (c) the number of

nodes, (d) number of intermediate nodes, (e) num-

ber of leaf nodes, (f) out-degree, and (g) the depth

of a graph. Columns (c)–(g) state triplets of valuesin the form x–y–z, where x represents the mini-

mum value, y the maximum value, and z the aver-

age value. Hence, the triplet 1–2–1.3 in row 1,

column (g) states that the average depth of a graph

belonging to class �0� is 1.3 where the shallowest of

these is of depth 1 and the deepest of depth 2.

From Table 1 we find that the property of the

graphs can vary considerably. For example, the

graphs belonging to class �W� feature many nodes,

are shallow and wide which produce many leaf

nodes. In contrast, graphs from class �3� also fea-

ture many nodes but are narrow and deep which

produce a larger number of intermediate nodes.There are also very small graphs such as those

belonging to class �U�. Overall, the data set pro-

vides graphs with a good variety of properties.

We trained a variety of networks using the

training set as the basis to determine a good set

of learning parameters. We found that a good

choice of parameters for a network of size

156 · 119 (the number of neurons is about 13the

number of nodes in the training set) is a(0) = 1,

� = 0.01, l1 = 2/3l2, r(0) = 10, t = 250 iterations.

A network trained with this set of parameters pro-

duced 97.26% classification performance on the

training set, and 86.34% generalization perfor-

mance. Whereas the best SOM-SD network

trained unsupervised achieved a respectable peak

performance of 95.16% for classification and83.44% for generalization when using l1 = 0.75l2,r(0) = 6, and other parameters as before.

The final mapping of all nodes in the data set is

shown in Fig. 3. The plot shows that leaf nodes are

mapped well separated from nodes of other types,

and that there are very few root nodes mapped at

the same location as intermediate nodes. The

observation of intermediate nodes being mappedin the same area of the map as root nodes is not

of a general nature but rather a result from a par-

ticular property of the given learning task. As sta-

ted earlier, the data label attached to the root

nodes do not hold much useful information and

are usually identical over many classes. Hence,

the differentiating feature of root nodes is struc-

tural information. Similarly, this is true for inter-mediate nodes since important details leading to

the differentiation of the logos is encoded in leaf

nodes. This property causes intermediate nodes

to be mapped very close to a corresponding root

node. Experiments conducted with other data set

(not shown in this paper) often showed a clear sep-

aration of leaf, intermediate, and root nodes.

A closer examination of the mapping of rootnodes depending on class membership as shown

in Fig. 4 provides a curious observation: root

nodes belonging to a particular class do not form

Page 9: A supervised training algorithm for self-organizing maps for structures

Table 1

Properties of the graphs for each class. The table states triplets of values representing min, max, and average values

Class Symbol #Nodes #Interm #Leafs Outdegree Depth

0 + 3–10–6.5 0–2–0.3 2–8–5.2 2–7–5.2 1–2–1.3

1 · 6–17–14.9 1–2–1.0 4–15–12.8 3–7–6.9 2–2–2.0

2 18–23–20.8 1–3–1.7 16–20–18.5 7–7–7.0 2–4–2.7

3 h 10–16–13.2 4–7–6.4 4–8–5.8 3–5–4.7 4–6–5.7

4 j 9–19–14.3 1–2–1.2 7–17–11.8 5–7–6.9 2–3–2.0

5 s 3–5–3.6 1–2–1.0 1–3–1.5 1–3-1.5 2–2–2.0

6 d 16–22–20.1 3–6–5.7 11–15–13.3 7–7–7.0 2–3–3.0

7 n 2–5–3.4 0–1–1.0 1–3–1.4 1–3–1.4 1–2–2.0

8 m 16–25–22.3 2–3–3.0 12–21–18.4 7–7–7.0 3–4–4.0

9 , 3–6–4.2 1–3–1.8 1–2–1.4 1–2–1.4 2–4–2.8

A . 3–8–6.1 0–1–0.1 2–7–5.0 2–7–5.0 1–2–1.1

B � 2–6–4.8 0–1–0.1 1–5–2.9 1–5–2.9 1–2–1.1

C � 6–11–8.5 0–1–0.0 4–10–7.5 4–7–6.7 1–2–1.0

D } 8–15–12.6 4–7–5.6 3–8–5.9 3–7–5.1 4–5–4.8

E 6–11–7.5 2–4–3.7 2–6–2.8 2–6–2.2 3–5–4.7

F 4–7–4.6 1–2–1.0 2–5–2.6 2–5–2.3 2–3–2.0

G 4–9–6.2 0–2–1.0 2–7–4.2 2–7–4.2 1–2–2.0

H 8–11–8.4 1–2–1.0 6–9–6.4 6–7–6.2 2–2–2.0

I g 5–10–7.8 0–2–0.9 3–8–5.8 3–7–5.8 1–2–1.9

J 2–8–5.7 0–1–0.0 1–7–4.6 1–7–4.6 1–2–1.0

K 22–31–28.1 1–3–2.0 19–28–25.1 7–7–7.0 2–2–3.0

L f 8–10–8.4 1–2–1.0 6–8–6.4 6–7–6.1 2–3–2.0

M 3–8–6 1–3–2.0 1–5–3 1–4–2.7 2–2–3.0

N 3–8–6.5 0–1–0.0 2–7–5.5 2–7–5.5 1–4–1.0

O g 2–5–3.4 0–1–0.2 1–3–2.2 1–3–2.2 1–2–1.2

P 8–12–10.1 1–4–2.1 5–9–7.0 3–7–5.4 2–2–2.0

Q 10–15–12.6 1–3–1.8 8–12–9.8 4–7–6.3 2–2–2.0

R f 9–15–12.4 1–2–1.0 7–13–9.9 7–7–7.0 2–2–2.0

S 2–6–4.4 0–3–1.8 1–2–1.6 1–2–1.6 1–4–2.8

T 8–12–10.2 3–5–3.9 4–7–5.3 4–6–4.1 3–3–3.0

U d 2–4–2.4 0–1–0.1 1–3–1.3 1–3–1.3 1–2–1.1

V 3–8–6.1 1–3–2.0 1–5–3.1 1–4–2.7 2–3–3.0

W 21–33–29.2 0–1–0.0 20–32–28.2 7–7–7.0 1–2–1.0

X 4–6–4.4 1–1–1.0 2–4–2.4 2–4–2.2 2–2–2.0

Y l 6–9–7.4 1–2–1.0 4–7–5.4 3–5–3.5 2–2–2.0

Z 4–7–4.8 1–1–1.0 2–5–2.8 2–4–2.2 2–2–2.0

a 3–9–5.8 0–3–1.7 1–6–3.1 1–5–2.9 1–2–2.0

b k 7–10–8.4 2–3–2.8 3–6–4.5 3–5–3.1 3–3–3.0

c 4–10–6.7 1–2–1.1 2–7–4.6 2–6–3.3 2–3–2.1

1882 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884

clear clusters. In addition, in many cases, nodes

mapped close to nodes belonging to a different

class do not seem to share common structural

properties. Take the class �K� as an example which

features wide but narrow graphs. Graphs with the

most similar property are those from class �W�.However, we cannot find these classes mapped

close to it. Instead, we find that classes �L� and�H� are often mapped close to class �K�. Curiously,these classes represent logos which feature text in-

side a single black block. Similarly, we often find

patterns from class �C� mapped closely to patterns

from class �W�. Interestingly, these classes are

obtained from logos featuring numerous black

blocks without any further structure inside the

blocks. Apparently, the supervised SOM-SD has

mapped patterns according to information pro-

vided through the data label of all nodes in the

graph. This indicates that the supervised SOM-SD uses structural information primarily to pass

vital information about the nodes in a graph to

the root node rather than influencing the mapping

Page 10: A supervised training algorithm for self-organizing maps for structures

Fig. 3. The mapping of nodes after training a SOM-SD supervised with parameters stated in the text. Plotted are the individual

mapping of root, intermediate, and leaf nodes.

20

40

60

80

100

40 60 80 100 120 140

Fig. 4. The portion of the map which mapped root nodes. Plotted are the mapping of root nodes depending on class membership. The

symbols used in this plot correspond to those defined in Table 1.

M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1883

of nodes directly. This is a good property since this

allows for all data labels in the graph to contribute

to the final mapping of the root node.

Also, we found that the supervised SOM-SD

model behaves in a robust fashion to most of the

learning parameters. Common negative effects

Page 11: A supervised training algorithm for self-organizing maps for structures

1884 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884

observed with other neural network models such

as over training, local minima, or oscillation were

not observed. Nevertheless, choosing the right

combination of parameters can help to obtain a

better mapping of the data.

5. Conclusions

This paper introduced a new method that al-

lows the training of self-organizing maps on

graphs in a supervised learning fashion. It was

demonstrated that through the incorporation ofclass membership information (which can be

numeric or symbolic) into the learning process,

the performance of the network is increased con-

siderably. The proposed method appears to be

very stable and is not very sensitive to the rejection

rate. In addition, the supervised trained SOM-SD

model has demonstrated better robustness to ini-

tial conditions for l2, a(0) and r(0) thus renderingthis model considerably more robust to learning

parameters when compared with the unsupervised

counterpart. This improvement comes at almost

no additional computational cost. A simple if

statement in the neuron update algorithm is all

that is different between the training algorithms

of supervised and the unsupervised SOM-SD

models.A further benefit of this method is that the algo-

rithm is capable of handling missing class informa-

tion. In the case where data have no class label

available, the training algorithm reduces to the

unsupervised mechanism. Hence, the supervised

learning mechanism generalizes the method intro-

duced in the unsupervised SOM-SD section.

The rejection term in Eq. (2) has been deter-mined heuristically. At current, there is no formal

proof that the rejection term will produce the de-

sired effect. However, our experience indicates that

the adding of this term indeed improves the map-

ping precision.

Note also that there is no convergence theorem

for the training algorithm introduced in this paper.

This deficiency also afflicts the general SOM

model.

References

Fessant, F., Aknin, P., Oukhellou, L., Midenet, S. 2001.

Comparison of supervised self-organizing maps using

euclidian or mahalanobis distance in classification context.

In: 6th Int. Work Conf. Artificial Natural Neural Networks,

p. 637 ff, Granada.

Frasconi, P., Francesconi, E., Gori, M., Marinai, S., Sheng,

J.Q., Soda, G., Sperduti, A., 1997. Logo recognition by

recursive neural networks. In: Kasturi, R., Tombre, K.

(Eds.), Second International Workshop on Graphics Rec-

ognition, GREC�97. Springer, pp. 104–117.Gonzalez, A.I., Grana, M., D�Anjou, A., Albizuri, F.X.,

Cottrell, M., 1997. A sensitivity analysis of the self

organizing map as an adaptive one-pass non-stationary

clustering algorithm: the case of color quantization of image

sequences. Neural Proces. Lett. 6.

Goren-Bar, D., Kuflik, T., Lev, D. 2000. Supervised learning

for automatic classification of documents using self-orga-

nizing maps. In: DELOS: Information Seeking, Searching

and Querying in Digital Libraries.

Hagenbuchner, M., Sperduti, A., Tsoi, A.C., 2003. A self-

organizing map for adaptive processing of structured data.

IEEE Trans. Neural Networks 14 (3), 491–505, May.

Hagenbuchner, M., Tsoi, A.C., Sperduti, A. 2001. A supervised

self-organizing map for structured data. In: Allinson, N.,

Yin, H., Allinson, L., Slack, J. (Eds.), Advances in Self-

Organising Maps, pp. 21–28.

Hammer, B., Micheli, A., Sperduti, A. 2002. A general

framework for unsupervised processing of structured data.

In: Verleysen, M. (Ed.), ESANN�2002, 10th European

Symposium on Artificial Neural Networks, pp. 389–394,

April 2002.

Kohonen, T. 1986. Learning vector quantization for pattern

recognition. Technical Report TKK-F-A601, Helsinki Uni-

versity of Technology.

Kohonen, T. 1988. The neural phonetic typewriter. In: IEEE-

CS Computer, vol. 21 of 3, pp. 11–22.

Kohonen, T., 1995. Self-Organizing MapsSpringer Series in

Information Sciences, vol. 30. Springer.

Varsta, M., Del, J., Millan, R., Heikkonen, J. 1997. A recurrent

self-organizing map for temporal sequence processing. In:

Proc. 7th Int. Conf. Artificial Neural Networks, ICANN�97,pp. 421–426.

Voegtlin, T. 2000. Context quantization and contextual self-

organizing maps. In: Proc. Int. Joint Conf. Neural Net-

works, vol. VI, pp. 20–25.