a supervised training algorithm for self-organizing maps for structures
TRANSCRIPT
Pattern Recognition Letters 26 (2005) 1874–1884
www.elsevier.com/locate/patrec
A supervised training algorithm for self-organizing mapsfor structures
Markus Hagenbuchner a,*, Ah Chung Tsoi b
a Faculty of Informatics, University of Wollongong, Wollongong NSW 2522, Australiab Australian Research Council, GPO Box 2702, Canberra ACT 2601, Australia
Available online 24 May 2005
Abstract
Recent developments with self-organizing maps allow the application to graph structured data. This paper proposes
a supervised learning technique for self-organizing maps for structured data. The ideas presented in this paper differ
from Kohonen�s approach in that a rejection term is introduced. This approach is superior because it is more robust
to the variation of the number of different classes in a dataset. It is also more flexible because it is able to efficiently
process data with missing or incomplete class information, and hence, includes the unsupervised version as a special
case. We demonstrate the capabilities of the proposed model through an application to a relatively large practical data
set from the area of image recognition, viz., logo recognition. It is shown that by adding supervised learning to the
learning process the discrimination between pattern classes is enhanced, while the computational complexity is similar
to that of the unsupervised version.
� 2005 Elsevier B.V. All rights reserved.
Keywords: Self-organizing maps; Supervised training algorithm; Structured data
1. Introduction
The self-organizing map (SOM) neural network
model is a direct outcome of work in the area of
‘‘vector quantization’’ and was first introduced in
(Kohonen, 1986). Vector quantization (VQ) and
SOMs are popular and well studied methods forquantizing sets of input vectors (Kohonen, 1995).
0167-8655/$ - see front matter � 2005 Elsevier B.V. All rights reserv
doi:10.1016/j.patrec.2005.03.009
* Corresponding author. Fax: +61 2 4221 4218.
E-mail address: [email protected] (M. Hagenbuchner).
In practice, the SOM model has become a widely
applied mechanism which is used to either detect
and visualize properties of high dimensional input
data, or as a pre-processing step to filter out essen-
tial properties of a data set, and hence reduce the
complexity of data processing by reducing its
dimensionality. The SOM, known to be a topolog-ical preserving map, was developed to help identify
clusters in multidimensional datasets. This is per-
formed by projecting data from high dimensional
input space onto a two-dimensional display plane
ed.
M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1875
whilst preserving the topological relationship be-
tween two vectors in the transform mapping oper-
ation.1 The result is that data points that were
‘‘close’’ to one another in the original multidimen-
sional input data space are mapped onto nearbyareas in the two dimensional display space. SOMs
combine competitive learning with dimensionality
reduction by smoothing the clusters with respect
to an a priori grid of neurons. The SOM algorithm
provides a trade-off between the accuracy of the
quantization and the smoothness of the topologi-
cal mapping.
The two-dimensional display space of a SOM isrepresented by a regular two-dimensional array of
neurons. Every neuron i is associated with an n-
dimensional codebook vector mi = (mi1, . . . ,min)T,
where T denotes the transpose of a vector or ma-
trix. The neurons of the map are connected to
adjacent neurons by a neighbourhood relation,
which defines the relationship among nearby neu-
rons, and hence the converged structure of themap. The most common topologies in use are rect-
angular and hexagonal (Kohonen, 1995). Adjacent
neurons belong to the neighbourhood Ni of the
neuron i. Neurons belonging to Ni are updated
according to a neighbourhood function f( Æ ). Most
often, f( Æ ) is a Gaussian shaped function or aMex-
ican-hat function (Kohonen, 1995). In the basic
SOM algorithm, the topology and the number ofneurons remain fixed. The number of neurons
determines the granularity of the mapping, which
has an effect on the accuracy and generalization
capability of the resulting SOM (Kohonen, 1995).
The network is trained by finding the codebook
vector which is most similar to an input vector.
This codebook vector, and its neighbours are then
updated so as to render them more similar to theinput vector. The result is that, during the training
phase, the SOM forms an elastic cover that is
shaped by the input data. The learning algorithm
controls the cover so that it strives to approximate
the probability density of the input data. The ref-
erence vectors in the codebook drift to areas where
the density of the input data is high. Eventually,
1 In theory the input data can be mapped into a q-
dimensional maps, where q 2 Nþ. For simplicity, we restrict
ourselves to two-dimensional Kohonen maps.
only few codebook vectors lie in areas where the
input data is sparse.
This method is commonly trained in an unsu-
pervised fashion though some supervised ones do
exist (Fessant et al., 2001; Goren-Bar et al., 2000;
Kohonen, 1988). Typically, supervised SOMs con-
catenate class information to the input vector, an
approach which has a number of disadvantages:
(1) The influence of the attached data label to
the error measure is unbalanced. For exam-
ple, when training a SOM on inputs with a
small data label and a large class label, then
the mapping of the input is biased towards
the class label resulting in a poor representa-
tion of the actual data.
(2) The complexity of the training algorithmgrows linearly with the size of the input vec-
tors. Attaching a class label to the network
increases the computational demand particu-
larly when the dimension of the class label is
large.
(3) The approach is restricted to learning prob-
lems featuring numerical class labels.
These problems are typically observed when
dealing with datasets that have a large number of
pattern classes. A supervised approach to the
training of SOM which overcomes these problems
is presented in this paper.
A limitation of SOM is that its application is re-
stricted to the processing of fixed sized data vec-
tors. Extensions to allow the processing of datasequences are suggested in (Kohonen, 1995; Varsta
et al., 1997; Voegtlin, 2000). More recent develop-
ments include a general framework for unsuper-
vised processing of structured data (Hammer
et al., 2002), and a specific extension of SOMs to
the domain of graph structures (Hagenbuchner
et al., 2003). Supervised versions of the extended
SOM are limited to (Hagenbuchner et al., 2001)where Kohonen�s idea of attaching class labels to
input data is adapted. However, the model per-
forms satisfactorily only within a limited range of
application domains.
This paper extends the SOM for structured data
(SOM-SD) described (Hagenbuchner et al., 2003)
by adding a component to allow for efficient and
1876 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884
flexible supervised learning. The idea is to assign a
class label to the neurons of the map. The training
algorithm then rejects an input pattern if it belongs
to a different class to the best matching codebook
vector; an approach inspired by the learning pro-cess deployed in learning vector quantization algo-
rithms (Kohonen, 1986). The class membership of
the neurons is not static as it is determined online
by the class membership of the input data which
matched a codebook vector most often. Hence,
this method assists the SOM learning process in
finding clusters according to some pre-defined clas-
ses. The proposed method is very general in naturein that it allows the handling of missing or incom-
plete class information, and hence, includes the
unsupervised SOM-SD as a special case.
The paper is organized as follows: the SOM-SD
model described in Section 2 forms the basis for
the supervised learning approach which is pro-
posed in Section 3. In Section 4, the supervised
SOM-SD is applied to a relatively large real worldlearning problem, viz., the logo recognition prob-
lem. Finally, conclusions drawn in Section 5 list
advantages and potential problems of the super-
vised SOM-SD approach.
3.14, 0.27 2.27, 1.3
0.07, 0.51
a
cb
Node-id Data-label Children
a (0.07, 0.51) b,cb (3.14, 0.27) c
2. SOM for structures
An extension of SOM which can process struc-
tured data (SOM-SD) is introduced in (Hagen-
buchner et al., 2003). A distinct feature of the
SOM-SD model is that data structures, e.g.,
labeled directed acyclic graphs (DAGs),2 can be
processed in an unsupervised fashion. As is dem-
onstrated in (Hagenbuchner et al., 2003), a
SOM-SD model defines a general mechanismwhich includes standard SOMs as a special case.
The ideas presented (Hagenbuchner et al., 2003)
include a method to transform DAGs into a list of
fixed sized vectors, and an extension to the learn-
ing algorithm.
Graphs can be easily transformed into a set of
data vectors by rewriting a representation in a tab-
2 The SOM-SD described (Hagenbuchner et al., 2003) allows
the processing of some special classes of directed cyclic graphs
as well.
ular form. An example of such a procedure is given
in Fig. 1. Each node in the graph shown in Fig. 1 is
identified by a unique symbol. Associated with
each node is a 2-dimensional real valued data label
which represents the data in the node. The nodeidentified as a is called the root, node c is the leaf
node. All other nodes are intermediate nodes.
The maximum out-degree (the maximum number
of children at any node) of this graph is 2. In com-
parison, each row in the tabular representation is a
vector which represents a node in the graph. These
vectors can be made constant in size if the maxi-
mum out-degree and the maximum data labeldimension are known. For nodes with missing chil-
dren or smaller data label size, padding with a suit-
ably chosen constant value, e.g., 0 is deployed.
Fixed sized vectors are more convenient to serve
as inputs for most artificial neural network archi-
tectures including SOM.
Traditionally, neural network models use fixed
sized data labels as input without considering rela-tionships among the data. In the case of graphs,
relationships between the data labels are well de-
fined, and hence can be used to assist in the train-
ing process. Thus, the aim is for the network to
utilize not only the data label of a node as input
but also includes some information about the best
matching codebook vector for each child.
c (2.27, 1.30)
Fig. 1. Top: A labelled directed acyclic graph with three nodes.
Bottom: The same graph represented in a tabular form.
M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1877
A SOM-SD is unable to receive feedback from
other neurons in a conventional sense since there
are no weighted links between the neurons. The
solution is to add spatial location information,
i.e., the location of the winning neuron of the chil-dren, to the set of input features of the parent
node. As a result, the network needs to know
where the winning neurons for all its children are
located on the map when processing the parent
node. In practical terms, graphs are processed in
a bottom-up fashion (from the leaf nodes towards
the root). This is necessary in order to make infor-
mation about the children available when the par-ent node is processed.
In practice, for each node in the graph, the
network input will be a set of vectors which con-
sists of (a) the p-dimensional data label l and (b)
a set of coordinates c of the winning neuron for
each child. The vector c is qo-dimensional, where
o the maximum out-degree of any graph in the
data set, and q is the dimension of the displaymap. Without loss of generality of the model
described, q is set to 2. Hence, c consists of o
tuples, each tuple being a two-dimensional vector
representing the x–y coordinates of the winning
neuron of a child node. Children, which essen-
tially represent sub-graphs, are mapped at a par-
ticular position within the map. Hence, the tuples
in c contain the coordinates of codebook vectorswhich were associated with the children of the
current node. Once it is known where the chil-
dren are represented on the map, the vector com-
ponent c of the parent node is updated
accordingly. The training algorithm is hierarchi-
cal in nature, processing from the leaf nodes to
the root node.
Training is performed in a similar fashion aswith a traditional unsupervised SOM. The differ-
ence of the training algorithms arises from the fact
that with SOM-SD the network input vector x is
built through the concatenation of l and c so that
x = [lT,cT]T. As a result, x is a n = p + 2o dimen-
sional vector. The codebook vectors m are of the
same dimension. But since we have two compo-
nents in the input data, the vectors l and c, it is re-quired to modify the similarity measure (e.g., the
Euclidean distance) so as to weigh the input ele-
ments l, and c. This weighting operation is neces-
sary to balance the influence of the elements to
the training algorithm. For example: l may be high
dimensional with elements larger than those in c.The network would be unable to learn relevant
information provided through c. The approachof weighting the network input components will
become clearer in Section 3.2 where a training
algorithm is presented.
3. Supervised SOM for structured data
This section presents an extension of the unsu-pervised SOM-SD incorporating a teacher signal
when processing nodes for which a (symbolic) tar-
get label exists. The idea is to assign codebook vec-
tors to the same class as the node that was mapped
at this location. If several nodes from different
classes are mapped at the same location, then the
class of the codebook entry is obtained through
majority voting. Training proceeds in a similarmanner as the unsupervised case with the differ-
ence that codebook entries are rejected if they be-
long to a different class as the input vector. The
advantage of this method is that the formation
of clusters during the learning process can be con-
trolled through some pre-defined target values. In
addition, the proposed method can handle prob-
lems which feature incomplete or missing classlabels.
3.1. Theoretical background
Kohonen (1995) described a mechanism for
training a SOM in a supervised fashion. The idea
was to produce input vectors through the concate-
nation of the (numeric) target vector with the datalabel, and then to proceed to train in a similar
manner as that of an unsupervised SOM training
algorithm. However, while the extension of this
idea to the class of SOM-SD produced good
results for some artificial learning tasks which
feature only a small number of classes (Hagen-
buchner et al., 2001), it often does not produce
acceptable results in practical applications.Also, it appears that little work by other research-
ers has been performed on supervised SOM
training.
3 Generally, the neighbourhood radius in SOMs never
decreases to zero as otherwise, the algorithm reduces to vector
quantization and has no longer topological ordering properties
(Kohonen, 1995, p. 111).
1878 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884
Given a self-organizing map M with k neurons.
Each neuron is associated with a codebook entry
m 2 Rn. The best matching neuron mr for an input
node x is obtained, and the i-th element of the j-th
codebook vector is updated as follows:
Dmij ¼��aðtÞf ðDjrÞhðxi;mijÞ
if x and mj are
in different classes:
aðtÞ f ðDjrÞðxi � mijÞ else:
8><>:
ð1Þ
where f(Djr) is a neighbourhood function, a is the
learning rate which decreases to zero with rate 1t,
where t is the number of iterations, and � is a rejec-tion factor which weighs the influence of the rejec-
tion term h( Æ ). The purpose of the rejection term is
to move mj and its neighbours nearby away from xif mj is found to be in a different class. The effect is
a reduction of the likelihood that an input node
activates a codebook vector which is assigned to
a ‘‘foreign’’ class in subsequent iterations. This
modification is inspired by a similar approach usedin LVQ (Kohonen, 1986).
In order to improve efficiency, the rejection
term h( Æ ) should provide stronger actions if a
codebook entry is very similar to the input node
and has a lesser influence on codebook vectors
which are already very different to x. We define
the rejection term as follows:
hðxi;mijÞ ¼ sgnðxi � mijÞri
riþ j xi � mij j
� �ð2Þ
where the function sgn(Æ) returns the sign of itsargument, and ri is the standard deviation defined
as follows:
ri ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNl¼1ðxli � xiÞ2
N
sð3Þ
where N is the total number of nodes in the train-
ing set, and xi ¼ 1N
PNl¼1xli. In other words, the
term sgn(xi � mij) gives the direction for the weight
change ensuring that the weights in m are moved
away from x, and the term (ri/(ri + jxi � mijj))influences the size of the weight change. The
weight step remains small if the input vectors are
not very different from each other, and hence, onlysmall weight adjustments are necessary to achieve
the desired effect. In addition, the size of the
weight change depends on how strongly the input
vector differs from the codebook vector. The
weight change is maximal if xi and mij are identi-
cal, and is closer to zero the more xi and mij differ.Note that h( Æ ) produces results within the range
[�1;1]. This eliminates the harmful influence of
the magnitude of the elements of the vector to
the rejection term, and allows more sophisticated
control over the rejection term as it is independent
of the size of the vector elements. Experiments
have shown that the rejection term does not re-
quire the normalization of input vectors.The neighbourhood function f(Æ) controls the
amount by which the weights of the neighbouring
neurons are updated. The neighbourhood function
f(Æ) can take the form of a Gaussian function:
f ðDirÞ ¼ exp �kl i � lrk2
2rðtÞ2
!ð4Þ
where r(t) is the spread decreasing with the num-
ber of iterations,3 and lr is the location of the win-
ning neuron, and li is the location of the i-th
neuron in the lattice. Other neighbourhood func-
tions are also possible (Kohonen, 1995).
3.2. The training algorithm
Training a SOM-SD in a supervised fashion ex-
tends the training algorithm of the unsupervised
version. The difference is that for computing the
similarity, the similarity measure needs to be
weighted. Also, some components (the vector c)of the input vector need to be updated at every
training step. Another difference is that codebook
vectors are assigned to a class during training,
and the use of a rejection term for codebook entries
that do not belong to the same class as the input
vector. The following training algorithm summa-
rizes the development proposed in this paper:
Step 1. A node j is chosen from the data set. When
choosing a node, special care has to be
taken that the children of that node have
M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1879
already been processed. Hence, at the
beginning the leaf nodes of a graph are
processed first. Then, vector xj is presentedto the network. The winning neuron r is
obtained by finding the most similar code-book entry mr, e.g., by using the Euclidean
distance as follows:
r ¼ argmini
kðxj �miÞKk ð5Þ
where K is a n · n dimensional diagonal
matrix. Its diagonal elements k11 . . .kppare set to the weight constant l1, all
remaining diagonal elements are set to
the weight constant l2. The winning neu-ron is assigned to the same class as the
node. Step 1 is repeated until all nodes in
the training set have been considered
exactly once. Codebook vectors that were
activated by nodes belonging to different
classes are assigned to the class which acti-
vated this neuron most frequently. Note
that this step is to initialize neurons withclass labels; it does not involve any train-
ing. Neurons not activated in the training
set are assigned to the class unknown.
Step 2. A node j is chosen from the data set in the
same way as in step 1. The vector xj is pre-sented to the network, and the winning
neuron r is obtained by finding the most
similar codebook entry mr by using Eq.(5). The class membership of mr is updated
using majority voting. Hence, a codebook
vector needs to recall (store) the classes
of input vectors by which it was activated
within an iteration. Then, the winning
codebook vector and its neighbours are
updated according to Eq. (1).
Step 3. The coordinates of the winning neuron arepassed on to the parent node which in turn
updates its vector c accordingly.4
4 This step neglects the potential changes in the state of the
descendants due to the weight change in Step 2. This approx-
imation is required to speed up the training process. It works
best when processing all the nodes from one graph before
moving to the next graph.
The algorithm repeats Steps 2 and 3 until a
given number of training iterations is performed,
or when the mapping precision has reached a given
threshold.
Gonzalez et al. (1997) describes problems whenapplying SOM to non-stationary data. In our case,
the weight change in training Step 2 may produce
non-stationary input data as the descendant vector
component c may change at successive iterations.
In practice, however, this issue was not found to
cause any particular problem. This was attributed
to (a) the fact that c contains cardinal values and
hence, small weight changes do not have an imme-diate effect on the values in c, and (b) choosing a
small initial learning rate a(t) decreasing linearly
with the number of iterations t to a very small
value ensures the convergence of the learning
algorithm.
The optimal choice of the weight values l1 andl2 depends on the dimension of the data label l, themagnitude of its elements, the dimension of thecoordinate vector c, and the magnitude of its ele-
ments. The Euclidean distance used in Eq. (5) is
computed as follows:
d ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffil1
Xpi¼1
ðl i �miÞ2 þ l2
X2oj¼1
ðcj �mpþjÞ2vuut ð6Þ
Hence, it becomes clear that the sole role of l1 andl2 is to balance the influence of the two terms to
the behaviour of the learning algorithm. Ideally,
the influence of the data label and the coordinate
vector on the final result is equal. A way of obtain-
ing the pair of weight values is through the follow-
ing equation:
l1
l2
¼ nm
Pmj¼1ð/ðj lj jÞ � rðj lj jÞÞPn
i¼1ð/ðciÞ � rðciÞÞð7Þ
where /(jlij) is the average absolute value of the
i-th element of all data labels in the data set. Sim-
ilarly, /(ci) is the average of the i-th element of all
coordinates. The data label l is available for all
nodes in the data set, however, the coordinate vec-
tor becomes only available when training has
started. Hence, /(ci) needs to be approximatedby assuming that the mapping of nodes is at ran-
dom. As a result, the values in /(ci) are simply half
1880 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884
the horizontal or vertical extension of the map.
r(vi) is the standard deviation of the i-th vector ele-
ment of a vector v. It is claimed (Hagenbuchner
et al., 2003) that Eq. (7) underestimates l2 becauseit neglects the importance of structural informa-tion being passed on to parent nodes accurately.
Nevertheless, Eq. (7) can give a good first indica-
tion for appropriate values for l1 and l2. In order
to obtain unique value pairs, we make the assump-
tion that l1 + l2 = 1.
Fig. 2. The original data set of logos. Thirty-nine different
instances of logos define 39 classes for the learning problem. In
this figure, the images are scaled to feature the same horizontal
extension.
4. Experimental results
This section investigates the capabilities of the
proposed supervised SOM-SD model. To do this,
we chose a real world learning problem from the
area of pattern recognition where the task is to
classify company logos.
The task defined by the logo recognition prob-
lem is to recognize and classify company logossuch as those shown in Fig. 2. This data set has al-
ready been applied to investigate other neural net-
work models capable of dealing with structured
information such as recursive cascade correlation
architecture and recursive multilayer perceptron
(Frasconi et al., 1997). The data set consists of
39 classes of logos which are available in the form
of digital images5. There are 300 different samplesavailable for each of the 39 classes, produced by
simulating random noise contamination on the
original logos, producing a total set of 11,700
images.
A graph representation is extracted from each
of the images by following procedures based on
a contour-tree algorithm described in (Frasconi
et al., 1997). The result is a training set consistingof 5850 graphs featuring a total of 55,547 sub-tree
structures or nodes, and a validation data set with
5850 graphs featuring 55,654 sub-trees in total.
Each node in the graph has a 12-dimensional
numeric label attached which consists of:
5 The logo data set was provided by the Document Processing
Group, Center for Automation Research, University of
Maryland.
(1) The area consumed by the contour (the
number of pixels surrounded by the con-
tour normalized with respect to the
maximum value among all contours).
(2) The outer boundary of the contour
normalized with respect to the largest
boundary found in the picture.
(3) The number of pixels found inside thearea enclosed by a contour. The value
is normalized with respect to the maxi-
mum number of black pixels found in
any of the contours of the picture.
(4) The minimum distance in pixels
between the image barycenter and the
contour barycenter (normalized with
respect to half of the diagonal of theimage).
(5) The angle between a horizontal line and
the line drawn through the image bary-
center and the contour barycenter. The
angle is quantized in order to reduce
the sensitivity to rotation in the image.
Eight possible values are coded as real
numbers in [0,1] with a uniformsampling.
M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1881
(6), (7) The maximum curvature angle for con-
vex sides, and the maximum curvature
angle for concave sides.
(8), (9) The number of points for which the cur-
vature angle exceeds a given thresholdfor convex (first value) and concave
regions (second value).
(10) The smallest distance in pixels between
the contour and other contours.(11), (12) The smallest distance between the con-
tour and the two contours with the larg-
est outer boundary.
The root node represents an image as a whole,
whereas its direct descendants represent outer
contours. Recursively, a descendant represents acontour which is located inside the contour repre-
sented by its parent node. As a consequence, it is
the intermediate and leaf nodes that hold informa-
tion about the actual contents of a logo, and the
structure of a graph represents the relationship be-
tween the components of a logo. The data label as-
signed to the root node holds little information.
The pre-processing step of the contour-treealgorithm uses filtering techniques to remove
noise and small features. As a result, the graphs
produced have a maximum out-degree not
greater than 7. Hence, the dimension of the input
vectors and codebook vectors is 12 + 2 * 7 = 26.
This data set provides a difficult learning task
considering that some of the images in the set
contained noise covering up to 50% of a proto-type image.
Table 1 gives a detailed overview of the proper-
ties of the data for each class. The columns in
Table 1 give (a) the class label, (b) the symbol used
in plots to represent a class, (c) the number of
nodes, (d) number of intermediate nodes, (e) num-
ber of leaf nodes, (f) out-degree, and (g) the depth
of a graph. Columns (c)–(g) state triplets of valuesin the form x–y–z, where x represents the mini-
mum value, y the maximum value, and z the aver-
age value. Hence, the triplet 1–2–1.3 in row 1,
column (g) states that the average depth of a graph
belonging to class �0� is 1.3 where the shallowest of
these is of depth 1 and the deepest of depth 2.
From Table 1 we find that the property of the
graphs can vary considerably. For example, the
graphs belonging to class �W� feature many nodes,
are shallow and wide which produce many leaf
nodes. In contrast, graphs from class �3� also fea-
ture many nodes but are narrow and deep which
produce a larger number of intermediate nodes.There are also very small graphs such as those
belonging to class �U�. Overall, the data set pro-
vides graphs with a good variety of properties.
We trained a variety of networks using the
training set as the basis to determine a good set
of learning parameters. We found that a good
choice of parameters for a network of size
156 · 119 (the number of neurons is about 13the
number of nodes in the training set) is a(0) = 1,
� = 0.01, l1 = 2/3l2, r(0) = 10, t = 250 iterations.
A network trained with this set of parameters pro-
duced 97.26% classification performance on the
training set, and 86.34% generalization perfor-
mance. Whereas the best SOM-SD network
trained unsupervised achieved a respectable peak
performance of 95.16% for classification and83.44% for generalization when using l1 = 0.75l2,r(0) = 6, and other parameters as before.
The final mapping of all nodes in the data set is
shown in Fig. 3. The plot shows that leaf nodes are
mapped well separated from nodes of other types,
and that there are very few root nodes mapped at
the same location as intermediate nodes. The
observation of intermediate nodes being mappedin the same area of the map as root nodes is not
of a general nature but rather a result from a par-
ticular property of the given learning task. As sta-
ted earlier, the data label attached to the root
nodes do not hold much useful information and
are usually identical over many classes. Hence,
the differentiating feature of root nodes is struc-
tural information. Similarly, this is true for inter-mediate nodes since important details leading to
the differentiation of the logos is encoded in leaf
nodes. This property causes intermediate nodes
to be mapped very close to a corresponding root
node. Experiments conducted with other data set
(not shown in this paper) often showed a clear sep-
aration of leaf, intermediate, and root nodes.
A closer examination of the mapping of rootnodes depending on class membership as shown
in Fig. 4 provides a curious observation: root
nodes belonging to a particular class do not form
Table 1
Properties of the graphs for each class. The table states triplets of values representing min, max, and average values
Class Symbol #Nodes #Interm #Leafs Outdegree Depth
0 + 3–10–6.5 0–2–0.3 2–8–5.2 2–7–5.2 1–2–1.3
1 · 6–17–14.9 1–2–1.0 4–15–12.8 3–7–6.9 2–2–2.0
2 18–23–20.8 1–3–1.7 16–20–18.5 7–7–7.0 2–4–2.7
3 h 10–16–13.2 4–7–6.4 4–8–5.8 3–5–4.7 4–6–5.7
4 j 9–19–14.3 1–2–1.2 7–17–11.8 5–7–6.9 2–3–2.0
5 s 3–5–3.6 1–2–1.0 1–3–1.5 1–3-1.5 2–2–2.0
6 d 16–22–20.1 3–6–5.7 11–15–13.3 7–7–7.0 2–3–3.0
7 n 2–5–3.4 0–1–1.0 1–3–1.4 1–3–1.4 1–2–2.0
8 m 16–25–22.3 2–3–3.0 12–21–18.4 7–7–7.0 3–4–4.0
9 , 3–6–4.2 1–3–1.8 1–2–1.4 1–2–1.4 2–4–2.8
A . 3–8–6.1 0–1–0.1 2–7–5.0 2–7–5.0 1–2–1.1
B � 2–6–4.8 0–1–0.1 1–5–2.9 1–5–2.9 1–2–1.1
C � 6–11–8.5 0–1–0.0 4–10–7.5 4–7–6.7 1–2–1.0
D } 8–15–12.6 4–7–5.6 3–8–5.9 3–7–5.1 4–5–4.8
E 6–11–7.5 2–4–3.7 2–6–2.8 2–6–2.2 3–5–4.7
F 4–7–4.6 1–2–1.0 2–5–2.6 2–5–2.3 2–3–2.0
G 4–9–6.2 0–2–1.0 2–7–4.2 2–7–4.2 1–2–2.0
H 8–11–8.4 1–2–1.0 6–9–6.4 6–7–6.2 2–2–2.0
I g 5–10–7.8 0–2–0.9 3–8–5.8 3–7–5.8 1–2–1.9
J 2–8–5.7 0–1–0.0 1–7–4.6 1–7–4.6 1–2–1.0
K 22–31–28.1 1–3–2.0 19–28–25.1 7–7–7.0 2–2–3.0
L f 8–10–8.4 1–2–1.0 6–8–6.4 6–7–6.1 2–3–2.0
M 3–8–6 1–3–2.0 1–5–3 1–4–2.7 2–2–3.0
N 3–8–6.5 0–1–0.0 2–7–5.5 2–7–5.5 1–4–1.0
O g 2–5–3.4 0–1–0.2 1–3–2.2 1–3–2.2 1–2–1.2
P 8–12–10.1 1–4–2.1 5–9–7.0 3–7–5.4 2–2–2.0
Q 10–15–12.6 1–3–1.8 8–12–9.8 4–7–6.3 2–2–2.0
R f 9–15–12.4 1–2–1.0 7–13–9.9 7–7–7.0 2–2–2.0
S 2–6–4.4 0–3–1.8 1–2–1.6 1–2–1.6 1–4–2.8
T 8–12–10.2 3–5–3.9 4–7–5.3 4–6–4.1 3–3–3.0
U d 2–4–2.4 0–1–0.1 1–3–1.3 1–3–1.3 1–2–1.1
V 3–8–6.1 1–3–2.0 1–5–3.1 1–4–2.7 2–3–3.0
W 21–33–29.2 0–1–0.0 20–32–28.2 7–7–7.0 1–2–1.0
X 4–6–4.4 1–1–1.0 2–4–2.4 2–4–2.2 2–2–2.0
Y l 6–9–7.4 1–2–1.0 4–7–5.4 3–5–3.5 2–2–2.0
Z 4–7–4.8 1–1–1.0 2–5–2.8 2–4–2.2 2–2–2.0
a 3–9–5.8 0–3–1.7 1–6–3.1 1–5–2.9 1–2–2.0
b k 7–10–8.4 2–3–2.8 3–6–4.5 3–5–3.1 3–3–3.0
c 4–10–6.7 1–2–1.1 2–7–4.6 2–6–3.3 2–3–2.1
1882 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884
clear clusters. In addition, in many cases, nodes
mapped close to nodes belonging to a different
class do not seem to share common structural
properties. Take the class �K� as an example which
features wide but narrow graphs. Graphs with the
most similar property are those from class �W�.However, we cannot find these classes mapped
close to it. Instead, we find that classes �L� and�H� are often mapped close to class �K�. Curiously,these classes represent logos which feature text in-
side a single black block. Similarly, we often find
patterns from class �C� mapped closely to patterns
from class �W�. Interestingly, these classes are
obtained from logos featuring numerous black
blocks without any further structure inside the
blocks. Apparently, the supervised SOM-SD has
mapped patterns according to information pro-
vided through the data label of all nodes in the
graph. This indicates that the supervised SOM-SD uses structural information primarily to pass
vital information about the nodes in a graph to
the root node rather than influencing the mapping
Fig. 3. The mapping of nodes after training a SOM-SD supervised with parameters stated in the text. Plotted are the individual
mapping of root, intermediate, and leaf nodes.
20
40
60
80
100
40 60 80 100 120 140
Fig. 4. The portion of the map which mapped root nodes. Plotted are the mapping of root nodes depending on class membership. The
symbols used in this plot correspond to those defined in Table 1.
M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884 1883
of nodes directly. This is a good property since this
allows for all data labels in the graph to contribute
to the final mapping of the root node.
Also, we found that the supervised SOM-SD
model behaves in a robust fashion to most of the
learning parameters. Common negative effects
1884 M. Hagenbuchner, A.C. Tsoi / Pattern Recognition Letters 26 (2005) 1874–1884
observed with other neural network models such
as over training, local minima, or oscillation were
not observed. Nevertheless, choosing the right
combination of parameters can help to obtain a
better mapping of the data.
5. Conclusions
This paper introduced a new method that al-
lows the training of self-organizing maps on
graphs in a supervised learning fashion. It was
demonstrated that through the incorporation ofclass membership information (which can be
numeric or symbolic) into the learning process,
the performance of the network is increased con-
siderably. The proposed method appears to be
very stable and is not very sensitive to the rejection
rate. In addition, the supervised trained SOM-SD
model has demonstrated better robustness to ini-
tial conditions for l2, a(0) and r(0) thus renderingthis model considerably more robust to learning
parameters when compared with the unsupervised
counterpart. This improvement comes at almost
no additional computational cost. A simple if
statement in the neuron update algorithm is all
that is different between the training algorithms
of supervised and the unsupervised SOM-SD
models.A further benefit of this method is that the algo-
rithm is capable of handling missing class informa-
tion. In the case where data have no class label
available, the training algorithm reduces to the
unsupervised mechanism. Hence, the supervised
learning mechanism generalizes the method intro-
duced in the unsupervised SOM-SD section.
The rejection term in Eq. (2) has been deter-mined heuristically. At current, there is no formal
proof that the rejection term will produce the de-
sired effect. However, our experience indicates that
the adding of this term indeed improves the map-
ping precision.
Note also that there is no convergence theorem
for the training algorithm introduced in this paper.
This deficiency also afflicts the general SOM
model.
References
Fessant, F., Aknin, P., Oukhellou, L., Midenet, S. 2001.
Comparison of supervised self-organizing maps using
euclidian or mahalanobis distance in classification context.
In: 6th Int. Work Conf. Artificial Natural Neural Networks,
p. 637 ff, Granada.
Frasconi, P., Francesconi, E., Gori, M., Marinai, S., Sheng,
J.Q., Soda, G., Sperduti, A., 1997. Logo recognition by
recursive neural networks. In: Kasturi, R., Tombre, K.
(Eds.), Second International Workshop on Graphics Rec-
ognition, GREC�97. Springer, pp. 104–117.Gonzalez, A.I., Grana, M., D�Anjou, A., Albizuri, F.X.,
Cottrell, M., 1997. A sensitivity analysis of the self
organizing map as an adaptive one-pass non-stationary
clustering algorithm: the case of color quantization of image
sequences. Neural Proces. Lett. 6.
Goren-Bar, D., Kuflik, T., Lev, D. 2000. Supervised learning
for automatic classification of documents using self-orga-
nizing maps. In: DELOS: Information Seeking, Searching
and Querying in Digital Libraries.
Hagenbuchner, M., Sperduti, A., Tsoi, A.C., 2003. A self-
organizing map for adaptive processing of structured data.
IEEE Trans. Neural Networks 14 (3), 491–505, May.
Hagenbuchner, M., Tsoi, A.C., Sperduti, A. 2001. A supervised
self-organizing map for structured data. In: Allinson, N.,
Yin, H., Allinson, L., Slack, J. (Eds.), Advances in Self-
Organising Maps, pp. 21–28.
Hammer, B., Micheli, A., Sperduti, A. 2002. A general
framework for unsupervised processing of structured data.
In: Verleysen, M. (Ed.), ESANN�2002, 10th European
Symposium on Artificial Neural Networks, pp. 389–394,
April 2002.
Kohonen, T. 1986. Learning vector quantization for pattern
recognition. Technical Report TKK-F-A601, Helsinki Uni-
versity of Technology.
Kohonen, T. 1988. The neural phonetic typewriter. In: IEEE-
CS Computer, vol. 21 of 3, pp. 11–22.
Kohonen, T., 1995. Self-Organizing MapsSpringer Series in
Information Sciences, vol. 30. Springer.
Varsta, M., Del, J., Millan, R., Heikkonen, J. 1997. A recurrent
self-organizing map for temporal sequence processing. In:
Proc. 7th Int. Conf. Artificial Neural Networks, ICANN�97,pp. 421–426.
Voegtlin, T. 2000. Context quantization and contextual self-
organizing maps. In: Proc. Int. Joint Conf. Neural Net-
works, vol. VI, pp. 20–25.