bioinfnetsom

8/3/2019 BioinfNetSOM

1/22

BIOINFORMATICS Vol. NoPages

Gene Expression Analysis with a Dynamically extended Self-

Organized Map that exploits class information

Seferina Mavroudi, Stergios Papadimitriou, Liviu Vladutu, Anastasios Bezerianos

Department of Medical Physics, School of Medicine, University of Patras,

26500 Patras, Greece, tel: +30-61-996115,

email:[email protected], [email protected]

ABSTRACT

Motivation Currently the most popular approach to analyse

genome-wide expression data is clustering. One of the major

drawbacks of most of the existing clustering methods is that

the number of clusters has to be specified a priori.

Furthermore, by using pure unsupervised algorithms prior

biological knowledge is totally ignored e.g. there is no

simple means to handle genes of known similar function

being allocated to different clusters based on their

expression profiles. Moreover, most current tools lack an

effective framework for tight integration of unsupervised

and supervised learning for the analysis of high-dimensional

expression data.

Results: The paper adapts a novel Self-Organizing map

called supervised Network Self-Organized Map (sNet-SOM)

to the peculiarities of gene expression data. The sNet-SOM

determines adaptively the number of clusters with a

dynamic extension process which is able to exploit class

information whenever exists. Specifically, the sNet-SOM

accepts available class information to control a dynamical

extension process with an entropy criterion. This processextracts information about the structure of the decision

boundaries. A supervised network can be connected

additionally in order to resolve better at the difficult parts of

the state space. In the case that there is no classification

available, a similar dynamical extension is controlled with

criteria based on the computation of local variances or

resource counts.

The sNet-SOM grows within a rectangular grid that

provides effective visualization while at the same time it

allows the implementation of efficient training algorithms.

The expansion of the sNet-SOM is based on an adaptive

process. This process grows nodes at the boundary nodes,

ripples weights from the internal nodes towards the outer

nodes of the grid, and inserts whole columns within the

map. The growing process determines automatically the

appropriate level of expansion with criteria dependent upon

whether unsupervised or supervised training is used. For the

unsupervised training the criterion is the similarity between

the gene expression patterns of the same cluster to fulfill a

designer definable statistical confidence level of not being a

random event. The supervised mode of training grows the

map until criteria defined on approximation/generalization

performance are fulfilled. The voting schemes for the

winner node have been designed in order to amplify the

representation of rare gene expression patterns.

The results indicate that sNet-SOM yields competitive

performance to other recently proposed approaches for

supervised classification at a significantly reduced

computational cost and it provides extensive exploratory

analysis potentiality within the unsupervised analysis

framework. Furthermore, it explores simple design decisionsthat are easier to comprehend and computationally efficient.

Availability: The source code of the algorithms presented in

the paper can be downloaded from

http://heart.med.upatras.gr. The implementation is in

Borland C++ Builder 4.0.

Contact: [email protected],

[email protected]

1
mailto:[email protected]:[email protected]:[email protected]://heart.med.upatras.gr/mailto:[email protected]:[email protected]:[email protected]:[email protected]://heart.med.upatras.gr/mailto:[email protected]:[email protected]


2/22

BIOINFORMATICS Vol. NoPages

1. Introduction

The recent development of DNA microarray technology

provides the ability to measure the expression levels of

thousands of genes in a single experiment [ , , ]. The

interpretation of such massive expression data is a new

challenge for bioinformatics and opens new perspectives for

functional genomics. A key question within this context is if

given some expression data for a gene, this gene does

belong to a particular functional class (i.e. it encodes for a

protein of interest).

12

12

12

12

12

6

6

5

5

Currently, the most popular analysis of gene expression data

in order to provide insight to the structure of the data and to

aid at the discovery of functional classes, is clustering, i.e

the grouping of genes with similar expression patterns into

clusters [ , ]. Such approaches unravel relations between

genes and help to deduce their biological role, since genes of

similar function tend to display similar expression patterns.

Most of the so far developed algorithms perform the

clustering of the expression patterns in an unsupervised

manner [ , , ]. However, frequently genes of similar

function become allocated to different clusters. In this case,

a pure unsupervised approach is unable to deduce the correct

"rule" for the characterization of the gene class. On the other

hand, there already exists valuable biological knowledge,

which is manifested in the form of collections of genes

knowing to encode proteins of similar biological function,

e.g. genes that code for ribosomal proteins [ ].

17

17

24

24

24

Some of the clustering algorithms used so far for the

clustering of gene expression data include hierarchical

clustering [ ], K-means clustering, Bayesian clustering [ ]

and the Self-Organizing Map (SOM) [ ]

13

Nevertheless, despite of the fact that most of the widely

approved clustering methods, as K-means and SOM, ignore

existing class information, another major drawbacks of thesemethods is that they require an a priori decision on the

number and structure of distinct clusters. Moreover, most of

the proposed models do not incorporate flexible means for

coupling effectively the unsupervised phase with a

supervised complementary phase, in order to benefit the

most from both of these approaches.

A major drawback of hierarchical clustering is that although

the data points are organized into a strict hierarchy of nested

subsets there is no reason to believe that expression data

actually follows a true hierarchical descent, like for

example, the evolution of the species [ , ]. Furthermore,

decisions made early about grouping points to specific

clusters cannot be reevaluated and often adversely affect the

result. This later disadvantage is owned also by the dynamic

non-fuzzy hierarchical schemes proposed recently [ , ].

Also, the traditional hierarchical clustering schemes suffer

from lack of robustness, and from nonuniqueness and

inversion problems.

11

7

Bayesian clustering is a highly structured approach, which

imposes a strong prior hypothesis on the data [ ]. However,

a prior hypotheses on expression data though is usually not

available.

8

K-means clustering on the other hand imposes no structure

at all on the data, proceeds in a local fashion and produces

an unorganized collection of clusters that is not conducive to

interpretation [ ].

In contrary, the standard SOM algorithm has a number of

properties, which render it to a candidate of particular

interest. SOMs can be implemented easily, are fast, robust

and scale well to large data sets. They allow one to impose

partial structure on the clusters and facilitate visualization

and interpretation. In the case hierarchical information is

required, it can be implemented on top of SOM, as in [ ].

However, there is still an inherent requirement of the

standard SOM algorithm, which constitutes a major

drawback. The number of distinct clusters has to be

specified a priori, although there is no mean to objectively

predetermine the optimum number in the case of gene

expression data.

27

2


3/22

Gene expression analysis with a dynamically extended Self-Organized Map

Recently, several dynamically extended schemes have been

proposed that overcome the limitation of the fixed non-

adaptable architecture of the SOM. Some examples are the

Dynamic Topology Representing structures [ ], the

Growing Cell Structures [ , ], Self-Organized Tree

Algorithms [ , ] and the Adaptive Resonance Theory [ ].

The presented approach has many similarities to these

dynamically extended schemes. However, in contrast to the

complexity of these schemes, we built simple algorithms

that through the restriction of growing on a rectangular grid,

can be implemented easily and the training of the models is

very efficient. Also, the benefits of the more complex

alternatives to the dynamical extension are still retained.

23

14 9

7 17 2

We call the proposed model sNet-SOM from supervised

Network SOM, since although it is SOM based it

incorporates many provisions for supervised

complementation of learning. These provisions start with the

supervised versions of the map growing process and run

through the possibility of integrating a pure supervised

model.

Specifically, our clustering algorithm modifies the original

SOM algorithm with a dynamic expansion process

controlled by an entropy-based measure whenever gene

functional class information exists. The later measure

quantifies to which extend the available information for the

biological function (i.e. class) of the gene is represented

accurately by the cluster (i.e. the SOM node) on which the

gene is allocated. Accordingly, the model is adapted

dynamically in order to minimize the entropy within the

generated clusters. This approach detects effectively the

regions where the decision boundaries between different

classes lie. At these regions, the classification task becomes

difficult and a special supervised network can be connected

with the sNet-SOM in order to resolve better at the class

boundaries. Usually, only in the case of lack of class

information the dynamic expansion is controlled by local

variance or resource counts criteria. The entropy criterion

concentrates on the resolution of the regions characterized

by class ambiguity and therefore it is more effective.

The sNet-SOM has been designed in order to automatically

detect the appropriate level of expansion. At the

unsupervised case the distance threshold between patterns

below which two genes can be considered as co-expressed is

estimated. Then the map is grown automatically until its

nodes correspond to gene clusters with distances that adhere

to this limit. In the supervised case the criteria for stopping

the network expansion can be expressed either in terms of

the approximation or in terms of the classification

performance.

Furthermore, the sNet-SOM overcomes the problem of

irrelevant (flat) profiles that can populate much more

clusters than necessary at the traditional SOM. The solution

we adopted is the careful redesign of the voting mechanism.

The paper is outlined as follows: Initially, Section 2

summarizes the microarray expression experiments and the

associated data used to evaluate the presented computational

learning schemes. Section 3 describes the extensions to the

SOM that lead to the sNet-SOM and the overall architecture

of the later. Section 4 deals with the learning algorithms that

adapt both the structure and the parameters of the sNet-

SOM. The expansion phase of the sNet-SOM learning is

described in separate sections since it is rather complicated

and depends on whether the learning is supervised or

unsupervised. Specifically, Section 5 elaborates on the

details of the expansion phase for the unsupervised case and

Section 6 for the supervised one. Section 7 discusses results

obtained from an application to yeast expression microarray

data. Finally, Section 8 presents results the conclusions

along with some directions onto which further research can

proceed for improvements.

2 Microarray expression experiments

Recently, new approaches have been developed for

accessing large scale gene expression data. One of the most

effective ones is by using the DNA microarray technology

[ ]. In this method, thousands of distinct DNA probes are

attached to a microarray. These probes can be Polymerase

Chain Reaction (PCR) products or oligonucleotides whose

sequences correspond to target genes or Expressed Sequence

10

3


4/22


Tags (ESTs)of the genome being studied. RNA is extracted

from the sample tissue or cells, reverse transcribed into

labeled with fluorescent dyes cDNA, which is then allowed

to hybridize with the probes on the microarray. The cDNA

corresponds to transcripts produced by genes in the samples,

and the amount of a particular cDNA sequence present will

be in proportion to the level of expression of its

corresponding gene. The microarray is washed to remove

non-specific hybridization, and the level of hybridization for

each probe is calculated. An expression level for genes

corresponding to the probes is derived from these

measurements. This level represents a ratio between the

expression of the gene under some control condition

relatively to the reference condition.

Gene expression data obtained in this way are usually

arranged in tables whose rows correspond to the genes and

columns to the individual expression values of each gene in

a particular experimental condition represented by the

column. These raw data are characterized by highly

asymmetrical distributions that makes difficult the

realization of any distance metric for the assessment of the

differences among them. Therefore, the logarithmic

transformation is used as a preprocessing step, that expands

the scale for small values and compresses it for large values.

An additional desirable effect of the logarithmic

transformation is that it provides a symmetrical scale around

0.

The gene expression patterns reflect a cell's internal state

and microenviroment creating a molecular "picture" of the

cell's state. Thus DNA microarrays can be used to capture

these molecular pictures and deduce the condition of the

cells. Furthermore, since the expression profile of a gene is

correlated with its biological role, systematic microarray

studies of global gene expression can provide remarkably

detailed clues to the functions of specific genes. This is

important, since currently fewer than 5% of the functions of

the genes in the human genome are known.

3. The sNet-SOM model

The sNet-SOM is based on the standard SOM algorithm, but

is dynamically extendable, so that the number of clusters is

controlled by a properly defined measure of the algorithm

itself, with no need for any a priori specification. Because

all the previously mentioned clustering algorithms are

purely unsupervised, they ignore any available a priori

biological information. This means that not only existing

information is not explored in order to deduce the correct

expression characteristics of genes that make them part of

functional groups, but also that genes known to be

erroneously grouped to a cluster cannot be handled.

Following the basic design principle to include existing

prior knowledge, we manage to simultaneously consider

both gene expression data and class information (whenever

available) at the sNet-SOM training algorithms.

However, so far class annotation for gene expression data is

limited and not always available. In order to account also for

this case we additionally developed a second similar

algorithm so that for the two cases the algorithms differ only

in the criteria that control the dynamic expansion of the

map. Specifically, depending on the availability of class

information we design two variants of sNet-SOM.

The first variant, the unsupervised sNet-SOMperforms node

expansion in the absence of class labels by exploiting either

a local variance measure that depends on the SOM

quantization performance or on node resource counts. These

criteria are used also at the Growing Cell Structures (GCS)

algorithms for growing cells [ , ]. The convergence criteria

are defined by a statistical assessment of the randomness of

the distance between gene expression patterns.

14

14

9

The secondvariant, the supervisedsNet-SOM performs the

growing by exploiting the class information with an entropy

measure. The dynamic growth is based on the criterion of

neuron ambiguity (i.e. uncertainty about class assignment),

which is quantified with the entropy measure that is defined

over the sNet-SOM nodes. This approach differs from the

local quantization error approach of [ ] and of the resource

value of [ ] that grow the map at the nodes accumulating

the largest local variances and resource context of the

unsupervised sNet-SOM. In the absence of class information

1

4


5/22


these are reasonable and well performing criteria. However,

these measures can be large even with no class ambiguity

while the entropy measure directly and objectively

quantifies the ambiguity. For that reason for the supervised

sNet-SOM the entropy based growing technique is

preferable.

We have developed the supervised sNet-SOM initially

within the context of an ischemia detection application

[ , ]. At this application, it is used in combination with

capable supervised models in order to maximize the

performance of the detection of ischemic episodes.

However, the peculiarities of the gene expression data made

mandatory significant redesign of the algorithms. Below we

discuss the sNet-SOM learning algorithms in detail.

22 3

4. Learning algorithms

The sNet-SOM is initialized with four nodes arranged in a

2X2 rectangular grid and grows nodes to represent the input

data. Weight values of the nodes are self-organized

according to a new method inspired by the SOM algorithm.

The self-organization process maps properties of the original

high-dimensional data space onto the lattice consisted of

sNet-SOM nodes. The map is expanded to represent the

input space by creating new nodes, either from the boundary

nodes performing boundary extension, or by inserting whole

columns (or rows) of new units with a column extension (or

row extension).

The decision to grow either with the boundary or with the

column (row) extension does not limit the potentiality for

dimensionality reduction of the model and its modeling

effectiveness, while its implementation is easier and the

training becomes more efficient. The later advantage is

important for the large data sets produced by the microarray

experiments. Usually, new nodes are created by expanding

the map at its boundaries. However, when the expansion

focus becomes a node placed deepin the interior of a large

map, far from the boundary nodes, the adaptive expansion

process inserts a whole column of nodes directly adjacent to

this node. Therefore, the node becomes directly a boundary

node and the expansion process can generate new nodes in

the neighborhood. The implementation of this exception to

the general grow from boundary rule, has accelerated

significantly the training of large maps (2 to 4 times faster

computation for maps of size of about 100 nodes).

The growing structure takes the form of a nonuniform

rectangular grid. It develops within a large NM grid that

provides slots for the new dynamically created nodes.

Generally, we require i.e. since the

insertion of whole columns results in faster expansion rate

along columns (note that the opposite is true when we

implement the alternative of row insertion, instead of

column insertion).

,MN ,2 MN

A training epoch consists of the presentation of all the

training patterns to the sNet-SOM. A training run is defined

as the training of the sNet-SOM with a fixed number of

neurons at its lattice i.e. the training between successive

node insertions/deletions.

After the preliminary discussion we can now proceed to

describe the sNet-SOM learning algorithms in more detail.

The top level sNet-SOM learning algorithm is the same for

both, the unsupervised and the supervised case. Inalgorithmic form it can be described as:

Top-level sNet-SOM learning algorithm

1.

While do

2.

3.

End While

4.

The details of the algorithm, i.e. the initialization,

adaptation, expansion and fine tuning phases and the

convergence criteria are described in detail below.

A. Initialization phase.The weight vectors of the four starting nodes that are

arranged in a 2X2 grid are initialized with random numbers

5


6/22


within the domain of feature values (i.e. of the normalized

ratio fluorescent coefficients).

B. Training Run Adaptation phaseThe purpose of this phase is to stabilize the current map

configuration in order to be able to evaluate its effectiveness

and the requirements for further expansion. During this

phase, the input patterns are repeatedly presented and the

corresponding self-organization actions are performed until

the map converges sufficiently. The training run adaptation

phase takes the following algorithmic form.

:

MapConverged:= false;

whileMapConverged= false do

for all input patterns dokx

present and adapt the map by applying the map

adaptation rules

kx

endfor

Evaluate map training run convergence condition and set

MapConvergedaccordingly

endwhile

Map adaptation rules

The map adaptation rules that govern the processing of each

input pattern are as follows:kx

1. Determination of the weight vector that is closest to

the input vector (i.e. of the winner node).

iw

kx

2. Adaptation of the weight vectors only for the four

nodes in the direct neighborhood of the winner i and for

the winner itself according to the following formula:

jw

j

kjkkj

kjj Njkijdknk

Njkk

)),())(,(()()(

),()1(

wxw

ww

where the learning rate ,)(kn Nk , is a monotonically

decreasing sequence of positive parameters, is the

neighborhood at the kth learning step and is

the neighborhood function implementing different

adaptation rates even within the same neighborhood.

kN

((dk )),ij

The learning rate starts from a value of 0.1 and decreases

down to 0.02. These values are specified with the empirical

criterion of having relatively fast convergence, without

however sacrificing the stability of the map.

The neighborhood function depends on the

distance between node and the winning node i . It

decreases monotonically with increasing distance from the

winning neuron (i.e. nodes closer to the winner are adapted

more), like in the standard SOM algorithm. The initial

neighborhood, , includes the entire map.

)),(( ijdk

j),( ijd

0N

Unlike the standard SOM, these parameters (i.e. ,kN

)),(( ijdk ) do not need to shrink with time and can be

kept constant i.e. )),(()),((, 00 ijdijdNN kk . This

is explained by the following: Initially, the neighborhood is

large enough to include the whole map. The sNet-SOM

starts with a much smaller size than a usual SOM: thus a

large neighborhood is not required to train the whole map at

the first learning steps (e.g. with 4 nodes initially at the map,

a neighborhood of 1 only is required). As training proceeds,

during subsequent training epochs, the area defined by the

neighborhood becomes localized near the winning neuron,not by shrinking the vicinity radius (as in the standard SOM)

but by enlarging the SOM with the dynamic growing.

Usually, we use the following simple and efficiently

computed formula for the neighborhood function (where

denote the row and column of node i respectively):cr ii ,

otherwise0,

1||||if,10,

if,1

)),(( ccrrk jiji

ij

ijd

An alternative rectangular neighborhood that updates also

the diagonal nodes with a smaller learning rate yields also

appropriate results:

otherwise0,

2||||if,10,

1||||if,10,

if,1

)),((ccrr

ccrrk

jiji

jijiaa

ij

ijd

Evaluation of the map training run convergence condition

6


7/22


The map training run convergence condition is tested by

evaluating the reduction of the total quantization error for

the unsupervised case and of the total entropy for the

supervised one, before and after the presentation of all the

input patterns (i.e. one training epoch). Specifically, denote

and the errors before and after the presentation of

patterns (similar is the formulation for the entropies). Then

the map converges when the relative change of the error

between successive epochs drops below a threshold value,

i.e.

bE aE

sholdeErrorThreConvergenc

Ea

aEbE:edMapConverg||

.

The setting of the ConvergenceErrorThresholdis somewhat

empirical but a value in the range 0.01 - 0.02 performs well

in assuming sufficient convergence without excessive

computation.

C. Fine Tuning Adaptation Phase

The fine tuning phase aims to optimize the final sNet-SOM

configuration. This phase is similar to the training run

adaptation phase described previously with two differences:

a.

The criterion for map convergence is moreelaborate. We require much smaller change of the

total quantization error (unsupervised case) or of

the total entropy (supervised case) for accepting the

condition for map convergence.

b. The learning rate decreases to a smaller value inorder to allow fine adjustments to the final

structure of the map.

Typically, the ConvergenceErrorThreshold for the fine

tuning phase is about 0.00001 and the learning rate is set to

0.01 (or to an even smaller value).

3. Expansion Phase

The dynamic expansion of the CP-SOM depends on the

availability of class labels and therefore is referred as

supervised expansion when class labels are available and

unsupervised expansion if not. These processes are

described separately below since they explore different

expansion criteria. Moreover, the objective underlying their

development is different for each case. The unsupervised

expansion has the task of revealing insight onto the groups

of genes with correlated expression patterns while the

supervised entropy based expansion has the objective to

reduce the computational requirements for a "pure"

supervised solution. The later objective follows a design

principle of the sNet-SOM: partitioning of a complex

learning problem to domains that can be learned effectively

with simple and computationally effective unsupervised

models and to ones that require the utilization of a capable

supervised model since they are characterized by complex

decision boundaries [ ].22

To avoid misconception we should note that in our scheme

the term supervised refers mainly to the fact that class

information is a decisive factor in determining the expansion

criterion. As we shall describe in the next section though,

information about class information can be exploited even in

what we term unsupervised expansion process. The reason

for not using always the supervised expansion mode when

class information is available, is simply explained by the

two different objectives outlined above, i.e. if the insight to

the structure of the gene expression data is more important

than the classification task itself, the unsupervised approach

is used although class information is available.

Each of the two approaches to map expansion, the

unsupervised and the supervised one, is described below in

its own section.

5 . The Unsupervised Expansion Process

The unsupervised expansion is based on the detection of

the neurons with large local error, referred to as the

unresolved neurons. A neuron is considered unresolved if

its local error i exceeds a threshold value, denoted by

the parameter NodeErrorForConsideringUnresolved.

enote by iS the set of gene expression profiles ip mapped

to node i. Also, let iw be the weight vector of node i that

corresponds to the average expressi n profile of iS . Then

ocal error iLE is de

LE

D

o

the l fined as:

7


8/22


iS

iiLE

p

wp2 .

The local error is commonly used for implementing

dynamically growing schemes [ , ]. However, the

peculiarities of the gene expression data motivated two

significant modifications to the classic local error measure.

Specifically:

1 17

1. Instead of the local error measure we use theaverage local error iAV per patterns, i.e.

ii

S

LEAV i

(1)

This measure does not increase when many similar

patterns are mapped to the same node. Therefore,

the objective of assigning all the functionally similar

genes to the same node is more easily achievable,

even when there are many of such genes. In contrast,

the accumulated local error increases monotonically,

as more genes are mapped to the same node. This in

turn can cause an undesired spreading of

functionally similar genes to different nodes.

2. The second provision applies when we have classinformation available (either complete or partial)

and we want to exploit it in order to improve the

expansion. The local error that accumulates to a

winner node is amplified by a factor that is inversely

proportional to the square root of the frequency ratio

cr of its corresponding class c . Specifically, let

patternstotal#

classofpatterns# crc be the frequency ratio of

class c . Then the amplification factor is 21c

r .

Therefore, the errors on the low frequency classes

account more. As a consequence the representation

of these low frequency classes is improved. We

should note that these classes are usually of most

biological significance. The utilization of the square

root prevents the overrepresentation of the very low

frequency classes (e.g. if class A is 100 times less

frequent than B it is amplified only 10 times

more). The error measure computed after this

additional class frequency dependent weighting is

called Class Frequency Average Local Error

(CFALE). In the absence of class information the

CFALE denotes the same quantity as the average

local error parameter defined above with

equation ( ).

iAV

1

This provision also confronts to some extent the

serious problem of the creation of false positives for

the low frequency classes by noise. Probabilistically,

most of these noisy patterns will belong to the high

frequency classes. However, the effect of these

erroneously classified patterns will be attenuated

significantly, because they are derived from the high

frequency classes. The final result is an enhanced

robustness to noisy patterns.

Nodes that are selected as winners for very few (usually one

or two) training patterns, termed uncolonized nodes, are not

deleted by our scheme although they probably correspond to

noisy outliers. The gene expression patterns that consistently

(three times or more) are mapped to uncolonized nodes are

very unique and can either be artifacts or if not they have thepotential to provide biological knowledge. Therefore they

are amenable to further consideration. These patterns

therefore are marked and isolated for further study.

Nodes that are not selected as winners for any pattern are

removed from the map in order to keep it compact.

The steps of the unsupervised expansion process are as

follows:

U.1. Computation of the CFALE measures for every node i.

repeat

U.2. let i = the node with the maximum CFALE

measure

U.3. ifIsBoundaryNode(i) then

// expand at the neighbours boundary nodes

U.4. JoinSmoothlyNeighbours (i)

U.5. elseifIsNearBoundaryNode(i)

8


9/22


U.6 RippleWeightsToNeighbours(i)

U.7. else InsertWholeColumn(i);

endif

U.8 Reset the local error measures.

U.9. Re-execute the Training Run Adaptation

Phase for the expanded map by presenting all the training

patterns.

until notRandomLikeClustersRemain();

Figure 1 Illustration of the weight initialization of a

new node with the function JoinSmoothlyNeighbors(). The

new node is allocated to join the map from the

boundary nodes. The weight of the new node is initialized by

computing the average weight nearby to the node

1, cr

cr, ,

which initiates the expansion, according to the empirical

formula:

crfcrfcrf

crcrfcrcrfcrcrfav

ANvANvANh

WANvWANvWANhW

cr,1,11.

,1,1,1,11,1.

,

where is a boolean flag that denotes that the node

has been allocated to the growing structure and and

are the horizontal and vertical factors of weighting

across the corresponding dimensions. Then the weight of the

new node N is estimated as:

jiAN, ji,

fh

fv

c,avWcrcrcrN WWWW ,,1,

ff vh

.

The concept of direction of weight growing is maintained

by considering more the node in the direction of growth

(horizontal in the case illustrated), i.e. , typically

.,1fh 5.0fv

We describe below shortly the main issues involved in these

steps. The while loop controls the sNet-SOM expansion.

The criteria for the establishment of the proper level of

expansion are described in the section that follows. The

function IsBoundaryNode() checks whether a node is a

boundary node. Training efficiency and implementation

simplicity were the motivations for the decision to expand

mostly from the boundary nodes. The expansion of the map

at the boundary nodes is straightforward: One to three nodes

are created and the weights of the new nodes are adjusted

heuristically to retain the weight flow with the function

JoinSmoothlyNeighbours() whose operation is illustrated in

Fig. .1

The map configuration is slightly disturbed when the winner

node is not a boundary node but is a near boundary node. A

node is considered near boundary (declared by the function

IsNearBoundaryNode()) when the boundary of the map can

be reached from this node by traversing in any direction at

most two nodes.

For a near boundary node a percentage (usually 20-50%) of

the weight of the winner node is shifted towards the outer

nodes (with the function RippleWeightsToNeighbours()).

This operation alters locally the Voronoi regions and usually

with a few weight rippling operations the winner node is

propagated to a boundary node (which is located near).

Finally, if the winner is a node that is neither a boundary nor

a near boundary the alternative of inserting a whole empty

column is used. The rippling of weights is avoided in these

cases, because usually excessive computation times are

required before the winner propagates from a node placed

deep in the map to a boundary node. Instead of inserting

whole new columns we can insert alternatively whole new

rows, or we can perform a combination of row and column

insertion. The operation of the corresponding

InsertWholeColumn() function is illustrated by Figure 2.

vf

new

node

initiating

expansion

node

1,

cr cr, 1,

cr

vf

cr ,1

hf

cr ,1

9


10/22


Figure 2 Grow by grid insertion in the direction of largest

total error, i.e.

if 1,11,1,11,11,1,1 jijijijijiji EEEEEE

jiE, ),( jinode

,

where is the error measure of (i.e. CFALE),

then insert the new column at the left of column j else insert

the new column at the right of column j.

Criteria for controlling the sNet-SOM dynamic growing

One of the most critical tasks for the effective analysis of

gene expression data with the sNet-SOM is the proper

definition of the parameters that control the growing

process. The objective is to automatically reach the

appropriate level of expansion and then to allow some fine

tuning of the resolution level by the molecular biologists.

Systematically, the design of the criteria for stopping the

growing process can be approached by evaluating a

statistical distance threshold for gene expression patterns,

below which two genes can be considered as functionally

similar. When the average distance between patterns in a

cluster drops below this value, the clustering together of

these particular gene expression patterns corresponds to

nonrandom behavior and therefore interesting informationcan be extracted by analyzing them.

To this end, we define a confidence level from which we

derive a threshold for the distance between gene

expression patterns. The confidence level has the

meaning that the probability of taking two random unrelated

expression profiles as functionally similar (i.e. to allocate

them at the same cluster) is lower than a if the distance

between them is smaller than the threshold.

a

thrD

a

new column

Ei-1, +1Obviously, the definition of a statistical confidence level

would be only possible if the distribution of the distance

between random expression vectors were known.

Practically, although the distribution is unknown, it is easy

to approximate it. Specifically, we shuffle randomly the

experiment points of every expression profile randomly,

both by gene and by experiment point. This randomization

destroys the correlation between the different profiles, while

it retains the other characteristics of the data set (e.g. ranges

and histogram distribution of values). In this way, we

compute an approximation of the distribution of the distance

between random patterns.

It is evident that larger (smaller) correlation between genes

corresponds to smaller (larger) Euclidean or Manhattan

distance. Assume that we have chosen the Manhattan

distance measure. Then as figure illustrates the distances

lying in the interval

3

hl vv are considered as random.

Also, for distances smaller than a positive correlation

between genes is implied, while it is reasonable to assume

that the converse holds for distances larger than .

lv

hv

Figure 3 The results of the data shuffling illustrate that the

distances between the randomized data occupy a distinct

Ei,j-1 Ei,

Ei-1,

Ei,j+1

Ei+1,j+1Ei+1,Ei+1, -1

Ei-1, -1

vlvh

10


11/22


distribution. For the gene expression data positive

correlation is favored while for the random the distribution

has a normal form.

The distributions of randomized and original gene

xpression patterns displayed in Figure 3 are used to

computation of the

lass assignment for each node i, and of a parameter

t i

s of the following steps:

r the map nodes. The ambiguity of class assignme

uation of the map over the whole training set in

mance

or the supervised

s already mentioned the objectives of the supervised

ed expansion exploits well the topological

e

implement the criteria on which the function

RandomLikeClustersRemain() is based. This function

evaluates the randomness of the genes allocated to one

cluster by computing all the pairwise distances between

them. If a considerable number of these distances are

random according to the specified significance level then the

cluster is considered to own unrelated genes and therefore

further decomposition is required. The percentage of

random pairwise distances above which we consider the

cluster as random, is specified empirically to a value of 5%.

Clearly, the smaller the required percentage parameter, the

larger the decomposition level becomes. The

aforementioned value (i.e. 5%) produces well behaved, from

a biological perspective, extensions of the map.

6. The Supervised Expansion Process

The supervised expansion is based on the

c iHN

characterizing the entropy of this assignment. This

parameter is derived according to Equation 2 tha s

discussed below. An advantage of the entropy is that it is

relatively insensitive to the overrepresentation of classes i.e.

independently of how many patterns of a class are mapped

to the same node, if the node does not represent other

classes, its entropy is zero.

The expansion phase consist

S.1 Computation of the class labels and entropies iHN

nt forfo

the genes of node i is quantified by iHN .

repeat

S.2. Eval

order to compute the approximation performance

CurrentApproximationPerformance

S.3 if CurrentApproximationPerfor

ThresholdOnApproximationPerformance then

// resolve better the difficult regions of the sta

which classification decisions cannot be deduced easily

S.3.1. let i the node of higher ambiguity (i.e. larg

entropy para eter).

S.3.2. ifIsBoundaryN

join smoothly the neighbours to

elseifnode i near the boundary then

RippleWeightsToNeighbours(i)

else InsertWholeColumn(i);

endif

Re

Apply the Map Adaptation phase to

map.

endif

until C

ThresholdOnApproximationPerformance

S.4 Generate training and testing sets fexpert. Further supervising training will be performed with

these sets by the supervising learning algorithm in order to

better resolve the ambiguous parts of the state space.

endif

A

expansion differ from those of the unsupervised one. While

the unsupervised aims to insert nodes in order to detect

interesting clusters of genes, the supervised extension is

concentrating at the task of revealing the class decision

boundaries.

The supervis

ordering that the basic SOM provides and increases the

resolution of the representation over the regions of the state

space that lie near class boundaries. At this point, it should

be emphasized that simply increasing the SOM size with the

adaptive extension algorithm until each neuron represents

unambiguously a class (i.e., zero iHN for all the nodes)

11


12/22


yields a SOM configuration that although fits to the training

set, fails to generalize well.

The ambiguous neurons i.e. those neurons for which the

task proceeds by feeding the pattern to the

e sNet-

ither majority

uncertainty of class assignment is significant, are identified

with the entropy criterion. The dynamic expansion phase of

the sNet-SOM is executed until the approximation

performance reaches the required level. Afterwards, training

and testing sets are created for the supervised expert. These

sets consist only of the patterns that are represented by the

ambiguous neurons. These neurons correspond to state

space regions on which classification decisions cannot be

deduced easily.

The classification

sNet-SOM. If the winning neuron is one that is not

ambiguous, the sNet-SOM classifies by using the class of

the winning neuron. In the opposite case, the supervised

expert is used to perform the classification decision.

The assignment of a class label to each neuron of th

SOM is performed according to a majority-voting scheme

[20]. This scheme acts as a local averaging operator defined

over the class labels of all the patterns that activate that

neuron as the winner (and accordingly are located at the

neighborhood of that neuron). The typical majority-voting

scheme considers one vote for each winning occurrence. An

alternative more "analog" weighted majority voting scheme

weights the votes each by a factor that decays with the

distance of the voting pattern from the winner (i.e. the

largest the distance the weakest the vote). The averaging

operation of the majority and weighted majority voting

schemes effectively attenuates the artifacts of the training

set patterns. As noted, in order to enhance the representation

of rare gene expression patterns we amplify the vote of each

pattern with a coefficient that is proportional to the inverse

of the frequency of appearance of that class.

In the context of sNet-SOM the utilization of e

or weighted majority voting is essential. These schemes

allow the degree of class discrepancy for a particular neuron

to be readily estimated. Indeed, by counting the votes at

each SOM neuron for every class, an entropy criterion that

quantifies the uncertainty of the class label of neuron can

be directly evaluated, as [ ]:

m

16

cN

k

kk ppmHN

1

log)( , (2)

where denotes the number of classes andcNtotal

kk

V

Vp ,

is the ratio of votes for class to the total number of

votes to neuron m.

kV k

totalV

Clearly, the entropy is zero for unambiguous neurons and

increases as the uncertainty about the class label of the

neuron increases. The upper bound of is ,

and corresponds to the situation where all classes are

equiprobable (i.e. the voting mechanism does not favor a

particular class). Consequently, within the framework posed

by these voting schemes, the regions of the SOM that are

placed at ambiguous regions of the state space can be easily

identified. For these regions, the supervised expert is

designed and optimized for obtaining adequate

generalization performance.

)(mH )log( cN

7. Results and Discussion

We have applied the sNet-SOM to analyze public available

microarray expression data from the budding yeast

Saccharomyces cerevisiae. This fully sequenced organism

was studied during the diauxic shift, the mitotic cell division

cycle, sporulation, and temperature and reducing shocks by

using microarrays containing essentially every OpenReading Frame (ORF). The data set on which we performed

extensive experiments consists of 2467 genes for which

there exists currently functional annotation in the

Saccharomyces Genome Database. The weighted K-nearest

neighbors imputation method presented in [ ] is applied in

order to fill up systematically the missing values.

25

Microarray gene expression data sets are large, complex,

contain many attributes and have an unknown internal

structure. For that, gaining insight onto the structure of the

data is the initial objective of most analysis methods rather

12


13/22


the classification of the data itself. The sNet-SOM meets

this objective by:

Achieving high quality and computationallyefficient clustering of the gene expression profiles

with the exploitation of either supervised or

unsupervised clustering criteria.

Offering extensive visualization capabilities withthe irregular two dimensional growable grid that

the basic structure provides which can be

complemented with the Sammons nonlinear

mapping [ ].21

The sNet-SOM is not only a clustering but is additionally a

classification tool, although that in this case the sNet-SOM

does not claim to directly compete capable supervised

models like the Radial Basis Support Vector Machine

[ , , ]. The sNet-SOM rather aims to complement them by

reducing the complexity of the problem that remains for the

pure supervised solution. Taking into account the size and

the complexity of the gene expression data set, this

reduction proves essential. The results of the last two rows

of Table demonstrate this computational benefit.

16 26 6

1

Learning Model Training Time

SOM 5X5 8 min

SOM 10X10 65 min

Unsupervised sNet-SOM 25 min

Supervised sNet-SOM 28 min

Unsupervised sNet-SOM

with column insertion

19 min

Supervised sNet-SOM with

column insertion

23 min

SVM 3 hours 15 min

Supervised sNet-SOM with

SVM for the ambiguous

patterns

50 min

Table 1 A comparison of the execution times for the

learning of the gene expression data set. The results

obtained with a 450 MHz Pentium III PC.

1

Table illustrates that the sNet-SOM trains faster than the

conventional SOM. The first two rows are the execution

times for a SOM with a grid of size 5X5 and 10X10

respectively. Also, the unsupervised sNet-SOM trains

slightly faster than the supervised (3rd

and 4th

rows).

Furthermore, the utilization of column (row) insertion

provides further performance advantages (5th

and 6th

rows).

The Support Vector Machine takes the longest time for the

training on the whole data set (7th

row). The implementation

approach of [ ] as implemented with the SVMLight

software package was used for the SVM solution. Finally,

the supervised sNet-SOM combined with the SVM

resolution of the difficult parts of the state space obtains

significantly better learning times without sacrificing the

quality of the results. The SVM classification results are

similar to those published in [ ] and are therefore not

repeated here.

19

6

6

The supervised phase was trained with the same functional

classes as in [ ]. This allows to perform some comparisons

relating the performance of the sNet-SOM with other

methods. These classes are summarized with Table . The

functional classifications were obtained from the Munich

information center for protein sequences yeast genome

database (

2

http://www.mips.biochem.mpg.de/proj/yeast).

1. Tricarboxylic-acid pathway (TCA)2. Respiration-chain complexes (Resp)3. Cytoplasmic ribosomal proteins (Cyto)4. Proteasome (Proteas)5. Histones (Hist)6. Helix-turn-helix (HTH)

Table 2 Functional classes used for supervised sNet-SOM

training. The tricarboxylic-acid pathway is, also known as

Krebs cycle, consists of genes that encode enzymes that

break down pyruvate (produced from glucose) by oxidation.

The respiration chain complexes perform oxidation-

reduction reactions that capture the energy present in

NADH through electron transport and the chemiosmotic

synthesis of ATP. The cytoplasmic ribosomal proteins are a

13
http://www.mips.biochem.mpg.de/proj/yeasthttp://www.mips.biochem.mpg.de/proj/yeast


14/22


class of proteins required to make the ribosome. The

proteasome consists of proteins that perform the

degradation of proteins. Histones interact with the DNA

backbone to form nucleosomes. These nucleosomes are an

essential part of the chromatin of the cell. Finally, the helix-

turn-helix class, is not a functional class. It consists of genes

that code for proteins containing the helix-turn-helix

structural motif. This class is included as a control class.

At the presented supervised sNet-SOM training experiment

we used six functional classes from the MIPS Yeast

Genome Database: tricarboxylic acid (TCA) cycle,

respiration, cytoplasmic ribosomes, proteasome, histones

and helix-turn-helix (HTH) proteins. The first five classes

represent categories of genes that on biological grounds is

expected to induce similar expression characteristics. The

sixth class, i.e. the helix-turn-helix proteins is used as a

control group. Since there is not any biological justification

for a mechanism that enforces the genes of this class to the

same patterns of expression, we expect these genes to be

spread to diverse clusters by the sNet-SOM.

The measure of Entropy of Class Representation is

evaluated over the sNet-SOM nodes in order to quantify the

dispersion of class representation. We expect this measure to

be large in the case of HTH, expressing the diversity of the

HTH gene expression patterns. Indeed, the results of Table

support this intuitive expectation. The high entropy of the

Unassigned class is due to the fact that this class

accumulates hundreds of other known functional classes and

all the unknown ones.

3

Table 3 The entropies of class representations at the sNet-

SOM configuration of Figure .

Class Entropy

1. Tricarboxylic-acid pathway(TCA)

2. Respiration-chaincomplexes (Resp)

3. Cytoplasmic ribosomalproteins (Ribo)

4. Proteasome (Proteas)

1.96

1.82

1.21

0.51

5. Histones6. Helix-turn-helix7. Rest and functionally

unknown classes (Unassigned)

0.60

2.78

4.13

7

Many functional classes of genes present strong similarity of

their gene expression patterns. This is evident in Figure 4,

where we can observe a high similarity of the gene

expression patterns of the class Ribo. The identities

(identifier, description and functional class) of the genes of

Figure 4 are displayed with Figure 6.

Figure 7 illustrates a snapshot of the progress of the learning

process. Each sNet-SOM node is colored according to the

predominant class. Also for each node three numbers are

continuously updated. The first is the numeric identifier of

the prevailing class. The second depends on the type of

training. For supervised training it is the entropy of the

node: nodes with high entropy lie near class separation

boundaries and the patterns can be used to train efficient

supervised models, like the Support Vector Machines, for

the effective discrimination of these parts of the state space.

For the unsupervised expansion this number is a resource

count (usually the local quantization error) that controls the

positions of the dynamic expansion. Finally, the third

number is the number of patterns mapped to the node.

Figure 8 displays a listbox with the characteristics of all the

nodes of the sNetSOM. The first two columns are the grid

coordinates of the node. The third column is the entropy of

the node and the fourth is the number of genes mapped to

the node. Finally, the last column is the name of the class

that the node represents. The biologist can obtain further

information about the genes mapped to the node by selecting

the corresponding element of the listbox. The main

parameters that control the operation of the sNet-SOM are

user defined with a parameter setting screen illustrated with

figure 9.

14


15/22


8. Conclusions and future work

This work has presented a new self-growing adaptive neural

network model for the analysis of genome-wide expression

data. This model, called sNet-SOM overcomes elegantly the

main drawback of most of the existing clustering methods

that impose an a priori specification at the number of

clusters. The sNet-SOM determines adaptively the number

of clusters with a dynamic extension process which is able

to exploit class information whenever available.

The sNet-SOM grows within a rectangular grid that

provides the potential for the implementation of efficient

training algorithms. The expansion of the sNet-SOM is

based on an adaptive process. This process grows nodes at

the boundary nodes, ripples weights from the internal nodes

towards the outer nodes of the grid, and inserts whole

columns within the map. The growing algorithm is simple

and computationally effective. It prefers to grow from the

boundary nodes in order to minimize the map readjustment

operations. However, a mechanism for whole column (row)

insertion is implemented in order to deal with the case that a

large map should be expanded around a point that is deep

within its interior. The growing process determines

automatically the appropriate level of expansion in order the

similarity between the gene expression patterns of the same

cluster to fulfill a designer definable statistical confidence

level of not being a random event. The voting schemes for

the winner node have been designed in order to amplify the

representation of rare gene expression patterns.

A novel feature of the sNet-SOM compared with other

related approaches is the potentiality for the effective

exploitation of the available class information with an

entropy based measure that controls the dynamical extension

process. This process extracts information about the

structure of the decision boundaries. A supervised network

can be connected additionally in order to resolve better at

the difficult parts of the state space. This hybrid approach

(i.e. unsupervised competitive learning for the simple parts

of the state space and supervised for the difficult ones) can

compete in performance advanced supervised learning

models at a much less computational cost. In essence, the

sNet-SOM can utilize the pure supervised machinery only

where it is needed, i.e. for the construction of complex

decision boundaries over regions of the state space where

patterns cannot be separated easily.

Another way to incorporate supervised learning to the sNet-

SOM is to use the nodes as Radial Basis Function centers

and to model the classification of a gene as a nonlinear

function of the gene expression templates represented by

the adjacent nodes. This approach resembles qualitatively

the supervised harvesting approach of [ ]. The node

average profiles can be used as inputs to a supervised phase.

This reduces the redundancy of information and prevents an

overfitting of the training set. Proper parameters of these

centers can be estimated by heuristic criteria like signal

counters, local errors, and node entropies providing local

information of much importance.

15

The sNet-SOM dynamical extension algorithm is similar for

the more usual case in the context of gene expression

analysis, where there is no classification information

available. In this case criteria based on the computation of

local variances or resource counts are implemented.

Moreover, in order to enhance the exploratory potential of

the sNet-SOM for the analysis of the gene expression data,

we have adapted the Sammon distance preserving nonlinear

mapping. The Sammon mapping allows an effective

visualization of the intrinsic structure of the sNet-SOM

codebook vectors even at the unsupervised case. We will

provide an extensive discussion of the application of the

Sammon mapping at the context of sNetSOM for the

effective visualization of gene expression data in a

forthcoming work. Also, another main direction for the

improvement of the sNetSOM performance is the

incorporation of more advanced distance metrics to its

algorithms, as the Bayesian one proposed in [ ]. The

incorporation of the presented sNet-SOM dynamic growing

algorithms as a front end processing within Bayesian

18

15


16/22


network structure learning algorithms [ ] is also an open

area for future work.

4

[4] Bishop C. M., Neural Networks for Pattern Recognition, Clarendon

Press-Oxford, 1996.

ACKNOWLEDGEMENTS

The authors wish to thank the Research Committee of the

University of Patras for the partial financial support of this

research with the contract Karatheodoris 2454

References

[1] Alahakoon Damminda, Halgamuge Saman K., Srinivasan Bala,

"Dynamic Self-Organizing Maps with Controlled Growth for Knowledge

Discovery", IEEE Transactions On Neural Networks, Vol. 11, No. 3, pp

601-614, May 2000.

[2] Azuaje Franscisco, A Computational Neural Approach to Support the

Discovery of Gene Function and Classes of Cancer, IEEE Trans. Biomed.

Eng., Vol 48, No. 3, March 2001, pp 332-339

[3]. Bezerianos A., Vladutu L., Papadimitriou S., "Hierarchical State

Space Partitioning with the Network Self-Organizing Map for the effective

recognition of the ST-T Segment Change", IEEE Medical & Biological

Engineering & Computing 2000, Vol 38, 406-415

[5] Brazma Alvis, Vilo Jaak, "Gene expression data analysis", FEBS

Letters, 480 (2000) 17-24

[6] Brown Michael P.S., Grundy William Noble, Lin David, Cristianini

Nello, Sugnet Charles Walsh, Furey Terrence S., Ares Manuel, Haussler

Jr., David, "Knowledge-based Analysis of Microarray Gene Expression

Data By Using Support Vector Machines"}, Proceedings of the National

Academy of Science, Vol 97, No 1, pp. 262-267, 1997

[7] Campos Marcos M., Carpenter Gail A., S-TREE: self-organizing trees

for data clustering and online vector quantization, Neural Networks 14

(2001), pp. 505-525

[8] Cheeseman, P., and Stutz, J. (1995) Bayesian Classification

(AutoClass): Theory and results, In Fayyad, U., Piatesky-Shapiro, G.,

Smyth, P., and Uthurusamy, R. editors, Advances in Knowledge Discovery

and Data Mining, pp. 153-180, AAAI Press, Menbo Park, CA[9] Cheng Guojian and Zell Andreas, "Externally Growing Cell Structures

for Data Evaluation of Chemical Gas Sensors", Neural Computing &

Applications, 10, pp. 89-97, Springer-Verlag, 2001

[10] Cheung Vivian G., Morley Michael, Aguilar Francisco, Massimi Aldo,

Kucherlapati Raju, Childs Geoffrey, "Making and reading microarrays",

Nature genetics supplement, Vol. 21, January 1999

[11] Durbin R., Eddy S., Krogh A., Mitchison G., Biological Sequence

Analysis, Cambridge, University Press, 1998

[12] Eisen Michael B., Spellman Paul T., Patrick O. Brown, and David

Botstein, "Cluster analysis and display of genome-wide expression

patterns", Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863-14868,

December 1998

[13] Friedman, N., M. Linial, I. Nachman, and D Peer, Using Bayesian

networks to analyze expression data, J. Comp. Bio. 7, 2000, 601-620,

[14] Fritzke Bernd, "Growing Grid - a self organizing network with

constant neighborhood range and adaptation strength", Neural Processing

Letters, Vol. 2, No. 5, pp. 9-13, 1995

[15] Hastie Trevor, Tibshirani Robert, Botstein David, Brown Patrick,Supervised Harvesting of expression trees, Genome Biology 2001, 2 (1),

http://genomebiology.com/2001/2/I

[16] Haykin S,, Neural Networks, Prentice Hall International, Second

Edition, 1999.

[17] Herrero Javier, Valencia Alfonso, and Dopazo Joaquin, A

hierarchical unsupervised growing neural network for clustering gene

expression patterns, Bioinformatics, (2001) Vol. 17, no. 2, pp. 126-136

[18] Hunter Lawrence, Taylor Ronald C., Leach Sonia M., Simon Richard,

GEST: a gene expression search tool based on a novel Bayesian similarity

metric, Bioinformatics, Vol. 17, Suppl 1, pp. 115-122, 2001

[19] Joachims Thorsten, Making Large-Scale SVM Learning Practical,

Advances in Kernel Methods Support Vector Learning, Bernhard

Scholkopf, Christopher J. C. Burges, and Alexander J. Smola (eds), MIT

Press, Cambridge, USA, 1998

[20] Kohonen T., Self-Organized Maps, Springer-Verlag, Second

Edition,1997.

[21] Pal Nikhil R., Eluri Vijary Kumar, Two Efficient Connectionist

Schemes for Structure Preserving Dimensionality Reduction, IEEE Teans.

Neural Networks, vol. 9, no. 6, November 1998, p. 1142-1154

[22] Papadimitriou S., Mavroudi S., Vladutu L., Bezerianos A., Ischemia

Detection with a Self Organizing Map Supplemented by Supervised

Learning, IEEE Trans. On Neural Networks, Vol. 12, No. 3, May 2001, p.

503-515

[23] Si J., Lin S., Vuong M. A., "Dynamic topology representing

networks", Neural Networks, 13, pp. 617-627, 2000

[24] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S.,

Dmitrovsky, E., Lander, E.S. and Golub, T.R. (1999) Interpreting patterns

of gene expression with self-organizing maps: methods and application to

hematopoietic differentiation, Proc. Natl. Acad. Sci., USA, 92, pp. 2907-

2912

[25] Troyanskaya Olga, Cantor Michael, Shelock Gavin, Brown Pat, Hastie

Trevor, Tibshirani Robert, Botstein David, Altman Russ B., Missing value

estimation methods for DNA microarrays, Bioinformatics, Vol. 17, no 6,

2001

[26] Vapnik V. N., Statistical Learning Theory, New York, John Wiley &

Sons, 1998.

[27] Vesanto Juha Alhoniemi, Esa, Clustering of the Self-Organized

Map, IEEE Transactions on Neural Networks, Vol. 11, No. 3, May 2000,

p. 586-600

16


17/22


17

Figure 4 The expression profiles of the genes clustered an sNet-SOM node of class Ribo. A few patterns of the rest classes presenting

very similar expression profiles map also to this node.


18/22


Figure 5 The average expression profile for the genes plotted by Figure 4

18


19/22


Figure 6 The identities of the genes as plotted in Figure from the back of the figure towards its front (at the 3D view). The biologist

can extract easily useful information about which of the genes of unassigned class present similar expression profiles to the genes of

class Ribo.

4

19


20/22


Figure 7 The outline of the configuration of the growing sNetSOM is displayed graphically and illustrates the progress of the learning

process to the user. The nodes that represent the Helix-Turn-Helix class are in blue color. It is visually evident that these nodes are

much more dispersed than nodes colored differently that represent other classes.

20


21/22


Figure 8 The listbox that displays the characteristics of the nodes of the sNetSOM. The first two columns are the grid coordinates of the

node. The third column is the entropy of the node and the fourth is the number of genes mapped to the node. Finally, the last column isthe name of the class that the node represents.

21


22/22


Figure 9 The parameter configuration screen allows to control directly the main parameters of the sNetSOM.

bioinfnetsom

Documents