bioinfnetsom

Upload: vliviu

Post on 06-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 BioinfNetSOM

    1/22

    BIOINFORMATICS Vol. NoPages

    Gene Expression Analysis with a Dynamically extended Self-

    Organized Map that exploits class information

    Seferina Mavroudi, Stergios Papadimitriou, Liviu Vladutu, Anastasios Bezerianos

    Department of Medical Physics, School of Medicine, University of Patras,

    26500 Patras, Greece, tel: +30-61-996115,

    email:[email protected], [email protected]

    ABSTRACT

    Motivation Currently the most popular approach to analyse

    genome-wide expression data is clustering. One of the major

    drawbacks of most of the existing clustering methods is that

    the number of clusters has to be specified a priori.

    Furthermore, by using pure unsupervised algorithms prior

    biological knowledge is totally ignored e.g. there is no

    simple means to handle genes of known similar function

    being allocated to different clusters based on their

    expression profiles. Moreover, most current tools lack an

    effective framework for tight integration of unsupervised

    and supervised learning for the analysis of high-dimensional

    expression data.

    Results: The paper adapts a novel Self-Organizing map

    called supervised Network Self-Organized Map (sNet-SOM)

    to the peculiarities of gene expression data. The sNet-SOM

    determines adaptively the number of clusters with a

    dynamic extension process which is able to exploit class

    information whenever exists. Specifically, the sNet-SOM

    accepts available class information to control a dynamical

    extension process with an entropy criterion. This processextracts information about the structure of the decision

    boundaries. A supervised network can be connected

    additionally in order to resolve better at the difficult parts of

    the state space. In the case that there is no classification

    available, a similar dynamical extension is controlled with

    criteria based on the computation of local variances or

    resource counts.

    The sNet-SOM grows within a rectangular grid that

    provides effective visualization while at the same time it

    allows the implementation of efficient training algorithms.

    The expansion of the sNet-SOM is based on an adaptive

    process. This process grows nodes at the boundary nodes,

    ripples weights from the internal nodes towards the outer

    nodes of the grid, and inserts whole columns within the

    map. The growing process determines automatically the

    appropriate level of expansion with criteria dependent upon

    whether unsupervised or supervised training is used. For the

    unsupervised training the criterion is the similarity between

    the gene expression patterns of the same cluster to fulfill a

    designer definable statistical confidence level of not being a

    random event. The supervised mode of training grows the

    map until criteria defined on approximation/generalization

    performance are fulfilled. The voting schemes for the

    winner node have been designed in order to amplify the

    representation of rare gene expression patterns.

    The results indicate that sNet-SOM yields competitive

    performance to other recently proposed approaches for

    supervised classification at a significantly reduced

    computational cost and it provides extensive exploratory

    analysis potentiality within the unsupervised analysis

    framework. Furthermore, it explores simple design decisionsthat are easier to comprehend and computationally efficient.

    Availability: The source code of the algorithms presented in

    the paper can be downloaded from

    http://heart.med.upatras.gr. The implementation is in

    Borland C++ Builder 4.0.

    Contact: [email protected],

    [email protected]

    1

    mailto:[email protected]:[email protected]:[email protected]://heart.med.upatras.gr/mailto:[email protected]:[email protected]:[email protected]:[email protected]://heart.med.upatras.gr/mailto:[email protected]:[email protected]
  • 8/3/2019 BioinfNetSOM

    2/22

    BIOINFORMATICS Vol. NoPages

    1. Introduction

    The recent development of DNA microarray technology

    provides the ability to measure the expression levels of

    thousands of genes in a single experiment [ , , ]. The

    interpretation of such massive expression data is a new

    challenge for bioinformatics and opens new perspectives for

    functional genomics. A key question within this context is if

    given some expression data for a gene, this gene does

    belong to a particular functional class (i.e. it encodes for a

    protein of interest).

    12

    12

    12

    12

    12

    6

    6

    5

    5

    Currently, the most popular analysis of gene expression data

    in order to provide insight to the structure of the data and to

    aid at the discovery of functional classes, is clustering, i.e

    the grouping of genes with similar expression patterns into

    clusters [ , ]. Such approaches unravel relations between

    genes and help to deduce their biological role, since genes of

    similar function tend to display similar expression patterns.

    Most of the so far developed algorithms perform the

    clustering of the expression patterns in an unsupervised

    manner [ , , ]. However, frequently genes of similar

    function become allocated to different clusters. In this case,

    a pure unsupervised approach is unable to deduce the correct

    "rule" for the characterization of the gene class. On the other

    hand, there already exists valuable biological knowledge,

    which is manifested in the form of collections of genes

    knowing to encode proteins of similar biological function,

    e.g. genes that code for ribosomal proteins [ ].

    17

    17

    24

    24

    24

    Some of the clustering algorithms used so far for the

    clustering of gene expression data include hierarchical

    clustering [ ], K-means clustering, Bayesian clustering [ ]

    and the Self-Organizing Map (SOM) [ ]

    13

    Nevertheless, despite of the fact that most of the widely

    approved clustering methods, as K-means and SOM, ignore

    existing class information, another major drawbacks of thesemethods is that they require an a priori decision on the

    number and structure of distinct clusters. Moreover, most of

    the proposed models do not incorporate flexible means for

    coupling effectively the unsupervised phase with a

    supervised complementary phase, in order to benefit the

    most from both of these approaches.

    A major drawback of hierarchical clustering is that although

    the data points are organized into a strict hierarchy of nested

    subsets there is no reason to believe that expression data

    actually follows a true hierarchical descent, like for

    example, the evolution of the species [ , ]. Furthermore,

    decisions made early about grouping points to specific

    clusters cannot be reevaluated and often adversely affect the

    result. This later disadvantage is owned also by the dynamic

    non-fuzzy hierarchical schemes proposed recently [ , ].

    Also, the traditional hierarchical clustering schemes suffer

    from lack of robustness, and from nonuniqueness and

    inversion problems.

    11

    7

    Bayesian clustering is a highly structured approach, which

    imposes a strong prior hypothesis on the data [ ]. However,

    a prior hypotheses on expression data though is usually not

    available.

    8

    K-means clustering on the other hand imposes no structure

    at all on the data, proceeds in a local fashion and produces

    an unorganized collection of clusters that is not conducive to

    interpretation [ ].

    In contrary, the standard SOM algorithm has a number of

    properties, which render it to a candidate of particular

    interest. SOMs can be implemented easily, are fast, robust

    and scale well to large data sets. They allow one to impose

    partial structure on the clusters and facilitate visualization

    and interpretation. In the case hierarchical information is

    required, it can be implemented on top of SOM, as in [ ].

    However, there is still an inherent requirement of the

    standard SOM algorithm, which constitutes a major

    drawback. The number of distinct clusters has to be

    specified a priori, although there is no mean to objectively

    predetermine the optimum number in the case of gene

    expression data.

    27

    2

  • 8/3/2019 BioinfNetSOM

    3/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Recently, several dynamically extended schemes have been

    proposed that overcome the limitation of the fixed non-

    adaptable architecture of the SOM. Some examples are the

    Dynamic Topology Representing structures [ ], the

    Growing Cell Structures [ , ], Self-Organized Tree

    Algorithms [ , ] and the Adaptive Resonance Theory [ ].

    The presented approach has many similarities to these

    dynamically extended schemes. However, in contrast to the

    complexity of these schemes, we built simple algorithms

    that through the restriction of growing on a rectangular grid,

    can be implemented easily and the training of the models is

    very efficient. Also, the benefits of the more complex

    alternatives to the dynamical extension are still retained.

    23

    14 9

    7 17 2

    We call the proposed model sNet-SOM from supervised

    Network SOM, since although it is SOM based it

    incorporates many provisions for supervised

    complementation of learning. These provisions start with the

    supervised versions of the map growing process and run

    through the possibility of integrating a pure supervised

    model.

    Specifically, our clustering algorithm modifies the original

    SOM algorithm with a dynamic expansion process

    controlled by an entropy-based measure whenever gene

    functional class information exists. The later measure

    quantifies to which extend the available information for the

    biological function (i.e. class) of the gene is represented

    accurately by the cluster (i.e. the SOM node) on which the

    gene is allocated. Accordingly, the model is adapted

    dynamically in order to minimize the entropy within the

    generated clusters. This approach detects effectively the

    regions where the decision boundaries between different

    classes lie. At these regions, the classification task becomes

    difficult and a special supervised network can be connected

    with the sNet-SOM in order to resolve better at the class

    boundaries. Usually, only in the case of lack of class

    information the dynamic expansion is controlled by local

    variance or resource counts criteria. The entropy criterion

    concentrates on the resolution of the regions characterized

    by class ambiguity and therefore it is more effective.

    The sNet-SOM has been designed in order to automatically

    detect the appropriate level of expansion. At the

    unsupervised case the distance threshold between patterns

    below which two genes can be considered as co-expressed is

    estimated. Then the map is grown automatically until its

    nodes correspond to gene clusters with distances that adhere

    to this limit. In the supervised case the criteria for stopping

    the network expansion can be expressed either in terms of

    the approximation or in terms of the classification

    performance.

    Furthermore, the sNet-SOM overcomes the problem of

    irrelevant (flat) profiles that can populate much more

    clusters than necessary at the traditional SOM. The solution

    we adopted is the careful redesign of the voting mechanism.

    The paper is outlined as follows: Initially, Section 2

    summarizes the microarray expression experiments and the

    associated data used to evaluate the presented computational

    learning schemes. Section 3 describes the extensions to the

    SOM that lead to the sNet-SOM and the overall architecture

    of the later. Section 4 deals with the learning algorithms that

    adapt both the structure and the parameters of the sNet-

    SOM. The expansion phase of the sNet-SOM learning is

    described in separate sections since it is rather complicated

    and depends on whether the learning is supervised or

    unsupervised. Specifically, Section 5 elaborates on the

    details of the expansion phase for the unsupervised case and

    Section 6 for the supervised one. Section 7 discusses results

    obtained from an application to yeast expression microarray

    data. Finally, Section 8 presents results the conclusions

    along with some directions onto which further research can

    proceed for improvements.

    2 Microarray expression experiments

    Recently, new approaches have been developed for

    accessing large scale gene expression data. One of the most

    effective ones is by using the DNA microarray technology

    [ ]. In this method, thousands of distinct DNA probes are

    attached to a microarray. These probes can be Polymerase

    Chain Reaction (PCR) products or oligonucleotides whose

    sequences correspond to target genes or Expressed Sequence

    10

    3

  • 8/3/2019 BioinfNetSOM

    4/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Tags (ESTs)of the genome being studied. RNA is extracted

    from the sample tissue or cells, reverse transcribed into

    labeled with fluorescent dyes cDNA, which is then allowed

    to hybridize with the probes on the microarray. The cDNA

    corresponds to transcripts produced by genes in the samples,

    and the amount of a particular cDNA sequence present will

    be in proportion to the level of expression of its

    corresponding gene. The microarray is washed to remove

    non-specific hybridization, and the level of hybridization for

    each probe is calculated. An expression level for genes

    corresponding to the probes is derived from these

    measurements. This level represents a ratio between the

    expression of the gene under some control condition

    relatively to the reference condition.

    Gene expression data obtained in this way are usually

    arranged in tables whose rows correspond to the genes and

    columns to the individual expression values of each gene in

    a particular experimental condition represented by the

    column. These raw data are characterized by highly

    asymmetrical distributions that makes difficult the

    realization of any distance metric for the assessment of the

    differences among them. Therefore, the logarithmic

    transformation is used as a preprocessing step, that expands

    the scale for small values and compresses it for large values.

    An additional desirable effect of the logarithmic

    transformation is that it provides a symmetrical scale around

    0.

    The gene expression patterns reflect a cell's internal state

    and microenviroment creating a molecular "picture" of the

    cell's state. Thus DNA microarrays can be used to capture

    these molecular pictures and deduce the condition of the

    cells. Furthermore, since the expression profile of a gene is

    correlated with its biological role, systematic microarray

    studies of global gene expression can provide remarkably

    detailed clues to the functions of specific genes. This is

    important, since currently fewer than 5% of the functions of

    the genes in the human genome are known.

    3. The sNet-SOM model

    The sNet-SOM is based on the standard SOM algorithm, but

    is dynamically extendable, so that the number of clusters is

    controlled by a properly defined measure of the algorithm

    itself, with no need for any a priori specification. Because

    all the previously mentioned clustering algorithms are

    purely unsupervised, they ignore any available a priori

    biological information. This means that not only existing

    information is not explored in order to deduce the correct

    expression characteristics of genes that make them part of

    functional groups, but also that genes known to be

    erroneously grouped to a cluster cannot be handled.

    Following the basic design principle to include existing

    prior knowledge, we manage to simultaneously consider

    both gene expression data and class information (whenever

    available) at the sNet-SOM training algorithms.

    However, so far class annotation for gene expression data is

    limited and not always available. In order to account also for

    this case we additionally developed a second similar

    algorithm so that for the two cases the algorithms differ only

    in the criteria that control the dynamic expansion of the

    map. Specifically, depending on the availability of class

    information we design two variants of sNet-SOM.

    The first variant, the unsupervised sNet-SOMperforms node

    expansion in the absence of class labels by exploiting either

    a local variance measure that depends on the SOM

    quantization performance or on node resource counts. These

    criteria are used also at the Growing Cell Structures (GCS)

    algorithms for growing cells [ , ]. The convergence criteria

    are defined by a statistical assessment of the randomness of

    the distance between gene expression patterns.

    14

    14

    9

    The secondvariant, the supervisedsNet-SOM performs the

    growing by exploiting the class information with an entropy

    measure. The dynamic growth is based on the criterion of

    neuron ambiguity (i.e. uncertainty about class assignment),

    which is quantified with the entropy measure that is defined

    over the sNet-SOM nodes. This approach differs from the

    local quantization error approach of [ ] and of the resource

    value of [ ] that grow the map at the nodes accumulating

    the largest local variances and resource context of the

    unsupervised sNet-SOM. In the absence of class information

    1

    4

  • 8/3/2019 BioinfNetSOM

    5/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    these are reasonable and well performing criteria. However,

    these measures can be large even with no class ambiguity

    while the entropy measure directly and objectively

    quantifies the ambiguity. For that reason for the supervised

    sNet-SOM the entropy based growing technique is

    preferable.

    We have developed the supervised sNet-SOM initially

    within the context of an ischemia detection application

    [ , ]. At this application, it is used in combination with

    capable supervised models in order to maximize the

    performance of the detection of ischemic episodes.

    However, the peculiarities of the gene expression data made

    mandatory significant redesign of the algorithms. Below we

    discuss the sNet-SOM learning algorithms in detail.

    22 3

    4. Learning algorithms

    The sNet-SOM is initialized with four nodes arranged in a

    2X2 rectangular grid and grows nodes to represent the input

    data. Weight values of the nodes are self-organized

    according to a new method inspired by the SOM algorithm.

    The self-organization process maps properties of the original

    high-dimensional data space onto the lattice consisted of

    sNet-SOM nodes. The map is expanded to represent the

    input space by creating new nodes, either from the boundary

    nodes performing boundary extension, or by inserting whole

    columns (or rows) of new units with a column extension (or

    row extension).

    The decision to grow either with the boundary or with the

    column (row) extension does not limit the potentiality for

    dimensionality reduction of the model and its modeling

    effectiveness, while its implementation is easier and the

    training becomes more efficient. The later advantage is

    important for the large data sets produced by the microarray

    experiments. Usually, new nodes are created by expanding

    the map at its boundaries. However, when the expansion

    focus becomes a node placed deepin the interior of a large

    map, far from the boundary nodes, the adaptive expansion

    process inserts a whole column of nodes directly adjacent to

    this node. Therefore, the node becomes directly a boundary

    node and the expansion process can generate new nodes in

    the neighborhood. The implementation of this exception to

    the general grow from boundary rule, has accelerated

    significantly the training of large maps (2 to 4 times faster

    computation for maps of size of about 100 nodes).

    The growing structure takes the form of a nonuniform

    rectangular grid. It develops within a large NM grid that

    provides slots for the new dynamically created nodes.

    Generally, we require i.e. since the

    insertion of whole columns results in faster expansion rate

    along columns (note that the opposite is true when we

    implement the alternative of row insertion, instead of

    column insertion).

    ,MN ,2 MN

    A training epoch consists of the presentation of all the

    training patterns to the sNet-SOM. A training run is defined

    as the training of the sNet-SOM with a fixed number of

    neurons at its lattice i.e. the training between successive

    node insertions/deletions.

    After the preliminary discussion we can now proceed to

    describe the sNet-SOM learning algorithms in more detail.

    The top level sNet-SOM learning algorithm is the same for

    both, the unsupervised and the supervised case. Inalgorithmic form it can be described as:

    Top-level sNet-SOM learning algorithm

    1.

    While do

    2.

    3.

    End While

    4.

    The details of the algorithm, i.e. the initialization,

    adaptation, expansion and fine tuning phases and the

    convergence criteria are described in detail below.

    A. Initialization phase.The weight vectors of the four starting nodes that are

    arranged in a 2X2 grid are initialized with random numbers

    5

  • 8/3/2019 BioinfNetSOM

    6/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    within the domain of feature values (i.e. of the normalized

    ratio fluorescent coefficients).

    B. Training Run Adaptation phaseThe purpose of this phase is to stabilize the current map

    configuration in order to be able to evaluate its effectiveness

    and the requirements for further expansion. During this

    phase, the input patterns are repeatedly presented and the

    corresponding self-organization actions are performed until

    the map converges sufficiently. The training run adaptation

    phase takes the following algorithmic form.

    :

    MapConverged:= false;

    whileMapConverged= false do

    for all input patterns dokx

    present and adapt the map by applying the map

    adaptation rules

    kx

    endfor

    Evaluate map training run convergence condition and set

    MapConvergedaccordingly

    endwhile

    Map adaptation rules

    The map adaptation rules that govern the processing of each

    input pattern are as follows:kx

    1. Determination of the weight vector that is closest to

    the input vector (i.e. of the winner node).

    iw

    kx

    2. Adaptation of the weight vectors only for the four

    nodes in the direct neighborhood of the winner i and for

    the winner itself according to the following formula:

    jw

    j

    kjkkj

    kjj Njkijdknk

    Njkk

    )),())(,(()()(

    ),()1(

    wxw

    ww

    where the learning rate ,)(kn Nk , is a monotonically

    decreasing sequence of positive parameters, is the

    neighborhood at the kth learning step and is

    the neighborhood function implementing different

    adaptation rates even within the same neighborhood.

    kN

    ((dk )),ij

    The learning rate starts from a value of 0.1 and decreases

    down to 0.02. These values are specified with the empirical

    criterion of having relatively fast convergence, without

    however sacrificing the stability of the map.

    The neighborhood function depends on the

    distance between node and the winning node i . It

    decreases monotonically with increasing distance from the

    winning neuron (i.e. nodes closer to the winner are adapted

    more), like in the standard SOM algorithm. The initial

    neighborhood, , includes the entire map.

    )),(( ijdk

    j),( ijd

    0N

    Unlike the standard SOM, these parameters (i.e. ,kN

    )),(( ijdk ) do not need to shrink with time and can be

    kept constant i.e. )),(()),((, 00 ijdijdNN kk . This

    is explained by the following: Initially, the neighborhood is

    large enough to include the whole map. The sNet-SOM

    starts with a much smaller size than a usual SOM: thus a

    large neighborhood is not required to train the whole map at

    the first learning steps (e.g. with 4 nodes initially at the map,

    a neighborhood of 1 only is required). As training proceeds,

    during subsequent training epochs, the area defined by the

    neighborhood becomes localized near the winning neuron,not by shrinking the vicinity radius (as in the standard SOM)

    but by enlarging the SOM with the dynamic growing.

    Usually, we use the following simple and efficiently

    computed formula for the neighborhood function (where

    denote the row and column of node i respectively):cr ii ,

    otherwise0,

    1||||if,10,

    if,1

    )),(( ccrrk jiji

    ij

    ijd

    An alternative rectangular neighborhood that updates also

    the diagonal nodes with a smaller learning rate yields also

    appropriate results:

    otherwise0,

    2||||if,10,

    1||||if,10,

    if,1

    )),((ccrr

    ccrrk

    jiji

    jijiaa

    ij

    ijd

    Evaluation of the map training run convergence condition

    6

  • 8/3/2019 BioinfNetSOM

    7/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    The map training run convergence condition is tested by

    evaluating the reduction of the total quantization error for

    the unsupervised case and of the total entropy for the

    supervised one, before and after the presentation of all the

    input patterns (i.e. one training epoch). Specifically, denote

    and the errors before and after the presentation of

    patterns (similar is the formulation for the entropies). Then

    the map converges when the relative change of the error

    between successive epochs drops below a threshold value,

    i.e.

    bE aE

    sholdeErrorThreConvergenc

    Ea

    aEbE:edMapConverg||

    .

    The setting of the ConvergenceErrorThresholdis somewhat

    empirical but a value in the range 0.01 - 0.02 performs well

    in assuming sufficient convergence without excessive

    computation.

    C. Fine Tuning Adaptation Phase

    The fine tuning phase aims to optimize the final sNet-SOM

    configuration. This phase is similar to the training run

    adaptation phase described previously with two differences:

    a.

    The criterion for map convergence is moreelaborate. We require much smaller change of the

    total quantization error (unsupervised case) or of

    the total entropy (supervised case) for accepting the

    condition for map convergence.

    b. The learning rate decreases to a smaller value inorder to allow fine adjustments to the final

    structure of the map.

    Typically, the ConvergenceErrorThreshold for the fine

    tuning phase is about 0.00001 and the learning rate is set to

    0.01 (or to an even smaller value).

    3. Expansion Phase

    The dynamic expansion of the CP-SOM depends on the

    availability of class labels and therefore is referred as

    supervised expansion when class labels are available and

    unsupervised expansion if not. These processes are

    described separately below since they explore different

    expansion criteria. Moreover, the objective underlying their

    development is different for each case. The unsupervised

    expansion has the task of revealing insight onto the groups

    of genes with correlated expression patterns while the

    supervised entropy based expansion has the objective to

    reduce the computational requirements for a "pure"

    supervised solution. The later objective follows a design

    principle of the sNet-SOM: partitioning of a complex

    learning problem to domains that can be learned effectively

    with simple and computationally effective unsupervised

    models and to ones that require the utilization of a capable

    supervised model since they are characterized by complex

    decision boundaries [ ].22

    To avoid misconception we should note that in our scheme

    the term supervised refers mainly to the fact that class

    information is a decisive factor in determining the expansion

    criterion. As we shall describe in the next section though,

    information about class information can be exploited even in

    what we term unsupervised expansion process. The reason

    for not using always the supervised expansion mode when

    class information is available, is simply explained by the

    two different objectives outlined above, i.e. if the insight to

    the structure of the gene expression data is more important

    than the classification task itself, the unsupervised approach

    is used although class information is available.

    Each of the two approaches to map expansion, the

    unsupervised and the supervised one, is described below in

    its own section.

    5 . The Unsupervised Expansion Process

    The unsupervised expansion is based on the detection of

    the neurons with large local error, referred to as the

    unresolved neurons. A neuron is considered unresolved if

    its local error i exceeds a threshold value, denoted by

    the parameter NodeErrorForConsideringUnresolved.

    enote by iS the set of gene expression profiles ip mapped

    to node i. Also, let iw be the weight vector of node i that

    corresponds to the average expressi n profile of iS . Then

    ocal error iLE is de

    LE

    D

    o

    the l fined as:

    7

  • 8/3/2019 BioinfNetSOM

    8/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    iS

    iiLE

    p

    wp2 .

    The local error is commonly used for implementing

    dynamically growing schemes [ , ]. However, the

    peculiarities of the gene expression data motivated two

    significant modifications to the classic local error measure.

    Specifically:

    1 17

    1. Instead of the local error measure we use theaverage local error iAV per patterns, i.e.

    ii

    S

    LEAV i

    (1)

    This measure does not increase when many similar

    patterns are mapped to the same node. Therefore,

    the objective of assigning all the functionally similar

    genes to the same node is more easily achievable,

    even when there are many of such genes. In contrast,

    the accumulated local error increases monotonically,

    as more genes are mapped to the same node. This in

    turn can cause an undesired spreading of

    functionally similar genes to different nodes.

    2. The second provision applies when we have classinformation available (either complete or partial)

    and we want to exploit it in order to improve the

    expansion. The local error that accumulates to a

    winner node is amplified by a factor that is inversely

    proportional to the square root of the frequency ratio

    cr of its corresponding class c . Specifically, let

    patternstotal#

    classofpatterns# crc be the frequency ratio of

    class c . Then the amplification factor is 21c

    r .

    Therefore, the errors on the low frequency classes

    account more. As a consequence the representation

    of these low frequency classes is improved. We

    should note that these classes are usually of most

    biological significance. The utilization of the square

    root prevents the overrepresentation of the very low

    frequency classes (e.g. if class A is 100 times less

    frequent than B it is amplified only 10 times

    more). The error measure computed after this

    additional class frequency dependent weighting is

    called Class Frequency Average Local Error

    (CFALE). In the absence of class information the

    CFALE denotes the same quantity as the average

    local error parameter defined above with

    equation ( ).

    iAV

    1

    This provision also confronts to some extent the

    serious problem of the creation of false positives for

    the low frequency classes by noise. Probabilistically,

    most of these noisy patterns will belong to the high

    frequency classes. However, the effect of these

    erroneously classified patterns will be attenuated

    significantly, because they are derived from the high

    frequency classes. The final result is an enhanced

    robustness to noisy patterns.

    Nodes that are selected as winners for very few (usually one

    or two) training patterns, termed uncolonized nodes, are not

    deleted by our scheme although they probably correspond to

    noisy outliers. The gene expression patterns that consistently

    (three times or more) are mapped to uncolonized nodes are

    very unique and can either be artifacts or if not they have thepotential to provide biological knowledge. Therefore they

    are amenable to further consideration. These patterns

    therefore are marked and isolated for further study.

    Nodes that are not selected as winners for any pattern are

    removed from the map in order to keep it compact.

    The steps of the unsupervised expansion process are as

    follows:

    U.1. Computation of the CFALE measures for every node i.

    repeat

    U.2. let i = the node with the maximum CFALE

    measure

    U.3. ifIsBoundaryNode(i) then

    // expand at the neighbours boundary nodes

    U.4. JoinSmoothlyNeighbours (i)

    U.5. elseifIsNearBoundaryNode(i)

    8

  • 8/3/2019 BioinfNetSOM

    9/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    U.6 RippleWeightsToNeighbours(i)

    U.7. else InsertWholeColumn(i);

    endif

    U.8 Reset the local error measures.

    U.9. Re-execute the Training Run Adaptation

    Phase for the expanded map by presenting all the training

    patterns.

    until notRandomLikeClustersRemain();

    Figure 1 Illustration of the weight initialization of a

    new node with the function JoinSmoothlyNeighbors(). The

    new node is allocated to join the map from the

    boundary nodes. The weight of the new node is initialized by

    computing the average weight nearby to the node

    1, cr

    cr, ,

    which initiates the expansion, according to the empirical

    formula:

    crfcrfcrf

    crcrfcrcrfcrcrfav

    ANvANvANh

    WANvWANvWANhW

    cr,1,11.

    ,1,1,1,11,1.

    ,

    where is a boolean flag that denotes that the node

    has been allocated to the growing structure and and

    are the horizontal and vertical factors of weighting

    across the corresponding dimensions. Then the weight of the

    new node N is estimated as:

    jiAN, ji,

    fh

    fv

    c,avWcrcrcrN WWWW ,,1,

    ff vh

    .

    The concept of direction of weight growing is maintained

    by considering more the node in the direction of growth

    (horizontal in the case illustrated), i.e. , typically

    .,1fh 5.0fv

    We describe below shortly the main issues involved in these

    steps. The while loop controls the sNet-SOM expansion.

    The criteria for the establishment of the proper level of

    expansion are described in the section that follows. The

    function IsBoundaryNode() checks whether a node is a

    boundary node. Training efficiency and implementation

    simplicity were the motivations for the decision to expand

    mostly from the boundary nodes. The expansion of the map

    at the boundary nodes is straightforward: One to three nodes

    are created and the weights of the new nodes are adjusted

    heuristically to retain the weight flow with the function

    JoinSmoothlyNeighbours() whose operation is illustrated in

    Fig. .1

    The map configuration is slightly disturbed when the winner

    node is not a boundary node but is a near boundary node. A

    node is considered near boundary (declared by the function

    IsNearBoundaryNode()) when the boundary of the map can

    be reached from this node by traversing in any direction at

    most two nodes.

    For a near boundary node a percentage (usually 20-50%) of

    the weight of the winner node is shifted towards the outer

    nodes (with the function RippleWeightsToNeighbours()).

    This operation alters locally the Voronoi regions and usually

    with a few weight rippling operations the winner node is

    propagated to a boundary node (which is located near).

    Finally, if the winner is a node that is neither a boundary nor

    a near boundary the alternative of inserting a whole empty

    column is used. The rippling of weights is avoided in these

    cases, because usually excessive computation times are

    required before the winner propagates from a node placed

    deep in the map to a boundary node. Instead of inserting

    whole new columns we can insert alternatively whole new

    rows, or we can perform a combination of row and column

    insertion. The operation of the corresponding

    InsertWholeColumn() function is illustrated by Figure 2.

    vf

    new

    node

    initiating

    expansion

    node

    1,

    cr cr, 1,

    cr

    vf

    cr ,1

    hf

    cr ,1

    9

  • 8/3/2019 BioinfNetSOM

    10/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Figure 2 Grow by grid insertion in the direction of largest

    total error, i.e.

    if 1,11,1,11,11,1,1 jijijijijiji EEEEEE

    jiE, ),( jinode

    ,

    where is the error measure of (i.e. CFALE),

    then insert the new column at the left of column j else insert

    the new column at the right of column j.

    Criteria for controlling the sNet-SOM dynamic growing

    One of the most critical tasks for the effective analysis of

    gene expression data with the sNet-SOM is the proper

    definition of the parameters that control the growing

    process. The objective is to automatically reach the

    appropriate level of expansion and then to allow some fine

    tuning of the resolution level by the molecular biologists.

    Systematically, the design of the criteria for stopping the

    growing process can be approached by evaluating a

    statistical distance threshold for gene expression patterns,

    below which two genes can be considered as functionally

    similar. When the average distance between patterns in a

    cluster drops below this value, the clustering together of

    these particular gene expression patterns corresponds to

    nonrandom behavior and therefore interesting informationcan be extracted by analyzing them.

    To this end, we define a confidence level from which we

    derive a threshold for the distance between gene

    expression patterns. The confidence level has the

    meaning that the probability of taking two random unrelated

    expression profiles as functionally similar (i.e. to allocate

    them at the same cluster) is lower than a if the distance

    between them is smaller than the threshold.

    a

    thrD

    a

    new column

    Ei-1, +1Obviously, the definition of a statistical confidence level

    would be only possible if the distribution of the distance

    between random expression vectors were known.

    Practically, although the distribution is unknown, it is easy

    to approximate it. Specifically, we shuffle randomly the

    experiment points of every expression profile randomly,

    both by gene and by experiment point. This randomization

    destroys the correlation between the different profiles, while

    it retains the other characteristics of the data set (e.g. ranges

    and histogram distribution of values). In this way, we

    compute an approximation of the distribution of the distance

    between random patterns.

    It is evident that larger (smaller) correlation between genes

    corresponds to smaller (larger) Euclidean or Manhattan

    distance. Assume that we have chosen the Manhattan

    distance measure. Then as figure illustrates the distances

    lying in the interval

    3

    hl vv are considered as random.

    Also, for distances smaller than a positive correlation

    between genes is implied, while it is reasonable to assume

    that the converse holds for distances larger than .

    lv

    hv

    Figure 3 The results of the data shuffling illustrate that the

    distances between the randomized data occupy a distinct

    Ei,j-1 Ei,

    Ei-1,

    Ei,j+1

    Ei+1,j+1Ei+1,Ei+1, -1

    Ei-1, -1

    vlvh

    10

  • 8/3/2019 BioinfNetSOM

    11/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    distribution. For the gene expression data positive

    correlation is favored while for the random the distribution

    has a normal form.

    The distributions of randomized and original gene

    xpression patterns displayed in Figure 3 are used to

    computation of the

    lass assignment for each node i, and of a parameter

    t i

    s of the following steps:

    r the map nodes. The ambiguity of class assignme

    uation of the map over the whole training set in

    mance

    or the supervised

    s already mentioned the objectives of the supervised

    ed expansion exploits well the topological

    e

    implement the criteria on which the function

    RandomLikeClustersRemain() is based. This function

    evaluates the randomness of the genes allocated to one

    cluster by computing all the pairwise distances between

    them. If a considerable number of these distances are

    random according to the specified significance level then the

    cluster is considered to own unrelated genes and therefore

    further decomposition is required. The percentage of

    random pairwise distances above which we consider the

    cluster as random, is specified empirically to a value of 5%.

    Clearly, the smaller the required percentage parameter, the

    larger the decomposition level becomes. The

    aforementioned value (i.e. 5%) produces well behaved, from

    a biological perspective, extensions of the map.

    6. The Supervised Expansion Process

    The supervised expansion is based on the

    c iHN

    characterizing the entropy of this assignment. This

    parameter is derived according to Equation 2 tha s

    discussed below. An advantage of the entropy is that it is

    relatively insensitive to the overrepresentation of classes i.e.

    independently of how many patterns of a class are mapped

    to the same node, if the node does not represent other

    classes, its entropy is zero.

    The expansion phase consist

    S.1 Computation of the class labels and entropies iHN

    nt forfo

    the genes of node i is quantified by iHN .

    repeat

    S.2. Eval

    order to compute the approximation performance

    CurrentApproximationPerformance

    S.3 if CurrentApproximationPerfor

    ThresholdOnApproximationPerformance then

    // resolve better the difficult regions of the sta

    which classification decisions cannot be deduced easily

    S.3.1. let i the node of higher ambiguity (i.e. larg

    entropy para eter).

    S.3.2. ifIsBoundaryN

    join smoothly the neighbours to

    elseifnode i near the boundary then

    RippleWeightsToNeighbours(i)

    else InsertWholeColumn(i);

    endif

    Re

    Apply the Map Adaptation phase to

    map.

    endif

    until C

    ThresholdOnApproximationPerformance

    S.4 Generate training and testing sets fexpert. Further supervising training will be performed with

    these sets by the supervising learning algorithm in order to

    better resolve the ambiguous parts of the state space.

    endif

    A

    expansion differ from those of the unsupervised one. While

    the unsupervised aims to insert nodes in order to detect

    interesting clusters of genes, the supervised extension is

    concentrating at the task of revealing the class decision

    boundaries.

    The supervis

    ordering that the basic SOM provides and increases the

    resolution of the representation over the regions of the state

    space that lie near class boundaries. At this point, it should

    be emphasized that simply increasing the SOM size with the

    adaptive extension algorithm until each neuron represents

    unambiguously a class (i.e., zero iHN for all the nodes)

    11

  • 8/3/2019 BioinfNetSOM

    12/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    yields a SOM configuration that although fits to the training

    set, fails to generalize well.

    The ambiguous neurons i.e. those neurons for which the

    task proceeds by feeding the pattern to the

    e sNet-

    ither majority

    uncertainty of class assignment is significant, are identified

    with the entropy criterion. The dynamic expansion phase of

    the sNet-SOM is executed until the approximation

    performance reaches the required level. Afterwards, training

    and testing sets are created for the supervised expert. These

    sets consist only of the patterns that are represented by the

    ambiguous neurons. These neurons correspond to state

    space regions on which classification decisions cannot be

    deduced easily.

    The classification

    sNet-SOM. If the winning neuron is one that is not

    ambiguous, the sNet-SOM classifies by using the class of

    the winning neuron. In the opposite case, the supervised

    expert is used to perform the classification decision.

    The assignment of a class label to each neuron of th

    SOM is performed according to a majority-voting scheme

    [20]. This scheme acts as a local averaging operator defined

    over the class labels of all the patterns that activate that

    neuron as the winner (and accordingly are located at the

    neighborhood of that neuron). The typical majority-voting

    scheme considers one vote for each winning occurrence. An

    alternative more "analog" weighted majority voting scheme

    weights the votes each by a factor that decays with the

    distance of the voting pattern from the winner (i.e. the

    largest the distance the weakest the vote). The averaging

    operation of the majority and weighted majority voting

    schemes effectively attenuates the artifacts of the training

    set patterns. As noted, in order to enhance the representation

    of rare gene expression patterns we amplify the vote of each

    pattern with a coefficient that is proportional to the inverse

    of the frequency of appearance of that class.

    In the context of sNet-SOM the utilization of e

    or weighted majority voting is essential. These schemes

    allow the degree of class discrepancy for a particular neuron

    to be readily estimated. Indeed, by counting the votes at

    each SOM neuron for every class, an entropy criterion that

    quantifies the uncertainty of the class label of neuron can

    be directly evaluated, as [ ]:

    m

    16

    cN

    k

    kk ppmHN

    1

    log)( , (2)

    where denotes the number of classes andcNtotal

    kk

    V

    Vp ,

    is the ratio of votes for class to the total number of

    votes to neuron m.

    kV k

    totalV

    Clearly, the entropy is zero for unambiguous neurons and

    increases as the uncertainty about the class label of the

    neuron increases. The upper bound of is ,

    and corresponds to the situation where all classes are

    equiprobable (i.e. the voting mechanism does not favor a

    particular class). Consequently, within the framework posed

    by these voting schemes, the regions of the SOM that are

    placed at ambiguous regions of the state space can be easily

    identified. For these regions, the supervised expert is

    designed and optimized for obtaining adequate

    generalization performance.

    )(mH )log( cN

    7. Results and Discussion

    We have applied the sNet-SOM to analyze public available

    microarray expression data from the budding yeast

    Saccharomyces cerevisiae. This fully sequenced organism

    was studied during the diauxic shift, the mitotic cell division

    cycle, sporulation, and temperature and reducing shocks by

    using microarrays containing essentially every OpenReading Frame (ORF). The data set on which we performed

    extensive experiments consists of 2467 genes for which

    there exists currently functional annotation in the

    Saccharomyces Genome Database. The weighted K-nearest

    neighbors imputation method presented in [ ] is applied in

    order to fill up systematically the missing values.

    25

    Microarray gene expression data sets are large, complex,

    contain many attributes and have an unknown internal

    structure. For that, gaining insight onto the structure of the

    data is the initial objective of most analysis methods rather

    12

  • 8/3/2019 BioinfNetSOM

    13/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    the classification of the data itself. The sNet-SOM meets

    this objective by:

    Achieving high quality and computationallyefficient clustering of the gene expression profiles

    with the exploitation of either supervised or

    unsupervised clustering criteria.

    Offering extensive visualization capabilities withthe irregular two dimensional growable grid that

    the basic structure provides which can be

    complemented with the Sammons nonlinear

    mapping [ ].21

    The sNet-SOM is not only a clustering but is additionally a

    classification tool, although that in this case the sNet-SOM

    does not claim to directly compete capable supervised

    models like the Radial Basis Support Vector Machine

    [ , , ]. The sNet-SOM rather aims to complement them by

    reducing the complexity of the problem that remains for the

    pure supervised solution. Taking into account the size and

    the complexity of the gene expression data set, this

    reduction proves essential. The results of the last two rows

    of Table demonstrate this computational benefit.

    16 26 6

    1

    Learning Model Training Time

    SOM 5X5 8 min

    SOM 10X10 65 min

    Unsupervised sNet-SOM 25 min

    Supervised sNet-SOM 28 min

    Unsupervised sNet-SOM

    with column insertion

    19 min

    Supervised sNet-SOM with

    column insertion

    23 min

    SVM 3 hours 15 min

    Supervised sNet-SOM with

    SVM for the ambiguous

    patterns

    50 min

    Table 1 A comparison of the execution times for the

    learning of the gene expression data set. The results

    obtained with a 450 MHz Pentium III PC.

    1

    Table illustrates that the sNet-SOM trains faster than the

    conventional SOM. The first two rows are the execution

    times for a SOM with a grid of size 5X5 and 10X10

    respectively. Also, the unsupervised sNet-SOM trains

    slightly faster than the supervised (3rd

    and 4th

    rows).

    Furthermore, the utilization of column (row) insertion

    provides further performance advantages (5th

    and 6th

    rows).

    The Support Vector Machine takes the longest time for the

    training on the whole data set (7th

    row). The implementation

    approach of [ ] as implemented with the SVMLight

    software package was used for the SVM solution. Finally,

    the supervised sNet-SOM combined with the SVM

    resolution of the difficult parts of the state space obtains

    significantly better learning times without sacrificing the

    quality of the results. The SVM classification results are

    similar to those published in [ ] and are therefore not

    repeated here.

    19

    6

    6

    The supervised phase was trained with the same functional

    classes as in [ ]. This allows to perform some comparisons

    relating the performance of the sNet-SOM with other

    methods. These classes are summarized with Table . The

    functional classifications were obtained from the Munich

    information center for protein sequences yeast genome

    database (

    2

    http://www.mips.biochem.mpg.de/proj/yeast).

    1. Tricarboxylic-acid pathway (TCA)2. Respiration-chain complexes (Resp)3. Cytoplasmic ribosomal proteins (Cyto)4. Proteasome (Proteas)5. Histones (Hist)6. Helix-turn-helix (HTH)

    Table 2 Functional classes used for supervised sNet-SOM

    training. The tricarboxylic-acid pathway is, also known as

    Krebs cycle, consists of genes that encode enzymes that

    break down pyruvate (produced from glucose) by oxidation.

    The respiration chain complexes perform oxidation-

    reduction reactions that capture the energy present in

    NADH through electron transport and the chemiosmotic

    synthesis of ATP. The cytoplasmic ribosomal proteins are a

    13

    http://www.mips.biochem.mpg.de/proj/yeasthttp://www.mips.biochem.mpg.de/proj/yeast
  • 8/3/2019 BioinfNetSOM

    14/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    class of proteins required to make the ribosome. The

    proteasome consists of proteins that perform the

    degradation of proteins. Histones interact with the DNA

    backbone to form nucleosomes. These nucleosomes are an

    essential part of the chromatin of the cell. Finally, the helix-

    turn-helix class, is not a functional class. It consists of genes

    that code for proteins containing the helix-turn-helix

    structural motif. This class is included as a control class.

    At the presented supervised sNet-SOM training experiment

    we used six functional classes from the MIPS Yeast

    Genome Database: tricarboxylic acid (TCA) cycle,

    respiration, cytoplasmic ribosomes, proteasome, histones

    and helix-turn-helix (HTH) proteins. The first five classes

    represent categories of genes that on biological grounds is

    expected to induce similar expression characteristics. The

    sixth class, i.e. the helix-turn-helix proteins is used as a

    control group. Since there is not any biological justification

    for a mechanism that enforces the genes of this class to the

    same patterns of expression, we expect these genes to be

    spread to diverse clusters by the sNet-SOM.

    The measure of Entropy of Class Representation is

    evaluated over the sNet-SOM nodes in order to quantify the

    dispersion of class representation. We expect this measure to

    be large in the case of HTH, expressing the diversity of the

    HTH gene expression patterns. Indeed, the results of Table

    support this intuitive expectation. The high entropy of the

    Unassigned class is due to the fact that this class

    accumulates hundreds of other known functional classes and

    all the unknown ones.

    3

    Table 3 The entropies of class representations at the sNet-

    SOM configuration of Figure .

    Class Entropy

    1. Tricarboxylic-acid pathway(TCA)

    2. Respiration-chaincomplexes (Resp)

    3. Cytoplasmic ribosomalproteins (Ribo)

    4. Proteasome (Proteas)

    1.96

    1.82

    1.21

    0.51

    5. Histones6. Helix-turn-helix7. Rest and functionally

    unknown classes (Unassigned)

    0.60

    2.78

    4.13

    7

    Many functional classes of genes present strong similarity of

    their gene expression patterns. This is evident in Figure 4,

    where we can observe a high similarity of the gene

    expression patterns of the class Ribo. The identities

    (identifier, description and functional class) of the genes of

    Figure 4 are displayed with Figure 6.

    Figure 7 illustrates a snapshot of the progress of the learning

    process. Each sNet-SOM node is colored according to the

    predominant class. Also for each node three numbers are

    continuously updated. The first is the numeric identifier of

    the prevailing class. The second depends on the type of

    training. For supervised training it is the entropy of the

    node: nodes with high entropy lie near class separation

    boundaries and the patterns can be used to train efficient

    supervised models, like the Support Vector Machines, for

    the effective discrimination of these parts of the state space.

    For the unsupervised expansion this number is a resource

    count (usually the local quantization error) that controls the

    positions of the dynamic expansion. Finally, the third

    number is the number of patterns mapped to the node.

    Figure 8 displays a listbox with the characteristics of all the

    nodes of the sNetSOM. The first two columns are the grid

    coordinates of the node. The third column is the entropy of

    the node and the fourth is the number of genes mapped to

    the node. Finally, the last column is the name of the class

    that the node represents. The biologist can obtain further

    information about the genes mapped to the node by selecting

    the corresponding element of the listbox. The main

    parameters that control the operation of the sNet-SOM are

    user defined with a parameter setting screen illustrated with

    figure 9.

    14

  • 8/3/2019 BioinfNetSOM

    15/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    8. Conclusions and future work

    This work has presented a new self-growing adaptive neural

    network model for the analysis of genome-wide expression

    data. This model, called sNet-SOM overcomes elegantly the

    main drawback of most of the existing clustering methods

    that impose an a priori specification at the number of

    clusters. The sNet-SOM determines adaptively the number

    of clusters with a dynamic extension process which is able

    to exploit class information whenever available.

    The sNet-SOM grows within a rectangular grid that

    provides the potential for the implementation of efficient

    training algorithms. The expansion of the sNet-SOM is

    based on an adaptive process. This process grows nodes at

    the boundary nodes, ripples weights from the internal nodes

    towards the outer nodes of the grid, and inserts whole

    columns within the map. The growing algorithm is simple

    and computationally effective. It prefers to grow from the

    boundary nodes in order to minimize the map readjustment

    operations. However, a mechanism for whole column (row)

    insertion is implemented in order to deal with the case that a

    large map should be expanded around a point that is deep

    within its interior. The growing process determines

    automatically the appropriate level of expansion in order the

    similarity between the gene expression patterns of the same

    cluster to fulfill a designer definable statistical confidence

    level of not being a random event. The voting schemes for

    the winner node have been designed in order to amplify the

    representation of rare gene expression patterns.

    A novel feature of the sNet-SOM compared with other

    related approaches is the potentiality for the effective

    exploitation of the available class information with an

    entropy based measure that controls the dynamical extension

    process. This process extracts information about the

    structure of the decision boundaries. A supervised network

    can be connected additionally in order to resolve better at

    the difficult parts of the state space. This hybrid approach

    (i.e. unsupervised competitive learning for the simple parts

    of the state space and supervised for the difficult ones) can

    compete in performance advanced supervised learning

    models at a much less computational cost. In essence, the

    sNet-SOM can utilize the pure supervised machinery only

    where it is needed, i.e. for the construction of complex

    decision boundaries over regions of the state space where

    patterns cannot be separated easily.

    Another way to incorporate supervised learning to the sNet-

    SOM is to use the nodes as Radial Basis Function centers

    and to model the classification of a gene as a nonlinear

    function of the gene expression templates represented by

    the adjacent nodes. This approach resembles qualitatively

    the supervised harvesting approach of [ ]. The node

    average profiles can be used as inputs to a supervised phase.

    This reduces the redundancy of information and prevents an

    overfitting of the training set. Proper parameters of these

    centers can be estimated by heuristic criteria like signal

    counters, local errors, and node entropies providing local

    information of much importance.

    15

    The sNet-SOM dynamical extension algorithm is similar for

    the more usual case in the context of gene expression

    analysis, where there is no classification information

    available. In this case criteria based on the computation of

    local variances or resource counts are implemented.

    Moreover, in order to enhance the exploratory potential of

    the sNet-SOM for the analysis of the gene expression data,

    we have adapted the Sammon distance preserving nonlinear

    mapping. The Sammon mapping allows an effective

    visualization of the intrinsic structure of the sNet-SOM

    codebook vectors even at the unsupervised case. We will

    provide an extensive discussion of the application of the

    Sammon mapping at the context of sNetSOM for the

    effective visualization of gene expression data in a

    forthcoming work. Also, another main direction for the

    improvement of the sNetSOM performance is the

    incorporation of more advanced distance metrics to its

    algorithms, as the Bayesian one proposed in [ ]. The

    incorporation of the presented sNet-SOM dynamic growing

    algorithms as a front end processing within Bayesian

    18

    15

  • 8/3/2019 BioinfNetSOM

    16/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    network structure learning algorithms [ ] is also an open

    area for future work.

    4

    [4] Bishop C. M., Neural Networks for Pattern Recognition, Clarendon

    Press-Oxford, 1996.

    ACKNOWLEDGEMENTS

    The authors wish to thank the Research Committee of the

    University of Patras for the partial financial support of this

    research with the contract Karatheodoris 2454

    References

    [1] Alahakoon Damminda, Halgamuge Saman K., Srinivasan Bala,

    "Dynamic Self-Organizing Maps with Controlled Growth for Knowledge

    Discovery", IEEE Transactions On Neural Networks, Vol. 11, No. 3, pp

    601-614, May 2000.

    [2] Azuaje Franscisco, A Computational Neural Approach to Support the

    Discovery of Gene Function and Classes of Cancer, IEEE Trans. Biomed.

    Eng., Vol 48, No. 3, March 2001, pp 332-339

    [3]. Bezerianos A., Vladutu L., Papadimitriou S., "Hierarchical State

    Space Partitioning with the Network Self-Organizing Map for the effective

    recognition of the ST-T Segment Change", IEEE Medical & Biological

    Engineering & Computing 2000, Vol 38, 406-415

    [5] Brazma Alvis, Vilo Jaak, "Gene expression data analysis", FEBS

    Letters, 480 (2000) 17-24

    [6] Brown Michael P.S., Grundy William Noble, Lin David, Cristianini

    Nello, Sugnet Charles Walsh, Furey Terrence S., Ares Manuel, Haussler

    Jr., David, "Knowledge-based Analysis of Microarray Gene Expression

    Data By Using Support Vector Machines"}, Proceedings of the National

    Academy of Science, Vol 97, No 1, pp. 262-267, 1997

    [7] Campos Marcos M., Carpenter Gail A., S-TREE: self-organizing trees

    for data clustering and online vector quantization, Neural Networks 14

    (2001), pp. 505-525

    [8] Cheeseman, P., and Stutz, J. (1995) Bayesian Classification

    (AutoClass): Theory and results, In Fayyad, U., Piatesky-Shapiro, G.,

    Smyth, P., and Uthurusamy, R. editors, Advances in Knowledge Discovery

    and Data Mining, pp. 153-180, AAAI Press, Menbo Park, CA[9] Cheng Guojian and Zell Andreas, "Externally Growing Cell Structures

    for Data Evaluation of Chemical Gas Sensors", Neural Computing &

    Applications, 10, pp. 89-97, Springer-Verlag, 2001

    [10] Cheung Vivian G., Morley Michael, Aguilar Francisco, Massimi Aldo,

    Kucherlapati Raju, Childs Geoffrey, "Making and reading microarrays",

    Nature genetics supplement, Vol. 21, January 1999

    [11] Durbin R., Eddy S., Krogh A., Mitchison G., Biological Sequence

    Analysis, Cambridge, University Press, 1998

    [12] Eisen Michael B., Spellman Paul T., Patrick O. Brown, and David

    Botstein, "Cluster analysis and display of genome-wide expression

    patterns", Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863-14868,

    December 1998

    [13] Friedman, N., M. Linial, I. Nachman, and D Peer, Using Bayesian

    networks to analyze expression data, J. Comp. Bio. 7, 2000, 601-620,

    [14] Fritzke Bernd, "Growing Grid - a self organizing network with

    constant neighborhood range and adaptation strength", Neural Processing

    Letters, Vol. 2, No. 5, pp. 9-13, 1995

    [15] Hastie Trevor, Tibshirani Robert, Botstein David, Brown Patrick,Supervised Harvesting of expression trees, Genome Biology 2001, 2 (1),

    http://genomebiology.com/2001/2/I

    [16] Haykin S,, Neural Networks, Prentice Hall International, Second

    Edition, 1999.

    [17] Herrero Javier, Valencia Alfonso, and Dopazo Joaquin, A

    hierarchical unsupervised growing neural network for clustering gene

    expression patterns, Bioinformatics, (2001) Vol. 17, no. 2, pp. 126-136

    [18] Hunter Lawrence, Taylor Ronald C., Leach Sonia M., Simon Richard,

    GEST: a gene expression search tool based on a novel Bayesian similarity

    metric, Bioinformatics, Vol. 17, Suppl 1, pp. 115-122, 2001

    [19] Joachims Thorsten, Making Large-Scale SVM Learning Practical,

    Advances in Kernel Methods Support Vector Learning, Bernhard

    Scholkopf, Christopher J. C. Burges, and Alexander J. Smola (eds), MIT

    Press, Cambridge, USA, 1998

    [20] Kohonen T., Self-Organized Maps, Springer-Verlag, Second

    Edition,1997.

    [21] Pal Nikhil R., Eluri Vijary Kumar, Two Efficient Connectionist

    Schemes for Structure Preserving Dimensionality Reduction, IEEE Teans.

    Neural Networks, vol. 9, no. 6, November 1998, p. 1142-1154

    [22] Papadimitriou S., Mavroudi S., Vladutu L., Bezerianos A., Ischemia

    Detection with a Self Organizing Map Supplemented by Supervised

    Learning, IEEE Trans. On Neural Networks, Vol. 12, No. 3, May 2001, p.

    503-515

    [23] Si J., Lin S., Vuong M. A., "Dynamic topology representing

    networks", Neural Networks, 13, pp. 617-627, 2000

    [24] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S.,

    Dmitrovsky, E., Lander, E.S. and Golub, T.R. (1999) Interpreting patterns

    of gene expression with self-organizing maps: methods and application to

    hematopoietic differentiation, Proc. Natl. Acad. Sci., USA, 92, pp. 2907-

    2912

    [25] Troyanskaya Olga, Cantor Michael, Shelock Gavin, Brown Pat, Hastie

    Trevor, Tibshirani Robert, Botstein David, Altman Russ B., Missing value

    estimation methods for DNA microarrays, Bioinformatics, Vol. 17, no 6,

    2001

    [26] Vapnik V. N., Statistical Learning Theory, New York, John Wiley &

    Sons, 1998.

    [27] Vesanto Juha Alhoniemi, Esa, Clustering of the Self-Organized

    Map, IEEE Transactions on Neural Networks, Vol. 11, No. 3, May 2000,

    p. 586-600

    16

  • 8/3/2019 BioinfNetSOM

    17/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    17

    Figure 4 The expression profiles of the genes clustered an sNet-SOM node of class Ribo. A few patterns of the rest classes presenting

    very similar expression profiles map also to this node.

  • 8/3/2019 BioinfNetSOM

    18/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Figure 5 The average expression profile for the genes plotted by Figure 4

    18

  • 8/3/2019 BioinfNetSOM

    19/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Figure 6 The identities of the genes as plotted in Figure from the back of the figure towards its front (at the 3D view). The biologist

    can extract easily useful information about which of the genes of unassigned class present similar expression profiles to the genes of

    class Ribo.

    4

    19

  • 8/3/2019 BioinfNetSOM

    20/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Figure 7 The outline of the configuration of the growing sNetSOM is displayed graphically and illustrates the progress of the learning

    process to the user. The nodes that represent the Helix-Turn-Helix class are in blue color. It is visually evident that these nodes are

    much more dispersed than nodes colored differently that represent other classes.

    20

  • 8/3/2019 BioinfNetSOM

    21/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Figure 8 The listbox that displays the characteristics of the nodes of the sNetSOM. The first two columns are the grid coordinates of the

    node. The third column is the entropy of the node and the fourth is the number of genes mapped to the node. Finally, the last column isthe name of the class that the node represents.

    21

  • 8/3/2019 BioinfNetSOM

    22/22

    Gene expression analysis with a dynamically extended Self-Organized Map

    Figure 9 The parameter configuration screen allows to control directly the main parameters of the sNetSOM.