f.-m. schleif, b. hammer and t. villmann- margin based active learning for lvq networks

8/3/2019 F.-M. Schleif, B. Hammer and T. Villmann- Margin based Active Learning for LVQ Networks

1/20

Margin based Active Learning

for LVQ Networks

F.-M. Schleifa,aUniv. of Leipzig, Dept. of Mathematics and Computer Science, Germany

B. Hammer b

bClausthal Univ. of Technology, Dept. of Computer Science, Clausthal, Germany

T. Villmann c

cUniv. of Leipzig, Dept. of Medicine, Clinic for Psychotherapy, Leipzig, Germany

Abstract

In this article, we extend a local prototype-based learning model by active learning,which gives the learner the capability to select training samples during the modeladaptation procedure. The proposed active learning strategy aims on an improvedgeneralization ability of the final model. This is achieved by usage of an adaptivequery strategy which is more adequate for supervised learning than a simple ran-dom approach. Beside an improved generalization ability the method also improvesthe speed of the learning procedure which is especially beneficial for large datasets with multiple similar items. The algorithm is based on the idea of selecting aquery on the borderline of the actual classification. This can be done by consideringmargins in an extension of learning vector quantization based on an appropriatecost function. The proposed active learning approach is analyzed for two kinds oflearning vector quantizers the supervised relevance neural gas and the supervisednearest prototype classifier, but is applicable for a broader set of prototype basedlearning approaches too. The performance of the query algorithm is demonstratedon synthetic and real life data taken from clinical proteomic studies. From the pro-

teomic studies high dimensional mass spectrometry measurements were calculatedwhich are believed to contain features discriminating the different classes. Using theproposed active learning strategies the generalization ability of the models could bekept or improved accompanied by a significantly improved learning speed. Both ofthese characteristics are important for the generation of predictive clinical modelsand were used in an initial biomarker discovery study.

Key words: active learning, learning vector quantization, generalization,classification, proteomic profiling

Preprint submitted to Elsevier Science 28 September 2006


2/20

1 Introduction

In supervised learning, we frequently are interested in training a classifier suchthat the underlying (unknown) target distribution is well estimated. Whereastraditional approaches usually adapt the model according to all available andrandomly sampled training data, the field of active learning restricts to onlyfew actively selected samples. This method avoids the shortcoming of tra-ditional approaches that the average amount of new information per sampledecreases during learning and that additional data from some regions are basi-cally redundant. Further, it accounts for the phenomenon which is increasinglycommon e.g. in bioinformatics or web search that unlabeled data are abundantwhereas reliable labeling is costly. Variants of active and query based learninghave been proposed quite early for neural models (1; 2). Basically two differentkinds of active learning exist, active learning based on an oracle where one

can demand arbitrary points of the input space (1) and method where one canchoose different points from a given training set (3). We focus on the secondstrategy.

In query algorithms proposed so far, samples are chosen according to someheuristic e.g. (1) or in a principled way by optimizing an objective functionsuch as the expected information gain of a query e.g. (2), or the model uncer-tainty e.g. (4). A common feature of these query algorithms, however, is thatthey have been applied to global learning algorithms. Only a few approachesincorporate active strategies into local learning such as (5) where a heuris-

tic query strategy for simple vector quantization is proposed. In this paperwe include active learning in two recently proposed margin-based, potentiallykernelized learning vector quantization approaches, which combine the goodgeneralization ability of margin optimization with the intuitivity of prototype-based local learners where subunits compete for predominance in a region ofinfluence (6; 7).

Now, we briefly review the basic of this kernel-extension of LVQ and its ac-companying learning theoretical generalization bounds, therefrom we derive amargin based active learning strategy. We demonstrate the benefit of this mode

by comparing the classification performance of the algorithm on randomly se-lected training data and active strategies for several data sets stemming fromclinical proteomics as well as synthetic data.

Frank-Michael Schleif: Bruker Daltonik GmbH, Permoserstrasse 15, D-04318Leipzig, Germany, Tel: +49 341 24 31-408, Fax: +49 341 24 31-404, [email protected]


3/20

1.1 Generalized relevance learning vector quantization

Standard LVQ and variants as proposed by Kohonen constitute popular sim-ple and intuitive prototype based methods, but they are purely heuristically

motivated local learners (8). They suffer from the problem of instabilities foroverlapping classes. Further they strongly depend on the initialization of pro-totypes, and a restriction to classification scenarios in Euclidean space. Gen-eralized relevance learning vector quantization (GRLVQ) has been introducedby the authors to cope with these problems (6). It is based on a cost functionsuch that neighborhood incorporation, integration of relevance learning, andkernelization of the approach become possible which gives Supervised Neu-ral Gas (SNG) and Supervised Relevance Neural Gas denoted as SRNG (6).The method can be accompanied by a large margin generalization bound (9),which is directly connected to the cost function of the algorithm and which

opens the way towards active learning strategies, as we will discuss in thisarticle.

We first introduce the basic algorithm. Input vectors are denoted by v andtheir corresponding class labels by cv, L is the set of labels (classes). LetV RDV be a set of inputs v. The model uses a fixed number of representativeprototypes (weight vectors, codebook vectors) for each class. Let W = {wr}be the set of all codebook vectors and cr be the class label ofwr. Furthermore,let Wc= {wr|cr = c} be the subset of prototypes assigned to class c L. Thetask of vector quantization is realized by the map as a winner-takes-all rule,

i.e. a stimulus vector v V is mapped onto that neuron s A the pointerws of which is closest to the presented stimulus vector v,

VA : v s (v) = argminrAd (v, wr) (1)

with d (v, w) being an arbitrary differentiable similarity measure, which maydepend on a parameter vector . The subset of the input space

r

=

v V : r = VA (v)

(2)

which is mapped to a particular neuron r according to (1), forms the (masked)receptive field of that neuron. If the class information of the weight vectoris used, the boundaries

rgenerate the decision boundaries for classes. A

training algorithm should adapt the prototypes such that for each class c L,the corresponding codebook vectors Wc represent the class as accurately aspossible.

To achieve this goal, GRLVQ optimizes the following cost function, which isrelated to the number of misclassifications of the prototypes, via a stochastic

3


4/20

gradient descent:

CostGRLVQ =v

f((v)) with (v) =dr+

dr

dr+

+ dr

(3)

where f(x) = (1 + exp (x))

1 is the standard logistic function, dr+ is thesimilarity of the input vector v to the nearest codebook vector labeled withcr+ = cv, say wr+, and d

r

is the similarity measure to the best matchingprototype labeled with cr = cv, say wr. Note that the term f((v)) scalesthe differences of the closest two competing prototypes to (1, 1), negativevalues correspond to correct classifications. The learning rule can be derivedfrom the cost function by using the derivative. As shown in (10), this costfunction shows robust behavior whereas original LVQ2.1 yields divergence.SRNG combined this methods with neighborhood cooperation as derived in(6) and thus avoids that the algorithm gets trapped in local optima of the

cost function. In subsequent experiments, we choose the weighted Euclideanmetric with d(v, w) =

i(vi wi)2 as metric.

An alternative approach is the Soft nearest prototype classification and its ker-nel variants incorporating metric adaptation (7) which are referred as SNPC.The SNPC is based on an alternative cost function which can be interpretedas a Gaussian mixture model (GMM) approach aiming on empirical risk min-imization. In the following the SNPC is reviewed very briefly. Thereby we re-strict ourselves to the SNPC cost function and some further important partswhich are effected by introduction of an active learning strategy. For detailson the derivation of SNPC and the ordinary learning dynamic we refer to (7).

2 Soft nearest prototype classification

We keep the generic notations from the former section and review subsequentlythe Soft Nearest Prototype Classification (SNPC). SNPC is proposed as analternative stable NPC learning scheme. It introduces soft assignments for datavectors to the prototypes, which have a statistical interpretation as normalizedGaussians. In the original SNPC as provided in (11) one considers the cost

function

E(S) = 1NS

NSk=1

r

u (r|vk)

1 r,cvk

(4)

with S= {(v, cv)} the set of all input pairs, NS = #S. The value r,cvk equalsone if cvk = cr and 0 otherwise. It is the probability that the input vector vkis assigned to the prototype r. A crisp winner-takes-all mapping (1) wouldyield u (r|vk) = (r = s (vk)).

In order to minimize (4) in SNPC the variables u (r|vk) are taken as soft

4


5/20

assignment probabilities. This allows a gradient descent on the cost function(4). As proposed in (11), the probabilities are chosen as normalized Gaussians

u (r

|vk) =

expd(vk,wr)

22

r

exp

d(vk,wr)

22 (5)

whereby d is the distance measure used in (1) and is the bandwidth whichhas to be chosen adequately. Then the cost function (4) can be rewritten as

Esoft (S) = 1NS

NSk=1

lc ((vk, cvk)) (6)

with local costs

lc ((vk, cvk)) =r u (r|vk)

1 r,cvk (7)

i.e., the local error is the sum of the assignment probabilities r,cvk to all pro-totypes of an incorrect class, and, hence, lc ((vk, cvk)) 1 with local costsdepending on the whole set W. Because the local costs lc ((vk, cvk)) are con-tinuous and bounded, the cost function (6) can be minimized by stochasticgradient descent using the derivative of the local costs as shown in (11). Allprototypes are adapted in this scheme according to the soft assignments. Notethat for small bandwidth , the learning rule is similar to LVQ2.1.

2.1 Relevance learning for SNPC

Like all NPC algorithms, SNPC heavily relies on the metric d, usually thestandard euclidean metric. For high-dimensional data as occur in proteomicpatterns, this choice is not adequate since noise present in the data set accu-mulates and likely disrupts the classification. Thus, a focus on the (prior notknown) relevant parts of the inputs, would be much more suited. Relevancelearning as introduced in (12) offers the opportunity to learn metric param-

eters, which account for the different relevance of input dimensions duringtraining. In analogy to the above learning approaches this relevance learningidea is included into SNPC leading to SNPC-R. Instead of the metric d (vk, wr)now the metric can be parameterized incorporating adaptive relevance factorsgiving d (vk, wr) which is included into the soft assignments (5), whereby thecomponent k of is usually chosen as weighting parameter for input dimen-sion k. The relevance parameters k can be adjusted according to the given

training data, taking the derivative of the cost function, i.e.lc((vk,cvk))

using

the local cost (7):

5


6/20

lc ((vk, cvk))

j= 1

22r

u (r|vk) dr

j

1 r,cvk lc ((vk, cvk))

(8)

with subsequent normalization of the k.

It is worth to emphasize that SNPC-R can also be used with individual metricparameters r for each prototype wr or with a class-wise metric shared withinprototypes with the same class label cr as it is done here, referred as localizedSNPC-R (LSNPC-R). If the metric is shared by all prototypes, LSNPC-R isreduced to SNPC-R. The respective adjusting of the relevance parameters can easily be determined in complete analogy to (8).

It has been pointed out in (13) that NPC classification schemes, which arebased on the euclidean metric, can be interpreted as large margin algorithms

for which dimensionality independent generalization bounds can be derived.Instead of the dimensionality of data, the so-called hypothesis margin, i.e.the distance, the hypothesis can be altered without changing the classificationon the training set, serves as a parameter of the generalization bound. Thisresult has been extended to NPC schemes with adaptive diagonal metric in (6).This fact is quite remarkable, since DV new parameters, DV being the inputdimension, are added this way, still, the bound is independent of DV. Thisresult can even be transferred to the setting of individual metric parametersr for each prototype or class as we will see below, such that a generally goodgeneralization ability of this method can be expected. Despite from the factthat (possibly local) relevance factors allow a larger flexibility of the approachwithout decreasing the generalization ability, they are of particular interestfor proteomic pattern analysis because they indicate potentially semanticallymeaningful positions.

Our active learning approach holds for each kind of such LVQ-type learning.

3 Margin based active learning

The first dimensionality independent large margin generalization bound ofLVQ classifiers has been provided in (13). For GRLVQ-type learning, a furtheranalysis is possible, which accounts for the fact that the similarity measure isadaptive during training (9). Here, we sketch the argumentation as providedin (9) to derive a bound for a slightly more general situation where differentlocal adaptive relevance terms are attached to the prototypes.

6


7/20

Theoretical generalization bound for fixed margin

Assume, for the moment, that a two-class problem with labels {1, 1} isgiven 1 . We assume that an NPC classification scheme is used whereby the

locally weighted squared euclidean metric determines the receptive fields:

v argminrA

l

rl (vl (wr)l)2

where l denotes the components of the vectors and

l rl = 1. We further

assume that data are chosen i.i.d. according to the data distribution P(V)which support is limited by a ball of radius B and the class labels are deter-mined by an unknown function. Generalization bounds limit the error, i.e. theprobability that the learned classifier does not classify given data correctly:

EP() = P(cv =

VA(v)) (9)

Note that this error captures the performance for GRLVQ/SRNG networks aswell as SNPC-R learning with local adaptive diagonal metric. Given a classifier and a sample (v, cv), we define the margin as

M(v, cv) = dr+r+ + dr

r , (10)

i.e. the difference of the distance of the data point from the closest correct andthe closest wrong prototype. (To be precise, we refer to the absolute value asthe margin.) For a fixed parameter

(0, 1), the loss function is defined as

L : R R, t

1 if t 01 t/ if 0 < t 0 otherwise

(11)

The termELm() =

vV

L(M(v, cv))/|V| (12)

denotes the empirical error on the training data. It counts the data points

which are classified wrong and, in addition, punishes all data points with toosmall margin.

We can now use techniques from (14) in analogy to the argumentation in (9)to relate the error (9) and the empirical error (12). As shown in (14)(Theorem7), one can express a bound for the deviation of the empirical error Em()

1 These constraints are technical to derive the generalization bounds which havealready been derived by two of the authors in (9), the active learning strategieswork also for > 2 classes and alternative metrics

7


8/20

and the error EP() by the Gaussian complexity of the function class definedby the model:

EP() ELm() +2K

Gm +

ln(4/)

2m

with probability at least 1 /2. K is a universal constant. Gm denotes theGaussian complexity, i.e. it is the expectation

EviEg1,...,gm

sup

2mmi=1

gi (vi)

where the expectation is taken w.r.t independent Gaussian variables g1, . . . ,gm with zero mean and unit variance and i.i.d. points v

i sampled accordingto the marginal distribution induced by P. The supremum is taken over all

NPC classifiers with prototypes W with length at most B.

The Gaussian complexity of NPC networks with local adaptive diagonal met-ric can be easily estimated using techniques from (14): the classifier can beexpressed as a Boolean formula of the result of classifiers with only two pro-totypes. Thereby, at most |W(W 1)| such terms exist. For two prototypesi and j, the corresponding output can be described by the sum of a simplequadratic form and a linear term:

di

i

d

j

j

0

(v wi)t i (v wi) (v wj)t j (v wj) 0 vt i v vt j v

2 (i wi j wj)tv + (wi)t i wi (wj)t j wj 0

Since the size of the prototypes and the size of the inputs are restricted byB and since is normalized to 1, we can estimate the empirical Gaussiancomplexity by the sum of

4 B (B + 1) (B + 2) mmfor the linear term (including the bias) and

2 B2 mm

for the quadratic term, using (14)(Lemma 22). The Gaussian complexity dif-fers from the empirical Gaussian complexity by at most with probability at

8


9/20

least 2 exp(2m/8). Putting these bounds together, the overall bound

P

EP() > ELm() + K ln |V|

|V|

ln(1/) |W|2B3 (13)

results with probability (0, 1), K being a universal constant. This boundholds for every prototype based learning algorithm with diagonal Euclideanmetric and adaptive relevance parameters as long as the absolute sum of therelevance parameters is restricted by 1, whereby the parameters may evenbe adapted locally for each prototype vector.

This formula provides an estimation of the difference of the real error of atrained classifier in comparison to the empirical error measured on the trainingset. Its form is very similar to generalization bounds of learning algorithmsas derived in the framework of VC-theory (15). Since these bounds are worst

case bounds, they usually differ strongly from the exact test error and cangive only an indication of the relevance of certain parameters. However, thisinformation is sufficient to derive useful strategies to guide active learning.

Note that the bound (13) does not include data dimensionality, but the mar-gin . This bound holds for the scaled squared Euclidean metric with localadaptive relevance terms. It can directly be generalized to kernelized versionsof this similarity measure. A sufficient condition for this fact is, e.g., that themeasure is symmetric, it vanishes for two identical arguments, and its nega-tive is conditionally positive definite. It should be mentioned, that the margin

(10) occurs as nominator in the cost function of GRLVQ. Hence GRLVQ andSRNG maximize its margin during training according to this cost function.However, the final generalization bound holds for all training algorithms ofNPC classifiers including SNPC and SNPC-R.

Active learning strategy

The generalization bound in terms of the margin proposes an elegant schemeto transfer margin based active learning to local learners. Margin based sample

selection has been presented e.g. in the context of SVM in (16; 3). Obviously,the generalization ability of the GRLVQ algorithm does only depend on thepoints with too small margin (10). Thus, only the extremal margin valuesneed to be limited and a restriction of the respective update to extremal pairsof prototypes would suffice. This argument proposes schemes for active dataselection if a fixed and static pattern set is available: We fix a monotonicallydecreasing non-negative function Lc : R R and actively select trainingpoints from a given sample, in analogy to e.g. (3), based on the probabilityLc(M(v, cv)) for sample v. Thereby, different realizations are relevant:

9


10/20

(1) Lc(t) = 1 for t < 0 and Lc(t) |t|, otherwise, i.e. the size of the margindetermines the probability of v being chosen annealed by a parameter (Probabilistic strategy).

(2) Lc(t) = 1 ift , otherwise, it is 0. That means, all samples with marginsmaller than are selected (Threshold strategy).

Both strategies focus on the samples which are not yet sufficiently representedin the model. Therefore they directly aim at an improvement of the generaliza-tion bound (13). Strategy (2) allows an adaptation of the margin parameter during training in accordance to the confidence of the model in analogy to therecent proposal (3) for SVM. For each codebook vector wr W we introduce anew parameter r measuring the mean distance of data points in its receptivefield (2) to the current prototype wr. This parameter can be easily computedduring training as a moving average with no extra costs 2 . We choose r locallyas r = 2 r. Thus, points which margin compares favorable to the size ofthe receptive fields are already represented with sufficient security and, hence,they are abandoned. For strategy (1), a confidence depending on the distanceto the closest correct prototype and the overall classification accuracy can beintroduced in a similar way. Doing this the normalized margin is taken as aprobability measure for data selection.

Validity of generalization bounds for active strategies

There are a few aspects of the generalization bound (13) which have to bereconsidered in the framework of active learning. The bound (13) is valid fora priorly fixed margin ; it is not applicable to the empirical margin observedduring training which is optimized by active learning. However, it is possible togeneralize the bound (13) to this general setting in the spirit of the luckinessframework of machine learning (17) by fixing prior probabilities to achievea certain margin. Here, we derive a bound depending on an universal upperbound on the empirical margin and fixed prior probabilities.

Assume the empirical margin can be upper bounded by C > 0. A reasonableupper bound C can e.g. be derived from the data as half the average of thedistances of points to their respective closest neighbor with different labeling.Define i = C/i for i 1. Choose prior probabilities pi 0 with ipi = 1which indicate the confidence in achieving an empirical margin which size is

2 The extra computational time to determine the active learning control variablesis negligible

10


11/20

at least i. Set

Li : R R, t

1 if t 01 t/i if 0 < t i0 otherwise

(14)

in analogy to (11). ELtm () denotes the empirical error (12) using loss functionLt. We are interested in the probability

Pi EP() > ELim () + (i)

(15)

i.e. the probability that the empirical error measured with respect to a loss Lifor any i and the real error deviate by more than (i), where the bound

(i) = K ln |V|i

|V|

ln(1/(pi )) |W|2B3

depends on the empirical margin i. Note that the value (i) is chosen as thebound derived in (13) for the margin i and confidence pi . The probability(15) upper bounds the probability of a large deviation of the real error andthe empirical error with respect to a loss function which is associated to theempirical margin observed posteriorly on the given training set. For everyobserved margin < C, some i can be found with i, such that thegeneralization bound (i) results in this setting. The size of (i) depends on

whether the observed empirical margin corresponds to the prior confidence inreaching this margin, i.e. a large prior pi.

We can limit (15) as follows:

Pi EP() > ELim () + (i)

i

P

EP() > ELim () + (i)

i

pi =

because of the choice of (i) such that it corresponds to the bound (13). Thisargument allows us to derive bounds of a similar form to (13) for the empiricalmargin.

Another point worth discussing concerns the assumption that training data arei.i.d. with respect to an unknown underlying probability. For active learning,the training points are chosen depending on the observed margin, that means,they are dependent. However, this argument does not apply to the scenarioas considered in our case: we assume a priorly fixed training set with i.i.d.data and chose the next training pattern depending on the margin such that

11


12/20

the convergence speed of training and the generalization ability of the trainedclassifier is improved. This affects the learning algorithm on the given dataset, however, the empirical error can nevertheless be evaluated on the wholetraining set independent of the algorithm, such that the bound as derivedabove is valid for the priorly fixed training set. For dependent data (e.g. points

created online for learning), an alternative technique such as e.g. the statisticaltheory of online learning must be used to derive bounds (18; 19).

Adaptation to noisy data or unknown classes

We will apply the strategy of active learning to data sets for which a smallclassification error can be achieved with reasonable margin. We will not test

the method for very corrupted or noisy data sets for which a large marginresp. small training error cannot be achieved. It can be expected that theactive strategies as proposed above have to be adapted for very noisy learningscenarios: the active strategy as proposed above will focus on points whichare not yet classified correctly resp. which lie close to the border, trying toenforce a separation which is not possible for the data set as hand. In suchcases, it would be valuable to restrict the active strategy to points which arestill promising, e.g. choosing points only from a limited band parallel to thetraining hyperplane.

We would like to mention that, so far, we have restricted active selectionstrategies to samples where all labels are known beforehand, because the clos-est correct and wrong prototype have to be determined in (10). This settingallows to improve the training speed and performance of batch training. Ifdata are initially unlabeled and queries can be asked for a subset of the data,we can extend these strategies in an obvious way towards this setting: in thiscase, the margin (10) is given by the closest two prototypes which possess adifferent class label, whereby the (unknown) class label of the sample pointhas no influence. Lc(t) is substituted by Lc(|t|).

4 Synthetic and Clinical data sets

The first data set is the Wisconsin Breast Cancer-Data set (WDBC) as givenby UCI (20). It consist of 569 measurements with 30 features in 2 classes.The other two data sets are taken from proteomic studies named as Proteom1and Proteom2 and originate from own mass spectrometry measurements on

12


13/20

clinical proteomic data 3 The Proteom1 data set consists of 97 samples in twoclasses with 59 dimensions. The data set Proteom2 consists of 97 measure-ments with two classes and 94 dimensions. MALDI-TOF MS combined withmagnetic-bead based sample preparation was used to generate proteomic pat-tern from Human EDTA (ethylenediaminetetraacetic acid) plasma samples.

The MALDI-TOF mass spectra were obtained using magnetic bead basedweak cation exchange chromatography (WCX)(21) The sample eluates fromthe bead preparation have been applicated to an AnchorChip-TargetTM byuse of HCCA matrix. The material was randomly applicated on the target byuse of ClinProt-RobotTM and subsequently measured using an AutoFlex II inlinear mode within 1 10kDa (Bruker Daltonik GmbH, Bremen, Germany).

Thereby each spectrum has been accumulated by use of 450 Laser shots with 15shot positions per spot on the target. Each eluated sample was fourfold spottedand averaged subsequently. Individual attributes such as gender, age, cancer

related preconditions and some others have been controlled during samplecollection to avoid bias effects. All data have been collected and prepared inaccordance to best clinical practices. Spectra preparation has been done by useof the Bruker ClinProt-System (Bruker Daltonik GmbH, Bremen, Germany).

The well separable Checkerboard data (Checker) as given in (12) are used as asynthetic evaluation set. It consists of 3700 data points. Further, to illustratethe differences between the different strategies during learning of the codebookvector positions a simple spiral data set has been created and applied usingthe SRNG algorithm. The spiral data are generated in a similar way as thedata shown in (5) and the set consists of 5238 data points. Checker as well

the spiral data are given in a two dimensional space.

5 Experiments and Results

For classification, we use 6 prototypes for the WDBC data, 100 prototypesfor the well separable Checkerboard data set as given in (12), 9 prototypes forthe Proteom1 data set and 10 for Proteom2. Parameter settings for SRNG canbe resumed as follows: learn rate correct prototype: 0.01, learn rate incorrectprototype: 0.001 and learning rate for : 0.01. The neighborhood range isgiven by #W/2. For SNPC the same settings as for SRNG are used withthe additional parameter window threshold: 0.05 and width = 2.5 for theGaussian kernel. Learning rates are annealed by an exponential decay. Alldata have been processed using a 10-fold cross-validation procedure. Resultsare calculated using the SNPC, SNPC-R and SNG, SRNG. SNG and SRNG

3 These data are measured on Bruker systems and are subject of confidential agree-ments with clinical partners, so only some details are listed here.

13


14/20

are used instead of GLVQ or GRLVQ as improved version of the former usingneighborhood cooperation.

We now compare both prototype classifiers using randomly selected sampleswith its counterparts using the proposed query strategies. The classification

results are given in Tab. 1 and Tab. 3 without metric adaptation and in Tab. 2and Tab. 3 with relevance learning respectively. Features of all data sets havebeen normalized. First we upper bounded the data set by 1.0 and subsequentlydata are transformed such that we end with zero mean and variance 1 .0.

We applied the training algorithms using the different query strategies as in-troduced above. The results for recognition and prediction rates using SRNGare shown in Tab. 2 4 and for SNPC in 3 respectively. Thereby the recognitionrate is a performance measure of the model indicating the relative number oftraining data points whose class label could correctly be recognized using the

model. The prediction rate is a measure of the generalization ability of themodel accounting for the relative number of correctly predicted class labelsfrom test data points which were not used in the former training of the clas-sifier. As explained before each prediction rate is obtained as an average overa 10-fold cross-validation procedure.

SNG SNGactive strategy 1 SNGactive strategy 2

Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q

WDBC 95% 95% 94% 94% 38% 93% 92% 9%

Proteom1 76% 83% 76% 77% 48% 76% 85% 15%

Proteom2 73% 67% 73% 65% 49% 73% 62% 27%

Checker 72% 67% 98% 97% 31% 99% 96% 5%

Table 1Classification accuracies for cancer and checkerboard data sets using SNG. All dataconsist of two classes whereby the Proteom2 data set is quite complex. The predic-tion accuracies are taken from a 10-fold cross validation and show a reliable goodprediction of data which belong to the WDBC data as well as for the Proteom1 dataset. The Checker data are not as well modeled, this is due to the fixed upper limitof cycles (1000) longer runtimes lead for this data to a nearly perfect separation.

For the WDBC data set and the Proteom2 data set we found small improve-ments in the prediction accuracy using the active strategy 2. In parts smallover-fitting behavior using the new query strategies can be observed. Both newquery strategies were capable to significantly decrease the necessary numberof queries by keeping at least reliable prediction accuracies with respect to

4 The relative number of queries is calculated with respect to the maximal num-ber of queries possible up to convergence of SRNG using the corresponding querystrategy. The upper limit of cycles has been fixed to 1000.

14


15/20

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

relativenumberofqu

eries

cycle

Probabilistic strategyThreshold strategy

Fig. 1. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-

bilistic) executed by the SRNG algorithm on the Proteom1 data set.

a random query approach. This is depicted for different data sets using theSRNG algorithm in Figure 1,2, 3 and 4 and in Figure 5,6 using the SNPC al-gorithm. Although the number of queries differs for SNPC and SRNG on thesame data set the generic trend is similar but clearly reflects the classificationperformance of the individual classifier. Especially the threshold approach letto a significant decrease in the number of queries in each experiment.

In Figure 7 the behaviour of the proposed strategies are shown for a syntheticspiral data set of around 5000 data points. Both active learning strategies

showed a significant decrease in the number of necessary queries, typically only10% 30% of the queries were executed with respect to a random approach.Using the active learning strategies the data could be learned nearly perfect.Random querying required a much larger number of cycles to obtain acceptableresults. Considering the prototype distribution one clearly observes the wellpositioning of the prototypes using the active learning strategies whereas the

SRNG SRNGactive strategy 1 SRNGactive strategy 2

Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q

WDBC 95% 94% 95% 94% 29% 97% 97% 7%Proteom1 82% 88% 92% 87% 31% 97% 93% 5%

Proteom2 87% 81% 96% 87% 33% 96% 76% 10%

Table 2Classification accuracies for cancer data sets using SRNG. Data characteristics areas given before. A reliable good prediction of data which belong to the WDBC dataas well as for the Proteom1 data set can be seen. One clearly observes an improvedmodeling capability by use of relevance learning and an related addition decreasein the number of queries.

15


16/20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

relativenumberofqu

eries

cycle



bilistic) executed by the SRNG algorithm on the Proteom2 data set.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

relativenumberofqueries

cycle


Fig. 3. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-bilistic) executed by the SRNG algorithm on the WDBC data set.

random approach suffers by over-representing the core of the spiral which hasa larger data density.

6 Conclusion

Margin based active learning strategies for GLVQ based networks have beenstudied. We compared two alternative query strategies incorporating the mar-gin criterion of the GLVQ networks with a random query selection. Both activelearning strategies show reliable or partially better results in their generaliza-tion ability with respect to the random approach. Thereby we found a sig-

16


17/20


18/20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

relativenumberofqu

eries

cycle



bilistic) executed by the SNPC-R algorithm on the Proteom1 data set.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

relativenumberofqueries

cycle


Fig. 6. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-bilistic) executed by the SNPC-R algorithm on the WDBC data set.

constraints showed a faster convergence but over-fitting on smaller trainingdata sets. Especially, for larger data sets the proposed active learning strate-gies show great benefits in speed and prediction. Especially for the considered

mass spectrometric cancer data sets an overall well performance improvementhas been observed. This is interesting from a practical point of view, since thetechnical equipment for measuring e.g. a large number of mass spectromet-ric data become more and more available. In mass spectrometry it is easy tomeasure a sample multiple times. The replicates which are taken from suchmultiple measurements are in general very similar and do differ only by ran-dom but not by systematic new information. In clinical proteomics based onmass spectrometry replicates are measured very often to decrease the mea-surement variance (e.g. by averaging) or to reduce the loss of missed samples

18


19/20

-4

-3

-2

-1

0

1

2

3

4

-3 -2 -1 0 1 2 3

2nddim

1st dim

Normalized spiral dataCodebook - no active learning

Codebook - probabilistic strategyCodebook - threshold strategy

Fig. 7. Plot of the synthetic spiral data with prototype positions as obtained using

the different query strategies.

in case of an error during a measurement of a sample. Typically 4,8 or even16 multiple measurements of the same sample are generated and hence alsofor moderate sample sizes (e.g. 50 samples per class) the amount of trainingdata becomes huge 5 The presented approach is optimal suited to deal withreplicate measurements which may drastically increase the number of samplesand hence typically lead to very long runtimes for ordinary trainings using theconsidered classification algorithms.

References

[1] E. Baum, Neural net algorithms that learn in polynomial time from exam-ples and queries, IEEE Transactions on Neural Networks 2 (1991) 519.

[2] Y. Freund, H. S. Seung, E. Shamir, N. Tishby, Information, predictionand query by committee, in: Advances in Neural Information ProcessingSystems 1993, 1993, pp. 483490.

[3] P. Mitra, C. Murthy, S. Pal, A probabilistic active support vector learningalgorithm, IEEE Transactions on Pattern Analysis and Machine Intelli-

gence 28 (3) (2004) 412418.[4] L. M. Belue, K. W. Bauer Jr., D. W. Ruck, Selecting optimal experiments

for multiple output multilayer perceptrons, Neural Computation 9 (1997)161183.

[5] M. Hasenjager, H. Ritter, Active learning with local models, Neural Pro-cessing Letters 7 (1998) 107117.

5 Considering 50 samples per class and 16 replicates per sample one would beconfronted with 1600 highly redundant training items.

19


20/20

[6] B. Hammer, M. Strickert, T. Villmann, Supervised neural gas with gen-eral similarity measure, Neural Processing Letters 21 (1) (2005) 2144.

[7] F.-M. Schleif, T. Villmann, B. Hammer, Local metric adaptation for softnearest prototype classification to classify proteomic data, in: Fuzzy Logicand Applications: 6th Int. Workshop, WILF 2005, Lecture Notes in Com-

puter Science (LNCS), Springer, 2006, pp. 290296.[8] T. Kohonen, Self-Organizing Maps, Vol. 30 of Springer Series in Informa-

tion Sciences, Springer, Berlin, Heidelberg, 1995, (2nd Ext. Ed. 1997).[9] B. Hammer, M. Strickert, T. Villmann, On the generalization ability of

GRLVQ networks, Neural Proc. Letters 21 (2) (2005) 109120.[10] A. Sato, K. Yamada, A formulation of learning vector quantization us-

ing a new misclassification measure, in: A. K. Jain, S. Venkatesh, B. C.Lovell (Eds.), Proceedings. Fourteenth International Conference on Pat-tern Recognition, Vol. 1, IEEE Computer Society, Los Alamitos, CA,USA, 1998, pp. 3225.

[11] S. Seo, M. Bode, K. Obermayer, Soft nearest prototype classification,IEEE Transaction on Neural Networks 14 (2003) 390398.

[12] B. Hammer, T. Villmann, Generalized relevance learning vector quanti-zation, Neural Networks 15 (8-9) (2002) 10591068.

[13] K. Crammer, R. Gilad-Bachrach, A.Navot, A.Tishby, Margin anal-ysis of the LVQ algorithm, in: Proc. NIPS 2002, http://www-2.cs.cmu.edu/Groups/NIPS/NIPS2002/NIPS2002preproceedings/index.html,2002.

[14] P. Bartlett, S. Mendelsohn, Rademacher and Gaussian complexities: riskbounds and structural results, Journal of Machine Learning and Research

3 (2002) 463482.[15] V. Vapnik, Statistical Learning Theory, Wiley, 1998.[16] C. Cambell, N. Cristianini, A. Smola, Query learning with large margin

classifiers, in: International Conference in Machine Learning, 2000, pp.111118.

[17] J. Shawe-Taylor, P. Bartlett, R. Willimason, M. Anthony, Structural riskminimization over data-dependent hierarchies, IEEE Transactions on In-formation Theory 44 (5) (1998) 19261940.

[18] M. Biehl, A. Ghosh, B. Hammer, Learning vector quantization: The dy-namics of winner-takes-all algorithms, Neurocomputing 69 (7-9) (2006)660670.

[19] M. Opper, Statistical mechanics of generalization, in: M. Arbib (Ed.),The Handbook of Brain Theory and Neural Networks, MIT, 2003, pp.10871090.

[20] C. Blake, C. Merz., UCI repository of machine learning databases., avail-able at: http://www.ics.uci.edu/mlearn/MLRepository.html (1998).

[21] E. Schaffeler, U. Zanger, M. Schwab, M. Eichelbaum, Magnetic beadbased human plasma profiling discriminate acute lymphatic leukaemiafrom non-diseased samples, in: 52st ASMS Conference (ASMS) 2004,2004, p. TPV 420.

20

f.-m. schleif, b. hammer and t. villmann- margin based active learning for lvq networks

Documents