feature - 東京大学...robi polikar is with electrical and computer engineering, rowan university,...

25
21 THIRD QUARTER 2006 1531-636X/06/$20.00©2006 IEEE IEEE CIRCUITS AND SYSTEMS MAGAZINE In matters of great importance that have financial, medical, social, or other implications, we often seek a second opinion before making a decision, sometimes a third, and sometimes many more. In doing so, we weigh the individual opinions, and combine them through some thought process to reach a final decision that is presumably the most informed one. The process of consulting “several experts” before making a final decision is perhaps second nature to us; yet, the extensive benefits of such a process in automated decision making applications have only recently been discovered by computa- tional intelligence community. Also known under various other names, such as multiple classifier systems, committee of classifiers, or mixture of experts, ensemble based systems have shown to produce favorable results compared to those of single-expert sys- tems for a broad range of applications and under a variety of scenarios. Design, implementation and application of such systems are the main topics of this article. Specifically, this paper reviews conditions under which ensemble based sys- tems may be more beneficial than their single classifier counterparts, algorithms for generating individual compo- nents of the ensemble systems, and various procedures through which the individual classifiers can be combined. We discuss popular ensemble based algorithms, such as bagging, boosting, AdaBoost, stacked generalization, and hierarchical mixture of experts; as well as commonly used combination rules, including algebraic combination of outputs, voting based techniques, behavior knowledge space, and decision templates. Finally, we look at current and future research directions for novel applications of ensemble systems. Such applications include incremental learning, data fusion, fea- ture selection, learning with missing features, confidence estimation, and error correcting output codes; all areas in which ensemble systems have shown great promise. Keywords: Multiple classifier systems, classifier combination, classifier fusion, classifier selection, classifier diversity, incremental learning, data fusion Abstract © ARTVILLE Feature Robi Polikar

Upload: others

Post on 23-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

21THIRD QUARTER 2006 1531-636X/06/$20.00©2006 IEEE IEEE CIRCUITS AND SYSTEMS MAGAZINE

In matters of great importance that have financial, medical,social, or other implications, we often seek a second opinionbefore making a decision, sometimes a third, and sometimesmany more. In doing so, we weigh the individual opinions, andcombine them through some thought process to reach a finaldecision that is presumably the most informed one. Theprocess of consulting “several experts” before making a finaldecision is perhaps second nature to us; yet, the extensivebenefits of such a process in automated decision makingapplications have only recently been discovered by computa-tional intelligence community.

Also known under various other names, such as multipleclassifier systems, committee of classifiers, or mixture ofexperts, ensemble based systems have shown to producefavorable results compared to those of single-expert sys-tems for a broad range of applications and under a variety ofscenarios. Design, implementation and application of suchsystems are the main topics of this article. Specifically, thispaper reviews conditions under which ensemble based sys-

tems may be more beneficial than their single classifiercounterparts, algorithms for generating individual compo-nents of the ensemble systems, and various proceduresthrough which the individual classifiers can be combined. Wediscuss popular ensemble based algorithms, such as bagging,boosting, AdaBoost, stacked generalization, and hierarchicalmixture of experts; as well as commonly used combinationrules, including algebraic combination of outputs, votingbased techniques, behavior knowledge space, and decisiontemplates. Finally, we look at current and future researchdirections for novel applications of ensemble systems. Suchapplications include incremental learning, data fusion, fea-ture selection, learning with missing features, confidenceestimation, and error correcting output codes; all areas inwhich ensemble systems have shown great promise.

Keywords: Multiple classifier systems, classifier combination,classifier fusion, classifier selection, classifier diversity,incremental learning, data fusion

Abstract

© ARTVILLE

Feature

Robi Polikar

Page 2: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

1. Introduction

1.1. I’d Iike to Ask the Audience, Please. . .A popular game show, aired in over 70 countries since1999, drew the attention of hundreds of millions of TVviewers around the world. The novelty of the show wasnot so much in the talent it sought from the contestants;just like many of its predecessors, the show quizzed theknowledge of its contestants on a series of increasinglynon-trivial topics. What made the show so popular was itspreviously unheard of large payout, but at the same timeits rather unorthodox use of so-called “lifelines.” If thecontestant did not know the answer to the question, s/hecould elect to have two incorrect answers be removedfrom the four possible choices, telephone a friend whoms/he thought to be knowledgeable on the subject matter,or simply poll the audience. The idea of giving such choic-es to the contestants was so intriguing that the question“which of the three choices is the best one for any givencondition?” became a topic of light hearted conversationamong many statistics and computational intelligenceresearchers.

The reader undoubtedly realizes where we are goingwith this: this article is about ensemble systems, and oneof those three choices includes the use of an ensemble ofdecision makers: the audience. We will show that underrelatively mild assumptions, asking the audience is in factthe best solution, and these assumptions provide cluesfor the specific scenarios in which ensemble systems mayprovide a clear advantage over single-expert systems.

As discussed below, there are several mathematicallysound reasons for considering ensemble systems, but theintrinsic connection to our daily life experiences providesan undeniably strong psychological pretext: we use themall the time! Seeking additional opinions before making adecision is an innate behavior for most of us, particularlyif the decision has important financial, medical or socialconsequences. Asking different doctors’ opinions beforeundergoing a major surgery, reading user reviews beforepurchasing a product, or requesting references before hir-ing someone are just a few of countless number of exam-ples where we consider the decisions of multiple expertsin our daily lives. Our goal in doing so, of course, is toimprove our confidence that we are making the right deci-sion, by weighing various opinions, and combining themthrough some thought process to reach a final decision.Why not, then, follow the same process, and ask the opin-ion of additional experts in data analysis and automateddecision making applications? Ensemble systems followexactly such an approach to data analysis, and the

design, implementation and applications of such systemsconstitute the main focus of this paper.

In this paper, the terms expert, classifier and hypothesisare used interchangeably: the goal of an expert is to makea decision, by choosing one option from a previouslydefined set of options. This process can be cast as a clas-sification problem: the expert (classifier) makes a hypoth-esis about the classification of a given data instance intoone of predefined categories that represent different deci-sions. The decision is based on prior training of the clas-sifier, using a set of representative training data, for whichthe correct decisions are a priori known.

1.2. Reasons for Using Ensemble Based SystemsThere are several theoretical and practical reasons whywe may prefer an ensemble system:

Statistical Reasons: Readers familiar with neural net-works or other automated classifiers are painfully awarethat good performance on training data does not predictgood generalization performance, defined as the per-formance of the classifier on data not seen during train-ing. A set of classifiers with similar trainingperformances may have different generalization per-formances. In fact, even classifiers with similar general-ization performances may perform differently in thefield, particularly if the test dataset used to determinethe generalization performance is not sufficiently repre-sentative of the future field data. In such cases, combin-ing the outputs of several classifiers by averaging mayreduce the risk of an unfortunate selection of a poorlyperforming classifier. The averaging may or may notbeat the performance of the best classifier in the ensem-ble, but it certainly reduces the overall risk of making aparticularly poor selection. This is precisely the reasonwe consult the opinions of other experts: having severaldoctors agree on a diagnosis, or several former usersagree on the quality (or lack thereof) of a product,reduces the risk of following the advice of a single doc-tor (or a single user) whose specific experience may besignificantly different than those of others.

Large Volumes of Data: In certain applications, theamount of data to be analyzed can be too large to beeffectively handled by a single classifier. For example,inspection of gas transmission pipelines using magneticflux leakage techniques generates 10 GB of data forevery 100 km of pipeline, of which there are over 2 mil-lion km. Training a classifier with such a vast amount ofdata is usually not practical; partitioning the data intosmaller subsets, training different classifiers with differ-ent partitions of data, and combining their outputs using

22 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA.

Page 3: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

an intelligent combination rule often proves to be amore efficient approach.

Too Little Data: Ensemble systems can also be usedto address the exact opposite problem of having too lit-tle data. Availability of an adequate and representativeset of training data is of paramount importance for aclassification algorithm to successfully learn the under-lying data distribution. In the absence of adequate train-ing data, resampling techniques can be used for drawingoverlapping random subsets of the available data, eachof which can be used to train a different classifier, creat-ing the ensemble. Such approaches have also proven tobe very effective.

Divide and Conquer: Regardless of the amount ofavailable data, certain problems are just too difficult fora given classifier to solve. More specifically, the decisionboundary that separates data from different classes maybe too complex, or lie outside the space of functions

that can be implemented by the chosen classifier model.Consider the two dimensional, two-class problem with acomplex decision boundary depicted in Figure 1. A lin-ear classifier, one that is capable of learning linearboundaries, cannot learn this complex non-linearboundary. However, appropriate combination of anensemble of such linear classifiers can learn this (or anyother, for that matter) non-linear boundary.

As an example, let us assume that we have access toa classifier model that can generate elliptic/circularshaped boundaries. Such a classifier cannot learn theboundary shown in Figure 1. Now consider a collectionof circular decision boundaries generated by an ensem-ble of such classifiers as shown in Figure 2, where eachclassifier labels the data as class1 (O) or class 2 (X),based on whether the instances fall within or outside ofits boundary. A decision based on the majority voting ofa sufficient number of such classifiers can easily learnthis complex non-circular boundary. In a sense, the clas-sification system follows a divide-and-conquer approachby dividing the data space into smaller and easier-to-learn partitions, where each classifier learns only one ofthe simpler partitions. The underlying complex decisionboundary can then be approximated by an appropriatecombination of different classifiers.

Data Fusion: If we have several sets of data obtainedfrom various sources, where the nature of features aredifferent (heterogeneous features), a single classifiercannot be used to learn the information contained in allof the data. In diagnosing a neurological disorder, forexample, the neurologist may order several tests, suchas an MRI scan, EEG recording, blood tests, etc. Eachtest generates data with a different number and type offeatures, which cannot be used collectively to train asingle classifier. In such cases, data from each testingmodality can be used to train a different classifier,whose outputs can then be combined. Applications inwhich data from different sources are combined to makea more informed decision are referred to as data fusionapplications, and ensemble based approaches have suc-cessfully been used for such applications.

There are many other scenarios in which ensemblebase systems can be very beneficial; however, discussionon these more specialized scenarios require a deeperunderstanding of how, why and when ensemble systemswork. The rest of this paper is therefore organized as fol-lows. In Section 2, we discuss some of the seminal workthat has paved the way for today’s active research area inensemble systems, followed by a discussion on diversity,a keystone and a fundamental strategy shared by allensemble systems. We close Section 2 by pointing outthat all ensemble systems must have two key compo-nents: an algorithm to generate the individual classifiers

23THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

OO

OO

OOOO

OOOO

OO

OO

OOOO

XXXX

XXXX

XXXX

XX

XXXX

XX

XXXXXX

XX

XX

XXXXXX

XX

XXXX

XXXX

XXXX

XXXXXX

XX

XX

XX

XXXX

XX

XX

XX

XXXXXX

OO

OO OO

OO

OOOO

OOOOOO

OOOO

OOOO

OO

OOOO

OOOOOO

OOOOOOOO

OO

OO

OO

OOOO

OOOO

OO OOOOOOOO

OO OOOO

OOOO OO

OOOOOO

OOOOOO

OOOO

OO

OO

OO

OOOO

OO

OO

OO

OOOOOOOO

OO

OOOO OOOO OO

OOOOOOOO

OOOO

OOOO

OO

OO OOOOOOOO

OO

OO

OO

OO

OO

OO

OO

XX

XX

XX XX

XX

XXXX

XX

OO

OO

OOOO

OO

OO

OOOO

Obs

erva

tion/

Mea

sure

men

t/Fea

ture

2

Training Data Examplesfor Class 1

Observation/Measurement/Feature 1

Training Data Examplesfor Class 2

Complex DecisionBoundary to Be Learned

OO

Figure 1. Complex decision boundary that cannot be learnedby linear or circular classifiers.

OO

OO

OOOO

OOOO

OO

OO

OOOO

XXXX

XXXX

XXXX

XX

XXXX

XX

XXXXXX

XX

XX

XXXXXX

XX

XXXX

XXXX

XXXX

XXXXXX

XX

XX

XX

XXXX

XX

XX

XX

XXXXXX

OO

OO OO

OO

OOOO

OOOOOO

OOOO

OOOO

OO

OOOO

OOOOOO

OOOO

OOOOOO

OO

OO

OOOO

OOOO

OO OOOOOOOO

OO OOOO

OOOO OO

OOOOOO

OO

OOOOOO

OOOO

OO

OO

OO

OO

OO

OO

OO

OO

OOOO

OOOOOO

OOOOOO

OO OOOO

OO

OOOOOO

OOOO

OOOO

OO OOOO

OOOO

OO

OO

OO

OO

OO

OO

OO

XX

XX

XX XX

XX

XXXX

XX

OO

OO

OOOO

OO

OO

OOOO

Observation/Measurement/Feature 1

Obs

erva

tion/

Mea

sure

men

t/Fea

ture

2

OOOO

OOOOOOOO OOOOOOOO

OOOOOOOO

OOOOOOO

OOOO

OOOO

XXXX

XXXXXXXXXXXX

XXXXXXXX

XXXXOOOOOOO

OOOOOOOO

OOO OOOO

OOO

OOO

OO

OOOO

OOOO

OO

OOOO

OOOOOOOO

OOOOOOO

XXXX

XXXX

XXXX

XXXXOOOO

OOOO OOOOOOOOOOO

OOOOOOOO

OOOOOOOO

XXXXXXXX

XOOOO

OOOO

OOOO

OOOOOOOOO

OOOO

OOOOOOOO

OOOOOOOO

OOOO

XXXXXXXX

XXXXXXXX

XXXX

XXXXOOOO

OOOO

OOOOOOOO

OOOO

OOOOOOOO

OOOOOOOOOOOO

OOOOOOOOOOO

OOOOO

OOOOOOOOOOOOO

OOOOO

OOOOOOOO

OOOOOOOO

OOOO

OOOOOOOOOO

OOOOO

OOOOOO XXXX

XXXX

XXXX

OOOOOOOO

XXXXXXXX

XXXXXX

XXXXXXXXX

XXXXXX

XXXXXXX

XXXX

OOOO

OOO

OOOOOOOO

OOOO

OOOO

OOOOOOOO

OOOOOOOOOOOO

OOOOOOOOOOOO

OOOO OOOOOOOOOOO

OOOOOOOOOOOO

OOOOOOOOOOOOOOO

OOOOOOOOO

OOOOOOOO

OOOO

OOOOOOOOOOOOOOOO

OOOO XXXXXXXX XX

OOO

OOOOOOOO

Decision Boundaries Generated by Individual Classifiers

XXXXXXXXXX

XXXXXXXX

XXXX XXXXXXXX

XXXX

XXXXXXX

XXXX

XXX

XXXXXXXX

XXXX

XXXXXXXXXXXXXXXXXXOOO

XXXXOOXXXXXXXX

XXXXXXXX XX XXXX

XXXXXXXX

XXXX

XXXXXXXX XXXX

XXXXXXXXXX

XXXXXXXXX

XXX

XXXXXXXX

XXXX

XXX

Figure 2. Ensemble of classifiers spanning the decisionspace.

Page 4: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

of the ensemble, and a method for combining the outputsof these classifiers. Various algorithms for the former arediscussed in Section 3, whereas different procedures forthe latter are discussed in Section 4. In Section 5, we lookat current and emerging directions in ensemble systemresearch, including some novel applications, for whichthe ensemble systems are naturally suited. Section 6 com-pletes our overview of ensemble systems with some con-cluding remarks.

2. Ensemble Based Systems

2.1. A Brief History of Ensemble SystemsPerhaps one of the earliest work on ensemble systems isDasarathy and Sheela’s 1979 paper [1], which discussespartitioning the feature space using two or more classi-fiers. In 1990, Hansen and Salamon showed that the gener-alization performance of a neural network can be improvedusing an ensemble of similarly configured neural net-works.[2], while Schapire proved that a strong classifier inprobably approximately correct (PAC) sense can be gener-ated by combining weak classifiers through boosting [3],the predecessor of the suite of AdaBoost algorithms. Sincethese seminal works, research in ensemble systems haveexpanded rapidly, appearing often in the literature undermany creative names and ideas. The long list includes com-posite classifier systems [1], mixture of experts [4], [5],stacked generalization [6], consensus aggregation [7],combination of multiple classifiers [8]–[12], change-glass-es approach to classifier selection [13], dynamic classifierselection [12], classifier fusion [14]–[16], committees ofneural networks [17], voting pool of classifiers [18], classi-fier ensembles [17], [19], and pandemonium system ofreflective agents [20], among many others.

The paradigms of these approaches usually differ fromeach other with respect to the specific procedure usedfor generating individual classifiers, and/or the strategyemployed for combining the classifiers. There are gener-ally two types of combination: classifier selection andclassifier fusion [12], [21]. In classifier selection, each clas-sifier is trained to become an expert in some local area ofthe total feature space (as in Figure 2). The combinationof the classifiers is then based on the given feature vec-tor: the classifier trained with data closest to the vicinityof the feature vector, in some distance metric sense, isgiven the highest credit. One or more local experts can benominated to make the decision [4], [12], [22–24]. In clas-sifier fusion, however, all classifiers are trained over theentire feature space. In this case, the classifier combina-tion process involves merging the individual (weaker)classifier designs to obtain a single (stronger) expert ofsuperior performance. Examples of this approach includebagging predictors [25], boosting [3], [26] and its many

variations. The combination may apply to classificationlabels only, or to the class-specific continuous valued out-puts of the individual experts [16], [27], [28]. In the lattercase, classifier outputs are often normalized to the [0, 1]interval, and these values are interpreted as the supportgiven by the classifier to each class, or even as class-con-ditional posterior probabilities [16], [29]. Such interpre-tation allows forming an ensemble through algebraiccombination rules (majority voting, maximum/mini-mum/sum/product or other combinations of posteriorprobabilities) [9], [28], [30], [31], fuzzy integral for com-bination [15], [32], [33], the Dempster-Shafer based clas-sifier fusion [10], [34], and more recently, the decisiontemplates [16], [27], [30], [35]–[37].

In other related efforts, input decimation has been pro-posed for naturally grouping different modalities for inde-pendent classifiers in [38], [39], and the conditions underwhich such combination of modalities work best isdescribed in [40]. A number of authors have providedtheoretical analyses of various strategies commonly usedin multiple expert fusion: for example, theoretical modelswere developed in [41], [42] for combining discriminantfunctions; six commonly used combination rules werecompared for their ability to predict posterior probabili-ties in [31]; and a Bayesian theoretical framework for mul-tiple expert fusion was developed in [28], where thesensitivity of various combination schemes to estimationerrors was analyzed to propose a plausible model thatexplains such behavior.

A sample of the immense literature on classifier com-bination can be found in Kuncheva’s recent book [21], thefirst text devoted to theory and implementation of ensem-ble based classifiers, and references therein. The field hasbeen developing so rapidly that an international work-shop on multiple classifier systems (MCS) has recentlybeen established, and the most current developmentscan be found in its proceedings [43].

2.2. Diversity: Cornerstone of Ensemble SystemsIf we had access to a classifier with perfect generalizationperformance, there would be no need to resort to ensem-ble techniques. The realities of noise, outliers and over-lapping data distributions, however, make such aclassifier an impossible proposition. At best, we can hopefor classifiers that correctly classify the field data most ofthe time. The strategy in ensemble systems is therefore tocreate many classifiers, and combine their outputs suchthat the combination improves upon the performance ofa single classifier. This requires, however, that individualclassifiers make errors on different instances. The intu-ition is that if each classifier makes different errors, thena strategic combination of these classifiers can reducethe total error, a concept not too dissimilar to low pass

24 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Page 5: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

filtering of the noise. The overarching principal in ensem-ble systems is therefore to make each classifier as uniqueas possible, particularly with respect to misclassifiedinstances. Specifically, we need classifiers whose decisionboundaries are adequately different from those of others.Such a set of classifiers is said to be diverse.

Classifier diversity can be achieved in several ways.The most popular method is to use different trainingdatasets to train individual classifiers. Such datasets areoften obtained through resampling techniques, such asbootstrapping or bagging, where training data subsetsare drawn randomly, usually with replacement, from theentire training data. This is illustrated in Figure 3, whererandom and overlapping training data subsets are select-ed to train three classifiers, which then form three differ-ent decision boundaries. These boundaries are combinedto obtain a more accurate classification.

To ensure that individual boundaries are adequatelydifferent, despite using substantially similar training

data, unstable classifiers are used as base models, sincethey can generate sufficiently different decision bound-aries even for small perturbations in their trainingparameters. If the training data subsets are drawn with-out replacement, the procedure is also called jackknifeor k-fold data split: the entire dataset is split into kblocks, and each classifier is trained only on k-1 of them.A different subset of k blocks is selected for each classi-fier as shown in Figure 4.

Another approach to achieve diversity is to use dif-ferent training parameters for different classifiers. Forexample, a series of multilayer perceptron (MLP) neuralnetworks can be trained by using different weight initial-izations, number of layers/nodes, error goals, etc. Adjust-ing such parameters allows one to control the instabilityof the individual classifiers, and hence contribute totheir diversity. The ability to control the instability ofneural network and decision tree type classifiers makethem suitable candidates to be used in an ensemble

25THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Classifier 1→ Decision Boundary 1 Classifier 2 → Decision Boundary 2 Classifier 3 → Decision Boundary 3

Feature 1

Ensemble Decision Boundary

Fea

ture

2

Feature 1Feature 1

Fea

ture

2

Feature 1 Feature 1

Fea

ture

2

Fea

ture

2

Fea

ture

2

Figure 3. Combining classifiers that are trained on different subsets of the training data.

Page 6: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

setting. Alternatively, entirely different type of classifiers,such MLPs, decision trees, nearest neighbor classifiers,and support vector machines can also be combined foradded diversity. However, combining different models,or even different architectures of the same model, isused only for specific applications that warrant them.Diversity is typically obtained through resampling of thetraining data, as this procedure is theoretically moretractable. Finally, diversity can also be achieved by usingdifferent features. In fact, generating different classifiersusing random feature subsets is known as the randomsubspace method [44], and it has found widespread use incertain applications, which are discussed later in futureresearch areas.

2.3. Measures of DiversitySeveral measures have been defined for quantitativeassessment of diversity. The simplest ones are pair-wisemeasures, defined between two classifiers. For T classi-fiers, we can calculate T(T -1)/2 pair-wise diversity meas-ures, and an overall diversity of the ensemble can beobtained by averaging these pair-wise measures. Giventwo hypotheses hi and hj, we use the notations

where a is the fraction of instances that are correctly clas-sified by both classifiers, b is the fraction of instancescorrectly classified by hi but incorrectly classified by hj,and so on. Of course, a+b + c+d = 1. Then, the followingpair-wise diversity measures can be defined:Correlation Diversity is measured as the correlationbetween two classifier outputs, defined as

ρi, j = ad − bc√(a + b) (c + d) (a + c) (b + d)

, 0 ≤ ρ ≤ 1. (1)

Maximum diversity is obtained for ρ = 0, indicating thatthe classifiers are uncorrelated.Q-Statistic Defined as

Qi, j = (ad − bc)/(ad + bc) (2)

Q assumes positive values if the same instances are cor-rectly classified by both classifiers; and negative values,otherwise. Maximum diversity is, once again, obtainedfor Q= 0.Disagreement and Double Fault Measures The disagree-ment is the probability that the two classifiers will dis-agree, whereas the double fault measure is the

26 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

hj is correct hj is incorrect

hi is correct a b

hi is incorrect c d

.

.

.

.

.

.

.

.

.

.

.

.

Entire Original Training Data

k−1 Blocks – Shown in Dark Background – Selectedfor Training Individual Classifiers

One Block – Shown in LightBackground – Is Left Out

Classifier 1

Classifier 2

Classifier 3

Classifier k−1

Classifier k

Block 1

Block 1

Block 1

Block 1

Block 1

Block 2

Block 1 Block 2 Block 3

Block 3

Block 3

Block 3

Block 2

Block 2

Block 2

Block 2

Block 3

Block 3

Block k−1

Block k−1

Block k−1

Block k−1

Block k−1 Block k

Block k

Block k

Block k

Block k

Block k−1 Block k

Figure 4. k-fold data splitting for generating different, but overlapping, training datasets.

Page 7: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

probability that both classifiers are incorrect. The diver-sity increases with both the disagreement and the doublefault value.

Di, j = b + c, (3)

DFi, j = d. (4)

There are also non pair-wise measures that take theensemble’s decision into consideration.Entropy Measure makes the assumption that the diversityis highest if half of the classifiers are correct, and theremaining ones are incorrect. Based on this assumption,and defining ζi as the number of classifiers, out of T , thatmisclassifies instance xi, the entropy measure is defined as

E = 1N

N∑

i=1

1T − �T/2� min {ζi, (T − ζi)} (5)

where �·� is the ceiling operator, and N is the dataset car-dinality. Entropy varies between 0 and 1: 0 indicates thatall classifiers are practically the same, and 1 indicateshighest diversity [45].Kohavi–Wolpert Variance defined as

KW = 1NT2

N∑

i=1

ζi · (T − ζi) (6)

follows a similar approach to the disagreement measure,and can be shown to be related to the disagreement meas-ure averaged over all classifiers by a constant factor [45].Measure of difficulty, originally proposed in [2], uses therandom variable Z (xi) = {0, 1/T, 2/T, · · · , 1}, defined asthe fraction of classifiers that misclassify xi. The measure ofdifficulty, θ , is then the variance of the random variable Z :

θ = 1T

T∑

t=0

(zt − z̄)2 (7)

where z̄ is the mean of z; hence it is the average fraction ofclassifiers that misclassify any given input. Hansen and Sala-mon argue that if classifiers are positively correlated, that is,they all tend to correctly classify the same instances, andmisclassify same other instances, then θ is large and there islittle diversity. In the extreme case, where all classifiers areidentical, the distribution of Z will be two singletons: one at0, with a mass proportional to the number of missedinstances, and the other at 1, with a mass proportional to thenumber of correctly classified instances. Then, Z attains itsmaximum value. However, if classifiers are negatively corre-lated, that is, they tend to misclassify different instances,then θ is relatively small, and there is more diversity.

Several other diversity measures have also been pro-posed, such as measurement of interrater agreement or

generalized diversity, whose comparisons can be found in[45], [46]. Kuncheva’s conclusion, reminiscent of the NoFree Lunch theorem (which states that no classificationalgorithm is universally superior to others [47]), is thatthere is no diversity measure that consistently correlateswith higher accuracy. In the absence of additional infor-mation, she suggests the Q statistic because of its intu-itive meaning and simple implementation.

2.4. Two Key Components of an Ensemble SystemAll ensemble systems consist of two key components.First, a strategy is needed to build an ensemble that is asdiverse as possible. Some of the more popular ones, suchas bagging, boosting, AdaBoost, stacked generalization,and mixture of experts are discussed in Section 3. A sec-ond strategy is needed to combine the outputs of indi-vidual classifiers that make up the ensemble in such away that the correct decisions are amplified, and incor-rect ones are cancelled out. Several choices are availablefor this purpose as well, which are discussed in Section 4.

3. Creating an Ensemble

Two interrelated questions need to be answered in design-ing an ensemble system: i) how will individual classifiers(base classifiers) be generated? and ii) how will they differfrom each other? The answers ultimately determine thediversity of the classifiers, and hence affect the perform-ance of the overall system. Therefore, any strategy for gen-erating the ensemble members must seek to improve theensemble’s diversity. In general, however, ensemble algo-rithms do not attempt to maximize a specific diversitymeasure (see [46], [48] for exceptions). Rather, increaseddiversity is usually sought—somewhat heuristically—through various resampling procedures or selection of dif-ferent training parameters. The above defined diversitymeasures can then be used to compare the diversities ofthe ensembles generated by different algorithms.

3.1. BaggingBreiman’s bagging, short for bootstrap aggregating, is oneof the earliest ensemble based algorithms. It is also oneof the most intuitive and simplest to implement, with asurprisingly good performance [25]. Diversity in baggingis obtained by using bootstrapped replicas of the train-ing data: different training data subsets are randomlydrawn—with replacement—from the entire trainingdata. Each training data subset is used to train a differ-ent classifier of the same type. Individual classifiers arethen combined by taking a majority vote of their deci-sions. For any given instance, the class chosen by mostclassifiers is the ensemble decision.

Bagging is particularly appealing when available data isof limited size. To ensure that there are sufficient training

27THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Page 8: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

samples in each subset, relatively large portions of thesamples (75% to 100%) are drawn into each subset. Thiscauses individual training subsets to overlap significantly,with many of the same instances appearing in most sub-sets, and some instances appearing multiple times in agiven subset. In order to ensure diversity under this sce-nario, a relatively unstable model is used so that suffi-ciently different decision boundaries can be obtained forsmall perturbations in different training datasets. As men-tioned above, neural networks and decision trees are goodcandidates for this purpose, as their instability can becontrolled by the selection of their free parameters. Thepseudocode for the bagging algorithm is given in Figure 5.

3.2. Variations of BaggingRandom Forests: A variation of the bagging algorithm isthe Random Forests, so-called because it is constructedfrom decision trees [49]. A random forest can be creat-ed from individual decision trees, whose certain trainingparameters vary randomly. Such parameters can bebootstrapped replicas of the training data, as in bagging,but they can also be different feature subsets as in ran-dom subspace methods.Pasting Small Votes: Unlike bagging, pasting small votesis designed to be used with large datasets [50]. A largedataset is partitioned into smaller subsets, called bites,each of which is used to train a different classifier. Twovariations of pasting small votes have emerged: one thatcreates the data subsets at random, called Rvotes, andone that creates consecutive datasets based on theimportance of the instances, called Ivotes.

The latter approach is known to provide betterresults [51], and is similar to the approach followed byboosting based algorithms, where each classifier focus-es on most important (or most informative) instances forthe current ensemble member.

Basically, the important instances are those thatimprove diversity; and one way to do so is to train eachclassifier on a dataset that consists of a balanced distri-bution of easy and difficult instances. This is achieved byevaluating the current ensemble Et , consisting of t classi-fiers, on instances that have not yet been used for train-ing. For any given instance x, those classifiers that didnot use x in their training are called out-of-bag classifiers.If x is misclassified by a simple majority vote of the cur-rent ensemble, then it is automatically placed in thetraining subset of the next classifier. Otherwise, it is stillplaced in the training set, but only with probability ε t/(1–ε t), where 0<εt <1/2 is the error of the tth classifier. Indi-vidual classifiers are expected to perform at least 50% toensure that a meaningful performance can be providedby each classifier. In Section 4, within the context of vot-ing algorithms, we will see that the error threshold of 0.5

is in fact a strategically chosen value. The complete algo-rithm for pasting small votes is given in Figure 6.

3.3. BoostingIn 1990, Schapire proved that a weak learner, an algo-rithm that generates classifiers that can merely do betterthan random guessing, can be turned into a strong learnerthat generates a classifier that can correctly classify allbut an arbitrarily small fraction of the instances [3]. For-mal definitions of weak and strong learner, as defined inthe PAC learning frame work, can be found in [3], whereSchapire also provides an elegant algorithm for boostingthe performance of a weak learner to the level of a strongone. Hence called boosting, the algorithm is now consid-ered as one of the most important developments in therecent history of machine learning.

Similar to bagging, boosting also creates an ensembleof classifiers by resampling the data, which are then com-bined by majority voting. However, similarities end there.In boosting, resampling is strategically geared to providethe most informative training data for each consecutive

28 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Algorithm: BaggingInput:■ Training data S with correct labels ωi

∈ �={ω1,...,ωC } representing C classes■ Weak learning algorithm WeakLearn,■ Integer T specifying number of iterations.■ Percent (or fraction) F to create bootstrapped

training dataDo t = 1, . . . , T

1. Take a bootstrapped replica St by random-ly drawing F percent of S.

2. Call WeakLearn with St and receive thehypothesis (classifier) ht .

3. Add ht to the ensemble, E.EndTest: Simple Majority Voting – Given unlabeled

instance x1. Evaluate the ensemble E= {h1, . . . ,hT } on x.

2. Let vt, j ={

1, if ht picks class ωj

0, otherwise(8)

be the vote given to class ωj by classifier ht . 3. Obtain total vote received by each class

Vj =∑T

t =1vt, j , j= 1,. . . ,C (9)

4. Choose the class that receives the highesttotal vote as the final classification.

Figure 5. The bagging algorithm.

Page 9: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

classifier. In essence, boosting creates three weak classi-fiers: the first classifier C1 is trained with a random sub-set of the available training data. The training data subsetfor the second classifier C2 is chosen as the most inform-ative subset, given C1. That is, C2 is trained on a trainingdata only half of which is correctly classified by C1, andthe other half is misclassified. The third classifier C3 istrained with instances on which C1and C2 disagree. Thethree classifiers are combined through a three-way major-ity vote. The algorithm is shown in detail in Figure 7.

Schapire has shown that the error of this three-classi-fier ensemble is bounded above, and it is less than theerror of the best classifier in the ensemble, provided thateach classifier has an error rate that is less than 0.5. For atwo-class problem, an error rate of 0.5 is the least we can

expect from a classifier, as an error of 0.5 amounts to ran-dom guessing. Hence, a stronger classifier is generatedfrom three weaker classifiers. A strong classifier in thestrict PAC learning sense can then be created by recur-sive applications of boosting.

3.4. AdaBoostIn 1997, Freund and Schapire introduced AdaBoost [26],which has since enjoyed a remarkable attention, one thatis rarely matched in computational intelligence.AdaBoost is a more general version of the original boost-ing algorithm. Among its many variations, AdaBoost.M1and AdaBoost.R are more commonly used, as they arecapable of handling multiclass and regression problems,respectively. In this paper, we discuss AdaBoost.M1 indetail, and provide a brief survey and references for otheralgorithms that are based on AdaBoost.

29THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Algorithm: BoostingInput:■ Training data S of size N with correct labels ωi

∈ �= {ω1, ω2};■ Weak learning algorithm WeakLearn.Training

1. Select N1<N patterns without replacementfrom S to create data subset S1.

2. Call WeakLearn and train with S1 to createclassifier C1.

3. Create dataset S2 as the most informativedataset, given C1, such that half of S2 is cor-rectly classified by C2, and the other half ismisclassified. To do so:a. Flip a fair coin. If Head, select samples from

S , and present them to C1 until the firstinstance is misclassified. Add this instanceto S2.

b. If Tail, select samples from S , and presentthem to C1 until the first one is correctlyclassified. Add this instance to S2.

c. Continue flipping coins until no more pat-terns can be added to S2.

4. Train the second classifier C2 with S2.5. Create S3 by selecting those instances for

which C1 and C2 disagree. Train the thirdclassifier C3 with S3.

Test -- Given a test instance x1. Classify x by C1 and C2. If they agree on the

class, this class is the final classification.2. If they disagree, choose the class predicted

by C3 as the final classification.

Figure 7. The boosting algorithm.

Algorithm: Pasting Small Votes (Ivotes)Input:■ Training data S with correct labels ωi ∈ � =

{ω1,..., ωC } representing C classes;■ Weak learning algorithm WeakLearn;■ Integer T specifying number of iterations;■ Bitesize M, indicating the size of individual train-

ing subsets to be created.Initialize

1. Choose a random subset S0 of size M from S .2. Call WeakLearn with S0, and receive the

hypothesis (classifier) h0.3. Evaluate h0 on a validation dataset, and

obtain error ε0 of h0..4. If ε0>1/2, return to step 1.

Do t=1, . . . , T1. Randomly draw an instance x from S accord-

ing to uniform distribution.2. Evaluate x using majority vote of out-of-bag

classifiers in the current ensemble Et .3. If x is misclassified, place x in St . Otherwise,

place x in St with probability p

p = εt−1(1−εt−1)

. (10)

Repeat Steps 1-3 until St has M such instances.4. Call WeakLearn with St and receive the

hypothesis ht .5. Evaluate ht on a validation dataset, and

obtain error εt of ht.. If εt>1/2, return to step4.

6. Add ht to the ensemble to obtain Et .EndTest -- Use simple majority voting on test data.

Figure 6. Pasting small votes (Ivotes) algorithm.

Page 10: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

AdaBoost generates a set of hypotheses, and combinesthem through weighted majority voting of the classes pre-dicted by the individual hypotheses. The hypotheses aregenerated by training a weak classifier, using instancesdrawn from an iteratively updated distribution of thetraining data. This distribution update ensures thatinstances misclassified by the previous classifier are morelikely to be included in the training data of the next classi-fier. Hence, consecutive classifiers’ training data aregeared towards increasingly hard-to-classify instances.

The pseudocode of the algorithm is provided in Fig-ure 8. Several interesting features of the algorithm areworth noting. The algorithm maintains a weight distribu-tion Dt(i) on training instances xi, i = 1, . . . , N , fromwhich training data subsets St are chosen for each con-secutive classifier (hypothesis) ht . The distribution is ini-tialized to be uniform, so that all instances have equallikelihood to be selected into the first training dataset.The training error ε t of classifier ht is also weighted bythis distribution, such that ε t is the sum of distributionweights of the instances misclassified by ht (Equation 12).As before, we require that this error be less than 1/2. Anormalized error is then obtained as βt, such that for< 0 εt < 1/2, we have 0 < βt < 1.

Equation 14 describes the distribution update rule:the distribution weights of those instances that are cor-rectly classified by the current hypothesis are reduced bya factor of βt , whereas the weights of the misclassifiedinstances are unchanged. When the updated weights arerenormalized, so that Dt+1 is a proper distribution, theweights of the misclassified instances are effectivelyincreased. Hence, iteration by iteration, AdaBoost focus-es on increasingly difficult instances. Note that AdaBoostraises the weights of instanced misclassified by ht so thatthey add up to 1/2, and lowers the weights of correctlyclassified instances, so that they too add up to 1/2. Sincethe base model learning algorithm WeakLearn is requiredto have an error less than 1/2, it is guaranteed to cor-rectly classify at least one previously misclassified train-ing example. Once a preset T number of classifiers aregenerated, AdaBoost is ready for classifying unlabeledtest instances. Unlike bagging or boosting, AdaBoost usesa rather undemocratic voting scheme, called the weightedmajority voting. The idea is an intuitive one: those classi-fiers that have shown good performance during trainingare rewarded with higher voting weights than the others.Recall that a normalized error βt was calculated in Equa-tion 13. The reciprocal of this quantity, 1/βt is therefore ameasure of performance, and can be used to weight theclassifiers. Furthermore, since βt is training error, it isoften close to zero and 1/βt can therefore be a very largenumber. To avoid potential instability that can be causedby asymptotically large numbers, the logarithm of

1/βt is usually used as the voting weight of ht . At the end,the class that receives the highest total vote from all clas-sifiers is the ensemble decision.

A conceptual block diagram of the algorithm is provid-ed in Figure 9. The diagram should be interpreted with theunderstanding that the algorithm is sequential: classifierC K is created before classifier C K+1, which in turn requiresthat βK and the current distribution DK be available.

Freund and Schapire also showed that the trainingerror of AdaBoost.M1 is bounded above:

E < 2TT∏

t=1

√εt (1 − εt) (16)

30 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Algorithm AdaBoost.M1Input:■ Sequence of N examples S = [(xi, yi)], i = 1, · · · , N

with labels yi ∈ �, � = {ω1, . . . , ωC };■ Weak learning algorithm WeakLearn;■ Integer T specifying number of iterations.

Initialize D1 (i) = 1N ., i = 1, · · · , N (11)

Do for t = 1, 2, . . . , T :1. Select a training data subset St , drawn from

the distribution Dt .2. Train WeakLearn with St , receive hypothe-

sis ht .3. Calculate the error of

ht : εt = ∑i:ht(xi)�=yi

Dt(i). (12)

If εt >1/2, abort.4. Set βt = εt/(1 − εt). (13)

5. Update distribution

Dt : Dt+1(i) = Dt(i)Zt

×{

βt if ht(xi) = yi

1, otherwise(14)

where Zt = ∑i Dt (i) is a normalization con-

stant chosen so that Dt+1 becomes a properdistribution function.

Test -- Weighted Majority Voting: Given an unla-beled instance x,

1. Obtain total vote received by each class

Vj = ∑t:ht(x)=ωj

log 1βt

, j= 1,. . . ,C . (15)

2. Choose the class that receives the highesttotal vote as the final classification.

Figure 8. The AdaBoost.M1 algorithm.

Page 11: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

where E is the ensemble error [26]. Since εt < 1/2 , E isguaranteed to decrease with each new classifier. In mostpractical cases, the error decreases very rapidly in thefirst few iterations, and approaches zero as new classi-fiers are added. While this is remarkable on its ownaccount, the surprising resistance of AdaBoost to over-fitting is particularly noteworthy. Overfitting is a com-monly observed phenomenon where the classifierperforms poorly on test data, despite achieving a verysmall training error. Overfitting is usually attributed tomemorizing the data, or learning the noise in the data.As a classifier’s capacity increases (for example, withthe complexity of its architecture), so does its tendencyto memorize the training data and/or learn the noise inthe data. Since the capacity of an ensemble increaseswith each added classifier, one would expect AdaBoostto suffer from overfitting, particularly if its complexityexceeds what is necessary to learn the underlying datadistribution. Yet, AdaBoost performance usually levelsoff with increasing number of classifiers with no indica-tion of overfitting.

Schapire et al. later provided an explanation to thisphenomenon based on the so-called margin theory [52].The details of margin theory are beyond the scope of thispaper; however, the margin of an instance x, in looseterms, is its distance from the decision boundary. The fur-ther away an instance is from the boundary, the larger itsmargin, and hence the higher the confidence of the clas-sifier in correctly classifying this instance. Margins areusually described within the context of support vectormachine (SVM) type classifiers, which are based on theidea of maximizing the margins between the instancesand the decision boundary. In fact, SVMs find thoseinstances, called support vectors, that lie closest to thedecision boundary. The support vectors are said todefine the margin that separates the classes. The conceptof a margin is conceptually illustrated in Figure 10.

By using a slightly different definition of a margin—suitably modified for AdaBoost—Schapire et al. show thatAdaBoost also boosts the margins, that is, it finds the deci-sion boundary that is further away from the instances ofall classes. In the context of AdaBoost, the margin of an

31THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Ck−1

EnsembleDecision

Ck

C1

Ck+1

CT

Trai

ning

Dat

aD

istr

ibut

ion

Trai

ning

Dat

a

Voting Weightslog 1/β

S1

ST

D1

DT

log 1/βT log 1/β1

h1

hk−1

hk

hk+1

hT

UpdateDistribution

D

NormalizedError

β

β1

βT

W

eighted Majority Voting

Figure 9. The AdaBoost.M1 algorithm.

Page 12: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

instance is simply the difference between the total vote itreceives from correctly identifying classifiers and themaximum vote received by any incorrect class. Theyshow that the ensemble error is bounded with respect tothe margin, but is independent of the number of classi-fiers [52]. While the bound is loose, the generalizationerror decreases with increasing margin, and AdaBoostmaximizes these margins by increasing the confidence ofthe ensemble’s decision.

Several variations of AdaBoost are available, a coupleof which are proposed by Freund and Schapire them-selves: AdaBoost.M2 removes the requirement that eachindividual hypothesis should maintain a weighted errorless than 1/2 (for multiclass problems, εt < 1/2 is sufficient,but not required), whereas AdaBoost.R extends theboosting approach to regression type problems [26].

There are also more heuristic variations that modifyeither the distribution update rule or the combinationrule of the classifiers. For example, AveBoost averagesthe distribution weights to make the errors of eachhypothesis as uncorrelated as possible with those of theprevious ones (a key goal for achieving diversity) [53],[54], whereas Learn++ makes the distribution updaterule contingent on the ensemble error (instead of the pre-vious hypothesis’ error) to allow for efficient incrementallearning of new data that may introduce new classes [55].

3.5. Stacked GeneralizationCertain instances may have a high likelihood of being mis-classified because, for example, they are very close to thedecision boundary, and hence often fall on the wrong sideof the boundary approximated by the classifier. Converse-

ly, certain instances may have ahigh likelihood of being correctlyclassified, because they are pri-marily far away from—and onthe correct side of—their respec-tive decision boundaries. A natu-ral question arises: can we learnthat certain classifiers consis-tently correctly classify, or con-sistently misclassify, certaininstances? Or more specifically,given an ensemble of classifiersoperating on a set of data drawnfrom an unknown but fixed dis-tribution, can we map the out-puts of these classifiers to theirtrue classes?

In Wolpert’s stacked general-ization, an ensemble of classi-fiers are first created, whoseoutputs are used as inputs to asecond level meta-classifier tolearn the mapping between theensemble outputs and the actu-al correct classes [6]. Figure 11illustrates the stacked general-ization approach, where classi-fiers C1, . . . , CT are trainedusing training parameters θ1

through θT (where may includedifferent training datasets, clas-sifier architectural parameters,etc.) to output hypotheses h1

through hT . The outputs ofthese classifiers and the corre-sponding true classes are thenused as input/output training

32 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Feature 1Feature 1

Fea

ture

2

Fea

ture

2

Various Potential Decision Boundaries Optimal Decision Boundary

MaximumMargin

Class 1 InstancesClass 2 Instances

Support Vectors for Class 1Support Vectors for Class 2

Figure 10. Both SVMs and AdaBoost maximize the margin between instances of differentclasses.

Inputx

Classifier 1with

Parametersθ1

FinalDecision

C1

h1(x, θ1)

Ck−1

hk(x, θk)Ck

Ck−1

CT

First LevelBase Classifiers

Second LevelMeta Classifier

Cla

ssifi

er T

+1w

ithP

aram

eter

s θ T

+1

hT(x, θT)

Figure 11. Stacked generalization.

Page 13: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

pairs for the second level classifier, CT+1.Traditionally,the k-fold selection process described in Section 2 isused to obtain the training data for classifier CT+1.Specifically, the entire training dataset is divided into Tblocks, and each first-level classifier C1, . . . , CT is firsttrained on (a different set of) T − 1 blocks of the trainingdata. Therefore, there is one block of data not seen byeach of the classifiers C1through CT . The outputs ofeach classifier for the block of instances on which it wasnot trained, along with the correct labels of thoseinstances, constitute the training data for the secondlevel meta-classifier CT+1. Once CT+1 is trained, all dataare pooled, and individual classifiers C1, . . . CT areretrained on the entire database, using a suitable resam-pling method.

3.6. Mixture-of-ExpertsA conceptually similar technique is the mixture-of-expertsmodel [4], where a set of classifiers C1, . . . , CT constitutethe ensemble, followed by a second level classifier CT+1

used for assigning weights for the consecutive combiner.Figure 12 illustrates this model, where the combiner itselfis usually not a classifier, but rather a simple combinationrule, such as random selection (from a weight distribu-tion), weighted majority, or weighted winner-takes-all.However, the weight distribution used for the combiner isdetermined by a second level classifier, usually a neuralnetwork, called the gating network. The gating network istrained either through standard gradient descent basedbackpropagation, or more commonly, through the expec-tation maximization (EM)algorithm [5], [56]. In eithercase, the inputs to the gatingnetwork are the actual train-ing data instances themselves(unlike outputs of first levelclassifiers for stacked gener-alization), hence the weightsused in combination rule areinstance specific, creating adynamic combination rule.

Mixture-of-experts cantherefore be seen as a classifi-er selection algorithm. Indi-vidual classifiers are expertsin some portion of the featurespace, and the combinationrule selects the most appro-priate classifier, or classifiersweighted with respect to theirexpertise, for each instance x.The pooling system may usethe weights in several differ-

ent ways: it may choose a single classifier with the highestweight, or calculate a weighted sum of the classifier out-puts for each class, and pick the class that receives thehighest weighted sum. The latter approach requires thatthe classifier outputs be continuous-valued for each class.

4. Combining Classifiers

The second key component of any ensemble system is thestrategy employed in combining classifiers. Combinationrules are often grouped as (i) trainable vs. non-trainablecombination rules, or (ii) combination rules that apply toclass labels vs. to class-specific continuous outputs.

In trainable combination rules, the parameters of thecombiner, usually called weights, are determined througha separate training algorithm. The EM algorithm used bythe mixture-of-experts model is such an example. Thecombination parameters created by trainable rules areusually instance specific, and hence are also calleddynamic combination rules. Conversely, there is no sepa-rate training involved in non-trainable rules beyond thatused for generating the ensembles. Discussed below,weighted majority voting is an example of such non-train-able rules, since the parameters become immediatelyavailable as the classifiers are generated.

In the second taxonomy, combination rules that applyto class labels need the classification decision only (thatis, one of ωj, j = 1, . . . , C ), whereas others need the con-tinuous-valued outputs of individual classifiers. Thesevalues often represent the degrees of support the classi-fiers give to each class, and they can estimate class

33THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

wN

Classifier 1with

Parametersθ1

Poo

ling/

Com

bini

ngS

yste

m

FinalDecision

(Usually Trained with Expectation

Maximization)

GatingNetwork

CT+1

h1(x, θ1)C1

Ck−1

Ck

Ck−1

CT hT(x, θT)

w1

hk(x, θk)Inputx

Figure 12. Mixture-of-experts.

Page 14: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

conditional posterior probabilities P (ωj|x) if (i) they areappropriately normalized to add up to 1 for all classes,and (ii) if the classifiers are trained with sufficiently densedata. Many classifier models, such as the MLP and theRBF networks, provide continuous-valued outputs, whichare often interpreted as posterior probabilities—despitethe sufficiently dense training data requirement rarelybeing met in practice.

In this paper, we follow the second grouping: we firstreview combination rules that apply to class labels, followedby those that combine class-specific continuous outputs.

4.1. Combining Class LabelsIn the following discussion, we assume that only the classlabels are available from the classifier outputs. Let usdefine the decision of the tthclassifier as dt, j ∈ {0, 1},t = 1 . . . ,T and j = 1, . . . C , where T is the number ofclassifiers and C is the number of classes. If tth classifierchooses class ωj, then dt, j = 1, and 0, otherwise.

4.1.1. Majority VotingThere are three versions of majority voting, where theensemble choose the class (i) on which all classifiers agree(unanimous voting); (ii) predicted by at least one more thanhalf the number of classifiers (simple majority); or (iii) thatreceives the highest number of votes, whether or not thesum of those votes exceeds 50% (plurality voting or justmajority voting). The ensemble decision for the pluralityvoting can be described as follows: choose class ω J , if

T∑

t=1

dt, J = Cmaxj=1

T∑

t=1

dt, j . (17)

This is precisely the voting mechanism used by audiencepolling mentioned in the Introduction. It is not difficult todemonstrate that majority voting is an optimal combinationrule under the minor assumptions of: (1) we have an oddnumber of classifiers for a two class problem; (2) the prob-ability of each classifier choosing the correct class is p forany instance x; and (3) the classifier outputs are independ-ent. Then, plurality voting and simple majority voting areidentical, and the ensemble makes the correct decision if atleast �T/2� + 1 classifiers choose the correct label, wherethe floor function �·� returns the largest integer less than orequal to its argument. The accuracy of the ensemble can berepresented by the binomial distribution as the total prob-ability of choosing k ≥ �T/2� + 1 successful one sout T ofclassifiers, where each classifier has the success rate of p.Hence, Pens, the probability of ensemble success is

Pens =T∑

k=(T/2)+1

(Tk

)pk (1 − p)T−k . (18)

This sum approaches 1 as T → ∞, if p< 0.5; and itapproaches 0 if p< 0.5.This result is known as the Con-dorcet Jury Theorem (1786), whose details can be foundin [57], [58]. Herein lies the power of audience polling: ifwe can assume that the probability of each audiencemember giving the correct answer is better than 1/2, thenfor a large enough audience (say, over 100), the probabil-ity of audience success approaches 1. Now, recall that weneed to have 50% + 1 audience members to answer cor-rectly for the majority voting to choose the correctanswer for a two class problem; but the contestant has tochoose one of four answers (a four class problem). It canbe shown that, with more than two options, fewer classi-fiers could in fact be sufficient to obtain the correct class,since the incorrect votes will be distributed over a largernumber of classes. Therefore, insisting that each classifierhas a probability of success of 1/2 or better, is in fact suf-ficient, but not necessary. It is actually a rather conserva-tive requirement. An extensive and excellent analysis ofthe majority voting approach can be found in [21].

4.1.2. Weighted Majority Voting If we have evidence that certain experts are more quali-fied than others, weighting the decisions of those quali-fied experts more heavily may further improve the overallperformance than that can be obtained by the pluralityvoting. Again, let us denote the decision of hypothesis ht

on class ωj as dt, j, such that dt, j is 1, if ht selects ωj and 0,otherwise. Further assume that we have a way of esti-mating the future performance of each classifier, and weassign a weight wt to classifier ht in proportion to its esti-mated performance. According to this notation, the clas-sifiers whose decisions are combined through weightedmajority voting will chose class J, if

T∑

t=1

wtdt, J = Cmaxj=1

T∑

t=1

wtdt, j (19)

that is, if the total weighted vote received by ωj is higherthan the total vote received by any other class. For easi-er interpretation, we can normalize these weights so thatthey sum up to 1; however, normalization does notchange the outcome of weighted majority voting.

A natural question is then “how do we assign theweights?” Clearly, had we known which classifiers wouldwork better, we would give the highest weights to thoseclassifiers, or perhaps, use only those classifiers. In theabsence of this knowledge, a plausible strategy is to usethe performance of a classifier on a separate validationdataset, or even its performance on the training dataset,as an estimate of that classifier’s future performance. Asindicated in Equations 13 and 15, AdaBoost follows thelatter approach: AdaBoost assigns a voting weight of

34 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Page 15: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

log (1/βt) to ht , where βt = εt/(1 − εt), and εt is theweighted training error of hypothesis ht . We have men-tioned that the normalization of the error allows us touse an error measure β that is between 0 and 1 (asopposed to ε, which is between 0 and 1/2); however, thereis another strategic reason for this specific normaliza-tion. It can be shown that if we have T class-conditional-ly-independent classifiers with individual accuraciesp1, . . . , pT , the accuracy of the ensemble combined byweighted majority voting is maximized if the votingweights satisfy the proportionality

wt ∝ logpt

1 − pt(20)

whose proof can be found in [21]. Representing theaccuracies of individual classifiers as pt =1 − εt , we seethat the weight assignment used by AdaBoost is identicalto the one in Equation 20. A detailed discussion onweighted majority voting can also be found in [59].

4.1.3. Behavior Knowledge Space (BKS)Originally proposed by Huang and Suen, the behavioralknowledge space (BKS) uses a look up table, constructedbased on the classification of the training data that keepstrack of how often each labeling combination is producedby the classifiers [60], [61]. Then, the true class, forwhich a particular labeling combination is observed mostoften during training, is chosen every time that combina-tion of class labels occurs during testing. The procedureis best described by an example, illustrated in Figure 13.

Assume that we have three classifiers C1 ∼ C3, for a threeclass problem. There are 27 possible combinations oflabeling, listed from {ω1 ω2 ω3} to [62], that can be chosenby the three classifiers. During training, we keep track ofhow often each combination occurs, and the correspon-ding correct class for each such occurrence. Sample num-bers are provided for each combination and for eachclass, and the maxima are circled in Figure 13. For exam-ple, let us assume that the combination of {ω1 ω1 ω1}occurs a total of 28 times, and in 10 of them the true classis ω1, in 15 of them the true class is ω2, and in 3 of them,the true class is ω3. The winner of this combination istherefore ω2, the most frequently observed true class forthis combination of labels. Therefore, during testing,every time the combination of ω1 ω2 ω1 occurs, theensemble chooses ω2. Note that the choice of ω2 is differ-ent than the class ω1 that would have been chosen by thesimple majority rule for this combination of labels. Iftrained with sufficiently dense training data, and if thetable is appropriately normalized, the maximum voteobtained from the BKS table can estimate the class pos-terior probabilities with reasonable accuracy.

4.1.4. Borda CountBorda count is different from rules listed previously inone important aspect: it does not throw away the supportgiven to a non-winning class. Borda count is typicallyused if and when the classifiers can rank order the class-es. This can be easily done if the classifiers provide con-tinuous outputs, as the classes can then be rank orderedwith respect to the support they receive from the

35THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

ω1 ω1 ω135 10 20

ω1 ω1 ω214 7 12

ω1 ω1 ω3 18 7 20

10 15 3

ω1 ω2 ω222 23 12

ω1 ω2 ω316 17 12

...

...

ω3 ω3 ω316 14 19

All Three ClassifiersAgree on Class 1

Classifiers C1 and C3Choose ω1, C2 Chooses ω2

ω1 ω2 ω3

Possible Combinations of Labeling for

Classifiers C1~ C3

Num

ber

of T

imes

Thi

s C

ombi

natio

nO

ccur

s in

the

Tra

inin

g D

ata

for

Thi

s A

ctua

l Cla

ss

C1 C2 C3

True Class

BKS Table

ω1 ω2 ω1

Figure 13. Behavior knowledge space illustration.

Page 16: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

classifier. However, Borda count does not need the valuesof these continuous outputs, but just the rankings, henceit qualifies as a combination rule that apply to labels.

In standard Borda count, each voter (classifier) rank-orders the candidates (classes). If there are N candi-dates, the first-place candidate receives N − 1 votes, thesecond-place candidate receives N − 2, with the candi-date in i th place receiving N − i votes. The candidateranked last receives 0 votes. The votes are added upacross all classifiers, and the class with the most votes ischosen as the ensemble decision.

Originally devised in 1770 by Jean Charles de Borda,Borda count is used often in practice: selecting the mostvaluable player in U.S. baseball league, ranking universi-ties in college sports, electing officers at certain univer-sity senate elections, and choosing the winning song inthe annual European wide song contest Eurovision, alluse some suitable variation of Borda count.

4.2. Combining Continuous OutputsThe continuous output provided by a classifier for a givenclass is interpreted as the degree of support given to thatclass, and it is usually accepted as an estimate of the pos-terior probability for that class. The posterior probabilityinterpretation requires access to sufficiently dense train-ing data, and that the outputs be appropriately normal-ized to add up to 1 over all classes. Usually, the softmaxnormalization is used for this purpose [63]. Representingthe actual classifier output corresponding to the j th classas gj (x)(x), and the normalized values as g ′

j (x), approximat-ed posterior probabilities P(ωj|x) can be obtained as

P(ωj |x) ≈ g′j(x) = egj(x)

∑Cc=1 egc(x)

⇒∑C

j=1g′

j(x) = 1. (21)

Kuncheva et al. define the so-called decision profile matrixin [16], which allows us to pres-ent all of the following combina-tion rules from a unifiedperspective. The decision pro-file matrix DP (x), for aninstance x, consists of elementsdt, j ∈ [0, 1], which represent thesupport given by the t th classifi-er to class ωj. The rows of DP(x), therefore, represent the sup-port given by individual classi-fiers to each of the classes,whereas the columns representthe support received by a par-ticular class from all classifiers.

The decision profile matrix is illustrated in Figure 14.

4.2.1. Algebraic combinersSimple algebraic combiners are, in general, non-trainablecombiners of continuous outputs. The total support foreach class is obtained as a simple function of the sup-ports received by individual classifiers. Following thesame notation in [16], we represent the total supportreceived by class ωj, the j th column of the decision pro-file DP(x), as

µj(x) = �[d1, j(x), · · · , dT, j(x)] (22)

where � (.) is the combination function, such as one ofthose listed below.Mean Rule: The support for ωj is obtained as the averageof all classifiers’ j th outputs,

µj (x) = 1T

T∑

t=1

dt, j (x) (23)

that is, the function �(·) is the averaging function. Themean rule is equivalent to the sum rule (within a normal-ization factor of 1/T), which also appears often in the lit-erature. In either case, the ensemble decision is taken asthe class ωj for which the total support µj (x) is largest.Weighted Average: This rule is a hybrid of the mean andthe weighted majority voting, where the weights areapplied not to class labels, but to the actual continuousoutputs. This particular combination rule can qualifyeither as a trainable or non-trainable combination rule,depending on how the weights are obtained. If theweights are obtained during the ensemble generation aspart of the regular training, as in AdaBoost, then it is anon-trainable combination rule. If a separate training isused to obtain the weights, such as in mixture of expertsmodel, then it is a trainable combination rule. Usually, wehave a weight for each classifier, or sometimes for each

36 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

dt,1 (x)DP(x) =

Output ofClassifier Ct,

One of T Classifiers

Support from ClassifiersC1...CT for Class ωj,One of C Classes.

d1,1 (x) d1,j (x)

dT,1 (x)

dt,C (x)

d1,C (x)

dT,C (x)dT,j (x)

dt,j (x)

Figure 14. Decision profile for a given instance x.

Page 17: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

class and for each classifier. In the former case, we haveT weights, w1, . . . , wT ,usually obtained as estimatedaccuracies from training performances, and the total sup-port for class ωj is

µj (x) =T∑

t =1

wtdt, j (x). (24)

In the latter case, we have a total of T × C weights thatare class and classifier specific, hence a class-consciouscombination [16]. Total support for class ωj is given as

µj (x) =T∑

t =1

wt, jdt, j (x) (25)

where wt, j is the weight of the t th classifier for classifyingclass ωj instances.Trimmed Mean: What if some of the classifiers giveunwarranted and unusually low, or unusually high sup-port to a particular class? This would adversely affectthe mean combiner. To avoid this problem, the most opti-mistic and pessimistic classifiers are removed from theensemble before calculating the mean, a procedureknown as the trimmed mean. For a P % trimmed mean,we simply remove P % of the support from each end, andthe mean of the remaining supports are calculated,avoiding extreme values of support.

Minimum/Maximum/Median Rule: As the names imply,these functions simply take the minimum, maximum orthe median among the classifiers’ individual outputs.

µj(x) = maxt=1···T

{dt, j(x)}µj(x) = min

t=1···T{dt, j(x)}

µj(x) = mediant=1···T

{dt, j(x)} (26)

In any of these cases, the ensemble decision is again cho-sen as the class for which total support is largest. Theminimum rule is the most conservative combination rule,as it chooses the class for which the minimum supportamong the classifiers is largest. Also worth mentioning isthat the trimmed mean at limit 50% is equivalent to themedian rule.Product Rule: In product rule, supports provided bythe classifiers are multiplied. This rule is very sensi-tive to the most pessimistic classifiers: a low support(close to 0) for a class from any of the classifiers caneffectively remove any chance of that class beingselected. However, if individual posterior probabili-ties are estimated correctly at the classifier outputs,then this rule provides the best estimate of the over-all posterior probability of the class selected by theensemble.

37THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Product Rule: C1: 0.0179, C2: 0.00021, C3: 0.0009

Sum Rule: C1: 1.55, C2: 1.91, C3: 1.54

Mean Rule: C1: 0.310, C2: 0.382, C3: 0.308

Max Rule: C1: 0.85, C2: 0.7, C3: 0.8

Median Rule: C1: 0.2, C2: 0.5, C3: 0.2

Minimum Rule: C1: 0.1, C2: 0.01, C3: 0.14

Weighted Average C1: 0.395, C2: 0.333, C3: 0.272

Majority Voting: C1: 1, C2: 3, C3: 1 Vote

Weighted Majority Voting: C1: 0.3, C2: 0.55, C3: 0.15 Votes

Borda Count: C1: 5 C2: 6, C3: 4 Votes

C1 C2 C3 C1 C2 C3 C1 C2 C3 C1 C2 C3 C1 C2 C3

0.85 0.01 0.14

Classifier 1 Classifier 2

0.3 0.5 0.2

Classifier 3

0.2 0.6 0.2

Classifier 4

0.1 0.7 0.2

Classifier 5

0.1 0.1 0.8

0.30 0.25 0.20 0.10 0.15

Rows of DP(x)

ClassifierWeights

Figure 15. Example on various combination rules.

Page 18: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

µj (x) = 1T

T∏

t=1

dt, j (x). (27)

Generalized Mean: Many of the above rules are in factspecial cases of the generalized mean,

µj (x, α) =(

1T

T∑

t=1

dt, j (x)α

)1/α

(28)

where specific choices of α results in different combina-tion rules discussed above. For example, for α → −∞,we obtain the minimum rule, µj (x, α) = mint {dt, j](x)}; forα → 0, we obtain

µj (x, α) =(∏T

t=1dt, j (x)

)1/T(29)

which is the geometric mean, a modified version of theproduct rule. For α → 1, we obtain the mean rule, andα → ∞ gives us the maximum rule.

As an example, consider the case in Figure 15, wherewe have a five classifier ensemble for a three class prob-lem. Assume that we have obtained the outputs andweights shown in Figure 15 for a given input x from theensemble. The corresponding decision profile for x is then

DP(x) =

0.85 0.01 0.140.3 0.5 0.20.2 0.6 0.20.1 0.7 0.20.1 0.1 0.8

. (30)

We make several observations from Figure 15, where thewinning class for each combination rule is shown in bold.(i) Different classes may win for different combinationrules. In this example, class 2 wins most of the combina-tion rules, class 1 wins three and class 3 wins only onerule (the minimum rule).Had Classifier 5 outputbeen [0.05 0.05 0.9],instead of [0.1 0.1 0.8],however, class 3 wouldhave also won the maxi-mum rule; (ii) the productrule and minimum ruleseverely punishes class 2,as it received very littlesupport from classifier C1;(iii) sum rule and meanrule provide the same out-come, as expected; and(iv) the weighted averagechooses class 1 primarily

because of the high support given to class 1 by the high-ly weighted Classifier 1. The three voting-based rules forcombining label outputs are also provided, all of whichhave voted in favor of class 2 (for Borda Count, the ties inrankings were randomly broken in favor of class 1, andclassifier weights given on top of classifiers were used forweighted majority voting).

4.2.2. Decision TemplatesKuncheva takes the idea of decision profiles one step fur-ther by defining decision templates as the average deci-sion profile observed for each class throughout training[16]. Given a test instance x, its decision profile is com-pared to the decision templates of each class, and theclass whose decision template is closest, in some similar-ity measure, is chosen as the ensemble decision. Morespecifically, the decision template for ωj, illustrated inFigure 16, is calculated as

DTj = 1Nj

PXj∈ωj

DP(Xj) (31)

which is the average decision profile obtained from Xj,the set (with cardinality Nj) of training in stances thatbelong to true class ωj. Given an unlabeled test instancex, we first construct its DP(x) from the ensemble outputs(as shown in Figure 15), and calculate the similarity Sbetween DP(x) and the decision template DTj for eachclass ωj as the degree of support given to class ωj.

µj(x) = S(DP(x), DTj), j = 1, · · · , C . (32)

The similarity measure S is usually a squared Euclideandistance, obtained as

µj(x) = 1 − 1T × C

T∑

t=1

C∑

k=1

[DTj(t,k) − dt,k(x)]2 (33)

38 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Decision Profiles Obtained fromClassifier Ct (x), for all Classes,

Averaged over Class j Instances.

Decision Profiles Obtained from ClassifiersC1...CT for Predicted Class ωk,

Averaged over (True) Class ωj Instances.

DTt,1 (x)DTj =

DT1,1 (x) DT1,k (x)

DTT,1 (x)

DTt,C (x)

DT1,C (x)

DTT,C (x)DTT,k (x)

DTt,k (x)

Figure 16. Decision template for class ωj obtained as averaged class ωj decision profiles.

Page 19: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

where DTj (t,k) is the support given by the t th classifierto class ωk by the decision template DTj. In other words,DTj (t,k) is the support given by the t th classifier to classωk, averaged over class ωj instances. This support shouldideally be high when k = j, and low otherwise. The sec-ond term dt,k(x) is the support given by the t th classifierto class ωk for the given instance x. As usual, the classwith the highest total support is finally chosen as theensemble decision.

4.2.3. Dempster-Shafer Based CombinationThis combination rule is borrowed from data fusion, afield of data analysis primarily involved in combiningelements of evidence provided by different sources ofdata. Many data fusion techniques are based on theDempster-Shafer (DS) theory of evidence, which usesbelief functions (instead of probability) to quantify theevidence available from each data source, which arethen combined using Dempster’s rule of combination[64], [65]. DS theory has traditionally been used in mili-tary applications, such as sensor fusion for target track-ing, or friend or foe identification. It can easily beapplied, however, to any decision making problem, byinterpreting the output of a classifier as a measure ofevidence provided by the source that generated thetraining data [10], [34]. In such a setting, DS theory isnot used for data fusion, that is, to combine data fromdifferent sources, but rather, to combine the evidenceprovided by ensemble classifiers trained on data comingfrom the same source. We describe the DS theory as anensemble combination rule here, and discuss its feasi-bility of ensemble systems for combining data from dif-ferent sources later under Current and Emerging Areas.

The decision template formulation becomes useful indescribing DS theory as an ensemble combination rule:let DT t

j denote the t th row of the decision template DTj,and Ct(x) denote the output of the t th classifier, that is,the t th row of the decision profile DP(x):Ct (x) = [dt,1(x), . . . , dt,C (x)]. Instead of similarity, wenow calculate proximity �j,t(x) of the t th classifier’s classj decision template DT t

j to this classifier’s decision oninstance x, Ct(x) [10], [21], [35]:

�j,t (x) =

(1 +

∥∥∥DT tj − Ct (x)

∥∥∥2)−1

C∑k=1

(1 + ∥∥DT t

k − Ct (x)∥∥2

)−1(34)

where the differences (calculated as distances in Euclid-ean norm) in both numerator and denominator are con-verted to similarities—representing proximities—usingthe reciprocal operation. The denominator is really anormalization term, representing the total proximity of

t th classifier decision to the decision templates of allclasses. Based on these proximities, we compute ourbelief, or evidence, that the t th classifier Ct is correctlyidentifying instance x into class ωj,

bj (Ct (x)) = �j,t (x)∏

k �= j(1 − �k,t (x)

)

1 − �j,t (x)[1 − ∏

k �= j(1 − �k,t (x)

)] (35)

Once the belief values are obtained for each source (clas-sifier), they can be combined by Dempster’s rule of com-bination, which simply states that the evidences (beliefvalues) from each source should be multiplied to obtainthe final support for each class:

µj (x) = KT∏

t =1

bj (Ct (x)) (36)

where K is a normalization constant ensuring that thetotal support for ωj from all classifiers is 1.

4.3. But, Which One Is Better?When there are many competing approaches to a prob-lem, an effort to determine a winning one is inevitable.Hence the question “which ensemble generation or com-bination rule is the best?” The no-free-lunch theorem hasunarguably proven that there is in fact no such best clas-sifier for all classification problems [47], and that the bestalgorithm depends on the structure of the available dataand prior knowledge. Yet, many studies have comparedvarious ensemble generation and combination rulesunder various scenarios. For example, Dietterich [66],[67], Breiman [68], Bauer et al. [69], Drucker [17] andQuinlan [70] all separately compared bagging, boostingand other ensemble based approaches. The typical con-sensus is that boosting usually achieves better general-ization performances, but it is also more sensitive tonoise and outliers.

Does the no-free-lunch theorem hold for combinationrules as well? Many studies have been conducted toanswer this question, including [18], [27], [28], [31],[36], [37], [42], [71], [72], among others. Their verdict?In general, the no-free-lunch theorem holds. The bestcombination method, just as for the best ensemblemethod, depends much on the particular problem. How-ever, there is a growing consensus on using the meanrule due to its simplicity and consistent performanceover a broad spectrum of applications. If the accuraciesof the classifiers can be reliably estimated, then theweighted average and weighted majority approachesmay be considered. If the classifier outputs correctlyestimate the posterior probabilities, then the productrule should be considered.

39THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Page 20: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

5. Current & Emerging Areas

Despite two decades of intense research, and the widelyheld belief that our current understanding of ensemblebased systems has matured [73], the field seems to beenjoying a growing attention. This may, in part, be due toemerging areas of applications that benefit from ensem-ble systems. The primary thrust of using ensemble sys-tems has been to reduce the risk of choosing a singleclassifier with a poor performance, or to improve uponthe performance of a single classifier by using an intelli-gently combined ensemble of classifiers. Many additionalareas and applications have recently emerged, however,for which the ensemble systems are inherently appropri-ate. In this section, we discuss some of the more promi-nent examples of such emerging areas.

5.1. Incremental LearningIn certain applications, it is not uncommon for the entiredataset to gradually become available in small batchesover a period of time. Furthermore, datasets acquired insubsequent batches may introduce instances of newclasses that were not present in previous datasets. Insuch settings, it is necessary to learn the novel informa-tion content in the new data, without forgetting the previ-ously acquired knowledge, and without requiring accessto previously seen data. The ability of a classifier to learnunder these circumstances is usually referred to as incre-mental learning.

A practical approach for learning from new datainvolves discarding the existing classifier, and retraining anew one using all data that have been accumulated thusfar. This approach, however, results in loss of all previ-ously acquired information, a phenomenon known as cat-astrophic forgetting (or interfering) [74]. Not conforming tothe above stated definition of incremental learning aside,this approach is undesirable, if retraining is computation-ally or financially costly; but more importantly, it isunfeasible, if the original dataset is lost, corrupted, dis-carded, or otherwise unavailable. Such scenarios are notuncommon in databases of restricted or confidentialaccess, such as in medical and military applications.

Ensemble systems have been successfully used toaddress this problem. The underlying idea is to generateadditional ensembles of classifiers with each subsequentdatabase that becomes available, and combine their out-puts using one of the combination methods discussedabove. The algorithm Learn++ and its recent variationshave been shown to achieve incremental learning on abroad spectrum of applications, even when new dataintroduce instances of previously unseen classes [55],[75], [76]. Learn++ is inspired in part by AdaBoost; how-ever it differs from AdaBoost in one important aspect:recall that AdaBoost updates its weight distribution

based on the performance of hypothesis ht generated inthe previous iteration [26]. In contrast, Learn++ intro-duces the notion of composite hypothesis—the com-bined ensemble generated thus far at any giveniteration—and updates its distribution based on the per-formance of the current ensemble through the use of thiscomposite hypothesis. Learn++ then focuses oninstances that have not been properly learned by theentire ensemble. During incremental learning, previouslyunseen instances, particularly those coming from a newclass, are bound to be misclassified by the ensemble,forcing the algorithm to focus on learning such instancesintroduced by the new data. Furthermore, Learn++ cre-ates an ensemble of ensemble classifiers, one ensemblefor each database. The ensembles are then combinedthrough a modified weighted majority voting algorithm.More recently, it was shown that the algorithm also workswell, even when the data distribution is unbalanced,where the number of instances in each class or databasevary significantly [77].

5.2. Data FusionThe goal of data fusion is to extract complementarypieces of information from different data sources, allow-ing a more informed decision about the phenomenon gen-erating the underlying data distributions. While ensemblesystems are often used for fusing classifier outputs, suchapplications have traditionally used classifiers that aretrained with different sampling of the data that comesfrom the same distribution. Therefore, all classifiers aretrained on the same feature set. In data fusion applica-tions, however, data obtained from different sources mayuse different feature sets that may be quite heteroge-neous in their composition. An example would be com-bining medical test results including an MRI scan (animage), an electroencephalogram (a time series), andseveral blood tests (scalar numbers) for the diagnosis ofa neurological disorder.

Even with heterogeneous feature sets, data fusion is anatural fit for ensemble systems, since different classi-fiers can be generated using data obtained from differentsources, and subsequently combined to achieve thedesired data fusion. Several classifier fusion approacheshave been proposed for this purpose, including combin-ing classifiers using Dempster-Shafer based combination[78]–[80], ARTMAP [81], Learn++ [82], genetic algo-rithms [83] and other combinations of boosting/votingmethods [84]–[86].

5.3. Feature SelectionOne way to improve diversity in the ensemble is to trainthe classifiers with data that consist of different featuresubsets. Several approaches based on this idea have

40 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Page 21: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

been proposed. Selecting the feature subsets at random is

known as the random subspace method, a term coined byHo [44], who used it on constructing decision tree ensem-bles. Subset selection need not be random: Oza andTumer propose input decimation, where the features areselected based on their correlation with the class labels[39]. In general, feature subsets should promote diversi-ty, that is, identical subsets should not be repeated. Whilemost approaches allow overlapping subsets of features,Gunter and Bunke [87] use mutually exclusive sets.

An interesting application of using feature subsets hasbeen proposed by Krause and Polikar, who used an ensem-ble of classifiers trained with an iteratively updated featuredistribution to address the missing feature problem [88]. Theidea is to generate a large enough ensemble of classifiersusing random subsets of features. Then, if certain featuresare missing from a given data instance, that instance can stillbe classified by a pool of classifiers that did not use the miss-ing features during their training. Another interesting appli-cation of using an ensemble of classifiers with differentfeature subsets was proposed by Long and Vega, who haveused the ensemble performance in determining the mostinformative feature subsets in gene expression data [89].

All of the above mentioned approaches implicitlyassume that there is redundancy in the overall featureset, and that the features are not consecutive elements ofa time-series data.

5.4. Error Correcting Output CodesError correcting output codes (ECOC) have originallybeen used in information theory for correcting bit rever-sals caused by noisy communication channels. Morerecently, they have also been used in converting binaryclassifiers, such as support vector machines, to multi-class classifiers by decomposing a multi-class probleminto several two-class problems [90]. Inspired by theseworks, Dietterich and Bakiri introduced ECOC to be usedwithin the ensemble setting [91]. The idea is to use a dif-ferent class encoding for each member of the ensemble.

The encodings constitute a binary C by T code matrix,

where C is the number of classes and T is the number ofclassifiers, combined by the minimum Hamming distancerule. Table 1 shows a particular code matrix for a 5-classproblem that uses 15 encodings. This encoding, sug-gested in [91], is known as the exhaustive coding becauseit includes all possible non-trivial and non-repeatingcodes. In this formulation, the individual classifiers aretrained on several meta two-class problems, where indi-vidual meta-classes include some combination of theoriginal classes. For example, C5 recognizes two meta-classes: original classes ω1 and ω3 constitute one class,and the others constitute the second class. During test-ing, each classifier outputs a “0” or “1” creating a 2C −1 − 1long output code vector. This vector is compared to eachcode word in the code matrix, and the class whose codeword has the shortest Hamming distance to the outputvector is chosen as the ensemble decision. More formally,the support for class ωj is given as

µj (x) = −T∑

t=1

|ot − Mj,t| (37)

where ot ∈ {0, 1} is the output of the t th binary classifier,and M is the code matrix. The negative sign converts thedistance metric into a support value, whose largest valuecan be zero in case of a perfect match. For example, theoutput [0 1 1 1 0 1 0 1 0 1 0 1 0 1 0] is closest to ω5 codeword with a Hamming distance of 1 (support of −1), andhence ω5 would be chosen for this output. Note that thisoutput does not match any of the code words exactly, andtherein lies the error correcting ability of the ECOC. Infact, the larger the minimum Hamming distance betweencode words, the more resistant the ECOC ensemblebecomes to misclassifications of individual classifiers.More efficient and error-resistant coding (choosingappropriate codes) as well as decoding (combinationmethods other than Hamming distance) are topics of cur-rent research [92], [93].

41THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Classifiers

Class C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15

ω1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

ω2 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1

ω3 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1

ω4 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1

ω5 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

Table 1. Exhaustive ECOC for a five-class problem.

Page 22: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

5.5. Confidence EstimationUsing an ensemble based system also allows us to quan-titatively assess the confidence of the decision even on aninstance-by-instance basis. Loosely speaking, if a vastmajority of classifiers in the ensemble agree on the deci-sion of a given instance, we can interpret this outcome asthe ensemble having a high confidence in its decision.Conversely, if the final decision is made based on a smalldifference in classifier opinions, then the ensemble wouldhave less confidence in its decision. Muhlbaier et al.showed that continuous valued outputs of ensemble clas-sifiers can also be used as an estimate of the posteriorprobability, which in turn can be interpreted as the confi-dence in the decision [29]. Specifically, they showed thatthe posterior probability of ωj can be estimated through asoftmax type normalization of classifier outputs:

P(ωj |x) ≈ Xj (x) = eFj(x)

∑Cc=1 eFc(x)

(38)

where Xj (x) is the confidence of the ensemble in assign-ing instance x into class ωj and

Fj (x) =N∑

t =1

(log(1/βt) ht(x) = ωj

0 otherwise

)(39)

is the total weighted sum of the weights for ωj. More inter-estingly, on several benchmark datasets of known distribu-tions, they showed that the posterior probability estimateof the ensemble approaches the true posterior probabilityas the number of classifiers in the ensemble increases.

5.6. Other AreasThere are other areas in which ensemble systems havebeen used and/or proposed, and the list is continuouslygrowing at what appears to be a healthy rate. Some of themore promising ones include using ensemble systems innon-stationary environments [19], in clustering applica-tions [94]–[96], and countless work on theoretical analy-ses of such systems, such as on incremental learning,improving diversity, bias-variance analysis, etc. Specificpractical applications of ensemble systems—whetherused for biomedical, financial, remote sensing, genomic,oceanographic, chemical or other types of data analysisproblems are also rapidly growing.

6. Conclusions

Over the last decade, the ensemble based systems haveenjoyed a growing attention and popularity due to theirmany desired properties, and the broad spectrum ofapplications that can benefit from them. In this paper, wediscussed the fundamental aspects of these ensemblesystems, including the need to ensure—and ways to

measure—diversity in the ensemble; ensemble genera-tion techniques such as bagging, AdaBoost, mixture ofexperts; and classifier combination strategies, such asalgebraic combiners, voting methods, and decision tem-plates. We have also reviewed some of the more popularareas where ensemble systems become naturally useful,such as incremental learning, data fusion, feature selec-tion and error correcting output codes.

Whereas there is no single ensemble generation algo-rithm or combination rule that is universally better thanothers, all of the approaches discussed above have beenshown to be effective on a wide range of real world andbenchmark datasets, provided that the classifiers can bemade as diverse as possible. In the absence of any otherprior information, the best ones are usually (as theWilliam of Ockham had said about 8 centuries ago) thesimplest and least complicated ones that can learn theunderlying data distribution.

As for the quiz show with the large payout, where youhad to choose from the three lifelines, let’s focus ourattention once again to Equation 18:

■ if the individual audience members vote independ-ently (a reasonable assumption); and

■ there is a reasonably large audience (another rea-sonable assumption); and

■ each has a probability of 0.5 or higher for gettingthe correct answer (sufficient, but not necessaryfor more than two choices),

then the probability of correct answer of the majority vot-ing of the audience (the ensemble) approaches 1. So, unlessyou know for a fact that your friend (for the “call a friend”option) has a very high probability of knowing the correctanswer (prior information that warrants a change in yourcourse of action), you are better off with the audience. Justin case you find yourself in that fortunate situation . . . .

Software

The following software tools include Matlab based func-tions that can be used for ensemble learning. Many othertools can also be found on the Internet.

■ C. Merkwinh and J. Wichard, “ENTOOL—A MatlabToolbox for Ensemble Modeling,” 2002. Availableonline at: http://chopin.zet.agh. edu.pl/∼wichtel

■ R. P.W. Duin, D. Ridder, P. Juszczak, P. Paclik, E.Pekalska and D. Tax, “PRTools,” 2005. Availableonline at: http://www.prtools.org

■ D. Stork and E. Yom-Tov, “Computer Manual in Mat-lab to Accompany Pattern Classification, 2nd Edi-tion, New York: Wiley, 2004.

Acknowledgment

This material is in part based upon work supported by theNational Science Foundation under Grant No: ECS 0239090.

42 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Page 23: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

References[1] B.V. Dasarathy and B.V. Sheela, “Composite classifier system design:Concepts and methodology,” Proceedings of the IEEE, vol. 67, no. 5, pp. 708–713, 1979.[2] L.K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp.993–1001, 1990.[3] R.E. Schapire, “The strength of weak learnability,” Machine Learning,vol. 5, no. 2, pp. 197–227, 1990.[4] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton, “Adaptive mix-tures of local experts,” Neural Computation, vol. 3, pp. 79–87, 1991.[5] M.J. Jordan and R.A. Jacobs, “Hierarchical mixtures of experts andthe EM algorithm,” Neural Computation, vol. 6, no. 2, pp. 181–214, 1994.[6] D.H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2,pp. 241–259, 1992.[7] J.A. Benediktsson and P.H. Swain, “Consensus theoretic classificationmethods,” IEEE Trans. on Systems, Man and Cybernetics, vol. 22, no. 4, pp.688–704, 1992.[8] L. Xu, A. Krzyzak, and C.Y. Suen, “Methods of combining multipleclassifiers and their applications to handwriting recognition,” IEEETransactions on Systems, Man and Cybernetics, vol. 22, no. 3, pp. 418–435,1992.[9] T.K. Ho, J.J. Hull, and S.N. Srihari, “Decision combination in multipleclassifier systems,” IEEE Trans. on Pattern Analy. Machine Intel., vol. 16, no. 1, pp. 66–75, 1994.[10] G. Rogova, “Combining the results of several neural network classi-fiers,” Neural Networks, vol. 7, no. 5, pp. 777–781, 1994.[11] L. Lam and C.Y. Suen, “Optimal combinations of pattern classifiers,”Pattern Recognition Letters, vol. 16, no. 9, pp. 945–954, 1995.[12] K. Woods, W.P.J. Kegelmeyer, and K. Bowyer, “Combination of multi-ple classifiers using local accuracy estimates,” IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 19, no. 4, pp. 405–410, 1997.[13] L.I. Kuncheva, “Change-glasses’ approach in pattern recognition,”Pattern Recognition Letters, vol. 14, no. 8, pp. 619–623, 1993.[14] I. Bloch, “Information combination operators for data fusion: A com-parative review with classification,” IEEE Transactions on Systems, Man,and Cybernetics Part A: Systems and Humans., vol. 26, no. 1, pp. 52–67, 1996.[15] S.B. Cho and J.H. Kim, “Combining multiple neural networks byfuzzy integral for robust classification,” IEEE Transactions on Systems,Man and Cybernetics, vol. 25, no. 2, pp. 380–384, 1995.[16] L.I. Kuncheva, J.C. Bezdek, and R. Duin, “Decision templates for mul-tiple classifier fusion: An experimental comparison,” Pattern Recognition,vol. 34, no. 2, pp. 299–314, 2001.[17] H. Drucker, C. Cortes, L.D. Jackel, Y. LeCun, and V. Vapnik, “Boostingand other ensemble methods,” Neural Computation, vol. 6, no. 6, pp. 1289–1301, 1994.[18] R. Battiti and A.M. Colla, “Democracy in neural nets: Voting schemesfor classification,” Neural Networks, vol. 7, no. 4, pp. 691–707, 1994.[19] L.I. Kuncheva, “Classifier ensembles for changing environments,”5th Int. Workshop on Multiple Classifier Systems, Lecture Notes in Com-puter Science, F. Roli, J. Kittler, and T. Windeatt, Eds., vol. 3077, pp. 1–15, 2004.[20] F. Smieja, “Pandemonium system of reflective agents,” IEEE Trans-actions on Neural Networks, vol. 7, no. 1, pp. 97–106, 1996.[21] L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms.New York, NY: Wiley Interscience, 2005.[22] E. Alpaydin and M.I. Jordan, “Local linear perceptrons for classifi-cation,” IEEE Transactions on Neural Networks, vol. 7, no. 3, pp. 788–792,1996.[23] F. Roli, G. Giacinto, and G. Vernazza, “Methods for designing multi-ple classifier systems,” 2nd Int. Workshop on Multiple Classifier Sys-tems, in Lecture Notes in Computer Science, J. Kittler and F. Roli, Eds., vol.2096, pp. 78–87, 2001.[24] G. Giacinto and F. Roli, “Approach to the automatic design of multi-ple classifier systems,” Pattern Recognition Letters, vol. 22, no. 1, pp. 25–33, 2001.[25] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.[26] Y. Freund and R.E. Schapire, “Decision-theoretic generalization ofon-line learning and an application to boosting,” Journal of Computer andSystem Sciences, vol. 55, no. 1, pp. 119–139, 1997.

[27] F.M. Alkoot and J. Kittler, “Experimental evaluation of expert fusionstrategies,” Pattern Recognition Letters, vol. 20, no. 11–13, pp. 1361–1369, Nov. 1999.[28] J. Kittler, M. Hatef, R.P.W. Duin, and J. Mates, “On combining classi-fiers,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239, 1998.[29] M. Muhlbaier, A. Topalis, and R. Polikar, “Ensemble confidence esti-mates posterior probability,” 6th Int. Workshop on Multiple ClassifierSys., in Lecture Notes in Comp. Science, N. Oza, R. Polikar, J. Kittler, and F.Roli, Eds., vol. 3541, pp. 326–335, 2005.[30] J. Grim, J. Kittler, P. Pudil, and P. Somol, “Information analysis of mul-tiple classifier fusion,” 2nd Int. Workshop on Multiple Classifier Systems, inLecture Notes in Computer Science, J. Kittler and F. Roli, Eds., vol. 2096,pp. 168–177, 2001.[31] L.I. Kuncheva, “A theoretical study on six classifier fusion strate-gies,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.24, no. 2, pp. 281–286, 2002.[32] S.B. Cho and J.H. Kim, “Multiple network fusion using fuzzy logic,”IEEE Transactions on Neural Networks, vol. 6, no. 2, pp. 497–501, 1995.[33] M. Grabisch and J.M. Nicolas, “Classification by fuzzy integral: per-formance and tests,” Fuzzy Sets and Systems, vol. 65, no. 2–3, pp. 255–271, 1994.[34] Y. Lu, “Knowledge integration in a multiple classifier system,”Applied Intelligence, vol. 6, no. 2, pp. 75–86, 1996.[35] L.I. Kuncheva, “Using measures of similarity and inclusion for mul-tiple classifier fusion by decision templates,” Fuzzy Sets and Systems, vol.122, no. 3, pp. 401–407, 2001.[36] L.I. Kuncheva, “Switching between selection and fusion in combin-ing classifiers: An experiment,” IEEE Transactions on Systems, Man, andCybernetics, Part B: Cybernetics, vol. 32, no. 2, pp. 146–156, 2002.[37] D. Tax, M. van Breukelen, R.P.W. Duin, and J. Kittler, “Combining mul-tiple classifiers by averaging or by multiplying?,” Pattern Rec., vol. 33, no.9, pp. 1475–1485, 2000.[38] L.I. Kuncheva and C.J. Whitaker, “Feature Subsets for Classifier Com-bination: An Enumerative Experiment,” 2nd Int. Workshop on MultipleClassifier Systems, in Lecture Notes in Comp. Science, J. Kittler and F. Roli, Eds., vol. 2096, pp. 228–237, 2001.[39] N.C. Oza and K. Tumer, “Input decimation ensembles: Decorrelationthrough dimensionality reduction,” 2nd Int. Workshop on Multiple Clas-sifier Systems, in Lecture Notes in Computer Science, J. Kittler and F. Roli,Eds., vol. 2096, pp. 238–247, 2001.[40] N.S.V. Rao, “On fusers that perform better than best sensor,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 8,pp. 904–909, 2001.[41] K. Tumer and J. Ghosh, “Analysis of decision boundaries in linearlycombined neural classifiers,” Pattern Recognition, vol. 29, no. 2, pp.341–348, 1996.[42] G. Fumera and F. Roli, “A Theoretical and Experimental Analysis ofLinear Combiners for Multiple Classifier Systems,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 942–956, 2005.[43] Various Authors, “Proceedings of International Workshop on Multi-ple Classifier Systems (2000–2005),” F. Roli, J. Kittler, T. Windeatt, N. C.Oza, and R. Polikar, Eds. Berlin, Germany: Springer, 2005.[44] T.K. Ho, “Random subspace method for constructing decisionforests,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 20, no. 8, pp. 832–844, 1998.[45] L. Kuncheva and C. Whitaker, “Measures of diversity in classifierensembles and their relationship with the ensemble accuracy,” MachineLearning, vol. 51, no. 2, pp. 181–207, 2003.[46] G. Brown, “Diversity in neural network ensembles,” Ph.D. disserta-tion, University of Birmingham, 2004.[47] D.H. Wolpert and W.G. Macready, “No free lunch theorems for opti-mization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67–82, 1997.[48] B. Rosen , “Ensemble learning using decorrelated neural networks,”Connection Science, vol. 8, no. 3, pp. 373–384, 1996.[49] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.[50] L. Breiman, “Pasting small votes for classifcation in large databasesand on-line,” Machine Learning, vol. 36, pp. 85–103, 1999.

43THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Page 24: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

[51] N.V. Chawla, L.O. Hall, K.W. Bowyer, J. Moore, and W.P. Kegelmeyer,“Distributed pasting of small votes,” 3rd Int. Workshop on Multiple Clas-sifier Systems, in Lecture Notes in Computer Science, F. Roli and J. Kittler, Eds., vol. 2364, p. 52–61, Springer, 2002.[52] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee, “Boosting theMargin: A New Explanation for the Effectiveness of Voting Methods,”Annals of Statistics, vol. 26, no. 5, pp. 1651–1686, 1998.[53] N.C. Oza, “Boosting with Averaged Weight Vectors,” 4th Int. Work-shop on Multiple Classifier Systems, in Lecture Notes in Computer Sci-ence, T. Windeatt and F. Roli, Eds., vol. 2709, pp. 15–24, Springer, 2003.[54] N.C. Oza, “AveBoost2: Boosting for Noisy Data,” 5th Int. Work-shop on Multiple Classifier Systems, in Lecture Notes in Computer Sci-ence, vol. 3077, pp. 31–40, F. Roli, J. Kittler, and T. Windeatt, Eds.Springer, 2004.[55] R.Polikar, L. Udpa, S.S. Udpa, and V. Honavar, “Learn++: An incre-mental learning algorithm for supervised neural networks,” IEEE Trans-actions on Systems, Man and Cybernetics Part C: Applications and Reviews,vol. 31, no. 4, pp. 497–508, 2001.[56] M.I. Jordan and L. Xu, “Convergence results for the EM approach tomixtures of experts architectures,” Neural Networks, vol. 8, no. 9, pp. 1409–1431, 1995.[57] P.J. Boland, “Majority system and the Condorcet jury theorem,” Statistician, vol. 38, no. 3, pp. 181–189, 1989.[58] L. Shapley and B. Grofman, “Optimizing group judgmental accuracyin the presence of interdependencies,” Public Choice (Historical Archive),vol. 43, no. 3, pp. 329–343, 1984.[59] N. Littlestone and M. Warmuth, “Weighted majority algorithm,” Infor-mation and Computation, vol. 108, pp. 212–261, 1994.[60] Y.S. Huang and C.Y. Suen, “Behavior-knowledge space method forcombination of multiple classifiers,” Proc. of IEEE Computer Vision andPattern Recog., pp. 347–352, 1993.[61] Y.S. Huang and C.Y. Suen, “A method of combining multiple expertsfor the recognition of unconstrained handwritten numerals,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 17, no. 1, pp. 90–94, 1995.[62] H. Ichihashi, T. Shirai, K. Nagasaka, and T. Miyoshi, “Neuro-fuzzy ID3:a method of inducing fuzzy decision trees with linear programming formaximizing entropy and an algebraic method for incremental learning,”Fuzzy Sets and Systems, vol. 81, no. 1, pp. 157–167, 1996.[63] R.O. Duda, P.E. Hart, and D. Stork, “Algorithm Independent Tech-niques,” in Pattern Classification, 2 ed New York: Wiley, pp. 453–516,2001.[64] A.P. Dempster, “Upper and lower probabilities induced by multival-ued mappings,” Annals of Mathematical Statistics, vol. 38, no. 2, pp. 325–339, 1967.[65] G. Shafer, A Mathematical Theory of Evidence. Princeton, NJ: Prince-ton Univ. Press, 1976.[66] T.G. Dietterich, “Ensemble methods in machine learning,” 1st Int.Workshop on Multiple Classifier Systems, in Lecture Notes in ComputerSceince, J. Kitler and F. Roli, Eds., vol. 1857, pp. 1–15, 2000.[67] T.G. Dietterich, “Experimental comparison of three methods for con-structing ensembles of decision trees: bagging, boosting, and random-ization,” Machine Learning, vol. 40, no. 2, pp. 139–157, 2000.[68] L. Breiman, “Arcing classifiers,” Annals of Statistics, vol. 26, no. 3, pp. 801–849, 1998.[69] E. Bauer and R. Kohavi, “An empiricial comparison of voting classi-fication algorithms: bagging, boosting and variants,” Machine Learning,vol. 36, pp. 105–142, 1999.[70] J.R. Quinlan, “Bagging, boosting and C4.5, 13th Int. Conf. on ArtificialIntelligence, pp. 725–730, 1996.[71] R.A. Jacobs, “Methods for combining experts’ probability assess-ments,” Neural Computation, vol. 7, no. 5, p. 867, 1995.[72] R. Duin and D. Tax, “Experiments with Classifier Combining Rules,”1st Int. Workshop on Multiple Classifier Systems, in Lecture Notes in Com-puter Science, J. Kittler and F. Roli, Eds., vol. 1857, pp. 16–29, Springer,2000.[73] J. Ghosh, “Multiclassifier Systems: Back to the Future,” 3rd Int.Workshop on Multiple Classifier Systems, in Lecture Notes in ComputerScience, F. Roli and J. Kittler, Eds., vol. 2364, pp. 1–15, 2002.[74] R. French, “Catastrophic forgetting in connectionist networks: caus-es, consequences and solutions,” Trends in Cognitive Sciences, vol. 3, no. 4, pp. 128–135, 1999.[75] M. Muhlbaier, A. Topalis, and R. Polikar, “Learn++.MT: A NewApproach to Incremental Learning,” 5th Int. Workshop on Multiple Clas-

sifier Systems, in Lecture Notes in Computer Sceince, F. Roli, J. Kittler, andT. Windeatt, Eds., vol. 3077, pp. 52–61, Springer, 2004.[76] R. Polikar, L. Udpa, S. Udpa, and V. Honavar, “An incremental learn-ing algorithm with confidence estimation for automated identification ofNDE signals,” IEEE Transactions on Ultrasonics, Ferroelectrics and Fre-quency Control, vol. 51, no. 8, pp. 990–1001, 2004.[77] M. Muhlbaier, A. Topalis, and R. Polikar, “Incremental learning fromunbalanced data,” IEEE Int. Joint Conf. on Neural Networks vol. 2, pp. 1057–1062, Budapest, Hungary, 2004.[78] H. Altincay and M. Demirekler, “Speaker identification by combiningmultiple classifiers using Dempster-Shafer theory of evidence,” SpeechCommunication, vol. 41, no. 4, pp. 531–547, 2003.[79] Y. Bi, D. Bell, H. Wang, G. Guo, and K. Greer, “Combining multipleclassifiers using dempster’s rule of combination for text categorization,”in Lecture Notes in Artificial Intelligence, vol. 3131, pp. 127–138,Barcelona, Spain: Springer, 2004.[80] T. Denoeux, “Neural network classifier based on Dempster-Shafertheory,” IEEE Trans. on Sys, Man, and Cyber. A: Sys. and Humans., vol. 30,no. 2, pp. 131–150, 2000.[81] G.A. Carpenter, S. Martens, and O.J. Ogas, “Self-organizing informa-tion fusion and hierarchical knowledge discovery: a new frameworkusing ARTMAP neural networks,” Neural Networks, vol. 18, no. 3, pp. 287–295, Apr. 2005.[82] D. Parikh, M.T. Kim, J. Oagaro, S. Mandayam, and R. Polikar, “Com-bining classifiers for multisensor data fusion,” IEEE Int. Conf. on Systems,Man and Cybernetics, vol. 2, pp. 1232–1237, The Hague, Netherlands,2004.[83] B.F. Buxton, W.B. Langdon, and S.J. Barrett, “Data fusion by intelli-gent classifier combination,” Measurement and Control, vol. 34, no. 8, pp. 229–234, 2001.[84] G.J. Briem, J.A. Benediktsson, and J.R. Sveinsson, “Use of multipleclassifiers in classification of data from multiple data sources,” in International Geoscience and Remote Sensing Symposium (IGARSS),vol. 2, pp. 882–884, Sydney, 2001.[85] W. Fan, M. Gordon, and P. Pathak, “On linear mixture of expertapproaches to information retrieval,” Decision Support Systems, vol. InPress, Corrected Proof.[86] S. Jianbo, W. Jun, and X. Yugeng, “Incremental learning with bal-anced update on receptive fields for multi-sensor data fusion,” IEEETransactions on Systems, Man and Cybernetics, Part B, vol. 34, no. 1, pp. 659–665, 2004.[87] S. Gunter and H. Bunke, “An evaluation of ensemble methods inhandwritten word recognition based on feature selection,” Int. Conf. onPattern Recog., vol. 1, pp. 388–392, 2004.[88] S. Krause and R. Polikar, “An Ensemble of Classifiers Approach forthe Missing Feature Problem,” in Int. Joint Conf on Neural Networks,vol. 1, pp. 553–556, Portland, OR, 2003.[89] P.M. Long and V.B. Vega, “Boosting and Microarray Data,” MachineLearning, vol. 52, no. 1–2, pp. 31–44, July 2003.[90] E. Allwein, R.E. Schapire, and Y. Singer, “Reducing Multiclass to Bina-ry: A Unifying Approach for Margin Classifiers,” Journal of MachineLearning Research, vol. 1, pp. 113–141, 2000.[91] T.G. Dietterich and G. Bakiri, “Solving Multiclass Learning Problemsvia Error-Correcting Output Codes,” Journal of Artificial Intel. Research,vol. 2, pp. 263–286, 1995.[92] S. Rajan and J. Ghosh, “An empirical comparison of hierarchical vs.two-level approaches to multiclass problems,” in Lecture Notes in Com-puter Science, vol. 3077, pp. 283–292, F. Roli, J. Kittler, and T. Windeatt,Eds. Springer, 2004.[93] R.S. Smith and T. Windeatt, “Decoding Rules for Error CorrectingOutput Code Ensembles,” 6th Int. Workshop on Multiple Classifier Sys.,Lecture Notes in Computer Science, N. C. Oza, R. Polikar, F. Roli, and J. Kit-tler, Eds., vol. 3541, pp. 53–63, Springer, 2005.[94] H.G. Ayad and M.S. Kamel, “Cluster-Based Cumulative Ensembles,”6th Int. Workshop on Multiple Classifier Sys., Lecture Notes in ComputerScience, N. C. Oza, R. Polikar, F. Roli, and J. Kittler, Eds., vol. 3541, pp.236–245, Springer, 2005.[95] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus Cluster-ing: A Resampling-Based Method for Class Discovery and Visualizationof Gene Expression Microarray Data,” Machine Learning, vol. 52, no. 1–2, pp. 91–118, July 2003.[96] A. Strehl and J. Ghosh, “Cluster Ensembles-A Knowledge ReuseFramework for Combining Multiple Partitions,” Journal of Machine Learn-ing Research, vol. 3, pp. 583–617, 2002.

44 IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2006

Page 25: Feature - 東京大学...Robi Polikar is with Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028 USA. an intelligent combination rule often proves to be a more

Robi Polikar received his B.S. degree inelectronics and communications engi-neering from Istanbul Technical Univer-sity, Istanbul, Turkey, in 1993, and hisM.S. and Ph.D. degrees, both co-majorsin biomedical engineering and electri-cal engineering, from Iowa State Uni-

versity, Ames, Iowa, in 1995 and in 2000, respectively.He is currently an Assistant Professor with the Depart-ment of Electrical and Computer Engineering at RowanUniversity, Glassboro, NJ. His current research inter-ests include signal processing, pattern recognition,neural systems, machine learning and computationalmodels of learning, with applications to biomedicalengineering and imaging; chemical sensing, nonde-structive evaluation and testing. He also teaches upperlevel undergraduate and graduate courses in wavelettheory, pattern recognition, neural networks and bio-medical systems and devices at Rowan University. Heis a member of IEEE, ASEE, Tau Beta Pi and Eta KappaNu. His current work is funded primarily through NSF’sCAREER program and NIH’s Collaborative Research inComputational Neuroscience program.

45THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

www.ieee.org/jobs

The IEEE Job Site is constantly updated with postings from leading companies in a broad range of industries such as: • Northrop Grumman• STMicroelectronics • AMI Semiconductor• Agilent Technologies• Google

Questions? Contact us at [email protected]

IEEE Members say:

“I found many openings I did not see on other job search engines.”

“It allows me to find jobs in my specific field of engineering.”

The Best Source for Tech Hiring!

Take advantage of this powerful IEEE Member benefit!

Post your profile and resume –FREE at ww.ieee.org/jobs