an extended one-versus-rest support vector machine for multi-label classification

11
An extended one-versus-rest support vector machine for multi-label classification Jianhua Xu School of Computer Science and Technology, Nanjing Normal University, Nanjing 210097, China article info Article history: Received 14 December 2010 Received in revised form 15 March 2011 Accepted 23 April 2011 Communicated by J. Kwok Available online 23 May 2011 Keywords: Support vector machine Multi-label classification Data decomposition Algorithm extension Hybrid strategy One-versus-rest abstract Hybrid strategy, which generalizes a specific single-label algorithm while one or two data decomposi- tion tricks are applied implicitly or explicitly, has become an effective and efficient tool to design and implement various multi-label classification algorithms. In this paper, we extend traditional binary support vector machine by introducing an approximate ranking loss as its empirical loss term to build a novel support vector machine for multi-label classification, resulting into a quadratic programming problem with different upper bounds of variables to characterize label correlation of individual instance. Further, our optimization problem can be solved via combining one-versus-rest data decomposition trick with modified binary support vector machine, which dramatically reduces computational cost. Experimental study on ten multi-label data sets illustrates that our method is a powerful candidate for multi-label classification, compared with four state-of-the-art multi-label classification approaches. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Multi-label classification is a particular learning task where a single instance can belong to several classes at the same time and thus the classes are not mutually exclusive. Recently, it has been paid more attention to than before because of many real-world applications, e.g., text categorization [16,26,27,40], scene classifi- cation [2,15,41], bioinformatics [1,21], and music and speech categorization [17,29]. Nowadays, there are three strategies to design and implement various discriminative multi-label classi- fication methods: data decomposition, algorithm extension, and hybrid strategies. Further, label correlation, i.e., label co-occur- rence information, has been exploited in three levels: individual instance, partial instances, and different labels. Data decomposition strategy divides a multi-label data set into either one or more single-label (single, binary, or multi-class) subsets, constructs a sub-classifier for each subset using an existing classification technique, and then assembles all sub- classifiers into an entire multi-label classifier. There are four widely used decomposition tricks: one-versus-rest (OVR), one- versus-one (OVO), one-by-one (OBO), and label powerset (LP) [3,30,32,39]. It is convenient and fast to implement a data decomposition multi-label method since lots of existing classifi- cation techniques and their software can be utilized. This strategy reflects label correlation of individual instance through exploiting multi-label instances repeatedly in OVR, OVO, and OBO methods implicitly, and label correlations of partial instances via consider- ing possible label combinations in LP methods directly. Algorithm extension strategy generalizes a specific multi-class classification algorithm to consider all training instances and all classes of a multi-label training data set at once. This strategy could induce some complicated optimization problems, e.g., large-scale quadratic programming in multi-label support vector machine (Rank-SVM) [10] and unconstrained optimization in multi-label neural networks (BP-MLL) [40]. But such two methods explicitly characterize label correlation of individual instance using an approximate expression of ranking loss, and further reflect label correlation of different labels using a threshold function from linear regression. Hybrid strategy aims to integrate the merits of the above two strategies. It needs to modify or extend an existing single-label method while a multi-label data set is divided into a series of subsets implicitly or explicitly. This strategy has been used to design and implement several efficient and effective multi-label classifiers, e.g., two kNN-based multi-label approaches (ML-kNN and IBLR-ML) [4, 41], which introduce posterior probability estimation for each label independently to extend kNN, after the OVR decomposition trick is applied implicitly. Furthermore, IBLR-ML captures label correlation of different labels via linking its posterior probability of each label with distance-weighted sums of k neighbor instance labels from all classes. But how to find out a proper way to characterize label correlation of Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.04.024 E-mail address: [email protected] Neurocomputing 74 (2011) 3114–3124

Upload: jianhua-xu

Post on 10-Sep-2016

220 views

Category:

Documents


3 download

TRANSCRIPT

Neurocomputing 74 (2011) 3114–3124

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

E-m

journal homepage: www.elsevier.com/locate/neucom

An extended one-versus-rest support vector machinefor multi-label classification

Jianhua Xu

School of Computer Science and Technology, Nanjing Normal University, Nanjing 210097, China

a r t i c l e i n f o

Article history:

Received 14 December 2010

Received in revised form

15 March 2011

Accepted 23 April 2011

Communicated by J. Kwoknovel support vector machine for multi-label classification, resulting into a quadratic programming

Available online 23 May 2011

Keywords:

Support vector machine

Multi-label classification

Data decomposition

Algorithm extension

Hybrid strategy

One-versus-rest

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2011.04.024

ail address: [email protected]

a b s t r a c t

Hybrid strategy, which generalizes a specific single-label algorithm while one or two data decomposi-

tion tricks are applied implicitly or explicitly, has become an effective and efficient tool to design and

implement various multi-label classification algorithms. In this paper, we extend traditional binary

support vector machine by introducing an approximate ranking loss as its empirical loss term to build a

problem with different upper bounds of variables to characterize label correlation of individual

instance. Further, our optimization problem can be solved via combining one-versus-rest data

decomposition trick with modified binary support vector machine, which dramatically reduces

computational cost. Experimental study on ten multi-label data sets illustrates that our method is a

powerful candidate for multi-label classification, compared with four state-of-the-art multi-label

classification approaches.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Multi-label classification is a particular learning task where asingle instance can belong to several classes at the same time andthus the classes are not mutually exclusive. Recently, it has beenpaid more attention to than before because of many real-worldapplications, e.g., text categorization [16,26,27,40], scene classifi-cation [2,15,41], bioinformatics [1,21], and music and speechcategorization [17,29]. Nowadays, there are three strategies todesign and implement various discriminative multi-label classi-fication methods: data decomposition, algorithm extension, andhybrid strategies. Further, label correlation, i.e., label co-occur-rence information, has been exploited in three levels: individualinstance, partial instances, and different labels.

Data decomposition strategy divides a multi-label data set intoeither one or more single-label (single, binary, or multi-class)subsets, constructs a sub-classifier for each subset using anexisting classification technique, and then assembles all sub-classifiers into an entire multi-label classifier. There are fourwidely used decomposition tricks: one-versus-rest (OVR), one-versus-one (OVO), one-by-one (OBO), and label powerset (LP)[3,30,32,39]. It is convenient and fast to implement a datadecomposition multi-label method since lots of existing classifi-cation techniques and their software can be utilized. This strategy

ll rights reserved.

reflects label correlation of individual instance through exploitingmulti-label instances repeatedly in OVR, OVO, and OBO methodsimplicitly, and label correlations of partial instances via consider-ing possible label combinations in LP methods directly.

Algorithm extension strategy generalizes a specific multi-classclassification algorithm to consider all training instances and allclasses of a multi-label training data set at once. This strategycould induce some complicated optimization problems, e.g.,large-scale quadratic programming in multi-label support vectormachine (Rank-SVM) [10] and unconstrained optimization inmulti-label neural networks (BP-MLL) [40]. But such two methodsexplicitly characterize label correlation of individual instanceusing an approximate expression of ranking loss, and furtherreflect label correlation of different labels using a thresholdfunction from linear regression.

Hybrid strategy aims to integrate the merits of the above twostrategies. It needs to modify or extend an existing single-labelmethod while a multi-label data set is divided into a series of subsetsimplicitly or explicitly. This strategy has been used to design andimplement several efficient and effective multi-label classifiers, e.g.,two kNN-based multi-label approaches (ML-kNN and IBLR-ML)[4,41], which introduce posterior probability estimation for each labelindependently to extend kNN, after the OVR decomposition trick isapplied implicitly. Furthermore, IBLR-ML captures label correlation ofdifferent labels via linking its posterior probability of each label withdistance-weighted sums of k neighbor instance labels from all classes.But how to find out a proper way to characterize label correlation of

J. Xu / Neurocomputing 74 (2011) 3114–3124 3115

individual instance, partial instances, and even different labels inextending a specific method is still a challenging issue for such ahybrid strategy.

Binary support vector machine [36] is one of the most powerfulmachine learning algorithms in the past 15 years. For multi-labelclassification, one-versus-rest support vector machine has beensuccessfully used in many real-world applications [2,16], whichindirectly reflects label correlation of individual instance throughreusing multi-label instances. In this paper, our focus is on incorpor-ating label correlation of individual instance into one-versus-restmulti-label support vector machine explicitly. We define a newempirical loss term through approximating ranking loss from aboveand then generalize traditional binary support vector machine todesign a novel support vector machine for multi-label classification.In our quadratic programming problem, the upper bounds of vari-ables are associated with the number of relevant or irrelevant labelsof training instances, which characterizes label correlation of indivi-dual instance directly. Particularly, our optimization problem can besolved via combining the OVR decomposition trick and modifiedbinary support vector machine, which reduces computational com-plexity greatly. Experimental results demonstrate that our method isa competitive candidate for multi-label classification, compared withfour existing techniques.

The rest of this paper is organized as follows. Multi-labelclassification setting is introduced in Section 2 and previous workis reviewed in Section 3. Then our novel method is described inSection 4. Section 5 is devoted to experiments with ten benchmarkdata sets. This paper ends with some conclusions in Section 6.

2. Multi-label classification setting

Let XARd be a d-dimensional input space and Q¼{1,2,y,q} afinite set of class labels, where q is the number of class labels.Further, assume that each instance xAX can be associated with asubset of labels LA2Q, which is referred to as the relevant set oflabels for x. At the same time, the complement of L, i.e., L¼Q \L, iscalled as the irrelevant set of labels. Given a training data set ofsize l drawn identically and independently from an unknownprobability distribution on X�2Q, i.e.

fðx1, L1Þ,. . .,ðxi, LiÞ,. . .,ðxl, LlÞg ð1Þ

the multi-label classification problem is to learn a classifier f(x):X-2Q that generalizes well on both these training instances and unseenones in the sense of optimizing some expected risk functional withrespect to a specific empirical loss function [4,10,40].

In many traditional q-class single-label classification methods,a widely used trick is to learn q decision functions fiðxÞ : X-R, i¼

1,. . .,q such that fk(x)4fi(x), iak if xAclass k [9]. For multi-labelclassification, this idea is adapted as fkðxÞ4 fiðxÞ, kAL, iAL, whichmeans that the relevant labels should be ranked higher than theirrelevant labels [10]. In this case, multi-label prediction can be donethrough a proper threshold t

f ðxÞ ¼ fk,s:t: fkðxÞZt,k¼ 1,. . .,qg: ð2Þ

Now there mainly are two kinds of thresholds: constantthreshold (e.g., 0.0 for OVR methods) and threshold functiondetermined by an additional learning procedure (e.g., linearregression) [10,39,40].

3. Previous work

In the past several years, since multi-label classification hasreceived a lot of attention in machine learning, pattern recognitionand statistics, a variety of methods have been proposed. In this paper,according to three strategies mentioned in the introduction, we

categorize existing discriminative multi-label methods into threegroups: data decomposition, algorithm extension and hybrid meth-ods. Note that in Refs. [3,30,32,33], our first group is referred to asproblem transformation or algorithm independent methods, whilethe last two roughly as algorithm adaptation or algorithm dependentmethods.

3.1. Data decomposition methods

Data decomposition methods combine data decompositiontricks with existing single-label classification methods, whosetwo core techniques are decomposition and integration. Here webriefly review four data decomposition tricks: one-versus-rest(OVR) or binary relevance (BR), one-versus-one (OVO), one-by-one (OBO) and label powerset (LP), and their sub-classifierintegrating ways.

The OVR or BR decomposition trick divides a q-class multi-label data set into q binary subsets [2,16], where the ith subsetconsists of positive instances with the ith label and negative oneswith the all other labels. Usually, q sub-classifiers are assembledinto an entire multi-label algorithm using the above Eq. (2) with aconstant threshold (e.g., 0.0 or 0.5). So far, many OVR multi-labelmethods have been verified to work well using various binaryclassifiers, e.g., SVM [2,10,16,30], C4.5 [4,30], naıve Bayes [30],and kNN [4,30].

The OVO decomposition trick splits a q-class multi-label dataset into q(q�1)/2 binary subsets in a pairwise way, where thesubset ij only involves instances with the ith and jth labels (io j).Note that for some subsets there is a mixed class, whose instancesbelong to both positive and negative classes simultaneously.In Refs. [2,22], this mixed class is merged into positive or negativeone twice respectively, and then two binary SVM classifiers aretrained. For a test instance, it is needed to count its votes of eachclass from q(q�1) binary classifiers and detect its relevant labelsusing a proper constant threshold or threshold function. For labelranking with pairwise comparison (RPC) as a variant of the OVOtrick, the mixed class is discarded [14,20]. Further, a calibratedlabel is estimated by OVR methods, whose votes are used as athreshold to isolate relevant labels [12,20].

The OBO decomposition trick parses a q-class multi-label dataset into q single-class subsets in which the ith subset consists ofinstances with the ith label only [39]. In this case, support vectordata description (SVDD) [28] is used to build single-class classi-fiers. To characterize label correlation of different labels, linearridge regression is utilized to integrate a multi-label classifier.This OBO multi-label algorithm has the lowest computationalcomplexity among all existing kernel-based methods.

The LP decomposition trick considers each possible labelcombination of more than one class in a multi-label training setas a new single class, and then constructs a standard multi-classtraining set [2,30]. This trick could produce a large number of newclasses, many of which consist of very few instances. In Ref. [23],some label combinations including very few instances are prunedand replaced by disjoint label combinations occurring more times.In Refs. [24,31,34], two LP-based ensemble classifiers wereproposed through utilizing a small subset of labels and a largesubset of instances.

Data decomposition methods are popular since they are easyto be implemented using various existing methods and theircorresponding software, and perform well on a number of bench-mark data sets in the literature.

3.2. Algorithm extension methods

Algorithm extension methods generalize existing multi-classclassifiers to consider all instances and all classes in a multi-label

Fig. 1. One-versus-rest multi-label support vector machine.

J. Xu / Neurocomputing 74 (2011) 3114–31243116

data set at once. Here we mainly review decision tree andboosting methods, support vector machine, and neural networks,according to their basic principles.

Through modifying the formula of entropy calculation andpermitting multiple labels at the leaves of the tree, a C4.5-likemulti-label classification algorithm was proposed in Ref. [5], inwhich it is possible to generate a large number of leaves for allcombinations of different labels, just like LP methods. BoosTexter[26] was derived from the well-known Adaboost algorithm,which includes two slightly different versions: AdaBoost.MHand adaBoost.MR. The former is to predict the relevant set oflabels of an instance, while the latter to range labels of aninstance in descending order. Adtboost.MH was constructed byintegrating alternative decision tree with Adaboost.MH in Ref. [6].

Multi-label support vector machine (Rank-SVM) was proposed[10] via generalizing multi-class SVM [36] and accepting anapproximate ranking loss as its empirical loss term. In Ref. [15],a virtual label was added to find a natural zero to determinerelevant labels. Although Rank-SVM and its improved version aresolved by the linearization Wolfe-Frank method, their trainingprocedures are still computationally expensive. In essence, tradi-tional back-propagation (BP) neural networks can deal withmulti-label problems directly through assigning many ones atoutput layer. In BP-MLL [40], to improve BP’s performance, a newempirical loss function was induced from ranking loss to reflectlabel correlation of individual instance. Learning a threshold in BPtraining procedure simultaneously was considered by doublingthe number of nodes of output layer in Ref. [13].

These algorithm extension methods explicitly exploit as manylabel correlations as possible. But due to their high computationalcosts, this group of methods is mainly applied to small-scale datasets now.

3.3. Hybrid methods

Hybrid methods modify or extend specific single-label classi-fication algorithms while one or two data decomposition tricksare utilized implicitly or explicitly, in order to integrate the meritsof both data decomposition methods and algorithm extensionones.

After the OVR trick is utilized implicitly, traditional k nearestneighbor (kNN) method has been generalized to construct twodifferent multi-label methods: ML-kNN [41] and IBLR-ML [4].Both of them include a training procedure combining leave-one-out (LOO) procedure with kNN, in which ML-kNN estimates classprior and conditional probabilities of a discrete binary Bayesianrule for each label independently, while IBLR-ML posterior prob-ability associated with weighted sums of k nearest neighborlabels of all classes using logistic regression to capture labelcorrelation of different labels.

Multi-label naive Bayesian method (ML-NB) [43] conducts adimension reduction cascading principal component analysis(PCA) and genetic algorithm (GA) based feature selection, tosearch an optimal subset of transformed features. Subsequently,multi-label classification is solved using the OVR decompositiontrick and the binary naive Bayesian classifiers with Gaussiandensity estimation.

For OVR methods, q sub-classifiers are usually assembled in aparallel way. However, in classifier chain methods [7,25], q sub-classifiers are linked in a cascade way and further the outputs ofprevious sub-classifier are added to the inputs of current sub-classifier. To relieve the effect of classifier order, an ensemblemethod is used to create different random chain ordering.

The OVO decomposition trick could result into a mixed classfor some subsets. It seems natural and reasonable to locate thismixed class between positive and negative classes. Through

extending traditional binary SVM, two parallel hyper-planes areused to deal with three class subsets in Refs. [37,38], while themixed class is forced to reside in the marginal region of binarySVM in Ref. [18].

Multi-label RBF neural networks (ML-RBF) [42] first clustersthe training instances of each label independently, and thendetects q weight vectors via solving q sets of linear equationswith the same coefficient matrix and q different right side vectors.This means that ML-RBF utilizes the OBO and OVR tricks at thesame time.

In short, these hybrid methods not only incorporate as manylabel correlations as possible in multi-label data sets into inducedapproaches but also reduce their computational costs. Currently,hybrid multi-label methods have become a widely used groupin real-world applications because of their effectiveness andefficiency.

4. Extended one-versus-rest multi-label supportvector machine

In this section, we briefly review traditional one-versus-restmulti-label support vector machine (OVR-SVM) [2,16,30], andthen propose its extended version (simply OVR-ESVM). For con-venience, for a training instance xi, we define a binary label vectoryi¼[yi1,yi2,y,yiq]T, where yik¼þ1 if kALi, otherwise yik¼�1.

4.1. Traditional one-versus-rest multi-label support vector machine

The first support vector machine for multi-label classificationwas implemented using both the OVR decomposition and thebinary SVM in Ref. [16]. As mentioned above, the OVR decom-position trick is to divide a q-class multi-label data set into q

binary subsets, in which those multi-label instances, as shown bytwo symbols in Fig. 1, could be involved in different subsets.

For the kth subset, its training instances can be denoted by{(x1,y1k),y(xi,yik),y,(xl,ylk)}, k¼1,y,q, which is used to train abinary SVM model

min1

2wT

k wkþCXl

i ¼ 1

xik,

s:t: yik½wTk xiþbk�Z1�xik,

xikZ0,i¼ 1,:::,l ð3Þ

where wk and bk(k¼1,y,q) represent the weight vector and biasof the linear decision function fkðxÞ ¼wT

k xþbk, and C is a pre-defined regularization constant which can control the trade-offbetween the number of errors and the complexity of classifier.

J. Xu / Neurocomputing 74 (2011) 3114–3124 3117

The slack variables xik40 imply misclassified instances in thetraining procedure. Instead of solving Eq. (3) directly, its followingdual form is optimized

max�1

2

Xl

i,j ¼ 1

aikajkyikyjkðxTi xjÞþ

Xl

i ¼ 1

aik,

s:t:Xl

i ¼ 1

yikaik ¼ 0,

0raikrC: ð4Þ

After solving Eq. (4) for all q subsets, there exist q decisionfunctions

fkðxÞ ¼wTk xþbk,k¼ 1,. . .,q ð5Þ

where wk ¼Pl

i ¼ 1aikyikxi and bk(k¼1,y,q) are calculated usingthe Karush–Kuhn–Tucher (KKT) conditions: yik½w

Tk xiþbk� ¼ 1

and 0oaikoC. Now we say that some instance x has severalrelevant labels whose decision function values are greater than 0

labels of x¼ fk, s:t: fkðxÞZ0, k¼ 1,::,qg: ð6Þ

In order to avoid an empty relevant set of labels, the followingrule is sometimes added in Ref. [2]

label of x¼ fk, s:t: max fkðxÞ, fkðxÞo0, k¼ 1,::,qg: ð7Þ

The above two Eqs. (6) and (7) are used to assemble q sub-classifiers into an entire multi-label classifier in OVR-SVM and ournovel method OVR-ESVM in the next section. Theoretically, it ispossible to create an empty set for a test instance. Experimentally,detecting a label at least indeed improves some instance-basedand label-based measures for some benchmark data sets withsmall average labels. In this paper, OVR-SVM is implementedusing LibSVM [11,19], in which the binary SVM is iterativelyoptimized by sequential minimal optimization (SMO), whosetime complexity at worst case is O(l2.3) [19]. Since OVR-SVM

Fig. 2. Four relationships of two decision

deals with each class independently, the label correlation ofindividual instance is implicitly characterized via reusing eachmulti-label instance many times, as shown in Fig. 1.

4.2. Extended one-versus-rest multi-label support vector machine

According to the decision rule (6), it is desirable that, for atraining instance, its decision function values of relevant labelsand irrelevant ones should be greater and less than 0, respec-tively. This means that any relevant label should be ranked higherthan any irrelevant one. In case such an ideal case does not exist,the number of deranged label pairs can be counted for eachtraining instance, and then the following ranking loss measurecan be defined over the training set (1) [10,33]

Ranking loss¼1

l

Xl

i ¼ 1

1

9Li99Li99ðm,nÞALi � Li,s:t fmðxiÞr fnðxiÞ9

!:

ð8Þ

Unluckily, it is difficult to minimize such a loss functiondirectly for lack of differentiability. Now we consider a pair ofrelevant label and irrelevant one of some instance xi, i.e.,ðm,nÞALi � Li, yim¼þ1 and yin¼�1. When the constraint formin Eq. (3) is accepted, we have

fmðxiÞ ¼wTmxiþbmZþ1�xim,

fnðxiÞ ¼wTnxiþbnr�1þxin: ð9Þ

According to Eq. (9), there are four possible cases to character-ize relationship between such two labels, as shown in Fig. 2, i.e.

Case 1: fm(xi)Zþ1 and fn(xi)r�1, then xim¼0 and xin¼0.This is a perfect situation.Case 2: fm(xi)Zþ1 and fn(xi)4�1, then xim¼0 and xin40.Case 3: fm(xi)oþ1 and fn(xi)r�1, then xim40 and xin¼0.Case 4: fm(xi)oþ1 and fn(xi)4�1, then xim40 and xin40.

function values for a pair of labels.

J. Xu / Neurocomputing 74 (2011) 3114–31243118

In Fig. 2, the vertical axis denotes decision function value, and

the circle and square present relevant and irrelevant labels,respectively. For the case 1, fm(xi)� fn(xi)Z2 and ximþxin¼0. Forthe cases 2–4, two sub-cases occur: (a) fm(xi)4 fn(xi) andximþxin40, and (b) fm(xi)r fn(xi) and ximþxinZ2. In terms ofEq. (8), no ranking loss value is induced by the perfect case 1 andthree sub-cases (a), while a ranking loss value ‘‘1’’ is indeedderived by three sub-cases (b). Therefore, regardless of any case,its ranking loss value is bounded from above by (ximþxin)/2, i.e.,the former is never greater than the latter. Now we utilize(ximþxin)/2 to define an approximate ranking loss over Eq. (1)

Approximate ranking loss¼1

2l

Xl

i ¼ 1

1

9Li99Li9

Xðm,nÞA Li�Li

ðximþxinÞ

0@

1Að10Þ

which is an upper bound of the actual ranking loss (8). For eachterm in the bracket of Eq. (10), we have,

1

9Li99Li9

Xðm,nÞA Li�Li

ðximþxinÞ ¼1

9Li99Li9

Xðm,nÞA Li�Li

ximþX

ðm,nÞALi�Li

xin

0@

1A

¼1

9Li99Li99Li9

XmALi

ximþ9Li9XnALi

xin

0@

1A

¼1

9Li9

XmALi

ximþ1

9Li9

XnALi

xin: ð11Þ

After 1/2l in Eq. (10) is neglected, the above Eq. (10) withEq. (11) is utilized as an empirical loss term in our new method.Now we consider all training instances and all classes at once toconstruct an original form of our novel support vector machinefor multi-label classification as follows:

min1

2

Xq

k ¼ 1

wTk wkþqC

Xl

i ¼ 1

1

9Li9

XmA Li

ximþ1

9Li9

XnALi

xin

24

35

s:t: yikðwTk xiþbkÞZ1�xik,

xikZ0,i¼ 1,. . .,l; k¼ 1,. . .,q, ð12Þ

where C represents a regularization constant again. The dualproblem can be derived using the standard Lagrangian technique.Let aikZ0 and bikZ0 be the Lagrangian multipliers for theinequality constraints in Eq. (12). The Lagrangian for the primalform (12) is

L¼1

2

Xq

k ¼ 1

wTk wkþqC

Xl

i ¼ 1

1

9Li9

XmALi

ximþ1

9Li9

XnALi

xin

24

35

�Xq

k ¼ 1

Xl

i ¼ 1

aikðyikðwTkxiþbkÞ�1þxikÞ�

Xq

k ¼ 1

Xl

i ¼ 1

bikxik: ð13Þ

The KKT conditions for this primary problem require thefollowing relations to be true

@L

@wk¼wk�

Xl

i ¼ 1

aikyikxi ¼ 0) wk ¼Xl

i ¼ 1

aikyikxi, ð14Þ

@L

@bk¼Xl

i ¼ 1

aikyik ¼ 0, ð15Þ

@L

@xik¼

q9Li9

C�aik�bik ¼ 0, if yik ¼ þ1,

q

9Li9C�aik�bik ¼ 0, if yik ¼�1:

8<: ð16Þ

By introducing the above KKT conditions (14)–(16) into theLagrangian Eq. (13), the dual form becomes

maxXq

k ¼ 1

�1

2

Xl

i,j ¼ 1

aikyikajkyjkðxTi xjÞþ

Xl

i ¼ 1

aik

0@

1A,

s:t:Xl

i ¼ 1

aikyik ¼ 0,

0raikrCik ¼

q9Li9

C, if yik ¼ þ1,

q

9Li9C, if yik ¼�1,

8<:

i¼ 1,. . .,l; k¼ 1,. . .,q: ð17Þ

Here the upper bounds for aik are derived from Eq. (16). We findout that this optimization problem is decoupled among differentclasses. Therefore, it can be divided into q sub-problems usingthe OVR decomposition trick and then each sub-problem can besolved using binary SVM. Additionally, the upper bounds ofvariables to be solved are associated with the number of relevantand irrelevant labels of training instances, which reflects labelcorrelation of individual instance explicitly.

Similarly in binary SVM [36], the dot product between twovectors in Eq. (17) could be replaced by various kernels, and thecorresponding q non-linear decision functions become

fkðxÞ ¼Xl

i ¼ 1

aikyikKðx, xiÞþbk, ð18Þ

where K(x, xi) represents some kernel function [36]. In this paper,this novel method is referred to as an extended one-versus-restsupport vector machine for multi-label classification or simplyOVR-ESVM. We extended binary SVM training method in LibSVM[11,19] to implement our OVR-ESVM, whose time complexity isthe same as that of OVR-SVM.

5. Experiments

In this section, we compare our OVR-ESVM with four existingmulti-label classification approaches experimentally. Before pre-senting our experimental results, we briefly describe four existingmethods, four evaluation measures for multi-label classification,ten benchmark data sets and parameter settings for five multi-label methods.

5.1. Four state-of-the-art multi-label classification methods

In this paper, we selected four existing multi-label classifiers:OVR-SVM, RCP-SVM, OBO-SVDD, and ML-kNN. OVR-SVM inte-grates the OVR data decomposition trick with traditional binarySVM, as introduced in Section 4. For RCP-SVM, binary SVM is usedto train OVO sub-classifiers after the mixed class has beendiscarded, and linear ridge regression is to build a thresholdfunction associated with the number of votes of each class.Therefore our RCP-SVM can be regarded as a simplificationversion of the method in Ref. [12], since one linear ridge regres-sion model substitutes for q OVR ones. OBO-SVDD combines theOBO data decomposition trick, support vector data description(SVDD) with linear ridge regression [39]. ML-kNN is an extensionof traditional k nearest neighbor method through adding discreteBayesian rule [41].

All four kernel-type multi-label methods have been imple-mented via modifying the famous SVM software package LibSVM[19]. To deal with median and even large-scale data sets, werecoded ML-kNN using C/Cþþ language. In this case, we compareour OVR-ESVM with such four multi-label classification algo-rithms, i.e., OVR-SVM, RCP-SVM, OBO-SVDD, and ML-kNN.

J. Xu / Neurocomputing 74 (2011) 3114–3124 3119

5.2. Evaluation measures

It is more complicated to evaluate a multi-label classificationalgorithm than a single-label one, which induces more than tenperformance evaluation measures [33]. In this paper, besides theranking loss (8), we choose four popular and indicative instance-based measures: accuracy, precision, recall, and F1. Assume anevaluation data set of size m to be {(x1,L1),y,(xi,Li),y,(xm,Lm)}.Given some instance xi, its predicted set of labels from somemulti-label algorithm is denoted by Lp

i A2Q .The accuracy evaluates the percentage of relevant labels that

are predicted correctly, out of union between predicted and truelabels. The precision calculates the percentage of predicted labelsthat are relevant, while recall computes the percentage ofrelevant labels that are predicted. Since such two measuresconflict with each other, the F1 expresses the harmonic meanbetween the two. Four instance-based measures are defined asfollows:

Accuracy¼1

m

Xm

i ¼ 1

9Li \ LPi 9

9Li [ LPi 9

, Precision¼1

m

Xmi ¼ 1

9Li \ LPi 9

9LPi 9

,

Recall¼1

m

Xm

i ¼ 1

9Li \ LPi 9

9Li9, F1¼

1

m

Xm

i ¼ 1

29Li \ LPi 9

9LPi 9þ9Li9

: ð19Þ

It is worth noting that the above four measures (19) andranking loss (8) all range between 0 and 1. It is desirable that amulti-label classification algorithm should achieve as high fourinstance-based measure values and as small ranking loss value aspossible.

Table 1Statistics for ten benchmark data sets used in our experiments.

Data set Domain Instances Features Labels Average

labels

Train Test

Image Scene 1200 800 294 5 1.24

Scene Scene 1211 1196 294 6 1.07

Emotions Music 391 202 72 6 1.87

Yeast Biology 1500 917 103 14 4.24

Genbase Biology 463 199 1185 27 1.35

Medical Text 645 333 1449 45 1.25

Enron Text 1123 579 1001 53 3.38

RCV1-subset1 Text 3000 3000 47,236 101 2.88

Corel16K001 Scene 5188 1744 500 153 2.87

Bibtex Text 4480 2515 1836 159 2.40

Fig. 3. Approximate ranking loss versu

5.3. Ten data sets

To compare our OVR-ESVM with four existing methods, wecollected ten benchmark data sets: Image, Scene, Emotions, Yeast,Genbase, Medical, Enron, RCV1-subset1, Corel16K001, and Bibtexfrom Refs. [19,35,44], as listed in Table 1, according to theirnumber of class labels. Table 1 shows some useful statistics ofthese data sets, such as, the number of instances in the trainingand test sets, the number of features, the number of labels, andthe average labels. These data sets cover four distinct domains:text, scene, music, and biology. In this paper, we adopt a train-testmode rather than k-fold cross validation, in which the trainingsets are used to evaluate the approximate ranking loss function(10), tune key parameters and learn multi-label classifiers, whilethe test sets to verify the performance of algorithms. The Image[44] was partitioned into training and test sets by us. The othernine training and test sets were downloaded from Refs. [19,35]directly. For more detailed information and description for thesedata sets, please refer to their web sites and references therein[19,35,44].

5.4. Parameter settings for five multi-label methods

In this paper, the RBF kernel Kðx,yÞ ¼ expð�gJx�yJ2Þ is tested

for four kernel-type methods, where g denotes kernel scaleparameter and 99 � 99 the Euclidean distance between x and y.Besides g, OVR-ESVM, OVR-SVM, and RCP-SVM involve a regular-ization constant C, while OBO-SVDD has a constant nA(0,1) tocontrol the fraction of instances outside the hyper-sphere. Toreduce the search space of parameter combinations, after fixingC¼1.0 and n¼0.1 (default and recommended settings in LibSVM[19]), the g value is varied from 27 to 2�7 for the first five datasets and from 22 to 2�12 for the last five ones in Table 1, forevaluating the approximate ranking loss (10) and tuning the keykernel parameter g. For ML-kNN, although k¼10 was recom-mended and has been widely used in Ref. [4,41], we still range k

value between 1 and 15 in parameter tuning procedure. Lastly,the optimal parameters from our tuning procedure are used totrain five multi-label classifiers on ten training sets in Table 1.

5.5. Evaluation of approximate ranking loss

In our OVR-ESVM, the approximate ranking loss (10), i.e., theupper bound of the actual one (8), is minimized as the empiricalloss term. Here we validate whether such two ranking loss

s actual one for ten training sets.

J. Xu / Neurocomputing 74 (2011) 3114–31243120

function values coincide. According to the previous parametersetting for our OVR-ESVM, we calculate such two function valueswith 15 different g values on ten training sets and show them

Fig. 4. The average F1 measure as a function of the g or k value, where (a) Image, (b) Sc

(i) Corel16K001, (j) Bibtex.

using cross plot in Fig. 3. It has to be pointed out that the smallapproximate ranking loss value corresponds to the large g valuefor each training data set.

ene, (c) Emotions, (d) Yeast, (e) Genbase, (f) Medical, (g) Enron, (h) RCV1-subset1,

J. Xu / Neurocomputing 74 (2011) 3114–3124 3121

From Fig. 3, it can be observed that, (i) a general tendency isthat as the approximate ranking loss value decreases the actualone decreases accordingly; (ii) for five data sets in Fig. 3(a), afterthe approximate function is less than some fixed value, the actualone equals zero; (iii) in Fig. 3(b), when the approximate functionis very small, some actual values slightly oscillate, which possiblyis induced by numerical computation; (iv) for RCV1-subset1,there are four special points, as shown in the top right corner ofFig. 3(b), although the approximate function decreases, the actualone holds unchanged. Overall, it is experimentally demonstratedthat the actual ranking loss (8) can be minimized by minimizingits approximate one (10).

Table 3Experimental results based on ranking loss.

Data set OVR-ESVM OVR-SVM RCP-SVM OBO-SVDD ML-kNN

Image 0.12760 0.13313 0.13646 0.19562 0.19312

Scene 0.06421 0.06871 0.06879 0.10738 0.09772

Emotions 0.16221 0.14773 0.16463 0.20015 0.15281

Yeast 0.16870 0.16955 0.17179 0.18114 0.17026

Genbase 0.00419 0.00419 0.01408 0.01147 0.00499

Medical 0.02282 0.02685 0.07263 0.06339 0.04117

Enron 0.09131 0.10541 0.09135 0.10967 0.09356

RCV1-subset1 0.01472 0.02019 0.14098 0.05449 0.02944

Corel16K001 0.17708 0.16614 0.15904 0.20222 0.19849

Bibtex 0.07097 0.07869 0.10165 0.21686 0.22324

Test score 3.5 3 1.5 0.5 1.5

Fig. 5. Percentage of the F1 improvement of OVR-ESVM over OVR-SVM.

5.6. Tuning key parameters for five multi-label classifiers

In this sub-section, we tune the g value for four kernel-typemulti-label classifiers and the k value for ML-kNN. Based onthree-fold cross validation and three repeats, we investigate theaverage F1 as a function of the g or k value, as shown in Fig. 4, andthen detect their optimal values, as listed in Table 2. It is worthnoting that our tuning procedure is only conducted on tentraining sets.

According to Fig. 4, it is observed that only the average F1curve of RCP-SVM from Genbase oscillates, while all other averageF1 curves of four kernel-type approaches only have a peak F1value. It is interesting that the average F1 curves of ML-kNNbehave diversely, e.g., almost monotonously increasing modefor Corel16K001 and Bibtex, monotonously decreasing one forGenbase, and oscillating one for the other seven data sets.

It is also found out that the optimal F1 values from ourOVR-ESVM are consistently superior to those from the other fourmethods for all ten data sets, as shown in Table 2 using bold font.For the other four methods, we could not identify which oneperforms the best. It is surprising that for Corel16K001 andBibtex, ML-kNN performs the worst extremely.

Since our OVR-ESVM is an extension of OVR-SVM, we alsoinvestigate the improvement of OVR-ESVM over OVR-SVM, asshown in Fig. 5, according to their optimal F1 values in Table 2.Interestingly, for all ten data sets, their F1 values of ourOVR-ESVM have been improved over OVR-SVM, in which morethan 5% improvement has been achieved for Emotions, Yeast, andMedical, and more than 10% for the last four data sets with morelabels. It is also observed that the g values for our OVR-ESVMare smaller than those for OVR-SVM on all data sets butGenbase, which implies that OVR-ESVM models are simpler thanOVR-SVM ones.

The above key parameter tuning procedure not only searchesthe optimal g or k values for five methods, but also provides aprimary evaluation and comparison between our OVR-ESVM andthe other four existing approaches.

Table 2Optimal g and k values according to maximal F1 measure.

Data set OVR-ESVM OVR-SVM

Image 2�2 (0.6658) 2�2 (0.6512)

Scene 2�3 (0.7856) 2�3 (0.7780)

Emotions 2�3 (0.6657) 2�1 (0.6249)

Yeast 2�1 (0.6554) 20 (0.6157)

Genbase 2�2 (0.9903) 2�3 (0.9828)

Medical 2�5 (0.8131) 2�4 (0.7561)

Enron 2�6 (0.5800) 2�6 (0.5224)

RCV1-subset1 2�3 (0.7391) 2�1 (0.6345)

Corel16K001 2�5 (0.2430) 2�2 (0.1700)

Bibtex 2�8 (0.4380) 2�7 (0.3982)

Note that the optimal average F1 values are also listed in the brackets.

5.7. Comparison with four existing methods on test sets

According to those optimal g and k values listed in Table 2, weretrained five multi-label classification methods using ten entiretraining sets and verified their performance using ten correspond-ing test sets listed in Table 1. The experimental results are shownin Tables 3–7, in terms of different measures, where the bestmeasure value of each data set among five methods is highlightedin bold font.

For each measure, to build a relative performance of any twoclassifiers (A and B), we statistically compare them using thepaired Wilcoxon signed rank test with 5% significance level [8].First, we conduct a two-tailed test, where the null hypothesis isthat two classifiers have equal performance, and the classifiers A

and B both are assigned to 0.5 score. In case this null hypothesis is

RCP-SVM OBO-SVDD ML-kNN

2�2 (0.6158) 2�1 (0.5619) 4 (0.4716)

2�3 (0.7697) 2�3 (0.6604) 7 (0.6297)

2�2 (0.6487) 20 (0.6023) 12 (0.5806)

21 (0.6514) 20 (0.5974) 6 (0.6256)

2�3 (0.8728) 2�4 (0.9549) 1 (0.9786)

2�5 (0.4952) 2�4 (0.5161) 5 (0.5525)

2�5 (0.5041) 2�4 (0.4319) 8 (0.3962)

2�1 (0.4337) 2�4 (0.5872) 15(0.6906)

2�3 (0.2284) 2�2 (0.1734) 13(0.0134)

2�6 (0.3152) 2�7 (0.1941) 8(0.1505)

Table 4Experimental results based on accuracy.

Data set OVR-ESVM OVR-SVM RCP-SVM OBO-SVDD ML-kNN

Image 0.64333 0.65646 0.53417 0.50667 0.44958

Scene 0.76394 0.76742 0.76129 0.58187 0.63531

Emotions 0.59092 0.57137 0.54703 0.50083 0.53630

Yeast 0.55330 0.52324 0.53918 0.49214 0.50369

Genbase 0.98576 0.98869 0.86315 0.93057 0.98660

Medical 0.78003 0.77177 0.47472 0.51491 0.54880

Enron 0.47090 0.43509 0.44479 0.30617 0.31244

RCV1-subset1 0.66030 0.53814 0.31779 0.45988 0.59856

Corel16K001 0.16421 0.12485 0.17008 0.12511 0.01026

Bibtex 0.37616 0.34923 0.26855 0.16401 0.12053

Test score 4.0 2.5 1.5 1.0 1.0

Table 5Experimental results based on precision.

Data set OVR-ESVM OVR-SVM RCP-SVM OBO-SVDD ML-kNN

Image 0.70896 0.75313 0.56687 0.55958 0.51563

Scene 0.79515 0.80351 0.80184 0.60075 0.66430

Emotions 0.64521 0.73680 0.64604 0.61716 0.66997

Yeast 0.65438 0.73604 0.65287 0.64278 0.72545

Genbase 0.99497 1.00000 0.86759 0.94229 0.99497

Medical 0.82307 0.84384 0.54505 0.57297 0.60561

Enron 0.61188 0.73440 0.58152 0.50212 0.55866

RCV1-subset1 0.83839 0.93503 0.58063 0.71771 0.84176

Corel16K001 0.20147 0.35837 0.29693 0.26732 0.01844

Bibtex 0.50957 0.63817 0.36810 0.27053 0.24434

Test score 2.0 4.0 1.5 1.0 1.5

Table 6Experimental results based on recall.

Data set OVR-ESVM OVR-SVM RCP-SVM OBO-SVDD ML-kNN

Image 0.73938 0.66375 0.78417 0.65088 0.46229

Scene 0.81438 0.77550 0.76129 0.78595 0.64465

Emotions 0.79043 0.63779 0.70050 0.63036 0.61304

Yeast 0.72414 0.58853 0.70659 0.63750 0.57030

Genbase 0.99079 0.98869 0.98911 0.97320 0.99162

Medical 0.86436 0.78378 0.60210 0.63313 0.56456

Enron 0.65170 0.47533 0.56755 0.39712 0.34949

RCV1-subset1 0.74236 0.54153 0.36994 0.54384 0.63014

Corel16K001 0.39998 0.12485 0.22987 0.16242 0.01037

Bibtex 0.51232 0.34994 0.37704 0.21842 0.12290

Test score 4.0 2.0 1.5 2.0 0.5

Table 7Experimental results based on F1.

Data set OVR-ESVM OVR-SVM RCP-SVM OBO-SVDD ML-kNN

Image 0.69842 0.69083 0.63271 0.57217 0.47583

Scene 0.79125 0.78219 0.77480 0.65368 0.64813

Emotions 0.68642 0.65248 0.65033 0.59302 0.61337

Yeast 0.66402 0.62492 0.65630 0.60705 0.60977

Genbase 0.98995 0.99241 0.90739 0.94909 0.99079

Medical 0.82302 0.79870 0.54234 0.57230 0.57337

Enron 0.59857 0.54373 0.55334 0.41785 0.40523

RCV1-subset1 0.75165 0.63988 0.43208 0.57653 0.68423

Corel16K001 0.25552 0.17979 0.24562 0.18636 0.01262

Bibtex 0.45953 0.41834 0.34440 0.21016 0.15131

Test score 4.0 2.5 1.5 1.0 1.0

Fig. 6. Ranking five multi-label classifiers according to their overall scores.

J. Xu / Neurocomputing 74 (2011) 3114–31243122

rejected, we conduct a left or right tailed test to detect whetherthe classifier A performs statistically significantly better than theB. If so, the classifiers A and B are scored by 1 and 0, respectively.

Lastly, we calculate the test score of each classifier according to itsall possible pairwise tests, as listed in the last rows of Tables 3–7.

Table 3 shows the comparison of our OVR-ESVM with fourstate-of-the-art multi-label classifiers on ranking loss measure.Our OVR-ESVM performs the best on eight data sets, while OVR-SVM and RCP-SVM achieve the lowest values on Emotions andCorel16K001, respectively. According to the Wilcoxin test score,five methods are sorted as OVR-ESVM, OVR-SVM, RCP-SVM,ML-kNN, and OBO-SVDD.

Table 4 reports the comparison of OVR-ESVM with fourexisting multi-label approaches on accuracy measure. OVR-ESVMand OVR-SVM work the best on six and three data sets respec-tively. It is interesting that RCP-SVM obtains a highest accuracyvalue on Corel16K001. In terms of statistical test, the order offive approaches is listed as OVR-ESVM, OVR-SVM, RCP-SVM,OBO-SVDD, and ML-kNN.

Table 5 provides the comparison of OVR-ESVM with four usedclassification techniques on precision measure. Here, OVR-SVMworks the best on all ten data sets. In terms of the Wilcoxon test,five methods can be sorted as OVR-SVM, OVR-ESVM, RCP-SVM,ML-kNN, and OBO-SVDD. This is only one exception that OVR-SVM outperforms our OVR-ESVM, which means that OVR-SVMwould detect less relevant labels than our OVR-ESVM.

Table 6 lists the comparison of OVR-ESVM with four high-performed classifiers on recall measure. Our OVR-ESVM achievesthe best results on all ten data sets. On the basis of statistical test, werank five methods as OVR-ESVM, OVR-SVM, OBO-SVDD, RCP-SVM,and ML-kNN.

Table 7 illustrates the comparison of OVR-ESVM with four state-of-the-art classification methods on F1 measure. Our OVR-ESVMachieves the best results on nine data sets, while OVR-SVM performsthe best on Genbase. Although our OVR-ESVM works worse thanOVR-SVM on Genbase, the difference between such two methodsis only 0.246%. Additionally, to the best of our knowledge, mostmethods can obtain a satisfactory performance on this data set[4,33]. On the basis of statistical test, we grade five methods asOVR-ESVM, OVR-SVM, RCP-SVM, OBO-SVDD, and ML-kNN.

It is well-known that precision and recall measures conflictwith each other, so it is not surprising that OVR-ESVM and OVR-SVM end in a tie statistically in terms of Tables 5 and 6. We alsofind out that the differences of average precision and recallvalues over ten data sets between OVR-ESVM and OVR-SVM are0.6783–0.7539¼�0.0756 and 0.723–0.593¼0.13, respectively.

J. Xu / Neurocomputing 74 (2011) 3114–3124 3123

Therefore OVR-EVSM sacrifices precision measure slightly andthen improves recall measure greatly. Further on two compre-hensive measures, accuracy and F1, our OVR-ESVM performsstatistically significantly better than OVR-SVM. To make clearerview of five methods, we sum up the overall scores of all methodsover five measures, and show them in Fig. 6 in descending order.It can be observed that OVR-ESVM is the best classifier and OVR-SVMbecomes a good candidate.

According to the above experiments and discussions, it can beillustrated that our OVR-ESVM is the strongest method, comparedwith four state-of-the-art multi-label methods, i.e., OVR-SVM,RCP-SVM, OBO-SVDD, and ML-kNN.

6. Conclusions

For multi-label classification, almost all researchers aim atboth low computational cost and good performance. But such twotargets usually conflict with each other in fact, which are mainlypaid attention to by data decomposition and algorithm extensionstrategies, respectively. Hybrid strategy considers the trade-offbetween two targets, resulting into some effective and efficientmulti-label techniques. In this paper, we have applied hybridstrategy to design and implement a novel support vector machinefor multi-label classification.

We define a new empirical loss term through approximatingranking loss from above, and then generalize traditional binarysupport vector machine to construct a new multi-label supportvector machine. Our quadratic programming problem has differ-ent upper bounds for variables to be solved, which effectivelycharacterizes label correlation of individual instance explicitly.Furthermore, our optimization problem is solved by the one-versus-rest data decomposition trick and modified binary supportvector machine, which reduces computational cost dramatically.Experimental study on ten benchmark data sets demonstratesthat our OVR-ESVM achieves rather competitive performance,compared with four typical multi-label classification methods,including OVR-SVM, RCP-SVM, OBO-SVDD, and ML-kNN. Ourprimary success will inspire more researchers to develop newmulti-label classification approaches using hybrid strategy.

In our future work, we will conduct a model selection work tosearch optimal parameters more efficiently, rather than k-fold crossvalidation and parameter grid search, and test more large-scaledata sets.

Acknowledgments

This work is supported by the Natural Science Foundation ofChina grant 60875001 and partially by the Jiangsu ProvinceScholarship for Overseas Studying (Sep. 2008–Sep. 2009).

References

[1] Z. Barutcuoglu, R.E. Schapire, O.G. Troyanskaya, Hierarchical multi-labelprediction of gene function, Bioinformatics 22 (7) (2006) 830–836.

[2] M.R. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label sceneclassification, Pattern Recognition 37 (9) (2004) 1757–1771.

[3] A.C.P.L.F. de Carvalho, A.A. Freitas, A tutorial on multi-label classificationtechniques, in: A. Abraham, A.E. Hassanien, V. Snasel (Eds.), Function Approxima-tion and Classification, Foundations of Computational Intelligence, vol. 5, Studiesin Computational Intelligence, Springer, Berlin/Heidelberg, 2009, pp. 177–195.

[4] W. Cheng, E. Hullermeier, Combining instance-based learning and logistic regres-sion for multi-label classification, Machine Learning 76 (2/3) (2009) 211–225.

[5] A. Clare, R.D. King, Knowledge discovery in multi-label phenotype data, in:Proceedings of the 5th European Conference on Principles of Data Mining andKnowledge Discovery (PKDD2001), LNCS 2168, Freiburg, Baden-Wurttemberg, Germany, 2001, pp. 42–53.

[6] F.D. Comite, R. Gilleron, M. Tommasi, Learning multi-label alternativedecision tree from texts and data, in: Proceedings of the 3rd International

Conference on Machine Learning and Data Mining in Pattern Recognition(MLMD2003), LNCS 2734. Leipzig, Germany, 2003, pp. 35–49.

[7] K. Dembczynski, W. Cheng, E. Hullermeier, Bayes optimal multilabel classi-fication via probabilistic classifier chains, in: Proceedings of the 27thInternational Conference on Machine Learning (ICML2010), Haifa, Israel,2010, pp. 279–286.

[8] J. Demsar, Statistical comparison of classifiers over multiple data sets, Journalof Machine Learning Research 7 (2006) 1–30.

[9] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., John Wileyand Sons, New York, 2001.

[10] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in:Proceedings of the 14th Conference on Neural Information ProcessingSystems (NIPS2001), Vancouver, British Columbia, Canada, 2001, pp. 681–687.

[11] R.E. Fan, P.H. Chen, C.J. Lin, Working set selection using second orderinformation for training support vector machines, Journal of Machine Learn-ing Research 6 (2005) 1889–1918.

[12] J. Furnkranz, E. Hullermeier, E.L. Mencia, K. Brinker, Multi-label classificationvia calibrated label ranking, Machine Learning 73 (2) (2008) 133–153.

[13] R. Grodzicki, J. Mandziuk, L. Wang, Improved multi-label classification withneural networks, in: Proceedings of the 10th International Conference onParallel Problem Solving from Nature (PPSN2008), LNCS 5199, Dortmund,Germany, 2008, pp. 409–416.

[14] E. Hullermeier, J. Furnkranz, W. Cheng, K. Brinker, Label ranking by learningpairwise preferences, Artificial Intelligence 172 (16/17) (2008) 1897–1916.

[15] A. Jiang, C. Wang, Y. Zhu, Calibrated rank-svm for multi-label imagecategorization, in: Proceedings of 2008 IEEE International Joint Conferenceon Neural Networks (IJCNN2008), Hongkong, China, 2008, pp. 1450–1455.

[16] T. Joachims, Text categorization with support vector machines: learning withmany relevant features, in: Proceedings of the 10th European Conference onMachine Learning (ECML1998), LNCS 1398, Chemnitz, Germany, 1998,pp. 137–142.

[17] B. Lauser, A. Hotho, Automatic multi-label subject indexing in a multilingualenvironment, in: Proceeding of the 7th European Conference on Researchand Advanced Technology for Digital Libraries (ECDL2003), LNCS 2769,Trondheim, Norway, 2003, pp. 140–151.

[18] J.Y. Li, J.H. Xu, A fast multi-label classification algorithm based on doublelabel support vector machine, in: Proceedings of 2009 International Con-ference on Computational Intelligence and Security (CIS2009), vol. 2, Beijing,China, 2009, pp. 30–35.

[19] C.J. Lin, LibSVM software and its implementation details, and multi-label datasets, /http://www.csie.ntu.edu.tw/�cjlin/libsvmS, 2010.

[20] E.L. Mencia, J. Furnkranz, Pairwise learning of multilabel classifications withperceptrons, in: Proceedings of 2008 IEEE International Joint Conference onNeural Networks (IJCNN2008), Hongkong, China, 2008, pp. 2900–2907.

[21] P. Pavlidis, J. Weston, J. Cai, W.N. Grundy, Combining microarray expressiondata and phylogenetic profiles to learn functional categories using supportvector machines, in: Proceedings of the 5th Annual international Conferenceon Computational Molecular Biology (RECOMB2001), Montreal, Canada,2001, pp. 242–248.

[22] M. Petrovskiy, Paired comparisons method for solving multi-label learningproblem, in: Proceedings of the 6th International Conference on HybridIntelligent Systems (HIS2006), Auckland, New Zealand, 2006, p. 42.

[23] J. Read, A pruned problem transform method for multi-label classifica-tion, in: Proceeding of 2008 New Zealand Computer Science Research StudentConference (NZCSRS 2008), Christchurch, New Zealand, 2008, pp. 143–150.

[24] J. Read, B. Pfahringer, G. Holmes, Multi-label classification using ensembles ofpruned sets, in: Proceedings of the 8th IEEE International Conference on DataMining (ICDM2008), Pisa, Italy, 2008, pp. 995–1000.

[25] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chain for multi-labelclassification, in: Proceedings of 2009 European Conference on MachineLearning and Knowledge Discovery in Databases (ECML PKDD 2009, Part II),LNAI 5782, Bled, Slovenia, 2009, pp. 254–269.

[26] R.E. Schapire, Y. Singer, Boostexter: a boosting-based system for textcategorization, Machine Learning 39 (2/3) (2000) 135–168.

[27] A.F.D. Souza, F. Pedroni, E. Ovliveira, P.M. Ciarelli, W.F. Henrique, L. Veronese,C. Badue, Automated multi-label text categorization with VG-RAM weight-less neural networks, Neurocomputing 72 (10/12) (2009) 2209–2217.

[28] D. Tax, R.P.W. Duan, Support vector data description, Machine Learning54 (1) (2004) 45–66.

[29] K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas, Multi-label classification ofmusic into emotions, in: Proceedings of the 9th International Conference on MusicInformation Retrieval (ISMIR2008), Philadelphia, PA, USA, 2008, pp. 325–330.

[30] G. Tsoumakas, I. Katakis, Multi-label classification: an overview, InternationalJournal of Data Warehousing and Mining 3 (3) (2007) 1–13.

[31] G. Tsoumakas, I. Vlahavas, Random k-labelsets: an ensemble method for multi-label classification, in: Proceedings of the 18th European Conference on MachineLearning (ECML2007), LNCS 4701, Warsaw, Poland, 2007, pp. 406–417.

[32] G. Tsoumakas, M.L. Zhang, Z.H. Zhou, Learning from Multi-label Data, in:ECML/PKDD 2009 Tutorial, /http://www.ecmlpkdd2009.net/program/toturials/learning-from-multi-label-data.pdfS.

[33] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: O. Maimon,L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, seconded., Springer, New York, 2010, pp. 667–685.

[34] G. Tsoumakas, I. Vlahavas, I. Katakis, Random k-labelsets for multi-labelclassification, IEEE Transactions on Knowledge and Data Mining 23 (7) (2011)1079–1089.

J. Xu / Neurocomputing 74 (2011) 3114–31243124

[35] G. Tsoumakas, Multi-label Datasets, /http://mulan.sourceforge.net/datasets.htmlS, 2010.

[36] V.N. Vapnik, Statistical Learning Theory, John Wiley and Sons, New York,1998.

[37] L. Wang, M. Chang, J. Feng, Parallel and sequential support vector machinesfor multi-label classification, International Journal of Information Technology11 (9) (2005) 11–18.

[38] S.P. Wan, J.H. Xu, A multi-label classification algorithm based on triple classsupport vector machine, in: Proceedings of 2007 IEEE International Con-ference on Wavelet Analysis and Pattern Recognition (ICPRWA2007), Beijing,China, 2007, pp. 1447–1452.

[39] J.H. Xu, Constructing a fast algorithm for multi-label classification withsupport vector data description, in: Proceedings of 2010 IEEE Conferenceon Granular Computing (GrC2010), San Jose, CA, USA, 2010, pp. 817–821.

[40] M.L. Zhang, Z.H. Zhou, Multilabel neural networks with application tofunction genomics and text categorization, IEEE Transactions on Knowledgeand Data Engineering 18 (10) (2006) 1338–1351.

[41] M.L. Zhang, Z.H. Zhou, ML-kNN: a lazy learning approach to multi-labellearning, Pattern Recognition 40 (5) (2007) 2038–2048.

[42] M.L. Zhang, ML-RBF: RBF neural networks for multi-label learning, NeuralProcessing Letters 29 (2) (2009) 61–74.

[43] M.L. Zhang, J.M. Pena, V. Robles, Feature selection for multi-label naıve Bayesclassification, Information Science 179 (19) (2009) 3218–3229.

[44] M.L. Zhang, Image Data Set, /http://cse.seu.edu.cn/people/zhangmlS, 2010.

Jianhua Xu received his Ph.D. in Pattern Recognitionand Intelligent Systems in 2002 (Department of Auto-mation, Tsinghua University, Beijing, China), M.S. inGeophysics in 1987 (Department of Earth and SpaceSciences, University of Science and Technology ofChina, Hefei, China), and B.E. in Seismology in 1985(Department of Applied Geophysics, Chengdu Collegeof Geology, Chengdu, China). Since 2005, he is aprofessor in Computer Science, School of ComputerScience and Technology, Nanjing Normal University,Nanjing, China. Between Sep. 2008 and Sep. 2009, hewas a visiting scholar at Department of Statistics,

Harvard University, Cambridge MA, USA. His research

interests are focused on pattern recognition, machine learning, and their applica-tions to bioinformatics.