an efficient multi-label support vector machine with a zero label

9
An efficient multi-label support vector machine with a zero label Jianhua Xu School of Computer Science and Technology, Nanjing Normal University, Nanjing, Jiangsu 210097, China article info Keywords: Support vector machine Multi-label classification Frank–Wolfe method Linear programming Quadratic programming abstract Existing multi-label support vector machine (Rank-SVM) has an extremely high computational complex- ity and lacks an intrinsic zero point to determine relevant labels. In this paper, we propose a novel sup- port vector machine for multi-label classification through both simplifying Rank-SVM and adding a zero label, resulting into a quadratic programming problem in which each class has an independent equality constraint. When Frank–Wolfe method is used to solve our quadratic programming problem iteratively, our entire linear programming problem of each step is divided into a series of sub-problems, which dra- matically reduces computational cost. It is illustrated that for famous Yeast data set our training proce- dure runs about 12 times faster than Rank-SVM does under C++ environment. Experiments from five benchmark data sets show that our method is a powerful candidate for multi-label classification, com- pared with five state-of-the–art multi-label classification techniques. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Multi-label classification is a special learning issue where a sin- gle instance is possibly associated with several classes at the same time and thus the classes are not mutually exclusive. Recently, it has been paid more attention to than before because of many real-world applications, such as, text categorization (Chang, Chen, & Liau, 2008; Joachims, 1998; Schapire & Singer, 2000; Zhang & Zhou, 2006), scene classification (Boutell, Luo, Shen, & Brown, 2004; Jiang, Wang, & Zhu, 2008; Li, Wang, & Sung, 2004), bioinfor- matics (Barutcuoglu, Schapire, & Troyanskaya, 2006; Pavlidis, Wes- ton, Cai, & Grundy, 2001), and music categorization (Trohidis, Tsoumakas, Kalliris, & Vlahavas, 2008). Nowadays there are mainly two distinct strategies to design various discriminative multi-label classification methods: problem transformation and algorithm adaptation (Tsoumakas & Katakis, 2007; Tsoumakas, Katakis, & Vlahavas, 2010). Problem transformation converts a multi-label problem into either one or more single-label (single, binary or multi-class) sub-problems, constructs a sub-classifier for each sub-problem using an existing technique, and then assembles all sub-classifiers into an entire multi-label classifier. It is convenient and fast to implement a problem transformation method due to lots of exist- ing techniques and their free software. Currently, four useful trans- formation tricks have been used in application domains, i.e., label powerset (LP), one-by-one (OBO), one-versus-one (OVO), and one-versus-other (OVR) approaches. In principle, the LP trick con- siders correlations between labels via possible label combinations, while the others through reusing multi-label instances many times. To characterize as many correlations between labels as possible, algorithm adaptation extends some specific multi-class classifica- tion algorithms to handle an entire multi-label training data set di- rectly. It is interesting that this strategy could induce two types of multi-label classification methods. One considers all classes and all instances simultaneously, usually leading to some complicated optimization problems, such as, multi-label support vector ma- chine (Rank-SVM) (Elisseeff & Weston, 2001) and multi-label BP neural networks (BP-MLL) (Zhang & Zhou, 2006). The other still deals with each class independently after using some problem transformation tricks (e.g., OVR), such as, multi-label kNN method (ML-kNN) (Zhang & Zhou, 2007) and multi-label RBF neural net- works (ML-RBF) (Zhang, 2009). The latter group of methods has much lower computational complexity than the former one. This inspires us to introduce problem transformation tricks into algo- rithm adaptation methods for reducing computational cost. In this paper, our focus is on famous Rank-SVM. Besides its high computational complexity, Rank-SVM lacks a natural zero point to detect relevant labels. Through adding a zero label, we define a new form of ranking loss and then simplify the original form of Rank-SVM, which leads to a novel quadratic programming problem where each label has an independent linear equality constraint. When the iterative linearization Frank–Wolfe method is utilized to solve our optimization problem, our entire linear programming problem of each step can be divided into a series of sub-problem using the OVR transformation trick. This means that the computa- tional cost could be reduced dramatically. It is illustrated experi- mentally that our method runs 12 times faster than Rank-SVM in 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.09.138 E-mail addresses: [email protected], [email protected] Expert Systems with Applications 39 (2012) 4796–4804 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: jianhua-xu

Post on 05-Sep-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: An efficient multi-label support vector machine with a zero label

Expert Systems with Applications 39 (2012) 4796–4804

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

An efficient multi-label support vector machine with a zero label

Jianhua XuSchool of Computer Science and Technology, Nanjing Normal University, Nanjing, Jiangsu 210097, China

a r t i c l e i n f o

Keywords:Support vector machineMulti-label classificationFrank–Wolfe methodLinear programmingQuadratic programming

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.09.138

E-mail addresses: [email protected], jhxu@st

a b s t r a c t

Existing multi-label support vector machine (Rank-SVM) has an extremely high computational complex-ity and lacks an intrinsic zero point to determine relevant labels. In this paper, we propose a novel sup-port vector machine for multi-label classification through both simplifying Rank-SVM and adding a zerolabel, resulting into a quadratic programming problem in which each class has an independent equalityconstraint. When Frank–Wolfe method is used to solve our quadratic programming problem iteratively,our entire linear programming problem of each step is divided into a series of sub-problems, which dra-matically reduces computational cost. It is illustrated that for famous Yeast data set our training proce-dure runs about 12 times faster than Rank-SVM does under C++ environment. Experiments from fivebenchmark data sets show that our method is a powerful candidate for multi-label classification, com-pared with five state-of-the–art multi-label classification techniques.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Multi-label classification is a special learning issue where a sin-gle instance is possibly associated with several classes at the sametime and thus the classes are not mutually exclusive. Recently, ithas been paid more attention to than before because of manyreal-world applications, such as, text categorization (Chang, Chen,& Liau, 2008; Joachims, 1998; Schapire & Singer, 2000; Zhang &Zhou, 2006), scene classification (Boutell, Luo, Shen, & Brown,2004; Jiang, Wang, & Zhu, 2008; Li, Wang, & Sung, 2004), bioinfor-matics (Barutcuoglu, Schapire, & Troyanskaya, 2006; Pavlidis, Wes-ton, Cai, & Grundy, 2001), and music categorization (Trohidis,Tsoumakas, Kalliris, & Vlahavas, 2008). Nowadays there are mainlytwo distinct strategies to design various discriminative multi-labelclassification methods: problem transformation and algorithmadaptation (Tsoumakas & Katakis, 2007; Tsoumakas, Katakis, &Vlahavas, 2010).

Problem transformation converts a multi-label problem intoeither one or more single-label (single, binary or multi-class)sub-problems, constructs a sub-classifier for each sub-problemusing an existing technique, and then assembles all sub-classifiersinto an entire multi-label classifier. It is convenient and fast toimplement a problem transformation method due to lots of exist-ing techniques and their free software. Currently, four useful trans-formation tricks have been used in application domains, i.e., labelpowerset (LP), one-by-one (OBO), one-versus-one (OVO), andone-versus-other (OVR) approaches. In principle, the LP trick con-siders correlations between labels via possible label combinations,

ll rights reserved.

at.harvard.edu

while the others through reusing multi-label instances manytimes.

To characterize as many correlations between labels as possible,algorithm adaptation extends some specific multi-class classifica-tion algorithms to handle an entire multi-label training data set di-rectly. It is interesting that this strategy could induce two types ofmulti-label classification methods. One considers all classes and allinstances simultaneously, usually leading to some complicatedoptimization problems, such as, multi-label support vector ma-chine (Rank-SVM) (Elisseeff & Weston, 2001) and multi-label BPneural networks (BP-MLL) (Zhang & Zhou, 2006). The other stilldeals with each class independently after using some problemtransformation tricks (e.g., OVR), such as, multi-label kNN method(ML-kNN) (Zhang & Zhou, 2007) and multi-label RBF neural net-works (ML-RBF) (Zhang, 2009). The latter group of methods hasmuch lower computational complexity than the former one. Thisinspires us to introduce problem transformation tricks into algo-rithm adaptation methods for reducing computational cost.

In this paper, our focus is on famous Rank-SVM. Besides its highcomputational complexity, Rank-SVM lacks a natural zero point todetect relevant labels. Through adding a zero label, we define anew form of ranking loss and then simplify the original form ofRank-SVM, which leads to a novel quadratic programming problemwhere each label has an independent linear equality constraint.When the iterative linearization Frank–Wolfe method is utilizedto solve our optimization problem, our entire linear programmingproblem of each step can be divided into a series of sub-problemusing the OVR transformation trick. This means that the computa-tional cost could be reduced dramatically. It is illustrated experi-mentally that our method runs 12 times faster than Rank-SVM in

Page 2: An efficient multi-label support vector machine with a zero label

J. Xu / Expert Systems with Applications 39 (2012) 4796–4804 4797

the training phase under C++ environment for the famous Yeastdata set. Further, our method exhibits competitive performanceagainst five existing high-performing multi-label classificationmethods including ML-kNN, ML-RBF, ML-NB, BP-MLL and Rank-SVM (Elisseeff & Weston, 2001; Zhang & Zhou, 2006, 2007; Zhang,2009; Zhang, Pena, & Robles, 2009).

The rest of this paper is organized as follows. Multi-label classi-fication setting is introduced in Section 2 and related work is re-viewed in Section 3. Our novel support vector machine isdescribed in Section 4 and its corresponding training algorithm ispresented in Section 5. Section 6 is devoted to experiments withfive benchmark data sets. This paper ends with some conclusionsin Section 7.

2. Multi-label classification setting

Let X 2 Rd be a d-dimensional real input space andQ = {1,2, . . . ,q} a finite set of class labels, where q is the numberof class labels. Further, assume that an instance x 2 X can be asso-ciated with a subset of labels L, which is referred to as the relevantset of labels for x. At the same time, the complement of L, i.e.,L ¼ Q n L, is called as the irrelevant set of labels. Given a trainingdata set of size l drawn identically and independently from an un-known probability distribution on X � 2Q, i.e.,

fðx1; L1Þ; ðx2; L2Þ; . . . ; ðxl; LlÞg; ð1Þ

the multi-label classification problem is to learn a classifierf(x) :X ? 2Q that generalizes well beyond these instances in thesense of minimizing the risk with respect to a specific loss function(Cheng & Hullermeier, 2009).

In many traditional q-class single-label classification methods, awidely used trick is to learn q discriminative functions:fiðxÞ : X ! R; i ¼ 1; . . . ; q such that fk(x) > fi(x), i – k if x belongs tothe class k (Duda, Hart, & Stork, 2001, chap. 5). For multi-label clas-sification, such an idea is extended as fkðxÞ > fiðxÞ; k 2 L; i 2 L,which means that the relevant labels should be ranked higher thanthe irrelevant ones (Elisseeff & Weston, 2001). In this case, the multi-label prediction can be done through a proper threshold t(x):

f ðxÞ ¼ fkjfkðxÞP tðxÞ; k ¼ 1; . . . ; qg: ð2Þ

There mainly are three types of threshold functions t(x): a constant(e.g., zero) (Boutell et al., 2004), a linear regression model associ-ated with q discriminative function values (Elisseeff & Weston,2001; Zhang & Zhou, 2006; Xu, 2010), an additional discriminativefunction for so-called virtual, zero or calibrated label (Brinker,Furnkranz, & Hullermeier, 2006; Furnkranz, Hullermeier, Mencia,& Brinker, 2008; Jiang et al., 2008). In the last two cases, t(x) de-pends on x.

3. Related work

In the past several years, a variety of multi-label methods havebeen proposed in pattern recognition, machine learning and statis-tics. In this section, according to two main aforementioned strate-gies: problem transformation and algorithm adaptation, our focusis on two kinds of corresponding discriminative multi-label meth-ods: problem transformation and algorithm adaptation methods asin (Tsoumakas & Katakis, 2007; Tsoumakas et al., 2010).

3.1. Problem transformation methods

In problem transformation methods, there exist two core prob-lems: how to divide a multi-label classification problem into eitherone or more single-label sub-problems and then accordingly howto assemble all sub-classifiers into an entire multi-label classifier.For most of such multi-label methods, sub-classifiers can be built

using a number of existing methods and their free software. Here,we briefly review four transformation tricks: one-versus-rest(OVR), one-versus-one (OVO), one-by-one (OBO) and label power-set (LP).

The OVR transformation or binary relevance (BR) method dividesa q-label classification problem into q binary sub-problems (Boutellet al., 2004; Joachims, 1998), where the ith sub-problem consists ofpositive instances from the ith label and negative ones from the allother labels, and usually assembles q sub-classifiers into an entiremulti-label algorithm using a proper threshold function. Thereforethe size of each sub-problem is the same, i.e., the number of totaltraining instances. This method has been demonstrated to work wellon a number of real world applications by using many binary classi-fiers, for example, support vector machine (Boutell et al., 2004; Elis-seeff & Weston, 2001; Joachims, 1998), naive Bayes (Zhang et al.,2009), and kNN (Cheng & Hullermeier, 2009).

The OVO transformation method is to split a q-label problem intoq(q � 1)/2 binary sub-problems in a pairwise way and combine allsub-classifiers using a voting way and a proper threshold function.Note that for some sub-problems, a few instances still belong to bothpositive and negative classes at the same time, which can be as-signed as a mixed class. In Boutell et al. (2004) and Petrovskiy(2006), this mixed class is regarded as positive (or negative class)twice, and then two binary SVM classifiers are trained. It is reason-able to locate this mixed class between positive and negative classes.Under this setting, two parallel hyper-planes are used to separatethese three classes in Wang, Chang, and Feng (2005) and Wan andXu (2007), while in Li and Xu (2009) such a mixed class is forcedto reside in the marginal region of binary SVM. When the numberof classes is small, these OVO-based methods run fast due to theirsmall-scale optimization sub-problems (Li & Xu, 2009). A variantof the OVO transformation is ranking by pairwise comparison(RPC) (Hullermeier, Furnkranz, Cheng, & Brinker, 2008; Mencia &Furnkranz, 2008), in which the mixed class is discarded. To detectthe relevant set of labels, a calibrated label is added and then esti-mated by the above OVR transformation method (Brinker et al.,2006; Furnkranz et al., 2008). In essence, this method integratesthe OVO transformation with the OVR one.

The OBO transformation technique is to parse a q-labelproblem into q single-class sub-problems in which the ith sub-problem consists of instances from the ith class only (Xu,2010), to build q single-class classifiers by support vector datadescription (Tax & Duan, 2004), and to assemble a multi-labelclassifier using linear ridge regression. This algorithm has a lowestcomputational complexity among all existing kernel-basedmulti-label methods (Xu, 2010).

The LP transformation method considers each possible labelcombination of more than one class in a multi-label training dataset as a new single class, and then converts a multi-label probleminto a standard multi-class one (Boutell et al., 2004). This methodcould produce a large number of new classes, many of which con-sist of very few instances. On the other hand, it is impossible to cre-ate any new label combination in the prediction procedure. In Read(2008), some label combinations including very few instances arepruned through a pre-determined threshold and replaced by dis-joint label combinations occurring more times. In Tsoumakas andVlahavas (2007), Tsoumakas, Katakis, and Vlahavas (2011) andRead, Pfahringer, and Holmes (2008), two LP ensemble classifiersare proposed through utilizing a small random subset of the setof labels, and a slightly larger random subset of the training setrespectively.

Problem transformation methods are popular since they areeasily to be implemented using various existing methods and theircorresponding free software and work well on a number of bench-mark data sets in the literature. In principle, the LP-type methodsconsider correlations between labels through possible label

Page 3: An efficient multi-label support vector machine with a zero label

4798 J. Xu / Expert Systems with Applications 39 (2012) 4796–4804

combinations, while the others via reusing multi-label instancesmany times.

The OVR method usually assembles many sub-classifiers in aparallel way, and alternatively a series of sub-classifiers are linkedin a chain in which the outputs of previous sub-classifiers areadded to the inputs of current sub-classifier (Read, Pfahringer,Holmes, & Frank, 2009; Dembczynski, Cheng, & Hullermeier,2010). According to Dembczynski, Cheng, et al. (2010) andDembczynski, Waegeman, Cheng, and Hullermeier (2010), the LPmethod and its ensemble forms, and classifier chain methods couldcharacterize conditional label dependence. However, it is notedthat these improved approaches increase the computational costgreatly.

3.2. Algorithm adaptation methods

In order to characterize as many correlations between labels aspossible, many existing multi-class classification methods havebeen extended to solve multi-label classification problems. Herewe further group them into two types according to their imple-mentation ways. One is implemented by solving complicated opti-mization problems and the other by some problem transformationtricks.

Through modifying the formula of entropy calculation and per-mitting multiple labels at the leaves of the tree, a C4.5-like multi-label classification algorithm is proposed in Clare and King (2001),in which it is possible to generate a large number of leaves for allcombinations of different labels, just like the original LP method.BoosTexer (Schapire & Singer, 2000) is derived from the well-known Adaboost algorithm, which includes two slightly differentversions: AdaBoost.MH and AdaBoost.MR. The former is to predictthe relevant set of labels of an instance, while the latter to rangelabels of an instance in descending order. Adtboost.MH (Comite,Gilleron, & Tommasi, 2003) is constructed by integrating alterna-tive decision tree with Adaboost.MH. It is hard to control the com-plexity of boosting-based multi-label algorithms, as pointed out inElisseeff and Weston (2001).

In principle, traditional back-propagation (BP) neural net-works can deal with multi-label classification directly throughassigning many ones at output layer. In BP-MLL (Zhang & Zhou,2006), a new empirical loss function is induced from the rank-ing loss to characterize correlations between labels of an in-stance. But it is needed to find an additional thresholdfunction using linear regression. It has been shown that thisBP method runs very slowly. Learning a threshold in BP trainingprocedure at the same time is considered by doubling the num-ber of nodes of output layer in Grodzicki, Mandziuk, and Wang(2008).

A support vector machine for multi-label classification (Rank-SVM) is proposed in Elisseeff and Weston (2001) through both gen-eralizing multi-class SVM (Vapnik, 1998) and accepting the rank-ing loss as its empirical loss function. But its correspondingquadratic programming problem has an extremely high computa-tional complexity due to a huge number of variables to be solved,for example, 58,248 variables in the training procedure for theYeast data set with 1500 instances and 14 labels. Additionally, justas in BP-MLL, a proper threshold function has to be estimated usinglinear regression. In Jiang et al. (2008), a virtual label is simplyadded to find a natural zero to determine the relevant labels, whichfurther increases difficulty of optimization work. Although Rank-SVM is solved by the linearization Frank–Wolfe algorithm ratherthan traditional quadratic programming solver, its training proce-dure still is time consuming.

The main limitation of the above algorithm adaptation methodsis their high computational complexity. Therefore how to reducetheir computational cost is still a challenging issue. Interestedly,

several algorithm adaptation methods are formally derived fromcorresponding multi-class techniques, but they still are imple-mented by some problem transformation tricks.

In classical multi-class RBF neural networks, C-mean methodis used to cluster all training instances, and then q discrimina-tive functions are built by solving q linear sets of equationsindependently. For its multi-label version (ML-RBF) (Zhang,2009), during its clustering procedure, a q-label problem is di-vided into q sub-problems using the OBO method, and theneach class instances are clustered independently. ThereforeML-RBF internally is realized by the OBO and OVR transforma-tion methods.

Nearest neighbor (kNN) or instance-based (IB) algorithm hasbeen extended to construct slightly different multi-label meth-ods: ML-kNN (Zhang & Zhou, 2007) and IBLR-ML (Cheng &Hullermeier, 2009). Both of them need a training phase. Com-bining leave-out-one procedure with traditional kNN method,ML-kNN estimates prior probabilities and conditional probabili-ties of Bayesian rules for each label independently, while IBLR-ML posterior probabilities associated with the weighted sum ofnearest instances of different labels using logistic regression foreach label. Therefore IBLR-ML takes more correlations betweenlabels as possible in account than ML-kNN does. In essence,ML-kNN and IBLR-ML are implemented by the OVR transforma-tion method.

It has been illustrated that these algorithm adaptation methodsembedding problem transformation tricks run much faster thanthose pure algorithm adaptation methods and some problemtransformation ones. This indicates that integrating problem trans-formation tricks into algorithm adaptation methods would becomean effective and efficient way to speed up the training procedure ofsome algorithm adaptation methods.

4. A novel support vector machine for multi-label classification

In this section, we briefly review an existing multi-label supportvector machine first and then propose our novel support vectormachine for multi-label classification.

4.1. Existing multi-label support vector machine

The existing multi-label support vector machine (Rank-SVM)(Elisseeff & Weston, 2001) is extended from its correspondingmulti-class support vector machine (Vapnik, 1998). In theoriginal input space, q linear discriminative functions aredefined as,

fkðxÞ ¼ wTk xþ bk; k ¼ 1; . . . ; q; ð3Þ

where wk and bk denote the weight vectors and bias terms. Forsome training instance, it is desirable that any relevant label shouldbe ranked at least one higher than any irrelevant one. When thisideal situation does not hold, a slack variable is introduced, i.e.,

fmðxiÞ� fnðxiÞ¼ ðwm�wnÞT xiþðbm�bnÞP 1�nimn; ðm;nÞ 2 ðLi�LiÞ; ð4Þ

In fact the slack variable nimn P 0 reflects a ranking loss value be-tween a relevant label m and an irrelevant one n. Accordingly, anoverall ranking loss can be approximated using the followingformula

RankingLoss �Xl

i¼1

1jLijjLij

Xðm;nÞ2ðLi�LiÞ

nimn: ð5Þ

This term is utilized as an empirical loss function in Rank-SVM. Theoriginal optimization problem of Rank-SVM is constructed asfollows,

Page 4: An efficient multi-label support vector machine with a zero label

J. Xu / Expert Systems with Applications 39 (2012) 4796–4804 4799

min12

Xq

k¼1

wTk wk þ C

Xl

i¼1

1jLijjLij

Xðm;nÞ2ðLi�LiÞ

nimn;

s:t: ðwm �wnÞT xi þ ðbm � bnÞP 1� nimn;

nimn P 0; ðm;nÞ 2 ðLi � LiÞ; i ¼ 1; . . . ; l;

ð6Þ

where C represents the regularization constant. Using standardLagrangian technique, the dual version of (6) is derived as,

min12

Xq

k¼1

Xl

i;j¼1

bkibkj xTi xj

� ��Xl

i¼1

Xðm;nÞ2ðLi�LiÞ

aimn;

s:t:Xl

i¼1

Xðm;nÞ2ðLi�LiÞ

ckimnaimn ¼ 0; k ¼ 1; . . . ; q;

0 6 aimn 6 Ci ¼1jLijjLij

C

ð7Þ

with

ckimn ¼

0 if m – k and n – k;þ1 if m ¼ k;

�1 if n ¼ k

8><>: ð8Þ

and

bki ¼X

ðm;nÞ2ðLi�LiÞ

ckimnaimn: ð9Þ

It is worth noting that the variables bki are defined to have asconcise formula as possible only. Now, there are

Pli¼1jLijjLij vari-

ables determined by the number of constraints in (6), so thisquadratic programming problem has an extremely high computa-tional complexity. Additionally, the threshold function t(x) in (2)needs to be estimated by linear regression. In Jiang et al. (2008),a virtual or calibrated label is added to (6) simply, resulting intoqlþ

Pli¼1jLijjLij variables, which greatly increases difficulty of opti-

mization work.

4.2. A novel support vector machine for multi-label classification

Focusing on both reducing the computational cost and adding azero label for Rank-SVM, we propose a novel support vector ma-chine for multi-label classification in this sub-section. For conve-nience, we define a binary vector yi = [yi1,yi2, . . . ,yiq] for sometraining instance xi, where yik = 1 if k 2 Li, otherwise yik = � 1. Toobtain a natural zero point to detect relevant labels, we add a zerolabel and define its discriminative function as,

f0ðxÞ ¼ wT0xþ b0: ð10Þ

Now it is desirable that any relevant label should be ranked onehigher than this zero label and any irrelevant label one lower thanthis zero one. In case such an ideal case does not happen, two slackvariables are considered. Therefore, for a pair of relevant and irrel-evant labels of xi, i.e.,ðm;nÞ 2 ðLi � LiÞ, we have the following twoconstraints,

fmðxiÞ � f0ðxiÞ ¼ ðwm �w0ÞT xi þ ðbm � b0ÞP 1� nim;

fnðxiÞ � f0ðxiÞ ¼ ðwn �w0ÞT xi þ ðbn � b0Þ 6 �1þ nin:ð11Þ

Here the slack variables nim and nin P 0. According to the binary la-bel vectors, the above constraints can be rewritten as a unifiedform:

yikððwk �w0ÞT xi þ ðbk � b0ÞÞP 1� nik; k ¼ 1; . . . ; q: ð12Þ

From (11), the difference between fm(xi) and fn(xi) becomes,

fmðxiÞ � fnðxiÞ ¼ ðwm �wnÞT xi þ ðbm � bnÞP 2� ðnim þ ninÞ: ð13Þ

It is observed that nim + nin plays a similar role as nimn does in Rank-SVM. Therefore, we define an overall ranking loss as follows,

RankingLoss �Xl

i¼1

1jLijjLij

Xðm;nÞ2ðLi�LiÞ

ðnim þ ninÞ

0@

1A

¼Xl

i¼1

1jLij

Xm2Li

nim þ1jLij

Xn2Li

nin

0@

1A; ð14Þ

which is used as an empirical loss function in our novel method.Now we construct a new optimization problem as follows,

min12

Xq

k¼1

wTk wk þ

12

wT0w0 þ C

Xl

i¼1

1jLij

Xm2Li

nim þ1jLij

Xn2Li

nin

0@

1A;

s:t: yikððwk �w0ÞT xi þ ðbk � b0ÞÞP 1� nik;

nik P 0; k ¼ 1; . . . ; q; i ¼ 1; . . . ; l;

ð15Þ

where C denotes the regularization constant. The dual problem canbe derived using standard Lagrangian technique. Let aik and bik bethe Lagrangian multipliers for the inequality constraints in (15).The Lagrangian for the primal form (15) becomes,

L ¼ 12

Xq

k¼1

wTk wk þ

12

wT0w0 þ C

Xl

i¼1

1jLij

Xm2Li

nim þ1jLij

Xn2Li

nin

0@

1A

�Xq

k¼1

Xl

i¼1

aikðyikððwk �w0ÞT xi þ ðbk � b0ÞÞ � 1þ nikÞ

�Xq

k¼1

Xl

i¼1

biknik: ð16Þ

The Karush–Kuhn–Tucker (KKT) conditions for this primary prob-lem require the following relations to be true,

@L@wk

¼ wk �Xl

i¼1

aikyikxi ¼ 0 ) wk ¼Xl

i¼1

aikyikxi; k ¼ 1; . . . ; q;

ð17Þ

@L@w0

¼ w0 þXq

k¼1

Xl

i¼1

aikyikxi ¼ 0 ) w0 ¼ �Xq

k¼1

Xl

i¼1

aikyikxi ¼ �Xq

k¼1

wk;

ð18Þ

@L@bk¼Xl

i¼1

aikyik ¼ 0; k ¼ 1; . . . ; q; ð19Þ

@L@b0¼Xq

k¼1

Xl

i¼1

aikyik ¼ 0; ð20Þ

@L@nik¼

1jLi j

C � aik � bik ¼ 0; if yik ¼ þ1;1jLi j

C � aik � bik ¼ 0; if yik ¼ �1;

(i ¼ 1; . . . ; l; k ¼ 1; . . . ;q:

ð21Þ

By introducing the above KKT conditions (17)–(21) into theLagrangian (16), the dual form is,

min W ¼ 12

Xq

k;k0¼1

ð1þ dkk0 ÞXl

i:j¼1

yikaikyjk0ajk0 xTi xj

� ��Xq

k¼1

Xl

i¼1

aik;

s:t:Xl

i¼1

aikyik ¼ 0;

ð22Þ

Page 5: An efficient multi-label support vector machine with a zero label

4800 J. Xu / Expert Systems with Applications 39 (2012) 4796–4804

0 6 aik 6 Cik ¼1jLi j

C; if yik ¼ þ1;1jLi j

C; if yik ¼ �1;

(k ¼ 1; . . . ; q; i ¼ 1; . . . ; l;

where dkk0 ¼ 1 if k = k0 and 0 otherwise. Note that if (19) holds, (20)naturally is true. Therefore (20) is not involved in (22). After solving(22), the bias terms can be calculated using the following KKTconditions:

yikððwk �w0ÞT xi þ ðbk � b0ÞÞ ¼ 1; if 0 < aik < Cik;

k ¼ 1; . . . ; q; i ¼ 1; . . . ; l: ð23Þ

In fact, we only need to calculate the difference between bk and b0.In this case, new discriminative functions can be rewritten as,

f 0kðxÞ ¼ fkðxÞ � f0ðxÞ ¼ ðwk �w0ÞT xþ ðbk � b0Þ

¼Xl

i¼1

Xq

k0¼1

ð1þ dkk0 Þaik0yik0

!xT

i x� �

þ ðbk � b0Þ: ð24Þ

The corresponding multi-label classification rules become,

labels of x ¼ fkjf 0kðxÞP 0; k ¼ 1; . . . ; qg: ð25Þ

Similarly in binary SVM and Rank-SVM (Elisseeff & Weston, 2001;Vapnik, 1998), the dot product between two vectors in (22) and(24) can be replaced by various kernels satisfying Mercer theorem(Vapnik, 1998). It is attractive that there are only ql variables in(22), which are much less than those in Rank-SVM. In this paper,our novel method is referred to as a support vector machine formulti-label classification with a zero label or simply SVM-ML.

5. An efficient training procedure based on Frank–Wolfemethod

In this section, an efficient training algorithm for our SVM-ML isconstructed. Before listing its detailed steps, we briefly review theFrank–Wolfe method for convex optimization problems.

5.1. Frank–Wolfe algorithm

The Frank–Wolfe method (FW) is a simple and classical first or-der feasible direction optimization method (Frank & Wolfe, 1956),which was originally proposed to solve quadratic programmingproblems and then extended to solve convex problems with con-tinuously differential objective function and linear (and box)constraints:

min f ðxÞ s:t: x 2 S: ð26Þ

The set S is a nonempty and bounded polyhedron of the form:

S ¼ fxjAx ¼ b; l 6 x 6 ug; ð27Þ

where A is a m � n matrix, b represents a m-dimensional columnvector, and l and u denote lower and upper bounds of x. The FWgenerates a sequence of feasible vectors {x(p)} using a linear searchx(p+1) = x(p) + k(p)d(p), where k (p) 2 [0,1] is a step length anddðpÞ ¼ �xðpÞ � xðpÞ a feasible descent direction satisfying �xðpÞ 2 S and(d(p))Trf(x(p)) < 0. To find out a best feasible direction, i.e., the best�xðpÞ, the FW utilizes the first order Taylor series expansion of f(x)around the vector x(p) and then solves a linear programming prob-lem with linear constraints (27):

�xðpÞ ¼ argminx2S

ðf ðxðpÞÞ þ ðx� xðpÞÞTrf ðxðpÞÞÞ

¼ argminx2S

xTrf ðxðpÞÞ ð28Þ

Such a problem is easily optimized by the widely used simplexmethod or interior point one. The basic FW algorithm for (26) canbe stated as follows:

Step 1 (initialization): Select an initial feasible vector x(p) 2 S withp = 1 and set a stop criterion e.

Step 2 (solving a linear programming problem): Let�xðpÞ ¼ argminx2SxTrf ðxðpÞÞ . If jð�xðpÞ � xðpÞÞTrf ðxðpÞÞj 6 e,then stop; else continue.

Step 3 (executing a linear search): Let kðpÞ ¼ argmink2½0;1�f ðxðpÞþkð�xðpÞ � xðpÞÞÞ.

Step 4 (update): Set xðpþ1Þ ¼ xðpÞ þ kðpÞð�xðpÞ � xðpÞÞ and p = p + 1, goto Step 2.

It was proved that the above FW algorithm has a sub-linear con-vergence (Frank & Wolfe, 1956; Guelat & Marcotte, 1986). Thisiterative linearization procedure has been applied to Rank-SVM di-rectly, whose time complexity of each iteration is O(qm2) withm ¼

Pli¼1jLijjLij (Elisseeff & Weston, 2001; their detailed technical

report: http://www.kyb.tuebingen.mpg.de/bs/people/weston/publications, 2001). But for SVM-ML, we only need to putm ¼

Pli¼1ðjLij þ jLijÞ ¼ ql into O(�). Further, if f(x) is a strictly convex

function, we have an analytical step length:

kðpÞ ¼min � ðdðpÞÞTrf ðxðpÞÞðdðpÞÞTr2f ðxðpÞÞdðpÞ

;1

( ); ð29Þ

which will be used to improve the training procedure of Rank-SVMand SVM-ML in our C++ implementation.

5.2. An efficient training algorithm for SVM-ML

In order to apply the FW method to the dual problem of ourSVM-ML (22), we have to derive the gradient of the objective func-tion in (22),

gik ¼@W@aik

¼ yik

Xq

k0¼1

ð1þ dkk0 ÞXl

j¼1

yjk0ajk0 xTi xj

� � !� 1: ð30Þ

Accordingly, we build a linear programming problem as follows,

minXq

k¼1

Xl

i¼1

gik �aik

!;

s:t:Xl

i¼1

yik�aik ¼ 0;

0 6 �aik 6 Cik; i ¼ 1; . . . ; l; k ¼ 1; . . . ; q:

ð31Þ

It is surprised that duo to special constraints, this problem can bedivided into q independent sub-problems using the OVR transfor-mation trick, i.e.,

minXl

i¼1

gik�aik;

s:t:Xl

i¼1

yik�aik ¼ 0;

0 6 �aik 6 Cik; i ¼ 1; . . . ; l:

ð32Þ

This means that each class has its own linear programming sub-problem, which would dramatically reduce the computational costfor each step of the FW method. Let dik ¼ �aik � aik. Accordingly, ananalytical step length can be represented as,

k ¼min �Pq

k¼1

Pli¼1gikdikPq

k;k0¼1ð1þ dkk0 ÞPl

i;j¼1yikdikyjk0djk0 xTi xj

� � ;18<:

9=;: ð33Þ

Now, we list our efficient training algorithm for SVM-ML as follows,

Step1: Set p ¼ 1; aðpÞik ¼ 0; gðpÞik ¼ �1; i ¼ 1; . . . ; l; k ¼ 1; . . . ; q;the stop criterion and the maximal iterations M.

Page 6: An efficient multi-label support vector machine with a zero label

Table 1Statistics for five benchmark data sets used in our experiments.

Data set Domain Instances Feature Class Average labels

Train Test

Emotions Music 391 202 72 6 1.87Medical Text 645 333 1449 45 1.25Image Scene 1200 800 294 5 1.24Scene Scene 1211 1196 294 6 1.07Yeast Biology 1500 917 103 14 4.24

J. Xu / Expert Systems with Applications 39 (2012) 4796–4804 4801

Step2: Solve q linear programming sub-problems according to

(32) to obtain �aik and then compute dðpÞik ¼ �aðpÞik � aðpÞik . IfPqk¼1

Pli¼1gðpÞik dðpÞik

��� ��� 6 e, then stop; else continue.

Step 3: Estimate the step length k(p)using (33).Step 4: If p reaches to M, then stop; else, let aðpþ1Þ

ik ¼ aðpÞik þ kðpÞdðpÞik

and p = p + 1.Step 5: Update the gradient gðpÞik according to (30), and then go to

Step 2.

Note that the maximal iterations M is set to stop our iterativeprocedure early, just as in BP-MLL (Zhang & Zhou, 2006). In theabove training method, we have improved the basic FW methodsin two aspects: to divide the entire linear programming of eachstep into a series of sub-problems using the OVR transformationtrick and to derive an analytical step length, both of which canspeed up the training procedure of SVM-ML dramatically.

6. Experiments

In this section, we compare our SVM-ML with five existing mul-ti-label classification approaches experimentally. Before present-ing our experimental results, we briefly introduce four evaluationmeasures for multi-label classification, five existing methods, andfive benchmark data sets.

6.1. Evaluation measures

It is more complicated to evaluate a multi-label classificationalgorithm than a single-label one, which induces more than 10 per-formance evaluation measures (Tsoumakas & Katakis, 2007;Tsoumakas et al., 2010). In this paper, since our SVM-ML can deter-mine the relevant labels of a test instance directly, we choose fourpopular and indicative instance-based measures: accuracy, preci-sion, recall and F1. Assume an evaluation data set of size m to be{(x1,L1), . . . , (xi,Li), . . . , (xm,Lm)}. Given some instance xi, its predictedset of labels from some multi-label classification algorithm is de-noted by LP

i 2 2Q .The accuracy evaluates the percentage of relevant labels which

are predicted correctly, out of union between predicted and truelabels. The precision calculates the percentage of predicted labelsthat are relevant, while the recall computes the percentage of rel-evant labels that are predicted. Since the precision and recall con-flict with each other, the F1 expresses the harmonic mean betweenthe two. Four measures are defined as follows,

Accuracy ¼ 1m

Xl

i¼1

Li \ LPi

��� ���Li [ LP

i

��� ��� ; Precision ¼ 1m

Xl

i¼1

Li \ LPi

��� ���LP

i

��� ��� ;

Recall ¼ 1m

Xl

i¼1

Li \ LPi

��� ���jLij

; F1 ¼ 1m

Xl

i¼1

2 Li \ LPi

��� ���jLij þ LP

i

��� ��� :ð34Þ

The above four measures range between 0 and 1. It is desirable thata multi-label classification algorithm should achieve larger valuesfor all four measures.

6.2. Five existing multi-label methods and five data sets

In this paper, we choose four multi-label classification methodsfrom Zhang et al.’ work, i.e., BP-MLL, ML-kNN, ML-RBF, and ML-NB(Zhang & Zhou, 2006, 2007; Zhang, 2009, Zhang et al., 2009), whosefree Matlab software is available at Zhang homepage http://cse.-seu.edu.cn/people/zhangml. But we recode ML-kNN using C++.On the other hand, we accept their recommended parameter set-tings. For BP-MLL, the number of hidden neurons is set to be 20%

of the number of input neurons, the learning rate is 0.05, the num-ber of training epochs is fixed to be 100, and the regularizationconstant is 0.1. For ML-kNN, the smooth factor s = 1 and nearest in-stances k = 10 with Euclidean distance. For ML-RBF, the fractionfactor a = 0.01 and the scaling factor l = 1.0. For ML-NB, the frac-tion of remaining features after PCA is set to be 0.3.

In order to improve the training efficiency of Rank-SVM, we usean analytical step length for Rank-SVM and recode its implementa-tion using C++. This new version is still referred to as Rank-SVM inthis paper. Our linear programming solver is the free LPSOL5.5based on improved simplex method (http://www.cs.sunysb.edu/algorithm/implement/lpsolve/implement.shtml).

To compare our SVM-ML with the above mentioned five classi-fication methods, we collect five benchmark data sets: Emotions,Medical, Scene, and Yeast from http://mulan.sourceforge.net/data-sets.html, and Image from http://cse.seu.edu.cn/people/zhangml.Table 1 shows some useful statistics of these data sets, such as,the number of instances in the training and test sets, the numberof features, the number of labels, and the average labels. These datasets cover four distinct domains: text, scene, music and biology.Note that the Image is partitioned into training and test sets byus, while the other four training and test sets are downloaded di-rectly. For more detailed information and description for these datasets, please refer to their web sites and references therein.

6.3. Tuning key parameters for SVM-ML and Rank-SVM on trainingsets

In this work, the RBF kernel k(x,y) = exp (�ckx � yk2) is tested,where c denotes the scale factor. In this case there are four tunableparameters in SVM-ML and Rank-SVM: scale factor c, regulariza-tion constant C, maximal iterations M and stop criterion e. To re-duce the search space of possible parameter combinations, givene = 0.0001, we construct a reduced procedure to tune three keyparameters (i.e., c, C and M) for SVM-ML and Rank-SVM on trainingsets, in which through fixing two parameters by turns we investi-gate the average accuracy as a function of the third parameterusing threefold cross validation and two repeats, and then detectan optimal value of the third parameter, as shown in Fig. 1.

Given C = 1 and M = 50 (a default setting for Rank-SVM), the cvalue is set to be 27,26, . . . , and 2�7 respectively. Fig. 1(a) (i.e., theleft column) shows the average accuracy with respect to 15 differ-ent c values. It is observed that each accuracy curve has a maxi-mum. On the Emotions, Image, Scene and Yeast data sets, SVM-ML achieves better accuracies than Rank-SVM does for most c val-ues and the optimal c one, but on the Medical, Rank-SVM is consis-tently superior to our SVM-ML. The optimal c values for five datasets and two methods are listed in Table 2.

According to the optimal c values from Fig. 1 (a) and M = 50,Fig. 1 (b) (i.e., the middle column) presents the average accuracyas a function of different C values from 27,26, . . . ,2�2, to 2�3. Forthe Emotions, Image, Scene and Yeast data sets, their accuracycurves fluctuate slightly, but our SVM-ML still outperformsRank-SVM at the peak accuracies which are higher than those inFig. 1(a). The optimal C values are detected to show in Table 2 too.

Page 7: An efficient multi-label support vector machine with a zero label

Fig. 1. Key parameter tuning procedure on training sets, where given two parameters the average accuracy is a function of the third parameter. (a) The gamma (c) varies from27,26, . . . ,2�7, (b) the C value from 27,26, . . . ,2-3, and (c) the maximal iterations M are set to be 50, 100, 150 and 200.

4802 J. Xu / Expert Systems with Applications 39 (2012) 4796–4804

At last, we investigate whether the average accuracy can beboosted via increasing iterations. Fig. 1(c) (i.e., the right column)indicates the average accuracy with respect to 50, 100, 150 and200 iterations, given the optimal combination of c and C fromFig. 1 (a) and (b) respectively. We find out that the accuracies onall five data sets can be improved. Specially, on the Medical, theobvious difference between SVM-ML and Rank-SVM in Fig. 1(a)and (b) vanishes at 200 iterations. Therefore we choose 200 itera-tions for five data sets and two methods, as shown in Table 2.

In Table 2, we also list the average accuracies from the aboveexperiments for five data sets and two classifiers, which can alsobe regarded as a primary comparison between our SVM-ML andRank-SVM. As shown in Table 2 using boldface, our SVM-ML per-forms better than Rank-SVM on five training data sets accordingto threefold cross validation and two repeats.

6.4. Comparison with five existing methods on test sets

In this sub-section, we compare SVM-ML with five existingmethods: BP-MLL, ML-RBF, ML-kNN, ML-NB and Rank-SVM using

Table 2The optimal parameter combinations and accuracy from 3-fold cross validation ontraining sets.

Data set SVM-ML Rank-SVM

c C M Accuracy c C M Accuracy

Emotions 2�1 2 200 0.573 2�1 2�1 200 0.515Image 2�2 1 200 0.575 2�2 2 200 0.563Medical 2�4 27 200 0.706 2�5 26 200 0.705Scene 2�3 2 200 0.697 2�4 22 200 0.655Yeast 1 2 200 0.536 1 22 200 0.529

a train-test mode. According to the optimal parameter combina-tions in Table 2 for SVM-ML and Rank-SVM from training sets,and those recommended parameter settings for the other fourmethods in Section 6.2, we train all six multi-label classificationmethods on five training sets, and then verify their performanceusing five independent test sets in Table 1 respectively. The exper-imental results are shown in Tables 3–7 for different data sets,where the best value of each measure among six methods is high-lighted in boldface. To compare these methods, we rank themusing single measure, which is denoted by integer number in thebrackets (from the best 1 to the worst 6), and then calculate theaverage rank for each method, which is listed in the last row ofeach table. Such a comparison method was recommended in(Brazdil & Soares, 2000) and used in (Tsoumakas et al., 2011) forcomparing different methods on multiple data sets using a singlemeasure. Here this method is used to compare different methodson a data set using multiple measures.

For the emotions data set in Table 3, our SVM-ML achieves thebest performance on accuracy, recall and F1 measures, while Rank-SVM works the best on precision measure. ML-NB is inferior to theother five methods on all four measures. According to the averagerank, the order of such six methods is listed as SVM-ML, Rank-SVM,BP-MLL, ML-kNN, ML-RBF and ML-NB.

On the image data set in Table 4, our SVM-ML performs the beston all four measures and Rank-SVM also achieves a competitiveperformance. It is surprised that BP-MLL fails because of its unac-ceptable measure values. In terms of the average rank, these sixmethods are sorted as SVM-ML, Rank-SVM, ML-RBF, ML-kNN,ML-NB and BP-MLL.

In the above sub-sections, SVM-ML almost behaves as well asRank-SVM on accuracy measure for the Medical data set. But in

Page 8: An efficient multi-label support vector machine with a zero label

Table 3The performance of six methods on Emotions.

Measure SVM-ML Rank-SVM ML-kNN ML-RBF ML-NB BP-MLL

Accuracy 0.582(1) 0.540(2) 0.501(5) 0.510(4) 0.448(6) 0.528(3)Precision 0.652(4) 0.682(1) 0.658(3) 0.633(5) 0.601(6) 0.661(2)Recall 0.765(1) 0.611(3) 0.573(5) 0.594(4) 0.512(6) 0.646(2)F1 0.675(1) 0.615(2) 0.585(4) 0.584(5) 0.525(6) 0.619(3)Average rank 1.75 2.00 4.25 4.50 6.00 2.50

Table 4The performance of six methods on Image.

Measure SVM-ML Rank-SVM ML-kNN ML-RBF ML-NB BP-MLL

Accuracy 0.603(1) 0.589(2) 0.475(4) 0.505(3) 0.461(5) 0.254(6)Precision 0.670(1) 0.662(2) 0.543(4) 0.571(3) 0.516(5) 0.293(6)Recall 0.743(1) 0.632(2) 0.486(3) 0.526(4) 0.518(5) 0.255(6)F1 0.674(1) 0.628(2) 0.501(4) 0.535(3) 0.499(5) 0.267(6)Average rank 1.00 2.00 3.75 3.25 5.00 6.00

Table 5The performance of six methods on Medical.

Measure SVM-ML Rank-SVM ML-kNN ML-RBF ML-NB BP-MLL

Accuracy 0.721(1) 0.698(2) 0.496(5) 0.558(4) 0.341(6) 0.592(3)Precision 0.750(2) 0.759(1) 0.544(5) 0.593(4) 0.369(6) 0.629(3)Recall 0.875(1) 0.727(2) 0.503(6) 0.580(4) 0.549(5) 0.720(3)F1 0.781(1) 0.727(2) 0.514(5) 0.576(4) 0.408(6) 0.649(3)Average rank 1.25 1.75 5.25 4.00 5.75 3.00

Table 6The performance of six methods on Scene.

Measure SVM-ML Rank-SVM ML-kNN ML-RBF ML-NB BP-MLL

Accuracy 0.689(1) 0.645(2) 0.629(3) 0.606(4) 0.562(5) 0.168(6)Precision 0.711(1) 0.671(2) 0.661(3) 0.635(4) 0.584(5) 0.173(6)Recall 0.849(1) 0.689(2) 0.655(3) 0.615(5) 0.621(4) 0.171(6)F1 0.750(1) 0.668(2) 0.648(3) 0.619(4) 0.589(5) 0.170(6)Average rank 1.00 2.00 3.00 4.25 4.75 6.00

Table 7The performance of six methods on Yeast.

Measure SVM-ML Rank-SVM ML-kNN ML-RBF ML-NB BP-MLL

Accuracy 0.539(1) 0.533(2) 0.492(5) 0.498(4) 0.488(6) 0.518(3)Precision 0.687(5) 0.710(2) 0.732(1) 0.700(3) 0.691(4) 0.666(6)Recall 0.655(1) 0.628(3) 0.549(6) 0.577(4) 0.572(5) 0.644(2)F1 0.645(1) 0.638(2) 0.599(5) 0.604(4) 0.599(6) 0.630(3)Average rank 2.00 2.25 4.25 3.75 5.25 3.50

Fig. 2. The overall average rank of six multi-label methods.

Table 8Comparison of training and test time (seconds) on Yeast.

Method Language Training time Test time

SVM-ML C++ 1214 4Rank-SVM C++ 15,138 4ML-kNN C++ 6 4ML-RBF Matlab 16 3ML-NB Matlab 5098 2BP-MLL Matlab 4375 8

J. Xu / Expert Systems with Applications 39 (2012) 4796–4804 4803

Table 5, SVM-ML is obviously superior to Rank-SVM on accuracymeasure and further achieves the best results on both recall andF1. But, Rank-SVM works the best on precision measure. Fromthe average rank, SVM-ML, Rank-SVM, BP-MLL, ML-RBF, ML-kNNand ML-NB are sorted in increasing order.

As to the scene data set, our SVM-ML obtains the best perfor-mance on all four measures. Again, BP-MLL fails as for Image.Although Rank-SVM is ranked second, but it performance is muchworse than our SVM-ML. According to the average rank, six meth-ods are ranked as SVM-ML, Rank-SVM, ML-kNN, ML-RBF, ML-NBand BP-MLL.

For the Yeast data set, our SVM-ML performs the best on allmeasures but precision, and ML-kNN works the best on precisionmeasure. But it is worth noting the difference of the average rankbetween the best method SVM-ML and the worst one ML-NB isthe smallest among five data sets. This is because the Yeast hasbeen investigated elaborately in (Zhang & Zhou, 2006, 2007; Zhang,2009; Zhang et al., 2009). From the average rank, the order of sixmethods is SVM-ML, Rank-SVM, BP-MLL, ML-RBF, ML-kNN andML-NB.

In order to compare six methods comprehensively, we calculatean overall average rank over five data sets for each method, andshow them in Fig. 2 in increasing order. It is observed that SVM-ML is the best among six methods, and Rank-SVM obtains thesecond position. It is interesting that ML-RBF, ML-kNN andBP-MLL perform almost equally well. But ML-NB is the worst can-didate among six compared methods.

In Section 4, we have pointed out the time complexity on SVM-ML and Rank-SVM briefly. Here we evaluate the training and testtime of six methods for the typical data set Yeast, as shown in Ta-ble 8. The first three methods are coded using C++, while the lastthree using Matlab. BP-MLL and Rank-SVM are extreme time con-suming because of their complicated optimization problems. In es-sence, ML-kNN, ML-RBF and ML-NB are implemented using theOVR transformation method. ML-NB runs the slowest due tospending a lot of time to conduct feature selection using geneticalgorithm. ML-kNN and ML-RBF run very fast, which reflects themain advantage of transformation methods. Our SVM-ML runsabout 12 times faster than its rival Rank-SVM. This should benefitfrom the OVR transformation trick used in each iterative step. Ourcomputational platform is one laptop computer with double1.8 GHz CPUs, 1G RAM, MFC6.0 and Matlab7.0.

According to the above experiments and analysis, it can bedemonstrated that our SVM-ML is the strongest method, comparedwith five state-of-the-art multi-label methods: ML-kNN, ML-RBF,ML-NB, BP-MLL and Rank-SVM.

Page 9: An efficient multi-label support vector machine with a zero label

4804 J. Xu / Expert Systems with Applications 39 (2012) 4796–4804

7. Conclusions

The existing support vector machine for multi-label classifica-tion (i.e., Rank-SVM) is a typical algorithm adaptation method,which considers all instances and all labels in a complicated qua-dratic programming problem. Additionally, Rank-SVM needs tolearn a linear threshold function by regression for detecting rele-vant labels. In this paper, a novel support vector machine for mul-ti-label classification is presented, which is derived throughsimplifying Rank-SVM and adding a zero label. Our quadratic pro-gramming problem has special linear constraints, which are decou-pled for different labels. When the Frank–Wolfe method is appliedto our optimization problem, our entire linear programming prob-lem can be divided in a series of sub-problems by using the OVRtransformation trick, which speeds up our training proceduregreatly. For the Yeast data set, our novel method runs about 12times faster than Rank-SVM under C++ environment. Experimentalstudy on five benchmark data sets demonstrates that our SVM-MLachieves rather competitive performance, compared with five typ-ical multi-label classification methods, including ML-kNN, ML-RBF,ML-NB, BP-MLL and Rank-SVM.

Our study indicates that integrating problem transformationtricks into algorithm adaptation methods is an effective and effi-cient way to design and implement novel multi-label classificationapproaches. In future we will conduct a model selection work tosearch optimal key parameters more efficiently for our SVM-ML.

Acknowledgements

This work is supported by Natural Science Foundation of ChinaGrant 60875001. The author thanks Prof. Zhihua Zhou and Min-gling Zhang, whose three Matlab software has been used in ourexperiments.

References

Barutcuoglu, Z., Schapire, R. E., & Troyanskaya, O. G. (2006). Hierarchical multi-labelprediction of gene function. Bioinformatics, 22(7), 830–836.

Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label sceneclassification. Pattern Recognition, 37(9), 1757–1771.

Brazdil, P. B., & Soares, G. (2000). A comparison of ranking methods for classificationalgorithm selection. In Proceedings of the 11th european conference on machinelearning (ECML2000), LNCS, Barcelona, Spain (Vol. 1810, pp. 63–74).

Brinker, K., Furnkranz, J., & Hullermeier, E. (2006). A unified model for multilabelclassification and ranking. In Proceedings of the 17th european conference onartificial intelligence (ECAI2006), Riva del Garda, Italy (pp. 489–493).

Chang, Y. C., Chen, S. M., & Liau, C. J. (2008). Multilabel text categorization based ona new linear classifier learning method and a category-sensitive refinementmethod. Expert Systems with Applications, 34, 1948–1953.

Cheng, W., & Hullermeier, E. (2009). Combining instance-based learning and logisticregression for multi-label classification. Machine Learning, 76, 211–225.

Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data.In Proceedings of the 5th european conference on principles of data mining andknowledge discovery (PKDD2001), LNCS, Freiburg, Baden–Württemberg, Germany(Vol. 2168, pp. 42–53).

Comite, F. D., Gilleron, R., & Tommasi, M. (2003). Learning multi-label alternativedecision tree from texts and data. In Proceedings of the 3rd internationalconference on machine learning and data mining in pattern recognition(MLMD2003), LNCS, Leipzig, Germany (Vol. 2734, pp. 35–49).

Dembczynski, K., Cheng, W., & Hullermeier E. (2010). Bayes optimal multilabelclassification via probabilistic classifier chains. In Proceedings of the 27thinternational conference on machine learning (ICML 2010), Haifa, Israel (pp. 279–286).

Dembczynski, K., Waegeman, W., Cheng, W., & Hullermeier, E. (2010). On labeldependence in multi-label classification. In Working notes of the 2ndinternational conference on learning from multi-label data (MLD2010), Haifa,Israel (pp. 5–12).

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). NewYork: John Wiley and Sons.

Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. InProceedings of the 14th conference on neural information processing systems(NIPS2001), Vancouver, British Columbia, Canada (pp. 681–687).

Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. NavalResearch Logistics Quarterly, 3(1–2), 95–110.

Furnkranz, J., Hullermeier, E., Mencia, E. L., & Brinker, K. (2008). Multi-labelclassification via calibrated label ranking. Machine Learning, 73, 133–153.

Grodzicki, R., Mandziuk, J., & Wang, L. (2008). Improved multi-label classificationwith neural networks. In Proceedings of the 10th international conference onParallel Problem Solving from Nature (PPSN2008), LNCS. Dortmund, Germany, (Vol.5199, pp. 409–416).

Guelat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’.Mathematical Programming, 35, 110–119.

Hullermeier, E., Furnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking bylearning pairwise preferences. Artificial Intelligence, 172, 1897–1916.

Jiang, A., Wang, C., & Zhu, Y. (2008). Calibrated rank-svm for multi-label imagecategorization. In Proceedings of the 2008 International Joint Conference on NeuralNetworks (IJCNN 2008), Hongkong, China (pp.1450–1455).

Joachims, T. (1998). Text categorization with support vector machines: Learningwith many relevant features. In Proceedings of the 10th european conference onmachine learning (ECML1998), LNCS, Chemnitz, Germany (Vol. 1398, pp. 137–142).

Li, J., & Xu, J. (2009). A fast multi-label classification algorithm based on double labelsupport vector machine. In Proceedings of 2009 international conference oncomputational intelligence and security (CIS2009), Beijing, China, (Vol. 2, pp. 30–35).

Li, X., Wang, L., & Sung, E. (2004). Multi-label SVM active learning for imageclassification. In Proceedings of 2004 international conference on image processing(ICIP 2004), Singapore (Vol. 4, pp. 2207–2210).

Mencia, E. L., & Furnkranz, J. (2008). Pairwise learning of multilabel classificationswith perceptrons. In Proceedings of 2008 IEEE international joint conference onneural networks (IJCNN 2008), Hongkong, China (pp. 2900–2907).

Pavlidis, P., Weston, J., Cai, J., & Grundy, W. N. (2001). Combining microarrayexpression data and phylogenetic profiles to learn functional categories usingsupport vector machines. In Proceedings of the fifth annual internationalconference on computational molecular biology (RECOMB 2001), Montreal,Canada (pp. 242–248).

Petrovskiy, M. (2006). Paired comparisons method for solving multi-label learningproblem. In Proceedings of the 6th international conference on hybrid intelligentsystems (HIS 2006), Auckland, New Zealand (p. 42).

Read, J. (2008). A pruned problem transform method for multi-label classification.In Proceeding of 2008 new zealand computer science research student conference(NZCSRS 2008), Christchurch, New Zealand (pp. 143–150).

Read, J., Pfahringer, B., & Holmes, G. (2008). Multi-label classification usingensembles of pruned sets. In Proceedings of the 8th IEEE internationalconference on data mining (ICDM2 008), Pisa, Italy (pp. 995–1000).

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chain for multi-labelclassification. In Proceedings of the european conference on machine learning andknowledge discovery in databases (ECML PKDD 2009, Part II), LNAI, Bled, Slovenia(Vol. 5782, pp. 254–269).

Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for textcategorization. Machine Learning, 39, 135–168.

Tax, D., & Duan, R. P. W. (2004). Support vector data description. Machine Learning,54(1), 45–66.

Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multi-labelclassification of music into emotions. In Proceedings of the 9th internationalconference on music information retrieval (ISMIR2008), Philadelphia, PA, USA (pp.325–330).

Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview.International Journal of Data Warehousing and Mining, 3(3), 1–13.

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2011). Random k-labelsets for multi-labelclassification. IEEE Transactions on Knowledge and Data Engineering, 23(7),1079–1089.

Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An ensemble method formulti-label classification. In Proceedings of the 18th european conference onmachine learning (ECML2007), LNCS, Warsaw, Poland (Vol. 4701, pp. 406–417).

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. InMaimon, O., & Rokach, L. (Eds.), Data mining and knowledge discoveryhandbook (2nd ed., pp. 667–685). New York: Springer.

Vapnik, V. N. (1998). Statistical learning theory. New York: John Wiley and Sons.Wan, S., & Xu, J. (2007). A multi-label classification algorithm based on triple class

support vector machine. In Proceedings of the 2007 IEEE international conferenceon wavelet analysis and pattern recognition (ICPRWA 2007), Beijing, China(pp. 1447–1452).

Wang, L., Chang, M., & Feng, J. (2005). Parallel and sequential support vectormachines for multi-label classification. International Journal InformationTechnology, 11(9), 11–18.

Xu, J. (2010). Constructing a fast algorithm for multi-label classification withsupport vector data description. In Proceedings of 2010 IEEE internationalconference on granular computing (GrC 2010), Silicon Vally, CA, USA (pp. 817–821).

Zhang, M. L. (2009). ML-RBF: RBF neural networks for multi-label learning. NeuralProcessing Letters, 29, 61–74.

Zhang, M. L., Pena, J. M., & Robles, V. (2009). Feature selection for multi-label NaïveBayes classification. Information Science, 179, 3218–3229.

Zhang, M. L., & Zhou, Z. H. (2006). Multilabel neural networks with application tofunction genomics and text categorization. IEEE Transactions on Knowledge andData Engineering, 18(10), 1338–1351.

Zhang, M. L., & Zhou, Z. H. (2007). ML-kNN: A lazy learning approach to multi-labellearning. Pattern Recognition, 40(5), 2038–2048.