lecture 18: multiclass support vector machineshzhang/math574m/2020lect18_msvm.pdfclass 2 class 3...

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Lecture 18: Multiclass Support Vector Machines

Hao Helen Zhang

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines


Extensions of SVM

Outlines

Traditional Methods for Multiclass Problems

One-vs-rest approachesPairwise approaches

Recent development for Multiclass Problems

Simultaneous ClassificationVarious loss functions

Extensions of SVM



Extensions of SVMTraditional Methods for Multiclass Problems

Multiclass Classification Setup

Label: {−1,+1} → {1, 2, . . . ,K}.Classification decision rule:

f : Rd =⇒ {1, 2, . . . ,K}.

Classification accuracy is measured by

Equal-cost: Generalization Error (GE)

Err(f) = P(Y 6= f (X)).

Unequal-cost: the risk

R(f ) = EY ,XC (Y , f (X)).




Traditional Methods

Main ideas:

(i) Decompose the multiclass classification problem into multiplebinary classification problems.

(ii) Use the majority voting principle (a combined decision fromthe committee) to predict the label

Common approaches: simple but effective

One-vs-rest (one-vs-all) approaches

Pairwise (one-vs-one, all-vs-all) approaches




One-vs-rest Approach

One of the simplest multiclass classifier; commonly used in SVMs;also known as the one-vs-all (OVA) approach

(i) Solve K different binary problems: classify “class k” versus“the rest classes” for k = 1, · · · ,K .

(ii) Assign a test sample to the class giving the largest fk(x)(most positive) value, where fk(x) is the solution from the kthproblem

Properties:

Very simple to implement, perform well in practice

Not optimal (asymptotically): the decision rule is not Fisherconsistent if there is no dominating class (i.e.arg max pk(x) < 1

2 ).

Read: Rifkin and Klautau (2004) “In Defense of One-vs-allClassification”




Pairwise Approach

Also known as all-vs-all (AVA) approach

(i) Solve(K

2

)different binary problems: classify “class k” versus

“class j” for all j 6= k . Each classifier is called gij .

(ii) For prediction at a point, each classifier is queried once andissues a vote. The class with the maximum number of(weighted) votes is the winner.

Properties:

Training process is efficient, by dealing with small binaryproblems.

If K is big, there are too many problems to solve. If K = 10,we need to train 45 binary classifiers.

Simple to implement; perform competitively in practice.

Read: Park and Furnkranz (2007) “Efficient PairwiseClassification”



Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

One Single SVM approach: Simultaneous Classification

Label: {−1,+1} → {1, 2, . . . ,K}.Use one single SVM to construct a decision function vector

f = (f1, . . . , fK ).

Classifier (Decision rule):

f (x) = argmaxk=1,··· ,K fk(x).

If K = 2, there is one fk and the decision rule is sign(fk).

In some sense, multiple logistic regression is a simultaneousclassification procedure



Extensions of SVM


x1

x2

Class 1

Class 2 Class 3

f1>max(f2,f3)

f2>max(f1,f3)

f3>max(f1,f2)



Extensions of SVM


Ployhedron one

Ployhedron two Ployhedron three

f1−f2=1

f1−f2=0

f2−f1=1

f1−f3=1

f1−f3=0

f3−f1=1

f2−f3=1

f3−f2=1f2−f3=0



Extensions of SVM


SVM for Multiclass Problem

Multiclass SVM: solving one single regularization problem byimposing a penalty on the values of fy (x)− fl(x)’s.

Weston and Watkins (1999)

Cramer and Singer (2002)

Lee et al. (2004)

Liu and Shen (2006); multiclass ψ-learning: Shen et al.(2003)



Extensions of SVM


Various Multiclass SVMs

Weston and Watkins (1999):a penalty is imposed only if fy (x) < fk(x) + 2 for k 6= y .

Even if fy (x) < 1, a penalty is not imposed as long as fk(x) issufficiently small for k 6= y ;Similarly, if fk(x) > 1 for k 6= y , we do not pay a penalty iffy (x) is sufficiently large.

L(y , f(x)) =∑k 6=y

[2− (fy (x)− fk(x))]+ .

Lee et al. (2004): L(y , f(x)) =∑

k 6=y [fk(x) + 1]+.

Crammer and Singer (2002): Liu and Shen (2006),L(y , f(x)) = [1−mink{fy (x)− fk(x)}]+.

To avoid the redundancy, a sum-to-zero constraint∑K

k=1 fk = 0 issometimes enforced.



Extensions of SVM


Linear Multiclass SVMs

For linear classification problems, we have

fk(x) = βkx + β0k , k = 1, · · · ,K .

The sum-to-zero constraint can be replaced by

K∑k=1

β0k = 0,K∑

k=1

βk = 0.

The optimization problem becomes

minf

n∑i=1

L (yi , f(xi )) + λ

K∑k=1

‖βk‖2

subject to the sum-to-zero constraint.



Extensions of SVM


Nonlinear Multiclass SVMs

To achieve the nonlinear classification, we assume

fk(x) = β′kφ(x) + βk0, k = 1, · · · ,K .

where φ(x) represents the basis functions in the feature space F .

Similar to the binary classification, the nonlinear MSVM canbe conveniently solved using a kernel function.



Extensions of SVM


Regularization Problems for Nonlinear MSVMs

We can represent the MSVM as the solution to a regularizationproblem in the RKHS.

Assume that

f(x) = (f1(x), · · · , fK (x)) ∈K∏

k=1

({1}+Hk)

under the sum-to-zero constraint.

Then a MSVM classifier can be derived by solving

minf

n∑i=1

L (yi , f(xi )) + λ

K∑k=1

‖gk‖2Hk,

where fk(x) = gk(x) + β0k , gk ∈ Hk , β0k ∈ R.



Extensions of SVM


Generalized Functional Margin

Given (x, y), a reasonable decision vector f(x) should

encourage a large value for fy (x)

have small values for fk(x), k 6= y .

Define the K − 1-vector of relative differences as

g = (fy (x)− f1(x), · · · , fy (x)− fy−1(x), fy (x)− fy+1(x), · · · , fy (x)− fK (x)) .

Liu et al. (2004) called the vector g the generalized functionalmargin of f

g characterizes correctness and strength of classification of xby f.

f indicates a correct classification of (x, y) if

g(f(x), y) > 0k−1.



Extensions of SVM


0-1 Loss with Functional Margin

A point (x , y) is misclassified if y 6= arg maxk fk(x).Define the multivariate sign function as

sign(u) = 1 if umin = min(u1, · · · , um) > 0,

−1 if umin ≤ 0.

where u = (u1, · · · , um). Using the functional margin,

The 0-1 loss becomes

I (min g(f (x), y) < 0) =1

2[1− sign(g(f (x), y))] .

The GE becomes to

R[f ] =1

2E [1− sign(g(f (x), y))] .



Extensions of SVM


Generalized Loss Functions Using Functional Margin

A natural way to generalize the binary loss is

n∑i=1

`(min g(f (x i ), yi )).

In particular, the loss function L(y , f(x)) can be expressed asV (g(f(x), y)) with

Weston and Watkins (1999): V (u) =∑K−1

j=1 [2− uj ]+ .

Lee et al. (2004): V (u) =∑K−1

j=1 [∑K−1

c=1 ucK − uj + 1]+.

Liu and Shen (2006): V (u) = [1−minj uj ]+.

All of these loss functions are the upper bounds of the 0-1 loss.



Extensions of SVM


Small Round Blue Cell Tumors of Childhood� Khan et al. (2001) in Nature Medicine� Tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS),

non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS)� Number of genes : 2308� Class distribution of data set

Data set EWS BL(NHL) NB RMS total

Training set 23 8 12 20 63

Test set 6 3 6 5 20

Total 29 11 18 25 83



Extensions of SVM


−10

−5

0

5

10

−20

−10

0

10−10

−5

0

5

10

PC1PC2

PC3

EWSBLNBRMSEWSBLNBRMSNON

Figure 3: Three principal components of 100 gene expression levels (circles: training

samples, squares: test samples including non SRBCT samples). The tumor types are

distinguished by colors.



Extensions of SVM


Characteristics of Support Vector Machines

High accuracy, high flexibility

Naturally handle large dimensional data

Sparse representation of the solutions (via support vectors):fast for making future prediction

No probability estimates (hard classifiers)



Extensions of SVM

Other Active Problems in SVM

Variable/Feature Selection

Linear SVM: Bradley and Mangasarian (1998), Guyon et al.(2000), Rakotomamonjy (2003), Jebara and Jaakkola (2000)Nonlinear SVM: Weston et al. (2002), Grandvalet (2003),Basis pursuit (Zhang 2003), COSSO selection (Lin & Zhang(2003)

Proximal SVM – faster computation

Robust SVM – get rid of outliers

Choice of kernels



Extensions of SVM

The L1 SVM

Replace the L2 penalty by the L1 penalty.

The L1 penalty tends to give sparse solutions.

For f (x) = h(x)Tβ + β0, the L1 SVM solves

minβ0,β

n∑i=1

[1− yi f (xi )]+ + λd∑

j=1

|βj |. (1)

The solution will have at most n nonzero coefficients βj .



Extensions of SVM

L1 Penalty versus L2 Penalty

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

3.5

4

w

J(w)

L1 penalty

L2 penalty



Extensions of SVM

Robust Support Vector Machines

Hinge loss is unbounded; sensitive to outliers (e.g. wronglabels etc)

Support Vectors: yi f (xi ) ≤ 1.

Truncated hinge loss: Ts(u) = H1(u)− Hs(u), where

Hs(u) = [s − u]+ .

Remove some “bad” SVs (Wu and Liu, 2006).



Extensions of SVM

Decomposition: Difference of Convex Functions

Key: D.C. decomposition (Diff. Convex functions).Ts(u) = H1(u)− Hs(u).

−3 0 1 30

1

2

3

4

z

H 1

−3 s 0 30

1

2

3

4

z

H s

−3 s 0 1 30

1

2

3

4

z

T s



Extensions of SVM

D.C. Algorithm

D.C. Algorithm: The Difference Convex Algorithm for mini-mizingJ(Θ) = Jvex(Θ) + Jcav (Θ)

1. Initialize Θ0.

2. Repeat Θt+1 = argminΘ(Jvex(Θ) +⟨J ′cav (Θt),Θ−Θt

⟩)

until convergence of Θt .

The algorithm converges in finite steps (Liu et al. (2005)).

Choice of initial values: Use SVM’s solution.

RSVM: The set of SVs is a only a SUBSET of the original one!

Nonlinear learning can be achieved by the kernel trick.


lecture 18: multiclass support vector machineshzhang/math574m/2020lect18_msvm.pdfclass 2 class 3...

Documents