lecture 18: multiclass support vector machineshzhang/math574m/2020lect18_msvm.pdfclass 2 class 3...

26
Overview of Multiclass Learning Simultaneous Classification by MSVMs Extensions of SVM Lecture 18: Multiclass Support Vector Machines Hao Helen Zhang Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Upload: others

Post on 20-May-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Lecture 18: Multiclass Support Vector Machines

Hao Helen Zhang

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 2: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Outlines

Traditional Methods for Multiclass Problems

One-vs-rest approachesPairwise approaches

Recent development for Multiclass Problems

Simultaneous ClassificationVarious loss functions

Extensions of SVM

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 3: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVMTraditional Methods for Multiclass Problems

Multiclass Classification Setup

Label: {−1,+1} → {1, 2, . . . ,K}.Classification decision rule:

f : Rd =⇒ {1, 2, . . . ,K}.

Classification accuracy is measured by

Equal-cost: Generalization Error (GE)

Err(f) = P(Y 6= f (X)).

Unequal-cost: the risk

R(f ) = EY ,XC (Y , f (X)).

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 4: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVMTraditional Methods for Multiclass Problems

Traditional Methods

Main ideas:

(i) Decompose the multiclass classification problem into multiplebinary classification problems.

(ii) Use the majority voting principle (a combined decision fromthe committee) to predict the label

Common approaches: simple but effective

One-vs-rest (one-vs-all) approaches

Pairwise (one-vs-one, all-vs-all) approaches

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 5: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVMTraditional Methods for Multiclass Problems

One-vs-rest Approach

One of the simplest multiclass classifier; commonly used in SVMs;also known as the one-vs-all (OVA) approach

(i) Solve K different binary problems: classify “class k” versus“the rest classes” for k = 1, · · · ,K .

(ii) Assign a test sample to the class giving the largest fk(x)(most positive) value, where fk(x) is the solution from the kthproblem

Properties:

Very simple to implement, perform well in practice

Not optimal (asymptotically): the decision rule is not Fisherconsistent if there is no dominating class (i.e.arg max pk(x) < 1

2 ).

Read: Rifkin and Klautau (2004) “In Defense of One-vs-allClassification”

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 6: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVMTraditional Methods for Multiclass Problems

Pairwise Approach

Also known as all-vs-all (AVA) approach

(i) Solve(K

2

)different binary problems: classify “class k” versus

“class j” for all j 6= k . Each classifier is called gij .

(ii) For prediction at a point, each classifier is queried once andissues a vote. The class with the maximum number of(weighted) votes is the winner.

Properties:

Training process is efficient, by dealing with small binaryproblems.

If K is big, there are too many problems to solve. If K = 10,we need to train 45 binary classifiers.

Simple to implement; perform competitively in practice.

Read: Park and Furnkranz (2007) “Efficient PairwiseClassification”

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 7: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

One Single SVM approach: Simultaneous Classification

Label: {−1,+1} → {1, 2, . . . ,K}.Use one single SVM to construct a decision function vector

f = (f1, . . . , fK ).

Classifier (Decision rule):

f (x) = argmaxk=1,··· ,K fk(x).

If K = 2, there is one fk and the decision rule is sign(fk).

In some sense, multiple logistic regression is a simultaneousclassification procedure

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 8: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

x1

x2

Class 1

Class 2 Class 3

f1>max(f2,f3)

f2>max(f1,f3)

f3>max(f1,f2)

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 9: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Ployhedron one

Ployhedron two Ployhedron three

f1−f2=1

f1−f2=0

f2−f1=1

f1−f3=1

f1−f3=0

f3−f1=1

f2−f3=1

f3−f2=1f2−f3=0

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 10: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

SVM for Multiclass Problem

Multiclass SVM: solving one single regularization problem byimposing a penalty on the values of fy (x)− fl(x)’s.

Weston and Watkins (1999)

Cramer and Singer (2002)

Lee et al. (2004)

Liu and Shen (2006); multiclass ψ-learning: Shen et al.(2003)

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 11: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Various Multiclass SVMs

Weston and Watkins (1999):a penalty is imposed only if fy (x) < fk(x) + 2 for k 6= y .

Even if fy (x) < 1, a penalty is not imposed as long as fk(x) issufficiently small for k 6= y ;Similarly, if fk(x) > 1 for k 6= y , we do not pay a penalty iffy (x) is sufficiently large.

L(y , f(x)) =∑k 6=y

[2− (fy (x)− fk(x))]+ .

Lee et al. (2004): L(y , f(x)) =∑

k 6=y [fk(x) + 1]+.

Crammer and Singer (2002): Liu and Shen (2006),L(y , f(x)) = [1−mink{fy (x)− fk(x)}]+.

To avoid the redundancy, a sum-to-zero constraint∑K

k=1 fk = 0 issometimes enforced.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 12: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Linear Multiclass SVMs

For linear classification problems, we have

fk(x) = βkx + β0k , k = 1, · · · ,K .

The sum-to-zero constraint can be replaced by

K∑k=1

β0k = 0,K∑

k=1

βk = 0.

The optimization problem becomes

minf

n∑i=1

L (yi , f(xi )) + λ

K∑k=1

‖βk‖2

subject to the sum-to-zero constraint.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 13: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Nonlinear Multiclass SVMs

To achieve the nonlinear classification, we assume

fk(x) = β′kφ(x) + βk0, k = 1, · · · ,K .

where φ(x) represents the basis functions in the feature space F .

Similar to the binary classification, the nonlinear MSVM canbe conveniently solved using a kernel function.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 14: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Regularization Problems for Nonlinear MSVMs

We can represent the MSVM as the solution to a regularizationproblem in the RKHS.

Assume that

f(x) = (f1(x), · · · , fK (x)) ∈K∏

k=1

({1}+Hk)

under the sum-to-zero constraint.

Then a MSVM classifier can be derived by solving

minf

n∑i=1

L (yi , f(xi )) + λ

K∑k=1

‖gk‖2Hk,

where fk(x) = gk(x) + β0k , gk ∈ Hk , β0k ∈ R.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 15: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Generalized Functional Margin

Given (x, y), a reasonable decision vector f(x) should

encourage a large value for fy (x)

have small values for fk(x), k 6= y .

Define the K − 1-vector of relative differences as

g = (fy (x)− f1(x), · · · , fy (x)− fy−1(x), fy (x)− fy+1(x), · · · , fy (x)− fK (x)) .

Liu et al. (2004) called the vector g the generalized functionalmargin of f

g characterizes correctness and strength of classification of xby f.

f indicates a correct classification of (x, y) if

g(f(x), y) > 0k−1.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 16: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

0-1 Loss with Functional Margin

A point (x , y) is misclassified if y 6= arg maxk fk(x).Define the multivariate sign function as

sign(u) = 1 if umin = min(u1, · · · , um) > 0,

−1 if umin ≤ 0.

where u = (u1, · · · , um). Using the functional margin,

The 0-1 loss becomes

I (min g(f (x), y) < 0) =1

2[1− sign(g(f (x), y))] .

The GE becomes to

R[f ] =1

2E [1− sign(g(f (x), y))] .

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 17: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Generalized Loss Functions Using Functional Margin

A natural way to generalize the binary loss is

n∑i=1

`(min g(f (x i ), yi )).

In particular, the loss function L(y , f(x)) can be expressed asV (g(f(x), y)) with

Weston and Watkins (1999): V (u) =∑K−1

j=1 [2− uj ]+ .

Lee et al. (2004): V (u) =∑K−1

j=1 [∑K−1

c=1 ucK − uj + 1]+.

Liu and Shen (2006): V (u) = [1−minj uj ]+.

All of these loss functions are the upper bounds of the 0-1 loss.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 18: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Small Round Blue Cell Tumors of Childhood� Khan et al. (2001) in Nature Medicine� Tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS),

non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS)� Number of genes : 2308� Class distribution of data set

Data set EWS BL(NHL) NB RMS total

Training set 23 8 12 20 63

Test set 6 3 6 5 20

Total 29 11 18 25 83

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 19: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

−10

−5

0

5

10

−20

−10

0

10−10

−5

0

5

10

PC1PC2

PC3

EWSBLNBRMSEWSBLNBRMSNON

Figure 3: Three principal components of 100 gene expression levels (circles: training

samples, squares: test samples including non SRBCT samples). The tumor types are

distinguished by colors.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 20: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Various Loss FunctionsGeneralized Functional Margin

Characteristics of Support Vector Machines

High accuracy, high flexibility

Naturally handle large dimensional data

Sparse representation of the solutions (via support vectors):fast for making future prediction

No probability estimates (hard classifiers)

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 21: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Other Active Problems in SVM

Variable/Feature Selection

Linear SVM: Bradley and Mangasarian (1998), Guyon et al.(2000), Rakotomamonjy (2003), Jebara and Jaakkola (2000)Nonlinear SVM: Weston et al. (2002), Grandvalet (2003),Basis pursuit (Zhang 2003), COSSO selection (Lin & Zhang(2003)

Proximal SVM – faster computation

Robust SVM – get rid of outliers

Choice of kernels

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 22: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

The L1 SVM

Replace the L2 penalty by the L1 penalty.

The L1 penalty tends to give sparse solutions.

For f (x) = h(x)Tβ + β0, the L1 SVM solves

minβ0,β

n∑i=1

[1− yi f (xi )]+ + λd∑

j=1

|βj |. (1)

The solution will have at most n nonzero coefficients βj .

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 23: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

L1 Penalty versus L2 Penalty

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

3.5

4

w

J(w)

L1 penalty

L2 penalty

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 24: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Robust Support Vector Machines

Hinge loss is unbounded; sensitive to outliers (e.g. wronglabels etc)

Support Vectors: yi f (xi ) ≤ 1.

Truncated hinge loss: Ts(u) = H1(u)− Hs(u), where

Hs(u) = [s − u]+ .

Remove some “bad” SVs (Wu and Liu, 2006).

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 25: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

Decomposition: Difference of Convex Functions

Key: D.C. decomposition (Diff. Convex functions).Ts(u) = H1(u)− Hs(u).

−3 0 1 30

1

2

3

4

z

H 1

−3 s 0 30

1

2

3

4

z

H s

−3 s 0 1 30

1

2

3

4

z

T s

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines

Page 26: Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3 f1>max(f2,f3) f2>max(f1,f3) f3>max(f1,f2) Hao Helen Zhang Lecture 18: Multiclass Support

Overview of Multiclass LearningSimultaneous Classification by MSVMs

Extensions of SVM

D.C. Algorithm

D.C. Algorithm: The Difference Convex Algorithm for mini-mizingJ(Θ) = Jvex(Θ) + Jcav (Θ)

1. Initialize Θ0.

2. Repeat Θt+1 = argminΘ(Jvex(Θ) +⟨J ′cav (Θt),Θ−Θt

⟩)

until convergence of Θt .

The algorithm converges in finite steps (Liu et al. (2005)).

Choice of initial values: Use SVM’s solution.

RSVM: The set of SVs is a only a SUBSET of the original one!

Nonlinear learning can be achieved by the kernel trick.

Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines