support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · linearly separable data...

89
Support vector machines CS 446 / ECE 449 2021-02-16 14:33:55 -0600 (d94860f)

Upload: others

Post on 07-Mar-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Support vector machines

CS 446 / ECE 449

2021-02-16 14:33:55 -0600 (d94860f)

Page 2: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Another algorithm for linear prediction. Why?!

1 / 37

Page 3: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

“pytorch meta-algorithm”.

...

5. Tweak 1-4 until training error is small.

6. Tweak 1-5, possibly reducing model complexity, until testing erroris small.

Recall: step 5 is easy, step 6 is complicated.Nice way to reduce complexity: favorable inductive bias.

I Support vector machines (SVMs) popularized a very influentialinductive bias: maximum margin predictors.

This concept has consequence beyond linear predictors and beyondsupervised learning.

I Nonlinear SVMs, via kernels, are also highly influential.

E.g., we will revisit them in the deep network lectures.

2 / 37

Page 4: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

“pytorch meta-algorithm”.

...

5. Tweak 1-4 until training error is small.

6. Tweak 1-5, possibly reducing model complexity, until testing erroris small.

Recall: step 5 is easy, step 6 is complicated.Nice way to reduce complexity: favorable inductive bias.

I Support vector machines (SVMs) popularized a very influentialinductive bias: maximum margin predictors.

This concept has consequence beyond linear predictors and beyondsupervised learning.

I Nonlinear SVMs, via kernels, are also highly influential.

E.g., we will revisit them in the deep network lectures.

2 / 37

Page 5: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Plan for SVM

I Hard-margin SVM.

I Soft-margin SVM.

I SVM duality.

I Nonlinear SVM: kernels

3 / 37

Page 6: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Linearly separable data

Linearly separable data means there exists u ∈ Rd which perfectly classifiesall training points:

miniyix

Tiu > 0.

We have many ways to solve for u:

I Logistic regression.

I Perceptron.

I Convex programming (convexfunction subject to convex setconstraint).

4 / 37

Page 7: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Linearly separable data

Linearly separable data means there exists u ∈ Rd which perfectly classifiesall training points:

miniyix

Tiu > 0.

We have many ways to solve for u:

I Logistic regression.

I Perceptron.

I Convex programming (convexfunction subject to convex setconstraint).

4 / 37

Page 8: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin solution

Best linear classifier onpopulation

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Page 9: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Page 10: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Page 11: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Page 12: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Page 13: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.

5 / 37

Page 14: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin solution

Best linear classifier onpopulation

Arbitrary linear separator ontraining data S

Maximum margin solution ontraining data S

Why use the maximum margin solution?(i) Uniquely determined by S, even if many perfect linear classifers.

(ii) It is a particular inductive bias—i.e., an assumption about theproblem—that seems to be commonly useful.

I We’ve seen inductive bias:least squares and logistic regression choose different predictors.

I This particular bias (margin maximization) is common in machine learning,has many nice properties.

Key insight: can express this as another convex program.5 / 37

Page 15: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Page 16: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Page 17: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Page 18: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Page 19: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Page 20: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Page 21: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Distance to decision boundary

Suppose w ∈ Rd satisfies min(x,y)∈S

yxTw > 0.

I “Maximum margin” shouldn’t care about scaling;w and 10w should be equally good.

I Thus for each direction w/‖w‖, we can fix a scaling.

Let (x, y) be any example in S that achieves the minimum mini yixTiw.

H

w

yxyxTw‖w‖2

I Rescale w so that yxTw = 1.

(Now scaling is fixed.)

I Distance from yx to H isyxTw

‖w‖ =1

‖w‖ .

This is the (normalized minimum) margin.

I This gives optimization problem

max 1/‖w‖

subj. to min(x,y)∈S

yxTw = 1.

Refinements: (a) can make constraint∀i, yixT

iw ≥ 1; (b) can replacemax 1/‖w‖ with min ‖w‖2.

6 / 37

Page 22: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S — i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

I This is a convex optimization problem: minimization of a convex function,subject to a convex constraint.

I There are many solvers for this problem; it is an area of active research.

You will code up an easy one in homework 2!

I The feasible set is nonempty when S is linearly separable;

in this case, the solution is unique.

I Note: some presentations explicitly include and encode the appended “1”feature on x; we will not.

7 / 37

Page 23: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Maximum margin linear classifier

The solution w to the following mathematical optimization problem:

minw∈Rd

1

2‖w‖22

s.t. yxTw ≥ 1 for all (x, y) ∈ S

gives the linear classifier with the largest minimum margin on S — i.e., themaximum margin linear classifier or support vector machine (SVM) classifier.

I This is a convex optimization problem: minimization of a convex function,subject to a convex constraint.

I There are many solvers for this problem; it is an area of active research.

You will code up an easy one in homework 2!

I The feasible set is nonempty when S is linearly separable;

in this case, the solution is unique.

I Note: some presentations explicitly include and encode the appended “1”feature on x; we will not.

7 / 37

Page 24: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Soft-margin SVM

The convex program made no sense if the data is not linearly separable.It is sometimes called the hard-margin SVM.Now we develop a soft-margin SVM for non-separable data.

8 / 37

Page 25: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Soft-margin SVMs (Cortes and Vapnik, 1995)

Start from the hard-margin program:

minw∈Rd

1

2‖w‖22

s.t. yixTiw ≥ 1 for all i = 1, 2, . . . , n.

Suppose it is infeasible (no linear separators).

Introduce slack variables ξ1, . . . , ξn ≥ 0, and a trade-off parameter C > 0:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n,

which is always feasible. This is called soft-margin SVM.

(Slack variables are auxiliary variables; not needed to form the linear classifier.)

9 / 37

Page 26: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Soft-margin SVMs (Cortes and Vapnik, 1995)

Start from the hard-margin program:

minw∈Rd

1

2‖w‖22

s.t. yixTiw ≥ 1 for all i = 1, 2, . . . , n.

Suppose it is infeasible (no linear separators).

Introduce slack variables ξ1, . . . , ξn ≥ 0, and a trade-off parameter C > 0:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n,

which is always feasible. This is called soft-margin SVM.

(Slack variables are auxiliary variables; not needed to form the linear classifier.)

9 / 37

Page 27: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Interpretation of slack varables

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

H

For given w, ξi/‖w‖2 is distance that xi would have to move to satisfy

yixTiw ≥ 1.

10 / 37

Page 28: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Another interpretation of slack variables

Constraints with non-negative slack variables:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT

iw}, thus

minw∈Rd

1

2‖w‖22 + C

n∑i=1

[1− yixT

iw]

+.

Notation: [a]+ := max {0, a} (ReLU!).[1− yxTw

]+

is hinge loss of w on example (x, y).

11 / 37

Page 29: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Another interpretation of slack variables

Constraints with non-negative slack variables:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT

iw}, thus

minw∈Rd

1

2‖w‖22 + C

n∑i=1

[1− yixT

iw]

+.

Notation: [a]+ := max {0, a} (ReLU!).

[1− yxTw

]+

is hinge loss of w on example (x, y).

11 / 37

Page 30: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Another interpretation of slack variables

Constraints with non-negative slack variables:

minw∈Rd,ξ1,...,ξn∈R

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yixTiw ≥ 1− ξi for all i = 1, 2, . . . , n,

ξi ≥ 0 for all i = 1, 2, . . . , n.

Equivalent unconstrained form:Given w, the optimal ξi is max{0, 1− yixT

iw}, thus

minw∈Rd

1

2‖w‖22 + C

n∑i=1

[1− yixT

iw]

+.

Notation: [a]+ := max {0, a} (ReLU!).[1− yxTw

]+

is hinge loss of w on example (x, y).

11 / 37

Page 31: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Unconstrained soft-margin SVM

This form is a regularized ERM:

minw∈Rd

1

2‖w‖22 + C

n∑i=1

`hinge(yixTiw) where `hinge(z) = max{0, 1− z}.

I We can equivalently substitute C = 1 and write λ2‖w‖2.

I C is a hyper-parameter, with no good search procedure.

12 / 37

Page 32: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM duality

A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.

Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:

I Clarifies the role of support vectors.

I Leads to a nice nonlinear approach via kernels.

I Gives another choice for optimization algorithms.

(Used in homework 2!)

13 / 37

Page 33: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM duality

A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.

Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:

I Clarifies the role of support vectors.

I Leads to a nice nonlinear approach via kernels.

I Gives another choice for optimization algorithms.

(Used in homework 2!)

13 / 37

Page 34: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM duality

A convex program is an optimization problem (minimization or maximization)where a convex objective is minimized over a convex constraint (feasible) set.

Every convex program has a corresponding dual program.There is a rich theory about this correspondence.For the SVM, the dual has many nice properties:

I Clarifies the role of support vectors.

I Leads to a nice nonlinear approach via kernels.

I Gives another choice for optimization algorithms.

(Used in homework 2!)

13 / 37

Page 35: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM hard-margin duality.Define the two optimization problems

min

{1

2‖w‖2 : w ∈ Rd,∀i � 1− yixiw ≤ 0

}(primal),

max

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj : α ∈ Rn,α ≥ 0

(dual).

If the primal is feasible, then primal optimal value = dual optimal value.Given a primal optimum w and a dual optimum α, then

w =n∑i=1

αiyixi.

I The dual variables α have dimension n, same as examples.

I We can write the primal optimum as a linear combination of examples.

I The dual objective is a concave quadratic.

I We will derive this duality using Lagrange multipliers.

14 / 37

Page 36: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM hard-margin duality.Define the two optimization problems

min

{1

2‖w‖2 : w ∈ Rd,∀i � 1− yixiw ≤ 0

}(primal),

max

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj : α ∈ Rn,α ≥ 0

(dual).

If the primal is feasible, then primal optimal value = dual optimal value.Given a primal optimum w and a dual optimum α, then

w =n∑i=1

αiyixi.

I The dual variables α have dimension n, same as examples.

I We can write the primal optimum as a linear combination of examples.

I The dual objective is a concave quadratic.

I We will derive this duality using Lagrange multipliers.

14 / 37

Page 37: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Lagrange multipliers

Move constraints to objective using Lagrange multipliers.

Original problem: minw∈Rd

1

2‖w‖22

s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.

I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange

multiplier) αi ≥ 0.

I Move constraints to objective by adding∑ni=1 αi(1− yix

Tiw) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).

Lagrangian L(w,α):

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,

P (w) := supα≥0

L(w,α) =

{12‖w‖22 if mini yix

Tiw ≥ 1,

∞ otherwise.

What if we leave α fixed, and minimize w?

15 / 37

Page 38: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Lagrange multipliers

Move constraints to objective using Lagrange multipliers.

Original problem: minw∈Rd

1

2‖w‖22

s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.

I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange

multiplier) αi ≥ 0.

I Move constraints to objective by adding∑ni=1 αi(1− yix

Tiw) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).

Lagrangian L(w,α):

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,

P (w) := supα≥0

L(w,α) =

{12‖w‖22 if mini yix

Tiw ≥ 1,

∞ otherwise.

What if we leave α fixed, and minimize w?

15 / 37

Page 39: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Lagrange multipliers

Move constraints to objective using Lagrange multipliers.

Original problem: minw∈Rd

1

2‖w‖22

s.t. 1− yixTiw ≤ 0 for all i = 1, . . . , n.

I For each constraint 1− yixTiw ≤ 0, associate a dual variable (Lagrange

multiplier) αi ≥ 0.

I Move constraints to objective by adding∑ni=1 αi(1− yix

Tiw) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).

Lagrangian L(w,α):

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Maximizing over α ≥ 0 recovers primal problem: for any w ∈ Rd,

P (w) := supα≥0

L(w,α) =

{12‖w‖22 if mini yix

Tiw ≥ 1,

∞ otherwise.

What if we leave α fixed, and minimize w?15 / 37

Page 40: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Dual problem

Lagrangian

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal hard-margin SVM

P (w) = supα≥0

L(w,α) = supα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.

Dual problem D(α) = minw L(w,α): given α ≥ 0, then w 7→ L(w,α) is aconvex quadratic with minimum w =

∑ni=1 αiyixi, giving

D(α) = minw∈Rd

L(w,α) = L

n∑i=1

αiyixi,α

=

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

=n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

16 / 37

Page 41: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Dual problem

Lagrangian

L(w,α) :=1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal hard-margin SVM

P (w) = supα≥0

L(w,α) = supα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.Dual problem D(α) = minw L(w,α): given α ≥ 0, then w 7→ L(w,α) is aconvex quadratic with minimum w =

∑ni=1 αiyixi, giving

D(α) = minw∈Rd

L(w,α) = L

n∑i=1

αiyixi,α

=

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

=

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjxTixj .

16 / 37

Page 42: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Summarizing,

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw) Lagrangian,

P (w) = maxα≥0

L(w,α) primal problem,

D(α) = minw

L(w,α) dual problem.

I For general Lagrangians, have weak duality

P (w) ≥ D(α),

since P (w) = maxα′≥0

L(w,α′) ≥ L(w,α) ≥ minw′

L(w′,α) = D(α).

I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via

w =

n∑i=1

αiyixi = arg minw

L(w, α).

17 / 37

Page 43: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Summarizing,

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw) Lagrangian,

P (w) = maxα≥0

L(w,α) primal problem,

D(α) = minw

L(w,α) dual problem.

I For general Lagrangians, have weak duality

P (w) ≥ D(α),

since P (w) = maxα′≥0

L(w,α′) ≥ L(w,α) ≥ minw′

L(w′,α) = D(α).

I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via

w =

n∑i=1

αiyixi = arg minw

L(w, α).

17 / 37

Page 44: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Summarizing,

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw) Lagrangian,

P (w) = maxα≥0

L(w,α) primal problem,

D(α) = minw

L(w,α) dual problem.

I For general Lagrangians, have weak duality

P (w) ≥ D(α),

since P (w) = maxα′≥0

L(w,α′) ≥ L(w,α) ≥ minw′

L(w′,α) = D(α).

I By convexity, have strong duality minw P (w) = maxα≥0 D(α),and an optimum α for D gives an optimum w for P via

w =

n∑i=1

αiyixi = arg minw

L(w, α).

17 / 37

Page 45: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Page 46: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Page 47: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Page 48: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Page 49: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Support vectors

Optimal solutions w and α = (α1, . . . , αn) satisfy

I w =n∑i=1

αiyixi =∑i:αi>0

αiyixi,

I αi > 0 ⇒ yixTi w = 1 for all i = 1, . . . , n (complementary slackness).

The yixi where αi > 0 are called support vectors.

H

I Support vector examples satisfy “margin”constraints with equality.

I Get same solution if non-support vectorsdeleted.

I Primal optimum is a linear combination ofsupport vectors.

18 / 37

Page 50: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Page 51: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Page 52: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Page 53: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Page 54: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Page 55: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Proof of complementary slackness

For the optimal (feasible) solutions w and α, we have

P (w) = D(α) = minw∈Rd

L(w, α) (by strong duality)

≤ L(w, α)

=1

2‖w‖22 +

n∑i=1

αi(1− yixTi w)

≤ 1

2‖w‖22 (constraints are satisfied)

= P (w).

Therefore, every term in sumn∑i=1

αi(1− yixTi w) must be zero:

αi(1− yixTi w) = 0 for all i = 1, . . . , n.

If αi > 0, then must have 1− yixTi w = 0. (Not iff!)

19 / 37

Page 56: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM (hard-margin) duality summary

Lagrangian

L(w,α) =1

2‖w‖22 +

n∑i=1

αi(1− yixTiw).

Primal maximum margin problem was

P (w) = supα≥0

L(w,α) = supα≥0

1

2‖w‖22 +

n∑i=1

αi(1− yixTiw)

.Dual problem

D(α) = minw∈Rd

L(w,α) =

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

2

.

Given dual optimum α,

I Corresponding primal optimum w =∑ni=1 αiyixi;

I Strong duality P (w) = D(α);

I αi > 0 implies yixTi w = 1,

and these yixi are support vectors.

20 / 37

Page 57: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM soft-margin dual

Similarly,

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +

n∑i=1

αi(1− ξi − yixTiw) (Lagrangian),

P (w, ξ) = supα≥0

L(w, ξ,α) (Primal),

=

{12‖w‖22 + C

∑ni=1 ξi ∀i � 1− ξi − yix

Tiw ≤ 0,

∞ otherwise,

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) (Dual),

= maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Remarks.

I Dual solution α still gives primal solution w =∑ni=1 αiyixi.

I Can take C →∞ to recover hard-margin case.

I Dual is still a constrained concave quadratic (used in many solvers).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

21 / 37

Page 58: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM soft-margin dual

Similarly,

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +n∑i=1

αi(1− ξi − yixTiw) (Lagrangian),

P (w, ξ) = supα≥0

L(w, ξ,α) (Primal),

=

{12‖w‖22 + C

∑ni=1 ξi ∀i � 1− ξi − yix

Tiw ≤ 0,

∞ otherwise,

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) (Dual),

= maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Remarks.

I Dual solution α still gives primal solution w =∑ni=1 αiyixi.

I Can take C →∞ to recover hard-margin case.

I Dual is still a constrained concave quadratic (used in many solvers).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

21 / 37

Page 59: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

SVM soft-margin dual

Similarly,

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +n∑i=1

αi(1− ξi − yixTiw) (Lagrangian),

P (w, ξ) = supα≥0

L(w, ξ,α) (Primal),

=

{12‖w‖22 + C

∑ni=1 ξi ∀i � 1− ξi − yix

Tiw ≤ 0,

∞ otherwise,

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) (Dual),

= maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Remarks.

I Dual solution α still gives primal solution w =∑ni=1 αiyixi.

I Can take C →∞ to recover hard-margin case.

I Dual is still a constrained concave quadratic (used in many solvers).

I Some presentations include bias in primal (xTiw + b);

this introduces a constraint∑ni=1 yiαi = 0 in dual.

21 / 37

Page 60: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Nonlinear SVM: feature mapping annoying in the primal?

SVM hard-margin primal, with a feature mapping φ : Rd → Rp:

min

{1

2‖w‖2 : w ∈ Rp, ∀i � φ(xi)

Tw ≥ 1

}.

Now the search space has p dimensions; potentially p� d.

Can we do better?

22 / 37

Page 61: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Nonlinear SVM: feature mapping annoying in the primal?

SVM hard-margin primal, with a feature mapping φ : Rd → Rp:

min

{1

2‖w‖2 : w ∈ Rp, ∀i � φ(xi)

Tw ≥ 1

}.

Now the search space has p dimensions; potentially p� d.

Can we do better?

22 / 37

Page 62: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Feature mapping in the dual

SVM hard-margin dual, with a feature mapping φ : Rd → Rp:

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x

with

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.

I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).

Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).

I This idea started with SVM, but appears in many other places.

I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).

23 / 37

Page 63: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Feature mapping in the dual

SVM hard-margin dual, with a feature mapping φ : Rd → Rp:

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x

with

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.

I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).

Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).

I This idea started with SVM, but appears in many other places.

I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).

23 / 37

Page 64: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Feature mapping in the dual

SVM hard-margin dual, with a feature mapping φ : Rd → Rp:

maxα1,α2,...,αn≥0

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)Tφ(xj).

Given dual optimum α, since w =∑ni=1 αiyiφ(xi), we can predict on future x

with

x 7→ φ(x)Tw =

n∑i=1

αiyiφ(x)Tφ(xi).

I Dual form never needs φ(x) ∈ Rp, only φ(x)Tφ(xi) ∈ R.

I Kernel trick: replace every φ(x)Tφ(x′) with kernel evaluation k(x,x′).

Sometimes k(·, ·) is much cheaper than φ(x)Tφ(x′).

I This idea started with SVM, but appears in many other places.

I Downside: implementations usually store Gram matrix G ∈ Rn×n whereGij := k(xi,xj).

23 / 37

Page 65: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Kernel example: affine features

Affine features: φ : Rd → R1+d, where

φ(x) = (1, x1, . . . , xd).

Kernel form:φ(x)Tφ(x′) = 1 + xTx′.

24 / 37

Page 66: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Kernel example: quadratic features

Consider re-normalized quadratic features

φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

).

Just writing this down takes time O(d2).Meanwhile,

φ(x)Tφ(x′) = (1 + xTx′)2,

time O(d).

Tweaks:

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

25 / 37

Page 67: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Kernel example: quadratic features

Consider re-normalized quadratic features

φ : Rd → R1+2d+(d2), where

φ(x) =(

1,√

2x1, . . . ,√

2xd, x21, . . . , x

2d,

√2x1x2, . . . ,

√2x1xd, . . . ,

√2xd−1xd

).

Just writing this down takes time O(d2).Meanwhile,

φ(x)Tφ(x′) = (1 + xTx′)2,

time O(d).

Tweaks:

I What if we change exponent “2”?

I What if we replace additive “1” with 0?

25 / 37

Page 68: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

RBF kernel

For any σ > 0, there is an infinite feature expansion φ : Rd → R∞ such that

φ(x)Tφ(x′) = exp

−∥∥x− x′∥∥2

2

2σ2

,

which can be computed in O(d) time.

This is called a Gaussian kernel or RBF kernel.It has some similarities to nearest neighbor methods (later lecture).

φ maps to an infinite-dimensional space, but there’s no reason to know that.26 / 37

Page 69: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Defining kernels without φ

A (positive definite) kernel function k : X ×X → R is a symmetric func-tion so that for any n and any data examples (xi)

ni=1, the corresponding

Gram matrix G ∈ Rn×n with Gij := k(xi,xj) is positive semi-definite.

I There is a ton of theory about this formalism;

e.g., keywords RKHS, representer theorem, Mercer’s theorem.

I Given any such k, there always exists a corresponding φ.

I This definition ensures the SVM dual is still concave.

27 / 37

Page 70: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Defining kernels without φ

A (positive definite) kernel function k : X ×X → R is a symmetric func-tion so that for any n and any data examples (xi)

ni=1, the corresponding

Gram matrix G ∈ Rn×n with Gij := k(xi,xj) is positive semi-definite.

I There is a ton of theory about this formalism;

e.g., keywords RKHS, representer theorem, Mercer’s theorem.

I Given any such k, there always exists a corresponding φ.

I This definition ensures the SVM dual is still concave.

27 / 37

Page 71: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Source data.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

500

-10.

000

-7.5

00-5

.000

-2.5

00

-2.500

0.00

0

0.000

2.500

Quadratic SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.0

00

-2.0

00

-1.0

00

-1.0

00

0.00

0

0.000

1.00

0

2.000

3.000

RBF SVM (σ = 1).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-1.500-1.000

-1.000

-1.000

-0.5

00

-0.500

0.00

0

0.000

0.50

0

0.500

1.000

1.50

0

RBF SVM (σ = 0.1).

28 / 37

Page 72: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Summary for SVM

I Hard-margin SVM.

I Soft-margin SVM.

I SVM duality.

I Nonlinear SVM: kernels

29 / 37

Page 73: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

(Appendix.)

30 / 37

Page 74: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Soft-margin dual derivation

Let’s derive the final dual form:

L(w, ξ,α) =1

2‖w‖22 + C

n∑i=1

ξi +

n∑i=1

αi(1− ξi − yixTiw)

D(α) = minw∈Rd,ξ∈Rn

≥0

L(w, ξ,α) = maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

Given α and ξ, the minimizing w is still w =∑ni=1 αiyixi; plugging in,

D(α) = minξ∈Rn≥0

L

n∑i=1

αiyixi, ξ,α

=n∑i=1

αi−1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2

+

n∑i=1

ξi(C−αi).

The goal is to maximize D; if αi > C, then ξi ↑ ∞ gives D(α) = −∞.Otherwise, minimized at ξi = 0. Therefore the dual problem is

maxα∈Rn

0≤αi≤C

n∑i=1

αi −1

2

∥∥∥∥∥∥n∑i=1

αiyixi

∥∥∥∥∥∥2 .

31 / 37

Page 75: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Page 76: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Page 77: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)k

So let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Page 78: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Page 79: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Page 80: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Gaussian kernel feature expansion

First consider d = 1, meaning φ : R→ R∞.

What φ has φ(x)φ(y) = e−(x−y)2/(2σ2)?

Reverse engineer using Taylor expansion:

e−(x−y)2/(2σ2) = e−x2/(2σ2) · e−y

2/(2σ2) · exy/σ2

= e−x2/(2σ2) · e−y

2/(2σ2) ·∞∑k=0

1

k!

(xy

σ2

)kSo let

φ(x) := e−x2/(2σ2)

(1,x

σ,

1√2!

(x

σ

)2

,1√3!

(x

σ

)3

, . . .

).

How to handle d > 1?

e−‖x−y‖2/(2σ2) = e−‖x‖

2/(2σ2) · e−‖y‖2/(2σ2) · ex

Ty/σ2

= e−‖x‖2/(2σ2) · e−‖y‖

2/(2σ2) ·∞∑k=0

1

k!

(xTy

σ2

)k.

32 / 37

Page 81: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Kernel example: products of all subsets of coordinates

Consider φ : Rd → R2d

, where

φ(x) =

∏i∈S

xi

S⊆{1,2,...,d}

Time O(2d) just to write down.Kernel evaluation takes time O(d):

φ(x)Tφ(x′) =

d∏i=1

(1 + xix′i).

33 / 37

Page 82: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Page 83: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Page 84: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Page 85: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Page 86: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Other kernels

Suppose k1 and k2 are positive definite kernel functions.

1. k(x, y) := k1(x, y) + k2(x, y) define a positive definite kernel?

2. k(x, y) := c · k1(x, y) (for c ≥ 0) define a positive definite kernel?

3. k(x, y) := k1(x, y) · k2(x, y) define a positive definite kernel?

Another approach: random features k(x,x′) = EwF (w,x′)TF (w,x′) for someF ; we will revisit this with deep networks and the neural tangent kernel (NTK).

34 / 37

Page 87: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Kernel ridge regression

Kernel ridge regression:

minw

1

2n‖Xw − y‖2 +

λ

2‖w‖2.

Solution:w = (XTX + λnI)−1XTy.

Linear algebra fact:XT(XXT + λnI)−1y.

Therefore predict with

x 7→ xTw = (Xx)T(XXT + λnI)−1y =n∑i=1

(xTix)T

[(XXT + λnI)−1y

]i.

Kernel approach:

I Compute α := (G+ λnI)−1, where G ∈ Rn is the Gram matrix:Gij = k(xi,xj).

I Predict with x 7→∑ni=1 αix

Tix.

35 / 37

Page 88: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Multiclass SVM

There are a few ways;one is to use one-against-all as in lecture.

Many researchers have proposed various multiclass SVM methods,but some of them can be shown to fail in trivial cases.

36 / 37

Page 89: Support vector machinesmjt.cs.illinois.edu/ml/lec6.pdf · 2021. 2. 21. · Linearly separable data Linearly separable data means there exists u2Rdwhich perfectly classi es all training

Supplemental reading

I Shalev-Shwartz/Ben-David: chapter 15.

I Murphy: chapter 14.

37 / 37