support vector machines (svms) part 3: kernels

19
1 CSE 5526: SVMs CSE 5526: Introduction to Neural Networks Support Vector Machines (SVMs) Part 3: Kernels

Upload: others

Post on 04-May-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support Vector Machines (SVMs) Part 3: Kernels

1 CSE 5526: SVMs

CSE 5526: Introduction to Neural Networks

Support Vector Machines (SVMs)

Part 3: Kernels

Page 2: Support Vector Machines (SVMs) Part 3: Kernels

2 CSE 5526: SVMs

Kernels are generalizations of inner products

โ€ข A kernel is a function of two data points such that ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ, ๐‘ฅ๐‘ฅโ€ฒ = ๐œ™๐œ™๐‘‡๐‘‡ ๐‘ฅ๐‘ฅ ๐œ™๐œ™(๐‘ฅ๐‘ฅโ€ฒ)

For some function ๐œ™๐œ™ ๐‘ฅ๐‘ฅ โ€ข It is therefore symmetric: ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ, ๐‘ฅ๐‘ฅโ€ฒ = ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅโ€ฒ, ๐‘ฅ๐‘ฅ โ€ข Can compute ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ, ๐‘ฅ๐‘ฅโ€ฒ from an explicit ๐œ™๐œ™ ๐‘ฅ๐‘ฅ โ€ข Or prove that ๐‘˜๐‘˜ ๐‘ฅ๐‘ฅ, ๐‘ฅ๐‘ฅโ€ฒ corresponds to some ๐œ™๐œ™ ๐‘ฅ๐‘ฅ

โ€ข Never need to actually compute ๐œ™๐œ™ ๐‘ฅ๐‘ฅ

Page 3: Support Vector Machines (SVMs) Part 3: Kernels

3 CSE 5526: SVMs

SVM as a kernel machine

โ€ข Coverโ€™s theorem: A complex classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in the low-dimensional input space

โ€ข SVM for pattern classification 1. Nonlinear mapping of the input space onto a high-

dimensional feature space 2. Constructing the optimal hyperplane for the feature

space

Page 4: Support Vector Machines (SVMs) Part 3: Kernels

4 CSE 5526: SVMs

Kernel machine illustration

Page 5: Support Vector Machines (SVMs) Part 3: Kernels

5 CSE 5526: SVMs

Kernelized SVM looks a lot like an RBF net

Page 6: Support Vector Machines (SVMs) Part 3: Kernels

6 CSE 5526: SVMs

Kernel matrix

โ€ข The matrix

๐Š๐Š =

๐‘˜๐‘˜(๐ฑ๐ฑ1, ๐ฑ๐ฑ1)

โ‹ฏโ‹ฎ

๐‘˜๐‘˜(๐ฑ๐ฑ1, ๐ฑ๐ฑ๐‘๐‘)

โ€ฆ ๐‘˜๐‘˜(๐ฑ๐ฑ๐‘–๐‘– , ๐ฑ๐ฑ๐‘—๐‘—) โ€ฆ

๐‘˜๐‘˜(๐ฑ๐ฑ๐‘๐‘ , ๐ฑ๐ฑ1)โ‹ฎโ€ฆ

๐‘˜๐‘˜(๐ฑ๐ฑ๐‘๐‘ , ๐ฑ๐ฑ๐‘๐‘)

is called the kernel matrix, or the Gram matrix. โ€ข K is positive semidefinite

Page 7: Support Vector Machines (SVMs) Part 3: Kernels

7 CSE 5526: SVMs

Mercerโ€™s theorem relates kernel functions and inner product spaces

โ€ข Suppose that for all finite sets of points ๐’™๐’™๐‘๐‘ ๐‘๐‘=1๐‘๐‘

and real numbers ๐’‚๐’‚ ๐‘๐‘=1

โˆž

๏ฟฝ๐‘Ž๐‘Ž๐‘—๐‘—๐‘Ž๐‘Ž๐‘–๐‘–๐‘˜๐‘˜ ๐’™๐’™๐‘–๐‘– ,๐’™๐’™๐‘—๐‘—๐‘–๐‘–,๐‘—๐‘—

โ‰ฅ 0

โ€ข Then ๐‘ฒ๐‘ฒ is called a positive semidefinite kernel โ€ข And can be written as

๐‘˜๐‘˜ ๐’™๐’™,๐’™๐’™โ€ฒ = ๐œ™๐œ™๐‘‡๐‘‡ ๐’™๐’™ ๐œ™๐œ™(๐’™๐’™โ€ฒ) โ€ข For some vector-valued function ๐œ™๐œ™ ๐’™๐’™

Page 8: Support Vector Machines (SVMs) Part 3: Kernels

8 CSE 5526: SVMs

Kernels can be applied in many situations

โ€ข Kernel trick: when predictions are based on inner products of data points, replace with kernel function

โ€ข Some algorithms where this is possible โ€ข Linear / ridge regression โ€ข Principal components analysis โ€ข Canonical correlation analysis โ€ข Perceptron classifier

Page 9: Support Vector Machines (SVMs) Part 3: Kernels

9 CSE 5526: SVMs

Some popular kernels

โ€ข Polynomial kernel, parameters ๐‘๐‘ and ๐‘๐‘ ๐‘˜๐‘˜ ๐’™๐’™,๐’™๐’™โ€ฒ = ๐’™๐’™๐‘‡๐‘‡๐’™๐’™โ€ฒ + ๐‘๐‘ ๐‘๐‘

โ€ข Finite-dimensional ๐œ™๐œ™(๐’™๐’™) can be explicitly computed

โ€ข Gaussian or RBF kernel, parameter ๐œŽ๐œŽ

๐‘˜๐‘˜ ๐’™๐’™,๐’™๐’™โ€ฒ = exp โˆ’12๐œŽ๐œŽ

๐’™๐’™ โˆ’ ๐’™๐’™โ€ฒ 2

โ€ข Infinite-dimensional ๐œ™๐œ™(๐’™๐’™) โ€ข Equivalent to RBF network, but more principled way of

finding centers

Page 10: Support Vector Machines (SVMs) Part 3: Kernels

10 CSE 5526: SVMs

Some popular kernels

โ€ข Hypebolic tangent kernel, parameters ๐›ฝ๐›ฝ1 and ๐›ฝ๐›ฝ2 ๐‘˜๐‘˜ ๐’™๐’™,๐’™๐’™โ€ฒ = tanh ๐›ฝ๐›ฝ1๐’™๐’™๐‘‡๐‘‡๐’™๐’™โ€ฒ + ๐›ฝ๐›ฝ2

โ€ข Only positive semidefinite for some values of ๐›ฝ๐›ฝ1 and ๐›ฝ๐›ฝ2 โ€ข Inspired by neural networks, but more principled way of

selecting number of hidden units

โ€ข String kernels or other structure kernels โ€ข Can prove that they are positive definite โ€ข Computed between non-numeric items โ€ข Avoid converting to fixed-length feature vectors

Page 11: Support Vector Machines (SVMs) Part 3: Kernels

11 CSE 5526: SVMs

Example: polynomial kernel

โ€ข Polynomial kernel in 2D, ๐‘๐‘ = 1, ๐‘๐‘ = 2 ๐‘˜๐‘˜ ๐’™๐’™,๐’™๐’™โ€ฒ = ๐’™๐’™๐‘‡๐‘‡๐’™๐’™โ€ฒ + 1 2 = ๐‘ฅ๐‘ฅ1๐‘ฅ๐‘ฅ1โ€ฒ + ๐‘ฅ๐‘ฅ2๐‘ฅ๐‘ฅ2โ€ฒ + 1 2

= ๐‘ฅ๐‘ฅ12๐‘ฅ๐‘ฅ1โ€ฒ2 + ๐‘ฅ๐‘ฅ22๐‘ฅ๐‘ฅ2โ€ฒ

2 + 2๐‘ฅ๐‘ฅ1๐‘ฅ๐‘ฅ1โ€ฒ๐‘ฅ๐‘ฅ2๐‘ฅ๐‘ฅ2โ€ฒ + 2๐‘ฅ๐‘ฅ1๐‘ฅ๐‘ฅ1โ€ฒ + 2๐‘ฅ๐‘ฅ2๐‘ฅ๐‘ฅ2โ€ฒ + 1 โ€ข If we define

๐œ™๐œ™ ๐’™๐’™ = ๐‘ฅ๐‘ฅ12, ๐‘ฅ๐‘ฅ22, 2x1x2, 2x1, 2x2, 1๐‘‡๐‘‡

โ€ข Then ๐‘˜๐‘˜ ๐’™๐’™,๐’™๐’™โ€ฒ = ๐œ™๐œ™๐‘‡๐‘‡ ๐’™๐’™ ๐œ™๐œ™ ๐’™๐’™โ€ฒ

Page 12: Support Vector Machines (SVMs) Part 3: Kernels

12 CSE 5526: SVMs

Example: XOR problem again

โ€ข Consider (once again) the XOR problem โ€ข The SVM can solve it using a polynomial kernel

โ€ข With ๐‘๐‘ = 2 and ๐‘๐‘ = 1

Page 13: Support Vector Machines (SVMs) Part 3: Kernels

13 CSE 5526: SVMs

XOR: first compute the kernel matrix

โ€ข In general, ๐พ๐พ๐‘–๐‘–๐‘—๐‘— = ๐‘˜๐‘˜ ๐’™๐’™๐‘–๐‘– ,๐’™๐’™๐‘—๐‘— = 1 + ๐’™๐’™๐‘–๐‘–๐‘‡๐‘‡๐’™๐’™๐‘—๐‘—2

โ€ข For example, ๐พ๐พ11 = ๐‘˜๐‘˜ โˆ’1

โˆ’1 , โˆ’1โˆ’1 = 1 + 2 2 = 9

๐พ๐พ12 = ๐‘˜๐‘˜ โˆ’1โˆ’1 , โˆ’1

+1 = 1 + 0 2 = 1

โ€ข So

๐พ๐พ =

9 1 1 11 9 1 11 1 9 11 1 1 9

Page 14: Support Vector Machines (SVMs) Part 3: Kernels

14 CSE 5526: SVMs

XOR: first compute the kernel matrix

โ€ข Or compute ๐œ™๐œ™ ๐’™๐’™๐‘–๐‘– and their inner products, e.g.,

โ€ข Remember, ๐œ™๐œ™ ๐’™๐’™ = ๐‘ฅ๐‘ฅ12, ๐‘ฅ๐‘ฅ22, 2x1x2, 2x1, 2x2, 1๐‘‡๐‘‡

โ€ข Since ๐œ™๐œ™ ๐’™๐’™ includes 1, no need for separate ๐‘๐‘ later

๐œ™๐œ™ ๐’™๐’™1 = ๐œ™๐œ™ โˆ’1โˆ’1 = 1,1, 2,โˆ’ 2,โˆ’ 2, 1

๐‘‡๐‘‡

๐œ™๐œ™ ๐’™๐’™2 = ๐œ™๐œ™ โˆ’1+1 = 1,1,โˆ’ 2,โˆ’ 2, 2, 1

๐‘‡๐‘‡

โ€ข Then ๐พ๐พ11 = ๐œ™๐œ™๐‘‡๐‘‡ ๐’™๐’™1 ๐œ™๐œ™ ๐’™๐’™1 = 1 + 1 + 2 + 2 + 2 + 1 = 9 ๐พ๐พ12 = ๐œ™๐œ™๐‘‡๐‘‡ ๐’™๐’™1 ๐œ™๐œ™ ๐’™๐’™2 = 1 + 1 โˆ’ 2 + 2 โˆ’ 2 + 1 = 1

โ€ข Results in same ๐พ๐พ matrix, but more computation

For ๐‘๐‘

Page 15: Support Vector Machines (SVMs) Part 3: Kernels

15 CSE 5526: SVMs

XOR: Combine class labels into ๐พ๐พ

โ€ข Define matrix ๐พ๐พ๏ฟฝ such that ๐พ๐พ๏ฟฝ๐‘–๐‘–๐‘—๐‘— = ๐พ๐พ๐‘–๐‘–๐‘—๐‘—๐‘‘๐‘‘๐‘–๐‘–๐‘‘๐‘‘๐‘—๐‘— โ€ข Recall ๐’…๐’… = โˆ’1, +1, +1,โˆ’1 ๐‘‡๐‘‡

๐พ๐พ๏ฟฝ =

+9 โˆ’1 โˆ’1 +1โˆ’1 +9 +1 โˆ’1โˆ’1 +1 +9 โˆ’1+1 โˆ’1 โˆ’1 +9

Page 16: Support Vector Machines (SVMs) Part 3: Kernels

16 CSE 5526: SVMs

XOR: Solve dual Lagrangian for ๐’‚๐’‚

โ€ข Find fixed points of

๐ฟ๐ฟ๏ฟฝ ๐’‚๐’‚ = ๐Ÿ๐Ÿ๐‘‡๐‘‡๐’‚๐’‚ โˆ’12๐’‚๐’‚๐‘‡๐‘‡๐พ๐พ๏ฟฝ๐’‚๐’‚

โ€ข Set matrix gradient to 0 ๐›ป๐›ป๐ฟ๐ฟ๏ฟฝ = ๐Ÿ๐Ÿ โˆ’ ๐พ๐พ๏ฟฝ๐’‚๐’‚ = ๐ŸŽ๐ŸŽ

โ‡’ ๐’‚๐’‚ = ๐พ๐พ๏ฟฝโˆ’1๐Ÿ๐Ÿ =18

,18

,18

,18

๐‘‡๐‘‡

โ€ข Satisfies all conditions: ๐‘Ž๐‘Ž๐‘๐‘ โ‰ฅ 0 โˆ€๐‘๐‘ โˆ‘ ๐‘Ž๐‘Ž๐‘๐‘๐‘‘๐‘‘๐‘๐‘ = 0๐‘๐‘ โ€ข So this is the solution

โ€ข All points are support vectors

Page 17: Support Vector Machines (SVMs) Part 3: Kernels

17 CSE 5526: SVMs

XOR: Compute ๐’˜๐’˜ (including ๐‘๐‘) from ๐’‚๐’‚

๐’˜๐’˜ = ๏ฟฝ๐‘Ž๐‘Ž๐‘๐‘๐‘‘๐‘‘๐‘๐‘๐’™๐’™๐‘๐‘๐‘๐‘

= โˆ’18๐œ™๐œ™ ๐’™๐’™1 +

18๐œ™๐œ™ ๐’™๐’™2 +

18๐œ™๐œ™ ๐’™๐’™3 โˆ’

18๐œ™๐œ™ ๐’™๐’™4

=18

โˆ’

112

โˆ’ 2โˆ’ 2

1

+

11

โˆ’ 2โˆ’ 2

21

+

11

โˆ’ 22

โˆ’ 21

โˆ’

11222

1

=

00

โˆ’12

000

๐‘๐‘

Page 18: Support Vector Machines (SVMs) Part 3: Kernels

18 CSE 5526: SVMs

XOR: Examine prediction function

โ€ข Prediction function ๐‘ฆ๐‘ฆ ๐’™๐’™ = ๐’˜๐’˜๐‘‡๐‘‡๐œ™๐œ™ ๐’™๐’™

= 0,0,โˆ’12

, 0,0,0๐‘‡๐‘‡

๐‘ฅ๐‘ฅ12, ๐‘ฅ๐‘ฅ22, 2x1x2, 2x1, 2x2, 1

= โˆ’๐‘ฅ๐‘ฅ1๐‘ฅ๐‘ฅ2 โ€ข Predictions are based on product of the dimensions

๐‘ฆ๐‘ฆ ๐’™๐’™1 = โˆ’ โˆ’1 โˆ’1 = โˆ’1 ๐‘ฆ๐‘ฆ ๐’™๐’™2 = โˆ’ โˆ’1 +1 = +1 ๐‘ฆ๐‘ฆ ๐’™๐’™3 = โˆ’ +1 โˆ’1 = +1 ๐‘ฆ๐‘ฆ ๐’™๐’™4 = โˆ’ +1 +1 = โˆ’1

Page 19: Support Vector Machines (SVMs) Part 3: Kernels

19 CSE 5526: SVMs

XOR: Decision boundaries

โ€ข Decision boundary at ๐‘ฆ๐‘ฆ ๐’™๐’™ = โˆ’๐‘ฅ๐‘ฅ1๐‘ฅ๐‘ฅ2 = 0 โ€ข Support vectors at ๐‘ฆ๐‘ฆ ๐’™๐’™ = โˆ’๐‘ฅ๐‘ฅ1๐‘ฅ๐‘ฅ2 = 1

๐‘ฆ๐‘ฆ ๐’™๐’™ > 0