svm icml01 tutorial
TRANSCRIPT
-
8/7/2019 SVM Icml01 Tutorial
1/85
Support Vector and KernelMachines
Nello CristianiniBIOwulf Technologies
http://www.support-vector.net/tutorial.html
ICML 2001
-
8/7/2019 SVM Icml01 Tutorial
2/85
www.support-vector.net
A Little History
z SVMs introduced in COLT-92 by Boser, Guyon,Vapnik. Greatly developed ever since.
z Initially popularized in the NIPS community, now animportant and active field of all Machine Learningresearch.
z Special issues of Machine Learning Journal, and
Journal of Machine Learning Research.z Kernel Machines: large class of learning algorithms,
SVMs a particular instance.
-
8/7/2019 SVM Icml01 Tutorial
3/85
www.support-vector.net
A Little History
z Annual workshop at NIPS
z Centralized website: www.kernel-machines.org
z Textbook (2000): see www.support-vector.net
z Now: a large and diverse community: from machinelearning, optimization, statistics, neural networks,functional analysis, etc. etc
z
Successful applications in many fields (bioinformatics,text, handwriting recognition, etc)
z Fast expanding field, EVERYBODY WELCOME ! -
-
8/7/2019 SVM Icml01 Tutorial
4/85
www.support-vector.net
Preliminaries
z Task of this class of algorithms: detect andexploit complex patterns in data (eg: byclustering, classifying, ranking, cleaning, etc.the data)
z Typical problems: how to represent complexpatterns; and how to exclude spurious
(unstable) patterns (= overfitting)z The first is a computational problem; the
second a statistical problem.
-
8/7/2019 SVM Icml01 Tutorial
5/85
www.support-vector.net
Very Informal Reasoning
z The class of kernel methods implicitly definesthe class of possible patterns by introducing anotion of similarity between data
z Example: similarity between documentsz By length
z By topic
z By language
z Choice of similarity Choice of relevantfeatures
-
8/7/2019 SVM Icml01 Tutorial
6/85
www.support-vector.net
More formal reasoning
z Kernel methods exploit information about the inner
products between data items
z Many standard algorithms can be rewritten so that they
only require inner products between data (inputs)
z Kernel functions = inner products in some feature
space (potentially very complex)
z If kernel given, no need to specify what features of thedata are being used
-
8/7/2019 SVM Icml01 Tutorial
7/85
www.support-vector.net
Just in case
z Inner product between vectors
z Hyperplane: x z xzi i
i
,
=x
x
o
o
o x
xo
oo
x
x
x
w
b
w x b, + = 0
-
8/7/2019 SVM Icml01 Tutorial
8/85
www.support-vector.net
Overview of the Tutorial
z Introduce basic concepts with extended
example of Kernel Perceptron
z Derive Support Vector Machinesz Other kernel based algorithms
z Properties and Limitations of Kernels
z On Kernel Alignment
z On Optimizing Kernel Alignment
-
8/7/2019 SVM Icml01 Tutorial
9/85
www.support-vector.net
Parts I and II: overview
z Linear Learning Machines (LLM)
z Kernel Induced Feature Spaces
z Generalization Theoryz Optimization Theory
z Support Vector Machines (SVM)
-
8/7/2019 SVM Icml01 Tutorial
10/85
www.support-vector.net
Modularity
z Any kernel-based learning algorithm composed of twomodules:
A general purpose learning machine
A problem specific kernel function
z Any K-B algorithm can be fitted with any kernel
z Kernels themselves can be constructed in a modular
wayz Great for software engineering (and for analysis)
IMPORTANTCONCEPT
-
8/7/2019 SVM Icml01 Tutorial
11/85
www.support-vector.net
1-Linear Learning Machines
z Simplest case: classification. Decision function
is a hyperplane in input space
z The Perceptron Algorithm (Rosenblatt, 57)
z Useful to analyze the Perceptron algorithm,
before looking at SVMs and Kernel Methods ingeneral
-
8/7/2019 SVM Icml01 Tutorial
12/85
www.support-vector.net
Basic Notation
z Input space
z Output space
z Hypothesisz Real-valued:
z Training Set
z Test error
z Dot product
x X
y Y
h Hf X
S x y x y
x z
i i
= +
=
{ , }
: R
{( , ),..., ( , ),....}
,
1 1
1 1
-
8/7/2019 SVM Icml01 Tutorial
13/85
www.support-vector.net
Perceptron
z Linear Separation of theinput space
x
x
o
o
o x
xo
oo
x
x
x
w
b
f x w x bh x sign f x
( ) ,( ) ( ( ))
= +
=
-
8/7/2019 SVM Icml01 Tutorial
14/85
www.support-vector.net
Perceptron Algorithm
Update rule
(ignoring threshold):
z if then xx
o
o
o x
xo
oo
x
x
x
w
b
y w xi k i( , ) 0
w w y x
k k
k k i i+ +
+
1
1
-
8/7/2019 SVM Icml01 Tutorial
15/85
www.support-vector.net
Observations
z Solution is a linear combination of training
points
z Only used informative points (mistake driven)
z The coefficient of a point in combinationreflects its difficulty
w y xi i i
i
=
0
-
8/7/2019 SVM Icml01 Tutorial
16/85
www.support-vector.net
Observations - 2
z Mistake bound:
z coefficients are non-negative
z possible to rewrite the algorithm using this alternativerepresentation
MR
2x
x
o
o
ox
xo
oo
x
x
x
g
-
8/7/2019 SVM Icml01 Tutorial
17/85
www.support-vector.net
Dual Representation
The decision function can be re-written as
follows:
f x w x b y x x bi i i( ) , ,= + = +
w y xi i i=
IMPORTANTCONCEPT
-
8/7/2019 SVM Icml01 Tutorial
18/85
www.support-vector.net
Dual Representation
z And also the update rule can be rewritten as follows:
z if then
z Note: in dual representation, data appears only insidedot products
y y x x bi j j j i ,
+
3 80 i i +
-
8/7/2019 SVM Icml01 Tutorial
19/85
www.support-vector.net
Duality: First Property of SVMs
z DUALITY is the first feature of Support VectorMachines
z SVMs are Linear Learning Machinesrepresented in a dual fashion
z Data appear only within dot products (indecision function and in training algorithm)
f x w x b y x x bi i i( ) , ,= + = +
-
8/7/2019 SVM Icml01 Tutorial
20/85
www.support-vector.net
Limitations of LLMs
Linear classifiers cannot deal with
z Non-linearly separable dataz Noisy data
z + this formulation only deals with vectorial data
-
8/7/2019 SVM Icml01 Tutorial
21/85
www.support-vector.net
Non-Linear Classifiers
z One solution: creating a net of simple linearclassifiers (neurons): a Neural Network
(problems: local minima; many parameters;heuristics needed to train; etc)
z Other solution: map data into a richer featurespace including non-linear features, then use alinear classifier
-
8/7/2019 SVM Icml01 Tutorial
22/85
www.support-vector.net
Learning in the Feature Space
z Map data into a feature space where they are
linearly separable x x ( )
x
x
x
x
oo
o
o
f(o)f(x)
f(x)
f(o)
f(o)
f(o) f(x)
f(x)
f
X F
-
8/7/2019 SVM Icml01 Tutorial
23/85
www.support-vector.net
Problems with Feature Space
z Working in high dimensional feature spacessolves the problem of expressing complexfunctions
BUT:
z There is a computational problem (working withvery large vectors)
z And a generalization theory problem (curse ofdimensionality)
-
8/7/2019 SVM Icml01 Tutorial
24/85
www.support-vector.net
Implicit Mapping to Feature Space
We will introduce Kernels:
z
Solve the computational problem of workingwith many dimensions
z Can make it possible to use infinite dimensions efficiently in time / space
z Other advantages, both practical andconceptual
-
8/7/2019 SVM Icml01 Tutorial
25/85
www.support-vector.net
Kernel-Induced Feature Spaces
z In the dual representation, the data points only
appear inside dot products:
z The dimensionality of space F not necessarilyimportant. May not even know the map
f x y x x bi i i( ) ( , ( ))= +
-
8/7/2019 SVM Icml01 Tutorial
26/85
www.support-vector.net
Kernels
z A function that returns the value of the dot
product between the images of the two
arguments
z Given a function K, it is possible to verify that itis a kernel
K x x x x( , ) ( ), ( )1 2 1 2=
IMPORTANT
CONCEPT
-
8/7/2019 SVM Icml01 Tutorial
27/85
www.support-vector.net
Kernels
z One can use LLMs in a feature space by
simply rewriting it in dual representation and
replacing dot products with kernels:
x x K x x x x1 2 1 2 1 2, ( , ) ( ), ( ) =
-
8/7/2019 SVM Icml01 Tutorial
28/85
www.support-vector.net
The Kernel Matrix
z (aka the Gram matrix):
K(m,m)K(m,3)K(m,2)K(m,1)
K(2,m)K(2,3)K(2,2)K(2,1)
K(1,m)K(1,3)K(1,2)K(1,1)
IMPORTANT
CONCEPT
K=
-
8/7/2019 SVM Icml01 Tutorial
29/85
www.support-vector.net
The Kernel Matrix
z The central structure in kernel machines
z Information bottleneck: contains all necessary
information for the learning algorithmz Fuses information about the data AND the
kernel
z Many interesting properties:
-
8/7/2019 SVM Icml01 Tutorial
30/85
www.support-vector.net
Mercers Theorem
z The kernel matrix is Symmetric Positive
Definite
z
Any symmetric positive definite matrix can beregarded as a kernel matrix, that is as an inner
product matrix in some space
-
8/7/2019 SVM Icml01 Tutorial
31/85
www.support-vector.net
More Formally: Mercers Theorem
z Every (semi) positive definite, symmetric
function is a kernel: i.e. there exists a mapping
such that it is possible to write:
Pos. Def.
K x x x x( , ) ( ), ( )1 2 1 2=
K x z f x f z d x d z
f L
( , ) ( ) ( )
I 02
-
8/7/2019 SVM Icml01 Tutorial
32/85
www.support-vector.net
Mercers Theorem
z Eigenvalues expansion of Mercers Kernels:
z That is: the eigenfunctions act as features !
K x x x xi
i
i i
( , ) ( ) ( )1 2 1 2
=
-
8/7/2019 SVM Icml01 Tutorial
33/85
www.support-vector.net
Examples of Kernels
z Simple examples of kernels are:
K x z x z
K x z e
d
x z
( , ) ,
( , )/
=
=
22
-
8/7/2019 SVM Icml01 Tutorial
34/85
www.support-vector.net
Example: Polynomial Kernels
x x x
z z z
x z x z x z
x z x z x z x z
x x x x z z z z
x z
=
=
= + =
= + + =
= =
=
( , );
( , );
, ( )
( , , ), ( , , )
( ), ( )
1 2
1 2
21 1 2 2
2
1
2
1
2
2
2
2
21 1 2 2
1
2
2
21 2 1
2
2
21 2
2
2 2
-
8/7/2019 SVM Icml01 Tutorial
35/85
www.support-vector.net
Example: Polynomial Kernels
-
8/7/2019 SVM Icml01 Tutorial
36/85
www.support-vector.net
Example: the two spirals
z Separated by a hyperplane in feature space
(gaussian kernels)
-
8/7/2019 SVM Icml01 Tutorial
37/85
www.support-vector.net
Making Kernels
z The set of kernels is closed under someoperations. If K, K are kernels, then:
z K+K is a kernel
z cK is a kernel, if c>0
z aK+bK is a kernel, for a,b >0
z Etc etc etc
z can make complex kernels from simple ones:modularity !
IMPORTANTCONCEPT
-
8/7/2019 SVM Icml01 Tutorial
38/85
www.support-vector.net
Second Property of SVMs:
SVMs are Linear Learning Machines, that
z Use a dual representation
ANDz Operate in a kernel induced feature space
(that is:
is a linear function in the feature space implicitelydefined by K)
f x y x x bi i i( ) ( , ( ))= +
-
8/7/2019 SVM Icml01 Tutorial
39/85
www.support-vector.net
Kernels over General Structures
z Haussler, Watkins, etc: kernels over sets, over
sequences, over trees, etc.
z
Applied in text categorization, bioinformatics,etc
-
8/7/2019 SVM Icml01 Tutorial
40/85
www.support-vector.net
A bad kernel
z would be a kernel whose kernel matrix is
mostly diagonal: all points orthogonal to each
other, no clusters, no structure
1000
1
0010
0001
-
8/7/2019 SVM Icml01 Tutorial
41/85
www.support-vector.net
No Free Kernel
z If mapping in a space with too many irrelevant
features, kernel matrix becomes diagonal
z Need some prior knowledge of target so
choose a good kernel
IMPORTANT
CONCEPT
-
8/7/2019 SVM Icml01 Tutorial
42/85
-
8/7/2019 SVM Icml01 Tutorial
43/85
www.support-vector.net
%5($.
-
8/7/2019 SVM Icml01 Tutorial
44/85
www.support-vector.net
The Generalization Problem
z The curse of dimensionality: easy to overfit in highdimensional spaces(=regularities could be found in the training set that are accidental,that is that would not be found again in a test set)
z The SVM problem is ill posed (finding one hyperplanethat separates the data: many such hyperplanes exist)
z Need principled way to choose the best possiblehyperplane
NEW
TOPIC
-
8/7/2019 SVM Icml01 Tutorial
45/85
www.support-vector.net
The Generalization Problem
z Many methods exist to choose a good
hyperplane (inductive principles)
z
Bayes, statistical learning theory / pac, MDL,
z Each can be used, we will focus on a simple
case motivated by statistical learning theory(will give the basic SVM)
-
8/7/2019 SVM Icml01 Tutorial
46/85
www.support-vector.net
Statistical (Computational)Learning Theory
z Generalization bounds on the risk of overfitting
(in a p.a.c. setting: assumption of I.I.d. data;
etc)
z Standard bounds from VC theory give upper
and lower bound proportional to VC dimension
z
VC dimension of LLMs proportional todimension of space (can be huge)
-
8/7/2019 SVM Icml01 Tutorial
47/85
www.support-vector.net
Assumptions and Definitions
z distribution D over input space X
z train and test points drawn randomly (I.I.d.) from D
z training error of h: fraction of points in S misclassifed
by h
z test error of h: probability under D to misclassify a pointx
z VC dimension: size of largest subset of X shattered byH (every dichotomy implemented)
C
-
8/7/2019 SVM Icml01 Tutorial
48/85
www.support-vector.net
VC Bounds = mVC
O
~
VC = (number of dimensions of X) +1
Typically VC >> m, so not useful
Does not tell us which hyperplane to choose
-
8/7/2019 SVM Icml01 Tutorial
49/85
www.support-vector.net
Margin Based Bounds
f
xfy
m
RO
ii
i
)(min
)/(~ 2
=
=
Note: also compression bounds exist; and online bounds.
-
8/7/2019 SVM Icml01 Tutorial
50/85
www.support-vector.net
Margin Based Bounds
z (The worst case bound still holds, but if lucky (margin islarge)) the other bound can be applied and bettergeneralization can be achieved:
z Best hyperplane: the maximal margin one
z Margin is large is kernel chosen well
=m
RO
2)/(~
IMPORTANTCONCEPT
-
8/7/2019 SVM Icml01 Tutorial
51/85
www.support-vector.net
Maximal Margin Classifier
z Minimize the risk of overfitting by choosing the
maximal margin hyperplane in feature space
z Third feature of SVMs: maximize the margin
z SVMs control capacity by increasing the
margin, not by reducing the number of degrees
of freedom (dimension free capacity control).
-
8/7/2019 SVM Icml01 Tutorial
52/85
www.support-vector.net
Two kinds of margin
z Functional and geometric margin:
x
x
o
o
ox
xo
oo
x
x
x
g
funct y f x
geomy f x
f
i i
i i
=
=
min ( )
min( )
-
8/7/2019 SVM Icml01 Tutorial
53/85
www.support-vector.net
Two kinds of margin
-
8/7/2019 SVM Icml01 Tutorial
54/85
www.support-vector.net
Max Margin = Minimal Norm
z If we fix the functional margin to 1, the
geometric margin equal 1/||w||
z Hence, maximize the margin by minimizing the
norm
-
8/7/2019 SVM Icml01 Tutorial
55/85
www.support-vector.net
Max Margin = Minimal Norm
Distance between
The two convex hullsw x b
w x b
w x x
w
wx x
w
,
,
, ( )
, ( )
+
+
+
+ = +
+ =
=
=
1
1
2
2
x
x
o
o
ox
xo
oo
x
x
x
g
-
8/7/2019 SVM Icml01 Tutorial
56/85
www.support-vector.net
The primal problem
z Minimize:
subject to:w w
y w x bi i
,
,
+ 4 9 1
IMPORTANT
STEP
-
8/7/2019 SVM Icml01 Tutorial
57/85
www.support-vector.net
Optimization Theory
z The problem of finding the maximal margin
hyperplane: constrained optimization
(quadratic programming)
z Use Lagrange theory (or Kuhn-Tucker Theory)
z Lagrangian:
1
21
0
w w y w x bi i i, , + ! "$#
1 6
-
8/7/2019 SVM Icml01 Tutorial
58/85
www.support-vector.net
From Primal to Dual
L w w w y w x bi i i
i
( ) , ,= + !"$#
1
21
0
1 6
Differentiate and substitute:
L
b
L
w
=
=
0
0
-
8/7/2019 SVM Icml01 Tutorial
59/85
www.support-vector.net
The Dual Problem
z Maximize:
z Subject to:
z The duality again ! Can use kernels !
W y y x x
y
i
i
ii j
j i j i j
i
i ii
( ) ,,
=
=
1
2
0
0
IMPORTANT
STEP
-
8/7/2019 SVM Icml01 Tutorial
60/85
www.support-vector.net
Convexity
z This is a Quadratic Optimization problem:convex, no local minima (second effect of
Mercers conditions)
z Solvable in polynomial time
z (convexity is another fundamental property of
SVMs)
IMPORTANTCONCEPT
-
8/7/2019 SVM Icml01 Tutorial
61/85
www.support-vector.net
Kuhn-Tucker Theorem
Properties of the solution:z Duality: can use kernels
z KKT conditions:
z Sparseness: only the points nearest to the hyperplane(margin = 1) have positive weight
z They are called support vectors
w y xi i i=
i i iy w x b
i
, + =
1 6 1 0
-
8/7/2019 SVM Icml01 Tutorial
62/85
www.support-vector.net
KKT Conditions Imply Sparseness
x
x
o
o
ox
xo
oo
x
x
x
g
Sparseness:another fundamental property of SVMs
-
8/7/2019 SVM Icml01 Tutorial
63/85
www.support-vector.net
Properties of SVMs - Summary
9 Duality
9 Kernels
9
Margin9 Convexity
9 Sparseness
-
8/7/2019 SVM Icml01 Tutorial
64/85
www.support-vector.net
Dealing with noise
( )2
2
1
2+
R
m
y w x bi i i, + 4 9 1
In the case of non-separable data
in feature space, the margin distribution
can be optimized
-
8/7/2019 SVM Icml01 Tutorial
65/85
www.support-vector.net
The Soft-Margin Classifier
1
2
1
2
1
2
w w C
w w C
y w x b
i
i
i
i
i i i
,
,
,
+
+
+
4 9
Minimize:
Or:
Subject to:
-
8/7/2019 SVM Icml01 Tutorial
66/85
www.support-vector.net
Slack Variables
( )2
2
1
2
+
R
m
y w x bi i i, + 4 9 1
-
8/7/2019 SVM Icml01 Tutorial
67/85
www.support-vector.net
Soft Margin-Dual Lagrangian
z Box constraints
z Diagonal
W y y x x
C
y
i
i
ii j
j i j i j
i
i ii
( ) ,,
=
=
1
2
0
0
i
i
ii j
j i j i j j j
i
i i
y y x x C
yi
1
2
1
2
0
0
, ,
-
8/7/2019 SVM Icml01 Tutorial
68/85
www.support-vector.net
The regression case
z For regression, all the above properties areretained, introducing epsilon-insensitive loss:
L
yi-+b
e
0
x
-
8/7/2019 SVM Icml01 Tutorial
69/85
www.support-vector.net
Regression: the -tube
-
8/7/2019 SVM Icml01 Tutorial
70/85
www.support-vector.net
Implementation Techniques
z Maximizing a quadratic function, subject to alinear equality constraint (and inequalities as
well)
W y y K x x
y
ii
ii j j i j i j
i
i ii
( ) ( , ),
=
=
1
2
0
0
-
8/7/2019 SVM Icml01 Tutorial
71/85
www.support-vector.net
Simple Approximation
z Initially complex QP pachages were used.
z Stochastic Gradient Ascent (sequentiallyupdate 1 weight at the time) gives excellent
approximation in most cases
i i
i i
i i i i j
K x x y y K x x +
1
1( , ) ( , )
-
8/7/2019 SVM Icml01 Tutorial
72/85
www.support-vector.net
Full Solution: S.M.O.
z SMO: update two weights simultaneously
z Realizes gradient descent without leaving thelinear constraint (J. Platt).
z Online versions exist (Li-Long; Gentile)
-
8/7/2019 SVM Icml01 Tutorial
73/85
www.support-vector.net
Other kernelized Algorithms
z Adatron, nearest neighbour, fisherdiscriminant, bayes classifier, ridge regression,
etc. etc
z Much work in past years into designing kernel
based algorithms
z Now: more work on designing good kernels (for
any algorithm)
-
8/7/2019 SVM Icml01 Tutorial
74/85
www.support-vector.net
On Combining Kernels
z When is it advantageous to combine kernels ?
z Too many features leads to overfitting also inkernel methods
z Kernel combination needs to be based on
principles
z
Alignment
-
8/7/2019 SVM Icml01 Tutorial
75/85
www.support-vector.net
Kernel Alignment
z Notion of similarity between kernels:Alignment (= similarity between Gram
matrices)
A K K K K
K K K K
( , ),
, ,1 2
1 2
1 1 2 2=
IMPORTANTCONCEPT
-
8/7/2019 SVM Icml01 Tutorial
76/85
www.support-vector.net
Many interpretations
z As measure of clustering in data
z As Correlation coefficient between oracles
z Basic idea: the ultimate kernel should be YY,
that is should be given by the labels vector
(after all: target is the only relevant feature !)
-
8/7/2019 SVM Icml01 Tutorial
77/85
www.support-vector.net
The ideal kernel
11-1-1
11-1-1
-1-111
-1-111
YY=
-
8/7/2019 SVM Icml01 Tutorial
78/85
www.support-vector.net
Combining Kernels
z Alignment in increased by combining kernelsthat are aligned to the target and not aligned to
each other.
z
A K YY K YY
K K YY YY
( , ' ), '
, ' , '1
1
1 1
=
-
8/7/2019 SVM Icml01 Tutorial
79/85
www.support-vector.net
Spectral Machines
z Can (approximately) maximize the alignment ofa set of labels to a given kernel
z By solving this problem:
z Approximated by principal eigenvector(thresholded) (see courant-hilbert theorem)
yyKy
yy
yi
=
+
arg max'
{ , }1 1
-
8/7/2019 SVM Icml01 Tutorial
80/85
www.support-vector.net
Courant-Hilbert theorem
z A:symmetric and positive definite,
z Principal Eigenvalue / Eigenvector
characterized by:
= max
'v
vAv
vv
-
8/7/2019 SVM Icml01 Tutorial
81/85
www.support-vector.net
Optimizing Kernel Alignment
z One can either adapt the kernel to the labels orvice versa
z In the first case: model selection method
z Second case: clustering / transduction method
-
8/7/2019 SVM Icml01 Tutorial
82/85
www.support-vector.net
Applications of SVMs
z Bioinformatics
z Machine Vision
z Text Categorization
z Handwritten Character Recognition
z Time series analysis
-
8/7/2019 SVM Icml01 Tutorial
83/85
www.support-vector.net
Text Kernels
z Joachims (bag of words)
z Latent semantic kernels (icml2001)
z String matching kernels
z
z See KerMIT project
-
8/7/2019 SVM Icml01 Tutorial
84/85
www.support-vector.net
Bioinformatics
z Gene Expression
z Protein sequences
z Phylogenetic Information
z Promoters
z
-
8/7/2019 SVM Icml01 Tutorial
85/85
www.support-vector.net
Conclusions:
z Much more than just a replacement for neuralnetworks. -
z General and rich class of pattern recognition methods
z %RR RQ 690VZZZVXSSRUWYHFWRUQHW
z Kernel machines website
www.kernel-machines.org
z www.NeuroCOLT.org