basis expansion and nonlinear svm - tsinghua...
TRANSCRIPT
Basis Expansion and Nonlinear SVM
Kai Yu
Linear Classifiers
8/7/12 2
• Help to learn more general cases, e.g., nonlinear models
z(x) = sign(f(x))
f(x) = w
>x+ b
Nonlinear Classifiers via Basis Expansion
8/7/12 3
• Nonlinear basis functions h(x)=[h1(x), h2(x), …, hm(x)] • f(x) = wTx+b is a special case where h(x)=x
• This explains a lot of classification models, including SVMs.
f(x) = w
>h(x) + b
z(x) = sign(f(x))
Outline
§ Representation theorem § Kernel trick § Understand regularization
§ Nonlinear logistic regression § General basis expansion functions § Summary
8/7/12 4
Review the QP for linear SVMs
§ After a lot of “stuff”, we obtain the Lagrange dual
§ The solution has the form
§ In other words, the solution w is in
8/7/12 5
January 2003 Trevor Hastie, Stanford University 9
Quadratic Programming Solution
After a lot of *stu!* we arrive at a Lagrange dual
LD =N!
i=1
!i !12
N!
i=1
N!
i!=1
!i!i!yiyi!xTi xi!
which we maximize subject to constraints (involving B as well).
The solution is expressed in terms of fitted Lagrange multipliers !̂i:
"̂ =N!
i=1
!̂iyixi
Some fraction of !̂i are exactly zero (from KKT conditions); the xi
for which !̂i > 0 are called support points S.
f̂(x) = xT "̂ + "̂0 =!
i!S!̂iyix
T xi + "̂0
w =NX
i=1
↵iyixi
span(x1, x2, . . . , xN )
A more general result – RKHS ���representation theorem (Wahba, 1971)
§ In its simplest form, L(wTx,y) is covex w.r.t. w, the solution of
has the form § Proof sketch …
§ Note: the conclusion is general, not only for SVMs. 8/7/12 6
minw
NX
i=1
L(wTxi, yi) + �kwk2
w =NX
i=1
↵ixi
For general basis expansion functions
The solution of
has the form
8/7/12 7
w =NX
i=1
↵ih(xi)
minw
NX
i=1
L(w>h(xi), yi) + �kwk2
Outline
§ Representation theorem § Kernel trick § Understand regularization
§ Nonlinear logistic regression § General basis expansion functions § Summary
8/7/12 8
Kernel
§ Define the Mercer kernel as
8/7/12 9
k(xi, xj) = h(xi)>h(xj)
Kernel trick
§ Apply the representation theorem
§ we have
8/7/12 10
min↵
NX
i=1
L
NX
i=1
↵ik(xi, x), yi
!+ �↵
>K↵
w =NX
i=1
↵ih(xi)
f(x) =NX
i=1
↵ik(xi, x) kwk2 =NX
i,j=1
↵i↵jk(xi, xj) = ↵
TK↵
Primal and Kernel formulations
8/7/12 11
min↵
NX
i=1
L
NX
i=1
↵ik(xi, x), yi
!+ �↵
>K↵
§ Given a kernel, we don’t even need h(x)! …really?
minw
NX
i=1
L
�w
>h(x), yi
�+ �kwk2
k(xi, xj) = h(xi)>h(xj)
Popular kernels
§ k(x,x’) is a symmetric, positive (semi-) definite function
§ Example:
8/7/12 12
January 2003 Trevor Hastie, Stanford University 12
Popular Kernels
K(x, x!) is a symmetric, positive (semi-)definite function.
dth deg. poly.: K(x, x!) = (1 + !x, x!")d
radial basis: K(x, x!) = exp(#$x # x!$2/c)
Example: 2nd degree polynomial in IR2.
K(x, x!) = (1 + !x, x!")2
= (1 + x1x!1 + x2x
!2)
2
= 1 + 2x1x!1 + 2x2x
!2 + (x1x
!1)
2 + (x2x!2)
2 + 2x1x!1x2x
!2
Then M = 6, and if we chooseh1(x) = 1, h2(x) =
%2x1, h3(x) =
%2x2, h4(x) = x2
1, h5(x) = x22,
and h6(x) =%
2x1x2,then K(x, x!) = !h(x), h(x!)".
January 2003 Trevor Hastie, Stanford University 12
Popular Kernels
K(x, x!) is a symmetric, positive (semi-)definite function.
dth deg. poly.: K(x, x!) = (1 + !x, x!")d
radial basis: K(x, x!) = exp(#$x # x!$2/c)
Example: 2nd degree polynomial in IR2.
K(x, x!) = (1 + !x, x!")2
= (1 + x1x!1 + x2x
!2)
2
= 1 + 2x1x!1 + 2x2x
!2 + (x1x
!1)
2 + (x2x!2)
2 + 2x1x!1x2x
!2
Then M = 6, and if we chooseh1(x) = 1, h2(x) =
%2x1, h3(x) =
%2x2, h4(x) = x2
1, h5(x) = x22,
and h6(x) =%
2x1x2,then K(x, x!) = !h(x), h(x!)".
January 2003 Trevor Hastie, Stanford University 12
Popular Kernels
K(x, x!) is a symmetric, positive (semi-)definite function.
dth deg. poly.: K(x, x!) = (1 + !x, x!")d
radial basis: K(x, x!) = exp(#$x # x!$2/c)
Example: 2nd degree polynomial in IR2.
K(x, x!) = (1 + !x, x!")2
= (1 + x1x!1 + x2x
!2)
2
= 1 + 2x1x!1 + 2x2x
!2 + (x1x
!1)
2 + (x2x!2)
2 + 2x1x!1x2x
!2
Then M = 6, and if we chooseh1(x) = 1, h2(x) =
%2x1, h3(x) =
%2x2, h4(x) = x2
1, h5(x) = x22,
and h6(x) =%
2x1x2,then K(x, x!) = !h(x), h(x!)".
Non-linear feature mapping
§ Datasets that are linearly separable
§ But what if the dataset is just too hard?
§ How about mapping data to a higher-dimensional space:
0
0
0
x2
x
x
x
Nonlinear feature mapping
§ General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
h: x → h(x)
Outline
§ Representation theorem § Kernel trick § Understand regularization
§ Nonlinear logistic regression § General basis expansion functions § Summary
8/7/12 15
Various equivalent formulations
8/7/12 16
§ Parametric form
§ Dual form
§ Nonparametric form
minf
NX
i=1
L(f(xi), yi) + �kfk2Hk
min↵
NX
i=1
L
NX
i=1
↵ik(xi, x), yi
!+ �↵
>K↵
minw
NX
i=1
L
�w
>h(x), yi
�+ �kwk2
Various equivalent formulations
8/7/12 17
§ Parametric form
§ Dual form
§ Nonparametric form
minf
NX
i=1
L(f(xi), yi) + �kfk2Hk
min↵
NX
i=1
L
NX
i=1
↵ik(xi, x), yi
!+ �↵
>K↵
minw
NX
i=1
L
�w
>h(x), yi
�+ �kwk2
Telling what kind of f(x) is preferred
Regularization induced by kernel (or basis functions)
8/7/12 18
§ Desired kernel is a smoothing operator, smoother
eigenfunctions ϕi tend to have larger eigenvalues γi
§ What does this mean ?
January 2003 Trevor Hastie, Stanford University 21
Aside: RKHS
Function space HK generated by a positive (semi-) definitefunction K(x, x!).
Eigen expansion: K(x, y) ="!
i=1
!i"i(x)"i(y)
with !i ! 0,""
i=1 !2i < ". f # HK if
f(x) ="!
i=1
ci"i(x)
ci =#
"i(t)f(t)dt
||f ||2HK
def="!
i=1
c2i /!i < "
The squared norm J(f) = ||f ||2HKis viewed as a roughness penalty.
January 2003 Trevor Hastie, Stanford University 21
Aside: RKHS
Function space HK generated by a positive (semi-) definitefunction K(x, x!).
Eigen expansion: K(x, y) ="!
i=1
!i"i(x)"i(y)
with !i ! 0,""
i=1 !2i < ". f # HK if
f(x) ="!
i=1
ci"i(x)
ci =#
"i(t)f(t)dt
||f ||2HK
def="!
i=1
c2i /!i < "
The squared norm J(f) = ||f ||2HKis viewed as a roughness penalty.
January 2003 Trevor Hastie, Stanford University 21
Aside: RKHS
Function space HK generated by a positive (semi-) definitefunction K(x, x!).
Eigen expansion: K(x, y) ="!
i=1
!i"i(x)"i(y)
with !i ! 0,""
i=1 !2i < ". f # HK if
f(x) ="!
i=1
ci"i(x)
ci =#
"i(t)f(t)dt
||f ||2HK
def="!
i=1
c2i /!i < "
The squared norm J(f) = ||f ||2HKis viewed as a roughness penalty.
Understand regularization
8/7/12 19
§ If push down this regularization term
§ In f(x), minor components ϕi(x) with smaller γi are
penalized more heavily. à principle components are preferred in f(x)!
§ A desired kernel is a smoothing operator, i.e., principle components are smoother functions à the regularization encourages f(x) to be smooth!
January 2003 Trevor Hastie, Stanford University 21
Aside: RKHS
Function space HK generated by a positive (semi-) definitefunction K(x, x!).
Eigen expansion: K(x, y) ="!
i=1
!i"i(x)"i(y)
with !i ! 0,""
i=1 !2i < ". f # HK if
f(x) ="!
i=1
ci"i(x)
ci =#
"i(t)f(t)dt
||f ||2HK
def="!
i=1
c2i /!i < "
The squared norm J(f) = ||f ||2HKis viewed as a roughness penalty.
Understanding regularization
§ Using what kernel? § Using what feature (for linear model) ? § Using what h(x)?
§ Using what functional norm
8/7/12 20
kfk2Hk
All pointing to one thing – what kind of functions are preferred apriori
January 2003 Trevor Hastie, Stanford University 21
Aside: RKHS
Function space HK generated by a positive (semi-) definitefunction K(x, x!).
Eigen expansion: K(x, y) ="!
i=1
!i"i(x)"i(y)
with !i ! 0,""
i=1 !2i < ". f # HK if
f(x) ="!
i=1
ci"i(x)
ci =#
"i(t)f(t)dt
||f ||2HK
def="!
i=1
c2i /!i < "
The squared norm J(f) = ||f ||2HKis viewed as a roughness penalty.
Outline
§ Representation theorem § Kernel trick § Understand regularization
§ Nonlinear logistic regression § General basis expansion functions § Summary
8/7/12 21
Nonlinear Logistic Regression
So far, things we discussed, including - representation theorem,
- kernel trick, - regularization,
are not limited to SVMs. They are all applicable to logistic regression. The only difference is the loss function.
8/7/12 22
Nonlinear Logistic Regression
§ Parametric form
§ Nonparametric form
8/7/12 23
January 2003 Trevor Hastie, Stanford University 17
SVM via Loss + Penalty
-3 -2 -1 0 1 2 3
0.00.5
1.01.5
2.02.5
3.0
Binomial Log-likelihoodSupport Vector
yf(x) (margin)
Los
s
With f(x) = h(x)T !+!0 and
yi ! {"1, 1}, consider
min!0, !
N!
i=1
[1"yif(xi)]++"#!#2
Solution identical to SVM so-
lution, with " = "(B).
In general min!0, !
N!
i=1
L[yi, f(xi)]+"#!#2
minf
NX
i=1
ln⇣1 + e�yiw
>h(xi)
⌘+ �kwk2
minf
NX
i=1
ln⇣1 + e�yif(xi)
⌘+ �kfk2Hk
Logistic Regression vs. SVM
§ Both can be linear or nonlinear, parametric or nonparametric, the main difference is the loss;
§ They are very similar in performance;
§ Outputs probabilities, useful for scoring confidence;
§ Logistic regression is easier for multiple classes.
§ 10 years ago, one was old, the other is new. Now, both are old.
8/7/12 24
Outline
§ Representation theorem § Kernel trick § Understand regularization
§ Nonlinear logistic regression § General basis expansion functions § Summary
8/7/12 25
Many known classification models follow a similar structure
§ Neural networks
§ RBF networks
§ Learning VQ (LVQ)
§ Boosting These models all learn w and h(x) together …
8/7/12 26
Many known classification models follow a similar structure
§ Neural networks § RBF networks § Learning VQ (LVQ)
§ Boosting § SVMs § Linear Classifier
§ Logistic Regression § …
8/7/12 27
Develop your own stuff !
By deciding
– Which loss function? – hinge, least square, …
– What form of h(x)? – RBF, logistic, tree, …
– Infinite h(x) or h(x)? – Learning h(x) or not?
– How to optimize? – QP, LBFGS, functional gradient, …
you can obtain various classification algorithms.
8/7/12 28
Parametric vs. nonparametric models
§ h(x) is finite dim, parametric model f(x)=wTh(x). Training complexity is O(Nm3)
§ h(x) is nonlinear and infinite dim, then has to use kernel trick. This is a nonparametric model. The training complexity is around O(N3)
§ Nonparametric models, including kernel SVMs, Gaussian processes, Dirichlet processes etc., are elegant in math, but nontrivial for large-scale computation.
8/7/12 29
Outline
§ Representation theorem § Kernel trick § Understand regularization
§ Nonlinear logistic regression § General basis expansion functions § Summary
8/7/12 30
Summary
§ Representation theorem and kernels
§ Regularization prefers principle eigenfunctions of the kernel (induced by basis functions)
§ Basis expansion - a general framework for classification models, e.g., nonlinear logistic regression, SVMs, …
8/7/12 31