Download - Linear Methods for Classification
1
Linear Methods for Classification
Lecture Notes for CMPUT 466/551
Nilanjan Ray
2
Linear Classification
• What is meant by linear classification?– The decision boundaries in the in the feature
(input) space is linear
• Should the regions be contiguous?
R1 R2
R3R4
X1
X2
Piecewise linear decision boundaries in 2D input space
3
Linear Classification…
• There is a discriminant function k(x) for
each class k
• Classification rule:
• In higher dimensional space the decision
boundaries are piecewise hyperplanar
• Remember that 0-1 loss function led to the
classification rule:
• So, can serve as k(x)
)}(maxarg:{ xkxR jj
k
)}|(maxarg:{ xXjGPkxRj
k
)|( XkGP
4
Linear Classification…
• All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair
• One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear– An example:
xxXGP
xXGP
xxXGP
x
xxXGP
T
T
T
T
0
0
0
0
])|2(
)|1(log[
)exp(1
1)|2(
)exp(1
)exp()|1(
Linear
So that
5
Linear Classification as a Linear Regression
)())(1()),((ˆ 3211
2121 TTTTT xxxxxxxY YXXX
535251
434241
333231
232221
131211
5251
4241
3231
2221
1211
,
1
1
1
1
1
yyy
yyy
yyy
yyy
yyy
xx
xx
xx
xx
xx
YX
321213
221212
121211
)1())((ˆ
)1())((ˆ
)1())((ˆ
xxxxY
xxxxY
xxxxY
2D Input space: X = (X1, X2)
Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)
Training sample, size N=5,
Regression output:
Each row hasexactly one 1indicating thecategory/class
Indicator Matrix
Or, Classification rule:
))((ˆmaxarg))((ˆ2121 xxYxxG k
k
6
The Masking
3213 )1(ˆ xxY
2212 )1(ˆ xxY
Linear regression of the indicator matrix can lead to masking
LDA can avoid this masking
2D input space and three classes Masking
1211 )1(ˆ xxY
Viewing direction
7
Linear Discriminant Analysis
K
lll
kk
xf
xfxXkG
1
)(
)()|Pr(
Essentially minimum error Bayes’ classifier
Assumes that the conditional class densities are (multivariate) Gaussian
Assumes equal covariance for every class
Posterior probability
k is the prior probability for class k
fk(x) is class conditional density or likelihood density
Application ofBayes rule
))()(2
1exp(
||)2(
1)( 1
2/12/ kT
kpk xxxf
ΣΣ
8
LDA…
)2
1(log)
2
1(log
loglog)|Pr(
)|Pr(log
1111l
Tll
Tlk
Tkk
Tk
l
k
l
k
xx
f
f
xXlG
xXkG
ΣΣΣΣ
)(xl)(xk
)(maxarg)(ˆ xxG kk
)|Pr(maxarg)(ˆ xXkGxGk
Classification rule:
is equivalent to:
The good old Bayes classifier!
9
LDA…
kkg ik Nxi
/ˆ
NNkk /ˆ
)/()ˆ)(ˆ(ˆ1
KNxxK
k g
Tkiki
i
Σ
Training data utilized to estimate
Prior probabilities:
Means:
Covariance matrix:
When are we going to use the training data?
Nixg ii :1),,( Total N input-output pairs Nk number of pairs in class k Total number of classes: K
10
LDA: Example
LDA was able to avoid masking here
11
Quadratic Discriminant Analysis
• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices
• The class decision boundaries are not linear rather quadratic
|)|log2
1)()(
2
1(log|)|log
2
1)()(
2
1(log
loglog)|Pr(
)|Pr(log
11lll
Tllkkk
Tkk
l
k
l
k
xxxx
f
f
xXlG
xXkG
ΣΣΣΣ
)(xl)(xk
12
QDA and Masking
Better than Linear Regression in terms of handling masking:
Usually computationally more expensive than LDA
13
Fisher’s Linear Discriminant[DHS]
From training set we want to find out a direction where the separationbetween the class means is high and overlap between the classes is small
14
Fisher’s LD…
w
xwT
x
Projection of a vector x on a unit vector w:
Geometric interpretation:
xwT
From training set we want to find out a direction w where the separationbetween the projections of class means is high and
the projections of the class overlap is small
15
Fisher’s LD…
21 2
21
1
1,
1
Rxi
Rxi
ii
xN
mxN
m
22
211
1
21
1~,1~ mwxw
Nmmwxw
Nm T
Rxi
TT
Rxi
T
ii
)(~~1212 mmwmm T
wSwwmxmxwmwxwmys
wSwwmxmxwmwxwmys
T
Rx
Tii
T
Rx
Ti
T
Rxyi
T
Rx
Tii
T
Rx
Ti
T
Rxyi
iiii
iiii
2222
2:
22
22
1112
1:
21
21
222
111
))(()()~(~
))(()()~(~
Class means:
Projected class means:
Difference between projected class means:
Scatter of projected data (this will indicate overlap between the classes):
16
Fisher’s LD…
wSw
wSw
ss
mmwr
wT
BT
22
21
212~~)~~(
)(
TB
w
mmmmS
SSS
))(( 1212
21
)( 121 mmSw w
Ratio of difference of projected means over total scatter:
where
We want to maximize r(w). The solution is
Rayleigh quotient
17
Fisher’s LD: Classifier
))(2
1)(()(
2
1)~~(
2
1)( 2112
12121 mmxmmSmmwxwmmxwxy w
TTT
Classification rule: x in R2 if y(x)>0, else x in R1, where
So far so good. However, how do we get the classifier?
All we know at this point is that the direction )( 121 mmSw w
separates the projected data very well
Since we know that the projected class means are well separated, we can choose average of the two projected means as a thresholdfor classification
18
Fisher’s LD: Multiple Classes
ii
k
xnn
m...
1
1
wSw
wSwwr
wT
BT
)(
Tkkk
TB
Cx
Tkk
Cx
Tw
mmmmnmmmmnS
mxmxmxmxSk
))((...))((
))((...))((
111
11
1
Bw SS 1
Maximize Rayleigh ratio:
The solution largest eigenvector of is
Compute means for the classes:
There are k clases C1,…,Ck with number of elements ni in the ith class
Compute variances:
iCxi
i xn
m1
Compute the grand mean:
At most (k-1) eigenvalues will be non-zero. Dimensionality reduction.
19
Fisher’s LD and LDA
They become same when
(1) Prior probabilities are same
(2) Common covariance matrix for the class conditional densities
(3) Both class conditional densities are multivariate Gaussian
Ex. Show that Fisher’s LD classifier and LDA produce thesame rule of classification given the above assumptions
Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario
20
Logistic Regression
• The output of regression is the posterior probability i.e., Pr(output | input)
• Always ensures that the sum of output variables is 1 and each output is non-negative
• A linear classification method• We need to know about two concepts to
understand logistic regression– Newton-Raphson method– Maximum likelihood estimation
21
Newton-Raphson Method
0)( 1 nxf
)(
)()( 11
n
nnnn xf
xfxfxx
)()()()( 11 nnnnn xfxxxfxf
)(
)(1
n
nnn xf
xfxx
A technique for solving non-linear equation f(x)=0
Taylor series:
After rearrangement:
If xn+1 is a root or very close to the root, then:
So:
Rule for iterationNeed an initial guess x0
22
Newton-Raphson in Multi-dimensions
Njxx
fxfxxf
N
kk
k
jjj ,...,1,)()(
1
0),,,(
0),,,(
0),,,(
21
212
211
NN
N
N
xxxf
xxxf
xxxf
We want to solve the equations:
Taylor series:
After some rearrangement etc.the rule for iteration:(Need an initial guess)
),,,(
),,,(
),,,(
21
212
211
1
21
2
2
2
1
2
1
2
1
1
1
1
12
11
1
12
11
nN
nnN
nN
nn
nN
nn
N
NNN
N
N
nN
n
n
nN
n
n
xxxf
xxxf
xxxf
x
f
x
f
x
f
x
f
x
f
x
fx
f
x
f
x
f
x
x
x
x
x
x
Jacobian matrix
23
Newton-Raphson : Example
0)sin(),(
0)cos(),(32
211212
221211
xxxxxf
xxxxfSolve:
32
211
22
1
1
2211
21
2
11
2
11
)()()sin(
)cos()(
)(32)cos(
)sin(2nnn
nn
nnn
nn
n
n
n
n
xxx
xx
xxx
xx
x
x
x
x
Iteration ruleneed initial guess
24
Maximum Likelihood Parameter Estimation
)2
)(exp(
2
1),;(
2
2
x
xp
N
i
ixL1
2
2
)2
)(exp(
2
1),(
),(maxarg)ˆ,ˆ(,
L
Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.
Samples: x1,….,xN
Form the likelihood function:
Estimate the parameters that maximize the likelihood function
Let’s find out )ˆ,ˆ(
25
Logistic Regression Model
1
10
1
10
0
)exp(1
1)|Pr(
1,,1,)exp(1
)exp()|Pr(
K
l
Tll
K
l
Tll
Tkk
xxXKG
Kkx
xxXkG
The method directly models the posterior probabilities as the output of regression
Note that the class boundaries are linear
How can we show this linear nature?
What is the discriminant function for every class in this model?
26
Logistic Regression Computation
Let’s fit the logistic regression model for K=2, i.e., number of classes is 2
N
ii
Tii
Ti
N
i iTii
Ti
N
iiiii
N
iii
xyxy
xyxy
xXGyxXGy
xXyGl
1
1
1
1
)))exp(1log()1((
))exp(1
1log)1((
))|0log(Pr()1())|1log(Pr(
)}|Pr({log)(
Training set: (xi, gi), i=1,…,N
Log-likelihood:
We want to maximize the log-likelihood in order to estimate
27
Logistic Regression Computation…
0))exp(1
)exp((
)(
1
N
iiT
T
i xx
xy
l
(p+1) Non-linear equations
Solve by Newton-Raphson method:
)(
)])(
Jacobian([ 1-oldold
oldnew ll
Let’s workout the details hidden in the above equation.In the process we’ll learn a bit about vector differentiation etc.