discriminative classifiers: logistic regression, svms...monkey 14 broker 0 analyst 1 dividend 1 d...
TRANSCRIPT
10/25/2015
1
Discriminative classifiers:Logistic Regression, SVMs
CISC 5800
Professor Daniel Leeds
Maximum A Posteriori: a quick reviewโข Likelihood:๐ ๐ท ๐ = ๐ ๐ท ๐ = ๐|๐ป| 1 โ ๐ |๐|
โข Prior: ๐ ๐ =๐๐ผโ1(1โ๐)๐ฝโ1
๐ต(๐ผ,๐ฝ)= ๐ ๐ =
๐๐ผโ1(1โ๐)๐ฝโ1
๐ต(๐ผ,๐ฝ)
โข Posterior Likelihood x prior = ๐ ๐ท ๐ ๐(๐)
โข MAP estimate:
argmax๐
log๐ ๐ท ๐ + log๐(๐)
argmax๐
log๐ ๐ท ๐ + log๐(๐)
๐ =๐ป + ๐ผ โ 1
๐ป + ๐ผ โ 1 + ๐ + ๐ฝ โ 1
Choose ๐ถ and ๐ท to give the prior belief of Heads bias
๐ โ ๐,๐
Higher ๐ถ: Heads more likelyHigher ๐ท: Tails more likely
2
Estimate each P(Xi|Y) through MAP
Incorporating prior for each class ๐ฝ๐
๐ ๐๐ = ๐ฅ๐ ๐ = ๐ฆ๐ =#๐ท(๐๐ = ๐ฅ๐โ๐ = ๐ฆ๐) + (๐ฝ๐โ1)
#๐ท(๐ = ๐ฆ๐) + ๐(๐ฝ๐ โ 1)
๐ ๐ = ๐ฆ๐ =#๐ท(๐ = ๐ฆ๐) + (๐ฝ๐โ1)
|๐ท| + ๐(๐ฝ๐ โ 1)
3
Note: both X and Y can take on multiple values (binary and beyond)
(๐ท๐โ๐) โ โfrequencyโ of class j
๐ ๐ท๐ โ ๐ โ โfrequenciesโ of all classes
Classification strategy: generative vs. discriminativeโข Generative, e.g., Bayes/Naรฏve Bayes:
โข Identify probability distribution for each class
โข Determine class with maximum probability for data example
โข Discriminative, e.g., Logistic Regression:โข Identify boundary between classes
โข Determine which side of boundary new data example exists on
4
0 5 10 15 20
0 5 10 15 20
Linear algebra: data features
โข Vector โ list of numbers:each number describesa data feature
โข Matrix โ list of lists of numbers:features for each datapoint
Wolf 12
Lion 16
Monkey 14
Broker 0
Analyst 1
Dividend 1
โ d โ
# of word occurrences
Document 1
8
10
11
1
0
1
โ
Document 2
0
2
1
14
10
12
โ
Document 3
5
Feature space
โข Each data feature defines a dimension in space
Wolf 12
Lion 16
Monkey 14
Broker 0
Analyst 1
Dividend 1
โ d โ
Document1
8
10
11
1
0
1
โ
Document2
0
2
1
14
10
12
โ
Document3
wolf
lion doc1
doc2
doc3
0 10 20
20
10
0
6
10/25/2015
2
The dot product
The dot product compares two vectors:
โข ๐ =
๐1โฎ๐๐
, ๐ =๐1โฎ๐๐
๐ โ ๐ = ๐=1๐ ๐๐๐๐ = ๐๐๐
0 10 20
20
10
0
๐๐๐
โ๐๐๐๐
= ๐ ร ๐๐ + ๐๐ ร ๐๐
= ๐๐ + ๐๐๐ = ๐๐๐
7
The dot product, continued
Magnitude of a vector is the sum of the squares of the elements
๐ = ๐ ๐๐2
If ๐ has unit magnitude, ๐ โ ๐ is the โprojectionโ of ๐ onto ๐
0 1 2
2
1
0
0.710.71
โ1.51
= .71 ร 1.5 + .71 ร 1
โ 1.07 + .71 = 1.78
๐ โ ๐ =
๐=1
๐
๐๐๐๐
0.710.71
โ00.5
= .71 ร 0 + .71 ร 0.5
โ 0 + .35 = 0.35 8
Separating boundary, defined by w
9
โข Separating hyperplanesplits class 0 and class 1
โข Plane is defined by line w perpendicular to plan
โข Is data point x in class 0 or class 1? wTx > 0 class 0
wTx < 0 class 1
Separating boundary, defined by w
10
โข Separating hyperplanesplits class 0 and class 1
โข Plane is defined by line w perpendicular to plan
โข Is data point x in class 0 or class 1? wTx > 0 class 1
wTx < 0 class 0
From real-number projection to 0/1 label
โข Binary classification: 0 is class A, 1 is class B
โข Sigmoid function stands in for p(x|y)
โข Sigmoid: ๐ โ =1
1+๐โโ
โข ๐ ๐ฆ = 0 ๐ฅ; ๐ = 1 โ ๐ ๐ค๐๐ฅ =๐โ๐ค๐๐ฅ
1+๐โ๐ค๐๐ฅ
โข ๐ ๐ฆ = 1 ๐ฅ; ๐ = ๐ ๐ค๐๐ฅ =1
1+๐โ๐ค๐๐ฅ
11
1
0.5
0-10 5 0 5 10
g(h)
h
๐๐ป๐ =
๐
๐๐๐๐
Learning parameters for classificationโข Similar to MLE for Bayes classifier
โข โLikelihoodโ for data points y1, โฆ, yn (different from Bayesian likelihood)โข If yi in class A, yi =0, multiply (1-g(xi;w))โข If yi in class B, yi=1, multiply (g(xi;w))
argmax๐ค
๐ฟ ๐ฆ ๐ฅ;๐ค =
๐
1 โ ๐ ๐ฅ๐;๐ค(1โ๐ฆ๐)
๐ ๐ฅ๐;๐ค๐ฆ๐
๐ฟ๐ฟ ๐ฆ ๐ฅ; ๐ค =
๐
(1 โ ๐ฆ๐) log 1 โ ๐ ๐ฅ๐; ๐ค + ๐ฆ๐ log ๐ ๐ฅ๐; ๐ค
๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค =
๐
๐ฆ๐ log๐ ๐ฅ๐; ๐ค
1 โ ๐ ๐ฅ๐; ๐ค+ log 1 โ ๐ ๐ฅ๐;๐ค
12
10/25/2015
3
Learning parameters for classification
๐ฟ๐ฟ ๐ฆ ๐ฅ; ๐ค =
๐
๐ฆ๐ log๐ ๐ฅ๐; ๐ค
1 โ ๐ ๐ฅ๐ ; ๐ค+ log 1 โ ๐ ๐ฅ๐; ๐ค
๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค =
๐
๐ฆ๐ log
1
1 + ๐โ๐ค๐๐ฅ๐
1 โ1
1 + ๐โ๐ค๐๐ฅ๐
+ log๐โ๐ค
๐๐ฅ๐
1 + ๐โ๐ค๐๐ฅ๐
๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค =
๐
๐ฆ๐ log1
1 + ๐โ๐ค๐๐ฅ๐
โ 1+ log
๐โ๐ค๐๐ฅ๐
1 + ๐โ๐ค๐๐ฅ๐
๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค =
๐
๐ฆ๐๐ค๐๐ฅ๐ โ๐ค๐ ๐ฅ๐ โ log 1 + ๐โ๐ค๐๐ฅ๐
13
๐ โ =1
1 + ๐โโLearning parameters for classification
๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค = ๐ ๐ฆ๐๐ค๐๐ฅ๐ โ๐ค๐ ๐ฅ๐ + log ๐ ๐ค๐๐ฅ๐
๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค =
๐
๐ฆ๐๐ฅ๐๐ โ ๐ฅ๐
๐ +๐ฅ๐๐๐โ๐ค
๐๐ฅ๐
1 + ๐โ๐ค๐๐ฅ๐
๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ ๐ฅ; ๐ค =
๐
๐ฅ๐๐ ๐ฆ๐ โ (1 โ 1 โ ๐(๐ค๐๐ฅ๐) )
๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค =
๐
๐ฅ๐๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐)
14
๐โฒ ๐ =๐โ๐
๐ + ๐โ๐ ๐
๐๐ป๐ =
๐
๐๐๐๐
Iterative gradient asscent
โข Begin with initial guessed weights w
โข For each data point (yi,xi), update each weight wj
๐ค๐ โ ๐ค๐ + ํ๐ฅ๐๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐)
โข Choose ํ so change is not too big or too small โ โstep sizeโ
โข ๐ฅ๐๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐)โข If yi=1 and g(wTxi)=0, and xi
j>0, make wj larger and push wTxi to be larger
โข If yi=0 and g(wTxi)=1, and xij>0, make wi smaller and push wTxi to be smaller
15
Intuition
yi โ true data labelg(wTxi) โ computed data
labelSeparating boundary, defined by w
16
โข Separating hyperplanesplits class 0 and class 1
โข Plane is defined by line w perpendicular to plan
โข Is data point x in class 0 or class 1? wTx > 0 class 1
wTx < 0 class 0
But, where do we place the boundary?
Logistic regression:
๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค :
๐
๐ฆ๐ โ 1 ๐๐ป๐๐ โ log 1 + ๐โ๐๐ป๐๐
โข Each data point ๐๐ considered for boundary ๐
โข Outlier data pulls boundary towards it
17
Max margin classifiers
โข Focus on boundary points
โข Find largest margin between boundary points on both sides
โข Works well in practice
โข We can call the boundary pointsโsupport vectorsโ
18
10/25/2015
4
Classifying with additional dimensions
No linear separator
Linear separator
๐ ๐ฅ
19
Mapping function(s)
โข Map from low-dimensional space ๐ฅ = ๐ฅ1, ๐ฅ2 to higher dimensional space ๐ ๐ฅ = ๐ฅ1, ๐ฅ2, ๐ฅ1
2, ๐ฅ22, ๐ฅ1๐ฅ2
โข N data points guaranteed to be separable in space of N-1 dimensions or more
Classifying ๐ฅ๐:
๐
๐ผ๐๐ฆ๐๐๐ ๐ฅ๐ ๐ ๐ฅ๐ + ๐
๐ =
๐
๐ถ๐๐ ๐๐ ๐๐
20
Discriminative classifiers
Find a separator to minimize classification error
โข Logistic Regression
โข Support Vector Machines
21
Logistic Regression review
โข ๐ ๐ฆ = 0 ๐ฅ; ๐ = 1 โ ๐ ๐ค๐๐ฅ =๐โ๐ค๐๐ฅ
1+๐โ๐ค๐๐ฅ
โข ๐ ๐ฆ = 1 ๐ฅ; ๐ = ๐ ๐ค๐๐ฅ =1
1+๐โ๐ค๐๐ฅ
โข Maximize likelihood:
โข argmax๐ค
๐ฟ ๐ฆ ๐ฅ; ๐ค = ๐ 1 โ ๐ ๐ฅ๐; ๐ค(1โ๐ฆ๐)
๐ ๐ฅ๐; ๐ค๐ฆ๐
โข Likelihood is ๐(๐ท|๐) : D = ๐ฅ๐, ๐ฆ๐ , ๐ = ๐ค
โข Update w :๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ ๐ฅ;๐ค = ๐ ๐ฅ๐
๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐)22
Logistic function
๐ โ =1
1 + ๐โโ
MAP for discriminative classifier
โข MLE: P(y=1|x;w) ~ g(wTx), P(y=0|x;w) ~ 1-g(wTx)
โข MAP: P(y=1,w|x) โ P(y=1|x;w) P(w) ~ g(wTx) ???
(different from Bayesian posterior)
โข P(w) priorsโข L2 regularization โ minimize all weights
โข L1 regularization โ minimize number of non-zero weights
23
MAP โ L2 regularization
โข P(y=1,w|x) โ P(y=1|x;w) P(w):
๐ฟ ๐ฆ,๐ค ๐ฅ =
๐
1 โ ๐ ๐ฅ๐; ๐ค(1โ๐ฆ๐)
๐ ๐ฅ๐; ๐ค๐ฆ๐
๐
๐โ๐ค๐
2
2๐
๐ฟ๐ฟ ๐ฆ,๐ค ๐ฅ =
๐
๐ฆ๐๐ค๐๐ฅ๐ โ๐ค๐ ๐ฅ๐ + log ๐ ๐ค๐๐ฅ๐ โ
๐
๐ค๐2
2๐
๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ,๐ค ๐ฅ =
๐
๐ฅ๐๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐) โ
๐ค๐
๐
Prevent wTx from getting too large24
w
p(x)
10/25/2015
5
MAP โ L1 regularization
โข P(y=1,w|x) โ P(y=1|x;w) P(w):
๐ฟ ๐ฆ,๐ค ๐ฅ =
๐
1 โ ๐ ๐ฅ๐;๐ค(1โ๐ฆ๐)
๐ ๐ฅ๐;๐ค๐ฆ๐
๐
๐โ|๐ค๐ |
๐
๐ฟ๐ฟ ๐ฆ,๐ค ๐ฅ =
๐
๐ฆ๐๐ค๐๐ฅ๐ โ๐ค๐ ๐ฅ๐ + log ๐ ๐ค๐๐ฅ๐ โ
๐
|๐ค๐ |
๐
๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ,๐ค ๐ฅ =
๐
๐ฅ๐๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐) โ
sign(๐ค๐)
๐
Force most dimensions to 025
w
p(x)
Parameters for learning
๐ค๐ โ ๐ค๐ + ํ ๐ฅ๐๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐) โ
๐ค๐
๐๐
โข Regularization: selecting ๐ influences the strength of your bias
โข Gradient ascent: selecting ํ influences the effect of individual data points in learning
โข Bayesian: selecting ๐ฝ๐ indicates the strength of the class prior beliefs
โข ๐, ํ, ๐ฝ๐ are parameters controlling our learning26
Multi-class logistic regression: class probability
Recall binary class:
โข ๐ ๐ฆ = 0 ๐ฅ; ๐ = 1 โ ๐ ๐ค๐๐ฅ =๐โ๐ค๐๐ฅ
1+๐โ๐ค๐๐ฅ
โข ๐ ๐ฆ = 1 ๐ฅ; ๐ = ๐ ๐ค๐๐ฅ =1
1+๐โ๐ค๐๐ฅ
Multi-class โ m classes:
โข ๐ ๐ฆ = ๐ ๐ฅ; ๐ =1
๐โ๐ค๐
๐๐ฅ+ ๐=1
๐โ1๐โ๐ค๐
๐๐ฅ
๐โ๐ค๐
๐๐ฅ
โข ๐ ๐ฆ = ๐ ๐ฅ; ๐ = 1 โ ๐=1๐โ1๐(๐ฆ = ๐|๐ฅ; ๐)
27
Multi-class logistic regression: likelihood
Recall binary class:
โข ๐ฟ ๐ฆ ๐ฅ; ๐ = ๐ ๐ ๐ฆ๐ = 0 ๐ฅ๐; ๐(1โ๐)
๐ ๐ฆ๐ = 1 ๐ฅ๐; ๐๐
โข๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ ๐ฅ; ๐ค = ๐ ๐ฅ๐
๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐)
Multi-class:
โข ๐ฟ ๐ฆ ๐ฅ; ๐ = ๐,๐ ๐ ๐ฆ๐ = ๐ ๐ฅ๐; ๐๐ฟ(๐ฆ๐โ๐)
โข๐
๐๐ค๐๐ฟ๐ฟ ๐ฆ ๐ฅ; ๐ค = ๐,๐ ๐ฅ๐
๐ ๐ฟ(๐ฆ๐ โ ๐) โ ๐ ๐ฆ๐ = ๐ ๐ฅ๐; ๐
28
๐น ๐ = ๐ ๐ข๐ ๐ = ๐๐ ๐จ๐ญ๐ก๐๐ซ๐ฐ๐ข๐ฌ๐
Multi-class logistic regression: update rule
Recall binary class:
โข ๐ค๐ โ ๐ค๐ + ํ๐ฅ๐๐ ๐ฆ๐ โ ๐(๐ค๐๐ฅ๐)
Multi-class:
โข ๐ค๐,๐ โ ๐ค๐,๐ + ํ๐ฅ๐๐ ๐ฟ(๐ฆ๐ โ ๐) โ ๐ ๐ฆ๐ = ๐ ๐ฅ๐; ๐
29
๐น ๐ = ๐ ๐ข๐ ๐ = ๐๐ ๐จ๐ญ๐ก๐๐ซ๐ฐ๐ข๐ฌ๐
Logistic regression: how many parameters?
โข N features
โข M classes
โข Learn wk for each of M-1 classes: N (M-1) parameters
โข Actually: ๐ค๐๐ฅ = ๐๐ค๐๐ฅ๐โข Would be better to allow offset from origin: wTx + b : N+1 parameters
per class
โข Total (N+1) (M-1) parameters
30
10/25/2015
6
Max margin classifiers
โข Focus on boundary points
โข Find largest margin between boundary points on both sides
โข Works well in practice
โข We can call the boundary pointsโsupport vectorsโ
31
Maximum margin definitions
โข M is the margin width
โข x+ is a +1 point closest to boundary, x- is a -1 point closest to boundary
โข ๐ฅ+ = ๐๐ค + ๐ฅโ
โข ๐ฅ+ โ ๐ฅโ = ๐
๐๐ป๐ + ๐ = ๐
๐๐ป๐ + ๐ = ๐
๐๐ป๐ + ๐ = โ๐
Classify as +1 if ๐๐ป๐ + ๐ โฅ ๐
Classify as -1 if ๐๐ป๐ + ๐ โค โ๐
Undefined if โ1 โค ๐๐ป๐ + ๐ โค ๐
32
๐ derivation
โข๐๐ป๐+ + ๐ = +๐
โข๐๐ป ๐๐ค + ๐โ + ๐ = +๐
โข ๐๐๐ป๐ค +๐๐ป๐โ + ๐ = +๐
โข ๐๐๐ป๐ค โ ๐ โ ๐ + ๐ = +๐
โข ๐ =2
๐๐ป๐ค
๐๐ป๐ + ๐ = ๐
๐๐ป๐ + ๐ = ๐
๐๐ป๐ + ๐ = โ๐
โข ๐๐ป๐โ + ๐ = โ๐
โข ๐๐ป๐+ + ๐ = +๐
โข ๐+ = ๐๐ค + ๐โ
33
M derivation
โข ๐ = ๐๐ค + ๐โ โ ๐ฅโ = ๐๐ค = ๐ ๐ค
โข ๐ = ๐ ๐ค๐๐ค
โข ๐ =2
๐ค๐๐ค๐ค๐๐ค =
2
๐ค๐๐ค
maximize ๐ minimize ๐ค๐๐ค
๐๐ป๐ + ๐ = ๐
๐๐ป๐ + ๐ = ๐
๐๐ป๐ + ๐ = โ๐
โข ๐๐ป๐โ + ๐ = โ๐
โข ๐๐ป๐+ + ๐ = +๐
โข ๐+ = ๐๐ค + ๐โ
โข ๐ฅ+ โ ๐ฅโ = ๐
34
Support vector machine (SVM) optimization
โข argmaxw,b ๐ =2
๐ค๐๐ค
argminw,b ๐ค๐๐ค
subject to
๐๐ป๐ + ๐ โฅ ๐ for x in class +1
๐๐ป๐ + ๐ โค โ๐ for x in class -1
35
Optimization with constraints: ๐
๐๐ค๐๐ ๐ค๐ = 0 with Lagrange multipliers.
โข Gradient descentโข Matrix calculus
Alternate SVM formulation
๐ค =
๐
๐ผ๐ ๐๐๐ฆ๐
Support vectors ๐ฅ๐ have ๐ผ๐ > 0
๐ฆ๐ are the data labels +1 or -1
To classify sample xj, compute:
๐ค๐๐ฅ๐ + ๐ =
๐
๐ผ๐ ๐ฆ๐ ๐๐๐๐๐ + ๐
36
๐ถ๐ โฅ ๐ โ๐
๐
๐ถ๐ ๐๐ = ๐
10/25/2015
7
Support vector machine (SVM) optimizationwith slack variables
What if data not linearly separable?
argminw,b ๐๐๐+ ๐ถ ๐ ํ๐
subject to
๐๐ป๐ + ๐ โฅ 1 โ ํ๐ for x in class 1
๐๐ป๐ + ๐ โค โ1 + ํ๐ for x in class -1ํ๐ โฅ 0 โ๐
37
Each error ํ๐ is penalized based on distance from separator
๐บ๐
๐บ๐
๐บ๐
๐บ๐
Classifying with additional dimensions
No linear separator
Linear separator
๐ ๐ฅ
38
Note: More dimensions makes it easier to separate N training points: training error minimized, may risk over-fit
Quadratic mapping function
โข x1, x2, x3, x4 -> x1, x2, x3, x4, x12, x2
2, x32
, x42, x1x2, x1x3, x1x4, โฆ, x2x4, x3x4
โข N features -> ๐ +๐ +๐ร(๐โ1)
2โ ๐2 features
โข N2 values to learn for w in higher-dimensional space
โข Or, observe: ๐๐๐ฅ + 1 2 = ๐12๐ฅ1
2 +โฏ+๐๐2 ๐ฅ๐
2
+๐1๐2๐ฅ1๐ฅ2 +โฏ+ ๐๐โ1๐๐๐ฅ๐โ1๐ฅ๐+๐1๐ฅ1 +โฏ+๐๐๐ฅ๐
39
v with N elements operating in quadratic space
๐ค๐๐ฅ๐ + ๐ =
๐
๐ผ๐ ๐ฆ๐ ๐๐๐๐๐ + ๐
Kernels
Classifying ๐๐:
๐
๐ผ๐๐ฆ๐๐ ๐๐๐๐ ๐๐ + ๐
Kernel trick:
โข Estimate high-dimensional dot product with function
โข ๐พ ๐๐, ๐๐ = ๐ ๐๐๐๐ ๐๐
โข E.g., ๐พ ๐๐, ๐๐ = exp โ๐๐โ๐๐
2
2๐2
40
๐ค๐๐ฅ๐ + ๐ =
๐
๐ผ๐ ๐ฆ๐ ๐๐๐๐๐ + ๐
The power of SVM (+kernels)
โข Boundary defined by a few support vectorsโข Caused by: maximizing margin
โข Causes: less overfitting
โข Similar to: regularization
โข Kernels keep number of learned parameters in check
41
Multi-class SVMs
โข Learn boundary for class k vs all other classes
โข Find boundary that gives highest margin for data point xi
42
10/25/2015
8
Benefits of generative methods
โข ๐ ๐ท|๐ and ๐ ๐|๐ท can generate non-linear boundary
โข E.g.: Gaussians with multiple variances
43