discriminative classifiers: logistic regression, svms...monkey 14 broker 0 analyst 1 dividend 1 d...

8
10/25/2015 1 Discriminative classifiers: Logistic Regression, SVMs CISC 5800 Professor Daniel Leeds Maximum A Posteriori: a quick review โ€ข Likelihood: = = || 1โˆ’ || โ€ข Prior: = โˆ’1 (1โˆ’) โˆ’1 (,) = = โˆ’1 (1โˆ’) โˆ’1 (,) โ€ข Posterior Likelihood x prior = () โ€ข MAP estimate: argmax log + log () argmax log + log () = + โˆ’1 + โˆ’1 + + โˆ’1 Choose and to give the prior belief of Heads bias โˆˆ , Higher : Heads more likely Higher : Tails more likely 2 Estimate each P(X i |Y) through MAP Incorporating prior for each class = = = #( = โ‹€ = ) + ( โˆ’1) #( = )+ ( โˆ’ 1) = = #( = ) + ( โˆ’1) || + ( โˆ’ 1) 3 Note: both X and Y can take on multiple values (binary and beyond) ( โˆ’) โ€“ โ€œfrequencyโ€ of class j โˆ’ โ€“ โ€œfrequenciesโ€ of all classes Classification strategy: generative vs. discriminative โ€ข Generative, e.g., Bayes/Naรฏve Bayes: โ€ข Identify probability distribution for each class โ€ข Determine class with maximum probability for data example โ€ข Discriminative, e.g., Logistic Regression: โ€ข Identify boundary between classes โ€ข Determine which side of boundary new data example exists on 4 0 5 10 15 20 0 5 10 15 20 Linear algebra: data features โ€ข Vector โ€“ list of numbers: each number describes a data feature โ€ข Matrix โ€“ list of lists of numbers: features for each data point Wolf 12 Lion 16 Monkey 14 Broker 0 Analyst 1 Dividend 1 โž d โž # of word occurrences Document 1 8 10 11 1 0 1 โž Document 2 0 2 1 14 10 12 โž Document 3 5 Feature space โ€ข Each data feature defines a dimension in space Wolf 12 Lion 16 Monkey 14 Broker 0 Analyst 1 Dividend 1 โž d โž Document1 8 10 11 1 0 1 โž Document2 0 2 1 14 10 12 โž Document3 wolf lion doc1 doc2 doc3 0 10 20 20 10 0 6

Upload: others

Post on 17-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

1

Discriminative classifiers:Logistic Regression, SVMs

CISC 5800

Professor Daniel Leeds

Maximum A Posteriori: a quick reviewโ€ข Likelihood:๐‘ƒ ๐ท ๐œƒ = ๐‘ƒ ๐ท ๐‘ = ๐‘|๐ป| 1 โˆ’ ๐‘ |๐‘‡|

โ€ข Prior: ๐‘ƒ ๐œƒ =๐œƒ๐›ผโˆ’1(1โˆ’๐œƒ)๐›ฝโˆ’1

๐ต(๐›ผ,๐›ฝ)= ๐‘ƒ ๐‘ =

๐‘๐›ผโˆ’1(1โˆ’๐‘)๐›ฝโˆ’1

๐ต(๐›ผ,๐›ฝ)

โ€ข Posterior Likelihood x prior = ๐‘ƒ ๐ท ๐œƒ ๐‘ƒ(๐œƒ)

โ€ข MAP estimate:

argmax๐œƒ

log๐‘ƒ ๐ท ๐œƒ + log๐‘ƒ(๐œƒ)

argmax๐‘

log๐‘ƒ ๐ท ๐‘ + log๐‘ƒ(๐‘)

๐‘ =๐ป + ๐›ผ โˆ’ 1

๐ป + ๐›ผ โˆ’ 1 + ๐‘‡ + ๐›ฝ โˆ’ 1

Choose ๐œถ and ๐œท to give the prior belief of Heads bias

๐’‘ โˆˆ ๐ŸŽ,๐Ÿ

Higher ๐œถ: Heads more likelyHigher ๐œท: Tails more likely

2

Estimate each P(Xi|Y) through MAP

Incorporating prior for each class ๐›ฝ๐‘—

๐‘ƒ ๐‘‹๐‘– = ๐‘ฅ๐‘˜ ๐‘Œ = ๐‘ฆ๐‘— =#๐ท(๐‘‹๐‘– = ๐‘ฅ๐‘˜โ‹€๐‘Œ = ๐‘ฆ๐‘—) + (๐›ฝ๐‘—โˆ’1)

#๐ท(๐‘Œ = ๐‘ฆ๐‘—) + ๐‘š(๐›ฝ๐‘š โˆ’ 1)

๐‘ƒ ๐‘Œ = ๐‘ฆ๐‘— =#๐ท(๐‘Œ = ๐‘ฆ๐‘—) + (๐›ฝ๐‘—โˆ’1)

|๐ท| + ๐‘š(๐›ฝ๐‘š โˆ’ 1)

3

Note: both X and Y can take on multiple values (binary and beyond)

(๐œท๐’‹โˆ’๐Ÿ) โ€“ โ€œfrequencyโ€ of class j

๐’Ž ๐œท๐’Ž โˆ’ ๐Ÿ โ€“ โ€œfrequenciesโ€ of all classes

Classification strategy: generative vs. discriminativeโ€ข Generative, e.g., Bayes/Naรฏve Bayes:

โ€ข Identify probability distribution for each class

โ€ข Determine class with maximum probability for data example

โ€ข Discriminative, e.g., Logistic Regression:โ€ข Identify boundary between classes

โ€ข Determine which side of boundary new data example exists on

4

0 5 10 15 20

0 5 10 15 20

Linear algebra: data features

โ€ข Vector โ€“ list of numbers:each number describesa data feature

โ€ข Matrix โ€“ list of lists of numbers:features for each datapoint

Wolf 12

Lion 16

Monkey 14

Broker 0

Analyst 1

Dividend 1

โž d โž

# of word occurrences

Document 1

8

10

11

1

0

1

โž

Document 2

0

2

1

14

10

12

โž

Document 3

5

Feature space

โ€ข Each data feature defines a dimension in space

Wolf 12

Lion 16

Monkey 14

Broker 0

Analyst 1

Dividend 1

โž d โž

Document1

8

10

11

1

0

1

โž

Document2

0

2

1

14

10

12

โž

Document3

wolf

lion doc1

doc2

doc3

0 10 20

20

10

0

6

Page 2: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

2

The dot product

The dot product compares two vectors:

โ€ข ๐’‚ =

๐‘Ž1โ‹ฎ๐‘Ž๐‘›

, ๐’ƒ =๐‘1โ‹ฎ๐‘๐‘›

๐’‚ โˆ™ ๐’ƒ = ๐‘–=1๐‘› ๐‘Ž๐‘–๐‘๐‘– = ๐’‚๐‘‡๐’ƒ

0 10 20

20

10

0

๐Ÿ“๐Ÿ๐ŸŽ

โˆ™๐Ÿ๐ŸŽ๐Ÿ๐ŸŽ

= ๐Ÿ“ ร— ๐Ÿ๐ŸŽ + ๐Ÿ๐ŸŽ ร— ๐Ÿ๐ŸŽ

= ๐Ÿ“๐ŸŽ + ๐Ÿ๐ŸŽ๐ŸŽ = ๐Ÿ๐Ÿ“๐ŸŽ

7

The dot product, continued

Magnitude of a vector is the sum of the squares of the elements

๐’‚ = ๐‘– ๐‘Ž๐‘–2

If ๐’‚ has unit magnitude, ๐’‚ โˆ™ ๐’ƒ is the โ€œprojectionโ€ of ๐’ƒ onto ๐’‚

0 1 2

2

1

0

0.710.71

โˆ™1.51

= .71 ร— 1.5 + .71 ร— 1

โ‰ˆ 1.07 + .71 = 1.78

๐’‚ โˆ™ ๐’ƒ =

๐‘–=1

๐‘›

๐‘Ž๐‘–๐‘๐‘–

0.710.71

โˆ™00.5

= .71 ร— 0 + .71 ร— 0.5

โ‰ˆ 0 + .35 = 0.35 8

Separating boundary, defined by w

9

โ€ข Separating hyperplanesplits class 0 and class 1

โ€ข Plane is defined by line w perpendicular to plan

โ€ข Is data point x in class 0 or class 1? wTx > 0 class 0

wTx < 0 class 1

Separating boundary, defined by w

10

โ€ข Separating hyperplanesplits class 0 and class 1

โ€ข Plane is defined by line w perpendicular to plan

โ€ข Is data point x in class 0 or class 1? wTx > 0 class 1

wTx < 0 class 0

From real-number projection to 0/1 label

โ€ข Binary classification: 0 is class A, 1 is class B

โ€ข Sigmoid function stands in for p(x|y)

โ€ข Sigmoid: ๐‘” โ„Ž =1

1+๐‘’โˆ’โ„Ž

โ€ข ๐‘ ๐‘ฆ = 0 ๐‘ฅ; ๐œƒ = 1 โˆ’ ๐‘” ๐‘ค๐‘‡๐‘ฅ =๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

1+๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

โ€ข ๐‘ ๐‘ฆ = 1 ๐‘ฅ; ๐œƒ = ๐‘” ๐‘ค๐‘‡๐‘ฅ =1

1+๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

11

1

0.5

0-10 5 0 5 10

g(h)

h

๐’˜๐‘ป๐’™ =

๐’‹

๐’˜๐’‹๐’™๐’‹

Learning parameters for classificationโ€ข Similar to MLE for Bayes classifier

โ€ข โ€œLikelihoodโ€ for data points y1, โ€ฆ, yn (different from Bayesian likelihood)โ€ข If yi in class A, yi =0, multiply (1-g(xi;w))โ€ข If yi in class B, yi=1, multiply (g(xi;w))

argmax๐‘ค

๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค =

๐‘–

1 โˆ’ ๐‘” ๐‘ฅ๐‘–;๐‘ค(1โˆ’๐‘ฆ๐‘–)

๐‘” ๐‘ฅ๐‘–;๐‘ค๐‘ฆ๐‘–

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ; ๐‘ค =

๐‘–

(1 โˆ’ ๐‘ฆ๐‘–) log 1 โˆ’ ๐‘” ๐‘ฅ๐‘–; ๐‘ค + ๐‘ฆ๐‘– log ๐‘” ๐‘ฅ๐‘–; ๐‘ค

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค =

๐‘–

๐‘ฆ๐‘– log๐‘” ๐‘ฅ๐‘–; ๐‘ค

1 โˆ’ ๐‘” ๐‘ฅ๐‘–; ๐‘ค+ log 1 โˆ’ ๐‘” ๐‘ฅ๐‘–;๐‘ค

12

Page 3: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

3

Learning parameters for classification

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ; ๐‘ค =

๐‘–

๐‘ฆ๐‘– log๐‘” ๐‘ฅ๐‘–; ๐‘ค

1 โˆ’ ๐‘” ๐‘ฅ๐‘– ; ๐‘ค+ log 1 โˆ’ ๐‘” ๐‘ฅ๐‘–; ๐‘ค

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค =

๐‘–

๐‘ฆ๐‘– log

1

1 + ๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

1 โˆ’1

1 + ๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

+ log๐‘’โˆ’๐‘ค

๐‘‡๐‘ฅ๐‘–

1 + ๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค =

๐‘–

๐‘ฆ๐‘– log1

1 + ๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

โˆ’ 1+ log

๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

1 + ๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค =

๐‘–

๐‘ฆ๐‘–๐‘ค๐‘‡๐‘ฅ๐‘– โˆ’๐‘ค๐‘‡ ๐‘ฅ๐‘– โˆ’ log 1 + ๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

13

๐‘” โ„Ž =1

1 + ๐‘’โˆ’โ„ŽLearning parameters for classification

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค = ๐‘– ๐‘ฆ๐‘–๐‘ค๐‘‡๐‘ฅ๐‘– โˆ’๐‘ค๐‘‡ ๐‘ฅ๐‘– + log ๐‘” ๐‘ค๐‘‡๐‘ฅ๐‘–

๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค =

๐‘–

๐‘ฆ๐‘–๐‘ฅ๐‘—๐‘– โˆ’ ๐‘ฅ๐‘—

๐‘– +๐‘ฅ๐‘—๐‘–๐‘’โˆ’๐‘ค

๐‘‡๐‘ฅ๐‘–

1 + ๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ๐‘–

๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ; ๐‘ค =

๐‘–

๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ (1 โˆ’ 1 โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–) )

๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค =

๐‘–

๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–)

14

๐’ˆโ€ฒ ๐’‰ =๐’†โˆ’๐’‰

๐Ÿ + ๐’†โˆ’๐’‰ ๐Ÿ

๐’˜๐‘ป๐’™ =

๐’‹

๐’˜๐’‹๐’™๐’‹

Iterative gradient asscent

โ€ข Begin with initial guessed weights w

โ€ข For each data point (yi,xi), update each weight wj

๐‘ค๐‘— โ† ๐‘ค๐‘— + ํœ€๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–)

โ€ข Choose ํœ€ so change is not too big or too small โ€“ โ€œstep sizeโ€

โ€ข ๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–)โ€ข If yi=1 and g(wTxi)=0, and xi

j>0, make wj larger and push wTxi to be larger

โ€ข If yi=0 and g(wTxi)=1, and xij>0, make wi smaller and push wTxi to be smaller

15

Intuition

yi โ€“ true data labelg(wTxi) โ€“ computed data

labelSeparating boundary, defined by w

16

โ€ข Separating hyperplanesplits class 0 and class 1

โ€ข Plane is defined by line w perpendicular to plan

โ€ข Is data point x in class 0 or class 1? wTx > 0 class 1

wTx < 0 class 0

But, where do we place the boundary?

Logistic regression:

๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค :

๐‘–

๐‘ฆ๐‘– โˆ’ 1 ๐’˜๐‘ป๐’™๐’Š โˆ’ log 1 + ๐‘’โˆ’๐’˜๐‘ป๐’™๐’Š

โ€ข Each data point ๐’™๐’Š considered for boundary ๐’˜

โ€ข Outlier data pulls boundary towards it

17

Max margin classifiers

โ€ข Focus on boundary points

โ€ข Find largest margin between boundary points on both sides

โ€ข Works well in practice

โ€ข We can call the boundary pointsโ€œsupport vectorsโ€

18

Page 4: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

4

Classifying with additional dimensions

No linear separator

Linear separator

๐œ‘ ๐‘ฅ

19

Mapping function(s)

โ€ข Map from low-dimensional space ๐‘ฅ = ๐‘ฅ1, ๐‘ฅ2 to higher dimensional space ๐œ‘ ๐‘ฅ = ๐‘ฅ1, ๐‘ฅ2, ๐‘ฅ1

2, ๐‘ฅ22, ๐‘ฅ1๐‘ฅ2

โ€ข N data points guaranteed to be separable in space of N-1 dimensions or more

Classifying ๐‘ฅ๐‘—:

๐‘–

๐›ผ๐‘–๐‘ฆ๐‘–๐œ‘๐‘‡ ๐‘ฅ๐‘– ๐œ‘ ๐‘ฅ๐‘— + ๐‘

๐’˜ =

๐’Š

๐œถ๐’Š๐‹ ๐’™๐’Š ๐’š๐’Š

20

Discriminative classifiers

Find a separator to minimize classification error

โ€ข Logistic Regression

โ€ข Support Vector Machines

21

Logistic Regression review

โ€ข ๐‘ ๐‘ฆ = 0 ๐‘ฅ; ๐œƒ = 1 โˆ’ ๐‘” ๐‘ค๐‘‡๐‘ฅ =๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

1+๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

โ€ข ๐‘ ๐‘ฆ = 1 ๐‘ฅ; ๐œƒ = ๐‘” ๐‘ค๐‘‡๐‘ฅ =1

1+๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

โ€ข Maximize likelihood:

โ€ข argmax๐‘ค

๐ฟ ๐‘ฆ ๐‘ฅ; ๐‘ค = ๐‘– 1 โˆ’ ๐‘” ๐‘ฅ๐‘–; ๐‘ค(1โˆ’๐‘ฆ๐‘–)

๐‘” ๐‘ฅ๐‘–; ๐‘ค๐‘ฆ๐‘–

โ€ข Likelihood is ๐‘ƒ(๐ท|๐œƒ) : D = ๐‘ฅ๐‘–, ๐‘ฆ๐‘– , ๐œƒ = ๐‘ค

โ€ข Update w :๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ;๐‘ค = ๐‘– ๐‘ฅ๐‘—

๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–)22

Logistic function

๐‘” โ„Ž =1

1 + ๐‘’โˆ’โ„Ž

MAP for discriminative classifier

โ€ข MLE: P(y=1|x;w) ~ g(wTx), P(y=0|x;w) ~ 1-g(wTx)

โ€ข MAP: P(y=1,w|x) โˆ P(y=1|x;w) P(w) ~ g(wTx) ???

(different from Bayesian posterior)

โ€ข P(w) priorsโ€ข L2 regularization โ€“ minimize all weights

โ€ข L1 regularization โ€“ minimize number of non-zero weights

23

MAP โ€“ L2 regularization

โ€ข P(y=1,w|x) โˆ P(y=1|x;w) P(w):

๐ฟ ๐‘ฆ,๐‘ค ๐‘ฅ =

๐‘–

1 โˆ’ ๐‘” ๐‘ฅ๐‘–; ๐‘ค(1โˆ’๐‘ฆ๐‘–)

๐‘” ๐‘ฅ๐‘–; ๐‘ค๐‘ฆ๐‘–

๐‘—

๐‘’โˆ’๐‘ค๐‘—

2

2๐œ†

๐ฟ๐ฟ ๐‘ฆ,๐‘ค ๐‘ฅ =

๐‘–

๐‘ฆ๐‘–๐‘ค๐‘‡๐‘ฅ๐‘– โˆ’๐‘ค๐‘‡ ๐‘ฅ๐‘– + log ๐‘” ๐‘ค๐‘‡๐‘ฅ๐‘– โˆ’

๐‘—

๐‘ค๐‘—2

2๐œ†

๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ,๐‘ค ๐‘ฅ =

๐‘–

๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–) โˆ’

๐‘ค๐‘—

๐œ†

Prevent wTx from getting too large24

w

p(x)

Page 5: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

5

MAP โ€“ L1 regularization

โ€ข P(y=1,w|x) โˆ P(y=1|x;w) P(w):

๐ฟ ๐‘ฆ,๐‘ค ๐‘ฅ =

๐‘–

1 โˆ’ ๐‘” ๐‘ฅ๐‘–;๐‘ค(1โˆ’๐‘ฆ๐‘–)

๐‘” ๐‘ฅ๐‘–;๐‘ค๐‘ฆ๐‘–

๐‘—

๐‘’โˆ’|๐‘ค๐‘— |

๐œ†

๐ฟ๐ฟ ๐‘ฆ,๐‘ค ๐‘ฅ =

๐‘–

๐‘ฆ๐‘–๐‘ค๐‘‡๐‘ฅ๐‘– โˆ’๐‘ค๐‘‡ ๐‘ฅ๐‘– + log ๐‘” ๐‘ค๐‘‡๐‘ฅ๐‘– โˆ’

๐‘—

|๐‘ค๐‘— |

๐œ†

๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ,๐‘ค ๐‘ฅ =

๐‘–

๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–) โˆ’

sign(๐‘ค๐‘—)

๐œ†

Force most dimensions to 025

w

p(x)

Parameters for learning

๐‘ค๐‘— โ† ๐‘ค๐‘— + ํœ€ ๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–) โˆ’

๐‘ค๐‘—

๐œ†๐‘

โ€ข Regularization: selecting ๐œ† influences the strength of your bias

โ€ข Gradient ascent: selecting ํœ€ influences the effect of individual data points in learning

โ€ข Bayesian: selecting ๐›ฝ๐‘— indicates the strength of the class prior beliefs

โ€ข ๐œ†, ํœ€, ๐›ฝ๐‘— are parameters controlling our learning26

Multi-class logistic regression: class probability

Recall binary class:

โ€ข ๐‘ ๐‘ฆ = 0 ๐‘ฅ; ๐œƒ = 1 โˆ’ ๐‘” ๐‘ค๐‘‡๐‘ฅ =๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

1+๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

โ€ข ๐‘ ๐‘ฆ = 1 ๐‘ฅ; ๐œƒ = ๐‘” ๐‘ค๐‘‡๐‘ฅ =1

1+๐‘’โˆ’๐‘ค๐‘‡๐‘ฅ

Multi-class โ€“ m classes:

โ€ข ๐‘ ๐‘ฆ = ๐‘— ๐‘ฅ; ๐œƒ =1

๐‘’โˆ’๐‘ค๐‘—

๐‘‡๐‘ฅ+ ๐‘˜=1

๐‘šโˆ’1๐‘’โˆ’๐‘ค๐‘—

๐‘‡๐‘ฅ

๐‘’โˆ’๐‘ค๐‘˜

๐‘‡๐‘ฅ

โ€ข ๐‘ ๐‘ฆ = ๐‘š ๐‘ฅ; ๐œƒ = 1 โˆ’ ๐‘—=1๐‘šโˆ’1๐‘(๐‘ฆ = ๐‘—|๐‘ฅ; ๐œƒ)

27

Multi-class logistic regression: likelihood

Recall binary class:

โ€ข ๐ฟ ๐‘ฆ ๐‘ฅ; ๐œƒ = ๐‘– ๐‘ ๐‘ฆ๐‘– = 0 ๐‘ฅ๐‘–; ๐œƒ(1โˆ’๐‘–)

๐‘ ๐‘ฆ๐‘– = 1 ๐‘ฅ๐‘–; ๐œƒ๐‘–

โ€ข๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ; ๐‘ค = ๐‘– ๐‘ฅ๐‘—

๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–)

Multi-class:

โ€ข ๐ฟ ๐‘ฆ ๐‘ฅ; ๐œƒ = ๐‘–,๐‘˜ ๐‘ ๐‘ฆ๐‘– = ๐‘˜ ๐‘ฅ๐‘–; ๐œƒ๐›ฟ(๐‘ฆ๐‘–โˆ’๐‘˜)

โ€ข๐œ•

๐œ•๐‘ค๐‘—๐ฟ๐ฟ ๐‘ฆ ๐‘ฅ; ๐‘ค = ๐‘–,๐‘˜ ๐‘ฅ๐‘—

๐‘– ๐›ฟ(๐‘ฆ๐‘– โˆ’ ๐‘˜) โˆ’ ๐‘ ๐‘ฆ๐‘– = ๐‘˜ ๐‘ฅ๐‘–; ๐œƒ

28

๐œน ๐’‚ = ๐Ÿ ๐ข๐Ÿ ๐’‚ = ๐ŸŽ๐ŸŽ ๐จ๐ญ๐ก๐ž๐ซ๐ฐ๐ข๐ฌ๐ž

Multi-class logistic regression: update rule

Recall binary class:

โ€ข ๐‘ค๐‘— โ† ๐‘ค๐‘— + ํœ€๐‘ฅ๐‘—๐‘– ๐‘ฆ๐‘– โˆ’ ๐‘”(๐‘ค๐‘‡๐‘ฅ๐‘–)

Multi-class:

โ€ข ๐‘ค๐‘—,๐‘˜ โ† ๐‘ค๐‘—,๐‘˜ + ํœ€๐‘ฅ๐‘—๐‘– ๐›ฟ(๐‘ฆ๐‘– โˆ’ ๐‘˜) โˆ’ ๐‘ ๐‘ฆ๐‘– = ๐‘˜ ๐‘ฅ๐‘–; ๐œƒ

29

๐œน ๐’‚ = ๐Ÿ ๐ข๐Ÿ ๐’‚ = ๐ŸŽ๐ŸŽ ๐จ๐ญ๐ก๐ž๐ซ๐ฐ๐ข๐ฌ๐ž

Logistic regression: how many parameters?

โ€ข N features

โ€ข M classes

โ€ข Learn wk for each of M-1 classes: N (M-1) parameters

โ€ข Actually: ๐‘ค๐‘‡๐‘ฅ = ๐‘—๐‘ค๐‘—๐‘ฅ๐‘—โ€ข Would be better to allow offset from origin: wTx + b : N+1 parameters

per class

โ€ข Total (N+1) (M-1) parameters

30

Page 6: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

6

Max margin classifiers

โ€ข Focus on boundary points

โ€ข Find largest margin between boundary points on both sides

โ€ข Works well in practice

โ€ข We can call the boundary pointsโ€œsupport vectorsโ€

31

Maximum margin definitions

โ€ข M is the margin width

โ€ข x+ is a +1 point closest to boundary, x- is a -1 point closest to boundary

โ€ข ๐‘ฅ+ = ๐œ†๐‘ค + ๐‘ฅโˆ’

โ€ข ๐‘ฅ+ โˆ’ ๐‘ฅโˆ’ = ๐‘€

๐’˜๐‘ป๐’™ + ๐’ƒ = ๐Ÿ

๐’˜๐‘ป๐’™ + ๐’ƒ = ๐ŸŽ

๐’˜๐‘ป๐’™ + ๐’ƒ = โˆ’๐Ÿ

Classify as +1 if ๐’˜๐‘ป๐’™ + ๐’ƒ โ‰ฅ ๐Ÿ

Classify as -1 if ๐’˜๐‘ป๐’™ + ๐’ƒ โ‰ค โˆ’๐Ÿ

Undefined if โˆ’1 โ‰ค ๐’˜๐‘ป๐’™ + ๐’ƒ โ‰ค ๐Ÿ

32

๐œ† derivation

โ€ข๐’˜๐‘ป๐’™+ + ๐’ƒ = +๐Ÿ

โ€ข๐’˜๐‘ป ๐œ†๐‘ค + ๐’™โˆ’ + ๐’ƒ = +๐Ÿ

โ€ข ๐œ†๐’˜๐‘ป๐‘ค +๐’˜๐‘ป๐’™โˆ’ + ๐’ƒ = +๐Ÿ

โ€ข ๐œ†๐’˜๐‘ป๐‘ค โˆ’ ๐Ÿ โˆ’ ๐’ƒ + ๐’ƒ = +๐Ÿ

โ€ข ๐œ† =2

๐’˜๐‘ป๐‘ค

๐’˜๐‘ป๐’™ + ๐’ƒ = ๐Ÿ

๐’˜๐‘ป๐’™ + ๐’ƒ = ๐ŸŽ

๐’˜๐‘ป๐’™ + ๐’ƒ = โˆ’๐Ÿ

โ€ข ๐’˜๐‘ป๐’™โˆ’ + ๐’ƒ = โˆ’๐Ÿ

โ€ข ๐’˜๐‘ป๐’™+ + ๐’ƒ = +๐Ÿ

โ€ข ๐’™+ = ๐œ†๐‘ค + ๐’™โˆ’

33

M derivation

โ€ข ๐‘€ = ๐œ†๐‘ค + ๐’™โˆ’ โˆ’ ๐‘ฅโˆ’ = ๐œ†๐‘ค = ๐œ† ๐‘ค

โ€ข ๐‘€ = ๐œ† ๐‘ค๐‘‡๐‘ค

โ€ข ๐‘€ =2

๐‘ค๐‘‡๐‘ค๐‘ค๐‘‡๐‘ค =

2

๐‘ค๐‘‡๐‘ค

maximize ๐‘€ minimize ๐‘ค๐‘‡๐‘ค

๐’˜๐‘ป๐’™ + ๐’ƒ = ๐Ÿ

๐’˜๐‘ป๐’™ + ๐’ƒ = ๐ŸŽ

๐’˜๐‘ป๐’™ + ๐’ƒ = โˆ’๐Ÿ

โ€ข ๐’˜๐‘ป๐’™โˆ’ + ๐’ƒ = โˆ’๐Ÿ

โ€ข ๐’˜๐‘ป๐’™+ + ๐’ƒ = +๐Ÿ

โ€ข ๐’™+ = ๐œ†๐‘ค + ๐’™โˆ’

โ€ข ๐‘ฅ+ โˆ’ ๐‘ฅโˆ’ = ๐‘€

34

Support vector machine (SVM) optimization

โ€ข argmaxw,b ๐‘€ =2

๐‘ค๐‘‡๐‘ค

argminw,b ๐‘ค๐‘‡๐‘ค

subject to

๐’˜๐‘ป๐’™ + ๐’ƒ โ‰ฅ ๐Ÿ for x in class +1

๐’˜๐‘ป๐’™ + ๐’ƒ โ‰ค โˆ’๐Ÿ for x in class -1

35

Optimization with constraints: ๐œ•

๐œ•๐‘ค๐‘—๐‘“ ๐‘ค๐‘— = 0 with Lagrange multipliers.

โ€ข Gradient descentโ€ข Matrix calculus

Alternate SVM formulation

๐‘ค =

๐‘–

๐›ผ๐‘– ๐’™๐‘–๐‘ฆ๐‘–

Support vectors ๐‘ฅ๐‘– have ๐›ผ๐‘– > 0

๐‘ฆ๐‘– are the data labels +1 or -1

To classify sample xj, compute:

๐‘ค๐‘‡๐‘ฅ๐‘— + ๐‘ =

๐‘–

๐›ผ๐‘– ๐‘ฆ๐‘– ๐’™๐‘–๐‘‡๐’™๐‘— + ๐‘

36

๐œถ๐’Š โ‰ฅ ๐ŸŽ โˆ€๐’Š

๐’Š

๐œถ๐’Š ๐’š๐’Š = ๐ŸŽ

Page 7: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

7

Support vector machine (SVM) optimizationwith slack variables

What if data not linearly separable?

argminw,b ๐’˜๐‘‡๐’˜+ ๐ถ ๐‘– ํœ€๐‘–

subject to

๐’˜๐‘ป๐’™ + ๐‘ โ‰ฅ 1 โˆ’ ํœ€๐‘– for x in class 1

๐’˜๐‘ป๐’™ + ๐‘ โ‰ค โˆ’1 + ํœ€๐‘– for x in class -1ํœ€๐‘– โ‰ฅ 0 โˆ€๐‘–

37

Each error ํœ€๐‘– is penalized based on distance from separator

๐œบ๐Ÿ

๐œบ๐Ÿ

๐œบ๐Ÿ‘

๐œบ๐Ÿ’

Classifying with additional dimensions

No linear separator

Linear separator

๐œ‘ ๐‘ฅ

38

Note: More dimensions makes it easier to separate N training points: training error minimized, may risk over-fit

Quadratic mapping function

โ€ข x1, x2, x3, x4 -> x1, x2, x3, x4, x12, x2

2, x32

, x42, x1x2, x1x3, x1x4, โ€ฆ, x2x4, x3x4

โ€ข N features -> ๐‘ +๐‘ +๐‘ร—(๐‘โˆ’1)

2โ‰ˆ ๐‘2 features

โ€ข N2 values to learn for w in higher-dimensional space

โ€ข Or, observe: ๐’—๐‘‡๐‘ฅ + 1 2 = ๐’—12๐‘ฅ1

2 +โ‹ฏ+๐’—๐‘2 ๐‘ฅ๐‘

2

+๐’—1๐’—2๐‘ฅ1๐‘ฅ2 +โ‹ฏ+ ๐’—๐‘โˆ’1๐’—๐‘๐‘ฅ๐‘โˆ’1๐‘ฅ๐‘+๐’—1๐‘ฅ1 +โ‹ฏ+๐’—๐‘๐‘ฅ๐‘

39

v with N elements operating in quadratic space

๐‘ค๐‘‡๐‘ฅ๐‘— + ๐‘ =

๐‘–

๐›ผ๐‘– ๐‘ฆ๐‘– ๐’™๐‘–๐‘‡๐’™๐‘— + ๐‘

Kernels

Classifying ๐’™๐’‹:

๐‘–

๐›ผ๐‘–๐‘ฆ๐‘–๐œ‘ ๐’™๐’Š๐‘‡๐œ‘ ๐’™๐’‹ + ๐‘

Kernel trick:

โ€ข Estimate high-dimensional dot product with function

โ€ข ๐พ ๐’™๐’Š, ๐’™๐’‹ = ๐œ‘ ๐’™๐’Š๐‘‡๐œ‘ ๐’™๐’‹

โ€ข E.g., ๐พ ๐’™๐’Š, ๐’™๐’‹ = exp โˆ’๐’™๐’Šโˆ’๐’™๐’‹

2

2๐œŽ2

40

๐‘ค๐‘‡๐‘ฅ๐‘— + ๐‘ =

๐‘–

๐›ผ๐‘– ๐‘ฆ๐‘– ๐’™๐‘–๐‘‡๐’™๐‘— + ๐‘

The power of SVM (+kernels)

โ€ข Boundary defined by a few support vectorsโ€ข Caused by: maximizing margin

โ€ข Causes: less overfitting

โ€ข Similar to: regularization

โ€ข Kernels keep number of learned parameters in check

41

Multi-class SVMs

โ€ข Learn boundary for class k vs all other classes

โ€ข Find boundary that gives highest margin for data point xi

42

Page 8: Discriminative classifiers: Logistic Regression, SVMs...Monkey 14 Broker 0 Analyst 1 Dividend 1 d Document1 8 10 11 1 0 1 Document2 0 2 1 14 10 12 Document3 wolf n doc1 doc2 doc3 0

10/25/2015

8

Benefits of generative methods

โ€ข ๐‘ƒ ๐ท|๐œƒ and ๐‘ƒ ๐œƒ|๐ท can generate non-linear boundary

โ€ข E.g.: Gaussians with multiple variances

43