580.691 learning theory reza shadmehr

18
580.691 Learning Theory Reza Shadmehr Classification via regression Fisher linear discriminant Bayes classifier Confidence and Error rate of the Bayes classifier

Upload: christopher-anania

Post on 01-Jan-2016

44 views

Category:

Documents


5 download

DESCRIPTION

580.691 Learning Theory Reza Shadmehr Classification via regression Fisher linear discriminant Bayes classifier Confidence and Error rate of the Bayes classifier. Classification via regression Suppose we wish to classify vector x as belonging to either class C0 or C1. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 580.691  Learning Theory Reza Shadmehr

580.691 Learning Theory

Reza Shadmehr

Classification via regressionFisher linear discriminant

Bayes classifierConfidence and Error rate of the Bayes classifier

Page 2: 580.691  Learning Theory Reza Shadmehr

Classification via regression

• Suppose we wish to classify vector x as belonging to either class C0 or C1.

• We can approach the problem as if it was regression:

(1) (1) ( ) ( ) ( )

0 0 1 1 2 2

(1) (1)1 2 1

( ) ( )1 2

1 0

, , , , 0,1

ˆ

1

1

ˆ if 0.5, otherwise

n n i

T

T TML

n n

D y y y

y w w w x w x

x x

X X X X

x x

C y C

x x

w x

w y

x x

Page 3: 580.691  Learning Theory Reza Shadmehr

-4 -2 0 2 4 6

-2

0

2

4

6

Classification via regression

• Model:0 0 1 1 2 2

1 0

ˆ

ˆ if 0.5, otherwise

Ty w w w x w x

C y C

x w

x x

1x

2x

-4-2

02

46

x1

-20

2

4

6

x2

-0.5

0

0.5

1

1.5

y

-4-2

02

46

x1

-20

2

4

6

x2

0 1 1 2 20.5 w w x w x

22 1 1 2

1

wx x x x

w

Page 4: 580.691  Learning Theory Reza Shadmehr

Classification via regression: concerns

• Model:0 1 1 2 2

1 0

ˆ

ˆ if 0.5, otherwise

Ty w w x w x

C y C

x w

x x

-4 -2 0 2 4 6

-2

0

2

4

6

-4 -2 0 2 4 6

-4

-2

0

2

4

60.5 TMLx w

1x

1x

2x2x

0.5 TMLx w

• Sometimes an x can give us a y that is outside our range (outside 0-1).

This classification looks good.This one not so good.

1 TMLx w

Page 5: 580.691  Learning Theory Reza Shadmehr

Classification via regression: concerns

• Model:0 1 1 2 2

1 0

( ) ( ) ( )

ˆ

ˆ if 0.5, otherwise

ˆ

T

n n n

y w w x w x

C y C

y y

x w

x x

• Variance of the error (which is equal to the variance of y) depends on x, unlike in regression.

Since y is a random variable that can only take on values of 0 or 1, error in regression will not be normally distributed.

2 2 2

2 2 2 2

2

1

0 1

1 1 0

1 1 0

var 2 2

1

1 0

P y x

P y x

E y x

E y x

y x E y E y y E y E y y

P y x P y x

Page 6: 580.691  Learning Theory Reza Shadmehr

Regression as projection

• A linear regression function projects each data point:

Each data point x(n)=[x1,x2] is projected onto

0 1 0 1 1 2 2ˆ Ty w w w x w x w x

1w

( ) ( ) ( )1

n n T nz x w x

For a given w1, there will be a specific distribution of the projected points z={z(1),z(2),…,z(n)}. We can study how well the projected points are distributed into classes.

-4 -2 0 2 4

-4

-2

0

2

4

-4 -2 0 2 4

-4

-2

0

2

4

-4 -2 0 2 4

-4

-2

0

2

4

Page 7: 580.691  Learning Theory Reza Shadmehr

Fisher discriminant analysis

• Suppose we wish to classify vector x as belonging to either class C0 or C1.

Class y=0: n0 number of points, mean 0, variance 0

Class y=1: n1 number of points, mean 1, variance 1

• Class descriptions in the classification (or projected) space: (i.e., variance of yhat for x’s that belong to class 0 or class 1)

0 0 0

0 0

ˆ

ˆvar

T

T

E y C w

y C

x μ w

x w w

1 1 0

1 1

ˆ

ˆvar

T

T

E y C w

y C

x μ w

x w w

0 1

( ) ( )0 1

0

( )0

1

( )1

( ) ( )0 0 1 1

0 1

( ) ( )0 0 0 0

0

( ) ( )1 1 1 1

1

1 1

1var

1

1var

1

i i

i

i

n ni i

C C

n Ti i

C

n Ti i

C

E C E Cn n

Cn

Cn

x x

x

x

x x μ x x x μ x

x x x μ x μ

x x x μ x μ

Page 8: 580.691  Learning Theory Reza Shadmehr

Fisher discriminant analysis

• Find w so that when each point is projected to the classification space, the classes are maximally separated.

2

20 1

0 0 1 1

seperation of projected means

sum of within class variances

ˆ ˆ

T T

T T

J

Jn n

w

μ w μ ww

w w w w

-4 -2 0 2 4

-4

-2

0

2

4

-4 -2 0 2 4

-4

-2

0

2

4

Large separation Small separation

Page 9: 580.691  Learning Theory Reza Shadmehr

Fisher discriminant analysis

2

0 1

0 0 1 1

2 20 1

0 0 1 1

1

221 2

1 1

arg maxˆ ˆ

ˆ ˆ

T T

T T

T T T

TT

T

TTTTT

T TT

n n

JSn n

S R R

R R

RRJ R

R R RR

w

μ μ ww

w w w w

μ μ w m ww

w ww w

v w w v

m vm v vv m

vv vv v

Symmetric positive definiteWe can always write S like this,

where R is a “square root” matrix

Using R, change the coordinate system of J from w to v:

Page 10: 580.691  Learning Theory Reza Shadmehr

Fisher discriminant analysis

2

0 1

11 10 1 0 1

10 0 1 1 0 1

is maximum when

ˆ ˆ

TT T

T T

T T

J R aR

aR aR

R aR R R R a

n n a

vv m v m

v

v m μ μ

w v μ μ μ μ

w μ μ

Dot product of a vector of norm 1 and another vector is maximum when the two have the same direction.

( ) ( )0

1

1 nT i T i

i

w E y yn

w x w x

arbitrary constant

Page 11: 580.691  Learning Theory Reza Shadmehr

Bayesian classification

• Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function:

(1) (1) ( ) ( ) ( )

1, ,

, , , , 1, ,

ˆ ˆ 1, ,

ˆ arg max

n n i

l L

D c c c L

c c L

c P l

x x

x x x

x x

Classify x into the class l that maximizes the posterior probability.

1

L

p c l P c l p l P c lP c l

pP c p

x xx

xx

priorLikelihood

marginal

Page 12: 580.691  Learning Theory Reza Shadmehr

Classification when distributions have equal variance

• Suppose we wish to classify a person as male or female based on height.

1 2| 0 , and | 1 , and 1

ˆ

ˆ ˆ1 if 1| 0.5; 0 otherwise

p x c N p x c N P c q

c x

c x P c x c x

What we have:What we want:

height (cm)x height (cm)x 160 180 200

0.01

0.02

0.03

0.04

| 0p x c | 1p x c female male

Note that the two densities have equal variance

1 0.5P c Assume equal probability of being male or female:

| 0 0p x c P c | 1 1p x c P c

160 180 200

0.005

0.01

0.015

0.02

160 180 200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

1

0

|i

p x p x c i P c i

height (cm)x

Page 13: 580.691  Learning Theory Reza Shadmehr

Classification when distributions have equal variance

| 0 0p x c P c | 1 1p x c P c

height (cm)x 160 180 200

0.005

0.01

0.015

0.02

height (cm)x 160 180 200

0.2

0.4

0.6

0.8

1

0 |P c x 1|P c x

Decision boundary=

Decision boundary

where 0 | 1|

where | 0 0 | 1 1

x P c x P c x

x p x c P c p x c P c

0 00 |

p x c P cP c x

p x

| 0 0

| 1 1

p x c P c

p x c P c

To classify, we really don’t need to compute the posterior prob. All we need is:

If this ratio is greater than 1, then we choose class 0, otherwise class 1. The boundaries between classes occur where the ratio is 1. In other words, the boundary occurs where the log of the ratio is 0

posterior

Page 14: 580.691  Learning Theory Reza Shadmehr

Uncertainty of the classifier

| 0 and | 1 and 1

1||

0 |

p x c p x c P c q

P c xp c x

P c x

Starting with our likelihood and prior:

we compute a posterior probability distribution as a function of x:

This is a binomial distribution. We can compute the variance of this distribution:

2 2 2

22

2

= 1| 1 0 | 0

1|

1| 1 0 | 0

1|

var

1| 1|

E c x P c x P c x

P c x

E c x P c x P c x

P c x

c x E c x E c x

P c x P c x

140 160 180 200

0.05

0.1

0.15

0.2

0.25

height (cm)x

0 |P c x

1|P c x

140 160 180 2000

0.2

0.4

0.6

0.8

1

var |c x

Classification is most uncertain at the decision boundary

Page 15: 580.691  Learning Theory Reza Shadmehr

height (cm)x

Classification when distributions have unequal variance

| 0 0p x c P c

| 1 1p x c P c

1 1 2 2| 0 , and | 1 , and 1

ˆ ˆ1 if 1| 0.5; 0 otherwise

p x c N p x c N P c q

c x P c x c x

What we have:Classification:

height (cm)x 160 180 200

0.005

0.01

0.015

0.02

0.025

1 0.5P c Assume:

160 180 200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

1

0

|i

p x p x c i P c i

0 |P c x 1|P c x

160 180 200

0.2

0.4

0.6

0.8

1

140 160 180 2000

0.05

0.1

0.15

0.2

0.25

var |c x

Page 16: 580.691  Learning Theory Reza Shadmehr

Bayes error rate: Probability of misclassification

160 180 200

0.005

0.01

0.015

0.02

0.025

| 0 0p x c P c | 1 1p x c P c

x

*xdecision boundary

0R 1R

0 1

0 1 1 1 0 0

1 1 0 0R R

P error P x c c P c P x c c P c

p x c P c dx p x c P c dx

In general, it is actually quite hard to compute P(error) because we will need to integrate the posterior probabilities over decision regions that may be discontinuous (for example, when the distributions have unequal variances). To help with this, there is the Chernoff bound.

Prob of data belonging to c1, but we classify as c0

0

1 1R

p x c P c dx

Prob of data belonging to c0 but we classify as c1

1

0 0R

p x c P c dx

Page 17: 580.691  Learning Theory Reza Shadmehr

Bayes error rate: Chernoff bound

In the two class classification problem, we note that the classification error depends on the area under the minimum of the two posterior probabilities.

min 0 , 1

min 0 0 , 1 1

P error x P c x P c x

p x c P c p x c P c

P error P error x dx

0 |P c x 1|P c x

140 160 180 2000

0.2

0.4

0.6

0.8

1

P error x

x

Page 18: 580.691  Learning Theory Reza Shadmehr

To compute the minimum, we will need the following inequality:

1min , , 0 and 0 1a b a b a b

To help figure out this inequality, we note that:

And without loss of generality, if we suppose that b is smaller than a. Then a/b>1, and we have:

So we can think of the term a^*b^(1-) (for all values of ), as an upper bound on the min[a,b]. Returning to our P(error) problem, we can replace the min[] function with our inequality:

1 aa b b

b

1b a b

110 1 0 1 0 1P error P c P c p x c p x c dx

The bound is found by numerically finding the value of that minimizes the above expression. The key benefit here is that our search is in the one dimensional space of , and we also got rid of the discontinuous decision regions.

Bayes error rate: Chernoff bound

min 0 0 , 1 1P error x p x c P c p x c P c

P error P error x dx