the perceptron mistake bound - svivek

33
Machine Learning The Perceptron Mistake Bound 1 Some slides based on lectures from Dan Roth, Avrim Blum and others

Upload: others

Post on 02-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Machine Learning

The Perceptron Mistake Bound

1Some slides based on lectures from Dan Roth, Avrim Blum and others

Where are we?

โ€ข The Perceptron Algorithm

โ€ข Variants of Perceptron

โ€ข Perceptron Mistake Bound

2

Convergence

Convergence theoremโ€“ If there exist a set of weights that are consistent with the

data (i.e. the data is linearly separable), the perceptron algorithm will converge.

3

Convergence

Convergence theoremโ€“ If there exist a set of weights that are consistent with the

data (i.e. the data is linearly separable), the perceptron algorithm will converge.

Cycling theoremโ€“ If the training data is not linearly separable, then the

learning algorithm will eventually repeat the same set of weights and enter an infinite loop

4

Perceptron Learnability

โ€ข Obviously Perceptron cannot learn what it cannot representโ€“ Only linearly separable functions

โ€ข Minsky and Papert (1969) wrote an influential book demonstrating Perceptronโ€™s representational limitations

โ€“ Parity functions canโ€™t be learned (XOR)โ€ข We have already seen that XOR is not linearly separable

โ€“ In vision, if patterns are represented with local features, canโ€™t represent symmetry, connectivity

5

Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

6

++

++

+ +++

-- --

-- -- --

---- --

--

Margin with respect to this hyperplane

Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

The margin of a data set (๐›พ) is the maximum margin possible for that dataset using any weight vector.

7

++

++

+ +++

-- --

-- -- --

---- --

--

Margin of the data

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

8

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

9

We can always find such an ๐‘…. Just look for the farthest data point from the origin.

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

Suppose there is a unit vector ๐ฎ โˆˆ โ„œ$ (i.e., ๐ฎ = 1) such that for some positive number ๐›พ โˆˆ โ„œ, ๐›พ > 0, we have ๐‘ฆ#๐ฎ&๐ฑ# โ‰ฅ ๐›พ for every example (๐ฑ#, ๐‘ฆ#).

10

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

Suppose there is a unit vector ๐ฎ โˆˆ โ„œ$ (i.e., ๐ฎ = 1) such that for some positive number ๐›พ โˆˆ โ„œ, ๐›พ > 0, we have ๐‘ฆ#๐ฎ&๐ฑ# โ‰ฅ ๐›พ for every example (๐ฑ#, ๐‘ฆ#).

11

The data has a margin ๐›พ. Importantly, the data is separable.๐›พ is the complexity parameter that defines the separability of data.

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

Suppose there is a unit vector ๐ฎ โˆˆ โ„œ$ (i.e., ๐ฎ = 1) such that for some positive number ๐›พ โˆˆ โ„œ, ๐›พ > 0, we have ๐‘ฆ#๐ฎ&๐ฑ# โ‰ฅ ๐›พ for every example (๐ฑ#, ๐‘ฆ#).

Then, the perceptron algorithm will make no more than โ„๐‘…" ๐›พ" mistakes on the training sequence.

12

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

Suppose there is a unit vector ๐ฎ โˆˆ โ„œ$ (i.e., ๐ฎ = 1) such that for some positive number ๐›พ โˆˆ โ„œ, ๐›พ > 0, we have ๐‘ฆ#๐ฎ&๐ฑ# โ‰ฅ ๐›พ for every example (๐ฑ#, ๐‘ฆ#).

Then, the perceptron algorithm will make no more than โ„๐‘…" ๐›พ" mistakes on the training sequence.

13

If u hadnโ€™t been a unit vector, then we could scale ยฐ in the mistake bound. This will change the final mistake bound to (||u||R/ยฐ)2

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

Suppose there is a unit vector ๐ฎ โˆˆ โ„œ$ (i.e., ๐ฎ = 1) such that for some positive number ๐›พ โˆˆ โ„œ, ๐›พ > 0, we have ๐‘ฆ#๐ฎ&๐ฑ# โ‰ฅ ๐›พ for every example (๐ฑ#, ๐‘ฆ#).

Then, the perceptron algorithm will make no more than โ„๐‘…" ๐›พ" mistakes on the training sequence.

14

Suppose we have a binary classification dataset with n dimensional inputs.

If the data is separable,โ€ฆ

โ€ฆthen the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes

Proof (preliminaries)

The settingโ€ข Initial weight vector ๐ฐ is all zeros

โ€ข Learning rate = 1โ€“ Effectively scales inputs, but does not change the behavior

โ€ข All training examples are contained in a ball of size ๐‘…. โ€“ That is, for every example (๐ฑ! , ๐‘ฆ!), we have

๐ฑ! โ‰ค ๐‘…

โ€ข The training data is separable by margin ๐›พ using a unit vector ๐ฎ. โ€“ That is, for every example (๐ฑ! , ๐‘ฆ!), we have

๐‘ฆ!๐ฎ"๐ฑ! โ‰ฅ ๐›พ

15

โ€ข Receive an input ๐ฑ! , ๐‘ฆ!โ€ข if sgn ๐ฐ"

#๐ฑ! โ‰  ๐‘ฆ!:Update ๐ฐ"$% โ† ๐ฐ" + ๐‘ฆ!๐ฑ!

Proof (1/3)

1. Claim: After t mistakes, ๐ฎ!๐ฐ" โ‰ฅ ๐‘ก๐›พ

16

โ€ข Receive an input ๐ฑ! , ๐‘ฆ!โ€ข if sgn ๐ฐ"

#๐ฑ! โ‰  ๐‘ฆ!:Update ๐ฐ"$% โ† ๐ฐ" + ๐‘ฆ!๐ฑ!

Proof (1/3)

1. Claim: After t mistakes, ๐ฎ!๐ฐ" โ‰ฅ ๐‘ก๐›พ

17

Because the data is separable by a margin ๐›พ

โ€ข Receive an input ๐ฑ! , ๐‘ฆ!โ€ข if sgn ๐ฐ"

#๐ฑ! โ‰  ๐‘ฆ!:Update ๐ฐ"$% โ† ๐ฐ" + ๐‘ฆ!๐ฑ!

Proof (1/3)

1. Claim: After t mistakes, ๐ฎ!๐ฐ" โ‰ฅ ๐‘ก๐›พ

Because ๐ฐ# = ๐ŸŽ (that is, ๐ฎ!๐ฐ# = ๐ŸŽ), straightforward induction gives us ๐ฎ!๐ฐ" โ‰ฅ ๐‘ก๐›พ

18

โ€ข Receive an input ๐ฑ! , ๐‘ฆ!โ€ข if sgn ๐ฐ"

#๐ฑ! โ‰  ๐‘ฆ!:Update ๐ฐ"$% โ† ๐ฐ" + ๐‘ฆ!๐ฑ!

Because the data is separable by a margin ๐›พ

2. Claim: After t mistakes, ๐ฐ"$โ‰ค ๐‘ก๐‘…$

Proof (2/3)

19

โ€ข Receive an input ๐ฑ! , ๐‘ฆ!โ€ข if sgn ๐ฐ"

#๐ฑ! โ‰  ๐‘ฆ!:Update ๐ฐ"$% โ† ๐ฐ" + ๐‘ฆ!๐ฑ!

Proof (2/3)

20

The weight is updated only when there is a mistake. That is when ๐‘ฆ!๐ฐ"

#๐ฑ! < 0.

๐ฑ! โ‰ค ๐‘…, by definition of R

2. Claim: After t mistakes, ๐ฐ"$โ‰ค ๐‘ก๐‘…$

โ€ข Receive an input ๐ฑ! , ๐‘ฆ!โ€ข if sgn ๐ฐ"

#๐ฑ! โ‰  ๐‘ฆ!:Update ๐ฐ"$% โ† ๐ฐ" + ๐‘ฆ!๐ฑ!

Proof (2/3)

2. Claim: After t mistakes, ๐ฐ"$โ‰ค ๐‘ก๐‘…$

Because ๐ฐ# = ๐ŸŽ (that is, ๐ฎ!๐ฐ# = ๐ŸŽ), straightforward induction gives us ๐ฐ"

$โ‰ค ๐‘…$

21

โ€ข Receive an input ๐ฑ! , ๐‘ฆ!โ€ข if sgn ๐ฐ"

#๐ฑ! โ‰  ๐‘ฆ!:Update ๐ฐ"$% โ† ๐ฐ" + ๐‘ฆ!๐ฑ!

Proof (3/3)

What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โ‰ฅ ๐‘ก๐›พ

2. After t mistakes, ๐ฐ6"โ‰ค ๐‘ก๐‘…"

22

Proof (3/3)

What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โ‰ฅ ๐‘ก๐›พ

2. After t mistakes, ๐ฐ6"โ‰ค ๐‘ก๐‘…"

23

From (2)

Proof (3/3)

What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โ‰ฅ ๐‘ก๐›พ

2. After t mistakes, ๐ฐ6"โ‰ค ๐‘ก๐‘…"

24

From (2)

๐’–๐‘ป๐ฐ" = ๐ฎ ๐ฐ" ๐‘๐‘œ๐‘  angle between them

But ๐ฎ = 1 and cosine is less than 1

So ๐ฎ#๐ฐ" โ‰ค ๐ฐ"

๐’–๐‘ป๐ฐ" = ๐ฎ ๐ฐ" ๐‘๐‘œ๐‘  angle between them

But ๐ฎ = 1 and cosine is less than 1

So ๐ฎ#๐ฐ" โ‰ค ๐ฐ"

Proof (3/3)

What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โ‰ฅ ๐‘ก๐›พ

2. After t mistakes, ๐ฐ6"โ‰ค ๐‘ก๐‘…"

25

From (2)

(Cauchy-Schwarz inequality)

Proof (3/3)

What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โ‰ฅ ๐‘ก๐›พ

2. After t mistakes, ๐ฐ6"โ‰ค ๐‘ก๐‘…"

26

From (2) From (1)

๐’–๐‘ป๐ฐ" = ๐ฎ ๐ฐ" ๐‘๐‘œ๐‘  angle between them

But ๐ฎ = 1 and cosine is less than 1

So ๐ฎ#๐ฐ" โ‰ค ๐ฐ"

Proof (3/3)

What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โ‰ฅ ๐‘ก๐›พ

2. After t mistakes, ๐ฐ6"โ‰ค ๐‘ก๐‘…"

27

Number of mistakes

Proof (3/3)

What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โ‰ฅ ๐‘ก๐›พ

2. After t mistakes, ๐ฐ6"โ‰ค ๐‘ก๐‘…"

28

Bounds the total number of mistakes!

Number of mistakes

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let ๐ฑ!, ๐‘ฆ! , ๐ฑ", ๐‘ฆ" , โ‹ฏ be a sequence of training examples such that every feature vector ๐ฑ# โˆˆ โ„œ$ with ๐ฑ# โ‰ค ๐‘… and the label ๐‘ฆ# โˆˆ {โˆ’1, 1}.

Suppose there is a unit vector ๐ฎ โˆˆ โ„œ$ (i.e., ๐ฎ = 1) such that for some positive number ๐›พ โˆˆ โ„œ, ๐›พ > 0, we have ๐‘ฆ#๐ฎ&๐ฑ# โ‰ฅ ๐›พ for every example (๐ฑ#, ๐‘ฆ#).

Then, the perceptron algorithm will make no more than โ„๐‘…" ๐›พ" mistakes on the training sequence.

29

The Perceptron Mistake bound

โ€ข ๐‘… is a property of the dimensionality. How?โ€“ For Boolean functions with ๐‘› attributes, show that ๐‘…# = ๐‘›.

โ€ข ๐›พ is a property of the data

โ€ข Exercises: โ€“ How many mistakes will the Perceptron algorithm make for disjunctions

with ๐‘› attributes?โ€ข What are ๐‘… and ๐›พ?

โ€“ How many mistakes will the Perceptron algorithm make for ๐‘˜-disjunctions with ๐‘› attributes?

โ€“ Find a sequence of examples that will force the Perceptron algorithm to make ๐‘‚ ๐‘› mistakes for a concept that is a ๐‘˜-disjunction.

30

Number of mistakes

Beyond the separable case

โ€ข Good newsโ€“ Perceptron makes no assumption about data distribution,

could be even adversarialโ€“ After a fixed number of mistakes, you are done. Donโ€™t even

need to see any more data

โ€ข Bad news: Real world is not linearly separableโ€“ Canโ€™t expect to never make mistakes againโ€“ What can we do: more features, try to be linearly

separable if you can, use averaging

31

What you need to know

โ€ข What is the perceptron mistake bound?

โ€ข How to prove it

32

Summary: Perceptron

โ€ข Online learning algorithm, very widely used, easy to implement

โ€ข Additive updates to weights

โ€ข Geometric interpretation

โ€ข Mistake bound

โ€ข Practical variants abound

โ€ข You should be able to implement the Perceptron algorithm and its variants, and also prove the mistake bound theorem

33