the perceptron mistake bound - svivek
TRANSCRIPT
Machine Learning
The Perceptron Mistake Bound
1Some slides based on lectures from Dan Roth, Avrim Blum and others
Where are we?
โข The Perceptron Algorithm
โข Variants of Perceptron
โข Perceptron Mistake Bound
2
Convergence
Convergence theoremโ If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron algorithm will converge.
3
Convergence
Convergence theoremโ If there exist a set of weights that are consistent with the
data (i.e. the data is linearly separable), the perceptron algorithm will converge.
Cycling theoremโ If the training data is not linearly separable, then the
learning algorithm will eventually repeat the same set of weights and enter an infinite loop
4
Perceptron Learnability
โข Obviously Perceptron cannot learn what it cannot representโ Only linearly separable functions
โข Minsky and Papert (1969) wrote an influential book demonstrating Perceptronโs representational limitations
โ Parity functions canโt be learned (XOR)โข We have already seen that XOR is not linearly separable
โ In vision, if patterns are represented with local features, canโt represent symmetry, connectivity
5
Margin
The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
6
++
++
+ +++
-- --
-- -- --
---- --
--
Margin with respect to this hyperplane
Margin
The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
The margin of a data set (๐พ) is the maximum margin possible for that dataset using any weight vector.
7
++
++
+ +++
-- --
-- -- --
---- --
--
Margin of the data
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
8
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
9
We can always find such an ๐ . Just look for the farthest data point from the origin.
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
Suppose there is a unit vector ๐ฎ โ โ$ (i.e., ๐ฎ = 1) such that for some positive number ๐พ โ โ, ๐พ > 0, we have ๐ฆ#๐ฎ&๐ฑ# โฅ ๐พ for every example (๐ฑ#, ๐ฆ#).
10
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
Suppose there is a unit vector ๐ฎ โ โ$ (i.e., ๐ฎ = 1) such that for some positive number ๐พ โ โ, ๐พ > 0, we have ๐ฆ#๐ฎ&๐ฑ# โฅ ๐พ for every example (๐ฑ#, ๐ฆ#).
11
The data has a margin ๐พ. Importantly, the data is separable.๐พ is the complexity parameter that defines the separability of data.
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
Suppose there is a unit vector ๐ฎ โ โ$ (i.e., ๐ฎ = 1) such that for some positive number ๐พ โ โ, ๐พ > 0, we have ๐ฆ#๐ฎ&๐ฑ# โฅ ๐พ for every example (๐ฑ#, ๐ฆ#).
Then, the perceptron algorithm will make no more than โ๐ " ๐พ" mistakes on the training sequence.
12
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
Suppose there is a unit vector ๐ฎ โ โ$ (i.e., ๐ฎ = 1) such that for some positive number ๐พ โ โ, ๐พ > 0, we have ๐ฆ#๐ฎ&๐ฑ# โฅ ๐พ for every example (๐ฑ#, ๐ฆ#).
Then, the perceptron algorithm will make no more than โ๐ " ๐พ" mistakes on the training sequence.
13
If u hadnโt been a unit vector, then we could scale ยฐ in the mistake bound. This will change the final mistake bound to (||u||R/ยฐ)2
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
Suppose there is a unit vector ๐ฎ โ โ$ (i.e., ๐ฎ = 1) such that for some positive number ๐พ โ โ, ๐พ > 0, we have ๐ฆ#๐ฎ&๐ฑ# โฅ ๐พ for every example (๐ฑ#, ๐ฆ#).
Then, the perceptron algorithm will make no more than โ๐ " ๐พ" mistakes on the training sequence.
14
Suppose we have a binary classification dataset with n dimensional inputs.
If the data is separable,โฆ
โฆthen the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes
Proof (preliminaries)
The settingโข Initial weight vector ๐ฐ is all zeros
โข Learning rate = 1โ Effectively scales inputs, but does not change the behavior
โข All training examples are contained in a ball of size ๐ . โ That is, for every example (๐ฑ! , ๐ฆ!), we have
๐ฑ! โค ๐
โข The training data is separable by margin ๐พ using a unit vector ๐ฎ. โ That is, for every example (๐ฑ! , ๐ฆ!), we have
๐ฆ!๐ฎ"๐ฑ! โฅ ๐พ
15
โข Receive an input ๐ฑ! , ๐ฆ!โข if sgn ๐ฐ"
#๐ฑ! โ ๐ฆ!:Update ๐ฐ"$% โ ๐ฐ" + ๐ฆ!๐ฑ!
Proof (1/3)
1. Claim: After t mistakes, ๐ฎ!๐ฐ" โฅ ๐ก๐พ
16
โข Receive an input ๐ฑ! , ๐ฆ!โข if sgn ๐ฐ"
#๐ฑ! โ ๐ฆ!:Update ๐ฐ"$% โ ๐ฐ" + ๐ฆ!๐ฑ!
Proof (1/3)
1. Claim: After t mistakes, ๐ฎ!๐ฐ" โฅ ๐ก๐พ
17
Because the data is separable by a margin ๐พ
โข Receive an input ๐ฑ! , ๐ฆ!โข if sgn ๐ฐ"
#๐ฑ! โ ๐ฆ!:Update ๐ฐ"$% โ ๐ฐ" + ๐ฆ!๐ฑ!
Proof (1/3)
1. Claim: After t mistakes, ๐ฎ!๐ฐ" โฅ ๐ก๐พ
Because ๐ฐ# = ๐ (that is, ๐ฎ!๐ฐ# = ๐), straightforward induction gives us ๐ฎ!๐ฐ" โฅ ๐ก๐พ
18
โข Receive an input ๐ฑ! , ๐ฆ!โข if sgn ๐ฐ"
#๐ฑ! โ ๐ฆ!:Update ๐ฐ"$% โ ๐ฐ" + ๐ฆ!๐ฑ!
Because the data is separable by a margin ๐พ
2. Claim: After t mistakes, ๐ฐ"$โค ๐ก๐ $
Proof (2/3)
19
โข Receive an input ๐ฑ! , ๐ฆ!โข if sgn ๐ฐ"
#๐ฑ! โ ๐ฆ!:Update ๐ฐ"$% โ ๐ฐ" + ๐ฆ!๐ฑ!
Proof (2/3)
20
The weight is updated only when there is a mistake. That is when ๐ฆ!๐ฐ"
#๐ฑ! < 0.
๐ฑ! โค ๐ , by definition of R
2. Claim: After t mistakes, ๐ฐ"$โค ๐ก๐ $
โข Receive an input ๐ฑ! , ๐ฆ!โข if sgn ๐ฐ"
#๐ฑ! โ ๐ฆ!:Update ๐ฐ"$% โ ๐ฐ" + ๐ฆ!๐ฑ!
Proof (2/3)
2. Claim: After t mistakes, ๐ฐ"$โค ๐ก๐ $
Because ๐ฐ# = ๐ (that is, ๐ฎ!๐ฐ# = ๐), straightforward induction gives us ๐ฐ"
$โค ๐ $
21
โข Receive an input ๐ฑ! , ๐ฆ!โข if sgn ๐ฐ"
#๐ฑ! โ ๐ฆ!:Update ๐ฐ"$% โ ๐ฐ" + ๐ฆ!๐ฑ!
Proof (3/3)
What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โฅ ๐ก๐พ
2. After t mistakes, ๐ฐ6"โค ๐ก๐ "
22
Proof (3/3)
What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โฅ ๐ก๐พ
2. After t mistakes, ๐ฐ6"โค ๐ก๐ "
23
From (2)
Proof (3/3)
What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โฅ ๐ก๐พ
2. After t mistakes, ๐ฐ6"โค ๐ก๐ "
24
From (2)
๐๐ป๐ฐ" = ๐ฎ ๐ฐ" ๐๐๐ angle between them
But ๐ฎ = 1 and cosine is less than 1
So ๐ฎ#๐ฐ" โค ๐ฐ"
๐๐ป๐ฐ" = ๐ฎ ๐ฐ" ๐๐๐ angle between them
But ๐ฎ = 1 and cosine is less than 1
So ๐ฎ#๐ฐ" โค ๐ฐ"
Proof (3/3)
What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โฅ ๐ก๐พ
2. After t mistakes, ๐ฐ6"โค ๐ก๐ "
25
From (2)
(Cauchy-Schwarz inequality)
Proof (3/3)
What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โฅ ๐ก๐พ
2. After t mistakes, ๐ฐ6"โค ๐ก๐ "
26
From (2) From (1)
๐๐ป๐ฐ" = ๐ฎ ๐ฐ" ๐๐๐ angle between them
But ๐ฎ = 1 and cosine is less than 1
So ๐ฎ#๐ฐ" โค ๐ฐ"
Proof (3/3)
What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โฅ ๐ก๐พ
2. After t mistakes, ๐ฐ6"โค ๐ก๐ "
27
Number of mistakes
Proof (3/3)
What we know:1. After t mistakes, ๐ฎ&๐ฐ6 โฅ ๐ก๐พ
2. After t mistakes, ๐ฐ6"โค ๐ก๐ "
28
Bounds the total number of mistakes!
Number of mistakes
Mistake Bound Theorem [Novikoff 1962, Block 1962]
Let ๐ฑ!, ๐ฆ! , ๐ฑ", ๐ฆ" , โฏ be a sequence of training examples such that every feature vector ๐ฑ# โ โ$ with ๐ฑ# โค ๐ and the label ๐ฆ# โ {โ1, 1}.
Suppose there is a unit vector ๐ฎ โ โ$ (i.e., ๐ฎ = 1) such that for some positive number ๐พ โ โ, ๐พ > 0, we have ๐ฆ#๐ฎ&๐ฑ# โฅ ๐พ for every example (๐ฑ#, ๐ฆ#).
Then, the perceptron algorithm will make no more than โ๐ " ๐พ" mistakes on the training sequence.
29
The Perceptron Mistake bound
โข ๐ is a property of the dimensionality. How?โ For Boolean functions with ๐ attributes, show that ๐ # = ๐.
โข ๐พ is a property of the data
โข Exercises: โ How many mistakes will the Perceptron algorithm make for disjunctions
with ๐ attributes?โข What are ๐ and ๐พ?
โ How many mistakes will the Perceptron algorithm make for ๐-disjunctions with ๐ attributes?
โ Find a sequence of examples that will force the Perceptron algorithm to make ๐ ๐ mistakes for a concept that is a ๐-disjunction.
30
Number of mistakes
Beyond the separable case
โข Good newsโ Perceptron makes no assumption about data distribution,
could be even adversarialโ After a fixed number of mistakes, you are done. Donโt even
need to see any more data
โข Bad news: Real world is not linearly separableโ Canโt expect to never make mistakes againโ What can we do: more features, try to be linearly
separable if you can, use averaging
31
Summary: Perceptron
โข Online learning algorithm, very widely used, easy to implement
โข Additive updates to weights
โข Geometric interpretation
โข Mistake bound
โข Practical variants abound
โข You should be able to implement the Perceptron algorithm and its variants, and also prove the mistake bound theorem
33