cs440/ece448 lecture 8: mark hasegawa-johnson, 2/2021

33
CS440/ECE448 Lecture 8: Perceptron Mark Hasegawa-Johnson, 2/2021 License: CC-BY 4.0; redistribute at will, as long as you cite the source.

Upload: others

Post on 19-Jul-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

CS440/ECE448 Lecture 8: Perceptron

Mark Hasegawa-Johnson, 2/2021

License: CC-BY 4.0; redistribute at will, as long as you cite the source.

Page 2: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Outline

β€’ A little history: perceptron as a model of a biological neuronβ€’ The perceptron learning algorithmβ€’ Linear separability, 𝑦π‘₯, and convergence

β€’ It converges to the right answer, even with πœ‚ = 1, if data are linearly separable

β€’ If data are not linearly separable, it’s necessary to use πœ‚ = 1/𝑛.

β€’ Multi-class perceptron

Page 3: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

The Giant Squid Axon β€’ 1909: Williams describes the giant squid axon (III: 1mm thick)β€’ 1939: Young describes the

synapse.β€’ 1952: Hodgkin & Huxley

publish an electrical current model for the generation of binary action potentials from real-valued inputs.

Image released to the public domain by lkkisan, 2007.Modified from LlinΓ‘s, Rodolfo R. (1999). The squid Giant Synapse.

Page 4: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron β€’ 1959: Rosenblatt is granted a patent for the β€œperceptron,” an electrical circuit model of a neuron.

Page 5: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron

1

x1

xD

b

w1

w2

x2

wD

Input

Weights

.

.

.

Output: sgn(wTx)

Can incorporate bias as component of the weight vector by always including a feature with value set to 1

Perceptron model: action potential = signum(affine function of the features)

#𝑦 = sgn(𝑀1π‘₯1+ … +𝑀𝐷π‘₯𝐷 + 𝑏)= sgn(𝑀!π‘₯)

Where 𝑀 = [𝑀", … ,𝑀#, 𝑏]!,π‘₯ = [π‘₯", … , π‘₯#, 1]!, and

sgn(π‘₯) = 2 1 π‘₯ β‰₯ 0βˆ’1 π‘₯ < 0

Page 6: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Outline

β€’ A little history: perceptron as a model of a biological neuronβ€’ The perceptron learning algorithmβ€’ Linear separability, 𝑦π‘₯, and convergence

β€’ It converges to the right answer, even with πœ‚ = 1, if data are linearly separable

β€’ If data are not linearly separable, it’s necessary to use πœ‚ = 1/𝑛.

β€’ Multi-class perceptron

Page 7: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

PerceptronRosenblatt’s big innovation: the perceptron learns from examples.β€’ Initialize weights randomlyβ€’ Cycle through training

examples in multiple passes (epochs)β€’ For each training example:β€’ If classified correctly, do

nothingβ€’ If classified incorrectly,

update weightsBy Elizabeth Goodspeed - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=40188333

Page 8: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

PerceptronFor each training instance 𝒙 with ground truth label 𝑦 ∈ {βˆ’1,1}:β€’ Classify with current weights: !𝑦 = sgn(𝑀&π‘₯)β€’ Update weights: β€’ if 𝑦 = ,𝑦 then do nothingβ€’ If 𝑦 β‰  ,𝑦 then 𝑀 = 𝑀 + Ξ·yπ‘₯β€’ Ξ· (eta) is a β€œlearning rate.” For now, let’s assume Ξ·=1.

Page 9: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron example: dogs versus catsCan you write a program that can tell which ones are dogs, and which ones are cats?

π‘₯! = # times the animal comes when called (out of 40).π‘₯" = weight of the animal, in pounds.π‘₯ = [π‘₯!, π‘₯", 1]#.𝑦 = 1 means β€œdog”𝑦 = βˆ’1 means β€œcat”

-𝑦 = sgn 𝑀#π‘₯

Page 10: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron training example: dogs vs. catsβ€’ Let’s start with the rule β€œif it comes when called (by at least 20 different people

out of 40), it’s a dog.”‒ Write that as an equation: -𝑦 = sgn(π‘₯! βˆ’ 20)β€’ Write that as a vector equation: -𝑦 = sgn(𝑀#π‘₯), where 𝑀# = 1,0, βˆ’20

π‘₯!

π‘₯"

𝑀# = 1,0,βˆ’20

sgn 𝑀#π‘₯ = 1

sgn 𝑀#π‘₯ = βˆ’1

Page 11: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron training example: dogs vs. catsβ€’ The Presa Canario gets misclassified as a cat (𝑦 = 1, but -𝑦 = βˆ’1) because it only

obeys its trainer, and nobody else (π‘₯! = 1). β€’ Though it rarely comes when called, is very large (π‘₯" = 100 pounds). β€’ οΏ½βƒ—οΏ½# = π‘₯!, π‘₯", 1 = [1,100,1].

π‘₯"

π‘₯!

𝑀# = 1,0,βˆ’20

sgn 𝑀#π‘₯ = 1

sgn 𝑀#π‘₯ = βˆ’1

Page 12: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron training example: dogs vs. catsβ€’ The Presa Canario gets misclassified. π‘₯# = π‘₯!, π‘₯", 1 = [1,100,1].β€’ Perceptron learning rule: update the weights as:

𝑀 = 𝑀 + 𝑦π‘₯ =10

βˆ’20+ 1Γ—

11001

=2100βˆ’19

π‘₯"

π‘₯!

New 𝑀 =2100βˆ’19

sgn 𝑀#π‘₯ = 1

sgn 𝑀#π‘₯ = βˆ’1

Old 𝑀 =10βˆ’20

Page 13: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron training example: dogs vs. catsβ€’ The Maltese is small (π‘₯" = 10 pounds) and very tame (π‘₯! = 40): π‘₯ =

40101

.

β€’ It’s correctly classified! -𝑦 = sgn 𝑀#π‘₯ = sgn 2Γ—40 + 100Γ—10 βˆ’ 19 = +1,β€’ so 𝑀 is unchanged.

π‘₯"

π‘₯!

sgn 𝑀#π‘₯ = 1

𝑀# = 2,100,βˆ’19

sgn 𝑀#π‘₯ = βˆ’1

Page 14: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron training example: dogs vs. catsβ€’ The Maine Coon cat is big (π‘₯" = 20 pounds: οΏ½βƒ—οΏ½ = 0,20,1 ), so it gets misclassified

as a dog (true label is 𝑦 = βˆ’1=β€œcat,” but the classifier thinks -𝑦 = 1=β€œdog”).

π‘₯"

π‘₯!

sgn 𝑀#π‘₯ = 1

sgn 𝑀#π‘₯ = βˆ’1

𝑀# = 2,100,βˆ’19

Page 15: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Perceptron training example: dogs vs. catsβ€’ The Maine Coon cat is big (π‘₯" = 20 pounds: οΏ½βƒ—οΏ½ = 0,20,1 ), so it gets

misclassified, so we update w:

𝑀 = 𝑀 + 𝑦π‘₯ =2100βˆ’19

+ (βˆ’1)Γ—0201

=280βˆ’20

π‘₯"

π‘₯!

sgn 𝑀#π‘₯ = 1

sgn 𝑀#π‘₯ = βˆ’1

𝑀# = 2,80,βˆ’20

Page 16: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Outline

β€’ A little history: perceptron as a model of a biological neuronβ€’ The perceptron learning algorithmβ€’ Linear separability, 𝑦π‘₯, and convergence

β€’ It converges to the right answer, even with πœ‚ = 1, if data are linearly separable

β€’ If data are not linearly separable, it’s necessary to use πœ‚ = 1/𝑛.

β€’ Multi-class perceptron

Page 17: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Suppose we run the perceptron algorithm for a very long time. Will it converge to an answer? Will it converge to the correct answer?

Page 18: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Suppose we run the perceptron algorithm for a very long time. Will it converge to an answer? Will it converge to the correct answer? That depends on whether or not the data are linearly separable.

Page 19: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Suppose we run the perceptron algorithm for a very long time. Will it converge to an answer? Will it converge to the correct answer? That depends on whether or not the data are linearly separable.”Linearly separable” means it’s possible to find a line (a hyperplane, in D-dimensional space) that separates the two classes, like this:

π‘₯<

π‘₯=

Page 20: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Suppose that, instead of plotting π‘₯, we plot 𝑦π‘₯.β€’ If 𝑦 = 1, plot π‘₯β€’ If 𝑦 = βˆ’1, plot βˆ’π‘₯

π‘₯<

π‘₯=

Page 21: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Notice that the original data (π‘₯) are linearly separable if and only if the signed data (𝑦π‘₯) are all in the same half-plane.

𝑦π‘₯<

𝑦π‘₯=

Page 22: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Notice that the original data (π‘₯) are linearly separable if and only if the signed data (𝑦π‘₯) are all in the same half-plane.That means that there is some vector 𝑀 such that sgn(𝑀!(𝑦π‘₯)) = 1 for all of the data.

𝑦π‘₯<

𝑦π‘₯= 𝑀

Page 23: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Suppose we start out with the wrong 𝑀, so that one token is misclassified.

𝑦π‘₯<

𝑦π‘₯=𝑀

Page 24: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Suppose we start out with the wrong 𝑀, so that one token is misclassified. Then we update w as

𝑀 = 𝑀 + 𝑦π‘₯

𝑦π‘₯<

𝑦π‘₯=𝑀

𝑦π‘₯

𝑦π‘₯New 𝑀

Page 25: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Does the perceptron find an answer?Does it find the correct answer?Suppose we start out with the wrong 𝑀, so that one token is misclassified. Then we update w as

𝑀 = 𝑀 + 𝑦π‘₯…and the boundary moves so that the misclassified token is on the right side.

𝑦π‘₯<

𝑦π‘₯=New 𝑀

Page 26: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

What if the data are not linearly separable?

… well, then in that case, the perceptron algorithm with πœ‚ = 1 never converges. The only solution is to use a learning rate, πœ‚, that gradually decays over time, so that the update πœ‚π‘¦π‘₯ also gradually decays toward zero.

𝑦π‘₯<

𝑦π‘₯= 𝑀

Page 27: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

What about non-separable data?β€’ If the data are NOT linearly separable, then the perceptron with

Ξ·=1 doesn’t converge.β€’ In fact, that’s what Ξ· is for. β€’ Remember that 𝑀 = 𝑀 + Ξ·yπ‘₯.β€’We can force the perceptron to stop wiggling around by forcing Ξ·

(and therefore Ξ·yοΏ½βƒ—οΏ½) to get gradually smaller and smaller.

β€’ This works: for the 𝑛>? training token, set Ξ·= <@

.

β€’ Notice: βˆ‘@A<B <@

is infinite. Nevertheless, Ξ·= <@

works, because the yπ‘₯ tokens are not all in the same direction.

Page 28: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Outline

β€’ A little history: perceptron as a model of a biological neuronβ€’ The perceptron learning algorithmβ€’ Linear separability, 𝑦π‘₯, and convergence

β€’ It converges to the right answer, even with πœ‚ = 1, if data are linearly separable

β€’ If data are not linearly separable, it’s necessary to use πœ‚ = 1/𝑛.

β€’ Multi-class perceptron

Page 29: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Multi-Class Perceptron

1

x1

xD

x2

InputWeights

.

.

.

Output:argmax$%&'(! 𝑀$#οΏ½βƒ—οΏ½

True class is 𝑦 ∈ {0,1,2, … , 𝑉 βˆ’ 1}(i.e., 𝑉=vocabulary size = # of distinct classes).

Classifier output is

-𝑦 = argmax$%&'(! 𝑀$!π‘₯! +β‹―+ 𝑀$)π‘₯) + 𝑏$

= argmax$%&'(! 𝑀$#π‘₯

∈ {0,1, … , 𝑉 βˆ’ 1}

Where 𝑀$ = [𝑀$!, … , 𝑀$) , 𝑏$]#

and π‘₯ = [π‘₯!, … , π‘₯) , 1]#

.

.

.

𝑏&𝑏!

𝑀&)

𝑏'(!

𝑀!)𝑀'(!,)

argmax

.

.

.

Page 30: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Multi-Class Perceptron True class is 𝑦 ∈ {0,1,2, … , 𝑉 βˆ’ 1}(i.e., 𝑉=vocabulary size = # of distinct classes).

Classifier output is

-𝑦 = argmax$%&'(! 𝑀$!π‘₯! +β‹―+ 𝑀$)π‘₯) + 𝑏$

= argmax$%&'(! 𝑀$#π‘₯

∈ {0,1, … , 𝑉 βˆ’ 1}

Where 𝑀$ = [𝑀$!, … , 𝑀$) , 𝑏$]#

and π‘₯ = [π‘₯!, … , π‘₯) , 1]#

π‘₯<

π‘₯=

By Balu Ertl - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=38534275

3π’š = 0

3π’š = 1 3π’š = 2 3π’š = 3

3π’š = 43π’š = 5 3π’š = 6

3π’š = 8

3π’š = 9

3π’š= 7

3π’š = 10

3π’š = 11 3π’š = 12 3π’š = 13

3π’š = 14

3π’š = 15 3π’š = 16 3π’š = 17

3π’š = 183π’š = 19

Page 31: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Multi-Class Linear ClassifiersAll multi-class linear classifiers have the form

#𝑦 = argmax$%&'(" 𝑀$!π‘₯

The region of x-space associated with each class label is convex with piece-wise linear boundaries. Such regions are called β€œVoronoi regions.”

π‘₯<

3π’š = 0

3π’š = 1 3π’š = 2 3π’š = 3

3π’š = 43π’š = 5 3π’š = 6

3π’š = 8

π‘₯=

By Balu Ertl - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=38534275

3π’š = 9

3π’š= 7

3π’š = 10

3π’š = 11 3π’š = 12 3π’š = 13

3π’š = 14

3π’š = 15 3π’š = 16 3π’š = 17

3π’š = 183π’š = 19

Page 32: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

For each training instance 𝒙 w/ground truth label 𝑦 ∈ {0,1, … , 𝑉 βˆ’ 1}:β€’ Classify with current weights: ,𝑦 = argmax!"#$%& 𝑀!'π‘₯β€’ Update weights: β€’ if ,𝑦 is correct (𝑦 = ,𝑦) then do nothingβ€’ If ,𝑦 is incorrect (𝑦 β‰  ,𝑦) then:β€’ Update the correct-class vector as 𝑀N = 𝑀) + Ξ·π‘₯β€’ Update the wrong-class vector as 𝑀*) = 𝑀ON βˆ’ Ξ·π‘₯β€’ Don’t change the vectors of any other class

Training a Multi-Class Perceptron

Page 33: CS440/ECE448 Lecture 8: Mark Hasegawa-Johnson, 2/2021

Conclusions

β€’ Perceptron as a model of a biological neuron: #𝑦 = sgn(𝑀#π‘₯)β€’ The perceptron learning algorithm: if 𝑦 = #𝑦 then do nothing, else 𝑀 = 𝑀 + πœ‚yπ‘₯.β€’ Linear separability, 𝑦π‘₯, and convergence

β€’ It converges to the right answer, even with πœ‚ = 1, if data are linearly separable

β€’ If data are not linearly separable, it’s necessary to use πœ‚ = 1/𝑛.

β€’ Multi-class perceptron: if 𝑦 = #𝑦 then do nothing, else 𝑀) = 𝑀) + πœ‚π‘₯, and 𝑀*) = 𝑀ON βˆ’ πœ‚π‘₯.