14 machine learning single layer perceptron

228
Neural Networks Introduction Single-Layer Perceptron Andres Mendez-Vazquez July 3, 2016 1 / 101

Upload: andres-mendez-vazquez

Post on 11-Apr-2017

379 views

Category:

Engineering


7 download

TRANSCRIPT

Page 1: 14 Machine Learning Single Layer Perceptron

Neural NetworksIntroduction Single-Layer Perceptron

Andres Mendez-Vazquez

July 3, 2016

1 / 101

Page 2: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 2 / 101

Page 3: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 3 / 101

Page 4: 14 Machine Learning Single Layer Perceptron

History

At the beginning of Neural Networks (1943 - 1958)McCulloch and Pitts (1943) for introducing the idea of neuralnetworks as computing machines.Hebb (1949) for postulating the first rule for self-organized learning.Rosenblatt (1958) for proposing the perceptron as the first model forlearning with a teacher (i.e., supervised learning).

In this chapter, we are interested in the perceptronThe perceptron is the simplest form of a neural network used for theclassifica tion of patterns said to be linearly separable (i.e., patterns thatlie on opposite sides of a hyperplane).

4 / 101

Page 5: 14 Machine Learning Single Layer Perceptron

History

At the beginning of Neural Networks (1943 - 1958)McCulloch and Pitts (1943) for introducing the idea of neuralnetworks as computing machines.Hebb (1949) for postulating the first rule for self-organized learning.Rosenblatt (1958) for proposing the perceptron as the first model forlearning with a teacher (i.e., supervised learning).

In this chapter, we are interested in the perceptronThe perceptron is the simplest form of a neural network used for theclassifica tion of patterns said to be linearly separable (i.e., patterns thatlie on opposite sides of a hyperplane).

4 / 101

Page 6: 14 Machine Learning Single Layer Perceptron

History

At the beginning of Neural Networks (1943 - 1958)McCulloch and Pitts (1943) for introducing the idea of neuralnetworks as computing machines.Hebb (1949) for postulating the first rule for self-organized learning.Rosenblatt (1958) for proposing the perceptron as the first model forlearning with a teacher (i.e., supervised learning).

In this chapter, we are interested in the perceptronThe perceptron is the simplest form of a neural network used for theclassifica tion of patterns said to be linearly separable (i.e., patterns thatlie on opposite sides of a hyperplane).

4 / 101

Page 7: 14 Machine Learning Single Layer Perceptron

History

At the beginning of Neural Networks (1943 - 1958)McCulloch and Pitts (1943) for introducing the idea of neuralnetworks as computing machines.Hebb (1949) for postulating the first rule for self-organized learning.Rosenblatt (1958) for proposing the perceptron as the first model forlearning with a teacher (i.e., supervised learning).

In this chapter, we are interested in the perceptronThe perceptron is the simplest form of a neural network used for theclassifica tion of patterns said to be linearly separable (i.e., patterns thatlie on opposite sides of a hyperplane).

4 / 101

Page 8: 14 Machine Learning Single Layer Perceptron

In addition

Something NotableThe single neuron also forms the basis of an adaptive filter.A functional block that is basic to the ever-expanding subject ofsignal processing.

FurthermoreThe development of adaptive filtering owes much to the classic paper ofWidrow and Hoff (1960) for pioneering the so-called least-mean-square(LMS) algorithm, also known as the delta rule.

5 / 101

Page 9: 14 Machine Learning Single Layer Perceptron

In addition

Something NotableThe single neuron also forms the basis of an adaptive filter.A functional block that is basic to the ever-expanding subject ofsignal processing.

FurthermoreThe development of adaptive filtering owes much to the classic paper ofWidrow and Hoff (1960) for pioneering the so-called least-mean-square(LMS) algorithm, also known as the delta rule.

5 / 101

Page 10: 14 Machine Learning Single Layer Perceptron

In addition

Something NotableThe single neuron also forms the basis of an adaptive filter.A functional block that is basic to the ever-expanding subject ofsignal processing.

FurthermoreThe development of adaptive filtering owes much to the classic paper ofWidrow and Hoff (1960) for pioneering the so-called least-mean-square(LMS) algorithm, also known as the delta rule.

5 / 101

Page 11: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 6 / 101

Page 12: 14 Machine Learning Single Layer Perceptron

Adapting Filtering Problem

Consider a dynamical system

Unknown

dynamical

system

Output

INPUTS

7 / 101

Page 13: 14 Machine Learning Single Layer Perceptron

Signal-Flow Graph of Adaptive Model

We have the following equivalence

-1

8 / 101

Page 14: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 9 / 101

Page 15: 14 Machine Learning Single Layer Perceptron

Description of the Behavior of the System

We have the data set

T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1)

Where

x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2)

10 / 101

Page 16: 14 Machine Learning Single Layer Perceptron

Description of the Behavior of the System

We have the data set

T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1)

Where

x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2)

10 / 101

Page 17: 14 Machine Learning Single Layer Perceptron

The Stimulus x (i)The stimulus x(i) can arise fromThe m elements of x(i) originate at different points in space (spatial)

11 / 101

Page 18: 14 Machine Learning Single Layer Perceptron

The Stimulus x (i)

The stimulus x(i) can arise fromThe m elements of x(i) represent the set of present and (m− 1) pastvalues of some excitation that are uniformly spaced in time (temporal).

Time

12 / 101

Page 19: 14 Machine Learning Single Layer Perceptron

Problem

Quite importantHow do we design a multiple input-single output model of the unknowndynamical system?

It is moreWe want to build this around a single neuron!!!

13 / 101

Page 20: 14 Machine Learning Single Layer Perceptron

Problem

Quite importantHow do we design a multiple input-single output model of the unknowndynamical system?

It is moreWe want to build this around a single neuron!!!

13 / 101

Page 21: 14 Machine Learning Single Layer Perceptron

Thus, we have the following...

We need an algorithm to control the weight adjustment of the neuron

Control Algorithm

14 / 101

Page 22: 14 Machine Learning Single Layer Perceptron

Which steps do you need for the algorithm?

FirstThe algorithms starts from an arbitrary setting of the neuron’s synapticweight.

SecondAdjustments, with respect to changes on the environment, are made on acontinuous basis.

Time is incorporated to the algorithm.

ThirdComputation of adjustments to synaptic weights are completed inside atime interval that is one sampling period long.

15 / 101

Page 23: 14 Machine Learning Single Layer Perceptron

Which steps do you need for the algorithm?

FirstThe algorithms starts from an arbitrary setting of the neuron’s synapticweight.

SecondAdjustments, with respect to changes on the environment, are made on acontinuous basis.

Time is incorporated to the algorithm.

ThirdComputation of adjustments to synaptic weights are completed inside atime interval that is one sampling period long.

15 / 101

Page 24: 14 Machine Learning Single Layer Perceptron

Which steps do you need for the algorithm?

FirstThe algorithms starts from an arbitrary setting of the neuron’s synapticweight.

SecondAdjustments, with respect to changes on the environment, are made on acontinuous basis.

Time is incorporated to the algorithm.

ThirdComputation of adjustments to synaptic weights are completed inside atime interval that is one sampling period long.

15 / 101

Page 25: 14 Machine Learning Single Layer Perceptron

Signal-Flow Graph of Adaptive Model

We have the following equivalence

-1

16 / 101

Page 26: 14 Machine Learning Single Layer Perceptron

Thus, This Neural Model ≈ Adaptive Filter with twocontinous processesFiltering processes

1 An output, denoted by y(i), that is produced in response to the melements of the stimulus vector x (i).

2 An error signal, e (i), that is obtained by comparing the output y(i)to the corresponding desired output d(i) produced by the unknownsystem.

Adaptive ProcessIt involves the automatic adjustment of the synaptic weights of the neuronin accordance with the error signal e(i)

RemarkThe combination of these two processes working together constitutes afeedback loop acting around the neuron.

17 / 101

Page 27: 14 Machine Learning Single Layer Perceptron

Thus, This Neural Model ≈ Adaptive Filter with twocontinous processesFiltering processes

1 An output, denoted by y(i), that is produced in response to the melements of the stimulus vector x (i).

2 An error signal, e (i), that is obtained by comparing the output y(i)to the corresponding desired output d(i) produced by the unknownsystem.

Adaptive ProcessIt involves the automatic adjustment of the synaptic weights of the neuronin accordance with the error signal e(i)

RemarkThe combination of these two processes working together constitutes afeedback loop acting around the neuron.

17 / 101

Page 28: 14 Machine Learning Single Layer Perceptron

Thus, This Neural Model ≈ Adaptive Filter with twocontinous processesFiltering processes

1 An output, denoted by y(i), that is produced in response to the melements of the stimulus vector x (i).

2 An error signal, e (i), that is obtained by comparing the output y(i)to the corresponding desired output d(i) produced by the unknownsystem.

Adaptive ProcessIt involves the automatic adjustment of the synaptic weights of the neuronin accordance with the error signal e(i)

RemarkThe combination of these two processes working together constitutes afeedback loop acting around the neuron.

17 / 101

Page 29: 14 Machine Learning Single Layer Perceptron

Thus

The output y(i) is exactly the same as the induced local field v(i)

y (i) = v (i) =m∑

i=1wk (i)xk (i) (3)

In matrix form, we have - remember we only have a neuron, so we donot have neuron k

y (i) = xT (i) w (i) (4)

Error

e (i) = d (i)− y (i) (5)

18 / 101

Page 30: 14 Machine Learning Single Layer Perceptron

Thus

The output y(i) is exactly the same as the induced local field v(i)

y (i) = v (i) =m∑

i=1wk (i)xk (i) (3)

In matrix form, we have - remember we only have a neuron, so we donot have neuron k

y (i) = xT (i) w (i) (4)

Error

e (i) = d (i)− y (i) (5)

18 / 101

Page 31: 14 Machine Learning Single Layer Perceptron

Thus

The output y(i) is exactly the same as the induced local field v(i)

y (i) = v (i) =m∑

i=1wk (i)xk (i) (3)

In matrix form, we have - remember we only have a neuron, so we donot have neuron k

y (i) = xT (i) w (i) (4)

Error

e (i) = d (i)− y (i) (5)

18 / 101

Page 32: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 19 / 101

Page 33: 14 Machine Learning Single Layer Perceptron

Consider

A continous differentiable function J (w)We want to find an optimal solution w∗ such that

J (w∗) ≤ J (w) , ∀w (6)

We want toMinimize the cost function J(w) with respect to the weight vector w.

For this

∇J (w) = 0 (7)

20 / 101

Page 34: 14 Machine Learning Single Layer Perceptron

Consider

A continous differentiable function J (w)We want to find an optimal solution w∗ such that

J (w∗) ≤ J (w) , ∀w (6)

We want toMinimize the cost function J(w) with respect to the weight vector w.

For this

∇J (w) = 0 (7)

20 / 101

Page 35: 14 Machine Learning Single Layer Perceptron

Consider

A continous differentiable function J (w)We want to find an optimal solution w∗ such that

J (w∗) ≤ J (w) , ∀w (6)

We want toMinimize the cost function J(w) with respect to the weight vector w.

For this

∇J (w) = 0 (7)

20 / 101

Page 36: 14 Machine Learning Single Layer Perceptron

Where

∇ is the gradient operator

∇ =[∂

∂w1,∂

∂w2, ...,

∂wm

]T

(8)

Thus

∇J (w) =[∂J (w)∂w1

,∂J (w)∂w2

, ...,∂J (w)∂wm

]T

(9)

21 / 101

Page 37: 14 Machine Learning Single Layer Perceptron

Where

∇ is the gradient operator

∇ =[∂

∂w1,∂

∂w2, ...,

∂wm

]T

(8)

Thus

∇J (w) =[∂J (w)∂w1

,∂J (w)∂w2

, ...,∂J (w)∂wm

]T

(9)

21 / 101

Page 38: 14 Machine Learning Single Layer Perceptron

Thus

Starting with an initial guess denoted by w(0),Then, generate a sequence of weight vectors w (1) ,w (2) , ...

Such that you can reduce J (w) at each iteration

J (w (n+ 1)) < J (w (n)) (10)

Where: w (n) is the old value and w (n+ 1) is the new value.

22 / 101

Page 39: 14 Machine Learning Single Layer Perceptron

Thus

Starting with an initial guess denoted by w(0),Then, generate a sequence of weight vectors w (1) ,w (2) , ...

Such that you can reduce J (w) at each iteration

J (w (n+ 1)) < J (w (n)) (10)

Where: w (n) is the old value and w (n+ 1) is the new value.

22 / 101

Page 40: 14 Machine Learning Single Layer Perceptron

The Three Main Methods for Unconstrained Optimization

We will look at1 Steepest Descent.2 Newton’s Method3 Gauss-Newton Method

23 / 101

Page 41: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 24 / 101

Page 42: 14 Machine Learning Single Layer Perceptron

Steepest Descent

In the method of steepest descent, we have a cost function J (w)where

w (n+ 1) = w (n)− η∇J (w (n))

How, we prove that J (w (n+ 1)) < J (w (n))?zuA626NRWe use the first-order Taylor series expansion around w (n)

J (w (n+ 1)) ≈ J (w (n)) +∇JT (w (n)) ∆w (n) (11)

Remark: This is quite true when the step size is quite small!!! Inaddition, ∆w (n) = w (n+ 1)−w (n)

25 / 101

Page 43: 14 Machine Learning Single Layer Perceptron

Steepest Descent

In the method of steepest descent, we have a cost function J (w)where

w (n+ 1) = w (n)− η∇J (w (n))

How, we prove that J (w (n+ 1)) < J (w (n))?zuA626NRWe use the first-order Taylor series expansion around w (n)

J (w (n+ 1)) ≈ J (w (n)) +∇JT (w (n)) ∆w (n) (11)

Remark: This is quite true when the step size is quite small!!! Inaddition, ∆w (n) = w (n+ 1)−w (n)

25 / 101

Page 44: 14 Machine Learning Single Layer Perceptron

Why? Look at the case in R

The equation of the tangent line to the curve y = J (w (n))

L (w (n)) = J ′ (w (n)) [w (n+ 1)− w (n)] + J (w (n)) (12)

Example

26 / 101

Page 45: 14 Machine Learning Single Layer Perceptron

Why? Look at the case in R

The equation of the tangent line to the curve y = J (w (n))

L (w (n)) = J ′ (w (n)) [w (n+ 1)− w (n)] + J (w (n)) (12)

Example

26 / 101

Page 46: 14 Machine Learning Single Layer Perceptron

Thus, we have that in R

Remember Something quite Classic

tan θ =J (w (n+ 1))− J (w (n))w (n+ 1)− w (n)

tan θ (w (n+ 1)− w (n)) =J (w (n+ 1))− J (w (n))J ′ (w (n)) (w (n+ 1)− w (n)) = J (w (n+ 1))− J (w (n))

27 / 101

Page 47: 14 Machine Learning Single Layer Perceptron

Thus, we have that in R

Remember Something quite Classic

tan θ =J (w (n+ 1))− J (w (n))w (n+ 1)− w (n)

tan θ (w (n+ 1)− w (n)) =J (w (n+ 1))− J (w (n))J ′ (w (n)) (w (n+ 1)− w (n)) = J (w (n+ 1))− J (w (n))

27 / 101

Page 48: 14 Machine Learning Single Layer Perceptron

Thus, we have that in R

Remember Something quite Classic

tan θ =J (w (n+ 1))− J (w (n))w (n+ 1)− w (n)

tan θ (w (n+ 1)− w (n)) =J (w (n+ 1))− J (w (n))J ′ (w (n)) (w (n+ 1)− w (n)) = J (w (n+ 1))− J (w (n))

27 / 101

Page 49: 14 Machine Learning Single Layer Perceptron

Thus, we have that

Using the First Taylor expansion

J (w (n)) ≈ J (w (n)) + J ′ (w (n)) [w (n+ 1)− w (n)] (13)

28 / 101

Page 50: 14 Machine Learning Single Layer Perceptron

Now, for Many Variables

An hyperplane in Rn is a set of the form

H ={

x|aT x = b}

(14)

Given x ∈ H and x0 ∈ H

b = aT x = aT x0

Thus, we have that

H ={

x|aT (x− x0) = 0}

29 / 101

Page 51: 14 Machine Learning Single Layer Perceptron

Now, for Many Variables

An hyperplane in Rn is a set of the form

H ={

x|aT x = b}

(14)

Given x ∈ H and x0 ∈ H

b = aT x = aT x0

Thus, we have that

H ={

x|aT (x− x0) = 0}

29 / 101

Page 52: 14 Machine Learning Single Layer Perceptron

Now, for Many Variables

An hyperplane in Rn is a set of the form

H ={

x|aT x = b}

(14)

Given x ∈ H and x0 ∈ H

b = aT x = aT x0

Thus, we have that

H ={

x|aT (x− x0) = 0}

29 / 101

Page 53: 14 Machine Learning Single Layer Perceptron

Thus, we have the following definition

Definition (Differentiability)Assume that J is defined in a disk D containing w (n). We say that J isdifferentiable at w (n) if:

1 ∂J(w(n))∂wi

exist for all i = 1, ..., n.2 J is locally linear at w (n).

30 / 101

Page 54: 14 Machine Learning Single Layer Perceptron

Thus, we have the following definition

Definition (Differentiability)Assume that J is defined in a disk D containing w (n). We say that J isdifferentiable at w (n) if:

1 ∂J(w(n))∂wi

exist for all i = 1, ..., n.2 J is locally linear at w (n).

30 / 101

Page 55: 14 Machine Learning Single Layer Perceptron

Thus, we have the following definition

Definition (Differentiability)Assume that J is defined in a disk D containing w (n). We say that J isdifferentiable at w (n) if:

1 ∂J(w(n))∂wi

exist for all i = 1, ..., n.2 J is locally linear at w (n).

30 / 101

Page 56: 14 Machine Learning Single Layer Perceptron

Thus, given J (w (n))

We know that we have the following operator

∇ =(

∂w1,∂

∂w2, ...,

∂wm

)(15)

Thus, we have

∇J (w (n)) =(∂J (w (n))

∂w1,∂J (w (n))

∂w2, ...,

∂J (w (n))∂wm

)=

m∑i=1

wi∂J (w (n))

∂wi

Where: wTi = (1, 0, ..., 0) ∈ R

31 / 101

Page 57: 14 Machine Learning Single Layer Perceptron

Thus, given J (w (n))

We know that we have the following operator

∇ =(

∂w1,∂

∂w2, ...,

∂wm

)(15)

Thus, we have

∇J (w (n)) =(∂J (w (n))

∂w1,∂J (w (n))

∂w2, ...,

∂J (w (n))∂wm

)=

m∑i=1

wi∂J (w (n))

∂wi

Where: wTi = (1, 0, ..., 0) ∈ R

31 / 101

Page 58: 14 Machine Learning Single Layer Perceptron

Now

Given a curve function r (t) that lies on the level set J (w (n)) = c(When is in R3)

32 / 101

Page 59: 14 Machine Learning Single Layer Perceptron

Level Set

Definition

{(w1, w2, ..., wm) ∈ Rm|J (w1, w2, ..., wm) = c} (16)

Remark: In a normal Calculus course we will use x and f instead of wand J .

33 / 101

Page 60: 14 Machine Learning Single Layer Perceptron

WhereAny curve has the following parametrization

r : [a, b]→ Rm

r(t) = (w1 (t) , ..., wm (t))

With r(n+ 1) = (w1 (n+ 1) , ..., wm (n+ 1))

We can write the parametrized version of it

z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17)

Differentiating with respect to t and using the chain rule for multiplevariables

dz(t)dt

=m∑

i=1

∂J (w (t))∂wi

· dwi(t)dt

= 0 (18)

34 / 101

Page 61: 14 Machine Learning Single Layer Perceptron

WhereAny curve has the following parametrization

r : [a, b]→ Rm

r(t) = (w1 (t) , ..., wm (t))

With r(n+ 1) = (w1 (n+ 1) , ..., wm (n+ 1))

We can write the parametrized version of it

z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17)

Differentiating with respect to t and using the chain rule for multiplevariables

dz(t)dt

=m∑

i=1

∂J (w (t))∂wi

· dwi(t)dt

= 0 (18)

34 / 101

Page 62: 14 Machine Learning Single Layer Perceptron

WhereAny curve has the following parametrization

r : [a, b]→ Rm

r(t) = (w1 (t) , ..., wm (t))

With r(n+ 1) = (w1 (n+ 1) , ..., wm (n+ 1))

We can write the parametrized version of it

z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17)

Differentiating with respect to t and using the chain rule for multiplevariables

dz(t)dt

=m∑

i=1

∂J (w (t))∂wi

· dwi(t)dt

= 0 (18)

34 / 101

Page 63: 14 Machine Learning Single Layer Perceptron

Note

FirstGiven y = f (u) = (f1 (u) , ..., fl (u)) andu = g (x) = (g1 (x) , ..., gm (x)).

We have then that∂ (f1, f2, ..., fl)∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl)

∂ (g1, g2, ..., gm) ·∂ (g1, g2, ..., gm)∂ (x1, x2, ..., xk) (19)

Thus∂ (f1, f2, ..., fl)

∂xi= ∂ (f1, f2, ..., fl)∂ (g1, g2, ..., gm) ·

∂ (g1, g2, ..., gm)∂xi

=m∑

k=1

∂ (f1, f2, ..., fl)∂gk

∂gk

∂xi

35 / 101

Page 64: 14 Machine Learning Single Layer Perceptron

Note

FirstGiven y = f (u) = (f1 (u) , ..., fl (u)) andu = g (x) = (g1 (x) , ..., gm (x)).

We have then that∂ (f1, f2, ..., fl)∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl)

∂ (g1, g2, ..., gm) ·∂ (g1, g2, ..., gm)∂ (x1, x2, ..., xk) (19)

Thus∂ (f1, f2, ..., fl)

∂xi= ∂ (f1, f2, ..., fl)∂ (g1, g2, ..., gm) ·

∂ (g1, g2, ..., gm)∂xi

=m∑

k=1

∂ (f1, f2, ..., fl)∂gk

∂gk

∂xi

35 / 101

Page 65: 14 Machine Learning Single Layer Perceptron

Note

FirstGiven y = f (u) = (f1 (u) , ..., fl (u)) andu = g (x) = (g1 (x) , ..., gm (x)).

We have then that∂ (f1, f2, ..., fl)∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl)

∂ (g1, g2, ..., gm) ·∂ (g1, g2, ..., gm)∂ (x1, x2, ..., xk) (19)

Thus∂ (f1, f2, ..., fl)

∂xi= ∂ (f1, f2, ..., fl)∂ (g1, g2, ..., gm) ·

∂ (g1, g2, ..., gm)∂xi

=m∑

k=1

∂ (f1, f2, ..., fl)∂gk

∂gk

∂xi

35 / 101

Page 66: 14 Machine Learning Single Layer Perceptron

Thus

Evaluating at t = nm∑

i=1

∂J (w (n))∂wi

· dwi(n)dt

= 0

We have that

∇J (w (n)) · r′ (n) = 0 (20)

This proves that for every level set the gradient is perpendicular tothe tangent to any curve that lies on the level setIn particular to the point w (n).

36 / 101

Page 67: 14 Machine Learning Single Layer Perceptron

Thus

Evaluating at t = nm∑

i=1

∂J (w (n))∂wi

· dwi(n)dt

= 0

We have that

∇J (w (n)) · r′ (n) = 0 (20)

This proves that for every level set the gradient is perpendicular tothe tangent to any curve that lies on the level setIn particular to the point w (n).

36 / 101

Page 68: 14 Machine Learning Single Layer Perceptron

Thus

Evaluating at t = nm∑

i=1

∂J (w (n))∂wi

· dwi(n)dt

= 0

We have that

∇J (w (n)) · r′ (n) = 0 (20)

This proves that for every level set the gradient is perpendicular tothe tangent to any curve that lies on the level setIn particular to the point w (n).

36 / 101

Page 69: 14 Machine Learning Single Layer Perceptron

Now the tangent plane to the surface can be describedgenerally

ThusL (w (n+ 1)) = J (w (n)) +∇JT (w (n)) [w (n+ 1)−w (n)] (21)

This looks like

37 / 101

Page 70: 14 Machine Learning Single Layer Perceptron

Now the tangent plane to the surface can be describedgenerallyThus

L (w (n+ 1)) = J (w (n)) +∇JT (w (n)) [w (n+ 1)−w (n)] (21)

This looks like

37 / 101

Page 71: 14 Machine Learning Single Layer Perceptron

Proving the fact about the Steepest Descent

We want the following

J (w (n+ 1)) < J (w (n))

Using the first-order Taylor approximation

J (w (n+ 1))− J (w (n)) ≈ ∇JT (w (n)) ∆w (n)

So, we ask the following

∆w (n) ≈ −η∇J (w (n)) with η > 0

38 / 101

Page 72: 14 Machine Learning Single Layer Perceptron

Proving the fact about the Steepest Descent

We want the following

J (w (n+ 1)) < J (w (n))

Using the first-order Taylor approximation

J (w (n+ 1))− J (w (n)) ≈ ∇JT (w (n)) ∆w (n)

So, we ask the following

∆w (n) ≈ −η∇J (w (n)) with η > 0

38 / 101

Page 73: 14 Machine Learning Single Layer Perceptron

Proving the fact about the Steepest Descent

We want the following

J (w (n+ 1)) < J (w (n))

Using the first-order Taylor approximation

J (w (n+ 1))− J (w (n)) ≈ ∇JT (w (n)) ∆w (n)

So, we ask the following

∆w (n) ≈ −η∇J (w (n)) with η > 0

38 / 101

Page 74: 14 Machine Learning Single Layer Perceptron

Then

We have that

J (w (n+ 1))− J (w (n)) ≈ −η∇JT (w (n))∇J (w (n)) = −η ‖∇J (w (n))‖2

Thus

J (w (n+ 1))− J (w (n)) < 0

Or

J (w (n+ 1)) < J (w (n))

39 / 101

Page 75: 14 Machine Learning Single Layer Perceptron

Then

We have that

J (w (n+ 1))− J (w (n)) ≈ −η∇JT (w (n))∇J (w (n)) = −η ‖∇J (w (n))‖2

Thus

J (w (n+ 1))− J (w (n)) < 0

Or

J (w (n+ 1)) < J (w (n))

39 / 101

Page 76: 14 Machine Learning Single Layer Perceptron

Then

We have that

J (w (n+ 1))− J (w (n)) ≈ −η∇JT (w (n))∇J (w (n)) = −η ‖∇J (w (n))‖2

Thus

J (w (n+ 1))− J (w (n)) < 0

Or

J (w (n+ 1)) < J (w (n))

39 / 101

Page 77: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 40 / 101

Page 78: 14 Machine Learning Single Layer Perceptron

Newton’s MethodHereThe basic idea of Newton’s method is to minimize the quadratic approximation ofthe cost function J (w) around the current point w (n).

Using a second-order Taylor series expansion of the cost functionaround the point w (n)

∆J (w (n)) = J (w (n+ 1))− J (w (n))

≈ ∇JT (w (n)) ∆w (n) + 12∆wT (n)H (n) ∆w (n)

Where given that w (n) is a vector with dimension m

H = ∇2J (w) =

∂2J(w)

∂w21

∂2J(w)∂w1∂w2

· · · ∂2J(w)∂w1∂wm

∂2J(w)∂w2∂w1

∂2J(w)∂w2

2· · · ∂2J(w)

∂w2∂wm

......

...∂2J(w)

∂wm∂w1

∂2J(w)∂wm∂w2

· · · ∂2J(w)∂w2

m

41 / 101

Page 79: 14 Machine Learning Single Layer Perceptron

Newton’s MethodHereThe basic idea of Newton’s method is to minimize the quadratic approximation ofthe cost function J (w) around the current point w (n).

Using a second-order Taylor series expansion of the cost functionaround the point w (n)

∆J (w (n)) = J (w (n+ 1))− J (w (n))

≈ ∇JT (w (n)) ∆w (n) + 12∆wT (n)H (n) ∆w (n)

Where given that w (n) is a vector with dimension m

H = ∇2J (w) =

∂2J(w)

∂w21

∂2J(w)∂w1∂w2

· · · ∂2J(w)∂w1∂wm

∂2J(w)∂w2∂w1

∂2J(w)∂w2

2· · · ∂2J(w)

∂w2∂wm

......

...∂2J(w)

∂wm∂w1

∂2J(w)∂wm∂w2

· · · ∂2J(w)∂w2

m

41 / 101

Page 80: 14 Machine Learning Single Layer Perceptron

Newton’s MethodHereThe basic idea of Newton’s method is to minimize the quadratic approximation ofthe cost function J (w) around the current point w (n).

Using a second-order Taylor series expansion of the cost functionaround the point w (n)

∆J (w (n)) = J (w (n+ 1))− J (w (n))

≈ ∇JT (w (n)) ∆w (n) + 12∆wT (n)H (n) ∆w (n)

Where given that w (n) is a vector with dimension m

H = ∇2J (w) =

∂2J(w)

∂w21

∂2J(w)∂w1∂w2

· · · ∂2J(w)∂w1∂wm

∂2J(w)∂w2∂w1

∂2J(w)∂w2

2· · · ∂2J(w)

∂w2∂wm

......

...∂2J(w)

∂wm∂w1

∂2J(w)∂wm∂w2

· · · ∂2J(w)∂w2

m

41 / 101

Page 81: 14 Machine Learning Single Layer Perceptron

Now, we want to minimize J (w (n+ 1))

Do you have any idea?Look again

J (w (n)) +∇JT (w (n)) ∆w (n) + 12∆wT (n)H (n) ∆w (n) (22)

Derive with respect to ∆w (n)

∇J (w (n)) +H (n) ∆w (n) = 0 (23)

Thus

∆w (n) = −H−1 (n)∇J (w (n))

42 / 101

Page 82: 14 Machine Learning Single Layer Perceptron

Now, we want to minimize J (w (n+ 1))

Do you have any idea?Look again

J (w (n)) +∇JT (w (n)) ∆w (n) + 12∆wT (n)H (n) ∆w (n) (22)

Derive with respect to ∆w (n)

∇J (w (n)) +H (n) ∆w (n) = 0 (23)

Thus

∆w (n) = −H−1 (n)∇J (w (n))

42 / 101

Page 83: 14 Machine Learning Single Layer Perceptron

Now, we want to minimize J (w (n+ 1))

Do you have any idea?Look again

J (w (n)) +∇JT (w (n)) ∆w (n) + 12∆wT (n)H (n) ∆w (n) (22)

Derive with respect to ∆w (n)

∇J (w (n)) +H (n) ∆w (n) = 0 (23)

Thus

∆w (n) = −H−1 (n)∇J (w (n))

42 / 101

Page 84: 14 Machine Learning Single Layer Perceptron

The Final Method

Define the following

J (w (n+ 1))− J (w (n)) = −H−1 (n)∇J (w (n))

Then

J (w (n+ 1)) = J (w (n))−H−1 (n)∇J (w (n))

43 / 101

Page 85: 14 Machine Learning Single Layer Perceptron

The Final Method

Define the following

J (w (n+ 1))− J (w (n)) = −H−1 (n)∇J (w (n))

Then

J (w (n+ 1)) = J (w (n))−H−1 (n)∇J (w (n))

43 / 101

Page 86: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 44 / 101

Page 87: 14 Machine Learning Single Layer Perceptron

We have then an error

Something Notable

J (w) = 12

n∑i=1

e2 (i)

Thus using the first order Taylor expansion, we linearize

el (i,w) = e (i) +[∂e (i)∂w

]T

[w −w (n)]

In matrix form

el (n,w) = e (n) + J (n) [w −w (n)]

45 / 101

Page 88: 14 Machine Learning Single Layer Perceptron

We have then an error

Something Notable

J (w) = 12

n∑i=1

e2 (i)

Thus using the first order Taylor expansion, we linearize

el (i,w) = e (i) +[∂e (i)∂w

]T

[w −w (n)]

In matrix form

el (n,w) = e (n) + J (n) [w −w (n)]

45 / 101

Page 89: 14 Machine Learning Single Layer Perceptron

We have then an error

Something Notable

J (w) = 12

n∑i=1

e2 (i)

Thus using the first order Taylor expansion, we linearize

el (i,w) = e (i) +[∂e (i)∂w

]T

[w −w (n)]

In matrix form

el (n,w) = e (n) + J (n) [w −w (n)]

45 / 101

Page 90: 14 Machine Learning Single Layer Perceptron

Where

The error vector is equal to

e (n) = [e (1) , e (2) , ..., e (n)]T (24)

Thus, we get the famous Jacobian once we derive ∂e(i)∂w

J (n) =

∂e(1)∂w1

∂e(1)∂w2

· · · ∂e(1)∂wm

∂e(2)∂w1

∂e(2)∂w2

· · · ∂e(2)∂wm...

......

∂e(n)∂w1

∂e(n)∂w2

· · · ∂e(n)∂wm

46 / 101

Page 91: 14 Machine Learning Single Layer Perceptron

Where

The error vector is equal to

e (n) = [e (1) , e (2) , ..., e (n)]T (24)

Thus, we get the famous Jacobian once we derive ∂e(i)∂w

J (n) =

∂e(1)∂w1

∂e(1)∂w2

· · · ∂e(1)∂wm

∂e(2)∂w1

∂e(2)∂w2

· · · ∂e(2)∂wm...

......

∂e(n)∂w1

∂e(n)∂w2

· · · ∂e(n)∂wm

46 / 101

Page 92: 14 Machine Learning Single Layer Perceptron

Where

We want the following

w (n+ 1) = argminw

{12 ‖el (n,w)‖2

}

IdeasWhat if we expand out the equation?

47 / 101

Page 93: 14 Machine Learning Single Layer Perceptron

Where

We want the following

w (n+ 1) = argminw

{12 ‖el (n,w)‖2

}

IdeasWhat if we expand out the equation?

47 / 101

Page 94: 14 Machine Learning Single Layer Perceptron

Expanded Version

We get

12 ‖el (n,w)‖2 =1

2 ‖e (n)‖2 + eT (n) J (n) (w −w (n)) + ...

12 (w −w (n))T JT (n) J (n) (w −w (n))

48 / 101

Page 95: 14 Machine Learning Single Layer Perceptron

Now,doing the Differential, we get

Differentiating the equation with respect to w

JT (n) e (n) + JT (n) J (n) [w −w (n)] = 0

We get finally

w (n+ 1) = w (n)−(JT (n) J (n)

)−1JT (n) e (n) (25)

49 / 101

Page 96: 14 Machine Learning Single Layer Perceptron

Now,doing the Differential, we get

Differentiating the equation with respect to w

JT (n) e (n) + JT (n) J (n) [w −w (n)] = 0

We get finally

w (n+ 1) = w (n)−(JT (n) J (n)

)−1JT (n) e (n) (25)

49 / 101

Page 97: 14 Machine Learning Single Layer Perceptron

Remarks

We have thatThe Newton’s method that requires knowledge of the Hessian matrixof the cost function.The Gauss-Newton method only requires the Jacobian matrix of theerror vector e (n).

HoweverThe Gauss-Newton iteration to be computable, the matrix productJT (n) J (n) must be nonsingular!!!

50 / 101

Page 98: 14 Machine Learning Single Layer Perceptron

Remarks

We have thatThe Newton’s method that requires knowledge of the Hessian matrixof the cost function.The Gauss-Newton method only requires the Jacobian matrix of theerror vector e (n).

HoweverThe Gauss-Newton iteration to be computable, the matrix productJT (n) J (n) must be nonsingular!!!

50 / 101

Page 99: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 51 / 101

Page 100: 14 Machine Learning Single Layer Perceptron

Introduction

A linear least-squares filter has two distinctive characteristicsFirst, the single neuron around which it is built is linear.The cost function J (w) used to design the filter consists of the sumof error squares.

Thus, expressing the error

e (n) = d (n)− (x (1) , ...,x (n))T w (n)

Short Version - error is linear in the weight vector w (n)

e (n) = d (n)−X (n) w (n)

Where d (n) is a n× 1 desired response vector.Where X(n) is the n×m data matrix.

52 / 101

Page 101: 14 Machine Learning Single Layer Perceptron

Introduction

A linear least-squares filter has two distinctive characteristicsFirst, the single neuron around which it is built is linear.The cost function J (w) used to design the filter consists of the sumof error squares.

Thus, expressing the error

e (n) = d (n)− (x (1) , ...,x (n))T w (n)

Short Version - error is linear in the weight vector w (n)

e (n) = d (n)−X (n) w (n)

Where d (n) is a n× 1 desired response vector.Where X(n) is the n×m data matrix.

52 / 101

Page 102: 14 Machine Learning Single Layer Perceptron

Introduction

A linear least-squares filter has two distinctive characteristicsFirst, the single neuron around which it is built is linear.The cost function J (w) used to design the filter consists of the sumof error squares.

Thus, expressing the error

e (n) = d (n)− (x (1) , ...,x (n))T w (n)

Short Version - error is linear in the weight vector w (n)

e (n) = d (n)−X (n) w (n)

Where d (n) is a n× 1 desired response vector.Where X(n) is the n×m data matrix.

52 / 101

Page 103: 14 Machine Learning Single Layer Perceptron

Now, differentiate e (n) with respect to w (n)

Thus

∇e (n) = −XT (n)

Correspondingly, the Jacobian of e(n) is

J (n) = −X (n)

Let us to use the Gaussian-Newton

w (n+ 1) = w (n)−(JT (n) J (n)

)−1JT (n) e (n)

53 / 101

Page 104: 14 Machine Learning Single Layer Perceptron

Now, differentiate e (n) with respect to w (n)

Thus

∇e (n) = −XT (n)

Correspondingly, the Jacobian of e(n) is

J (n) = −X (n)

Let us to use the Gaussian-Newton

w (n+ 1) = w (n)−(JT (n) J (n)

)−1JT (n) e (n)

53 / 101

Page 105: 14 Machine Learning Single Layer Perceptron

Now, differentiate e (n) with respect to w (n)

Thus

∇e (n) = −XT (n)

Correspondingly, the Jacobian of e(n) is

J (n) = −X (n)

Let us to use the Gaussian-Newton

w (n+ 1) = w (n)−(JT (n) J (n)

)−1JT (n) e (n)

53 / 101

Page 106: 14 Machine Learning Single Layer Perceptron

Thus

We have the followingw (n+ 1) = w (n)−

(−XT (n) · −X (n)

)−1 · −XT (n) [d (n)−X (n) w (n)]

We have then

w (n+ 1) =w (n) +(XT (n) X (n)

)−1XT (n) d (n)− ...(

XT (n) X (n))−1

XT (n) X (n) w (n)

Thus, we have

w (n + 1) =w (n) +(XT (n) X (n)

)−1XT (n) d (n)−w (n)

=(XT (n) X (n)

)−1XT (n) d (n)

54 / 101

Page 107: 14 Machine Learning Single Layer Perceptron

Thus

We have the followingw (n+ 1) = w (n)−

(−XT (n) · −X (n)

)−1 · −XT (n) [d (n)−X (n) w (n)]

We have then

w (n+ 1) =w (n) +(XT (n) X (n)

)−1XT (n) d (n)− ...(

XT (n) X (n))−1

XT (n) X (n) w (n)

Thus, we have

w (n + 1) =w (n) +(XT (n) X (n)

)−1XT (n) d (n)−w (n)

=(XT (n) X (n)

)−1XT (n) d (n)

54 / 101

Page 108: 14 Machine Learning Single Layer Perceptron

Thus

We have the followingw (n+ 1) = w (n)−

(−XT (n) · −X (n)

)−1 · −XT (n) [d (n)−X (n) w (n)]

We have then

w (n+ 1) =w (n) +(XT (n) X (n)

)−1XT (n) d (n)− ...(

XT (n) X (n))−1

XT (n) X (n) w (n)

Thus, we have

w (n + 1) =w (n) +(XT (n) X (n)

)−1XT (n) d (n)−w (n)

=(XT (n) X (n)

)−1XT (n) d (n)

54 / 101

Page 109: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 55 / 101

Page 110: 14 Machine Learning Single Layer Perceptron

Again Our Error Cost function

We have

J (w) = 12e

2 (n)

where e (n) is the error signal measured at time n.

Again differentiating against the vector w

∂J (w)∂w

= e (n) ∂e (n)∂w

LMS algorithm operates with a linear neuron so we may express theerror signal as

e (n) = d (n)− xT (n) w (n) (26)

56 / 101

Page 111: 14 Machine Learning Single Layer Perceptron

Again Our Error Cost function

We have

J (w) = 12e

2 (n)

where e (n) is the error signal measured at time n.

Again differentiating against the vector w

∂J (w)∂w

= e (n) ∂e (n)∂w

LMS algorithm operates with a linear neuron so we may express theerror signal as

e (n) = d (n)− xT (n) w (n) (26)

56 / 101

Page 112: 14 Machine Learning Single Layer Perceptron

Again Our Error Cost function

We have

J (w) = 12e

2 (n)

where e (n) is the error signal measured at time n.

Again differentiating against the vector w

∂J (w)∂w

= e (n) ∂e (n)∂w

LMS algorithm operates with a linear neuron so we may express theerror signal as

e (n) = d (n)− xT (n) w (n) (26)

56 / 101

Page 113: 14 Machine Learning Single Layer Perceptron

We have

Something Notable∂e (n)∂w

= −x (n)

Then∂J (w)∂w

= −x (n) e (n)

Using this as an estimate for the gradient vector, we have for thegradient descent

w (n+ 1) = w (n) + ηx (n) e (n) (27)

57 / 101

Page 114: 14 Machine Learning Single Layer Perceptron

We have

Something Notable∂e (n)∂w

= −x (n)

Then∂J (w)∂w

= −x (n) e (n)

Using this as an estimate for the gradient vector, we have for thegradient descent

w (n+ 1) = w (n) + ηx (n) e (n) (27)

57 / 101

Page 115: 14 Machine Learning Single Layer Perceptron

We have

Something Notable∂e (n)∂w

= −x (n)

Then∂J (w)∂w

= −x (n) e (n)

Using this as an estimate for the gradient vector, we have for thegradient descent

w (n+ 1) = w (n) + ηx (n) e (n) (27)

57 / 101

Page 116: 14 Machine Learning Single Layer Perceptron

Remarks

The feedback loop around the weight vector low-pass filterIt behaves like a low-pass filter.It passes the low frequency component of the error signal andattenuating its high frequency component.

Low-Pass filter

58 / 101

Page 117: 14 Machine Learning Single Layer Perceptron

Remarks

The feedback loop around the weight vector low-pass filterIt behaves like a low-pass filter.It passes the low frequency component of the error signal andattenuating its high frequency component.

Low-Pass filter

58 / 101

Page 118: 14 Machine Learning Single Layer Perceptron

Remarks

The feedback loop around the weight vector low-pass filterIt behaves like a low-pass filter.It passes the low frequency component of the error signal andattenuating its high frequency component.

Low-Pass filter

R

C

Vin

Vout

58 / 101

Page 119: 14 Machine Learning Single Layer Perceptron

Actually

ThusThe average time constant of this filtering action is inversely proportionalto the learning-rate parameter η.

ThusAssigning a small value to η, the adaptive process progresses slowly.

ThusMore of the past data are remembered by the LMS algorithm.Thus, LMS is a more accurate filter.

59 / 101

Page 120: 14 Machine Learning Single Layer Perceptron

Actually

ThusThe average time constant of this filtering action is inversely proportionalto the learning-rate parameter η.

ThusAssigning a small value to η, the adaptive process progresses slowly.

ThusMore of the past data are remembered by the LMS algorithm.Thus, LMS is a more accurate filter.

59 / 101

Page 121: 14 Machine Learning Single Layer Perceptron

Actually

ThusThe average time constant of this filtering action is inversely proportionalto the learning-rate parameter η.

ThusAssigning a small value to η, the adaptive process progresses slowly.

ThusMore of the past data are remembered by the LMS algorithm.Thus, LMS is a more accurate filter.

59 / 101

Page 122: 14 Machine Learning Single Layer Perceptron

Actually

ThusThe average time constant of this filtering action is inversely proportionalto the learning-rate parameter η.

ThusAssigning a small value to η, the adaptive process progresses slowly.

ThusMore of the past data are remembered by the LMS algorithm.Thus, LMS is a more accurate filter.

59 / 101

Page 123: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 60 / 101

Page 124: 14 Machine Learning Single Layer Perceptron

Convergence of the LMS

This convergence depends on the following pointsThe statistical characteristics of the input vector x (n).The learning-rate parameter η.

Something NotableHowever instead using E [w (n)] as n→∞, we use E

[e2 (n)

]→constant

as n→∞

61 / 101

Page 125: 14 Machine Learning Single Layer Perceptron

Convergence of the LMS

This convergence depends on the following pointsThe statistical characteristics of the input vector x (n).The learning-rate parameter η.

Something NotableHowever instead using E [w (n)] as n→∞, we use E

[e2 (n)

]→constant

as n→∞

61 / 101

Page 126: 14 Machine Learning Single Layer Perceptron

To make this analysis practical

We take the following assumptionsThe successive input vectors x(1),x(2), .. are statisticallyindependent of each other.At time step n, the input vector x(n) is statistically independent ofall previous samples of the desired response, namelyd(1), d(2), ..., d(n− 1).At time step n, the desired response d(n) is dependent on x(n), butstatistically independent of all previous values of the desired response.The input vector x(n) and desired response d(n) are drawn fromGaussiandistributed populations.

62 / 101

Page 127: 14 Machine Learning Single Layer Perceptron

To make this analysis practical

We take the following assumptionsThe successive input vectors x(1),x(2), .. are statisticallyindependent of each other.At time step n, the input vector x(n) is statistically independent ofall previous samples of the desired response, namelyd(1), d(2), ..., d(n− 1).At time step n, the desired response d(n) is dependent on x(n), butstatistically independent of all previous values of the desired response.The input vector x(n) and desired response d(n) are drawn fromGaussiandistributed populations.

62 / 101

Page 128: 14 Machine Learning Single Layer Perceptron

To make this analysis practical

We take the following assumptionsThe successive input vectors x(1),x(2), .. are statisticallyindependent of each other.At time step n, the input vector x(n) is statistically independent ofall previous samples of the desired response, namelyd(1), d(2), ..., d(n− 1).At time step n, the desired response d(n) is dependent on x(n), butstatistically independent of all previous values of the desired response.The input vector x(n) and desired response d(n) are drawn fromGaussiandistributed populations.

62 / 101

Page 129: 14 Machine Learning Single Layer Perceptron

To make this analysis practical

We take the following assumptionsThe successive input vectors x(1),x(2), .. are statisticallyindependent of each other.At time step n, the input vector x(n) is statistically independent ofall previous samples of the desired response, namelyd(1), d(2), ..., d(n− 1).At time step n, the desired response d(n) is dependent on x(n), butstatistically independent of all previous values of the desired response.The input vector x(n) and desired response d(n) are drawn fromGaussiandistributed populations.

62 / 101

Page 130: 14 Machine Learning Single Layer Perceptron

We get the followingThe LMS is convergent in the mean square provided that η satisfies

0 < η <2

λmax(28)

Because λmax is the largest eigenvalue of the correlation sample Rx

This can be difficult in reality.... then we use the trace instead

0 < η <2

trace [Rx] (29)

However, each diagonal element of Rx is equal the mean-squaredvalue of the corresponding of the sensor inputWe can re-state the previous condition as

0 < η <2

sum of the mean-square values of the sensor input (30)

63 / 101

Page 131: 14 Machine Learning Single Layer Perceptron

We get the followingThe LMS is convergent in the mean square provided that η satisfies

0 < η <2

λmax(28)

Because λmax is the largest eigenvalue of the correlation sample Rx

This can be difficult in reality.... then we use the trace instead

0 < η <2

trace [Rx] (29)

However, each diagonal element of Rx is equal the mean-squaredvalue of the corresponding of the sensor inputWe can re-state the previous condition as

0 < η <2

sum of the mean-square values of the sensor input (30)

63 / 101

Page 132: 14 Machine Learning Single Layer Perceptron

We get the followingThe LMS is convergent in the mean square provided that η satisfies

0 < η <2

λmax(28)

Because λmax is the largest eigenvalue of the correlation sample Rx

This can be difficult in reality.... then we use the trace instead

0 < η <2

trace [Rx] (29)

However, each diagonal element of Rx is equal the mean-squaredvalue of the corresponding of the sensor inputWe can re-state the previous condition as

0 < η <2

sum of the mean-square values of the sensor input (30)

63 / 101

Page 133: 14 Machine Learning Single Layer Perceptron

Virtues and Limitations of the LMS AlgorithmVirtues

An important virtue of the LMS algorithm is its simplicity.The model is independent and robust to the error (small disturbances= small estimation error).

Not only that, the LMS algorithm is optimal in accordance with theminimax criterionIf you do not know what you are up against, plan for the worst andoptimize.

Primary LimitationThe slow rate of convergence and sensitivity to variations in theeigenstructure of the input.The LMS algorithms requires about 10 times the dimensionality ofthe input space for convergence.

64 / 101

Page 134: 14 Machine Learning Single Layer Perceptron

Virtues and Limitations of the LMS AlgorithmVirtues

An important virtue of the LMS algorithm is its simplicity.The model is independent and robust to the error (small disturbances= small estimation error).

Not only that, the LMS algorithm is optimal in accordance with theminimax criterionIf you do not know what you are up against, plan for the worst andoptimize.

Primary LimitationThe slow rate of convergence and sensitivity to variations in theeigenstructure of the input.The LMS algorithms requires about 10 times the dimensionality ofthe input space for convergence.

64 / 101

Page 135: 14 Machine Learning Single Layer Perceptron

Virtues and Limitations of the LMS AlgorithmVirtues

An important virtue of the LMS algorithm is its simplicity.The model is independent and robust to the error (small disturbances= small estimation error).

Not only that, the LMS algorithm is optimal in accordance with theminimax criterionIf you do not know what you are up against, plan for the worst andoptimize.

Primary LimitationThe slow rate of convergence and sensitivity to variations in theeigenstructure of the input.The LMS algorithms requires about 10 times the dimensionality ofthe input space for convergence.

64 / 101

Page 136: 14 Machine Learning Single Layer Perceptron

Virtues and Limitations of the LMS AlgorithmVirtues

An important virtue of the LMS algorithm is its simplicity.The model is independent and robust to the error (small disturbances= small estimation error).

Not only that, the LMS algorithm is optimal in accordance with theminimax criterionIf you do not know what you are up against, plan for the worst andoptimize.

Primary LimitationThe slow rate of convergence and sensitivity to variations in theeigenstructure of the input.The LMS algorithms requires about 10 times the dimensionality ofthe input space for convergence.

64 / 101

Page 137: 14 Machine Learning Single Layer Perceptron

Virtues and Limitations of the LMS AlgorithmVirtues

An important virtue of the LMS algorithm is its simplicity.The model is independent and robust to the error (small disturbances= small estimation error).

Not only that, the LMS algorithm is optimal in accordance with theminimax criterionIf you do not know what you are up against, plan for the worst andoptimize.

Primary LimitationThe slow rate of convergence and sensitivity to variations in theeigenstructure of the input.The LMS algorithms requires about 10 times the dimensionality ofthe input space for convergence.

64 / 101

Page 138: 14 Machine Learning Single Layer Perceptron

More of this in...

Simon HaykinSimon Haykin - Adaptive Filter Theory (3rd Edition)

65 / 101

Page 139: 14 Machine Learning Single Layer Perceptron

Exercises

We have from NN by Haykin3.1, 3.2, 3.3, 3.4, 3.5, 3.7 and 3.8

66 / 101

Page 140: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 67 / 101

Page 141: 14 Machine Learning Single Layer Perceptron

Objective

GoalCorrectly classify a series of samples (External applied stimuli)x1, x2, x3, ..., xm into one of two classes, C1 and C2.

Output of each input1 Class C1 output y +1.2 Class C2 output y -1.

68 / 101

Page 142: 14 Machine Learning Single Layer Perceptron

Objective

GoalCorrectly classify a series of samples (External applied stimuli)x1, x2, x3, ..., xm into one of two classes, C1 and C2.

Output of each input1 Class C1 output y +1.2 Class C2 output y -1.

68 / 101

Page 143: 14 Machine Learning Single Layer Perceptron

History

Frank RosenblattThe perceptron algorithm was invented in 1957 at the Cornell AeronauticalLaboratory by Frank Rosenblatt.

Something NotableFrank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!

Frank RosenblattHe helped to develop the Mark I Perceptron - a new machine based in theconnectivity of neural networks!!!

Some problems with itThe most important is the impossibility to use the perceptron with asingle neuron to solve the XOR problem

69 / 101

Page 144: 14 Machine Learning Single Layer Perceptron

History

Frank RosenblattThe perceptron algorithm was invented in 1957 at the Cornell AeronauticalLaboratory by Frank Rosenblatt.

Something NotableFrank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!

Frank RosenblattHe helped to develop the Mark I Perceptron - a new machine based in theconnectivity of neural networks!!!

Some problems with itThe most important is the impossibility to use the perceptron with asingle neuron to solve the XOR problem

69 / 101

Page 145: 14 Machine Learning Single Layer Perceptron

History

Frank RosenblattThe perceptron algorithm was invented in 1957 at the Cornell AeronauticalLaboratory by Frank Rosenblatt.

Something NotableFrank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!

Frank RosenblattHe helped to develop the Mark I Perceptron - a new machine based in theconnectivity of neural networks!!!

Some problems with itThe most important is the impossibility to use the perceptron with asingle neuron to solve the XOR problem

69 / 101

Page 146: 14 Machine Learning Single Layer Perceptron

History

Frank RosenblattThe perceptron algorithm was invented in 1957 at the Cornell AeronauticalLaboratory by Frank Rosenblatt.

Something NotableFrank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!

Frank RosenblattHe helped to develop the Mark I Perceptron - a new machine based in theconnectivity of neural networks!!!

Some problems with itThe most important is the impossibility to use the perceptron with asingle neuron to solve the XOR problem

69 / 101

Page 147: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 70 / 101

Page 148: 14 Machine Learning Single Layer Perceptron

Perceptron: Local Field of a Neuron

Signal-FlowBias

Inputs

Induced local field of a neuron

v =m∑

i=1wixi + b (31)

71 / 101

Page 149: 14 Machine Learning Single Layer Perceptron

Perceptron: Local Field of a Neuron

Signal-FlowBias

Inputs

Induced local field of a neuron

v =m∑

i=1wixi + b (31)

71 / 101

Page 150: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 72 / 101

Page 151: 14 Machine Learning Single Layer Perceptron

Perceptron: One Neuron Structure

Based in the previous induced local fieldIn the simplest form of the perceptron there are two decision regionsseparated by an hyperplane:

m∑i=1

wixi + b = 0 (32)

Example with two signals

73 / 101

Page 152: 14 Machine Learning Single Layer Perceptron

Perceptron: One Neuron StructureBased in the previous induced local fieldIn the simplest form of the perceptron there are two decision regionsseparated by an hyperplane:

m∑i=1

wixi + b = 0 (32)

Example with two signals

Decision Boundary

Class

Class

73 / 101

Page 153: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 74 / 101

Page 154: 14 Machine Learning Single Layer Perceptron

Deriving the Algorithm

First, you put signals together

x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33)

Weights

v(n) =m∑

i=0wi (n)xi (n) = wT (n) x (n) (34)

Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable

75 / 101

Page 155: 14 Machine Learning Single Layer Perceptron

Deriving the Algorithm

First, you put signals together

x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33)

Weights

v(n) =m∑

i=0wi (n)xi (n) = wT (n) x (n) (34)

Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable

75 / 101

Page 156: 14 Machine Learning Single Layer Perceptron

Deriving the Algorithm

First, you put signals together

x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33)

Weights

v(n) =m∑

i=0wi (n)xi (n) = wT (n) x (n) (34)

Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separableDecision Boundary

Class

Class

75 / 101

Page 157: 14 Machine Learning Single Layer Perceptron

Rule for Linear Separable Classes

There must exist a vector w

1 wT x > 0 for every input vector x belonging to class C1.2 wT x ≤ 0 for every input vector x belonging to class C2.

What is the derivative of dv(n)dw

?dv (n)dw

= x (n) (35)

76 / 101

Page 158: 14 Machine Learning Single Layer Perceptron

Rule for Linear Separable Classes

There must exist a vector w

1 wT x > 0 for every input vector x belonging to class C1.2 wT x ≤ 0 for every input vector x belonging to class C2.

What is the derivative of dv(n)dw

?dv (n)dw

= x (n) (35)

76 / 101

Page 159: 14 Machine Learning Single Layer Perceptron

Rule for Linear Separable Classes

There must exist a vector w

1 wT x > 0 for every input vector x belonging to class C1.2 wT x ≤ 0 for every input vector x belonging to class C2.

What is the derivative of dv(n)dw

?dv (n)dw

= x (n) (35)

76 / 101

Page 160: 14 Machine Learning Single Layer Perceptron

Finally

No correction is necessary1 w(n+ 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.2 w(n+ 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to classC2.

Correction is necessary1 w(n+ 1) = w(n)− η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs

to class C2.2 w(n+ 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)

belongs to class C1.

Where η (n) is a learning parameter adjusting the learning rate.

77 / 101

Page 161: 14 Machine Learning Single Layer Perceptron

Finally

No correction is necessary1 w(n+ 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.2 w(n+ 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to classC2.

Correction is necessary1 w(n+ 1) = w(n)− η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs

to class C2.2 w(n+ 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)

belongs to class C1.

Where η (n) is a learning parameter adjusting the learning rate.

77 / 101

Page 162: 14 Machine Learning Single Layer Perceptron

Finally

No correction is necessary1 w(n+ 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.2 w(n+ 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to classC2.

Correction is necessary1 w(n+ 1) = w(n)− η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs

to class C2.2 w(n+ 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)

belongs to class C1.

Where η (n) is a learning parameter adjusting the learning rate.

77 / 101

Page 163: 14 Machine Learning Single Layer Perceptron

Finally

No correction is necessary1 w(n+ 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.2 w(n+ 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to classC2.

Correction is necessary1 w(n+ 1) = w(n)− η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs

to class C2.2 w(n+ 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)

belongs to class C1.

Where η (n) is a learning parameter adjusting the learning rate.

77 / 101

Page 164: 14 Machine Learning Single Layer Perceptron

Finally

No correction is necessary1 w(n+ 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.2 w(n+ 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to classC2.

Correction is necessary1 w(n+ 1) = w(n)− η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs

to class C2.2 w(n+ 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)

belongs to class C1.

Where η (n) is a learning parameter adjusting the learning rate.

77 / 101

Page 165: 14 Machine Learning Single Layer Perceptron

A little bit on the GeometryFor Example, w(n+ 1) = w(n)− η (n) x (n)

78 / 101

Page 166: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 79 / 101

Page 167: 14 Machine Learning Single Layer Perceptron

Under Linear Separability - Convergence happens!!!

If we assumeLinear Separabilty for the classes C1 and C2.

Rosenblatt - 1962Let the subsets of training vectors C1 and C2 be linearly separable.Let the inputs presented to the perceptron originate from these twosubsets.The perceptron converges after some n0 iterations, in the sense thatis a solution vector for

w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)

is a solution vector for n0 ≤ nmax

80 / 101

Page 168: 14 Machine Learning Single Layer Perceptron

Under Linear Separability - Convergence happens!!!

If we assumeLinear Separabilty for the classes C1 and C2.

Rosenblatt - 1962Let the subsets of training vectors C1 and C2 be linearly separable.Let the inputs presented to the perceptron originate from these twosubsets.The perceptron converges after some n0 iterations, in the sense thatis a solution vector for

w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)

is a solution vector for n0 ≤ nmax

80 / 101

Page 169: 14 Machine Learning Single Layer Perceptron

Under Linear Separability - Convergence happens!!!

If we assumeLinear Separabilty for the classes C1 and C2.

Rosenblatt - 1962Let the subsets of training vectors C1 and C2 be linearly separable.Let the inputs presented to the perceptron originate from these twosubsets.The perceptron converges after some n0 iterations, in the sense thatis a solution vector for

w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)

is a solution vector for n0 ≤ nmax

80 / 101

Page 170: 14 Machine Learning Single Layer Perceptron

Under Linear Separability - Convergence happens!!!

If we assumeLinear Separabilty for the classes C1 and C2.

Rosenblatt - 1962Let the subsets of training vectors C1 and C2 be linearly separable.Let the inputs presented to the perceptron originate from these twosubsets.The perceptron converges after some n0 iterations, in the sense thatis a solution vector for

w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)

is a solution vector for n0 ≤ nmax

80 / 101

Page 171: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 81 / 101

Page 172: 14 Machine Learning Single Layer Perceptron

Proof I - First a Lower Bound for ‖w (n+ 1)‖2

Initialization

w (0) = 0 (37)

Now assume for time n = 1, 2, 3, ...

wT (n) x (n) < 0 (38)

with x(n) belongs to class C1.

PERCEPTRON INCORRECTLY CLASSIFY THE VECTORSx (1) ,x (2) , ...

Apply the correction formula

w (n+ 1) = w (n) + x (n) (39)82 / 101

Page 173: 14 Machine Learning Single Layer Perceptron

Proof I - First a Lower Bound for ‖w (n+ 1)‖2

Initialization

w (0) = 0 (37)

Now assume for time n = 1, 2, 3, ...

wT (n) x (n) < 0 (38)

with x(n) belongs to class C1.

PERCEPTRON INCORRECTLY CLASSIFY THE VECTORSx (1) ,x (2) , ...

Apply the correction formula

w (n+ 1) = w (n) + x (n) (39)82 / 101

Page 174: 14 Machine Learning Single Layer Perceptron

Proof I - First a Lower Bound for ‖w (n+ 1)‖2

Initialization

w (0) = 0 (37)

Now assume for time n = 1, 2, 3, ...

wT (n) x (n) < 0 (38)

with x(n) belongs to class C1.

PERCEPTRON INCORRECTLY CLASSIFY THE VECTORSx (1) ,x (2) , ...

Apply the correction formula

w (n+ 1) = w (n) + x (n) (39)82 / 101

Page 175: 14 Machine Learning Single Layer Perceptron

Proof I - First a Lower Bound for ‖w (n+ 1)‖2

Initialization

w (0) = 0 (37)

Now assume for time n = 1, 2, 3, ...

wT (n) x (n) < 0 (38)

with x(n) belongs to class C1.

PERCEPTRON INCORRECTLY CLASSIFY THE VECTORSx (1) ,x (2) , ...

Apply the correction formula

w (n+ 1) = w (n) + x (n) (39)82 / 101

Page 176: 14 Machine Learning Single Layer Perceptron

Proof I - First a Lower Bound for ‖w (n+ 1)‖2

Initialization

w (0) = 0 (37)

Now assume for time n = 1, 2, 3, ...

wT (n) x (n) < 0 (38)

with x(n) belongs to class C1.

PERCEPTRON INCORRECTLY CLASSIFY THE VECTORSx (1) ,x (2) , ...

Apply the correction formula

w (n+ 1) = w (n) + x (n) (39)82 / 101

Page 177: 14 Machine Learning Single Layer Perceptron

Proof II

Apply the correction iteratively

w (n+ 1) = x (1) + x (2) + ...+ x (n) (40)

We know that there is a solution w0(Linear Separability)

α = minx(n)∈C1

wT0 x (n) (41)

Then, we have

wT0 w (n+ 1) = wT

0 x (1) + wT0 x (2) + ...+ wT

0 x (n) (42)

83 / 101

Page 178: 14 Machine Learning Single Layer Perceptron

Proof II

Apply the correction iteratively

w (n+ 1) = x (1) + x (2) + ...+ x (n) (40)

We know that there is a solution w0(Linear Separability)

α = minx(n)∈C1

wT0 x (n) (41)

Then, we have

wT0 w (n+ 1) = wT

0 x (1) + wT0 x (2) + ...+ wT

0 x (n) (42)

83 / 101

Page 179: 14 Machine Learning Single Layer Perceptron

Proof II

Apply the correction iteratively

w (n+ 1) = x (1) + x (2) + ...+ x (n) (40)

We know that there is a solution w0(Linear Separability)

α = minx(n)∈C1

wT0 x (n) (41)

Then, we have

wT0 w (n+ 1) = wT

0 x (1) + wT0 x (2) + ...+ wT

0 x (n) (42)

83 / 101

Page 180: 14 Machine Learning Single Layer Perceptron

Proof III

Apply the correction iteratively

w (n+ 1) = x (1) + x (2) + ...+ x (n) (43)

We know that there is a solution w0(Linear Separability)

α = minx(n)∈C1

wT0 x (n) (44)

Then, we have

wT0 w (n+ 1) = wT

0 x (1) + wT0 x (2) + ...+ wT

0 x (n) (45)

84 / 101

Page 181: 14 Machine Learning Single Layer Perceptron

Proof III

Apply the correction iteratively

w (n+ 1) = x (1) + x (2) + ...+ x (n) (43)

We know that there is a solution w0(Linear Separability)

α = minx(n)∈C1

wT0 x (n) (44)

Then, we have

wT0 w (n+ 1) = wT

0 x (1) + wT0 x (2) + ...+ wT

0 x (n) (45)

84 / 101

Page 182: 14 Machine Learning Single Layer Perceptron

Proof III

Apply the correction iteratively

w (n+ 1) = x (1) + x (2) + ...+ x (n) (43)

We know that there is a solution w0(Linear Separability)

α = minx(n)∈C1

wT0 x (n) (44)

Then, we have

wT0 w (n+ 1) = wT

0 x (1) + wT0 x (2) + ...+ wT

0 x (n) (45)

84 / 101

Page 183: 14 Machine Learning Single Layer Perceptron

Proof IV

Thus we use the α

wT0 w (n+ 1) ≥ nα (46)

Thus using the Cauchy-Schwartz Inequality∥∥∥wT0

∥∥∥2‖w (n+ 1)‖2 ≥

[wT

0 w (n+ 1)]2

(47)

‖·‖ is the Euclidean distance.

Thus

∥∥∥wT0

∥∥∥2‖w (n+ 1)‖2 ≥ n2α2

‖w (n+ 1)‖2 ≥ n2α2∥∥wT0∥∥2

85 / 101

Page 184: 14 Machine Learning Single Layer Perceptron

Proof IV

Thus we use the α

wT0 w (n+ 1) ≥ nα (46)

Thus using the Cauchy-Schwartz Inequality∥∥∥wT0

∥∥∥2‖w (n+ 1)‖2 ≥

[wT

0 w (n+ 1)]2

(47)

‖·‖ is the Euclidean distance.

Thus

∥∥∥wT0

∥∥∥2‖w (n+ 1)‖2 ≥ n2α2

‖w (n+ 1)‖2 ≥ n2α2∥∥wT0∥∥2

85 / 101

Page 185: 14 Machine Learning Single Layer Perceptron

Proof IV

Thus we use the α

wT0 w (n+ 1) ≥ nα (46)

Thus using the Cauchy-Schwartz Inequality∥∥∥wT0

∥∥∥2‖w (n+ 1)‖2 ≥

[wT

0 w (n+ 1)]2

(47)

‖·‖ is the Euclidean distance.

Thus

∥∥∥wT0

∥∥∥2‖w (n+ 1)‖2 ≥ n2α2

‖w (n+ 1)‖2 ≥ n2α2∥∥wT0∥∥2

85 / 101

Page 186: 14 Machine Learning Single Layer Perceptron

Proof V - Now a Upper Bound for ‖w (n+ 1)‖2

Now rewrite equation 39

w (k + 1) = w (k) + x (k) (48)

for k = 1, 2, ..., n and x (k) ∈ C1

Squaring the Euclidean norm of both sides

‖w (k + 1)‖2 = ‖w (k)‖2 + ‖x (k)‖2 + 2wT (k) x (k) (49)

Now taking that wT (k) x (k) < 0

‖w (k + 1)‖2 ≤ ‖w (k)‖2 + ‖x (k)‖2

‖w (k + 1)‖2 − ‖w (k)‖2 ≤ ‖x (k)‖2

86 / 101

Page 187: 14 Machine Learning Single Layer Perceptron

Proof V - Now a Upper Bound for ‖w (n+ 1)‖2

Now rewrite equation 39

w (k + 1) = w (k) + x (k) (48)

for k = 1, 2, ..., n and x (k) ∈ C1

Squaring the Euclidean norm of both sides

‖w (k + 1)‖2 = ‖w (k)‖2 + ‖x (k)‖2 + 2wT (k) x (k) (49)

Now taking that wT (k) x (k) < 0

‖w (k + 1)‖2 ≤ ‖w (k)‖2 + ‖x (k)‖2

‖w (k + 1)‖2 − ‖w (k)‖2 ≤ ‖x (k)‖2

86 / 101

Page 188: 14 Machine Learning Single Layer Perceptron

Proof V - Now a Upper Bound for ‖w (n+ 1)‖2

Now rewrite equation 39

w (k + 1) = w (k) + x (k) (48)

for k = 1, 2, ..., n and x (k) ∈ C1

Squaring the Euclidean norm of both sides

‖w (k + 1)‖2 = ‖w (k)‖2 + ‖x (k)‖2 + 2wT (k) x (k) (49)

Now taking that wT (k) x (k) < 0

‖w (k + 1)‖2 ≤ ‖w (k)‖2 + ‖x (k)‖2

‖w (k + 1)‖2 − ‖w (k)‖2 ≤ ‖x (k)‖2

86 / 101

Page 189: 14 Machine Learning Single Layer Perceptron

Proof VI

Use the telescopic sumn∑

k=0

[‖w (k + 1)‖2 − ‖w (k)‖2

]≤

n∑k=0‖x (k)‖2 (50)

Assume

w (0) = 0x (0) = 0

Thus

‖w (n+ 1)‖2 ≤∑n

k=1 ‖x (k)‖2

87 / 101

Page 190: 14 Machine Learning Single Layer Perceptron

Proof VI

Use the telescopic sumn∑

k=0

[‖w (k + 1)‖2 − ‖w (k)‖2

]≤

n∑k=0‖x (k)‖2 (50)

Assume

w (0) = 0x (0) = 0

Thus

‖w (n+ 1)‖2 ≤∑n

k=1 ‖x (k)‖2

87 / 101

Page 191: 14 Machine Learning Single Layer Perceptron

Proof VI

Use the telescopic sumn∑

k=0

[‖w (k + 1)‖2 − ‖w (k)‖2

]≤

n∑k=0‖x (k)‖2 (50)

Assume

w (0) = 0x (0) = 0

Thus

‖w (n+ 1)‖2 ≤∑n

k=1 ‖x (k)‖2

87 / 101

Page 192: 14 Machine Learning Single Layer Perceptron

Proof VII

Then, we can define a positive number

β = maxx(k)∈C1

‖x (k)‖2 (51)

Thus

‖w (k + 1)‖2 ≤∑n

k=1 ‖x (k)‖2 ≤ nβ

Thus, we satisfies the equations only when exists a nmax (Using OurSandwich )

n2maxα

2

‖w0‖2= nmaxβ (52)

88 / 101

Page 193: 14 Machine Learning Single Layer Perceptron

Proof VII

Then, we can define a positive number

β = maxx(k)∈C1

‖x (k)‖2 (51)

Thus

‖w (k + 1)‖2 ≤∑n

k=1 ‖x (k)‖2 ≤ nβ

Thus, we satisfies the equations only when exists a nmax (Using OurSandwich )

n2maxα

2

‖w0‖2= nmaxβ (52)

88 / 101

Page 194: 14 Machine Learning Single Layer Perceptron

Proof VII

Then, we can define a positive number

β = maxx(k)∈C1

‖x (k)‖2 (51)

Thus

‖w (k + 1)‖2 ≤∑n

k=1 ‖x (k)‖2 ≤ nβ

Thus, we satisfies the equations only when exists a nmax (Using OurSandwich )

n2maxα

2

‖w0‖2= nmaxβ (52)

88 / 101

Page 195: 14 Machine Learning Single Layer Perceptron

Proof VIII

Solving

nmax = β ‖w0‖2

α2 (53)

ThusFor η (n) = 1 for all n, w (0) = 0 and a solution vector w0:

The rule for adaptying the synaptic weights of the perceptron mustterminate after at most nmax steps.

In additionBecause w0 the solution is not unique.

89 / 101

Page 196: 14 Machine Learning Single Layer Perceptron

Proof VIII

Solving

nmax = β ‖w0‖2

α2 (53)

ThusFor η (n) = 1 for all n, w (0) = 0 and a solution vector w0:

The rule for adaptying the synaptic weights of the perceptron mustterminate after at most nmax steps.

In additionBecause w0 the solution is not unique.

89 / 101

Page 197: 14 Machine Learning Single Layer Perceptron

Proof VIII

Solving

nmax = β ‖w0‖2

α2 (53)

ThusFor η (n) = 1 for all n, w (0) = 0 and a solution vector w0:

The rule for adaptying the synaptic weights of the perceptron mustterminate after at most nmax steps.

In additionBecause w0 the solution is not unique.

89 / 101

Page 198: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 90 / 101

Page 199: 14 Machine Learning Single Layer Perceptron

Algorithm Using Error-Correcting

Now, if we use the 12ek (n)2

We can actually simplify the rules and the final algorithm!!!

Thus, we have the following Delta Value

∆w (n) = η ((dj − yj (n))) x (n) (54)

91 / 101

Page 200: 14 Machine Learning Single Layer Perceptron

Algorithm Using Error-Correcting

Now, if we use the 12ek (n)2

We can actually simplify the rules and the final algorithm!!!

Thus, we have the following Delta Value

∆w (n) = η ((dj − yj (n))) x (n) (54)

91 / 101

Page 201: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 92 / 101

Page 202: 14 Machine Learning Single Layer Perceptron

We could use the previous Rule

In order to generate an algorithmHowever, you need classes that are linearly separable!!!

Therefore, we can use a more generals Gradient Descent RuleTo obtain an algorithm to the best separation hyperplane!!!

93 / 101

Page 203: 14 Machine Learning Single Layer Perceptron

We could use the previous Rule

In order to generate an algorithmHowever, you need classes that are linearly separable!!!

Therefore, we can use a more generals Gradient Descent RuleTo obtain an algorithm to the best separation hyperplane!!!

93 / 101

Page 204: 14 Machine Learning Single Layer Perceptron

Gradient Descent Algorithm

Algorithm - Off-line/Batch Learning1 Set n = 0.

2 Set dj ={

+1 if xj (n) ∈ Class 1−1 if xj (n) ∈ Class 2

for all j = 1, 2, ...,m.

3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)).I Weights may be initialized to 0 or to a small random value.

4 Initialize Dummy outputs so you can enter loop yt = 〈y1 (n) ., y2 (n) , ..., ym (n)〉5 Initialize Stopping error ε > 0.6 Initialize learning error η.7 While 1

m

∑m

j=1 ‖dj − yj (n)‖ > ε

I For each sample (xj , dj) for j = 1, ...,m:F Calculate output yj = ϕ

(wT (n) · xj

)F Update weights wi (n+ 1) = wi (n) + η (dj − yj (n))xij .

I n = n+ 1

94 / 101

Page 205: 14 Machine Learning Single Layer Perceptron

Gradient Descent Algorithm

Algorithm - Off-line/Batch Learning1 Set n = 0.

2 Set dj ={

+1 if xj (n) ∈ Class 1−1 if xj (n) ∈ Class 2

for all j = 1, 2, ...,m.

3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)).I Weights may be initialized to 0 or to a small random value.

4 Initialize Dummy outputs so you can enter loop yt = 〈y1 (n) ., y2 (n) , ..., ym (n)〉5 Initialize Stopping error ε > 0.6 Initialize learning error η.7 While 1

m

∑m

j=1 ‖dj − yj (n)‖ > ε

I For each sample (xj , dj) for j = 1, ...,m:F Calculate output yj = ϕ

(wT (n) · xj

)F Update weights wi (n+ 1) = wi (n) + η (dj − yj (n))xij .

I n = n+ 1

94 / 101

Page 206: 14 Machine Learning Single Layer Perceptron

Nevertheless

We have the following problem

ε > 0 (55)

Thus...Convergence to the best linear separation is a tweaking business!!!

95 / 101

Page 207: 14 Machine Learning Single Layer Perceptron

Nevertheless

We have the following problem

ε > 0 (55)

Thus...Convergence to the best linear separation is a tweaking business!!!

95 / 101

Page 208: 14 Machine Learning Single Layer Perceptron

Outline1 Introduction

History2 Adapting Filtering Problem

DefinitionDescription of the Behavior of the System

3 Unconstrained OptimizationIntroductionMethod of Steepest DescentNewton’s MethodGauss-Newton Method

4 Linear Least-Squares FilterIntroductionLeast-Mean-Square (LMS) AlgorithmConvergence of the LMS

5 PerceptronObjectivePerceptron: Local Field of a NeuronPerceptron: One Neuron StructureDeriving the AlgorithmUnder Linear Separability - Convergence happens!!!ProofAlgorithm Using Error-CorrectingFinal Perceptron Algorithm (One Version)Other Algorithms for the Perceptron 96 / 101

Page 209: 14 Machine Learning Single Layer Perceptron

However, if we limit our features!!!

The Winnow Algorithm!!!It converges even with no-linear separability.

Feature VectorA Boolean-valued features X = {0, 1}d

Weight Vector w

1 wt = (w1, w2, ..., wp) for all wi ∈ R2 For all i, wi ≥ 0.

97 / 101

Page 210: 14 Machine Learning Single Layer Perceptron

However, if we limit our features!!!

The Winnow Algorithm!!!It converges even with no-linear separability.

Feature VectorA Boolean-valued features X = {0, 1}d

Weight Vector w

1 wt = (w1, w2, ..., wp) for all wi ∈ R2 For all i, wi ≥ 0.

97 / 101

Page 211: 14 Machine Learning Single Layer Perceptron

However, if we limit our features!!!

The Winnow Algorithm!!!It converges even with no-linear separability.

Feature VectorA Boolean-valued features X = {0, 1}d

Weight Vector w

1 wt = (w1, w2, ..., wp) for all wi ∈ R2 For all i, wi ≥ 0.

97 / 101

Page 212: 14 Machine Learning Single Layer Perceptron

Classification Scheme

We use a specific θ1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 12 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2

RuleWe use two possible Rules for training!!! With a learning rate of α > 1.

Rule 1When misclassifying a positive training example x ∈Class 1 i.e.wT x < θ

∀xi = 1 : wi ← αwi (56)

98 / 101

Page 213: 14 Machine Learning Single Layer Perceptron

Classification Scheme

We use a specific θ1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 12 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2

RuleWe use two possible Rules for training!!! With a learning rate of α > 1.

Rule 1When misclassifying a positive training example x ∈Class 1 i.e.wT x < θ

∀xi = 1 : wi ← αwi (56)

98 / 101

Page 214: 14 Machine Learning Single Layer Perceptron

Classification Scheme

We use a specific θ1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 12 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2

RuleWe use two possible Rules for training!!! With a learning rate of α > 1.

Rule 1When misclassifying a positive training example x ∈Class 1 i.e.wT x < θ

∀xi = 1 : wi ← αwi (56)

98 / 101

Page 215: 14 Machine Learning Single Layer Perceptron

Classification Scheme

Rule 2When misclassifying a negative training example x ∈Class 2 i.e.wT x ≥ θ

∀xi = 1 : wi ←wi

α(57)

Rule 3If samples are correctly classified do nothing!!!

99 / 101

Page 216: 14 Machine Learning Single Layer Perceptron

Classification Scheme

Rule 2When misclassifying a negative training example x ∈Class 2 i.e.wT x ≥ θ

∀xi = 1 : wi ←wi

α(57)

Rule 3If samples are correctly classified do nothing!!!

99 / 101

Page 217: 14 Machine Learning Single Layer Perceptron

Properties of Winnow

PropertyIf there are many irrelevant variables Winnow is better than thePerceptron.

DrawbackSensitive to the learning rate α.

100 / 101

Page 218: 14 Machine Learning Single Layer Perceptron

Properties of Winnow

PropertyIf there are many irrelevant variables Winnow is better than thePerceptron.

DrawbackSensitive to the learning rate α.

100 / 101

Page 219: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 220: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 221: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 222: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 223: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 224: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 225: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 226: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 227: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101

Page 228: 14 Machine Learning Single Layer Perceptron

The Pocket AlgorithmA variant of the Perceptron AlgorithmIt was suggested by Geman et al. in

“Perceptron based learning algorithms,” IEEE Transactions on NeuralNetworks,Vol. 1(2), pp. 179–191, 1990.It converges to an optimal solution even if the linear separability isnot fulfilled.

It consists of the following steps1 Initialize the weight vector w (0) in a random way.2 Define a storage pocket vector ws and a history counter hs to zero

for the same pocket vector.3 At the ith iteration step compute the update w (n+ 1) using the

Perceptron rule.4 Use the update weight to find the number h of samples correctly

classified.5 If at any moment h > hs replace ws with w (n+ 1) and hs with h6 Keep iterating to 3 until convergence.7 Return ws.

101 / 101