topic 3

Topic 3.

Learning Rules of the Artificial Neural Networks.

Multilayer Perceptron.

• The first layer is the input layer, • and the last layer is the output layer. • All other layers with no direct connections from or

to the outside are called hidden layers.


• The input is processed and relayed from one layer to the next, until the final result has been computed.

• This process represents the feedforward scheme.

Multilayer Perceptron.• structural credit

assignment problem: when an error is made at the output of a network, how is credit (or blame) to be assigned to neurons deep within the network?

•One of the most popular techniques to train the hidden neurons is error backpropagation,

whereby the error of output units is propagated back to yield estimates of how much a given hidden unit contributed to the output error.


•The error function of multilayer perceptron:

The best performance of the network corresponds to the minimum of the total squared error, and during the network training, we adjust the weights of connections in order to get to that minimum.

p op o n

p

n

jqpsq

lmr

lim

ljijjp

n

p

n

jjp awwwwFteE

0 0

2121

0 0

2 )),,......,,,((21

21


•Combination of the weights, including that of hidden neurons, which minimises the error function E is considered to be a solution of multiple layer perceptron learning problem .

p op o n

p

n

jqpsq

lmr

lim

ljijjp

n

p

n

jjp awwwwFteE

0 0

2121

0 0

2 )),,......,,,((21

21


•The error function of multilayer perceptron:

•The backpropagation algorithm looks for the minimum of the multi-variable error function E in the space of weights of connections w using the method of gradient descent.

p op o n

p

n

jqpsq

lmr

lim

ljijjp

n

p

n

jjp awwwwFteE

0 0

2121

0 0

2 )),,......,,,((21

21


•Following calculus, a local minimum of a function of two or more variables is defined by equality to zero of its gradient:

p on

p

n

jqpsq

lmr

lim

ljijjp awwwwFtE

0 0

2121 )),,......,,,((21

121 ,...,,,sq

lmr

lim

lji w

Ew

EwE

wEE

where is partial derivative of the error function E with respect to the weight of connection between h-th unit in the layer k and t-th unit in the previous layer number k-1.

khtw

E


p on

p

n

jqpsq

lmr

lim

ljijjp awwwwFtE

0 0

2121 )),,......,,,((21

121 ,...,,,sq

lmr

lim

lji w

Ew

EwE

wEE

We would like to go in the direction opposite to to most rapidly minimise E. Therefore, during the iterative process of gradient descent each weight of connection, including the hidden ones, is updated:

E

kht

oldnew

wkht

kht ww

kht

kht w

ECw

using the increment

here C represents the learning rate.


p on

p

n

jqpsq

lmr

lim

ljijjp awwwwFtE

0 0

2121 )),,......,,,((21

Since calculus-based methods of minimisation rest on the taking of derivatives, their application to network training requires the error function E be a differentiable function

kht

oldnew

wkht

kht ww

kht

kht w

ECw

where

Multilayer Perceptron.Since calculus-based methods of minimisation rest on the taking of derivatives, their application to network training requires the error function E be a differentiable function, which requires the network output Xjp to be differentiable, which requires the activation functions f(S) to be differentiable:

kht

oldnew

wkht

kht ww

kht

kht w

ECw

where

p oli

lm

p oli

p oli

n

p

n

j

n

i

n

mqpsqs

lm

lim

li

lji

ljjp

n

p

n

j

n

i

lip

li

lji

ljjp

n

p

n

j

n

i

lip

lji

ljjp

awffwfwft

Sfwft

XwftE

0 0 0 0

11211

2

0 0 0

11

2

0 0 0

1

.......21

21

21

1 2

1

1

Multilayer Perceptron.Since calculus-based methods of minimisation rest on the taking of derivatives, their application to network training requires the error function E be a differentiable function, which requires the network output Xjp to be differentiable, which requires the activation functions f(S) to be differentiable:

kht

oldnew

wkht

kht ww

kht

kht w

ECw

where

p oli

lm

p oli

p oli

n

p

n

j

n

i

n

mqpsqs

lm

lim

li

lji

ljjp

n

p

n

j

n

i

lip

li

lji

ljjp

n

p

n

j

n

i

lip

lji

ljjp

awffwfwft

Sfwft

XwftE

0 0 0 0

11211

2

0 0 0

11

2

0 0 0

1

.......21

21

21

1 2

1

1

This provides a powerful motivation for using continuous and differentiable activation functions f(w,a).

Multilayer Perceptron.Since calculus-based methods of minimisation rest on the taking of derivatives, their application to network training requires the activation functions f(S) to be differentiable.

kht

oldnew

wkht

kht ww

kht

kht w

ECw

where

p o

li

lmn

p

n

j

n

i

n

mqpsqs

lm

lim

li

lji

ljjp awffwfwftE

0 0 0 0

11211 .......21

1 2

•To make a multiple layer perceptron to be “able to learn” here is a useful generic sigmoid activation function associated with a hidden or output neuron:

.1

)(

Se

Sf


kht

oldnew

wkht

kht ww

kht

kht w

ECw

where

p o

li

lmn

p

n

j

n

i

n

mqpsqs

lm

lim

li

lji

ljjp awffwfwftE

0 0 0 0

11211 .......21

1 2


.1

)(

Se

Sf

Important thing about the generic sigmoid function is that it is differentiable, with a very simple and easy to compute derivative

Sf

SfdS

Sfdf


kht

oldnew

wkht

kht ww

kht

kht w

ECw

where

p o

li

lmn

p

n

j

n

i

n

mqpsqs

lm

lim

li

lji

ljjp awffwfwftE

0 0 0 0

11211 .......21

1 2


.1

)(

Se

Sf

If all activation functions f(S) in the network are differentiable then, according to the chain rule of calculus, differentiating the error function E with respect to the weight of connection in consideration we can express the corresponding partial derivative of the error function

Sf

SfdS

Sfdf

Multilayer Perceptron.kht

oldnew

wkht

kht ww

kht

kht w

ECw

where Then….

1

1

0

1111

0 0

0

1

1

1

1

1

1

.....

.....

lip o

li

n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

n

ikht

khp

khp

kh

kh

krp

krp

kr

lip

li

li

ljp

ljp

lj

lj

jp

jpkht

ffwffwfe

wS

Sf

fS

Sf

Sf

fS

Sf

fe

eE

wE


oldnew

wkht

kht ww

kht

kht w

ECw

where

1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjpk

ht

ffwffwfewE


oldnew

wkht

kht ww

kht

kht w

ECw

where

1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjpk

ht

ffwffwfewE

Thus, correction to the hidden weight of connection between h-th unit in the k-th layer and t-th unit in the previous (k-1)-th layer can be found by

1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht

kht

ffwffwfeC

wECw

Multilayer Perceptron Learning rule!!!

kht

oldnew

wkht

kht ww where

1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

The correction is defined by• the output layer errors ejp,

• derivatives of activation functions of all neurons in the upper layers with numbers p > k,

• derivative of activation function of the neuron h itself in the layer k,

• activation function of connected neuron t in the previous layer (k-1).

khtw

pmf

khf

Sf kt

1

Multilayer Perceptron Learning rule!!!

kht

oldnew

wkht

kht ww where

1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

We can easily measure the output errors of the network, and it is us to define all the activation functions.

If we also know the derivatives of the activation functions, then we can easily find all the corrections to weights of connections of all neurons in the network, including the hidden ones, during the second run back through the network.

Multilayer Perceptron Training.

The training process of multilayer perceptron consists of two phases.

Initial values of the weights of connections set up randomly.

Then, during the first, feedforward phase, starting from the input layer and further layer-by-layer, outputs of every unit in the network are computed together with the corresponding derivatives.

Figure: Directions of two basic signal flows in multilayer perceptron: forward propagation of function signals and back-propagation of error signals.

Multilayer Perceptron Training.The training process of multilayer perceptron consists

of two phases. Initial values of the weights of connections set up

randomly. Then, during the first, feedforward phase, starting

from the input layer and further layer-by-layer, outputs of every unit in the network are computed together with the corresponding derivatives.

In the second, feedback phase corrections to all weights of connections of all units including the hidden ones are computed using the outputs and derivatives computed during the feedforward phase.

Figure: Directions of two basic signal flows in multilayer perceptron: forward propagation of function signals and back-propagation of error signals.

Multilayer Perceptron Training.To understand the second, error back-propagation phase of computing corrections to the weights, let us follow an example of a small three-layer perceptron.

Multilayer Perceptron Training.To understand the second, error back-propagation phase of computing corrections to the weights, let us follow an example of a small three-layer perceptron.

Suppose that we have found all outputs and corresponding derivatives of activation functions of all computing units including the hidden ones in the network.


We shall mark values of the layer in consideration, values of the layer previous to the one in consideration,


Weight of connection between unit number 1 (first lower index) in the output layer (layer number 2 shown as the upper index) and unit number 0 (second lower index) in the previous layer (number 1=2-1) after presentation of a training pattern would have a correction 1

02

1110

10

211

210 .. XfeCSffeCw

1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw


Analogously, corrections to all six weights of connections between the output layer and the hidden layer are obtained as

1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

;1,0;2,1,0,.. 12122

ijXfeCffeCw ijjijjji

Multilayer Perceptron Training.Corrections to hidden units connections.

We shall mark values of the layer in consideration, values of the layer previous to the one in consideration, values of the layers above the one in consideration,


1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

Weight of connection between unit number 1 (first lower index) in the hidden layer (layer number 1 shown in the upper index) and unit number 0 in the previous input layer (second lower index) would have a correction

01

121

22

0

00

11

21

22

0

110 afwfeCffwfeCw jj

jjjj

jj


1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

Analogously, for all six weights of connections between the hidden layer and the input layer:

;2,1,0;1,0,1222

0

01222

0

1

riafwfeCffwfeCw rijijj

jrijijj

jir


1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

•In this way going backwards through the network, one obtain the corrections to all weights …,


1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

•In this way going backwards through the network, one obtain the corrections to all weights …,•then update the weights.


1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

•In this way going backwards through the network, one obtain the corrections to all weights …,•then update the weights. •After that, with the new weights go forward to get new outputs…


1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

•In this way going backwards through the network, one obtain the corrections to all weights …,•then update the weights. •After that, with the new weights go forward to get new outputs…•Find new error, go backwards and so on…


1

0

1111

0 0

.....lip o n

i

kt

kh

krh

kr

li

lji

lj

n

p

n

jjp

kht ffwffwfeCw

•In this way going backwards through the network, one obtain the corrections to all weights …, then update the weights. •After that, with the new weights go forward to get new outputs…•Find new error, go backwards and so on…

•Hopefully, sooner or later the iterative procedure will come to output with the minimum error, i.e. the absolute minimum of the error function E.

Multilayer Perceptron Training.•In this way going backwards through the network, one obtain the corrections to all weights …, then update the weights. •After that, with the new weights go forward to get new outputs…•Find new error, go backwards and so on…

•Hopefully, sooner or later the iterative procedure will come to output with the minimum error, i.e. the absolute minimum of the error function E.

•Unfortunately, as a function of many variables, the error function might have more than one minimum, and one may get not to the absolute minimum but to a relative one.


•If it happens, the error function stops to decrease regardless of number of iteration.•Some measures must be taken to get out of the function relative minimum, for example, adding small random values, i.e. “noise”, to one or more of the weights. •Then the iterative procedure starts from that new point to get to the absolute minimum eventually.

•Unfortunately, as a function of many variables, the error function might have more than one minimum, and one may get not to the absolute minimum but to a relative one.


• Finally, after successful training, perceptron is able to produce the desired responses to all input patterns of the training set.


• Finally, after successful training, perceptron is able to produce the desired responses to all input patterns of the training set. • Then all the network weights of connections are fixed,



and the network is presented with inputs it must “recognise”, i.e. not the training set inputs.




• If an input in consideration produces an output similar to one of the training set, such input is said to belong to the same type or cluster of inputs as the corresponding one of the training set.

Multilayer Perceptron Training.• Then all the network weights of connections are fixed,


• If an input in consideration produces an output similar to one of the training set, such input is said to belong to the same type or cluster of inputs as the corresponding one of the training set. • If the network produces an output not similar to any of the training set, then such an input is said not been recognised.

Multilayer Perceptron Training. Conclusion.

• In 1969 Minsky and Papert not just found the solution to the XOR problem in a form of multilayer perceptron, they also gave a very thorough mathematical analysis of the time it takes to train such networks. • Minsky and Papert emphasized that training times increase very rapidly for certain problems as the number of input lines and weights of connections increases.

Multilayer Perceptron Training. Conclusion.• Minsky and Papert emphasized that training times increase very rapidly for certain problems as the number of input lines and weights of connections increases. • The difficulties were seized upon by opponents of the subject. In particular, this was true of those working in the field of artificial intelligence (AI), who at that time did not want to concern themselves with the underlying “wetware” of the brain, but only with the functional aspects – regarded by them solely as logical processing. • Due to the limitations of funding, competition between AI and neural network communities could have only one victor.

Multilayer Perceptron Training. Conclusion.• Due to the limitations of funding, competition between AI and neural network communities could have only one victor.• Neural networks then went into a relative quietude for more then fifteen years, with only a few devotees still working on it.

Multilayer Perceptron Training. Conclusion.• Due to the limitations of funding, competition between AI and neural network communities could have only one victor.• Neural networks then went into a relative quietude for more then fifteen years, with only a few devotees still working on it. • Then new vigour came from various sources. One was from the increasing power of computers, allowing simulations of otherwise intractable problems.

Multilayer Perceptron Training. Conclusion.• New vigour came from various sources. One was from the increasing power of computers, allowing simulations of otherwise intractable problems.

• Finally, established by the mid 80s the backpropagation algorithm solved the difficulty of training hidden neurons.

Multilayer Perceptron Training. Conclusion.• New vigour came from various sources. One was from the increasing power of computers, allowing simulations of otherwise intractable problems.

• Finally, established by the mid 80s the backpropagation algorithm solved the difficulty of training hidden neurons.• Nowadays, Perceptron is an effective tool for recognising protein and amino-acid sequences and processing other complex biological data.

topic 3

Documents

output error

error backpropagation

output layer

error of output units

network training

network output xjp

total squared error

input layer