multilayer neural networksmlclab.org/pr/notes/lecture04-multilayerneuralnetworks2.pdflec4:...

28
Lec4: Multilayer Neural Networks 1 Multilayer Neural Networks Prof. Daniel Yeung School of Computer Science and Engineering South China University of Technology Lecture 4 Pattern Recognition Lec4: Multilayer Neural Networks 2 Outline Introduction (6.1) Artificial Neural Network (6.2) Multiple Layer Perceptron NN (6.2) Back-Propagation Algorithm (6.3) Regularization (6.11) Relating NN and Bayes Theory (6.6) Practical Techniques (6.8)

Upload: others

Post on 20-May-2020

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 1

Multilayer Neural

Networks

Prof. Daniel Yeung

School of Computer Science and Engineering

South China University of Technology

Lecture 4Pattern Recognition

Lec4: Multilayer Neural Networks 2

Outline

� Introduction (6.1)

� Artificial Neural Network (6.2)

� Multiple Layer Perceptron NN (6.2)

� Back-Propagation Algorithm (6.3)

� Regularization (6.11)

� Relating NN and Bayes Theory (6.6)

� Practical Techniques (6.8)

Page 2: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 3

Introduction

� Recall, Linear Discriminant Functions:

�Limited generalization capability

�Cannot handle the non-linearly separable

problem

Σ

Σ

x1

x2

x3

x4

x5

g1

g2

w11

w12

Input Layer Output Layer

Lec4: Multilayer Neural Networks 4

Introduction

� Solution 1: Mapping Function φ(x)

�Pro: Simple structure (still using LDF)

�Cons: Selection of φ(x) and its parameters

�Discuss already in Lecture 03

Σ

Σ

φx1

φx2

φx3

φx4

φx5

g1

g2

w11

w12

Input Layer Output Layer

Page 3: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 5

Introduction

� Solution 2: Multi-Layer Neural Network

� No need to choose the nonlinear mapping φ(x), and no need to have any prior knowledge relevant to the classification problem.

x1

x2

x3

x4

x5

w11

w12

Input Layer Output Layer

Σ

Σ g1

g2

Hidden Layers

w13

Lec4: Multilayer Neural Networks 6

Multi-Layer Neural Network

(multilayer Perceptrons)

�The number of hidden layer should be more

than one

�The hidden layers serve as a mapping

function

�Will be introduced in this lecture

Page 4: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 7

Artificial Neural Network (ANN)

� A very simplified model of the brain

� Basically a function approximator

�Transforms inputs into outputs to the best of its ability

input output

input output

Human Brain

Artificial NN

Lec4: Multilayer Neural Networks 8

Artificial Neural Network (ANN)

� Composed of neurons which cooperate

together

f

I1

I2

Id

… Ө

w1

w2

wdneuron

synapse

output

input

Page 5: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 9

Artificial Neural Network (ANN)

Lec4: Multilayer Neural Networks 10

Artificial Neural Network (ANN)

How does a neuron work?� The output of a neuron is a function of the

weighted sum of the inputs plus a bias (optional)

unit which always emits a value of 1 or -1 as a

threshold.

= f (w1I1 + w2I2 + … + wdId + bias)f

I1

I2

Id

… Ө

w1

w2

wd

bias

activation function

Page 6: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 11

Artificial Neural Network (ANN)

Activation Function

� The function ( f ) is the activation function

� Examples:

� Linear Function

� Output is the same

as input

� Differentiable

� f (x) = x

� Sign Function

� Decision Making

� Not differentiable

� f (x)

� Sigmoid Function

� Smooth, continuous,

and monotonically

increasing

(differentiable)

� f (x) = 1/(1 + e-x)

1 x > 0

-1 x < 0

Lec4: Multilayer Neural Networks 12

Artificial Neural Network (ANN)

XOR Example

5.0211 ++= xxy

5.1212 −+= xxy

( )14.07.0sgn 21 −−= yyy

bias

x1

x2

x1

x2

x2

x1

Page 7: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 13

Artificial Neural Network (ANN)

XOR Example

5.0211 ++= xxy

5.1212 −+= xxy

( )14.07.0sgn 21 −−= yyy

x1 = 1 x2 = 1

y1 = sgn(1+1+0.5) = sgn(2.5) = 1

y2 = sgn(1+1-1.5) = sgn(0.5) = 1

y = sgn(0.7 - 0.4 – 1) = sgn(-1.3) = -1

x1 = -1 x2 = -1

y1 = sgn(-1-1+0.5 ) = sgn(-1.5) = -1

y2 = sgn(-1-1-1.5 ) = sgn(-3.5) = -1

y = sgn(-0.7 + 0.4 – 1) = sgn(-1.3) = -1

Lec4: Multilayer Neural Networks 14

Artificial Neural Network (ANN)

XOR Example� The first hidden unit implements an

OR gate

� The second hidden unit implements

an AND gate

� The final output unit implements an AND NOT gate:

� y = y1 AND NOT y2 = (x1 OR x2) AND NOT (x1 AND x2)

= x1 XOR x2

5.0211 ++= xxy

5.1212 −+= xxy

( )14.07.0sgn 21 −−= yyy

Page 8: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 15

Artificial Neural Network (ANN)

� Structure of ANN:

�A simple three-layer neural network

� Input Layer: 2 input units

� Hidden Layer: 3 hidden units

� Output Layer: 2 output units

x1

x2

w11

w21

Input Layer Output Layer

Σ

Σ g1

g2

Hidden Layers

w31

Lec4: Multilayer Neural Networks 16

x1

x2

w11

w21

Input Layer Output Layer

Σ

Σ g1

g2

Hidden Layers

w31

x1 = length of salmon/seabass

x2 = lightness of salmon/seabass

wij = Weights for assigning importance of each input from neuron

Top hidden neuron = Length discriminant function

Middle hidden neuron = Combination of length and lightness discriminat

function

Bottom hidden neuron = Lightness discriminant function

g1 = final output

g2 = final output

Illustrative example:

Page 9: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 17

Artificial Neural Network (ANN)

( )jj netfy = ∑

=

=n

i

ikik ywnet1

''∑=

=d

m

mjmj xwnet1

( )kk netfg '=

f

f

x1

x2

w11

w21

Input Layer Output Layer

f

f

f g1

g2

Hidden Layers

w31

w'11

w‘21

y1net1

net’1

= ∑ ∑

= =

n

i

d

m

mimkik xwfwfg1 1

'

Lec4: Multilayer Neural Networks 18

Artificial Neural Network (ANN)

� A two-layer networkclassifier can only implement a lineardecision boundary

� A three-, four- and higher-layer networkscan implement arbitrary decision boundaries

� The decision regions need not be convex, nor simply connected

Page 10: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 19

Multiple Layer Perceptron NN (MLPNN)

� The most common NN

� More than one layer

� Sigmoid is used as activation function

� A general function approximator

� Not limited to linear problemsMultilayers

Sigmoid function

Input

LayerOutput

Layer

Lec4: Multilayer Neural Networks 20

Multiple Layer Perceptron NN (MLPNN)

� Example

Page 11: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 21

Training: Weight Determination

� Weights can be determined by training

� Reduce the error between the desired outputs and

the NN outputs of training samples

� Back-propagation algorithm is the most widely

used method for determining the weight

� Natural extension of LMS algorithm

� Pros:

� Simple and general method

� Cons:

� Slow and trapped at local minima

Lec4: Multilayer Neural Networks 22

Back-Propagation (BP) Algorithm

� Calculation of the derivative flows backwards through the network� Hence, it is called backpropagation

� These derivatives point in the direction of the maximum increase of the error function (find out where max error being made and go back to try to decrease this error)

� A small step (learning rate) in the opposite direction will result in the maximum decrease of the (local) error function:

w

Eww

∂∂

+= α' where α is the learning rate &

E the error function

Page 12: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 23

� Most common measure of error is the mean

square error:

� Update rule of weight is

where η is the learning rate which controls the size of

each step

Back-Propagation (BP) Algorithm

J = (target – output)2 / 2

w

Jkwkw

∂∂

−=+ η)()1(

Lec4: Multilayer Neural Networks 24

Back-Propagation (BP) Algorithm

� Next slides show BP for a 3-layer NN

�Two types of weights

� Hidden-to-Output who

� Input-to-Hidden wih

x1

x2

wihwho

g1

g2

Page 13: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 25

Back-Propagation (BP) Algorithm

3-Layer NN� The learning rule for the hidden-to-output

units

kj

k

k

k

kkj w

net

net

output

output

J

w

J

∂∂

∂∂

∂∂

=∂∂

)(' k

k

knetf

net

output=

∂∂

j

kj

k yw

net=

∂∂

=∂

koutput

J- (targetk - outputk)

outputk

∑=

=n

i

ikik ywnet1

outputk = f (netk)

Lec4: Multilayer Neural Networks 26

Back-Propagation (BP) Algorithm

3-Layer NN� The learning rule for the input-to-hidden units:

ji

j

j

j

jji w

net

net

y

y

J

w

J

∂∂

=∂∂

( )j

j

j

yf net

net

∂′=

1

dj

jm m i

mji ji

netw x x

w w =

∂ ∂ = = ∂ ∂

( )jj netfy = ∑

=

=d

m

mjmj xwnet1

( )

∂∂

=∂∂

∑=

c

kjj yy

J

1

2

kk output? target2

1

( )j

c

k y∂∂

−= ∑=

k

1

kk

outputoutput? target

( )j

k

k

c

k y

net

net ∂∂

∂∂

−= ∑=

k

1

kk

outputoutput? target

( ) ( ) kjk

c

k

wnetf 'output? target1

kk∑=

−=

outputk

output1

outputc

??

Page 14: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 27

Back-Propagation (BP) Algorithm

3-Layer NN� Summary

�Hidden-to-Output Weight

� Input-to-Hidden Weight

( ) ( ) ( )

−=

∂∂

∑=

kjk

c

k

ji

ji

wnetfnetfxw

J'output? target'

1

kk

( )kk output? target)(' kj

kj

netfyw

J−=

∂∂

Lec4: Multilayer Neural Networks 28

Back-Propagation (BP) Algorithm

Training Algorithm

� For the sample training set, the weights of

NN can be updated differently by

presenting the training samples in different

sequences

� There are two popular methods:

�Stochastic training

�Batch training

Page 15: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 29

Back-Propagation (BP) Algorithm

Training Algorithm� Stochastic training

�Patterns are chosen randomly from the

training set

� Network weights are updated randomly

Lec4: Multilayer Neural Networks 30

Back-Propagation (BP) Algorithm

Training Algorithm� Batch training

�All patterns are presented to the network

before learning ( weight update ) takes place

Page 16: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 31

Back-Propagation (BP) Algorithm

Training Algorithm� A classifier with smaller training error is better?

� Most of the cases, NO!

� We have discussed this issue in Lecture 01

� For example:

(Stop training when testing error reaches a minimum)

(Test unseen samples)

Lec4: Multilayer Neural Networks 32

Regularization

� In Lec03, we mentioned that in most cases, the solution (discriminant function) is not unique (ill-posed problem)� Which one is the best?

� Enough to minimize training error?

� Too complex classifier?

� Good generalization ability?

Empirical Risk

(Training Error)

Overfitting problem --

MinimizeempR

Page 17: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 33

� Regularization is one of the methods to handle this problem� Add Regularization Term in objective function

� Measure the “smoothness” of the decision plane

� Tradeoff parameter (λ) to control the importance of training accuracy and regularization term

� Seek a smooth classifier with good performance on a training set

� May sacrifice Training Error for the Simplicity of a classifier if necessary

Regularization

Minimize: empR ( )fλψ+

Training Error Regularization Term

Tradeoff

ψψψψ

Lec4: Multilayer Neural Networks 34

Regularization

λ: regularization parameter ; ψψψψ regularization function

λ= 0 ∞∞∞∞ > λ > 0 λ→→→→ ∞∞∞∞

Minimize: empR ( )fλψ+

� Similar to traditional training objective faction

� No effect on the regularization term

� If we can find suitable λ, we may find f with a good generalization ability

� Dominated by the regularization term

� The most smooth classifier is found

Page 18: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 35

Regularization

Weight Decay

� It is a well known regularization example

� Regularization Term measures the value

of weight

�Smaller → smoother

� The objective function becomes

( ) 2

2wf =ψ

Minimize: empR2

2wλ+

Lec4: Multilayer Neural Networks 36

NN and Bayes Theory

� Recall, Bayes formula:

� Suppose a network is trained using the

following target output setting:

( ) ( ) ( )( ) ( ) )(

),(

|

||

1xp

xP

pxp

pxpxp k

c

i ii

kkk

ω

ωω

ωωω ==

∑ =

=otherwise0

if1)(target

k

k

xx

ω

Page 19: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 37

NN and Bayes Theory� When the number of training samples tends to infinity (see p.304 in

text):

� We try to minimize J(w) wrt w, the following term will be minimized

� The trained network will approximate the posteriori probability

[ ]∑ −=∞→∞→

x

kknn

wxgn

wJn

2target);(

1lim)(

1lim

[ ] ∫∫ ≠≠+−= dxxpwxgPdxxpwxgP kikkikkk )|();()()|(1);()(22 ωωωω

∫∫∫ +−= dxxpdxxpwxgdxxpwxg kkkk ),(),();(2)();(2 ωω

[ ] ∫∫ ≠+−= dxxpxPxPdxxpxPwxg kikkk )()|()|()()|();(2 ωωω

Independent of wDependent on w

[ ]∫ − dxxpxPwxg kk )()|();(2ω

)|();( xPwxg kk ω≅

(Using mean square error for J)

=1=0

Lec4: Multilayer Neural Networks 38

NN and Bayes Theory

� Thus when MLPNNs are trained via back

propagation on a sum-squared error criterion,

they provide a least squares fit to the Bayes

discriminant function, i.e.

)|();( xPwxg kk ω≅

Page 20: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 39

Practical Techniques

� How to design a MLPNN to handle a given

classification problem?

x1

x2

.

.

.

xn

y1

y2

.

.

.

yn

,

,

,

Training Set MLP

? ??

Lec4: Multilayer Neural Networks 40

Practical Techniques

� Must consider following issues:

� Scaling input

� Target values

� Number of Hidden Layers

� Number of Hidden Units

� Initializing Weights

� Learning Rates

� Momentum

� Weight Decay

� Stochastic and Batch Training

� Stopped Training

Page 21: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 41

Practical Techniques

Scaling input� Features with different natures will have

different properties (e.g. range, mean…)

�For example: Fish� Mass (grams) and Length (meters)

� Normally the value of the mass will be orders of magnitude larger than that for length

� During the training, the network will adjust weights from the “mass” input unit far more than for the “length” input

� The error will hardly depend upon the tiny length values

� The situation will be reversed when � Mass (kilogram) and length (millimeter)

Lec4: Multilayer Neural Networks 42

Practical Techniques

Scaling input

� How to reduce this influence?

�Normalization (Standardization)

�Standardize the training samples have

� Same range (e.g. 0 to 1 or -1 to 1)

� Same variance (e.g. 1)

� Same average (e.g. 0)

Page 22: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 43

Practical Techniques

Target Value

� Usually a one-of-c representation for the

target vector is used

�For four-class problem:

� Four outputs will be used

� ω1 = (1, -1, -1, -1) or (1, 0, 0, 0)

� ω2 = (-1, 1, -1, -1) or (0, 1, 0, 0)

� ω3 = (-1, -1, 1, -1) or (0, 0, 1, 0)

� ω4 = (-1, -1, -1,-1) or (0, 0, 0, 1)

Lec4: Multilayer Neural Networks 44

Practical Techniques

Number of Hidden Layers

� BP algorithm works well for NN with many hidden layers, as long as the units are differentiable

� How many hidden layers are enough?

� NN with more hidden layers� Easier to learn translations

� Some functions can be implemented more efficiently

� However, more undesirable local minima and more complex

� Since any arbitrary function can be approximated by a MLP with one hidden layer. Usually a 3-layer NN is recommended. Special problem conditions or requirements may justify the use of more than 3-layers.

Page 23: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 45

Practical Techniques

Number of Hidden Units� Governs the expressive power of the NN (for facial

recognition, neuron for mouth, nose, ear, eye, face shape, etc.)� Well separated or linearly separable samples, few hidden units

� Complicated problem, more hidden units

Number of hidden units (nH)

Number of weight (nw)

One study shows minimum

Error occurs for NN in range

Of 4-5 hidden units & range of

weights 17-21.

Lec4: Multilayer Neural Networks 46

Practical Techniques

Number of Hidden Units� How to determine the number of hidden units

(nH)?� nH determines the total number of weights in the net,

thus we should not have more weights than the total number of training points (n)

� Without further information, nH cannot be determined before training

� Experimentally,� Choose nH such that the total number of weights in the net is

roughly n/10

� Adjust the complexity of the network in response to the training data, for example,

� Start with a “large” value of nH

� Prune or eliminate weights

Page 24: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 47

Practical Techniques

Initializing Weights

� In setting weights in a given layer, we

choose weights randomly from a single

distribution to help insure uniform learning

� If set w = 0 initially, learning can never

start.

� Want standardize data so choose both

positive & negative weights

Lec4: Multilayer Neural Networks 48

� If w is initially too small

the net activation of a hidden unit will be small and the linear model will be implemented

Practical Techniques

Initializing Weights

( )jj netfy =∑

=

=d

m

mjmj xwnet1 Sigmoid Function

f (x) = 1/(1 + e-x)

Saturate Linear Saturate

� If w is initially too largethe hidden unit may saturate(sigmoid function is always 0 or 1) even before learning begins

Page 25: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 49

Practical Techniques

Initializing Weights� We set w such that the net

activation at a hidden unit is in

the range −1 < netj < +1, since

netj ≈ ±1 are the limits to its linear

range

� Input-to-Hidden (d inputs )

� Hidden-to-Output (the fan-in is nH )

Sigmoid Function

f (x) = 1/(1 + e-x)

dw

dji

11+<<−

H

kj

H nw

n

11+<<−

Lec4: Multilayer Neural Networks 50

Practical Techniques

Learning Rates

� In principle, small learning rate ensures

convergence

� Its value determines only the learning speed

but not the final weight values themselves

� However, in practice, because networks

are rarely fully trained to a training error

minimum, the learning rate can affect the

quality of the final network

Page 26: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 51

Practical Techniques

Learning Rates� Optimal learning rate is

the one which leads to the

local error minimum in

one learning step

� The optimal rate:

w

Jw

w

J

∂∂

=∆∂∂

2

2

1

2

2−

∂∂

=w

Joptη

Lec4: Multilayer Neural Networks 52

Practical Techniques

Learning Rates

� Slower

convergence

� Optimal

� Converge by

one step

� Diverge� Oscillate but

slowly

converge

Page 27: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 53

Practical Techniques

Momentum� What is Momentum?

� In physics, it means the moving

objects tend to keep moving unless

acted upon by outside forces

� Example: Two balls carry the same

momentum

� In BP algorithm, the approach is to alter the learning rule to include some fraction α of the previous weight update

)1()()1()()1( −∆+∆−+=+ mwmwmwmw αα

Current delta w

Previous delta w

Different factionMomentum

Lec4: Multilayer Neural Networks 54

Practical Techniques

Momentum

� Using momentum

� Reduces the variation in

overall gradient directions

� Increase speed of learning

with momentum

without momentum

Page 28: Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Lec4: Multilayer Neural Networks 55

Practical Techniques

Stochastic and Batch Training

� Each training algorithm has strength and

drawback:

�Batch learning is typically slower than

stochastic learning.

�Stochastic training is preferred for large

redundant training sets

Lec4: Multilayer Neural Networks 56

Practical Techniques

Stopped Training� Stopping the training before gradient descent is

complete can help avoid overfitting

� A far more effective method is to stop training when the error on a separate validation set reaches a minimum

Validation Error

Generalization Error

Training Error

Algorithm

1. Separate the original training set

into two set

• New Training Set

• Validation Set

2. Use New Training Set to train

the classifier

3. Evaluate the classifier using

Validation Set at the end of each

epoch