multilayer neural networksmlclab.org/pr/notes/lecture04-multilayerneuralnetworks2.pdflec4:...
TRANSCRIPT
Lec4: Multilayer Neural Networks 1
Multilayer Neural
Networks
Prof. Daniel Yeung
School of Computer Science and Engineering
South China University of Technology
Lecture 4Pattern Recognition
Lec4: Multilayer Neural Networks 2
Outline
� Introduction (6.1)
� Artificial Neural Network (6.2)
� Multiple Layer Perceptron NN (6.2)
� Back-Propagation Algorithm (6.3)
� Regularization (6.11)
� Relating NN and Bayes Theory (6.6)
� Practical Techniques (6.8)
Lec4: Multilayer Neural Networks 3
Introduction
� Recall, Linear Discriminant Functions:
�Limited generalization capability
�Cannot handle the non-linearly separable
problem
Σ
Σ
x1
x2
x3
x4
x5
g1
g2
w11
w12
Input Layer Output Layer
Lec4: Multilayer Neural Networks 4
Introduction
� Solution 1: Mapping Function φ(x)
�Pro: Simple structure (still using LDF)
�Cons: Selection of φ(x) and its parameters
�Discuss already in Lecture 03
Σ
Σ
φx1
φx2
φx3
φx4
φx5
g1
g2
w11
w12
Input Layer Output Layer
Lec4: Multilayer Neural Networks 5
Introduction
� Solution 2: Multi-Layer Neural Network
� No need to choose the nonlinear mapping φ(x), and no need to have any prior knowledge relevant to the classification problem.
x1
x2
x3
x4
x5
w11
w12
Input Layer Output Layer
Σ
Σ g1
g2
Hidden Layers
w13
Lec4: Multilayer Neural Networks 6
Multi-Layer Neural Network
(multilayer Perceptrons)
�The number of hidden layer should be more
than one
�The hidden layers serve as a mapping
function
�Will be introduced in this lecture
Lec4: Multilayer Neural Networks 7
Artificial Neural Network (ANN)
� A very simplified model of the brain
� Basically a function approximator
�Transforms inputs into outputs to the best of its ability
input output
input output
Human Brain
Artificial NN
Lec4: Multilayer Neural Networks 8
Artificial Neural Network (ANN)
� Composed of neurons which cooperate
together
f
I1
I2
Id
… Ө
w1
w2
wdneuron
synapse
output
input
Lec4: Multilayer Neural Networks 9
Artificial Neural Network (ANN)
Lec4: Multilayer Neural Networks 10
Artificial Neural Network (ANN)
How does a neuron work?� The output of a neuron is a function of the
weighted sum of the inputs plus a bias (optional)
unit which always emits a value of 1 or -1 as a
threshold.
= f (w1I1 + w2I2 + … + wdId + bias)f
I1
I2
Id
… Ө
w1
w2
wd
bias
activation function
Lec4: Multilayer Neural Networks 11
Artificial Neural Network (ANN)
Activation Function
� The function ( f ) is the activation function
� Examples:
� Linear Function
� Output is the same
as input
� Differentiable
� f (x) = x
� Sign Function
� Decision Making
� Not differentiable
� f (x)
� Sigmoid Function
� Smooth, continuous,
and monotonically
increasing
(differentiable)
� f (x) = 1/(1 + e-x)
1 x > 0
-1 x < 0
Lec4: Multilayer Neural Networks 12
Artificial Neural Network (ANN)
XOR Example
5.0211 ++= xxy
5.1212 −+= xxy
( )14.07.0sgn 21 −−= yyy
bias
x1
x2
x1
x2
x2
x1
Lec4: Multilayer Neural Networks 13
Artificial Neural Network (ANN)
XOR Example
5.0211 ++= xxy
5.1212 −+= xxy
( )14.07.0sgn 21 −−= yyy
x1 = 1 x2 = 1
y1 = sgn(1+1+0.5) = sgn(2.5) = 1
y2 = sgn(1+1-1.5) = sgn(0.5) = 1
y = sgn(0.7 - 0.4 – 1) = sgn(-1.3) = -1
x1 = -1 x2 = -1
y1 = sgn(-1-1+0.5 ) = sgn(-1.5) = -1
y2 = sgn(-1-1-1.5 ) = sgn(-3.5) = -1
y = sgn(-0.7 + 0.4 – 1) = sgn(-1.3) = -1
Lec4: Multilayer Neural Networks 14
Artificial Neural Network (ANN)
XOR Example� The first hidden unit implements an
OR gate
� The second hidden unit implements
an AND gate
� The final output unit implements an AND NOT gate:
� y = y1 AND NOT y2 = (x1 OR x2) AND NOT (x1 AND x2)
= x1 XOR x2
5.0211 ++= xxy
5.1212 −+= xxy
( )14.07.0sgn 21 −−= yyy
Lec4: Multilayer Neural Networks 15
Artificial Neural Network (ANN)
� Structure of ANN:
�A simple three-layer neural network
� Input Layer: 2 input units
� Hidden Layer: 3 hidden units
� Output Layer: 2 output units
x1
x2
w11
w21
Input Layer Output Layer
Σ
Σ g1
g2
Hidden Layers
w31
Lec4: Multilayer Neural Networks 16
x1
x2
w11
w21
Input Layer Output Layer
Σ
Σ g1
g2
Hidden Layers
w31
x1 = length of salmon/seabass
x2 = lightness of salmon/seabass
wij = Weights for assigning importance of each input from neuron
Top hidden neuron = Length discriminant function
Middle hidden neuron = Combination of length and lightness discriminat
function
Bottom hidden neuron = Lightness discriminant function
g1 = final output
g2 = final output
Illustrative example:
Lec4: Multilayer Neural Networks 17
Artificial Neural Network (ANN)
( )jj netfy = ∑
=
=n
i
ikik ywnet1
''∑=
=d
m
mjmj xwnet1
( )kk netfg '=
f
f
x1
x2
w11
w21
Input Layer Output Layer
f
f
f g1
g2
Hidden Layers
w31
w'11
w‘21
y1net1
net’1
= ∑ ∑
= =
n
i
d
m
mimkik xwfwfg1 1
'
Lec4: Multilayer Neural Networks 18
Artificial Neural Network (ANN)
� A two-layer networkclassifier can only implement a lineardecision boundary
� A three-, four- and higher-layer networkscan implement arbitrary decision boundaries
� The decision regions need not be convex, nor simply connected
Lec4: Multilayer Neural Networks 19
Multiple Layer Perceptron NN (MLPNN)
� The most common NN
� More than one layer
� Sigmoid is used as activation function
� A general function approximator
� Not limited to linear problemsMultilayers
Sigmoid function
Input
LayerOutput
Layer
Lec4: Multilayer Neural Networks 20
Multiple Layer Perceptron NN (MLPNN)
� Example
Lec4: Multilayer Neural Networks 21
Training: Weight Determination
� Weights can be determined by training
� Reduce the error between the desired outputs and
the NN outputs of training samples
� Back-propagation algorithm is the most widely
used method for determining the weight
� Natural extension of LMS algorithm
� Pros:
� Simple and general method
� Cons:
� Slow and trapped at local minima
Lec4: Multilayer Neural Networks 22
Back-Propagation (BP) Algorithm
� Calculation of the derivative flows backwards through the network� Hence, it is called backpropagation
� These derivatives point in the direction of the maximum increase of the error function (find out where max error being made and go back to try to decrease this error)
� A small step (learning rate) in the opposite direction will result in the maximum decrease of the (local) error function:
w
Eww
∂∂
+= α' where α is the learning rate &
E the error function
Lec4: Multilayer Neural Networks 23
� Most common measure of error is the mean
square error:
� Update rule of weight is
where η is the learning rate which controls the size of
each step
Back-Propagation (BP) Algorithm
J = (target – output)2 / 2
w
Jkwkw
∂∂
−=+ η)()1(
Lec4: Multilayer Neural Networks 24
Back-Propagation (BP) Algorithm
� Next slides show BP for a 3-layer NN
�Two types of weights
� Hidden-to-Output who
� Input-to-Hidden wih
x1
x2
wihwho
g1
g2
Lec4: Multilayer Neural Networks 25
Back-Propagation (BP) Algorithm
3-Layer NN� The learning rule for the hidden-to-output
units
kj
k
k
k
kkj w
net
net
output
output
J
w
J
∂∂
∂∂
∂∂
=∂∂
)(' k
k
knetf
net
output=
∂∂
j
kj
k yw
net=
∂∂
=∂
∂
koutput
J- (targetk - outputk)
outputk
∑=
=n
i
ikik ywnet1
outputk = f (netk)
Lec4: Multilayer Neural Networks 26
Back-Propagation (BP) Algorithm
3-Layer NN� The learning rule for the input-to-hidden units:
ji
j
j
j
jji w
net
net
y
y
J
w
J
∂
∂
∂
∂
∂∂
=∂∂
( )j
j
j
yf net
net
∂′=
∂
1
dj
jm m i
mji ji
netw x x
w w =
∂ ∂ = = ∂ ∂
∑
( )jj netfy = ∑
=
=d
m
mjmj xwnet1
( )
∂∂
=∂∂
∑=
c
kjj yy
J
1
2
kk output? target2
1
( )j
c
k y∂∂
−= ∑=
k
1
kk
outputoutput? target
( )j
k
k
c
k y
net
net ∂∂
∂∂
−= ∑=
k
1
kk
outputoutput? target
( ) ( ) kjk
c
k
wnetf 'output? target1
kk∑=
−=
outputk
output1
outputc
??
Lec4: Multilayer Neural Networks 27
Back-Propagation (BP) Algorithm
3-Layer NN� Summary
�Hidden-to-Output Weight
� Input-to-Hidden Weight
( ) ( ) ( )
−=
∂∂
∑=
kjk
c
k
ji
ji
wnetfnetfxw
J'output? target'
1
kk
( )kk output? target)(' kj
kj
netfyw
J−=
∂∂
Lec4: Multilayer Neural Networks 28
Back-Propagation (BP) Algorithm
Training Algorithm
� For the sample training set, the weights of
NN can be updated differently by
presenting the training samples in different
sequences
� There are two popular methods:
�Stochastic training
�Batch training
Lec4: Multilayer Neural Networks 29
Back-Propagation (BP) Algorithm
Training Algorithm� Stochastic training
�Patterns are chosen randomly from the
training set
� Network weights are updated randomly
Lec4: Multilayer Neural Networks 30
Back-Propagation (BP) Algorithm
Training Algorithm� Batch training
�All patterns are presented to the network
before learning ( weight update ) takes place
Lec4: Multilayer Neural Networks 31
Back-Propagation (BP) Algorithm
Training Algorithm� A classifier with smaller training error is better?
� Most of the cases, NO!
� We have discussed this issue in Lecture 01
� For example:
(Stop training when testing error reaches a minimum)
(Test unseen samples)
Lec4: Multilayer Neural Networks 32
Regularization
� In Lec03, we mentioned that in most cases, the solution (discriminant function) is not unique (ill-posed problem)� Which one is the best?
� Enough to minimize training error?
� Too complex classifier?
� Good generalization ability?
Empirical Risk
(Training Error)
Overfitting problem --
MinimizeempR
Lec4: Multilayer Neural Networks 33
� Regularization is one of the methods to handle this problem� Add Regularization Term in objective function
� Measure the “smoothness” of the decision plane
� Tradeoff parameter (λ) to control the importance of training accuracy and regularization term
� Seek a smooth classifier with good performance on a training set
� May sacrifice Training Error for the Simplicity of a classifier if necessary
Regularization
Minimize: empR ( )fλψ+
Training Error Regularization Term
Tradeoff
ψψψψ
Lec4: Multilayer Neural Networks 34
Regularization
λ: regularization parameter ; ψψψψ regularization function
λ= 0 ∞∞∞∞ > λ > 0 λ→→→→ ∞∞∞∞
Minimize: empR ( )fλψ+
� Similar to traditional training objective faction
� No effect on the regularization term
� If we can find suitable λ, we may find f with a good generalization ability
� Dominated by the regularization term
� The most smooth classifier is found
Lec4: Multilayer Neural Networks 35
Regularization
Weight Decay
� It is a well known regularization example
� Regularization Term measures the value
of weight
�Smaller → smoother
� The objective function becomes
( ) 2
2wf =ψ
Minimize: empR2
2wλ+
Lec4: Multilayer Neural Networks 36
NN and Bayes Theory
� Recall, Bayes formula:
� Suppose a network is trained using the
following target output setting:
( ) ( ) ( )( ) ( ) )(
),(
|
||
1xp
xP
pxp
pxpxp k
c
i ii
kkk
ω
ωω
ωωω ==
∑ =
∈
=otherwise0
if1)(target
k
k
xx
ω
Lec4: Multilayer Neural Networks 37
NN and Bayes Theory� When the number of training samples tends to infinity (see p.304 in
text):
� We try to minimize J(w) wrt w, the following term will be minimized
� The trained network will approximate the posteriori probability
[ ]∑ −=∞→∞→
x
kknn
wxgn
wJn
2target);(
1lim)(
1lim
[ ] ∫∫ ≠≠+−= dxxpwxgPdxxpwxgP kikkikkk )|();()()|(1);()(22 ωωωω
∫∫∫ +−= dxxpdxxpwxgdxxpwxg kkkk ),(),();(2)();(2 ωω
[ ] ∫∫ ≠+−= dxxpxPxPdxxpxPwxg kikkk )()|()|()()|();(2 ωωω
Independent of wDependent on w
[ ]∫ − dxxpxPwxg kk )()|();(2ω
)|();( xPwxg kk ω≅
(Using mean square error for J)
=1=0
Lec4: Multilayer Neural Networks 38
NN and Bayes Theory
� Thus when MLPNNs are trained via back
propagation on a sum-squared error criterion,
they provide a least squares fit to the Bayes
discriminant function, i.e.
)|();( xPwxg kk ω≅
Lec4: Multilayer Neural Networks 39
Practical Techniques
� How to design a MLPNN to handle a given
classification problem?
x1
x2
.
.
.
xn
y1
y2
.
.
.
yn
,
,
,
Training Set MLP
? ??
Lec4: Multilayer Neural Networks 40
Practical Techniques
� Must consider following issues:
� Scaling input
� Target values
� Number of Hidden Layers
� Number of Hidden Units
� Initializing Weights
� Learning Rates
� Momentum
� Weight Decay
� Stochastic and Batch Training
� Stopped Training
Lec4: Multilayer Neural Networks 41
Practical Techniques
Scaling input� Features with different natures will have
different properties (e.g. range, mean…)
�For example: Fish� Mass (grams) and Length (meters)
� Normally the value of the mass will be orders of magnitude larger than that for length
� During the training, the network will adjust weights from the “mass” input unit far more than for the “length” input
� The error will hardly depend upon the tiny length values
� The situation will be reversed when � Mass (kilogram) and length (millimeter)
Lec4: Multilayer Neural Networks 42
Practical Techniques
Scaling input
� How to reduce this influence?
�Normalization (Standardization)
�Standardize the training samples have
� Same range (e.g. 0 to 1 or -1 to 1)
� Same variance (e.g. 1)
� Same average (e.g. 0)
Lec4: Multilayer Neural Networks 43
Practical Techniques
Target Value
� Usually a one-of-c representation for the
target vector is used
�For four-class problem:
� Four outputs will be used
� ω1 = (1, -1, -1, -1) or (1, 0, 0, 0)
� ω2 = (-1, 1, -1, -1) or (0, 1, 0, 0)
� ω3 = (-1, -1, 1, -1) or (0, 0, 1, 0)
� ω4 = (-1, -1, -1,-1) or (0, 0, 0, 1)
Lec4: Multilayer Neural Networks 44
Practical Techniques
Number of Hidden Layers
� BP algorithm works well for NN with many hidden layers, as long as the units are differentiable
� How many hidden layers are enough?
� NN with more hidden layers� Easier to learn translations
� Some functions can be implemented more efficiently
� However, more undesirable local minima and more complex
� Since any arbitrary function can be approximated by a MLP with one hidden layer. Usually a 3-layer NN is recommended. Special problem conditions or requirements may justify the use of more than 3-layers.
Lec4: Multilayer Neural Networks 45
Practical Techniques
Number of Hidden Units� Governs the expressive power of the NN (for facial
recognition, neuron for mouth, nose, ear, eye, face shape, etc.)� Well separated or linearly separable samples, few hidden units
� Complicated problem, more hidden units
Number of hidden units (nH)
Number of weight (nw)
One study shows minimum
Error occurs for NN in range
Of 4-5 hidden units & range of
weights 17-21.
Lec4: Multilayer Neural Networks 46
Practical Techniques
Number of Hidden Units� How to determine the number of hidden units
(nH)?� nH determines the total number of weights in the net,
thus we should not have more weights than the total number of training points (n)
� Without further information, nH cannot be determined before training
� Experimentally,� Choose nH such that the total number of weights in the net is
roughly n/10
� Adjust the complexity of the network in response to the training data, for example,
� Start with a “large” value of nH
� Prune or eliminate weights
Lec4: Multilayer Neural Networks 47
Practical Techniques
Initializing Weights
� In setting weights in a given layer, we
choose weights randomly from a single
distribution to help insure uniform learning
� If set w = 0 initially, learning can never
start.
� Want standardize data so choose both
positive & negative weights
Lec4: Multilayer Neural Networks 48
� If w is initially too small
the net activation of a hidden unit will be small and the linear model will be implemented
Practical Techniques
Initializing Weights
( )jj netfy =∑
=
=d
m
mjmj xwnet1 Sigmoid Function
f (x) = 1/(1 + e-x)
Saturate Linear Saturate
� If w is initially too largethe hidden unit may saturate(sigmoid function is always 0 or 1) even before learning begins
Lec4: Multilayer Neural Networks 49
Practical Techniques
Initializing Weights� We set w such that the net
activation at a hidden unit is in
the range −1 < netj < +1, since
netj ≈ ±1 are the limits to its linear
range
� Input-to-Hidden (d inputs )
� Hidden-to-Output (the fan-in is nH )
Sigmoid Function
f (x) = 1/(1 + e-x)
dw
dji
11+<<−
H
kj
H nw
n
11+<<−
Lec4: Multilayer Neural Networks 50
Practical Techniques
Learning Rates
� In principle, small learning rate ensures
convergence
� Its value determines only the learning speed
but not the final weight values themselves
� However, in practice, because networks
are rarely fully trained to a training error
minimum, the learning rate can affect the
quality of the final network
Lec4: Multilayer Neural Networks 51
Practical Techniques
Learning Rates� Optimal learning rate is
the one which leads to the
local error minimum in
one learning step
� The optimal rate:
w
Jw
w
J
∂∂
=∆∂∂
2
2
1
2
2−
∂∂
=w
Joptη
Lec4: Multilayer Neural Networks 52
Practical Techniques
Learning Rates
� Slower
convergence
� Optimal
� Converge by
one step
� Diverge� Oscillate but
slowly
converge
Lec4: Multilayer Neural Networks 53
Practical Techniques
Momentum� What is Momentum?
� In physics, it means the moving
objects tend to keep moving unless
acted upon by outside forces
� Example: Two balls carry the same
momentum
� In BP algorithm, the approach is to alter the learning rule to include some fraction α of the previous weight update
)1()()1()()1( −∆+∆−+=+ mwmwmwmw αα
Current delta w
Previous delta w
Different factionMomentum
Lec4: Multilayer Neural Networks 54
Practical Techniques
Momentum
� Using momentum
� Reduces the variation in
overall gradient directions
� Increase speed of learning
with momentum
without momentum
Lec4: Multilayer Neural Networks 55
Practical Techniques
Stochastic and Batch Training
� Each training algorithm has strength and
drawback:
�Batch learning is typically slower than
stochastic learning.
�Stochastic training is preferred for large
redundant training sets
Lec4: Multilayer Neural Networks 56
Practical Techniques
Stopped Training� Stopping the training before gradient descent is
complete can help avoid overfitting
� A far more effective method is to stop training when the error on a separate validation set reaches a minimum
Validation Error
Generalization Error
Training Error
Algorithm
1. Separate the original training set
into two set
• New Training Set
• Validation Set
2. Use New Training Set to train
the classifier
3. Evaluate the classifier using
Validation Set at the end of each
epoch