multilayer neural networksmlclab.org/pr/notes/lecture04-multilayerneuralnetworks2.pdflec4:...

Lec4: Multilayer Neural Networks 1

Multilayer Neural

Networks

Prof. Daniel Yeung

School of Computer Science and Engineering

South China University of Technology

Lecture 4Pattern Recognition


Outline

� Introduction (6.1)

� Artificial Neural Network (6.2)

� Multiple Layer Perceptron NN (6.2)

� Back-Propagation Algorithm (6.3)

� Regularization (6.11)

� Relating NN and Bayes Theory (6.6)

� Practical Techniques (6.8)


Introduction

� Recall, Linear Discriminant Functions:

�Limited generalization capability

�Cannot handle the non-linearly separable

problem

Σ

Σ

x1

x2

x3

x4

x5

g1

g2

w11

w12

Input Layer Output Layer


Introduction

� Solution 1: Mapping Function φ(x)

�Pro: Simple structure (still using LDF)

�Cons: Selection of φ(x) and its parameters

�Discuss already in Lecture 03

Σ

Σ

φx1

φx2

φx3

φx4

φx5

g1

g2

w11

w12



Introduction

� Solution 2: Multi-Layer Neural Network

� No need to choose the nonlinear mapping φ(x), and no need to have any prior knowledge relevant to the classification problem.

x1

x2

x3

x4

x5

w11

w12


Σ

Σ g1

g2

Hidden Layers

w13


Multi-Layer Neural Network

(multilayer Perceptrons)

�The number of hidden layer should be more

than one

�The hidden layers serve as a mapping

function

�Will be introduced in this lecture


Artificial Neural Network (ANN)

� A very simplified model of the brain

� Basically a function approximator

�Transforms inputs into outputs to the best of its ability

input output

input output

Human Brain

Artificial NN



� Composed of neurons which cooperate

together

f

I1

I2

Id

… Ө

w1

w2

wdneuron

synapse

output

input





How does a neuron work?� The output of a neuron is a function of the

weighted sum of the inputs plus a bias (optional)

unit which always emits a value of 1 or -1 as a

threshold.

= f (w1I1 + w2I2 + … + wdId + bias)f

I1

I2

Id

… Ө

w1

w2

wd

bias

activation function



Activation Function

� The function ( f ) is the activation function

� Examples:

� Linear Function

� Output is the same

as input

� Differentiable

� f (x) = x

� Sign Function

� Decision Making

� Not differentiable

� f (x)

� Sigmoid Function

� Smooth, continuous,

and monotonically

increasing

(differentiable)

� f (x) = 1/(1 + e-x)

1 x > 0

-1 x < 0



XOR Example

5.0211 ++= xxy

5.1212 −+= xxy

( )14.07.0sgn 21 −−= yyy

bias

x1

x2

x1

x2

x2

x1



XOR Example

5.0211 ++= xxy

5.1212 −+= xxy

( )14.07.0sgn 21 −−= yyy

x1 = 1 x2 = 1

y1 = sgn(1+1+0.5) = sgn(2.5) = 1

y2 = sgn(1+1-1.5) = sgn(0.5) = 1

y = sgn(0.7 - 0.4 – 1) = sgn(-1.3) = -1

x1 = -1 x2 = -1

y1 = sgn(-1-1+0.5 ) = sgn(-1.5) = -1

y2 = sgn(-1-1-1.5 ) = sgn(-3.5) = -1

y = sgn(-0.7 + 0.4 – 1) = sgn(-1.3) = -1



XOR Example� The first hidden unit implements an

OR gate

� The second hidden unit implements

an AND gate

� The final output unit implements an AND NOT gate:

� y = y1 AND NOT y2 = (x1 OR x2) AND NOT (x1 AND x2)

= x1 XOR x2

5.0211 ++= xxy

5.1212 −+= xxy

( )14.07.0sgn 21 −−= yyy



� Structure of ANN:

�A simple three-layer neural network

� Input Layer: 2 input units

� Hidden Layer: 3 hidden units

� Output Layer: 2 output units

x1

x2

w11

w21


Σ

Σ g1

g2

Hidden Layers

w31


x1

x2

w11

w21


Σ

Σ g1

g2

Hidden Layers

w31

x1 = length of salmon/seabass

x2 = lightness of salmon/seabass

wij = Weights for assigning importance of each input from neuron

Top hidden neuron = Length discriminant function

Middle hidden neuron = Combination of length and lightness discriminat

function

Bottom hidden neuron = Lightness discriminant function

g1 = final output

g2 = final output

Illustrative example:



( )jj netfy = ∑

=

=n

i

ikik ywnet1

''∑=

=d

m

mjmj xwnet1

( )kk netfg '=

f

f

x1

x2

w11

w21


f

f

f g1

g2

Hidden Layers

w31

w'11

w‘21

y1net1

net’1

= ∑ ∑

= =

n

i

d

m

mimkik xwfwfg1 1

'



� A two-layer networkclassifier can only implement a lineardecision boundary

� A three-, four- and higher-layer networkscan implement arbitrary decision boundaries

� The decision regions need not be convex, nor simply connected


Multiple Layer Perceptron NN (MLPNN)

� The most common NN

� More than one layer

� Sigmoid is used as activation function

� A general function approximator

� Not limited to linear problemsMultilayers

Sigmoid function

Input

LayerOutput

Layer


Multiple Layer Perceptron NN (MLPNN)

� Example


Training: Weight Determination

� Weights can be determined by training

� Reduce the error between the desired outputs and

the NN outputs of training samples

� Back-propagation algorithm is the most widely

used method for determining the weight

� Natural extension of LMS algorithm

� Pros:

� Simple and general method

� Cons:

� Slow and trapped at local minima


Back-Propagation (BP) Algorithm

� Calculation of the derivative flows backwards through the network� Hence, it is called backpropagation

� These derivatives point in the direction of the maximum increase of the error function (find out where max error being made and go back to try to decrease this error)

� A small step (learning rate) in the opposite direction will result in the maximum decrease of the (local) error function:

w

Eww

∂∂

+= α' where α is the learning rate &

E the error function


� Most common measure of error is the mean

square error:

� Update rule of weight is

where η is the learning rate which controls the size of

each step


J = (target – output)2 / 2

w

Jkwkw

∂∂

−=+ η)()1(



� Next slides show BP for a 3-layer NN

�Two types of weights

� Hidden-to-Output who

� Input-to-Hidden wih

x1

x2

wihwho

g1

g2



3-Layer NN� The learning rule for the hidden-to-output

units

kj

k

k

k

kkj w

net

net

output

output

J

w

J

∂∂

∂∂

∂∂

=∂∂

)(' k

k

knetf

net

output=

∂∂

j

kj

k yw

net=

∂∂

=∂

∂

koutput

J- (targetk - outputk)

outputk

∑=

=n

i

ikik ywnet1

outputk = f (netk)



3-Layer NN� The learning rule for the input-to-hidden units:

ji

j

j

j

jji w

net

net

y

y

J

w

J

∂

∂

∂

∂

∂∂

=∂∂

( )j

j

j

yf net

net

∂′=

∂

1

dj

jm m i

mji ji

netw x x

w w =

∂ ∂ = = ∂ ∂

∑

( )jj netfy = ∑

=

=d

m

mjmj xwnet1

( )

∂∂

=∂∂

∑=

c

kjj yy

J

1

2

kk output? target2

1

( )j

c

k y∂∂

−= ∑=

k

1

kk

outputoutput? target

( )j

k

k

c

k y

net

net ∂∂

∂∂

−= ∑=

k

1

kk

outputoutput? target

( ) ( ) kjk

c

k

wnetf 'output? target1

kk∑=

−=

outputk

output1

outputc

??



3-Layer NN� Summary

�Hidden-to-Output Weight

� Input-to-Hidden Weight

( ) ( ) ( )

−=

∂∂

∑=

kjk

c

k

ji

ji

wnetfnetfxw

J'output? target'

1

kk

( )kk output? target)(' kj

kj

netfyw

J−=

∂∂



Training Algorithm

� For the sample training set, the weights of

NN can be updated differently by

presenting the training samples in different

sequences

� There are two popular methods:

�Stochastic training

�Batch training



Training Algorithm� Stochastic training

�Patterns are chosen randomly from the

training set

� Network weights are updated randomly



Training Algorithm� Batch training

�All patterns are presented to the network

before learning ( weight update ) takes place



Training Algorithm� A classifier with smaller training error is better?

� Most of the cases, NO!

� We have discussed this issue in Lecture 01

� For example:

(Stop training when testing error reaches a minimum)

(Test unseen samples)


Regularization

� In Lec03, we mentioned that in most cases, the solution (discriminant function) is not unique (ill-posed problem)� Which one is the best?

� Enough to minimize training error?

� Too complex classifier?

� Good generalization ability?

Empirical Risk

(Training Error)

Overfitting problem --

MinimizeempR


� Regularization is one of the methods to handle this problem� Add Regularization Term in objective function

� Measure the “smoothness” of the decision plane

� Tradeoff parameter (λ) to control the importance of training accuracy and regularization term

� Seek a smooth classifier with good performance on a training set

� May sacrifice Training Error for the Simplicity of a classifier if necessary

Regularization

Minimize: empR ( )fλψ+

Training Error Regularization Term

Tradeoff

ψψψψ


Regularization

λ: regularization parameter ; ψψψψ regularization function

λ= 0 ∞∞∞∞ > λ > 0 λ→→→→ ∞∞∞∞

Minimize: empR ( )fλψ+

� Similar to traditional training objective faction

� No effect on the regularization term

� If we can find suitable λ, we may find f with a good generalization ability

� Dominated by the regularization term

� The most smooth classifier is found


Regularization

Weight Decay

� It is a well known regularization example

� Regularization Term measures the value

of weight

�Smaller → smoother

� The objective function becomes

( ) 2

2wf =ψ

Minimize: empR2

2wλ+


NN and Bayes Theory

� Recall, Bayes formula:

� Suppose a network is trained using the

following target output setting:

( ) ( ) ( )( ) ( ) )(

),(

|

||

1xp

xP

pxp

pxpxp k

c

i ii

kkk

ω

ωω

ωωω ==

∑ =

∈

=otherwise0

if1)(target

k

k

xx

ω


NN and Bayes Theory� When the number of training samples tends to infinity (see p.304 in

text):

� We try to minimize J(w) wrt w, the following term will be minimized

� The trained network will approximate the posteriori probability

[ ]∑ −=∞→∞→

x

kknn

wxgn

wJn

2target);(

1lim)(

1lim

[ ] ∫∫ ≠≠+−= dxxpwxgPdxxpwxgP kikkikkk )|();()()|(1);()(22 ωωωω

∫∫∫ +−= dxxpdxxpwxgdxxpwxg kkkk ),(),();(2)();(2 ωω

[ ] ∫∫ ≠+−= dxxpxPxPdxxpxPwxg kikkk )()|()|()()|();(2 ωωω

Independent of wDependent on w

[ ]∫ − dxxpxPwxg kk )()|();(2ω

)|();( xPwxg kk ω≅

(Using mean square error for J)

=1=0


NN and Bayes Theory

� Thus when MLPNNs are trained via back

propagation on a sum-squared error criterion,

they provide a least squares fit to the Bayes

discriminant function, i.e.

)|();( xPwxg kk ω≅


Practical Techniques

� How to design a MLPNN to handle a given

classification problem?

x1

x2

.

.

.

xn

y1

y2

.

.

.

yn

,

,

,

Training Set MLP

? ??



� Must consider following issues:

� Scaling input

� Target values

� Number of Hidden Layers

� Number of Hidden Units

� Initializing Weights

� Learning Rates

� Momentum

� Weight Decay

� Stochastic and Batch Training

� Stopped Training



Scaling input� Features with different natures will have

different properties (e.g. range, mean…)

�For example: Fish� Mass (grams) and Length (meters)

� Normally the value of the mass will be orders of magnitude larger than that for length

� During the training, the network will adjust weights from the “mass” input unit far more than for the “length” input

� The error will hardly depend upon the tiny length values

� The situation will be reversed when � Mass (kilogram) and length (millimeter)



Scaling input

� How to reduce this influence?

�Normalization (Standardization)

�Standardize the training samples have

� Same range (e.g. 0 to 1 or -1 to 1)

� Same variance (e.g. 1)

� Same average (e.g. 0)



Target Value

� Usually a one-of-c representation for the

target vector is used

�For four-class problem:

� Four outputs will be used

� ω1 = (1, -1, -1, -1) or (1, 0, 0, 0)

� ω2 = (-1, 1, -1, -1) or (0, 1, 0, 0)

� ω3 = (-1, -1, 1, -1) or (0, 0, 1, 0)

� ω4 = (-1, -1, -1,-1) or (0, 0, 0, 1)



Number of Hidden Layers

� BP algorithm works well for NN with many hidden layers, as long as the units are differentiable

� How many hidden layers are enough?

� NN with more hidden layers� Easier to learn translations

� Some functions can be implemented more efficiently

� However, more undesirable local minima and more complex

� Since any arbitrary function can be approximated by a MLP with one hidden layer. Usually a 3-layer NN is recommended. Special problem conditions or requirements may justify the use of more than 3-layers.



Number of Hidden Units� Governs the expressive power of the NN (for facial

recognition, neuron for mouth, nose, ear, eye, face shape, etc.)� Well separated or linearly separable samples, few hidden units

� Complicated problem, more hidden units

Number of hidden units (nH)

Number of weight (nw)

One study shows minimum

Error occurs for NN in range

Of 4-5 hidden units & range of

weights 17-21.



Number of Hidden Units� How to determine the number of hidden units

(nH)?� nH determines the total number of weights in the net,

thus we should not have more weights than the total number of training points (n)

� Without further information, nH cannot be determined before training

� Experimentally,� Choose nH such that the total number of weights in the net is

roughly n/10

� Adjust the complexity of the network in response to the training data, for example,

� Start with a “large” value of nH

� Prune or eliminate weights



Initializing Weights

� In setting weights in a given layer, we

choose weights randomly from a single

distribution to help insure uniform learning

� If set w = 0 initially, learning can never

start.

� Want standardize data so choose both

positive & negative weights


� If w is initially too small

the net activation of a hidden unit will be small and the linear model will be implemented


Initializing Weights

( )jj netfy =∑

=

=d

m

mjmj xwnet1 Sigmoid Function

f (x) = 1/(1 + e-x)

Saturate Linear Saturate

� If w is initially too largethe hidden unit may saturate(sigmoid function is always 0 or 1) even before learning begins



Initializing Weights� We set w such that the net

activation at a hidden unit is in

the range −1 < netj < +1, since

netj ≈ ±1 are the limits to its linear

range

� Input-to-Hidden (d inputs )

� Hidden-to-Output (the fan-in is nH )

Sigmoid Function

f (x) = 1/(1 + e-x)

dw

dji

11+<<−

H

kj

H nw

n

11+<<−



Learning Rates

� In principle, small learning rate ensures

convergence

� Its value determines only the learning speed

but not the final weight values themselves

� However, in practice, because networks

are rarely fully trained to a training error

minimum, the learning rate can affect the

quality of the final network



Learning Rates� Optimal learning rate is

the one which leads to the

local error minimum in

one learning step

� The optimal rate:

w

Jw

w

J

∂∂

=∆∂∂

2

2

1

2

2−

∂∂

=w

Joptη



Learning Rates

� Slower

convergence

� Optimal

� Converge by

one step

� Diverge� Oscillate but

slowly

converge



Momentum� What is Momentum?

� In physics, it means the moving

objects tend to keep moving unless

acted upon by outside forces

� Example: Two balls carry the same

momentum

� In BP algorithm, the approach is to alter the learning rule to include some fraction α of the previous weight update

)1()()1()()1( −∆+∆−+=+ mwmwmwmw αα

Current delta w

Previous delta w

Different factionMomentum



Momentum

� Using momentum

� Reduces the variation in

overall gradient directions

� Increase speed of learning

with momentum

without momentum



Stochastic and Batch Training

� Each training algorithm has strength and

drawback:

�Batch learning is typically slower than

stochastic learning.

�Stochastic training is preferred for large

redundant training sets



Stopped Training� Stopping the training before gradient descent is

complete can help avoid overfitting

� A far more effective method is to stop training when the error on a separate validation set reaches a minimum

Validation Error

Generalization Error

Training Error

Algorithm

1. Separate the original training set

into two set

• New Training Set

• Validation Set

2. Use New Training Set to train

the classifier

3. Evaluate the classifier using

Validation Set at the end of each

epoch

multilayer neural networksmlclab.org/pr/notes/lecture04-multilayerneuralnetworks2.pdflec4:...

Documents