multi-layer feedforward neural networks cap5615intro. to neural networks xingquan (hill) zhu
TRANSCRIPT
Multi-Layer Feedforward Neural Networks
CAP5615 Intro. to Neural Networks
Xingquan (Hill) Zhu
Outline
• Multi-layer Neural Networks• Feedforward Neural Networks
– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN
• FFNN for Face Recognition
Multi-layer NN
Inputlayer
Outputlayer
Hidden Layer
• Between the input and output layers there are hidden layers, as illustrated below. – Hidden nodes do not directly send outputs to the external environment.
• Multi-layer NN overcome the limitation of a single-layer NN– they can handle non-linearly separable learning tasks.
XOR problem
1
1
-1
-1
x2
x1
x1
x2
-1
+1
+1
w1-1
-1
-1
w0
Two classes, green and red, cannot be separated using one line, but two lines.The NN below with two hidden nodes realizes this non-linear separation, whereeach hidden node represents one of the two blue lines.
y1
y2w3
z
Types of decision regions
022110 xwxww
022110 xwxwwx1
1
x2 w2
w1
w0
Convexregion
L1L2
L3L4 -3.5
Networkwith a singlenode
One-hidden layer network that realizes the convex region: eachhidden node realizes one of the lines bounding the convex region
P1P2
P3
1
1
1
1
1
x1
x2
1
1.5
two-hidden layer network that realizes the union of three convex regions: each box represents a onehidden layer network realizing one convex region
1
1
1
1
x1
x2
1
Different Non-Linearly Separable Problems
StructureTypes of
Decision RegionsExclusive-OR
ProblemClass
SeparationMost General
Region Shapes
Single-Layer
Two-Layer
Three-Layer
Half PlaneBounded ByHyperplane
Convex OpenOr
Closed Regions
Arbitrary(Complexity
Limited by No.of Nodes)
A
AB
B
A
AB
B
A
AB
B
BA
BA
BA
Outline
• Multi-layer Neural Networks• Feedforward Neural Networks
– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN
• FFNN for Face Recognition
FFNN NEURON MODEL
• The classical learning algorithm of FFNN is based on the gradient descent method.
• The activation function used in FFNN are continuous functions of the weights, differentiable everywhere. – A typical activation function is the Sigmoid Function
FFNN NEURON MODEL• A typical activation function is the Sigmoid Function:
• When a approaches to 0, tends to a linear function
• when a tends to infinity then tends to the step function
0 with)(v1
1j
a
e jav
-10 -8 -6 -4 -2 2 4 6 8 10
jv
)( jv 1
Increasing a
iyj
iw
ywi
node ofoutput and node to
node fromlink of weight with
vwhere
i
ji
ijij
FFNN MODEL
• xij : The input from node i to node j• wij : The weight from node i to node j
– wij : The weight updating amount from node i to node j
• ok : The output from node k
The objective of multi-layer NN• The error of output neuron j after the activation of the
network on the n-th training example is:
• The network error is the sum of the squared errors of the output neurons:
• The total mean squared error is the average of the network errors over the training examples.
(n)o-(n)(n)e jjj d
(n)eE(n)nodeoutput j
2j2
1
n j jj nondN
WE
EWE
2
N
1nN
1
))()((2
1)(
(n))(
))(),(( ndnx
Feed forward NN Idea: Credit assignment problem
• Problem of assigning ‘credit’ or ‘blame’ to individual elements involving in forming overall response of a learning system (hidden units)
• In neural networks, problem relates to distributing the network error to the weights.
Outline
• Multi-layer Neural Networks• Feedforward Neural Networks
– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN
• FFNN for Face Recognition
Training: Backprop algorithm
• Searches for weight values that minimize the total error of the network over the set of training examples.
• Repeated procedures of the following two passes:– Forward pass: Compute the outputs of all units in the
network, and the error of the output layers.– Backward pass: The network error is used for updating
the weights (credit assignment problem). • Starting at the output layer, the error is propagated backwards
through the network, layer by layer. This is done by recursively computing the local gradient of each neuron.
Backprop
• Back-propagation training algorithm illustrated:
• Backprop adjusts the weights of the NN in order to minimize the network total mean squared error.
Network activationError computationForward Step
Error propagationBackward Step
BP
BP Example
• XOR– X0 X1 X2 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 0
X0
X1
X2
1
wac
w0ca
b
c
wbc
w0a
w1a
w2a
w0b
w1b
w2b
=0.5; ; For instance {(1, 0, 0), 0}
Neuro a Neuro b Neuro C
woa =0.34 va=0.34oa=0.58
w0b =-0.12 vb= -0.12ob=0.47
w0c =-0.99 vc=-0.54oc=0.37
w1a =0.13 w1b =0.57 wac =0.16
w2a =-0.92 w2b =-0.33 wbc =0.75
a=oa(1-oa)kwakk
=0.58*(1-0.58)*0.16*(-0.085) =-0.003
b=ob(1-ob)kwbkk
=0.47*(1-0.47)*0.75*(-0.085) =-0.016
c=oc(1-oc)(tc-oc) =0.37*(1-0.37)*(0-0.37) =-0.085
woa =awoa=0.5*(-0.003)*1 =-0.015
wob =bwob=0.5*(-0.016)*1 =-0.008
woc =cwoc=0.5*(-0.085)*1 =-0.043
w1a =aw1a=0.5*(-0.003)*0=0 w1b =bw1b=0.5*(-0.01)*0=0 wac =cwac=0.5*(-0.085)*0.58 =-0.025
w2a =aw2a=0.5*(-0.003)*1=0 w2b =bw2b=0.5*(-0.01)*1=0 wbc =cwbc=0.5*(-0.085)*0.47 =-0.020
)1(1 xvx eo
• Weight updating
Neuro a Neuro b Neuro C
woa = woa+ woa=0.34-0.015=0.325 w0b = w0b + wob =-0.12-0.008 w0c = w0c + w0c =-0.99-0.043
w1a = w1a + w1a=0.13+0 w1b = w1b + w1b =0.57+0 wac = wac + wac =0.16-0.025
w2a = w2a+ w2a =-0.92+0 w2b = w2b + w2b =-0.33+0 wbc = wbc + wbc =0.75-0.02
woa =awoa=0.5*(-0.003)*1 =-0.015
wob =bwob=0.5*(-0.016)*1 =-0.008
woc =cwoc=0.5*(-0.085)*1 =-0.043
w1a =aw1a=0.5*(-0.003)*0=0 w1b =bw1b=0.5*(-0.01)*0=0 wac =cwac=0.5*(-0.085)*0.58 =-0.025
w2a =aw2a=0.5*(-0.003)*1=0 w2b =bw2b=0.5*(-0.01)*1=0 wbc =cwbc=0.5*(-0.085)*0.47 =-0.020
Outline
• Multi-layer Neural Networks• Feedforward Neural Networks
– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN
• FFNN for Face Recognition
Weight Update Rule
ijij w
)(-w
WE
The Backprop weight update rule is based on thegradient descent method: take a step in the direction yielding the maximum decrease of the network error E. This direction is the opposite of the gradient of E.
ijijij www
Weight Update Rule
jo
ijx ijw
ii iijj xxwv
ijij ww
ij
j
jij w
v
v
)(
w
)(
WEWEUsing the chain rulewe can write:
jj
WE
v
)(
i,...,0
ijjv xwmi
The input of a neuron j is
Moreover if we define the local gradient of neuron j as follows:
Then from
we get
j
mi1
wij
l……
(vj)
jv
Weight update
Inputlayer
Outputlayer
Hidden Layer
So we have to compute the local gradient of neurons. Two different rules for the cases• j output neuron (green ones)• j hidden neuron (the brown ones)
jj
WE
v
)(
Weight update of output neuron
ijx ijw
)v(')1(ev
o
o
e
e
)(
v
)(j
j
j
j
j
jj
jj
WEWE
If j is an output neuron then using the chain rule we obtain:
jjj o-de because
)1(o)o-d( jjj jij ow
and )v( jjo
)v('e j jj For j output neuron
Substituting in j we get
Weight update of hidden neuron
Ck j
k
kjj v
v
v
)(-
v
)( WEWE
)v('v
oj
j
j
jkk wv
jo
layernext ink
jkkw)1( jjiij ooxw
For j is a hidden node
k
wE
kv
)(
,v
o
o
v
v
v
j
j
j
k
j
k
Using chain rule
Then Moreover
C set of neurons of output layer
ijx ijwSubstituting in j we get
Because i,...,0
ikkv xwmi
Ck j
k
kj )1()('
v
v
v
)(-
Ckjkkjj
Ckjkkj woowv
WE
Error backpropagation
’(v1)
’(vk)
’(vm)
1
k
m
wj1
wjk
wjm
e1
ek
em
The flow-graph below illustrates how errors are back-propagated to the hidden neuron j
j ’(vj)
Summary: Delta Rule
• Delta rule wij = j xi
j)o(d)v( jjj
layernext ofk
jkkj w)v( IF j output node
IF j hidden node
)o1(oa)v(' jjj where
for sigmoid activation functions
Outline
• Multi-layer Neural Networks• Feedforward Neural Networks
– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN
• FFNN for Face Recognition
Network training:
Two types of network training:
Incremental mode (on-line, stochastic, or per-observation) Weights updated after each instance is presented
Batch mode (off-line or per -epoch) Weights updated after all the patterns are presented
Backprop algorithmincremental-mode
n=1;initialize w(n) randomly;while (stopping criterion not satisfied and n<max_iterations)
for each example (x,d)- run the network with input x and compute the output y - update the weights in backward order starting from
those of the output layer:
with computed using the (generalized) Delta ruleend-for
n = n+1; end-while;choose a randomized ordering for selecting the examples in the training set in order to
avoid poor performance.
ijijij www ijw
Backprop algorithmbatch mode
• In the batch-mode the weights are updated only after all examples have been processed, using the formula
• The learning process continues on an epoch-by-epoch basis until the stopping condition is satisfied.
example trainingx
xijijij www
Stopping criterions
• Sensible stopping criterions:– total mean squared error change:
Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).
–generalization based criterion: After each epoch the NN is tested for generalization using a different set of examples (validation set). If the generalization performance is adequate then stop.
Use of Available Data Set for Training
• Training set – use to update the weights. Patterns in this set are repeatedly in random order. The weight update equation are applied after a certain number of patterns.
• Validation set – use to decide when to stop training only by monitoring the error.
• Test set – Use to test the performance of the neural network. It should not be used as part of the neural network development cycle.
The available data set is normally split into three sets as follows:
Earlier Stopping - Good Generalization• Running too many epochs may overtrain the
network and result in overfitting and perform poorly in generalization.
Keep a hold-out validation set and test accuracy after every epoch. Maintain weights for best performing network on the validation set and stop training when error increases increases beyond this.
No. of epochs
error Training set
Validation set
Model Selection by Cross-validation• Too few hidden units prevent the network from
learning adequately fitting the data and learning the concept.
• Too many hidden units leads to overfitting. Similar cross-validation methods can be used to
determine an appropriate number of hidden units by using the optimal test error to select the model with optimal number of hidden layers and nodes.
No. of epochs
error Training set
Validation set
• Data representation• Network Topology• Network Parameters• Training
NN DESIGN
• Data representation depends on the problem. In general NNs work on continuous (real valued) attributes. Therefore symbolic attributes are encoded into continuous ones.
• Attributes of different types may have different ranges of values which affect the training process. Normalization may be used, like the following one which scales each attribute to assume values between 0 and 1.
for each value of attribute , where are the minimum and maximum value of that attribute over the training set.
Data Representation
i
i
minmax
min
i
ii
xx
ix i imax and mini
• The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error.
• Two types of adaptive algorithms can be used:– start from a large network and successively
remove some neurons and links until network performance degrades.
– begin with a small network and introduce new neurons until performance is satisfactory.
Network Topology
• How are the weights initialized?• How is the learning rate chosen?• How many hidden layers and how many
neurons?• How many examples in the training set?
Network parameters
• In general, initial weights are randomly chosen, with typical values between -1.0 and 1.0 or -0.5 and 0.5.
• If some inputs are much larger than others, random initialization may bias the network to give much more importance to larger inputs. In such a case, weights can be initialized as follows:
Initialization of weights
mi
m,...,1
|x|1
21
ij iw For weights from the input to
the first layer
nj
ni,...,1)xw(
121
jkiji
w
For weights from the first to the second layer
• The right value of depends on the application. Values between 0.1 and 0.9 have been used in many applications.
Choice of learning rate
• Rule of thumb: – the number of training examples should be at
least five to ten times the number of weights of the network.
• Other rule:
Size of Training set
a)-(1
|W| N
|W|= number of weightsa=expected accuracy
Applications of FFNNClassification, pattern recognition:• FFNN can be applied to tackle non-linearly
separable learning tasks.– Recognizing printed or handwritten characters– Face recognition– Classification of loan applications into credit-worthy and
non-credit-worthy groups– Analysis of sonar radar to determine the nature of the
source of a signal
Regression and forecasting:• FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose inputs is a sequence of measurements over time (time series).
Outline
• Multi-layer Neural Networks• Feedforward Neural Networks
– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN
• FFNN for Face Recognition
Categorical attributes and multi-classes
• A categorical attribute is usually decomposed into a series of (0, 1) continuous attributes– Whether an attribute value exists or now.
• Each class corresponds to one output node, the desired output of the node is “1” for any instance belonging to this class (otherwise, “0”)– For each test instance, the final class label is
determined by the output node with the maximum output value.
A generalized delta rule• If is small then the algorithm learns the weights very
slowly, while if is large then the large changes of the weights may cause an unstable behavior with oscillations of the weight values.
• A technique for tackling this problem is the introduction of a momentum term in the delta rule which takes into account previous updates. We obtain the following generalized Delta rule:
n)(n)x()1n(wn)(w ijijij momentum constant
momentum term accelerates the descent in steady downhill directions
10
Neural Net for object recognition from images
• Objective– Identify interesting objects from input images
• Face recognition– Locate faces, happy/sad faces, gender, face pose, orientation – Recognize specific faces: authorization
• Vehicle recognition (traffic control or safe driving assistant)– Passenger car, van, pick up, bus, truck
• Traffic sign detection
• Challenges– Image size (100x100, 10240x10240)– Object size, pose and object orientation– Illuminations
Example
Example: Face Detection Challenges
pose variation
lighting condition variation
facial expression variation
Normal procedures• Training (identify your problem and build specific model)
– Build training dataset• Isolate sample images
– Images containing faces• Extract regions containing the objects
– region containing faces• Normalization (size and illumination)
– 200x200 etc.• Select counter-class examples
– Non-face regions– Determine Neural Net
• Input layers are determined by the input images– E.g., a 200x200 image requires 40,000 input dimensions, each containing a
value between 0-255• Neural net architectures
– A three layer FF NN (two hidden layers) is a common practice• Output layers are determined by the learning problem
– Bi-class classification or multi-class classification– Train Neural Net
Normal procedures• Test
– Given a test image• Select a small region (considering all possibilities of
the object location and size)– Scanning from the top left to the bottom right– Sampling at different scale levels
• Feed the region into the network, determine whether this region contains the object or not
• Repeat the above process – Which is a time consuming process
CMU Neural Nets for Face Pose Recognition
Head pose (1-of-4):90% accuracy
Face recognition (1-of-20):90% accuracy
Neural Net Based Face Detection
• Large training set of faces and small set of non-faces• Training set of non-faces automatically built up:
• Set of images with no faces• Every ‘face’ detected is added to the non-face training
set.
Traffic sign detection
• Demo– http://www.mathworks.com/products/demos/videoimage/
traffic_sign/vipwarningsigns.html
• Intelligent traffic light control system– Instead of using loop detectors (like metal detectors)
• Using surveillance video: Detecting vehicle and bicycles
Vehicle Detection
• Intelligent vehicles aim at improving the driving safety by machine vision techniques
http://www.mobileye.com/visionRange.shtml
• Modifying CMU face recognition source code to train a classifier for one type of image classification problem– You identify your own objective (make your objective
unique)• Gender, kid/adult recognition etc.
– Available source code (c, Unix)– Team
• Maximum team members: 3– Due date (April 30)– A written report (3 page minimum)
• Your objective• System architecture• Experimental results
Term Project (1)
Alternative choice (2)
• Alternatively, you can propose your own term project as well.
• Requirement– Must relate to the neural network and
classification– Must have a clear objective– Must involve programming work– Must have experimental assessment results– Must have a written report (3 page minimum).– Send me your proposal by April 4.
CMU NN face recognition source code
• Dr. Tom Mitchell (Machine Learning)– http://www.cs.cmu.edu/~tom/faces.html
• What available?– Image dataset
• Different class of images: pose, expression, glasses, etc. in pgm format
– Complete C source codes• Pgm image read/write• 3 layer feed-forward neural network architecture• Backpropogation learning algorithms• Weight visualization
– Document• A 13 page document, list the details of the datasets and the
source code
Outline
• Multi-layer Neural Networks• Feedforward Neural Networks
– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN
• FFNN for Face Recognition