multi-layer feedforward neural networks cap5615intro. to neural networks xingquan (hill) zhu

Multi-Layer Feedforward Neural Networks

CAP5615 Intro. to Neural Networks

Xingquan (Hill) Zhu

Outline

• Multi-layer Neural Networks• Feedforward Neural Networks

– FF NN model– Backpropogation (BP) Algorithm– BP rules derivation– Practical Issues of FFNN

• FFNN for Face Recognition

Multi-layer NN

Inputlayer

Outputlayer

Hidden Layer

• Between the input and output layers there are hidden layers, as illustrated below. – Hidden nodes do not directly send outputs to the external environment.

• Multi-layer NN overcome the limitation of a single-layer NN– they can handle non-linearly separable learning tasks.

XOR problem

1

1

-1

-1

x2

x1

x1

x2

-1

+1

+1

w1-1

-1

-1

w0

Two classes, green and red, cannot be separated using one line, but two lines.The NN below with two hidden nodes realizes this non-linear separation, whereeach hidden node represents one of the two blue lines.

y1

y2w3

z

Types of decision regions

022110 xwxww

022110 xwxwwx1

1

x2 w2

w1

w0

Convexregion

L1L2

L3L4 -3.5

Networkwith a singlenode

One-hidden layer network that realizes the convex region: eachhidden node realizes one of the lines bounding the convex region

P1P2

P3

1

1

1

1

1

x1

x2

1

1.5

two-hidden layer network that realizes the union of three convex regions: each box represents a onehidden layer network realizing one convex region

1

1

1

1

x1

x2

1

Different Non-Linearly Separable Problems

StructureTypes of

Decision RegionsExclusive-OR

ProblemClass

SeparationMost General

Region Shapes

Single-Layer

Two-Layer

Three-Layer

Half PlaneBounded ByHyperplane

Convex OpenOr

Closed Regions

Arbitrary(Complexity

Limited by No.of Nodes)

A

AB

B

A

AB

B

A

AB

B

BA

BA

BA

Outline




FFNN NEURON MODEL

• The classical learning algorithm of FFNN is based on the gradient descent method.

• The activation function used in FFNN are continuous functions of the weights, differentiable everywhere. – A typical activation function is the Sigmoid Function

FFNN NEURON MODEL• A typical activation function is the Sigmoid Function:

• When a approaches to 0, tends to a linear function

• when a tends to infinity then tends to the step function

0 with)(v1

1j

a

e jav

-10 -8 -6 -4 -2 2 4 6 8 10

jv

)( jv 1

Increasing a

iyj

iw

ywi

node ofoutput and node to

node fromlink of weight with

vwhere

i

ji

ijij

FFNN MODEL

• xij : The input from node i to node j• wij : The weight from node i to node j

– wij : The weight updating amount from node i to node j

• ok : The output from node k

The objective of multi-layer NN• The error of output neuron j after the activation of the

network on the n-th training example is:

• The network error is the sum of the squared errors of the output neurons:

• The total mean squared error is the average of the network errors over the training examples.

(n)o-(n)(n)e jjj d

(n)eE(n)nodeoutput j

2j2

1

n j jj nondN

WE

EWE

2

N

1nN

1

))()((2

1)(

(n))(

))(),(( ndnx

Feed forward NN Idea: Credit assignment problem

• Problem of assigning ‘credit’ or ‘blame’ to individual elements involving in forming overall response of a learning system (hidden units)

• In neural networks, problem relates to distributing the network error to the weights.

Outline




Training: Backprop algorithm

• Searches for weight values that minimize the total error of the network over the set of training examples.

• Repeated procedures of the following two passes:– Forward pass: Compute the outputs of all units in the

network, and the error of the output layers.– Backward pass: The network error is used for updating

the weights (credit assignment problem). • Starting at the output layer, the error is propagated backwards

through the network, layer by layer. This is done by recursively computing the local gradient of each neuron.

Backprop

• Back-propagation training algorithm illustrated:

• Backprop adjusts the weights of the NN in order to minimize the network total mean squared error.

Network activationError computationForward Step

Error propagationBackward Step

BP Example

• XOR– X0 X1 X2 Y

1 0 0 0

1 0 1 1

1 1 0 1

1 1 1 0

X0

X1

X2

1

wac

w0ca

b

c

wbc

w0a

w1a

w2a

w0b

w1b

w2b

=0.5; ; For instance {(1, 0, 0), 0}

Neuro a Neuro b Neuro C

woa =0.34 va=0.34oa=0.58

w0b =-0.12 vb= -0.12ob=0.47

w0c =-0.99 vc=-0.54oc=0.37

w1a =0.13 w1b =0.57 wac =0.16

w2a =-0.92 w2b =-0.33 wbc =0.75

a=oa(1-oa)kwakk

=0.58*(1-0.58)*0.16*(-0.085) =-0.003

b=ob(1-ob)kwbkk

=0.47*(1-0.47)*0.75*(-0.085) =-0.016

c=oc(1-oc)(tc-oc) =0.37*(1-0.37)*(0-0.37) =-0.085

woa =awoa=0.5*(-0.003)*1 =-0.015

wob =bwob=0.5*(-0.016)*1 =-0.008

woc =cwoc=0.5*(-0.085)*1 =-0.043

w1a =aw1a=0.5*(-0.003)*0=0 w1b =bw1b=0.5*(-0.01)*0=0 wac =cwac=0.5*(-0.085)*0.58 =-0.025

w2a =aw2a=0.5*(-0.003)*1=0 w2b =bw2b=0.5*(-0.01)*1=0 wbc =cwbc=0.5*(-0.085)*0.47 =-0.020

)1(1 xvx eo

• Weight updating

Neuro a Neuro b Neuro C

woa = woa+ woa=0.34-0.015=0.325 w0b = w0b + wob =-0.12-0.008 w0c = w0c + w0c =-0.99-0.043

w1a = w1a + w1a=0.13+0 w1b = w1b + w1b =0.57+0 wac = wac + wac =0.16-0.025

w2a = w2a+ w2a =-0.92+0 w2b = w2b + w2b =-0.33+0 wbc = wbc + wbc =0.75-0.02

woa =awoa=0.5*(-0.003)*1 =-0.015

wob =bwob=0.5*(-0.016)*1 =-0.008

woc =cwoc=0.5*(-0.085)*1 =-0.043

w1a =aw1a=0.5*(-0.003)*0=0 w1b =bw1b=0.5*(-0.01)*0=0 wac =cwac=0.5*(-0.085)*0.58 =-0.025

w2a =aw2a=0.5*(-0.003)*1=0 w2b =bw2b=0.5*(-0.01)*1=0 wbc =cwbc=0.5*(-0.085)*0.47 =-0.020

Outline




Weight Update Rule

ijij w

)(-w

WE

The Backprop weight update rule is based on thegradient descent method: take a step in the direction yielding the maximum decrease of the network error E. This direction is the opposite of the gradient of E.

ijijij www

Weight Update Rule

jo

ijx ijw

ii iijj xxwv

ijij ww

ij

j

jij w

v

v

)(

w

)(

WEWEUsing the chain rulewe can write:

jj

WE

v

)(

i,...,0

ijjv xwmi

The input of a neuron j is

Moreover if we define the local gradient of neuron j as follows:

Then from

we get

j

mi1

wij

l……

(vj)

jv

Weight update

Inputlayer

Outputlayer

Hidden Layer

So we have to compute the local gradient of neurons. Two different rules for the cases• j output neuron (green ones)• j hidden neuron (the brown ones)

jj

WE

v

)(

Weight update of output neuron

ijx ijw

)v(')1(ev

o

o

e

e

)(

v

)(j

j

j

j

j

jj

jj

WEWE

If j is an output neuron then using the chain rule we obtain:

jjj o-de because

)1(o)o-d( jjj jij ow

and )v( jjo

)v('e j jj For j output neuron

Substituting in j we get

Weight update of hidden neuron

Ck j

k

kjj v

v

v

)(-

v

)( WEWE

)v('v

oj

j

j

jkk wv

jo

layernext ink

jkkw)1( jjiij ooxw

For j is a hidden node

k

wE

kv

)(

,v

o

o

v

v

v

j

j

j

k

j

k

Using chain rule

Then Moreover

C set of neurons of output layer

ijx ijwSubstituting in j we get

Because i,...,0

ikkv xwmi

Ck j

k

kj )1()('

v

v

v

)(-

Ckjkkjj

Ckjkkj woowv

WE

Error backpropagation

’(v1)

’(vk)

’(vm)

1

k

m

wj1

wjk

wjm

e1

ek

em

The flow-graph below illustrates how errors are back-propagated to the hidden neuron j

j ’(vj)

Summary: Delta Rule

• Delta rule wij = j xi

j)o(d)v( jjj

layernext ofk

jkkj w)v( IF j output node

IF j hidden node

)o1(oa)v(' jjj where

for sigmoid activation functions

Outline




Network training:

Two types of network training:

Incremental mode (on-line, stochastic, or per-observation) Weights updated after each instance is presented

Batch mode (off-line or per -epoch) Weights updated after all the patterns are presented

Backprop algorithmincremental-mode

n=1;initialize w(n) randomly;while (stopping criterion not satisfied and n<max_iterations)

for each example (x,d)- run the network with input x and compute the output y - update the weights in backward order starting from

those of the output layer:

with computed using the (generalized) Delta ruleend-for

n = n+1; end-while;choose a randomized ordering for selecting the examples in the training set in order to

avoid poor performance.

ijijij www ijw

Backprop algorithmbatch mode

• In the batch-mode the weights are updated only after all examples have been processed, using the formula

• The learning process continues on an epoch-by-epoch basis until the stopping condition is satisfied.

example trainingx

xijijij www

Stopping criterions

• Sensible stopping criterions:– total mean squared error change:

Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).

–generalization based criterion: After each epoch the NN is tested for generalization using a different set of examples (validation set). If the generalization performance is adequate then stop.

Use of Available Data Set for Training

• Training set – use to update the weights. Patterns in this set are repeatedly in random order. The weight update equation are applied after a certain number of patterns.

• Validation set – use to decide when to stop training only by monitoring the error.

• Test set – Use to test the performance of the neural network. It should not be used as part of the neural network development cycle.

The available data set is normally split into three sets as follows:

Earlier Stopping - Good Generalization• Running too many epochs may overtrain the

network and result in overfitting and perform poorly in generalization.

Keep a hold-out validation set and test accuracy after every epoch. Maintain weights for best performing network on the validation set and stop training when error increases increases beyond this.

No. of epochs

error Training set

Validation set

Model Selection by Cross-validation• Too few hidden units prevent the network from

learning adequately fitting the data and learning the concept.

• Too many hidden units leads to overfitting. Similar cross-validation methods can be used to

determine an appropriate number of hidden units by using the optimal test error to select the model with optimal number of hidden layers and nodes.

No. of epochs

error Training set

Validation set

• Data representation• Network Topology• Network Parameters• Training

NN DESIGN

• Data representation depends on the problem. In general NNs work on continuous (real valued) attributes. Therefore symbolic attributes are encoded into continuous ones.

• Attributes of different types may have different ranges of values which affect the training process. Normalization may be used, like the following one which scales each attribute to assume values between 0 and 1.

for each value of attribute , where are the minimum and maximum value of that attribute over the training set.

Data Representation

i

i

minmax

min

i

ii

xx

ix i imax and mini

• The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error.

• Two types of adaptive algorithms can be used:– start from a large network and successively

remove some neurons and links until network performance degrades.

– begin with a small network and introduce new neurons until performance is satisfactory.

Network Topology

• How are the weights initialized?• How is the learning rate chosen?• How many hidden layers and how many

neurons?• How many examples in the training set?

Network parameters

• In general, initial weights are randomly chosen, with typical values between -1.0 and 1.0 or -0.5 and 0.5.

• If some inputs are much larger than others, random initialization may bias the network to give much more importance to larger inputs. In such a case, weights can be initialized as follows:

Initialization of weights

mi

m,...,1

|x|1

21

ij iw For weights from the input to

the first layer

nj

ni,...,1)xw(

121

jkiji

w

For weights from the first to the second layer

• The right value of depends on the application. Values between 0.1 and 0.9 have been used in many applications.

Choice of learning rate

• Rule of thumb: – the number of training examples should be at

least five to ten times the number of weights of the network.

• Other rule:

Size of Training set

a)-(1

|W| N

|W|= number of weightsa=expected accuracy

Applications of FFNNClassification, pattern recognition:• FFNN can be applied to tackle non-linearly

separable learning tasks.– Recognizing printed or handwritten characters– Face recognition– Classification of loan applications into credit-worthy and

non-credit-worthy groups– Analysis of sonar radar to determine the nature of the

source of a signal

Regression and forecasting:• FFNN can be applied to learn non-linear functions

(regression) and in particular functions whose inputs is a sequence of measurements over time (time series).

Outline




Categorical attributes and multi-classes

• A categorical attribute is usually decomposed into a series of (0, 1) continuous attributes– Whether an attribute value exists or now.

• Each class corresponds to one output node, the desired output of the node is “1” for any instance belonging to this class (otherwise, “0”)– For each test instance, the final class label is

determined by the output node with the maximum output value.

A generalized delta rule• If is small then the algorithm learns the weights very

slowly, while if is large then the large changes of the weights may cause an unstable behavior with oscillations of the weight values.

• A technique for tackling this problem is the introduction of a momentum term in the delta rule which takes into account previous updates. We obtain the following generalized Delta rule:

n)(n)x()1n(wn)(w ijijij momentum constant

momentum term accelerates the descent in steady downhill directions

10

Neural Net for object recognition from images

• Objective– Identify interesting objects from input images

• Face recognition– Locate faces, happy/sad faces, gender, face pose, orientation – Recognize specific faces: authorization

• Vehicle recognition (traffic control or safe driving assistant)– Passenger car, van, pick up, bus, truck

• Traffic sign detection

• Challenges– Image size (100x100, 10240x10240)– Object size, pose and object orientation– Illuminations

Example

Example: Face Detection Challenges

pose variation

lighting condition variation

facial expression variation

Normal procedures• Training (identify your problem and build specific model)

– Build training dataset• Isolate sample images

– Images containing faces• Extract regions containing the objects

– region containing faces• Normalization (size and illumination)

– 200x200 etc.• Select counter-class examples

– Non-face regions– Determine Neural Net

• Input layers are determined by the input images– E.g., a 200x200 image requires 40,000 input dimensions, each containing a

value between 0-255• Neural net architectures

– A three layer FF NN (two hidden layers) is a common practice• Output layers are determined by the learning problem

– Bi-class classification or multi-class classification– Train Neural Net

Normal procedures• Test

– Given a test image• Select a small region (considering all possibilities of

the object location and size)– Scanning from the top left to the bottom right– Sampling at different scale levels

• Feed the region into the network, determine whether this region contains the object or not

• Repeat the above process – Which is a time consuming process

CMU Neural Nets for Face Pose Recognition

Head pose (1-of-4):90% accuracy

Face recognition (1-of-20):90% accuracy

Neural Net Based Face Detection

• Large training set of faces and small set of non-faces• Training set of non-faces automatically built up:

• Set of images with no faces• Every ‘face’ detected is added to the non-face training

set.

Traffic sign detection

• Demo– http://www.mathworks.com/products/demos/videoimage/

traffic_sign/vipwarningsigns.html

• Intelligent traffic light control system– Instead of using loop detectors (like metal detectors)

• Using surveillance video: Detecting vehicle and bicycles

http://www.mathworks.com/products/demos/videoimage/traffic_sign/vipwarningsigns.html

http://www.mathworks.com/products/demos/videoimage/traffic_sign/vipwarningsigns.html

Vehicle Detection

• Intelligent vehicles aim at improving the driving safety by machine vision techniques

http://www.mobileye.com/visionRange.shtml

• Modifying CMU face recognition source code to train a classifier for one type of image classification problem– You identify your own objective (make your objective

unique)• Gender, kid/adult recognition etc.

– Available source code (c, Unix)– Team

• Maximum team members: 3– Due date (April 30)– A written report (3 page minimum)

• Your objective• System architecture• Experimental results

Term Project (1)

Alternative choice (2)

• Alternatively, you can propose your own term project as well.

• Requirement– Must relate to the neural network and

classification– Must have a clear objective– Must involve programming work– Must have experimental assessment results– Must have a written report (3 page minimum).– Send me your proposal by April 4.

CMU NN face recognition source code

• Dr. Tom Mitchell (Machine Learning)– http://www.cs.cmu.edu/~tom/faces.html

• What available?– Image dataset

• Different class of images: pose, expression, glasses, etc. in pgm format

– Complete C source codes• Pgm image read/write• 3 layer feed-forward neural network architecture• Backpropogation learning algorithms• Weight visualization

– Document• A 13 page document, list the details of the datasets and the

source code

http://www.cs.cmu.edu/~tom/faces.html

Outline




multi-layer feedforward neural networks cap5615intro. to neural networks xingquan (hill) zhu

Documents