neural networks - ulisboa · pdf file... (vlsi) implementation 272. 2 ... artificial neural...
TRANSCRIPT
1
NEURAL NETWORKS
Neural networks
MotivationHumans are able to process complex tasks efficiently(perception, pattern recognition, reasoning, etc.)Ability to learn from examplesAdaptability and fault tolerance
Engineering applicationsNonlinear approximation and classificationLearning (adaptation) from data: black-box modelingVery-Large-Scale Integration (VLSI) implementation
272
2
Biological neuron
Soma: body of the neuron.Dendrites: receptors (inputs) of the neuron.Axon: output of neuron; connected to dendrites of otherneurons via synapses.Synapses: transfer of information between neurons(electrochemical signals).
273
Neural networks
Biological neural networksNeuron switching time: 0.001 secondNumber of neurons: 1011 (100 bilion)Connections per neuron (synapses): 1014 (100 trilion)Recognition time: 10-3 s (miliseconds)
parallel computationArtificial neural networks
Weighted connections amongst unitsHighly parallel, distributed processEmphasis on tuning weights automatically
274
3
Use of neural networks
Input is high-dimensionalOutput is multidimensionalMathematical form of system is unknownInterpretability of identified model is unimportant
ApplicationsPattern recognitionClassificationPredictionModeling
275
Biological neuralnetwork
Artificial neuralnetwork
Soma Neuron
Dendrite Input
Axon Output
Synapse Weight
ANN: history
1943 Warren McCulloch & Walter Pitts
Definition of a neuron:The activity of a neuron is an all or none processThe structure of the net does not change with time
Too simple structure, however:Proved that networks of their neurons could representany finite logical expressionUsed a massively parallel architectureProvided important foundation for further development
276
4
ANN: history
1948 Donald Hebb
Major contributions:Recognized that information is stored in the weight ofthe synapsesPostulated a learning rate that is proportional to theproduct of neuron activation valuesPostulated a cell assembly theory: repeatedsimultaneous activation of weakly-connected cell groupresults in a more strongly connected group.
277
ANN: history
1957 Frank Rosenblatt
Defined first computer implementation:the perceptronAttracted attention of engineers and physicists, usingmodel of biological visionDefined information storage as being “in connectionsor associations rather than topographicrepresentations”Defined both self-organizing and supervised learningmechanisms
278
5
ANN: history
1959 Bernard Widrow & Marcian Hoff
Engineers who simulated networks on computers andimplemented designs in hardware (Adaline andMadaline).Formulated Least Mean Squares (LMS) algorithm thatminimizes sum-squared error.LMS adapts weights even when classifier output iscorrect.
279
ANN: history
1977 David RummelhartIntroduced computer implementation ofbackpropagation learning and delta rule
1982 John HopfieldImplemented recurrent networkDeveloped way to minimize “energy” ofnetwork, defined stable statesFirst NNs on silicon chips built by AT&T usingHopfield net
280
6
ANN: history
1989 Cybenko (approximation theory)
1990 Jang et al. (neuro-fuzzy systems)
1993 Barron (complexity vs. accuracy)
281
ADAPTIVE NETWORKS
7
Adaptive (neural) networks
Massively connected computational units inspired bythe working of the human brainProvide a mathematical model for biological neuralnetworks (brains)Characteristics:
learning from examplesadaptive and fault tolerantrobust for fulfilling complex tasks
283
Network classification
Learning methods (supervised, unsupervised)Architectures (feedforward, recurrent)Output types (binary, continuous)Node types (uniform, hybrid)Implementations (software, hardware)Connection weights (adjustable, hard-wired)Inspirations (biological, psychological)
284
8
Adaptive network architecture
Nodes are static (no dynamics) and parametricNetwork can consist of heterogeneous nodesLinks do not have weights or parameters associatedNode functions are differentiable except at a finite number ofpoints
285
adaptive nodes
x1
x2
4
5
36
7 9
8
Input layer Layer 1 Layer 2 Layer 3(Output layer)
x8
x9
fixed nodes
Adaptive networks categories
Feedforward
Recurrent
286
x1
x2
4
5
36
7 9
8 x8
x9
x1
x2
4
5
36
7 9
8 x8
x9
9
Adap. network representations
Layered
Topological ordering
287
x1
x2
4
5
36
7 9
8 x8
x9
x1 x2 4 53 6 7 98
x8
x9
Feedforward adaptive network
Static mapping between input and output spacesAim: construct network to obtain nonlinear mappingregulated by a data set (training data set) of desiredinput-output pairs of a target system to be modeledProcedures: learning rules or adaptation algorithms(parameter adjustment to improve networkperformance)Network performance: measured as the discrepancybetween desired and network’s output for same input(error measure)
288
10
Examples of adaptive networks
Adaptive network with single linear node
Perceptron network (linear classifier)
289
f3
x1
x2
x3
3 3 1 2 1 2 3
1 1 2 2 3
( , ; , , )x f x x a a aa x a x a
3 3 1 2 1 2 3
1 1 2 2 3
34 4 3
3
( , ; , , )
1 if 0( )
0 if 0
x f x x a a aa x a x a
xx f x
x
f3
x1
x2
f4 x4x3
Examples of adaptive networks
Multilayer perceptron (3-3-2 neural network)
Parameter set of node 7:
290
74,7 4 5,7 5 6,7 6 7
11 exp[ ( )]
xw x w x w x t
x1
x2
6
7 x7
x8
x3
4
58
Layer 0(Input layer)
Layer 1(Hidden Layer)
Layer 2(Output layer)
4,7 5,7 6,7 7{ , , , }w w w t
weight
threshold
11
SUPERVISED LEARNINGNEURAL NETWORKS
Perceptron
Early (and popular) attempt to build intelligent andself-learning systems by using simple componentsDerived from McCulloch-Pitts (1943) model of thebiological neuronModels output by weighted combinations of selectedfeatures (feature classifier)Essentially a linear classifierIncremental learning roughly based on gradientdescent
292
12
Perceptron
293
(fixed)x1
g4
g1
g2
g3
x2
x3
x4
w1
w2
w3
w4
Output
FeatureDetectionLayer
InputPattern
q
(adaptive)
Training algorithm
Perceptron (Rosenblatt, 1958). Can only learn linearlyseparable functions.
Training algorithm:1. Select an input vector x from the training data set2. If the perceptron gives an incorrect response,
modify all connection weights wi
294
13
Training algorithm
Weight training:
Weight correction is given by the delta rule:
– learning ratee(l) = yd(l) – y(l)
295
( 1) ( ) ( )i i iw l w l w l
( ) ( ) ( )i iw l x l e l
Question: Can we represent a simpleexclusive-OR (XOR) function with asingle-layer perceptron?
XOR problem
296
-0.5 0 0.5 1 1.5-0.5
0
0.5
1
1.5
input 1
inpu
t 2
o: class 1, x: class 2
How to classify thepatterns correctly?
X Y Class0 0 0
0 1 1
1 0 1
1 1 0
Linear classificationis not possible!
14
Example
Linearly separable classificationsIf classification is linearly separable, we can have anynumber of classes with a perceptron.For example, consider classifying furniture according toheight and width:
297
Example
Each category can be separated from the other 2 by astraight line:
3 straight lineseach output node fires if point is on right side of straightline:
298
More than one output node could fire at same time!
15
Artificial neuron
xi: i-th input of the neuronwi: synaptic strengh (weight) for xi
y = ( wixi): output signal
299
w2
wn
x1
x2
xn
...y
Neuron
Types of neurons
Threshold (McCulloch and Pits, 1943):
Other types of activation functions (net = wixi):
300
1
n
i ii
y sign w x
1
0
1
step 1, if 00, if 0
nety
netsigmoid 1
1 netye
lineary net
16
Activation functions
Logistic*
Hyperbolic tangent*
Identity (linear)
*sigmoidal or squashing functions
301
1( )1 xf x
e
1( ) tanh2 1
x
x
x ef xe
( )f x x
-10 -5 0 5 10-2
-1
0
1
2Logistic Function
(a)
-10 -5 0 5 10-2
-1
0
1
2Hyperbolic Tangent Function
(b)
-10 -5 0 5 10-10
-5
0
5
10Identity Function
(c)
Single-layer perceptron (SLP)
Single-layer perceptron can only classify linearlyseparable patterns, regardless of the activationfunction used.How to cope with problems which are not linearlyseparable?
Using multilayer neural networks!
302
17
Multi-Layer Perceptron for XOR
303
x1
x2x3
w1
w2
= -w0x3
x4
x5
x1
x2
+1
+1+1
+1+1
-1 0.5
1.5
0.5
00.5
1 0
0.5
1
00.5
1
x2
(a) x3
x1
00.5
1 0
0.5
1
00.5
1
x2x1
(b) x4
00.5
1 0
0.5
1
00.5
1
x4x3
(c) x5
00.5
1 0
0.5
1
00.5
1
x2x1
(d) x5
Backpropagation MLP
Most commonly used NN structures for applications inwide range of areas:
Pattern recognition, signal processing, datacompression and automatic control
Well-known applications:NETtalk: trained an MLP to pronounce English text;Carnegie Mellon University’s ALVINN (AutonomousLand Vehicle in a Neural Network) used an NN forsteering an autonomous vehicle;Optical Character Recognition (OCR).
304
18
Multi-Layer Perceptron
Can learn functions that are not linearly separable.
305
Out
puts
igna
ls
Inputlayer
1st hiddenlayer
2nd hiddenlayer
Outputlayer
Most common MLP
306
Output layer
Hidden layer
x1
xi
xn
......
1
2
j
m
wijh
1
k
l
w11h
wnmh
w11o
y1
yk
yl
......
......
wjko
b1h
bmh
b1o
blowml
o
19
Most common MLP
Output of neurons in the hidden-layer hj:
Output of neurons in the output-layer yk:
307
1 0
0tanh
n nh h hj ij i j ij ii i
n hij ii
h w x b w x
w x
1 0
0
m mo o ok jk j j jk kj j
m ojk jj
y w h b w h
w h
sigmoid
linear
Learning in NN
Biological neural networks:Synaptic connections amongst neurons whichsimultaneously exhibit high activity are strengthned.
Artificial neural networks:Mathematical approximation of biological learning.Error minimization (nonlinear optimization problem).
Error backpropagation (first-order gradient)Newton methods (second-order gradient)Levenberg-Marquardt (second-order gradient)Conjugate gradients...
308
20
Supervised learning
309
Training data:
1 2
TT T TNY y y y
1 2
TT T TNX x x x
x y
e
Error backpropagation
Initialize all weights and thresholds to small randomnumbers
Repeat1. Input training examples and compute network and
hidden layer outputs2. Adjust output weights using output error3. Propagating output error backwards, adjust hidden-
layer weightsUntil satisfied with approximation
310
21
Backpropagation in MLP
Compute the output of the output-layer, and computeerror:
The cost function to be minimized is the following:
N – number of data points
311
, , 1, ,k d k ke y y k l
2
1 1
1( )2
l N
kqk q
J w e
Learning using gradient
Ouput weight learning for output yk:
312
( 1) ( ) ( )o o ojk jk jkw p w p J w
1 2
( ) , , ,T
ojk o o o
k k mk
J J JJ ww w w
22
Output-layer weights
313
2,
0 1
1, , ( , )2
m lo o h
k jk j k d k k jk ij kj k
y w h e y y J w w e
hm
...yk
Neuron
w2ko
wmko
w1ko w0k
oh1
Output-layer weights
Applying the chain rule
with
then
Thus:
Recall that for SLP:
314
j kojk
J h ew
, 1,k kk jo
k k jk
e yJ e he y w
k ko ojk k k jk
e yJ Jw e y w
( 1) ( ) ( ) ( )o o o ojk jk jk jk j kw p w p J w w p h e
i iw x e
23
Hidden-layer weights
315
j jh hij j j ij
h netJ Jw h net w
hjNeuron
w2jh
wnjh
w1jh w0j
hx1
x2
xn
0, tanh( )
nh
j ij i j ji
net w x h net
( 1) ( ) ( )h h hij ij ijw p w p J w
Hidden-layer weights
Partial derivatives:
then
and
316
1, ( ),
kj jo
k jk j j ihkj j ij
h netJ e w h xh net w
1
( ) ( )l
oi j j k jkh
kij
J x h e ww
1( ) ( ) ( )
lh oij i j j k jk
kw p x h e w
24
Error backpropagation algorithm
Initialize all weights to small random numbersRepeat:1. Input training example and compute network
outputs.2. Adjust output weights using gradients:
3. Adjust hidden-layer weights:
Until satisfied or fixed number of epochs p
317
( 1) ( )o ojk jk j kw p w p h e
1( 1) ( ) ( ) ( )
lh h oij ij i j j k jk
kw p w p x h e w
First-order gradient methods
318
J w( )
wn wn+1
n nJ w( )
25
Second-order gradient methods
Update rule for the weights:
H(w) is the Hessian matrix of w
Learning does not depend on a learning coefficientMuch more efficient in general
319
( 1) ( ) ( ( )) ( ( ))p p p J pw w H w w
( ) , ,h oij jkp w ww
Second-order gradient methods
320
J w( )
wn wn+1
26
Approximation power
General function approximators“Feedforward neural network with one hidden layerand sigmoidal activation functions can approximateany continuous function arbitrarily well on a compactset” (Cybenko)Intuitive relation to localized receptive fieldsLittle constructive results
321
Function approximation
322
1 1 1 2 2 2tanh( ) tanh( )o h h o h hy w w x b w w x b
2
0 1 2
z1 z2
x
w x + b2 2h h w x + b1 1
h h
Activation (weightedsummation)
z
Activation (weightedsummation)
27
Function approximation
323
1
2
0 1 2
v1 v2
tanh( )z2
v
tanh( )z1
x
1
2
0 1 2 x
w v1 1o
y
w v2 2o
w v + w v1 1 2 2o o
Transformationthrough tanh
Summation ofneuron outputs
RADIAL BASISFUNCTION NETWORKS
28
Radial Basis Function Networks (RBFN)
Feedforward neural networks where hidden units donot implement an activation function; they represent aradial basis function.Developed as an approach to improve accuracy anddecrease training time complexity.
325
Radial Basis Function Networks
Activation functions are radial basis functionsActivation level of ith receptive field (hidden unit):
326
( ) ii i
i
R Rx u
x
xn
c11
cm1
...
...
y1
yl
cml
...
ui – center of basis function
i – spread of basis functionj = 1, 2,...,nNo connection weightsbetween input and hiddenlayers
29
Radial Basis Function Networks
Localized activation functions. Gaussian and logistic:
Weighted sum or average output:
ci can be constants or functions of inputs: ci = aiTx + bi
327
2
2( ) exp2
ii
i
Rx u
x 2 2
1( )1 exp
i
i i
R xx u
1 1( ) ( )
H H
i i i ii i
y c w c Rx x 1
1
( )( )
( )
H
i ii
H
ii
c Ry
R
xx
x
RBFN architecture
328
Weighted averageWeighted sum
Hidden layer Hidden layer
Localized activation functions in the hidden layer
30
RBFN learning
Supervised learning to update all parameters (e.g. withGenetic Algorithms)Sequential training: fix basis functions and then adjustoutput weights by:
orthogonal least squaresdata clusteringsoft competition based on “maximum likelihoodestimate”
i sometimes estimated based on standard deviationsMany other schemes also exist
329
Least-squares estimate of weights
Given basis functions R and a set of input-outputdata: [xk, yk], k = 1,...,N, estimate optimal weights cij
1. Compute the output of the neurons:
The output is linear in the weights: y = R c.2. Least squares estimate:
330
1[ ]T Tc R R R y
2
22( )k i
ii kR x e
x u
31
RBFN and Sugeno systems
Equivalent if the following hold:Both RBFN and TS use same aggregation method foroutput (weighted sum or weighted average).Number of basis functions in RBFN equals number ofrules in TS.TS uses Gaussian membership functions with same(variance) as basis functions and rule firing isdetermined by product.RBFN response function (ci) and TS rule consequentsare equal.
331
General function approximator
332
32
Approximation properties of NN
[Cybenko, 1989]: A feedforward NN with at least onehidden layer can approximate any continuous functionRp Rn on a compact interval, if sufficient hiddenneurons are available.[Barron, 1993]: A feedforward NN with one hiddenlayer and sigmoidal activation functions can achieve anintegrated squared error of the order J = O(1 / h).
independently of the dimension of the input space ph: number of hidden neurons (for smooth functions)
333
Approximation properties
For a basis function expansion (polynomial, trigono-metric, singleton fuzzy model, etc.) with h terms,J = O(1 / h2/p), where p is the dimension of the input.Examples:
1. p = 2: polynomial J = O(1 / h2/2) = O(1 / h)neural net J = O(1 / h)
2. p = 10, h = 21: polynomial J = O(1/212/10) = 0.54neural net J = O(1/21) = 0.048
334
33
Example of aproximation
To achieve the same accuracy:
J = O(1 / hn) = O(1 / hb),hn = hb
2/p,
335
610 10421pnb hh
Hopfield network
Recurrent ANN. Example (single-layer):
Learning capability is much higher.Successive iterations may not necessarily converge;may lead to chaotic behavior (unstable network).
336
34
NEURAL NETWORKS
MATLAB EXAMPLE(R2007b)
Feedforward backpropagation network
1. Input and targetP = [0 1 2 3 4 5 6 7 8 9 10]; %input
T = [0 1 2 3 4 3 2 1 2 3 4]; %target
2. Create nethelp newff
net = newff(P,T,5);
3. Simulate and plot netYi = sim(net,P);
plot(P,T,'rs-',P,Yi,‘-o')
legend('T','Yi',0),xlabel('P')
338
35
Feedforward backpropagation network
339
0 1 2 3 4 5 6 7 8 9 10-3
-2
-1
0
1
2
3
4
P
TYi
Feedforward backpropagation network
4. Train the network for 50 epochsnet.trainParam.epochs = 50;
net = train(net,P,T);
T = [0 1 2 3 4 3 2 1 2 3 4]; %target
5. Simulate net and plot the resultsY = sim(net,P);
figure,
plot(P,T,'rs-',P,Yi,'bo',P,Y,'g^')
legend('T','Yi','Y',0),xlabel('P')
340
36
Feedforward backpropagation network
341
0 1 2 3 4 5 6 7 8 9 10-3
-2
-1
0
1
2
3
4
P
TYiY
Feedforward backpropagation network
Compute the mean absolute and squared errorsma_error = mae(T-Y)ma_error = 0.1120
ms_error = mse(T-Y)ms_error = 0.0169
Plot the network errorfigure,plot(P,T-Y,'o'),grid
ylabel('error'),xlabel('P')
342
37
Feedforward backpropagation network
343
0 1 2 3 4 5 6 7 8 9 10-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
erro
r
P
Feedforward backpropagation network
Check the parameters of the networknet
Some important parametersinputs: {1x1 cell} of inputs
layers: {2x1 cell} of layers
outputs: {1x2 cell} containing 1 output
targets: {1x2 cell} containing 1 target
biases: {2x1 cell} containing 2 biases
inputWeights: {2x1 cell} containing 1input weight
layerWeights: {2x2 cell} containing 1layer weight
344
38
Feedforward backpropagation network
adaptFcn: 'trains'
initFcn: 'initlay'
performFcn: 'mse'
trainFcn: 'trainlm'
adaptParam: .passes
trainParam: .epochs, .goal, .show, .time
IW: {2x1 cell} containing 1 input weightmatrix
LW: {2x2 cell} containing 1 layer weightmatrix
b: {2x1 cell} containing 2 bias vectors
345
Feedforward backpropagation network
Note that every time that a network is initialized,different random numbers are used for the weights.Example in the following:
Initialization and training of 10 networksComputation of mean absolute errorComputation of mean squared error
346
39
Feedforward backpropagation network
MA_error = []; MS_error = [];
for i = 1:10
net = newff(P,T,5);
net.trainParam.epochs = 50;
net = train(net,P,T);
Y = sim(net,P);
MA_error = [MA_error mae(T-Y)];
MS_error = [MS_error mse(T-Y)];
end
347
Feedforward backpropagation network
figure,
subplot(2,1,1),plot(MA_error,'o'),grid,
title('Mean Absolute Error')
subplot(2,1,2),plot(MS_error,'o'),grid
title('Mean Squared Error')
348
40
Feedforward backpropagation network
349
1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25Mean Absolute Error
1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2Mean Squared Error