neural networks - ulisboa · pdf file... (vlsi) implementation 272. 2 ... artificial neural...

1

NEURAL NETWORKS

Neural networks

MotivationHumans are able to process complex tasks efficiently(perception, pattern recognition, reasoning, etc.)Ability to learn from examplesAdaptability and fault tolerance

Engineering applicationsNonlinear approximation and classificationLearning (adaptation) from data: black-box modelingVery-Large-Scale Integration (VLSI) implementation

272

2

Biological neuron

Soma: body of the neuron.Dendrites: receptors (inputs) of the neuron.Axon: output of neuron; connected to dendrites of otherneurons via synapses.Synapses: transfer of information between neurons(electrochemical signals).

273

Neural networks

Biological neural networksNeuron switching time: 0.001 secondNumber of neurons: 1011 (100 bilion)Connections per neuron (synapses): 1014 (100 trilion)Recognition time: 10-3 s (miliseconds)

parallel computationArtificial neural networks

Weighted connections amongst unitsHighly parallel, distributed processEmphasis on tuning weights automatically

274

3

Use of neural networks

Input is high-dimensionalOutput is multidimensionalMathematical form of system is unknownInterpretability of identified model is unimportant

ApplicationsPattern recognitionClassificationPredictionModeling

275

Biological neuralnetwork

Artificial neuralnetwork

Soma Neuron

Dendrite Input

Axon Output

Synapse Weight

ANN: history

1943 Warren McCulloch & Walter Pitts

Definition of a neuron:The activity of a neuron is an all or none processThe structure of the net does not change with time

Too simple structure, however:Proved that networks of their neurons could representany finite logical expressionUsed a massively parallel architectureProvided important foundation for further development

276

4

ANN: history

1948 Donald Hebb

Major contributions:Recognized that information is stored in the weight ofthe synapsesPostulated a learning rate that is proportional to theproduct of neuron activation valuesPostulated a cell assembly theory: repeatedsimultaneous activation of weakly-connected cell groupresults in a more strongly connected group.

277

ANN: history

1957 Frank Rosenblatt

Defined first computer implementation:the perceptronAttracted attention of engineers and physicists, usingmodel of biological visionDefined information storage as being “in connectionsor associations rather than topographicrepresentations”Defined both self-organizing and supervised learningmechanisms

278

5

ANN: history

1959 Bernard Widrow & Marcian Hoff

Engineers who simulated networks on computers andimplemented designs in hardware (Adaline andMadaline).Formulated Least Mean Squares (LMS) algorithm thatminimizes sum-squared error.LMS adapts weights even when classifier output iscorrect.

279

ANN: history

1977 David RummelhartIntroduced computer implementation ofbackpropagation learning and delta rule

1982 John HopfieldImplemented recurrent networkDeveloped way to minimize “energy” ofnetwork, defined stable statesFirst NNs on silicon chips built by AT&T usingHopfield net

280

6

ANN: history

1989 Cybenko (approximation theory)

1990 Jang et al. (neuro-fuzzy systems)

1993 Barron (complexity vs. accuracy)

281

ADAPTIVE NETWORKS

7

Adaptive (neural) networks

Massively connected computational units inspired bythe working of the human brainProvide a mathematical model for biological neuralnetworks (brains)Characteristics:

learning from examplesadaptive and fault tolerantrobust for fulfilling complex tasks

283

Network classification

Learning methods (supervised, unsupervised)Architectures (feedforward, recurrent)Output types (binary, continuous)Node types (uniform, hybrid)Implementations (software, hardware)Connection weights (adjustable, hard-wired)Inspirations (biological, psychological)

284

8

Adaptive network architecture

Nodes are static (no dynamics) and parametricNetwork can consist of heterogeneous nodesLinks do not have weights or parameters associatedNode functions are differentiable except at a finite number ofpoints

285

adaptive nodes

x1

x2

4

5

36

7 9

8

Input layer Layer 1 Layer 2 Layer 3(Output layer)

x8

x9

fixed nodes

Adaptive networks categories

Feedforward

Recurrent

286

x1

x2

4

5

36

7 9

8 x8

x9

x1

x2

4

5

36

7 9

8 x8

x9

9

Adap. network representations

Layered

Topological ordering

287

x1

x2

4

5

36

7 9

8 x8

x9

x1 x2 4 53 6 7 98

x8

x9

Feedforward adaptive network

Static mapping between input and output spacesAim: construct network to obtain nonlinear mappingregulated by a data set (training data set) of desiredinput-output pairs of a target system to be modeledProcedures: learning rules or adaptation algorithms(parameter adjustment to improve networkperformance)Network performance: measured as the discrepancybetween desired and network’s output for same input(error measure)

288

10

Examples of adaptive networks

Adaptive network with single linear node

Perceptron network (linear classifier)

289

f3

x1

x2

x3

3 3 1 2 1 2 3

1 1 2 2 3

( , ; , , )x f x x a a aa x a x a

3 3 1 2 1 2 3

1 1 2 2 3

34 4 3

3

( , ; , , )

1 if 0( )

0 if 0

x f x x a a aa x a x a

xx f x

x

f3

x1

x2

f4 x4x3

Examples of adaptive networks

Multilayer perceptron (3-3-2 neural network)

Parameter set of node 7:

290

74,7 4 5,7 5 6,7 6 7

11 exp[ ( )]

xw x w x w x t

x1

x2

6

7 x7

x8

x3

4

58

Layer 0(Input layer)

Layer 1(Hidden Layer)

Layer 2(Output layer)

4,7 5,7 6,7 7{ , , , }w w w t

weight

threshold

11

SUPERVISED LEARNINGNEURAL NETWORKS

Perceptron

Early (and popular) attempt to build intelligent andself-learning systems by using simple componentsDerived from McCulloch-Pitts (1943) model of thebiological neuronModels output by weighted combinations of selectedfeatures (feature classifier)Essentially a linear classifierIncremental learning roughly based on gradientdescent

292

12

Perceptron

293

(fixed)x1

g4

g1

g2

g3

x2

x3

x4

w1

w2

w3

w4

Output

FeatureDetectionLayer

InputPattern

q

(adaptive)

Training algorithm

Perceptron (Rosenblatt, 1958). Can only learn linearlyseparable functions.

Training algorithm:1. Select an input vector x from the training data set2. If the perceptron gives an incorrect response,

modify all connection weights wi

294

13

Training algorithm

Weight training:

Weight correction is given by the delta rule:

– learning ratee(l) = yd(l) – y(l)

295

( 1) ( ) ( )i i iw l w l w l

( ) ( ) ( )i iw l x l e l

Question: Can we represent a simpleexclusive-OR (XOR) function with asingle-layer perceptron?

XOR problem

296

-0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

input 1

inpu

t 2

o: class 1, x: class 2

How to classify thepatterns correctly?

X Y Class0 0 0

0 1 1

1 0 1

1 1 0

Linear classificationis not possible!

14

Example

Linearly separable classificationsIf classification is linearly separable, we can have anynumber of classes with a perceptron.For example, consider classifying furniture according toheight and width:

297

Example

Each category can be separated from the other 2 by astraight line:

3 straight lineseach output node fires if point is on right side of straightline:

298

More than one output node could fire at same time!

15

Artificial neuron

xi: i-th input of the neuronwi: synaptic strengh (weight) for xi

y = ( wixi): output signal

299

w2

wn

x1

x2

xn

...y

Neuron

Types of neurons

Threshold (McCulloch and Pits, 1943):

Other types of activation functions (net = wixi):

300

1

n

i ii

y sign w x

1

0

1

step 1, if 00, if 0

nety

netsigmoid 1

1 netye

lineary net

16

Activation functions

Logistic*

Hyperbolic tangent*

Identity (linear)

*sigmoidal or squashing functions

301

1( )1 xf x

e

1( ) tanh2 1

x

x

x ef xe

( )f x x

-10 -5 0 5 10-2

-1

0

1

2Logistic Function

(a)

-10 -5 0 5 10-2

-1

0

1

2Hyperbolic Tangent Function

(b)

-10 -5 0 5 10-10

-5

0

5

10Identity Function

(c)

Single-layer perceptron (SLP)

Single-layer perceptron can only classify linearlyseparable patterns, regardless of the activationfunction used.How to cope with problems which are not linearlyseparable?

Using multilayer neural networks!

302

17

Multi-Layer Perceptron for XOR

303

x1

x2x3

w1

w2

= -w0x3

x4

x5

x1

x2

+1

+1+1

+1+1

-1 0.5

1.5

0.5

00.5

1 0

0.5

1

00.5

1

x2

(a) x3

x1

00.5

1 0

0.5

1

00.5

1

x2x1

(b) x4

00.5

1 0

0.5

1

00.5

1

x4x3

(c) x5

00.5

1 0

0.5

1

00.5

1

x2x1

(d) x5

Backpropagation MLP

Most commonly used NN structures for applications inwide range of areas:

Pattern recognition, signal processing, datacompression and automatic control

Well-known applications:NETtalk: trained an MLP to pronounce English text;Carnegie Mellon University’s ALVINN (AutonomousLand Vehicle in a Neural Network) used an NN forsteering an autonomous vehicle;Optical Character Recognition (OCR).

304

18

Multi-Layer Perceptron

Can learn functions that are not linearly separable.

305

Out

puts

igna

ls

Inputlayer

1st hiddenlayer

2nd hiddenlayer

Outputlayer

Most common MLP

306

Output layer

Hidden layer

x1

xi

xn

......

1

2

j

m

wijh

1

k

l

w11h

wnmh

w11o

y1

yk

yl

......

......

wjko

b1h

bmh

b1o

blowml

o

19

Most common MLP

Output of neurons in the hidden-layer hj:

Output of neurons in the output-layer yk:

307

1 0

0tanh

n nh h hj ij i j ij ii i

n hij ii

h w x b w x

w x

1 0

0

m mo o ok jk j j jk kj j

m ojk jj

y w h b w h

w h

sigmoid

linear

Learning in NN

Biological neural networks:Synaptic connections amongst neurons whichsimultaneously exhibit high activity are strengthned.

Artificial neural networks:Mathematical approximation of biological learning.Error minimization (nonlinear optimization problem).

Error backpropagation (first-order gradient)Newton methods (second-order gradient)Levenberg-Marquardt (second-order gradient)Conjugate gradients...

308

20

Supervised learning

309

Training data:

1 2

TT T TNY y y y

1 2

TT T TNX x x x

x y

e

Error backpropagation

Initialize all weights and thresholds to small randomnumbers

Repeat1. Input training examples and compute network and

hidden layer outputs2. Adjust output weights using output error3. Propagating output error backwards, adjust hidden-

layer weightsUntil satisfied with approximation

310

21

Backpropagation in MLP

Compute the output of the output-layer, and computeerror:

The cost function to be minimized is the following:

N – number of data points

311

, , 1, ,k d k ke y y k l

2

1 1

1( )2

l N

kqk q

J w e

Learning using gradient

Ouput weight learning for output yk:

312

( 1) ( ) ( )o o ojk jk jkw p w p J w

1 2

( ) , , ,T

ojk o o o

k k mk

J J JJ ww w w

22

Output-layer weights

313

2,

0 1

1, , ( , )2

m lo o h

k jk j k d k k jk ij kj k

y w h e y y J w w e

hm

...yk

Neuron

w2ko

wmko

w1ko w0k

oh1

Output-layer weights

Applying the chain rule

with

then

Thus:

Recall that for SLP:

314

j kojk

J h ew

, 1,k kk jo

k k jk

e yJ e he y w

k ko ojk k k jk

e yJ Jw e y w

( 1) ( ) ( ) ( )o o o ojk jk jk jk j kw p w p J w w p h e

i iw x e

23

Hidden-layer weights

315

j jh hij j j ij

h netJ Jw h net w

hjNeuron

w2jh

wnjh

w1jh w0j

hx1

x2

xn

0, tanh( )

nh

j ij i j ji

net w x h net

( 1) ( ) ( )h h hij ij ijw p w p J w

Hidden-layer weights

Partial derivatives:

then

and

316

1, ( ),

kj jo

k jk j j ihkj j ij

h netJ e w h xh net w

1

( ) ( )l

oi j j k jkh

kij

J x h e ww

1( ) ( ) ( )

lh oij i j j k jk

kw p x h e w

24

Error backpropagation algorithm

Initialize all weights to small random numbersRepeat:1. Input training example and compute network

outputs.2. Adjust output weights using gradients:

3. Adjust hidden-layer weights:

Until satisfied or fixed number of epochs p

317

( 1) ( )o ojk jk j kw p w p h e

1( 1) ( ) ( ) ( )

lh h oij ij i j j k jk

kw p w p x h e w

First-order gradient methods

318

J w( )

wn wn+1

n nJ w( )

25

Second-order gradient methods

Update rule for the weights:

H(w) is the Hessian matrix of w

Learning does not depend on a learning coefficientMuch more efficient in general

319

( 1) ( ) ( ( )) ( ( ))p p p J pw w H w w

( ) , ,h oij jkp w ww

Second-order gradient methods

320

J w( )

wn wn+1

26

Approximation power

General function approximators“Feedforward neural network with one hidden layerand sigmoidal activation functions can approximateany continuous function arbitrarily well on a compactset” (Cybenko)Intuitive relation to localized receptive fieldsLittle constructive results

321

Function approximation

322

1 1 1 2 2 2tanh( ) tanh( )o h h o h hy w w x b w w x b

2

0 1 2

z1 z2

x

w x + b2 2h h w x + b1 1

h h

Activation (weightedsummation)

z

Activation (weightedsummation)

27

Function approximation

323

1

2

0 1 2

v1 v2

tanh( )z2

v

tanh( )z1

x

1

2

0 1 2 x

w v1 1o

y

w v2 2o

w v + w v1 1 2 2o o

Transformationthrough tanh

Summation ofneuron outputs

RADIAL BASISFUNCTION NETWORKS

28

Radial Basis Function Networks (RBFN)

Feedforward neural networks where hidden units donot implement an activation function; they represent aradial basis function.Developed as an approach to improve accuracy anddecrease training time complexity.

325

Radial Basis Function Networks

Activation functions are radial basis functionsActivation level of ith receptive field (hidden unit):

326

( ) ii i

i

R Rx u

x

xn

c11

cm1

...

...

y1

yl

cml

...

ui – center of basis function

i – spread of basis functionj = 1, 2,...,nNo connection weightsbetween input and hiddenlayers

29

Radial Basis Function Networks

Localized activation functions. Gaussian and logistic:

Weighted sum or average output:

ci can be constants or functions of inputs: ci = aiTx + bi

327

2

2( ) exp2

ii

i

Rx u

x 2 2

1( )1 exp

i

i i

R xx u

1 1( ) ( )

H H

i i i ii i

y c w c Rx x 1

1

( )( )

( )

H

i ii

H

ii

c Ry

R

xx

x

RBFN architecture

328

Weighted averageWeighted sum

Hidden layer Hidden layer

Localized activation functions in the hidden layer

30

RBFN learning

Supervised learning to update all parameters (e.g. withGenetic Algorithms)Sequential training: fix basis functions and then adjustoutput weights by:

orthogonal least squaresdata clusteringsoft competition based on “maximum likelihoodestimate”

i sometimes estimated based on standard deviationsMany other schemes also exist

329

Least-squares estimate of weights

Given basis functions R and a set of input-outputdata: [xk, yk], k = 1,...,N, estimate optimal weights cij

1. Compute the output of the neurons:

The output is linear in the weights: y = R c.2. Least squares estimate:

330

1[ ]T Tc R R R y

2

22( )k i

ii kR x e

x u

31

RBFN and Sugeno systems

Equivalent if the following hold:Both RBFN and TS use same aggregation method foroutput (weighted sum or weighted average).Number of basis functions in RBFN equals number ofrules in TS.TS uses Gaussian membership functions with same(variance) as basis functions and rule firing isdetermined by product.RBFN response function (ci) and TS rule consequentsare equal.

331

General function approximator

332

32

Approximation properties of NN

[Cybenko, 1989]: A feedforward NN with at least onehidden layer can approximate any continuous functionRp Rn on a compact interval, if sufficient hiddenneurons are available.[Barron, 1993]: A feedforward NN with one hiddenlayer and sigmoidal activation functions can achieve anintegrated squared error of the order J = O(1 / h).

independently of the dimension of the input space ph: number of hidden neurons (for smooth functions)

333

Approximation properties

For a basis function expansion (polynomial, trigono-metric, singleton fuzzy model, etc.) with h terms,J = O(1 / h2/p), where p is the dimension of the input.Examples:

1. p = 2: polynomial J = O(1 / h2/2) = O(1 / h)neural net J = O(1 / h)

2. p = 10, h = 21: polynomial J = O(1/212/10) = 0.54neural net J = O(1/21) = 0.048

334

33

Example of aproximation

To achieve the same accuracy:

J = O(1 / hn) = O(1 / hb),hn = hb

2/p,

335

610 10421pnb hh

Hopfield network

Recurrent ANN. Example (single-layer):

Learning capability is much higher.Successive iterations may not necessarily converge;may lead to chaotic behavior (unstable network).

336

34

NEURAL NETWORKS

MATLAB EXAMPLE(R2007b)

Feedforward backpropagation network

1. Input and targetP = [0 1 2 3 4 5 6 7 8 9 10]; %input

T = [0 1 2 3 4 3 2 1 2 3 4]; %target

2. Create nethelp newff

net = newff(P,T,5);

3. Simulate and plot netYi = sim(net,P);

plot(P,T,'rs-',P,Yi,‘-o')

legend('T','Yi',0),xlabel('P')

338

35


339

0 1 2 3 4 5 6 7 8 9 10-3

-2

-1

0

1

2

3

4

P

TYi


4. Train the network for 50 epochsnet.trainParam.epochs = 50;

net = train(net,P,T);

T = [0 1 2 3 4 3 2 1 2 3 4]; %target

5. Simulate net and plot the resultsY = sim(net,P);

figure,

plot(P,T,'rs-',P,Yi,'bo',P,Y,'g^')

legend('T','Yi','Y',0),xlabel('P')

340

36


341

0 1 2 3 4 5 6 7 8 9 10-3

-2

-1

0

1

2

3

4

P

TYiY


Compute the mean absolute and squared errorsma_error = mae(T-Y)ma_error = 0.1120

ms_error = mse(T-Y)ms_error = 0.0169

Plot the network errorfigure,plot(P,T-Y,'o'),grid

ylabel('error'),xlabel('P')

342

37


343

0 1 2 3 4 5 6 7 8 9 10-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

erro

r

P


Check the parameters of the networknet

Some important parametersinputs: {1x1 cell} of inputs

layers: {2x1 cell} of layers

outputs: {1x2 cell} containing 1 output

targets: {1x2 cell} containing 1 target

biases: {2x1 cell} containing 2 biases

inputWeights: {2x1 cell} containing 1input weight

layerWeights: {2x2 cell} containing 1layer weight

344

38


adaptFcn: 'trains'

initFcn: 'initlay'

performFcn: 'mse'

trainFcn: 'trainlm'

adaptParam: .passes

trainParam: .epochs, .goal, .show, .time

IW: {2x1 cell} containing 1 input weightmatrix

LW: {2x2 cell} containing 1 layer weightmatrix

b: {2x1 cell} containing 2 bias vectors

345


Note that every time that a network is initialized,different random numbers are used for the weights.Example in the following:

Initialization and training of 10 networksComputation of mean absolute errorComputation of mean squared error

346

39


MA_error = []; MS_error = [];

for i = 1:10

net = newff(P,T,5);

net.trainParam.epochs = 50;

net = train(net,P,T);

Y = sim(net,P);

MA_error = [MA_error mae(T-Y)];

MS_error = [MS_error mse(T-Y)];

end

347


figure,

subplot(2,1,1),plot(MA_error,'o'),grid,

title('Mean Absolute Error')

subplot(2,1,2),plot(MS_error,'o'),grid

title('Mean Squared Error')

348

40


349

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25Mean Absolute Error

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2Mean Squared Error

neural networks - ulisboa · pdf file... (vlsi) implementation 272. 2 ... artificial neural...

Documents