from biological to artificial neural networks npcds/mitacs spring school, montreal, may 23-27, 2006...

FromFrom BiologicalBiological toto

ArtificialArtificial NeuralNeural NetworksNetworks

NPCDS/MITACS Spring School, Montreal, May 23-27, 2006

Helmut KrogerLaval University

Outline

1. From biology to artifial NNs2. Perceptron – unsupervised learning3. Hopfield model – associative memory4. Kohonen map – self organization

ReferencesReferences

1.1. Hertz J., Krogh A., and Palmer R.G. Hertz J., Krogh A., and Palmer R.G. Introduction to the theory of neural Introduction to the theory of neural computationcomputation

2.2. Pattern Classification, John Wiley, 2001Pattern Classification, John Wiley, 2001R.O. Duda and P.E. Hart and D.G. StorkR.O. Duda and P.E. Hart and D.G. Stork

3.3. Haykin S (1999). Neural networks. Haykin S (1999). Neural networks. Prentice Hall International. Prentice Hall International.

4.4. Bishop C (1995). Neural networks for Bishop C (1995). Neural networks for pattern recognition. Oxford: Clarendon pattern recognition. Oxford: Clarendon Press Press

1.1. Pattern Recognition and Neural NetworksPattern Recognition and Neural Networksby Brian D. Ripley. Cambridge University Press. by Brian D. Ripley. Cambridge University Press. Jan 1996.Jan 1996.

2.2. Neural Networks. An Introduction, Springer-Neural Networks. An Introduction, Springer-Verlag Berlin, 1991 B. Mueller and J. ReinhardtVerlag Berlin, 1991 B. Mueller and J. Reinhardt

3.3. W.S. McCulloch & W. Pitts (1943). “A logical W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in nervous calculus of the ideas immanent in nervous activity”, activity”, Bulletin of Mathematical BiophysicsBulletin of Mathematical Biophysics, 5, , 5, 115-137.115-137.

Use of NNs: Neural Networks Are ForApplications Science

Character recognition Neuroscience

Optimization Physics, Mathematics Data mining Computer science

… …

Biological neural networksBiological neural networks

Nerve cells are called neurons. Many different types Nerve cells are called neurons. Many different types exist. Neurons are extremely complex.exist. Neurons are extremely complex.

Approx. Approx. 101011 11 neurons in the brain. Each neuron neurons in the brain. Each neuron has about 10has about 1033 connections connections

Neural communicationNeural communication

Neurons transport Neurons transport information via information via electrical action electrical action potentials.potentials. At the At the synapse the synapse the transmission is transmission is mediated by chemical mediated by chemical macromolecules macromolecules (neurotransmitter (neurotransmitter proteins). proteins).

How two neurons communicate

Biology of neurons:Biology of neurons:

Single neurons are highly complex Single neurons are highly complex electrochemical deviceselectrochemical devices

Many forms of interneuron Many forms of interneuron communication now known – acting over communication now known – acting over many different spatial and temporal many different spatial and temporal scales scales

LocalLocal GaseousGaseous Volume signaling etc.Volume signaling etc.

inputHillock

Output

The neuron is a computer:

(1) Information is distributed.(1) Information is distributed. (2) Brain works in deterministic mode as well (2) Brain works in deterministic mode as well

as in stochastic mode (generation of nerve as in stochastic mode (generation of nerve signals, synaptic transmission).signals, synaptic transmission).

(3) Associative memory: Retrieval not by (3) Associative memory: Retrieval not by checking bit by bit, but by gradual checking bit by bit, but by gradual reproduction of the whole. reproduction of the whole.

From biology of brain to information

processing.

(4) Architecture of brain: (4) Architecture of brain: - optimizes information transmission.- optimizes information transmission. - stable against errors- stable against errors - physical constraints: oxygen, blood, energy, cooling. - physical constraints: oxygen, blood, energy, cooling. - Small World Architecture. - Small World Architecture.

(5) 1/f frequency scaling in EEG: fractals, self-similarity. (5) 1/f frequency scaling in EEG: fractals, self-similarity.

Dynamical origin: Model of self-organized criticality Dynamical origin: Model of self-organized criticality (sand pile analogue). (sand pile analogue).

Neural avalanches – do they transmit information? Neural avalanches – do they transmit information? Does the brain work at some critical point? Does the brain work at some critical point?

(6) How is neural connectivity generated? Nerve (6) How is neural connectivity generated? Nerve growth: random process, guided by chemical markers, growth: random process, guided by chemical markers, enhanced by stochastic resonance. enhanced by stochastic resonance.

Layered structure of neocortexLayered structure of neocortex

Organization of neo-cortexOrganization of neo-cortex

Biological parameters of neocortexBiological parameters of neocortex

Artificial Neural NetworksArtificial Neural Networks

A network with interactions: An attempt to mimic A network with interactions: An attempt to mimic the brain:the brain:

Unit elements Unit elements are artificial neurons (linear or are artificial neurons (linear or nonlinear input-output unit).nonlinear input-output unit).

Communications Communications are encoded by weights, are encoded by weights, measure of how strong neurons affect each other. measure of how strong neurons affect each other.

ArchitecturesArchitectures can be feed-forward, feedback or can be feed-forward, feedback or recurrent.recurrent.

w1: synaptic strength

wn

x1

xn

)( ijj bxwfy

Firing of a group of neurons: biology vs model

1. 1. Multi-layerMulti-layer feedfeed foforrwardward networknetwork

Biological motivation for Biological motivation for multi-layer feed-forward multi-layer feed-forward network:network:

Modeled after visual cortex being muli-layered Modeled after visual cortex being muli-layered (6 layers). Dominantly feedforward. There is (6 layers). Dominantly feedforward. There is strong lateral inhibition: Neurons in the same strong lateral inhibition: Neurons in the same layer don’t talk to each other. layer don’t talk to each other.

High connectivity between neurons (10^4 per High connectivity between neurons (10^4 per

neuron) provides basis for massive parallel neuron) provides basis for massive parallel computing.computing.

High redundance against errors.High redundance against errors.

The principal building blocks:The principal building blocks: Input vector x_kInput vector x_k Weight matrix w_ikWeight matrix w_ik Activation function fActivation function f Output vector y_iOutput vector y_i Dynamical update rule: Dynamical update rule:

Goal: Find weights and activation function such that for given input the output is close to desired output

(Training supervised by teacher)

))(()1( ijj iji txwfty

Examples of activation functionsExamples of activation functions

The The PerceptronPerceptron ( (19621962): A ): A simplesimple feed-forwardfeed-forward network:1network:1 inputinput layerlayer 1output1output layerlayer..

x+y-1.5

1.5

y

1

(x+y-1.5)

inp

uts

weig

hts sum output 1 1 1

01 0 1 0 0

000

x & yyx

outputinputs

Truth Table for Logical AND

• The simplest possible artificial neural network able to do a calculation consists from two inputs and one output. It can be used to classify patterns or to perform basic logical operations such as AND, OR and NOT.

1

-1

x

Perceptron learning Perceptron learning algorithmalgorithm

Rule:If the sum of the weighted inputs exceeds a threshold, output 1, else output -1.

output

inputs

weig

ht

s sum

Σxi wi

*Training any unit consists of adjusting the weight vector and threshold so that desired classification is performed.

1 if Σ inputi * weighti > threshold

-1 if Σ inputi * weighti < threshold

Algorithm (inspired by Hebb’s rule): For each training vector x evaluate the output y. Take the difference desired output t – current output y. wi=wi+wi.

wi=(t-y)xi. Repeat until y=t.

Interpretation of weightsInterpretation of weights Heaviside function has threshold at x=0. Decision Heaviside function has threshold at x=0. Decision

boundary given by:boundary given by:

a = w*x+ w0 = w0 + w1 x1 + w2 x2 = 0

Thus: x2 = - (w0 + w1 x1)/w2 .

0

10

0

1.5

1.5

Look-up table: OR fct, XOR fctLook-up table: OR fct, XOR fct

Linear discriminant function: Linear discriminant function: separation of 2 classes separation of 2 classes

-4 -2 0 2 4 6 8 10 12 14-4

-2

0

2

4

6

8

Feature 1

Featu

re 2

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE

DecisionBoundary

DecisionRegion 1

Decision Region 2

A linear discriminant function is a mapping which partitions feature space using a linear function (straight line, or hyperplane)

D=2 dimensions: decision boundary is straight line

Simple form of classifier:

“separate two classes using a straight line in feature space”

Perceptron can be used to discriminate between k classes by having Perceptron can be used to discriminate between k classes by having k output nodes:k output nodes:

xx is in class C is in class Cjj if y if yj j ((xx)>= y)>= ykk for all k for all k Resulting decision boundaries divide the feature space into convex Resulting decision boundaries divide the feature space into convex

decision regionsdecision regions

yk

yj = g(wjk xk + wk0)

w11

y1

wkd

wk1

wk0

x1

xd

Weight to output j from input k is wjk

-1

C1 C2

C3

Separation of K classes

Network training via gradient Network training via gradient descend descend

Set of training data from known classes used in conjunction with an error function E(w) (eg. squared difference between target t and response y) which must be minimized.

Then: w new = w old - E(w)

where: (w) is a vector representing the gradient and is the learning rate (small, positive)

1. Move downhill in direction E(w) (steepest downhill since E(w) is the direction of steepest increase)2. Termination is controlled by

jkjkjk w

Eww

1

Equivalent to climbing hill up and downEquivalent to climbing hill up and down Problem: when to stop?Problem: when to stop? Local minimaLocal minima

Possibility of multiple local minima. Note: for single-layer perceptron, E(w) only has a single global minimum Possibility of multiple local minima. Note: for single-layer perceptron, E(w) only has a single global minimum - no problem!- no problem!

Gradient descent goes to the closest local minimum:Gradient descent goes to the closest local minimum: General solution: random restarts from multiple places in weight space (simulated annealing).General solution: random restarts from multiple places in weight space (simulated annealing).

Moving along the error function landscapeMoving along the error function landscape

Training multi-layer NN via back Training multi-layer NN via back propagation algorithm propagation algorithm

Back-Propagation

Back propagation cont’dBack propagation cont’d

For d-dimensional data perceptron consists of d-weights, a bias and a thresholding activation function. For 2D data we have:

w1

w2

w0

a = w0 + w1 x1 + w2 x2 y=g(a) {-1, +1}

1. Weighted Sum of the inputs

2. Pass thru Heaviside function:T(a)= -1 if a < 0 T(a)= 1 if a >= 0

x1

1Output= classdecision

If we group the weights as a vector w we therefore have the net output y given by:

y = g(w . x + w0)

x2

View the bias as another weight from an input which is constantly on

Perceptron as Classifier

Failure of the PerceptronFailure of the Perceptron

Successful

Unsuccessful

Many Hours in the Gym per Week

Few Hours in the Gym

per Week

Footballers

Academics

In this example, a perceptron would not be able to discriminate between the footballers and the academics (XOR cannot be represented by single theshold sigma node).

…despite the simplicity of their relationship:

Academics =Successful XOR Gym

This failure caused the majority of researchers to walk away.

Color = dark

Color = light

healthy

Classification: Decision Trees

#nuclei=1

#nuclei=2

#nuclei=1

#nuclei=2

#tails=1 #tails=2

cancerous

cancerous healthy

healthy

#tails=1 #tails=2

cancerous

Healthy

Cancerous

“Which factors determine if a cell is cancerous?”

Example of Classification from Neural Networks:

Color = dark

# nuclei = 1

…

# tails = 2

ProblemProblem: : Over-fittingOver-fitting

1 2 3 4 5 67 8 9 10 11

.

1.

.

.

.

0

10

0

0

00

0

09

15

FeedFeed forwardforward NNNN . . LetterLetter recognitionrecognition fromfrom writingwriting onon

PCPC screenscreen. . LetterLetter scannerscanner – – transformationtransformation intointo

ASCIIASCII codecode

SWN topology: Network SWN topology: Network of 5 Layers by 8 Neurons. of 5 Layers by 8 Neurons.

Simulation with 5 neurons per layers and 8 layers. NN was trained with 40 patterns for 50 different runs.

Learning 40 patterns. The regular network almost fails to learn. With a few short-cuts the network learns well. The SWN architecture is better than regular and random random architecture.

2. Hopfield NN:2. Hopfield NN: Model of associative Model of associative memorymemory

Example:Example: Restoring corrupted Restoring corrupted memory patternsmemory patterns

Original T Half is Corrupted

20% of T corrupted

Use in sUse in search machines (Alt earch machines (Alt Vista): Search from incomplete Vista): Search from incomplete or corrupted items. or corrupted items.

Approximate solution of combinatorial Approximate solution of combinatorial optimization problem: Travelling Salesman optimization problem: Travelling Salesman

ProblemProblem

Associative memory Associative memory problem:problem:

Store a set of patterns Store a set of patterns

When presenting pattern z, When presenting pattern z, network finds pattern , network finds pattern , closest to pattern z.closest to pattern z.

x

x

Hopfield ModelHopfield Model

Dynamical variables: Neuron i ( i=1,..,N) represented Dynamical variables: Neuron i ( i=1,..,N) represented by variableby variable

At any time t network determined by state vector At any time t network determined by state vector { i=1…N}.{ i=1…N}. Dynamical update rule: Dynamical update rule:

System evolves from a given state to some stable System evolves from a given state to some stable network state (attractor state, fixed point).network state (attractor state, fixed point).

)(tSi

))(sgn()1( ijj iji tSwtS

)(tSi

Hopfield NetsHopfield Nets1. Every node is connected to

every other node.

2. Weights are symmetric and wi,j=0.

3. The flow of information is not unidirectional.

4. The state of the system is given by the node output.

Energy function of Hopfield net: Energy function of Hopfield net: multidimensional landscapemultidimensional landscape

jiji SSwH ,2

1

Training of Hopfield Nets: Inspired by Training of Hopfield Nets: Inspired by biological Hebb’s rule(unsupervised)biological Hebb’s rule(unsupervised)

1. Present components of the patterns to be stored at the outputs of corresponding

nodes of the net

2. If two nodes have the same value then make a small

positive increment to internode weight. If they

have opposite values then make a small negative decriment to the node

pj

piij vvw

Distance of NN state from 1Distance of NN state from 1stst training pattern training pattern

Phase diagram of attractor networkPhase diagram of attractor network

Load: alpha=no patterns/no connections per node

bCharacter Character recognitionrecognition

For a 256 x 256 character we have 65, 536 pixels. One input for each pixel is not efficient. Reasons:

1. Poor generalisation: data set would have to be vast to be able to properly constrain all the parameters.

2. Takes long time to train

3. Answer: use averages of N2 pixels dimensionality reduction – each average could be a feature.

Restoration of memory: regular network

Restoration of memory: SWN

Extend the ideas of competitive learning to incorporate the neighborhood around inputs and neurons

We want a nonlinear transformation of input pattern space onto output feature space which preserves neighbourhood relationship between the inputs -- a feature map where nearby neurons respond to similar inputs

Eg Place cells, orientation columns, somatosensory cells etc

Idea is that neurons selectively tune to particular input patterns in such a way that the neurons become ordered with respect to each other so that a meaningful coordinate system for different input features is created

3. Self-organization3. Self-organization: : Topographic Topographic mapsmaps

Topographic map: spatial locations are indicative of the intrinsic statistical features of the input patterns: ie close in the input => close in the output

EG: When the black input in the lower layer is active, the black neuron in the upper layer is the winner so when the red input in the lower layer is active, we want the upper red neuron to be the winner

2 layer network each cortical unit fully connect to visual space via Hebbian units

Interconnections of cortical units described by ‘Mexican-hat’ function (Garbor function): short-range excitation and long-range inhibition

visual space

cortical units

Example: Activity-based self-organization (von der Malsburg, 1973)incorporation of competitive and cooperative mechanisms to generate feature maps using unsupervised learning networks

Biologically motivated: how can activity-based learning using highly interconnected circuits lead to orderly mapping of visual stimulus space onto cortical surface? (visual-tectum map)

Kohonen NNKohonen NN

Example of grid: mapping 2D to 2D

Knowledge vs Topology: SWN Knowledge vs Topology: SWN during most of organization during most of organization phase.phase.

Problem

SOM tends to overrepresent regions of low density and underrepresent regions of high density

where p(x) is the distribution density of input data

Also, can have problems when there is no natural 2d ordering of the data.

3/2)(xpWx

from biological to artificial neural networks npcds/mitacs spring school, montreal, may 23-27, 2006...

Documents

neural avalanches

neural connectivity

group of neurons

single neurons

strong neurons

artificial neurons linear

biology of brain

information transmission