from biological to artificial neural networks npcds/mitacs spring school, montreal, may 23-27, 2006...
TRANSCRIPT
FromFrom BiologicalBiological toto
ArtificialArtificial NeuralNeural NetworksNetworks
NPCDS/MITACS Spring School, Montreal, May 23-27, 2006
Helmut KrogerLaval University
Outline
1. From biology to artifial NNs2. Perceptron – unsupervised learning3. Hopfield model – associative memory4. Kohonen map – self organization
ReferencesReferences
1.1. Hertz J., Krogh A., and Palmer R.G. Hertz J., Krogh A., and Palmer R.G. Introduction to the theory of neural Introduction to the theory of neural computationcomputation
2.2. Pattern Classification, John Wiley, 2001Pattern Classification, John Wiley, 2001R.O. Duda and P.E. Hart and D.G. StorkR.O. Duda and P.E. Hart and D.G. Stork
3.3. Haykin S (1999). Neural networks. Haykin S (1999). Neural networks. Prentice Hall International. Prentice Hall International.
4.4. Bishop C (1995). Neural networks for Bishop C (1995). Neural networks for pattern recognition. Oxford: Clarendon pattern recognition. Oxford: Clarendon Press Press
1.1. Pattern Recognition and Neural NetworksPattern Recognition and Neural Networksby Brian D. Ripley. Cambridge University Press. by Brian D. Ripley. Cambridge University Press. Jan 1996.Jan 1996.
2.2. Neural Networks. An Introduction, Springer-Neural Networks. An Introduction, Springer-Verlag Berlin, 1991 B. Mueller and J. ReinhardtVerlag Berlin, 1991 B. Mueller and J. Reinhardt
3.3. W.S. McCulloch & W. Pitts (1943). “A logical W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in nervous calculus of the ideas immanent in nervous activity”, activity”, Bulletin of Mathematical BiophysicsBulletin of Mathematical Biophysics, 5, , 5, 115-137.115-137.
Use of NNs: Neural Networks Are ForApplications Science
Character recognition Neuroscience
Optimization Physics, Mathematics Data mining Computer science
… …
Biological neural networksBiological neural networks
Nerve cells are called neurons. Many different types Nerve cells are called neurons. Many different types exist. Neurons are extremely complex.exist. Neurons are extremely complex.
Approx. Approx. 101011 11 neurons in the brain. Each neuron neurons in the brain. Each neuron has about 10has about 1033 connections connections
Neural communicationNeural communication
Neurons transport Neurons transport information via information via electrical action electrical action potentials.potentials. At the At the synapse the synapse the transmission is transmission is mediated by chemical mediated by chemical macromolecules macromolecules (neurotransmitter (neurotransmitter proteins). proteins).
How two neurons communicate
Biology of neurons:Biology of neurons:
Single neurons are highly complex Single neurons are highly complex electrochemical deviceselectrochemical devices
Many forms of interneuron Many forms of interneuron communication now known – acting over communication now known – acting over many different spatial and temporal many different spatial and temporal scales scales
LocalLocal GaseousGaseous Volume signaling etc.Volume signaling etc.
inputHillock
Output
The neuron is a computer:
(1) Information is distributed.(1) Information is distributed. (2) Brain works in deterministic mode as well (2) Brain works in deterministic mode as well
as in stochastic mode (generation of nerve as in stochastic mode (generation of nerve signals, synaptic transmission).signals, synaptic transmission).
(3) Associative memory: Retrieval not by (3) Associative memory: Retrieval not by checking bit by bit, but by gradual checking bit by bit, but by gradual reproduction of the whole. reproduction of the whole.
From biology of brain to information
processing.
(4) Architecture of brain: (4) Architecture of brain: - optimizes information transmission.- optimizes information transmission. - stable against errors- stable against errors - physical constraints: oxygen, blood, energy, cooling. - physical constraints: oxygen, blood, energy, cooling. - Small World Architecture. - Small World Architecture.
(5) 1/f frequency scaling in EEG: fractals, self-similarity. (5) 1/f frequency scaling in EEG: fractals, self-similarity.
Dynamical origin: Model of self-organized criticality Dynamical origin: Model of self-organized criticality (sand pile analogue). (sand pile analogue).
Neural avalanches – do they transmit information? Neural avalanches – do they transmit information? Does the brain work at some critical point? Does the brain work at some critical point?
(6) How is neural connectivity generated? Nerve (6) How is neural connectivity generated? Nerve growth: random process, guided by chemical markers, growth: random process, guided by chemical markers, enhanced by stochastic resonance. enhanced by stochastic resonance.
Layered structure of neocortexLayered structure of neocortex
Organization of neo-cortexOrganization of neo-cortex
Biological parameters of neocortexBiological parameters of neocortex
Artificial Neural NetworksArtificial Neural Networks
A network with interactions: An attempt to mimic A network with interactions: An attempt to mimic the brain:the brain:
Unit elements Unit elements are artificial neurons (linear or are artificial neurons (linear or nonlinear input-output unit).nonlinear input-output unit).
Communications Communications are encoded by weights, are encoded by weights, measure of how strong neurons affect each other. measure of how strong neurons affect each other.
ArchitecturesArchitectures can be feed-forward, feedback or can be feed-forward, feedback or recurrent.recurrent.
w1: synaptic strength
wn
x1
xn
)( ijj bxwfy
Firing of a group of neurons: biology vs model
1. 1. Multi-layerMulti-layer feedfeed foforrwardward networknetwork
Biological motivation for Biological motivation for multi-layer feed-forward multi-layer feed-forward network:network:
Modeled after visual cortex being muli-layered Modeled after visual cortex being muli-layered (6 layers). Dominantly feedforward. There is (6 layers). Dominantly feedforward. There is strong lateral inhibition: Neurons in the same strong lateral inhibition: Neurons in the same layer don’t talk to each other. layer don’t talk to each other.
High connectivity between neurons (10^4 per High connectivity between neurons (10^4 per
neuron) provides basis for massive parallel neuron) provides basis for massive parallel computing.computing.
High redundance against errors.High redundance against errors.
The principal building blocks:The principal building blocks: Input vector x_kInput vector x_k Weight matrix w_ikWeight matrix w_ik Activation function fActivation function f Output vector y_iOutput vector y_i Dynamical update rule: Dynamical update rule:
Goal: Find weights and activation function such that for given input the output is close to desired output
(Training supervised by teacher)
))(()1( ijj iji txwfty
Examples of activation functionsExamples of activation functions
The The PerceptronPerceptron ( (19621962): A ): A simplesimple feed-forwardfeed-forward network:1network:1 inputinput layerlayer 1output1output layerlayer..
x+y-1.5
1.5
y
1
(x+y-1.5)
inp
uts
weig
hts sum output 1 1 1
01 0 1 0 0
000
x & yyx
outputinputs
Truth Table for Logical AND
• The simplest possible artificial neural network able to do a calculation consists from two inputs and one output. It can be used to classify patterns or to perform basic logical operations such as AND, OR and NOT.
1
-1
x
Perceptron learning Perceptron learning algorithmalgorithm
Rule:If the sum of the weighted inputs exceeds a threshold, output 1, else output -1.
output
inputs
weig
ht
s sum
Σxi wi
*Training any unit consists of adjusting the weight vector and threshold so that desired classification is performed.
1 if Σ inputi * weighti > threshold
-1 if Σ inputi * weighti < threshold
Algorithm (inspired by Hebb’s rule): For each training vector x evaluate the output y. Take the difference desired output t – current output y. wi=wi+wi.
wi=(t-y)xi. Repeat until y=t.
Interpretation of weightsInterpretation of weights Heaviside function has threshold at x=0. Decision Heaviside function has threshold at x=0. Decision
boundary given by:boundary given by:
a = w*x+ w0 = w0 + w1 x1 + w2 x2 = 0
Thus: x2 = - (w0 + w1 x1)/w2 .
0
10
0
1.5
1.5
Look-up table: OR fct, XOR fctLook-up table: OR fct, XOR fct
Linear discriminant function: Linear discriminant function: separation of 2 classes separation of 2 classes
-4 -2 0 2 4 6 8 10 12 14-4
-2
0
2
4
6
8
Feature 1
Featu
re 2
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
DecisionBoundary
DecisionRegion 1
Decision Region 2
A linear discriminant function is a mapping which partitions feature space using a linear function (straight line, or hyperplane)
D=2 dimensions: decision boundary is straight line
Simple form of classifier:
“separate two classes using a straight line in feature space”
Perceptron can be used to discriminate between k classes by having Perceptron can be used to discriminate between k classes by having k output nodes:k output nodes:
xx is in class C is in class Cjj if y if yj j ((xx)>= y)>= ykk for all k for all k Resulting decision boundaries divide the feature space into convex Resulting decision boundaries divide the feature space into convex
decision regionsdecision regions
yk
yj = g(wjk xk + wk0)
w11
y1
wkd
wk1
wk0
x1
xd
Weight to output j from input k is wjk
-1
C1 C2
C3
Separation of K classes
Network training via gradient Network training via gradient descend descend
Set of training data from known classes used in conjunction with an error function E(w) (eg. squared difference between target t and response y) which must be minimized.
Then: w new = w old - E(w)
where: (w) is a vector representing the gradient and is the learning rate (small, positive)
1. Move downhill in direction E(w) (steepest downhill since E(w) is the direction of steepest increase)2. Termination is controlled by
jkjkjk w
Eww
1
Equivalent to climbing hill up and downEquivalent to climbing hill up and down Problem: when to stop?Problem: when to stop? Local minimaLocal minima
Possibility of multiple local minima. Note: for single-layer perceptron, E(w) only has a single global minimum Possibility of multiple local minima. Note: for single-layer perceptron, E(w) only has a single global minimum - no problem!- no problem!
Gradient descent goes to the closest local minimum:Gradient descent goes to the closest local minimum: General solution: random restarts from multiple places in weight space (simulated annealing).General solution: random restarts from multiple places in weight space (simulated annealing).
Moving along the error function landscapeMoving along the error function landscape
Training multi-layer NN via back Training multi-layer NN via back propagation algorithm propagation algorithm
Back-Propagation
Back propagation cont’dBack propagation cont’d
For d-dimensional data perceptron consists of d-weights, a bias and a thresholding activation function. For 2D data we have:
w1
w2
w0
a = w0 + w1 x1 + w2 x2 y=g(a) {-1, +1}
1. Weighted Sum of the inputs
2. Pass thru Heaviside function:T(a)= -1 if a < 0 T(a)= 1 if a >= 0
x1
1Output= classdecision
If we group the weights as a vector w we therefore have the net output y given by:
y = g(w . x + w0)
x2
View the bias as another weight from an input which is constantly on
Perceptron as Classifier
Failure of the PerceptronFailure of the Perceptron
Successful
Unsuccessful
Many Hours in the Gym per Week
Few Hours in the Gym
per Week
Footballers
Academics
In this example, a perceptron would not be able to discriminate between the footballers and the academics (XOR cannot be represented by single theshold sigma node).
…despite the simplicity of their relationship:
Academics =Successful XOR Gym
This failure caused the majority of researchers to walk away.
Color = dark
Color = light
healthy
Classification: Decision Trees
#nuclei=1
#nuclei=2
#nuclei=1
#nuclei=2
#tails=1 #tails=2
cancerous
cancerous healthy
healthy
#tails=1 #tails=2
cancerous
Healthy
Cancerous
“Which factors determine if a cell is cancerous?”
Example of Classification from Neural Networks:
Color = dark
# nuclei = 1
…
# tails = 2
ProblemProblem: : Over-fittingOver-fitting
1 2 3 4 5 67 8 9 10 11
.
1.
.
.
.
0
10
0
0
00
0
09
15
FeedFeed forwardforward NNNN . . LetterLetter recognitionrecognition fromfrom writingwriting onon
PCPC screenscreen. . LetterLetter scannerscanner – – transformationtransformation intointo
ASCIIASCII codecode
SWN topology: Network SWN topology: Network of 5 Layers by 8 Neurons. of 5 Layers by 8 Neurons.
Simulation with 5 neurons per layers and 8 layers. NN was trained with 40 patterns for 50 different runs.
Learning 40 patterns. The regular network almost fails to learn. With a few short-cuts the network learns well. The SWN architecture is better than regular and random random architecture.
2. Hopfield NN:2. Hopfield NN: Model of associative Model of associative memorymemory
Example:Example: Restoring corrupted Restoring corrupted memory patternsmemory patterns
Original T Half is Corrupted
20% of T corrupted
Use in sUse in search machines (Alt earch machines (Alt Vista): Search from incomplete Vista): Search from incomplete or corrupted items. or corrupted items.
Approximate solution of combinatorial Approximate solution of combinatorial optimization problem: Travelling Salesman optimization problem: Travelling Salesman
ProblemProblem
Associative memory Associative memory problem:problem:
Store a set of patterns Store a set of patterns
When presenting pattern z, When presenting pattern z, network finds pattern , network finds pattern , closest to pattern z.closest to pattern z.
x
x
Hopfield ModelHopfield Model
Dynamical variables: Neuron i ( i=1,..,N) represented Dynamical variables: Neuron i ( i=1,..,N) represented by variableby variable
At any time t network determined by state vector At any time t network determined by state vector { i=1…N}.{ i=1…N}. Dynamical update rule: Dynamical update rule:
System evolves from a given state to some stable System evolves from a given state to some stable network state (attractor state, fixed point).network state (attractor state, fixed point).
)(tSi
))(sgn()1( ijj iji tSwtS
)(tSi
Hopfield NetsHopfield Nets1. Every node is connected to
every other node.
2. Weights are symmetric and wi,j=0.
3. The flow of information is not unidirectional.
4. The state of the system is given by the node output.
Energy function of Hopfield net: Energy function of Hopfield net: multidimensional landscapemultidimensional landscape
jiji SSwH ,2
1
Training of Hopfield Nets: Inspired by Training of Hopfield Nets: Inspired by biological Hebb’s rule(unsupervised)biological Hebb’s rule(unsupervised)
1. Present components of the patterns to be stored at the outputs of corresponding
nodes of the net
2. If two nodes have the same value then make a small
positive increment to internode weight. If they
have opposite values then make a small negative decriment to the node
pj
piij vvw
Distance of NN state from 1Distance of NN state from 1stst training pattern training pattern
Phase diagram of attractor networkPhase diagram of attractor network
Load: alpha=no patterns/no connections per node
bCharacter Character recognitionrecognition
For a 256 x 256 character we have 65, 536 pixels. One input for each pixel is not efficient. Reasons:
1. Poor generalisation: data set would have to be vast to be able to properly constrain all the parameters.
2. Takes long time to train
3. Answer: use averages of N2 pixels dimensionality reduction – each average could be a feature.
Restoration of memory: regular network
Restoration of memory: SWN
Extend the ideas of competitive learning to incorporate the neighborhood around inputs and neurons
We want a nonlinear transformation of input pattern space onto output feature space which preserves neighbourhood relationship between the inputs -- a feature map where nearby neurons respond to similar inputs
Eg Place cells, orientation columns, somatosensory cells etc
Idea is that neurons selectively tune to particular input patterns in such a way that the neurons become ordered with respect to each other so that a meaningful coordinate system for different input features is created
3. Self-organization3. Self-organization: : Topographic Topographic mapsmaps
Topographic map: spatial locations are indicative of the intrinsic statistical features of the input patterns: ie close in the input => close in the output
EG: When the black input in the lower layer is active, the black neuron in the upper layer is the winner so when the red input in the lower layer is active, we want the upper red neuron to be the winner
2 layer network each cortical unit fully connect to visual space via Hebbian units
Interconnections of cortical units described by ‘Mexican-hat’ function (Garbor function): short-range excitation and long-range inhibition
visual space
cortical units
Example: Activity-based self-organization (von der Malsburg, 1973)incorporation of competitive and cooperative mechanisms to generate feature maps using unsupervised learning networks
Biologically motivated: how can activity-based learning using highly interconnected circuits lead to orderly mapping of visual stimulus space onto cortical surface? (visual-tectum map)
Kohonen NNKohonen NN
Example of grid: mapping 2D to 2D
Knowledge vs Topology: SWN Knowledge vs Topology: SWN during most of organization during most of organization phase.phase.
Problem
SOM tends to overrepresent regions of low density and underrepresent regions of high density
where p(x) is the distribution density of input data
Also, can have problems when there is no natural 2d ordering of the data.
3/2)(xpWx