self-organised learning in the chialvo-bak model … learning in the chialvo-bak model msc project...

Self-Organised Learning in the Chialvo-Bak Model

MSc Project

Marco BrighamT

HE

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2009

Abstract

A review of the Chialvo-Bak model is presented, for the two-layer neural network topology. A

novel Markov Chain representation is proposed that yields several important analytical quanti-

ties and supports a learning convergence argument. The power law regime is re-examined under

this new representation and is found to be limited to learning under small mapping changes.

A parallel between the power law regime and the biological neural avalanches is proposed. A

mechanism to avoid the permanent tagging of synaptic weights of the selective punishment rule

is proposed.

i

Acknowledgements

I wish to thank Dr. Mark van Rossum for his tireless support and attentive guidance, and for

having accepted to supervise me in the first place.

To Dr. J. Michael Herrmann I wish to thank the very creative and rewarding discussions on the

holistic merits of the Chialvo-Bak model.

To Dr. Wolfgang Maass and his team at the Institute for Theoretical Computer Science at

T.U. Graz, I wish to thank the precious feedback and fruitful discussions received after the first

talk on this MSc project.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is my own

except where explicitly stated otherwise in the text, and that this work has not been submitted

for any other degree or professional qualification except as specified.

(Marco Brigham)

iii

To the memory of Per Bak, whose ideas live on and inspire.

iv

Contents

1 Introduction 21.1 Brief literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 The Two-Layer Topology 72.1 Basic Principles and Learning . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Interference events . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Synaptic Landscape . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Neural avalanches . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Storing Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Advanced Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Research Results 213.1 �-band Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Desaturation strategies . . . . . . . . . . . . . . . . . . . . . . . 223.1.2 Global tag threshold . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Markov Chain Representation . . . . . . . . . . . . . . . . . . . . . . . . 263.2.1 Statistical properties . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.2 Markov chain representation: numerical evidence . . . . . . . . . 303.2.3 Analytical solution for Γ(2, �m, �o) . . . . . . . . . . . . . . . . . 333.2.4 Alternate formulation: graph transitions . . . . . . . . . . . . . . 363.2.5 Analytical solution: numerical evidence . . . . . . . . . . . . . . 373.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Learning Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Power-Law Behaviour and Neural Avalanches . . . . . . . . . . . . . . . 443.4.1 Biological interpretation . . . . . . . . . . . . . . . . . . . . . . . 453.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Conclusion 47

1

Chapter 1

Introduction

The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999, withthe stated goal of identifying ”some ”universal” and ”simple” mechanism which allowsa large number of neurons to connect autonomously in a way that helps the organismto survive” [8]. Their effort resulted in a ”schematic ”brain” model of self-organisedlearning and adaptation that operates using the principle of satisficing” [1].

In common with other models authored by P. Bak, is a patent minimalism of form,where models are succinctly defined by simple, local and stochastic interaction rulesthat reflect the most basic assumptions of the real-world system. However simple andminimalistic, these models manage to reproduce complex and emergent behaviour thatis observed in the real-world systems [2].

In the Chialvo-Bak model, the basic properties of neurons and neural networks arerepresented by simple, local and stochastic dynamical rules that support the processesof learning, memory and adaptation. The most basic operations in the model, thenode activation and the synaptic plasticity rules, are regulated by ”Winner-Take-All”(WTA) dynamics and learning by synaptic depression, respectively. These mechanismsmay correspond to well accepted physiological mechanisms [14] [10] [12], which suggeststhe biological feasibility of the model.

The present work focused on extending the analytical understanding of the model.A Markov chain representation for the simple two-layer topology is proposed, wherethe states of the chain correspond to the learning states of the network. This represen-tation provides a good statistical description of the model and supports an argumentfor the learning convergence.

A power law tail in the learning time distribution, corresponding to an order-disorderphase transition in the model, was proposed by J.R Wakeling [20]. This result wasspecific to the slow change mode, where the network is made to learn a succession ofsmall mapping changes. The power law behaviour was re-examined under the Markovchain representation for other mapping change modes and was only reproduced in theslow change mode.

An argument is provided for drawing a parallel between the above power law behaviourand the biological neural avalanches, evidenced experimentally J. Beggs and D. Plenz[4][5] in 2003. These correspond to the propagation of spontaneous activity in neural

2

networks with power law behaviour in the event size distribution.

The ability to store previously successful configurations is enabled by a selective pun-ishment mechanism [8] [1], where successful synaptic weights are depressed less severelywhen no longer leading to the correct mappings. The selective punishment mechanismhas a known ageing effect [1] that is related to the permanent tagging of the successfulsynaptic weights. A mechanism to avoid the permanent tagging in order to maintainthe performance advantage of selective punishment is proposed.

This document is organised as follows:

Chapter 1 presents a succinct introduction to the Chialvo-Bak model and the rele-vant literature.

Chapter 2 introduces the Chialvo-Bak model in the simple two-layer topology, cover-ing the learning modes, the selective punishment rule and the power-law tail behaviour.

Chapter 3 presents the research results of this MSc project.

Chapter 4 presents the conclusion and future work.

1.1 Brief literature review

A brief review on the research papers related to the Chialvo-Bak model is presented be-low. The purpose of this review is to broadly describe the areas of the model that havealready been investigated to considerable depth. Detailed descriptions of the modelthat support the present work are provided in Section 2.1.

The literature on the Chialvo-Bak model can be grouped into papers that follow theoriginal formulation of the model and papers that extend the model to different work-ing principles and dynamic rules. As the present work is closely aligned to the firstapproach, so is the focus of the literature review presented below.

Papers on the original Chialvo-Bak model

The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999. In thispaper, the motivations, biological constraints and ground rules are put forward, anda great emphasis is placed on the biological plausibility of the model, which leads torequirements of self-organisation at different levels and robustness to noise.

Self-organised learning and adaptation is required to reflect ability to learn withoutexternal guidance. The apparent lack of information in the DNA to encode the phys-ical properties of neurons and synapses and their connectivity [15] motivates the self-organisation at the connectivity level. Each neuron must learn without genetic orexternal guidance, to which other neurons to connect and this connectivity should re-main flexible in order to adapt to external changes.

The ability quickly recover from perturbations induced by biological noise is a con-straint motivated by the biological reality of the organism.

3

Learning by synaptic depression (negative feedback) is proposed as the basis of bi-ological learning and adaptation, supported by the following elements:

∙ Long-term synaptic depression (LTD) is as common in the mammalian brain aslong-term synaptic potentiation (LTP) [8]. The LTP mechanism is the suggestedphysiological implementation of learning by synaptic depression.

∙ Learning by synaptic potentiation leads to very stable synaptic landscapes, fromwhich adaptation to new configurations is difficult and slow.

∙ Learning of new tasks or adapting to new environments is error prone, as such, aprocess that acts on errors rather than on what is correct leads to faster learning.

The other pillar of the model, the ”Winner-Take-All” (WTA) rule is inspired frommodels of Self-Organised Criticality [2], as means to drive the system to an adaptivecritical state, where small perturbations can cause very large changes in the synapticlandscape. The WTA rule plays also a key role in the solution to the credit assignmentproblem by keeping the activity of the network low, as detailed in Section 2.1.

The synaptic plasticity changes are driven by a global signal informing on the successof the latest synaptic changes. The ability for the organism to differentiate betweenoutcomes is deemed innate to the system and possibly resulting from Darwinist selec-tion.

A second paper from the same authors [1] was published in 2000 that expanded onseveral key aspects of the model, such as the network topologies, the memory mech-anism and a new learning rule to tackle more complex problems. The performancescaling under the new learning rule was also analysed.

Several network topologies and their relevant learning rules were formally defined andthese are illustrated in Figure 1.1.

(a) The simple layered networktopology, which is the one used forthe present work.

(b) The lattice networktopology, where nodesconnect to a smallnumber of nodes in thesubsequent layer.

(c) The random network topology,where nodes connect randomly tonc of other neurons. Two subsetsof nodes are selected as the inputand the output nodes of the net-work.

Figure 1.1 – The various network topologies proposed in the original model. Figuresreproduced from [1].

4

The ability to store and retrieve previously successful configurations is enabled by theselective punishment of synaptic weights, where previously successful weights are de-pressed less severely when no longer leading to the correct mappings.

A small modification to the basic learning rules enables the network learn non-linearproblems such as the XOR problem or more generally the generalised parity problem,where the parity of any number of input neurons must be correctly calculated.

The learning of multistep sequences is introduced, where the depression of weightsrelated to bad sequences only occurred at the end of the last step. The generalisationand feature detection capability is also covered. The network is able to differentiatebetween classes of inputs requiring the same output by identifying useful features inthe inputs.

In [20], J.R. Wakeling identified an order-disorder phase transition in the model thatis regulated by the ratio of middle layer to input and output nodes. At the phasetransition, the network displays power-law behaviour with exponent � = −1.3 for thelearning time distribution.

These order-disorder regimes are characterised by the frequency of path interferenceevents, where already learnt mappings are accidentally destroyed while learning newmappings. The disordered phase is characterised by a high probability of interference.

In [21], J. Wakeling investigated the performance of synaptic potentiation and selectivepunishment under different mapping change modes. In the slow change mode the net-work is made to learn a succession of mapping sets that only differ by one input-outputmapping. In the flip− flop mode two different mapping sets are presented alternatelyto the network.

Synaptic potentiation was introduced to the plasticity rules by rewarding successfulweights by an amount � while still punishing unsuccessful weights. A quantitativeanalysis showed that any small amount of synaptic potentiation resulted in higher av-erage learning time, specially in the slow change mode. This is illustrated in Figure1.2a.

The performance of selective punishment was also investigated. While no visible im-provement was detected in the slow change mode, in the flip-flop mode the mechanismwas very effective, as illustrated in Figure 1.2b.

Chialvo-Bak model extensions

In [7], R.J.C. Bosman, W.A. van Leeuwen and B. Wemmenhove propose an extensionto the Chialvo-Bak model by including the potentiation of successful weights. Thisenables faster single mapping learning and multi-node firing in each layer, but is at thecost of the adaptation performance.

In [16], K. Klemm, S. Bornholdt and H.G. Schuster propose a stochastic approximationto the WTA rule and include a ”forgiveness” parameter in order to only punish thoseweights that are consistently unsuccessful. This extension results in a more complexmodel that according to the learning performance analysis in [1] does not improve on

5

(a) The learning performance of synaptic po-tentiation in the slow change mode, where onlyone mapping is changed at a time.

(b) The learning performance of synaptic po-tentiation in the flip-flop mode, where the net-work is presented alternately with two differentmappings.

Figure 1.2 – The learning performance decreases with any amount of synaptic potentiationfor the successful weights. Reproduced from [21].

the original model.

6

Chapter 2

The Two-Layer Topology

This chapter describes the functioning and the properties of the Chialvo-Bak modelthat are most relevant to the research results presented in Chapter 3. As such, it doesnot comprise an extensive description to the model, for which the reader is best directedto the original papers by P. Bak and D. Chialvo [8] [1].

The contents of this chapter are based in the above two papers and in the paperpublished by J.R. Wakeling [20] on the order-disorder phase transition of the model.

2.1 Basic Principles and Learning

The Chialvo-Bak model is characterised by the following principles:

∙ ”Winner-Take-All” dynamics: Neural activity only propagates through thestrongest synapses.

∙ Learning by synaptic depression (negative feedback): Synaptic plasticityis exclusively applied by the weakening of synaptic weights that participate inwrong decisions. These synapses are depressed.

These principles define the node activation and plasticity rules of the network, there-fore determining the dynamics and properties of the model. To illustrate this, thefunctioning of a Chialvo-Bak network while learning an arbitrary input-output map-ping is presented below.

Consider a neural network with one input layer, one middle layer and one outputlayer, as illustrated in Figure 2.1. Each layer has �i, �m and �o nodes respectively andthe network is noted Γ(�i, �m, �o).

The nodes connect between layers with synaptic weights w, as follows:

∙ Input nodes i connect to all middle nodes m with weights w(m, i).

∙ Middle nodes m connect to all output nodes o with weights w(o,m).

The network is initialised with random weights in [0, 1].

Each node can be active or not active, corresponding to node state 0 or 1. The activationof an input layer node results in the activation of one middle layer node and one outputlayer node, according to the following ”Winner-Wake-All” (WTA) rule:

7

Figure 2.1 – The two-layer network with three input nodes, four middle layer nodes andthree output nodes, i.e. �i = 3, �m = 4 and �o = 3. Each input node is connected toall nodes in the middle layer, and each middle layer node is connected to all nodes in theoutput layer.

∙ Input node i activates the middle layer node m with maximum w(m, i).

∙ Middle node m activates the output node o with maximum w(o,m).

In biological terms, the WTA rule could be implemented using lateral inhibitory con-nections within the same layer and excitatory connections between layers.

The activation sequence above ensures that no directed cycles are possible betweennodes, qualifying the network as a feed-forward network.

For a given a set of weights, the WTA rule determines the sequence of activation inthe middle and output layers, which defines the active configuration C of the network.An active configuration example is shown in Figure 2.2, where the blue connectionsrepresent the winning weights according to the WTA rule.

Figure 2.2 – The ”Winner-Take-All” rule (WTA) specifies the active configuration C ofthe network. In the above graph all input nodes are shown active, whereas in the networkonly one input node is active at the time. In this example C = {{1, 1, 1}, {2, 2, 3}, {3, 4, 2}}and corresponds to the input-output mapping set M = {1, 3, 2}. The blue connectionsrepresent the active weights of the configuration C.

An active configuration C associates input nodes to middle layer nodes, which in turnare associated to output nodes. As such, each C maps the input nodes to output nodes:for each input node i corresponds a mapping to the output node o = M(i).

8

The mapping set M contains the mappings of all the input nodes of the network.

In such terms, learning an arbitrary mapping set M corresponds to the evolution ofsynaptic weights from an initial active configuration C to the final active configurationC that yields the required mapping set M .

The network learns an arbitrary mapping set M by applying the following synapticplasticity rules:

1. A random input node i is selected.

2. The input node i fires and activates a middle layer node m and output layer nodeo according to the WTA rule.

3. If output node o is correct, i.e. o = M(i), return to step 1.

4. Otherwise depress the active weights w(m, i) and w(o,m) by a random amountin [0, 1] and return to step 2.

A sequence of learning steps from M = {1, 3, 2} to the identity mapping set M ={1, 2, 3} is illustrated in Figure 2.3.

Figure 2.3 – The learning of the identity mapping set M = {1, 2, 3} from the initial map-ping set M = {1, 3, 2}. In this example, M is learnt in three depressions: one depressionto learn input node 2 (upper row graph sequence) and two depressions to learn input node3 (lower row graph sequence). The blue connections represent the active weights of theconfiguration and the orange connections represent the depression of active weights.

A weight normalisation can be applied at the end of step 3 of the plasticity rules, byraising the weights of input node i and middle layer node m such that the winningweights are equal to one.

9

The step 2 of the plasticity rules requires a feedback signal informing the suitabil-ity of the recent changes. This is provided in the form of a global feedback signal thatis broadcast to the entire network, in the case the latest changes are not satisfactory.

The synaptic plasticity rules could correspond to the following events at the biolog-ical level:

1. Depressing the current active level of an input node results in a new active levelthat is tagged for recent changes. This tagging takes the form of a chemical orhormone release that is triggered by the latest synaptic activity.

2. No further plasticity changes take place until a global feedback signal is received.This signal is broadcast to the entire network informing whether the latest changesare not satisfactory.

3. Following an unsuccessful global feedback signal, the step 1 is repeated. Nofurther actions are taken otherwise.

The synaptic plasticity rules result in the following properties:

∙ For the global feedback mechanism to be efficient in directing synaptic learning,the rate of plasticity change has to be sparse. In such conditions, the creditassignment problem [18][3] is solved, i.e. the system can determine which elementsare to be punished following bad performance.

∙ The network signalling is in the time scale of firing patterns (i.e. milliseconds),while the tagging and feedback mechanisms are in a timescale more adapted tothe scale of events in the external world (i.e. seconds to hours).

∙ The global feedback signal represents an external critic rather than a teacher, asno specific instructions are provided to direct the plasticity activity.

For the network to learn a random mapping set A, the middle layer size must be atleast be as large as the input layer size, i.e. �m ≥ �i, so that each input node i canhave a dedicated path to the corresponding output node o = M(i).

A network with �i input nodes, �m middle layer nodes and �o output nodes is notedΓ(�i, �m, �o).

2.1.1 Interference events

In the process of learning a new mapping an interference event � may occur, where thenetwork unlearns a previously learnt mapping.

This is the case whenever while learning a new mapping, a middle layer node thatwas establishing a correct mapping for another input node is selected (assuming thecorrect output node for these input nodes is different). This is illustrated in Figure 2.4.

2.1.2 Synaptic Landscape

An interesting consequence of learning by synaptic depression is the resulting synapticlandscape, shown in Figure 2.5.

10

Figure 2.4 – While learning a mapping the network may unlearn a previously correctmapping. In the above sequence, the learning of input node 3 lead from M = {1, 2, 2}to M = {1, 1, 3}. As such, the net number of learnt mappings remained unchanged: theoutput mapping of input node 3 was learnt and the output mapping of input node 2 wasunlearnt.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Node 1 to Middle LayerΓ(6,108,6) [runs:1e+5 rand]

Wei

ght w

Node Index

(a) The synaptic weights from input node oneto the middle layer, in a network Γ(6, 108, 6).

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025

Weight Distribution p(w)Γ(32,1024,32) [runs:1e+5 bins:100 rand]

p(w

)

Weight w

(b) The synaptic weights distribution (100bins) in a network Γ(6, 108, 6).

Figure 2.5 – The metastable synaptic landscape is a direct consequence of learning bysynaptic depression and supports the fast adaptation property of the model.

In Figure 2.5a the metastable nature of the synaptic landscape is apparent, with theactive configuration barely supporting the current mapping. This is to be contrastedwith the synaptic landscape resulting from learning by synaptic potentiation, whichoften results in a small number of dominating synaptic weights.

In this model, learning a very different mapping set M is often just a few depres-sions away from the currently active weights.

The particular form of the weights distribution in Figure 2.5b is due to both the WTArule and learning by synaptic depression. As the active synapses for a given input ormiddle layer node are depressed by a random amount, the WTA rule will select thesynapse with the current highest weight1 for the new active configuration in each layer.This amounts to shifting all the weights by the amount of the difference between theprevious highest weight and the new highest weight.

1The highest weight after depression may still correspond be the previous winning weight, but theprobability of re-selection is lower than for any another weight.

11

Starting from an uniform weights distribution and repeating the above process a suf-ficient number of times, yields the distribution in Figure 2.5b. The intermediate stepsof this process are illustrated in Figure 2.6.

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025

Weight Distribution p(w)Γ(32,1024,32) [runs:8 bins:100 rand]

p(w

)

Weight w

(a) The synaptic weights distribution afteradapting to eight successive random mappings,in a network Γ(32, 1024, 32).

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025

Weight Distribution p(w)Γ(32,1024,32) [runs:15 bins:100 rand]

p(w

)

Weight w

(b) The synaptic weights distribution afteradapting to 15 random mappings, in a networkΓ(32, 1024, 32).

Figure 2.6 – The synaptic weights distribution evolves from a uniform distribution at theinitialisation of the network, to the distribution in Figure 2.5b.

2.1.3 Neural avalanches

The learning performance of the network can be measured in the number of depressions� required to completely learn a given mapping set M . This quantity will be looselyreferred to as the learning time, although no particular timescale is thereby implied.

The learning performance is known [8] to improve with increasing middle layer sizes,as illustrated in Figure 2.7. This is an advantage to regular back-propagation learningwhere in general the performance decreases with increasing middle layer size.

Let X be the random variable associated with the number of depressions � required tolearn a mapping M and let Pr(X = �) ≡ Pr(�) be the probability of learning mappingM in � depressions.

The learning performance of the network Γ(�i, �m, �o) is completely determined bythe learning time distribution, characterised by the probability mass function p suchthat

p(�) = Pr(X = �) (2.1)∞∑�=0

p(�) = 1. (2.2)

The basic operation for measuring p(�) is to record the number of depressions � re-quired to learn the current mapping set M , increase by one unit the count of mappingset learning in � depressions, present a new mapping set M to the network and so on.However, certain aspects in the setup of the simulations have a noticeable impact onthe measured values, as such these will be discussed in greater detail below.

12

20 40 60 80 1000

50

100

150

200

250

Average Learning Time ⟨ρ⟩

Γ(6,*,7) [run:50*1e+4 slow]

Middle Layer Nodes

⟨ρ⟩

(a) The average number of depressions ⟨�⟩ re-quired to learn a random mapping M for differ-ent sizes of the middle layer.

20 40 60 80 1000

5

10

15

20

25

30

Average Interference Events ⟨ν⟩

Γ(6,*,7) [run:50*1e+4 slow]

Middle Layer Nodes

⟨ν⟩

(b) The average number of interference events⟨�⟩ while learning a random mapping M fordifferent sizes of the middle layer.

Figure 2.7 – Increasing the number of middle layer nodes decreases the average learningtime ⟨�⟩ and the average interference events ⟨�⟩. The plots show ⟨�⟩ and ⟨�⟩ for networkswith six input nodes, seven output nodes and varying number of middle layer nodes.

One could require the weights of the network to be reset prior to learning the nextmapping but this would lead to measuring the first mapping learning times. Instead,the weights of the network are not reset at each new mapping set, and this yields ameasure that is closer to on-line learning performance.

A further distinction can be made on the degree of similarity between the new mappingset being presented to the network and the previous one. These can be completely ran-dom or differing in a small number of mappings only. Borrowing from J.R. Wakeling[20], the slow mapping change mode corresponds to one single mapping change2 in thenew mapping set. The random mapping change mode corresponds to random mappingsets being presented to the network.

The distribution p(�) for a network Γ(8, ∗, 9) is shown in Figure 2.8a. The tail ofthe distribution (i.e. long learning times) recedes for larger middle layer sizes, which isconsistent with the plot in Figure 2.7.

An interesting aspect of the model is the power law tail of p(�) [20], as illustratedin Figure 2.8b. The power law tail is a telltale sign of scale-free behaviour, for whichno single value of � is typical for the learning time in those networks. Shorter valuesof � occur more frequently than longer ones, but the later occur frequently enough tonot being singled out as exceptional.

The power-law tail of p(�) can be understood as avalanches of activation in the middlelayer nodes. Borrowing terminology from statistical physics, three operating regimesare then identified:

∙ Sub-critical regime for �m << �i �o

2The slow change learning mode requires �o ≥ �i + 1 unless the input nodes can share the sameoutput node.

13

101

102

103

104

10−4

10−2

100

Learning Time Distribution p(ρ)[runs:1e+6 slow]

p(ρ)

ρ

Γ(8,36,9)Γ(8,72,9)Γ(8,144,9)

(a) The learning time distributions for severalmiddle layer sizes of a network with eight inputand eight output nodes. Data from 1e+ 6 runs.

101

102

103

104

10−6

10−4

10−2

100

Learning Time Distribution p(ρ)[runs:1e+6 slow]

p(ρ)

ρ

Γ(8,72,9)Γ(16,272,17)Γ(32,1056,33)Γ(64,4160,65)

(b) The learning time distributions for severalnetworks with critical number of middle layersize. Data from 1e+ 6 runs.

Figure 2.8 – The learning time distributions reveal three distinct regimes: sub-critical,critical and super-critical. The critical regime exhibits power law behaviour with p(�) ∼�−1.3 according to [20].

∙ Critical regime for �m ∼ �i �o

∙ Super-critical regime for �m >> �i �o

In [20], J.R. Wakeling proposed that the power law tail of p(�) corresponds to anorder-disorder phase transition in the model and that the key difference in the learningdynamics for the three operating regimes is the interference probability Pr(�):

∙ In sub-critical networks, there are enough middle layer nodes for interferenceevents to be quite rare and therefore learning is very quick.

∙ In super-critical networks, there are hardly enough middle layer nodes to learnwithout inducing interference events and therefore learning is extremely slow.

∙ The learning dynamics for critical network sizes is in-between the other tworegimes with just enough interference to occasionally cause large learning timeswhile most of the time the learning times are quite fast.

However, it should be noted that the model has not been proved to be critical in theproper statistical physics sense, in order to merit such terminology. Assessing the crit-icality for the model in the two-layer topology is certainly challenging.

Furthermore, the approximately straight segments in the distributions of Figure 2.8b donot necessarily imply that p(�) is a proper power law tail distribution, as very clearlyexplained in the paper [9] by Clauset, Shalizi and Newman. Straight segments in alog-log plot are a necessary but not sufficient condition for p(�) to be a power law taildistribution. Due to timing constraints however, no conclusive power law testing wascompleted for p(�) and in consequence, the terminology proposed in [20] is adoptedthrough the document.

2.1.4 Summary

The key elements of this section are the following:

14

∙ Network dynamics: Defined by ”Winner-Takes-All” dynamics and learning bysynaptic depression (negative feedback).

∙ Input-output learning: The network is able to learn an arbitrary mapping setM where for each input node i corresponds an output node o = M(i).

∙ Local flagging mechanism: Plasticity changes are locally marked for recentactivity.

∙ Global feedback mechanism: Feedback is provided in the form of a globalfeedback signal specifying whether the most recent changes are unsatisfactory.

∙ Solution to the credit-assignment problem: Requires sparse network activ-ity, so that no plasticity changes occur until a global feedback signal is received.

∙ Two typical timescales: The network signalling occurs in the time scale ofthe firing patterns, while the tagging and feedback mechanisms occur in a muchlonger timescale that is relevant to the scale of events in the external world.

∙ Interference event �: The learning of input-output mappings can be disruptedby the unlearning of previously learnt mappings.

∙ Metastable synaptic landscape: The active configuration is barely supportedby the winning weights.

∙ Neural Avalanches: For middle layer sizes �m = �i �o the network displayspower-law behaviour in the distribution of the learning time p(�).

2.2 Storing Mappings

The plasticity rules introduced in Section 2.1 enable the network to learn a randommapping set M , and quickly adapt to another mapping set M whenever needed. Notmuch information is left [1] in the synaptic weights to reliably retrieve M at a laterstage, since active weights that supported M were depressed3 a random amount in [0, 1]to support the new mapping set M .

An additional mechanism is therefore required to store the information from previ-ously learnt mapping sets for later recall. It turns out that such a mechanism existsand amounts to depressing less severely the weights that have successful in the past,and is called the selective punishment rule [8][1].

The selective punishment rule requires small modifications to the plasticity rules toenable the distinction between successful and unsuccessful weights:

1. A random input node i is selected.

2. Input node i fires and activates a middle layer nodem and output node o accordingto the WTA rule.

3. If output node o is correct, i.e. o = M(i), tag the weights w(m, i) and w(o,m) assuccessful and return to step 1.

3More specifically, the active weights that are not shared by M and M .

15

4. Otherwise depress the active weights w(m, i) and w(o,m) by:

∙ A random amount in [0, 1] if w(m, i) and w(o,m) have never been successful.

∙ A random amount in [0, �] otherwise.

Return to step 1.

In the Chialvo-Bak model, recalling a mapping set M refers to a different operationthan in other neural network models. Since synaptic plasticity is required to retrievethe information stored in the synaptic weights, the network is rather re-adapting to apreviously seen mapping set, than recalling the mapping set. Nevertheless, in order todistinguish from the learning rules without selective plasticity, the term recall will beused.

0 10 20 30 40 500

100

200

300

400

500

Recall Time ρΓ(6,36,6) [runs:50 rand]

ρ

Recall

M1

M2

M3

M4

(a) Example of learn and recall performancewithout selective punishment.

0 10 20 30 40 500

100

200

300

400

500

Recall Time ρΓ(6,36,6) [runs:50 select rand]

ρ

Recall

M1

M2

M3

M4

(b) Example of learn and recall performancewith the selective punishment rule.

Figure 2.9 – The number of depressions � required to first learn and then recall fourrandom mapping sets M1, ⋅ ⋅ ⋅ ,M4. The network is presented with the mapping in randomsuccession, and value of � is recorded for each graph, i.e. at recall = 10 the network hasseen each mapping set 10 times.

The selective punishment rule results in a dramatic performance increase (under therandom mapping change mode), as shown in Figure 2.9. This performance increaseresults from the network establishing preferred paths from each input node to the out-put nodes required by the mappings being presented. These preferred paths are thefirst to be queried when the active configuration is no longer correct. A detailed ex-ample of the selective punishment dynamics is presented in the Appendix of this section.

The weights tagged by the selective punishment rule are constrained to a region4 withina distance � from unity, as shown in Figure 2.10. This region is referred to as the �-band.

2.2.1 Summary


4The uniform distribution of weights in the � − band in Figure 2.10 results from depressing thetagged weights a fixed amount � rather than a random amount in [0, �]. In the later case, the resultingdistribution of weights in the �-band would be similar to that of Figure 2.5b.

16

0.998 0.9985 0.999 0.9995 10

0.005

0.01

0.015

0.02

0.025

Weight Distribution p(w)Γ(32,1024,32) [runs:1e+5 bins:5e+4 select rand]

p(w

)Weight w

Figure 2.10 – The weights tagged by the selective punishment rule are kept within the �band located in [1 − �, 1]. This is where the memory of previous successful mappings isstored. Discrete distribution of 500 bins.

∙ Selective punishment rule: Recalling previously learnt mapping sets is enabledby depressing weights that haven been successful in the past by a smaller randomamount � when no longer leading to the desired mapping.

∙ Selective punishment dynamics: The performance increase results from thenetwork establishing preferred paths from input nodes output nodes as requiredby the learnt mapping sets. On average these preferred paths are queried muchmore often.

∙ Delta band: contains the weights representing the memory of previously suc-cessful mappings and is located at [1− �, 1].

2.2.2 Appendix

The example from Figure 2.9 will be used to illustrate the dynamics of selective pun-ishment. Suppose that M1, ⋅ ⋅ ⋅ ,M4 are presented to the network in that order andrequire input one to activate output nodes {1, 3, 6, 6} respectively.

After learning the mapping from input one to output node required by M1, the winningweights w(m1, 1) and w(1,m1) are tagged by the selective punishment rule.

When presented with M2, input node one should now activate output node three andthe weights w(m1, 1) and w(1,m1) are depressed accordingly. This results in a seriesof successive depressions to bring these weights slightly below the respective secondhighest weights.

This succession of depressions is necessary since the average weight distance for thisnetwork is 1/�m ∼ 0.03 for the weights set w(m, i) and 1/�o ∼ 0.14 for the weights setw(o,m), whereas w(m1, 1) and w(1,m1) are now depressed by small random amountsin � ≤ 0.001. This accounts for the relatively higher adaptation values for learningmappings M2,M3 and M4 in Figure 2.9b, when compared to Figure 2.9a.

This also illustrates the negative performance impact that synaptic potentiation wouldhave in this network, since it would lead to large weight differences (in units of depres-sion amounts) between the active weights and the other weights. A metastable synaptic

17

landscape, such as the one illustrated in Figure 2.5a, is a requirement for the networkto quickly converge to a new mapping sets.

Eventually, either w(m1, i) or w(1,m1) will be depressed below the other weights.Supposing that w(m1, i) is first, then node one switches from middle layer node m1

to another middle layer node m, which has 1/�o probability of activating output nodethree. If that is the case, the input node one has learnt the correct mapping for M2

and the weights w(m, 1) and w(3, m) are also tagged as successful.

If middle layer node m does not lead to output node three, it is depressed accord-ingly. Node one is then very likely to switch back to middle layer node m1, whichwill still activate output node one, unless weight w(1,m1) is already the second high-est weight of node one, where it then has a chance of activating a different output node.

If output node three is still not found after a few more depressions, the search se-quence will now alternate between middle layer node m1 and other middle layer nodes,and the output node of middle layer node m1 will alternate between output node oneand the other output nodes.

After learning the mapping sets M1, ⋅ ⋅ ⋅ ,M4 for the first time, each input node hasone or more preferred paths formed by the pairs of tagged weights that lead to therequired output nodes. The network will pool these weights much more frequently.

The example above can be easily generalised to other input nodes and to the weightw(1,m1) reaching the second highest weight before weight w(m1, 1) does.

A sample run where the mapping sets M1, ⋅ ⋅ ⋅ ,M4 required input one to activate out-put nodes {1, 3, 6, 6}, resulted in the preferred paths for input node shown in Table 2.1:

Preferred paths from input 1

To middle layer node m From middle layer node m to output node o

3 612 2, 630 631 134 3

Table 2.1 – The successful weights tagged by the selective punishment rule result in a setof preferred paths for input node 1.

The preferred path to output node two, which is not required for input node one, wasadded by input node two when learning mapping set M4.

2.3 Advanced Learning

The type of learning problems that the model can tackle so far are better described assolving a routing problem: given a mapping set M , the input nodes i have to find pathsto the output node M(i).

18

A more advanced type of learning consists in considering mapping sets M betweeninput node activation patterns I and the specific activation of output nodes o = M(I).As before, the state of each input node can be active i = 1 or inactive i = 0 and theentire configuration of input nodes is represented by a binary vector I. For example,I = {1, 0, 0, 1, 0, 1} is an activation pattern for a network with six input nodes.

Learning the basic Boolean functions F = {AND, OR, XOR, NAND, ⋅ ⋅ ⋅ } is a par-ticular example of this type of learning. The logical value of proposition p and q isrepresented by the state of two input nodes and the logical value of the function F(p, q)is represented by the activation of one of the two output nodes.

The changes to the plasticity rules that are required to learn this type of problemare surprisingly small and amount to a slightly modified ”Winner-Take-All” rule:

∙ Input configuration I = {x1, ⋅ ⋅ ⋅ , x�i} activates the middle layer node j withmaximum ℎj =

∑�ii w(j, i)x(i).

∙ Middle node m activates the output node o with maximum w(o,m), as before.

A bias input node that is always active is necessary in order to compute the state wherethe remaining input nodes are inactive.

An example of the network solving the XOR problem under the above plasticity rulesis shown in Figure 2.11. An example of weights that implement a solution to the XORproblem are shown in Table 2.2 of the Appendix of this section.

Figure 2.11 – An example active configuration implementing the XOR truth table.

The network can learn the XOR problem with three middle layer nodes. In general, itcan learn any mapping M with �m = 2�i middle layer nodes, by discovering for eachinput pattern I the corresponding middle layer node pointing to the correct outputM(I). As there are as many middle layer nodes as input configurations the learningconvergence is guaranteed. As can be appreciated from the XOR example, the networkcan learn with fewer middle layer nodes but the exact minimum is dependent on M.

2.3.1 Summary


∙ Advanced learning capability: The model can learn the mapping M of inputnode configurations I, representing the activation state of the input nodes, andthe respective output nodes o = M(I). In particular the basic Boolean functionscan be learned.

19

∙ Advanced learning plasticity rule: The middle layer node with the maximumweighted sum of weights w(m, i) is selected and activates the output node asbefore.

2.3.2 Appendix

An example of weights that implement a solution to the XOR problem are shown inTable 2.2.

Input to middle w(m,i) Middle to output w(o,m)

1→ 1 0.5 1→ 1 0.11→ 2 0.4 1→ 2 0.21→ 3 0.1

2→ 1 0.3 2→ 1 0.42→ 2 0.7 2→ 2 0.32→ 3 0.9

3→ 1 0.2 3→ 1 0.53→ 2 0.6 3→ 2 0.63→ 3 0.8

Table 2.2 – Example weights to solve the XOR problem under the advanced learningplasticity rules.

20

Chapter 3

Research Results

This chapter presents the results of the research that was conducted during this MScproject.

3.1 �-band Saturation

The selective punishment rule enables the network to quickly re-adapt to previouslylearnt mappings by depressing less severely weights that were successful at least oncein the past. It is also know that the performance of this mechanism has an ageing effectat large time scales [1], as illustrated in Figure 3.1.

0 10 20 30 40 500

100

200

300

400

500


ρ

Recall

M1

M2

M3

M4

(a) At short time scales the selective punish-ment rule drastically improves the ability toquickly recall previously learnt mapping sets.

0 500 1000 1500 20000

100

200

300

400

500


ρ

Recall

M1

M2

M3

M4

(b) At long time scales the recall performancedegrades progressively.

Figure 3.1 – The ageing effect on the selective punishment performance at large timescales.

As before, the term recall will refer to the re-adaptation of the network to a previouslylearnt mapping M .

The performance degradation of selective punishment is caused by the saturation ofthe �-band, which is the region within a distance delta from unity where the weightstagged as successful are constrained to be.

21

For the selective punishment rule to be effective, each input node should be able toquickly sort through the tagged weights to recover the preferred path to the correctoutput node. Ideally, each input node would have established one preferred path foreach previously learnt output nodes. As such, it would take a number of depressionsin the order of the number of preferred paths to find the correct output node.

On the other hand, if the number of preferred paths leading to the same output nodegrows further, the advantage of the tagging mechanism to identify the preferred middlelayer nodes ceases to be effective.

The increase of paths leading to the same output node is a consequence of all weightseventually being given a chance to participate in correct mappings. Consider the con-tinuous raising of the weights caused by the depressions of the plasticity rules. Eachraising step is the difference between the highest weight and the second highest. Even-tually, all weights end-up in the �-band and are soon able to compete with the taggedweights for participating in a correct configuration. Once that occurs, one additionalpath to the output node is created.

The monotonous increase of tagged weights leads to a saturation of the �-band, asincreasingly more weights are confined in that region of weight space. This effect canbe appreciated in Figure 3.2a and the corresponding increase of recall times is illustratedin Figure 3.2b.

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

δ−band SaturationΓ(6,36,6) [runs:100x250 map:128 rand]

Tag

ged

Wei

ghts

(%

)

Recall

(a) The number of tagged weights increasesmonotonically and leads to a saturation of the�-band.

0 50 100 150 200 2500

50

100

150

Average Recall Time ⟨ρ⟩

Γ(6,36,6) [runs:100x250 map:128 rand]

⟨ρ⟩

Recall

not selectiveselective

(b) The performance of the selective punish-ment rule degrades with successive recalls.

Figure 3.2 – As the percentage of weights tagged as correct increases, the recall timesapproach the performance of the regime without selective punishment.

3.1.1 Desaturation strategies

The monotonous increase of tagged weights is a consequence of the permanent taggingof the selective punishment rule. As such, a mechanism is required to reduce the tag-ging lifespan.

P. Bak and D. Chialvo proposed [1] a mechanism of neuron ageing to tackle this issue,where nodes are replaced at a fixed rate, their weights randomised and the tagging

22

information removed. However, the neuron replacement rate may be dependent on thelevel of network activity, in order to successfully counter the saturation rate of thedelta-band.

Several strategies that result in non-permanent tagging were reviewed and the firstof them was selected for implementation:

∙ Global tag threshold: weights are untagged if not successful for more than aglobal threshold of depressions.

∙ Local tag threshold: as the previous, but the threshold depends on the pastperformance of each weight.

∙ Interference correction: untagging weights after a threshold number of inter-ference events.

3.1.2 Global tag threshold

The global tag threshold has been investigated in greater detail, and numerical simu-lations suggest that an optimum threshold value exists for each network size.

In order to ensure that the optimal tag threshold value does not depend on the activitylevel of the network, an increasing number of mappings was presented to networks ofthe same size. The performance of the optimal tag threshold value was consistent acrossthe number of presented mapping sets, as illustrated in Figure 3.3. For the networkΓ(6, 36, 6) the optimal tag threshold value is close to 48.

101

102

103

0

0.2

0.4

0.6

0.8

1

Average δ−band SaturationΓ(6,36,6) [runs:50x250 rand]

Ave

rage

Tag

ged

Wei

ghts

(%

)

Mappings

1632486480

(a) The average saturation of the �-band forglobal tag threshold values are consistent acrossthe number of presented mapping sets.

101

102

103

0

50

100

150

200


Γ(6,36,6) [runs:50x250 rand]

⟨ρ⟩

Mappings

1632486480

(b) The average recall time ⟨�⟩ performance forglobal tag threshold values are consistent acrossthe number of presented mapping sets.

Figure 3.3 – For the network Γ(6, 36, 6) the optimal global tag threshold value is close to48.

For networks with larger number of input nodes �i the optimal tag threshold value isalso higher. This was verified for the network Γ(12, 144, 12), where the optimal tagthreshold values is around 64. This is illustrated in Figure 3.4.

The optimal tag threshold seems closely related to an optimal number of average tagged

23

101

102

103

0

0.05

0.1

0.15

0.2

Average δ−band SaturationΓ(12,144,12) [runs:50x250 rand]

Ave

rage

Tag

ged

Wei

ghts

(%

)

Mappings

32486480

(a) The average saturation of the �-band forglobal tag threshold values.

101

102

103

50

100

150

200

250

300


Γ(12,144,12) [runs:50x250 rand]

⟨ρ⟩

Mappings

32486480

(b) The average recall time ⟨�⟩ performance forglobal tag threshold values.

Figure 3.4 – For the network Γ(12, 144, 12) the optimal global tag threshold value is around64.

middle layer nodes and tagged output nodes behind them, as illustrated in Figure 3.5.

Tag threshold values that are too low result in the network forgetting successful nodestoo fast, as can be observed by the sharp decrease in the average recall time ⟨�⟩ inFigure 3.5. Tag threshold values that are too high fail to get rid of path redundancyfast enough. This results in an increase of the average recall time ⟨�⟩, as illustrated inFigure 3.4b for the tag threshold value of 80, for example. Somewhere in-between liesthe optimal value.

The blue line in Figure 3.5b represents the average number of tagged middle layernodes per input node. From Table 3.1, the number of tagged middle layer nodes sur-passes the number of output nodes �o from tag threshold 32 and above.

The green line of the same graph represents the average number of tagged outputnodes per tagged middle layer node. This value decreases until reaching tag threshold32, where the network compensates for the lack of enough tagged middle layer nodes,by tagging more output nodes per tagged middle layer node. Near tag threshold 32 aminimum is reached, and starts growing again as higher tag threshold values cannotget rid of path redundancy fast enough.

The best tag threshold found for this network produced an average of 7.9 tagged middlelayer nodes, which is nearly two nodes more than the number of output nodes �o. Thefull results are presented in Table 3.1 in the Appendix of this Section.

3.1.3 Summary


∙ �-band saturation: The permanent tagging of successful weights results in themonotonous increase of weights being constrained to the �-band region, thereforereducing the performance advantage of the selective punishment rule.

∙ Desaturation strategies: Several desaturation strategies are possible and amount

24

Global Threshold Average Middle Layer Average Output Layer

16 4.8350 1.544724 5.3833 1.512932 6.0450 1.411840 6.9167 1.413548 7.9133 1.536256 9.3767 1.699864 10.9283 1.937272 12.4167 2.106680 14.4617 2.3799

Table 3.1 – Average number of tagged middle layer nodes per input node and averagenumber of tagged output nodes per tagged middle layer node for different values of theglobal tag threshold in the network Γ(6, 36, 6).

10 20 30 40 50 60 70 800

50

100

150

200


Γ(6,36,6) [runs:100x250 map:128 rand]

⟨ρ⟩

Global Tag Threshold

(a) The average recall time ⟨�⟩ decreasessharply until reaching the optimal global tagthreshold value and slowly starts increasingagain beyond that point.

10 20 30 40 50 60 70 800

5

10

15

20

25

Average Tagged Nodes (Middle−Output)Γ(6,36,6) [runs:100x250 map:128 rand]

Ave

rage

Nod

es

Global Tag Threshold

middleoutput x10

(b) The average number of tagged middlelayer nodes steadily increases with increasingglobal tag threshold values. The average num-ber of tagged output nodes per tagged middlelayer node has a minimum when the aver-age tagged middle layer nodes are equal to thenumber of output nodes �o.

Figure 3.5 – The relation between the optimal global tag threshold and the average numberof tagged middle layer nodes and average number of tagged output nodes per tagged middlelayer nodes.

to capping the tagging lifetime.

∙ Global tag threshold: Sets a global limit on the number of times a taggedweight can be wrong before becoming untagged. There is an optimal tag thresholdfor each network size that is independent of the level of network activity.

∙ Global tag threshold dynamics: The optimal value of the global threshold isrelated to the average number of tagged middle layer nodes and average numberof tagged output nodes behind them.

25

3.2 Markov Chain Representation

A Chialvo-Bak network can be represented by a first-order Markov chain when consid-ering the evolution of the network as a sequence learnt mappings states. Such repre-sentation is useful to derive several statistical properties of the model analytically, suchas the average learning time ⟨�⟩, the learning time distribution p(�) and the averageinterference events ⟨�⟩.

The Markov chain representation seems appropriate since the evolution of the net-work is to a large extent stochastic. The random amounts of the depression from theplasticity rules result in changes to the active configuration C, which is the basic macro-scopic state of the network. Assuming the transition between active configuration C toa new active configuration C is stochastic and only depends on C, then the evolutionof the network can be described by a first-order Markov chain1.

Rather than considering the evolution between active configurations C, a more mean-ingful basic state of the Markov chain representation is the learning state Sn of thenetwork, i.e. the number n of currently learnt mappings. For each active configurationC the learning state Sn is determined by counting the number n of learnt mappingsof C. As such, the correspondence between C and Sn maintains the Markov chainproperties mentioned above.

The Markov chain has �i + 1 states, noted S0, S1, ..., S�i , corresponding to learningfrom zero up to �i mappings.

In this context, learning a new mapping set M corresponds to the Markov chain start-ing from an initial state Si and evolving towards the final state S�i . The chain has atransition from Sn to Sn+1 when an additional output node is learnt and from Sn toSn−1 in the case of an interference events. This is illustrated in Figure 3.6.

Figure 3.6 – The Markov chain representation considers the evolution of the network interms of the number of learnt mappings. In this example, the network started in the stateS2 and successfully reached the final state S3 after two depressions. The evolution sequencewas S2 ⇒ S2 ⇒ S3. The target mapping set is M = {1, 2, 3}

The final state S�i is special since the Markov chain stops when arriving to state S�i

1A k-order Markov chain would depend on the k previous steps.

26

and no transitions from S�i to any other states is possible. This corresponds to thenetwork having fully learnt the mapping set M and therefore no further depressionsare necessary. The fact that all states can reach the absorbing state S�i and that notransition is possible from S�i to any other state qualifies the Markov chain as an ab-sorbing Markov chain.

In general, there are four possible transitions from a given state Sn. Noting S(t) thestate of the system at evolution step t, if S(t) = Sn then S(t+1) ∈ {Sn−1, Sn, Sn+1, Sn+2}.In Figure 3.7 is shown an example of transition Sn ⇒ Sn+2.

Figure 3.7 – The network can learn up to two mappings after one depression giving thetransitions Sn ⇒ Sn+1 and Sn ⇒ Sn+2, and it can unlearn one single mapping givingSn ⇒ Sn−1.

For networks with three input nodes, all possible transitions are shown in Figure 3.8,where the arrows indicate the sense of the transitions.

Figure 3.8 – All the possible transitions for a network with three input node is shownabove. The arrows indicate the sense of the transitions. No transition is possible from S3

to any of the other states.

A Markov chain is completely determined by the state transition matrix A, which inthe columns specifies the transition probabilities between states Sn, and the initial stateprobability vector p, which specifies the initial state probabilities Pr(S(0)).

The element amn of the state transition matrix A is the probability Pr(Sm∣Sn) ofthe transition Sn ⇒ Sm. The element pk of the column vector p is the probability ofstarting the chain in state Pr(S(0) = Sk−1).

27

For the elements of A and p to represent valid probabilities the following must hold:

�i∑m=0

amn = 1 for any column index n of A (3.1)

�i∑k=0

pk = 1 (3.2)

For a network with three input nodes, A and p have the following form:

A =

⎛⎜⎜⎝a00 a01 0 0a10 a11 a12 0a20 a21 a22 00 a31 a32 1

⎞⎟⎟⎠ and p = (p0 p1 p2 p3)t,

where a30 = a02 = 0 since the corresponding transitions are not possible (see Figure3.8).

For example, running this network in the slow change mode where only one mappingis changed each time, corresponds to the initial state probability vector p = (0 0 1 0)t.

For a general network Γ(�i, �m, �o), A and p have the following form:

A =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

a00 a01 0 ⋅ ⋅ ⋅ 0 0 0 0a10 a11 a12 ⋅ ⋅ ⋅ 0 0 0 0a20 a21 a22 ⋅ ⋅ ⋅ 0 0 0 00 a31 a32 ⋅ ⋅ ⋅ 0 0 0 00 0 a42 ⋅ ⋅ ⋅ 0 0 0 00 0 0 ⋅ ⋅ ⋅ 0 0 0 0...

......

. . ....

......

...0 0 0 ⋅ ⋅ ⋅ 0 0 0 00 0 0 ⋅ ⋅ ⋅ a�i−4 �i−3 0 0 00 0 0 ⋅ ⋅ ⋅ a�i−3 �i−3 a�i−3 �i−2 0 00 0 0 ⋅ ⋅ ⋅ a�i−2 �i−3 a�i−2 �i−2 a�i−2 �i−1 00 0 0 ⋅ ⋅ ⋅ a�i−1 �i−3 a�i−1 �i−2 a�i−1 �i−1 00 0 0 ⋅ ⋅ ⋅ 0 a�i �i−2 a�i �i−1 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠p = (p0 p1 ⋅ ⋅ ⋅ p�i)t,

with a(n−2)n = a(n−3)n = ⋅ ⋅ ⋅ = a0n = 0 and an+3n = ⋅ ⋅ ⋅ = a�i n = 0 since thecorresponding transitions are impossible (for appropriate values of n).

3.2.1 Statistical properties

The state transition matrix A and the initial state probability vector p enable to com-pute the statistics of the Markov chain in a very straightforward way. For detailedanalytical derivations see [13] and [17], for example.

28

For example, the elements of the n-th power of A, noted An, yield the transitionprobabilities to a given state in n steps, i.e. anmn is the transition probability fromstate Sn to state Sm in n steps.

To motivate the above statement, consider the previous example of the network withthree inputs. The probability Pr(2)(S2∣S1) of going from state S1 to state S2 in twosteps is computed as follows:

Pr(2)(S2∣S1) =Pr(S2∣S1)Pr(S1∣S1) + Pr(S2∣S2)Pr(S2∣S1) + Pr(S2∣S3)Pr(S3∣S1)(3.3)

which in terms of the transition matrix elements is written:

Pr(2)(S2∣S1) = a21a11 + a22a21 + a23a31. (3.4)

The last expression is the product of the second row of A with the first column of A,which is the element (1, 2) of the product A A ≡ A2 of A with itself.

An important observation is that the above computation relied explicitly in the defin-ing properties of the Markov chain: step S(t + 1) is completely determined from stepS(t) and the state transition probability matrix A. This allowed to factor Pr(2)(S2∣S1)in terms of the accessible intermediate states in Eq. (3.3) and associate the resultingprobabilities to the elements of A in Eq. (3.4). This will be useful when interpretingthe numerical evidence for the Markov chain representation.

The powers of matrix A lead to a straightforward computation of the learning timesdistribution p(�). Since the last row of An gives the transition probabilities to theabsorbing state S�i in n steps when starting from each of the �i + 1 states, the productwith p gives the probability of reaching the absorbing state in n steps when startingfrom the initial state distribution p:

p (� ≤ n) = (0 1)t An p, n ∈ ℕ0 (3.5)

The transient states matrix T contains the transition probabilities between non-absorbingstates, i.e. all states excluding the absorbing state S�i .

A =

(T 0F 1

)where F is sub-matrix row of transition probabilities to the final state S�i .

The powers of matrix T are obtained from An and its elements t(n)i j correspond to the

transition probability in n steps from the non-absorbing states Sj to the non-absorbingstate Si.

An =

(Tn 0∗ 1

)where ∗ represents the last row of An except the element 1.

The learning times distribution p(�) can be expressed in terms of the sub-matrix Tn,since the element j of the last row of An is equal to one minus the sum of the j-th

29

column of Tn, i.e. an�i j = 1−∑�i−1

i=0 ani j = 1−∑

i nni j .

Rewriting the first part of Eq. 3.5 in terms of Tn yields:

p (� ≤ n) = 1− 1t Tn p, n ∈ ℕ0 (3.6)

where p = (p0 p1 ⋅ ⋅ ⋅ �i−1)t are the components of p except the probability of startingin the final state S�i and 1 is the ones column vector of size �i − 1.

An absorbing Markov chain has a number of interesting properties [13]:

∙ The matrix I−T has an inverse N = (I−T)−1, which is called the FundamentalMatrix.

∙ N = I + T + T2 + ⋅ ⋅ ⋅

∙ The element ni j of matrix N is the expected number of times the chain is in stateSi, when starting in state Sj , before reaching the absorbing state S�i .

The last property yields the expectation of the learning time ⟨�⟩, since adding the en-tries of column j of N gives the expect number of times in all transient state beforereaching the absorbing state S�i , when starting in state Sj .

Therefore,

⟨�⟩ = 1t N p (3.7)

Higher order moments of the random variable � can be obtained from the expressionof the factorial moments [17]:

E[� (�− 1) ⋅ ⋅ ⋅ (�− n+ 1)] = n! 1t Tn−1 Nn p

where E[⋅] is the statistical expectation.

The expectation of the interference events ⟨�⟩ is easily derived from N, by notingthat the interference probability in a given state Sj is given by the element aj−1 j of A,i.e. Pr(�∣Sj) = aj−1 j .

Therefore:

⟨�⟩ = vt N p (3.8)

where v = (0 a0 1 ⋅ ⋅ ⋅ a0 ��i−2 �i−1)t is the column vector with the elements of A corre-

sponding to interference events.

3.2.2 Markov chain representation: numerical evidence

To test the validity of the Markov chain representation the following experiment wasperformed:

∙ A test set of networks was selected with input layer size �i ranging from one totwelve input nodes.

∙ For each input layer size �i:

30

– The output layer size was �i + 1.

– Three middle layer sizes corresponding to the three regimes (sub-critical,critical, super-critical) were used: �subm = 2 �criticalm and �superm = �criticalm /2.

∙ Each network was simulated for a large number of times and the following statis-tics recorded:

– state transition counts

– initial state counts

– p (�) learning time distribution

– ⟨�⟩ average learning time

– ⟨�⟩ average interference events

The state transition and initial state counts are possible to extract since thenetwork evolves in discrete steps and for each discrete step the learning state canbe computed from the active configuration.

∙ The measured state transition and initial state counts are normalised accordingto Eqs. (3.1) and (3.2) yielding2 the maximum likelihood estimators [6] for AMLE

and pMLE, respectively.

∙ For each measured AMLE and pMLE the quantities p (�)pred, ⟨�⟩pred and ⟨�⟩predare computed from Eqs. (3.6), (3.7) and (3.8) respectively.

∙ The measured values are compared to the computed values using the NormalisedRoot Mean Square Error (NRMSE) and the Kolmogorov-Smirnov D statistic:

– Normalised root mean square error RMSE[Xpred] =

√(∑i x

predi −⟨xi⟩

)2xmax−xmin

– Kolmogorov-Smirnov statistic D = arg maxxi ∣p (X ≤ xi)pred − p (X ≤ xi)∣

In order to obtain enough state transition counts for each state to enable a good esti-mation AMLE, the mappings were presented to the network with a uniform initial stateprobability (except for the state S�i). Under this mapping change mode, the learningtime distribution p (�) has a different shape than under the slow mapping change mode,for example. The Figure 3.9 shows an example of measured learning time distributionsp (�) for two sample networks and the predicted distributions p (�)pred that are obtainedfrom Eq. (3.6).

The NRMSE for ⟨�⟩pred and ⟨�⟩pred was zero across the tested networks. The actualvalue was below the numerical accuracy of the experiment as estimated by the inverseof the number of samples in each simulation 1/N . Such NRMSE value is perhaps anindication that the underlying method is not applicable to validate the Markov Chainrepresentation.

The Kolmogorov-Smirnov D statistic for p(�)pred computed from Eq. (3.6) is shown inFigure 3.10.

2The derivation of the maximum likelihood estimator for A follows the usual scheme of minimisingthe data log likelihood and applying Lagrange multipliers for each element of A.

31

100

101

102

103

10−4

10−2

100

Learning Time Distribution p(ρ)Γ(4,10,5) [runs:4e+5 uniform]

p(ρ)

ρ

modelmarkov

(a) Super-critical network Γ(4, 10, 5).

100

101

102

103

10−4

10−2

100

Learning Time Distribution p(ρ)Γ(12,312,13) [runs:1e+6 uniform]

p(ρ)

ρ

modelmarkov

(b) Sub-critical network Γ(12, 312, 13).

Figure 3.9 – The learning time distribution p(�) for a uniform initial state probability(except for the state S�i). The black line represents the predicted distribution p (�)pred thatis obtained from Eq. (3.6). The green line is the measured p (�).

0 2 4 6 8 10 120

0.01

0.02

0.03

0.04

0.05

0.06

0.07Kolmogorov−Smirnov Statistic

D

Input Layer Size

supercriticalsub

Figure 3.10 – The Kolmogorov-Smirnov statistic D for p(�)MLE for the test network set,with networks ranging from two input nodes to 12 input nodes. The statistic D is computedfrom Eq. (3.6) and decreases with increasing input layer size for the networks considered.

For the networks considered, the Kolmogorov-Smirnov D statistic was usually higherin the super-critical and critical regimes. With increasing input layer size, the statisticD decreases and converges for all three regimes. The abrupt reduction for the networkwith two input nodes lead to additional measurements with larger output layers butthe results were very similar.

The value differences in the statistic D indicate that the Markov chain representa-tion is more accurate for larger networks. Smaller networks may be better representedby other types of Markov chain or may even not allow a Markov chain representationby not respecting the Markov chain properties, for example.

A direct estimation of the Markov chain order might clarify the above question. Due totiming constraints this was not pursued. Several estimators exist for the Markov chainorder, such as the BIC Markov order estimator [11] or the Peres-Shield order estimator[19].

32

3.2.3 Analytical solution for Γ(2, �m, �o)

An analytical solution for A has been obtained for network with two input nodesΓ(2, �m, �o), for the case where the input nodes cannot share output nodes3.

For Γ(2, �m, �o) the state transition matrix A and the initial state transition vectorA have the form:

A =

⎛⎝a00 a01 0a10 a11 0a20 a21 1

⎞⎠ and p = (p0 p1 p2)t, (3.9)

Each state Sn can be further separated in sub-states Sdn corresponding to the graphswith d degrees of freedom in the middle layer. The basic graphs Sdn for Γ(2, �m, �o) areillustrated in Figure 3.11. For simplicity, these sub-states are simply referred as thegraphs of state state Sn.

Figure 3.11 – The basic graphs for the two input networks Γ(2, �m, �o). The upper indexspecifies the number of degrees of freedom in the middle layer, i.e. S2

1 is the graph for onelearnt input node and one shared middle layer node.

The main steps in this computation are the following:

1. Compute the number of active configurations C for each state Sn in terms of thegraphs Sdn.

2. Equal graph accessibility approximation: assume that in a given state Sn eachof graphs Sdn is equally accessible. This is strictly the case for S(0) where thedistribution of graphs is only dependent on the relative proportion of active con-figurations C implementing each graph. However, it is not necessarily the casefor transitions from another state Sn or from the same state Sn where the graphto graph transitions may favour some particular graphs.

3. For each graph Sdn compute the probability of transition to Sn−1, Sn, Sn+1 andSn+2 corresponding to unlearning, no change, learning one and two mappings,respectively.

3The case where the input nodes can share output nodes may be easily derived by simplification ofthe present results.

33

When starting the network from random weights, the active configuration distributionfor each state Sn does not correspond to the relative frequency count of its graphs, asshown in Table 3.2.

Graphs

S10 S1

1 S12

S20 S2

1 S22

Configuration count

6 6 054 36 6

Configuration count distribution

0.0556 0.0556 00.5000 0.3333 0.0556

Random graph distribution

0.1666 0.1666 00.3749 0.2502 0.0417

Table 3.2 – The configuration count distribution and the initial graph distribution whenstarting from random weights do not match. The values above are for the network Γ(2, 3, 4).The configuration count distribution is obtained by counting the relative frequency of theconfigurations implementing a given graph. For example, Γ(2, 3, 4) has six possible config-urations for graph S1

0 out of 108 distinct configurations, which gives a relative frequency of0.0556 for graph S1

0 .

The correct distribution is obtained by first counting the configurations generated fromnon-shared middle layer nodes and one of the shared middle layer nodes and thenmultiplying by �o for each shared additional shared middle layer node. Such strategy isjustified by considering all the combinations of triplets (i,m, o) and enforcing the WTArule in the second and third value of the triplet. The total number of combinations is(�m �o)

�i and the WTA rule assigns them to the corresponding graphs Sdn. The resultinggraph distribution is the correct one, as shown in Table 3.3. This approach has alsobeen verified successfully for the network Γ(3, �m, �o).

Random graph distribution

0.1666 0.1666 00.3749 0.2502 0.0417

Predicted graph distribution

0.1667 0.1667 00.3750 0.2500 0.0417

Table 3.3 – The initial graph distribution when starting from random weights matches thepredicted graph distribution, for the network Γ(2, 3, 4). The predicted graph distribution isdetailed in the text.

For the network Γ(2, �m, �o) one obtains the following number of configurations pergraph:∣∣S1

0

∣∣ = �m �o (�o − 2)∣∣S1

1

∣∣ = �i �m �o∣∣S20

∣∣ = �m (�m − 1) (�o − 1)2∣∣S2

1

∣∣ = �i �m (�m − 1) (�o − 1)∣∣S22

∣∣ = �m (�m − 1)

where∣∣Sdn∣∣ is the order of the graph Sdn, i.e. the number of distinct configurations that

implement the graph.

34

This allows to compute prand when starting from a random configuration:

prand =1

∣S∣(∣∣S1

0

∣∣+∣∣S2

0

∣∣ ∣∣S11

∣∣+∣∣S2

1

∣∣ ∣∣S22

∣∣)where ∣S∣ ≡

∣∣S10

∣∣+∣∣S2

0

∣∣+∣∣S1

1

∣∣+∣∣S2

1

∣∣+∣∣S2

2

∣∣.One final element before computing the transition probabilities is to determine theprobability p0 of reselecting the same node after depression and the probability of se-lecting a different node 1− p0. The reselection probability is given by p0 = 1/(2 �n− 1)and this value can only be related to the particular weight distribution of this model,as shown in Figure 2.5b. Numerical evidence for such value of p0 is shown in Figure3.12.

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Average Re−selection ProbabilityΓ(2,*,10) [depressions:1e+4]

⟨ P

r(re

−se

lect

ion)

⟩

Middle Layer Size

re−select1/(2η

n−1)

1/ηn

Figure 3.12 – The average re-selection probability in the middle layer nodes, for thenetworks Γ(2, �m, 10). The measured probability (in blue) follows the curve 1/(2 �n−1) (ingreen). The curve 1/�n is shown for comparison (in red). Data collected for 1e4 depressionsin each network.

The element a01 of the transition matrix A for Γ(2, �m, �o) can be obtained as follows:

a01 =Pr(S0∣S1) = Pr(S10 ∣S1

1)Pr(S11 ∣S1) + Pr(S2

0 ∣S11)Pr(S1

1 ∣S1)

Using the equal graph accessibility approximation,

a01 ≃(Pr(S1

0 ∣S11) + Pr(S2

0 ∣S11)) ∣∣S1

1

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣which with the graph transition probabilities gives:

a01 ≃(m0(1− o0)

�o − 2

�o − 1+ (1−m0)(1− o0)

�o − 1

�o

) ∣∣S11

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣where m0 and o0 represent the re-selection probabilities of middle and output nodes,respectively.

The main elements of the last computation are detailed below, where it is assumedthe network is learning to connect input two to output two while learning the iden-tity mapping M = {1, 2} (the other possible mapping set requiring an inversion of theroles).

35

∙ ∣S11 ∣

∣S11 ∣+∣S2

1 ∣is the equal graph accessibility approximation for graphs S1

1 and S21 in

state S1.

∙ m0(1−o0)�o−2�o−1 is the transition probability from S11 to S1

0 . This transition occurswhenever the same middle layer node is re-selected and the output node is notre-selected. There are �o − 2 out of �o − 1 such cases that enable the transition.

∙ (1−m0)(1− o0)�o−1�ois the transition probability from S1

1 to S20 , occurring when-

ever the same middle layer node is not re-selected, the output node is not re-selected and there are �o − 1 out of �o cases that enable the transition.

Figure 3.13 – The graph transitions from S11 to S1

0 and S20 . The target mapping is M =

{1, 2}.

3.2.4 Alternate formulation: graph transitions

There is an alternative to relying on the equal graph accessibility approximation. Byconsidering the transitions between graphs Sdn directly rather than between states Sn,the need to compute the graph occupation within states is avoided altogether.

The resulting transition matrix has more elements, one column for each possible graph,but the predictions are potentially more accurate.

The elements of this new state transition matrix Ag and the new initial probabilitiesvector pg, are as follows:

Ag =

⎛⎜⎜⎜⎜⎝a10 10 a10 20 a10 11 0 0a20 10 a20 20 a20 11 0 0a11 10 a11 20 a11 11 a11 21 0a21 10 a21 20 a21 11 a21 21 0a22 10 0 a22 11 a22 21 1

⎞⎟⎟⎟⎟⎠ and pg = (p10 p20 p11 p21 p22)t,

where a22 20 = a10 21 = a20 21 = 0 since no transitions are possible between those graphs.

The element ayx dn represents Pr(Syx∣Sdn), the transition probability from graph Sdnto graph Syx.

36

Figure 3.14 – The graph transitions for a network with two input nodes. The arrowsindicate the sense of the transitions. No transition is possible from graph S2

2 to any of theother graph.

The analytical results previously obtained are directly applicable to compute the el-ements of the transition matrix Ag. For example, the element a01 of A yields theelements a10 11 and a20 11 of Ag, as follows:

a10 11 =Pr(S10 ∣S1

1) = m0(1− o0)�o − 2

�o − 1

a20 11 =Pr(S20 ∣S1

1) = (1−m0)(1− o0)�o − 1

�o

These two elements of Ag correspond to the two graph transitions in Figure 3.13.

Computing p (�) and ⟨�⟩ is according to Eqs. (3.6) and (3.7) respectively. However, Eq.(3.8) is no longer applicable for computing ⟨�⟩, as the elements of Ag corresponding tointerference events are no longer on the upper diagonal of matrix Ag.

Nevertheless, a similar computation is still possible:

⟨�⟩ =∑

j:columnof N

∑i:interference

graph of j

agi j ni j pj (3.10)

where agi j is an element of Ag corresponding to an interference transition from graphj to graph i, pg is the initial state transition vector excluding the absorbing graph andpj is the element j of pg.

3.2.5 Analytical solution: numerical evidence

To test the validity of the analytical solution, a similar methodology was used as inSub-section 3.2.2 for testing the Markov chain representation. In the present case, thenetwork test set was limited to networks with two input nodes, 10 output nodes and arange of middle layer nodes.

The analytical expressions for A and Ag were used to obtain p(�)pred, ⟨�⟩pred and⟨�⟩pred from Eqs. (3.6), (3.7) and (3.8) or (3.10), respectively, and the predicted valueswere then compared to the ones obtained from the simulations.

37

The slow mapping change mode was used this time, instead of the uniform initialstate probability, as the number of states is quite small and the slow change mode isable to visit them a sufficient number of times. The resulting learning time distribu-tions p(�) are more representative of the typical network dynamics.

100

101

102

10−4

10−2

100

Learning Time Distribution p(ρ)Gamma(2,10,10) [runs:1e+6 slow]

p(ρ)

ρ

modelpred Apred A

g

(a) Super-critical network Γ(2, 10, 10).

100

101

102

10−4

10−2

100

Learning Time Distribution p(ρ)Gamma(2,100,10) [runs:1e+6 slow]

p(ρ)

ρ

modelpred Apred A

g

(b) Sub-critical network Γ(2, 100, 10).

Figure 3.15 – The learning time distribution p(�) in the slow change mode. The greenand red lines are the prediction from Eq. (3.6) using A and Ag respectively.

The predicted learning time distributions p(�)pred are quite similar to the ones mea-sured. This can be appreciated from Figure 3.15, where the predictions are comparedin the super-critical and the sub-critical regimes. As with the predictions from theMarkov chain validation testing, the super-critical regime distributions are less accu-rate than the sub-critical distributions.

0 20 40 60 80 100 1200

0.01

0.02

0.03

0.04

0.05

0.06

Kolmogorov−Smirnov StatisticGamma(2,*,10) [runs:1e+6 slow]

D

Middle Layer Size

pred Apred A

g

Figure 3.16 – The Kolmogorov-Smirnov statistic for the learning time distribution p(�)in the slow change mode. The green and red lines are the statistic value obtained from Eq.(3.6) using A and Ag respectively.

The above is confirmed by the Kolmogorov-Smirnov D statistic, which is higher for thesuper-critical and critical regimes, as show in Figure 3.16.

The statistic D decreases with increasing input layer size and is comparable to themaximum likelihood estimations for the transition matrix, obtained in Sub-section

38

3.2.2 for the critical and sub-critical regimes (compare Figure 3.16 with the D value fornetwork size two from Figure 3.10).

Interestingly, the predictions obtained from Ag don’t have better performance thanthe ones from A, except in the sub-critical regime where the D statistic is lower for thep(�)pred obtained from Ag. This may be particular to the two input networks, whichare somewhat singled out from the others in Figure 3.10.

In terms of ⟨�⟩pred and ⟨�⟩pred, the predictions obtained from A are also more ac-curate than the ones obtained from Ag, as shown in Table 3.4 and Figure 3.17.

0 20 40 60 80 100 12010

15

20

25

30

Average Learning Time ⟨ρ⟩

Gamma(2,*,10) [runs:1e+6 slow]

⟨ρ⟩

Middle Layer Size

modelpred Apred A

g

0 20 40 60 80 100 1200

0.5

1

1.5

Average Interference Event ⟨ν⟩

Gamma(2,*,120) [runs:1e+6 slow]

⟨ν⟩

Middle Layer Size

modelpred Apred A

g

Figure 3.17 – Comparing ⟨�⟩pred and ⟨�⟩pred to the measured values, in the slow changemode. The green and red lines are the prediction from Eq. (3.7) and (3.10) for A and Ag

respectively.

The improvement in the Kolmogorov-Smirnov D statistic for Ag in the sub-criticalregime did not reflect in the NRMSE, where the predictions from A are consistentlymore accurate.

NMRSE ⟨�⟩predsuper-critical critical sub-critical

A 0.0839 (0.0660) 0.0577 (0) 0.0149 (0.0203)Ag 0.1042 (0.0758) 0.0817 (0) 0.0264 (0.0332)

NMRSE ⟨�⟩predsuper-critical critical sub-critical

A 0.1323 (0.1066) 0.1339 (0) 0.0452 (0.0616)Ag 0.1704 (0.1269) 0.2000 (0) 0.0875 (0.1089)

Table 3.4 – The normalised root mean square error (NMRSE) for the predicted ⟨�⟩predand ⟨�⟩pred in the super-critical, critical and sub-critical regimes. The values in bracketsare the standard deviation on the error NMRSE.

3.2.6 Summary


39

∙ Markov chain representation: The Chialvo-Bak network in the two-layertopology is fairly accurately represented by a first order Markov chain, where thechain states correspond to the number of learnt maps.

∙ Markov chain statistics: The state transition matrix A obtained either ana-lytically or by maximum likelihood estimation, easily allows to compute p (�)pred,⟨�⟩pred and ⟨�⟩pred.

∙ Numerical evidence for Markov chain: The Markov chain representation isaccurate in all regimes (super-critical, critical and sub-critical) for all but verysmall networks.

∙ Analytical solution for A for Γ(2, �m, �o): An approximate analytical solutionwas obtained for the network of two input nodes.

∙ Analytical solution for Ag for Γ(2, �m, �o): An analytical solution was ob-tained for the network of two input nodes when considering the transitions be-tween graphs Sdn rather than between states. As such, this solution is in principleexact.

∙ Numerical evidence for analytical solution: The performance of the analyt-ical solutions is comparable to a maximum likelihood estimation of the transitionmatrix A in the critical and sub-critical regimes. The predictions from the ap-proximate solution had better performance than the non-approximate solution.

3.2.7 Appendix

Analytical Solution with A for Γ(2, �m, �o)

The solution for the full transition matrix A for Γ(2, �m, �o) is as follows:

a00 ≃∣∣S1

0

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣ (Pr(S10 ∣S1

0) + Pr(S20 ∣S1

0))

+

∣∣S20

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣ (Pr(S10 ∣S2

0) + Pr(S20 ∣S2

0))

≃∣∣S1

0

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣(m0

(o0 + (1− o0)

�o − 3

�o − 1

)+ (1−m0)

�o − 1

�o

(o0 + (1− o0)

�o − 2

�o − 1

))+

∣∣S20

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣m0

(o0 + (1− o0)

�o − 2

�o − 1

)+

∣∣S20

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣(

(1−m0)�m − 2

�m − 1

�o − 1

�o+ (1−m0)

1

�m − 1

�o − 2

�o − 1

)

a10 ≃∣∣S1

0

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣ (Pr(S11 ∣S1

0) + Pr(S21 ∣S1

0))

+

∣∣S20

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣ (Pr(S10 ∣S2

0) + Pr(S20 ∣S2

0))

≃∣∣S1

0

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣m0(1− o0)2

�o − 1

+

∣∣S10

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣(

(1−m0)

(1

�o

(o0 + (1− o0)

�o − 2

�o − 1

)+�0 − 1

�0(1− o0)

1

�o − 1

))+

∣∣S20

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣(m0(1− o0)

1

�o − 1+ (1−m0)

�m − 2

�m − 1

1

�o+ (1−m0)

1

�m − 1

1

�o − 1

)

40

a20 ≃∣∣S1

0

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣ Pr(S22 ∣S1

0) ≃∣∣S1

0

∣∣∣∣S10

∣∣+∣∣S2

0

∣∣(

(1−m0)1

�o(1− o0)

1

�o − 1

)

a01 ≃∣∣S1

1

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣ (Pr(S10 ∣S1

1) + Pr(S20 ∣S1

1))

≃∣∣S1

1

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣(m0(1− o0)

�o − 2

�o − 1+ (1−m0)(1− o0)

�o − 1

�o

)

a11 ≃∣∣S1

1

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣ (Pr(S11 ∣S1

1) + Pr(S21 ∣S1

1))

+

∣∣S21

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣ (Pr(S21 ∣S2

1) + Pr(S11 ∣S2

1))

≃∣∣S1

1

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣(m0

(o0 + (1− o0)

1

�o − 1

)+ (1−m0)

(1

�o(1− o0) +

�o − 1

�oo0

))+

∣∣S21

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣m0

(o0 + (1− o0)

�o − 2

�o − 1

)+

∣∣S21

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣ +

((1−m0)

�m − 2

�m − 1

�o − 1

�o+ (1−m0)

1

�m − 1

)

a22 ≃∣∣S1

1

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣ Pr(S22 ∣S1

1) +

∣∣S21

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣ Pr(S22 ∣S2

1)

≃∣∣S1

1

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣(

(1−m0)1

�oo0

)+

∣∣S21

∣∣∣∣S11

∣∣+∣∣S2

1

∣∣(m0(1− o0)

1

�o − 1+ (1−m0)

�m − 2

�m − 1

1

�o

)

a02 =0 a12 = 0 a22 = 1

where m0 and o0 represent the re-selection probabilities of middle and output nodes,respectively.

Analytical Solution with Ag for Γ(2, �m, �o)

The solution for the full transition matrix Ag for Γ(2, �m, �o) is as follows: where m0

and o0 represent the re-selection probabilities of middle and output nodes, respectively.

a10 10 =Pr(S10 ∣S1

0) = m0

(o0 + (1− o0)

�o − 3

�o − 1

)a20 10 =Pr(S2

0 ∣S10) = (1−m0)

�o − 1

�o

(o0 + (1− o0)

�o − 2

�o − 1

)a11 10 =Pr(S1

1 ∣S10) = m0(1− o0)

2

�o − 1

a21 10 =Pr(S21 ∣S1

0) = (1−m0)

(1

�o

(o0 + (1− o0)

�o − 2

�o − 1

)+�0 − 1

�0(1− o0)

1

�o − 1

)a22 10 =Pr(S2

2 ∣S10) = (1−m0)

1

�o(1− o0)

1

�o − 1

41

a10 20 =Pr(S10 ∣S2

0) = (1−m0)1

�m − 1

�o − 2

�o − 1

a20 20 =Pr(S20 ∣S2

0) = m0

(o0 + (1− o0)

�o − 2

�o − 1

)+ (1−m0)

�o − 2

�o − 1

�o − 1

�o

a11 20 =Pr(S11 ∣S2

0) = (1−m0)1

�m − 1

1

�o − 1

a21 20 =Pr(S21 ∣S2

0) = m0(1− o0)1

�o − 1+ (1−m0)

�m − 2

�m − 1

1

�o

a22 20 =Pr(S22 ∣S2

0) = 0

a10 11 =Pr(S10 ∣S1

1) = m0(1− o0)�o − 2

�o − 1

a20 11 =Pr(S20 ∣S1

1) = (1−m0)(1− o0)�o − 1

�o

a11 11 =Pr(S11 ∣S1

1) = m0

(o0 + (1− o0)

1

�o − 1

)a21 11 =Pr(S2

1 ∣S11) = (1−m0)

(1

�o(1− o0) +

�o − 1

�oo0

)a22 11 =Pr(S2

2 ∣S11) = (1−m0)

1

�oo0

a10 21 =Pr(S10 ∣S2

1) = 0

a20 21 =Pr(S20 ∣S2

1) = 0

a11 21 =Pr(S11 ∣S2

1) = (1−m0)1

�m − 1

a21 21 =Pr(S21 ∣S2

1) = m0

(o0 + (1− o0)

�o − 2

�o − 1

)+ (1−m0)

�m − 2

�m − 1

�o − 1

�o

a22 21 =Pr(S22 ∣S2

1) = m0(1− o0)1

�o − 1+ (1−m0)

�m − 2

�m − 1

1

�o

a10 22 =0 a20 22 = 0 a11 22 = 0 a21 22 = 0 a22 22 = 1

3.3 Learning Convergence

In all but very small networks, the two layer topology of the Chialvo-Bak model seemswell represented by a first-order absorbing Markov chain. This is supported by thenumerical testing results of the Markov chain representation.

An important property of absorbing Markov Chains is the guaranteed convergence4

to the absorbing states. Therefore, as long as the absorbing Markov chain represen-tation holds, the corresponding network can be expected to converge to the completelearning state.

More specifically, for a given learning rule, the network is expected to converge tothe complete learning state if under that learning rule:

4This property is based on a probability conservation argument. For details see [13], for example.

42

∙ The Markov property is preserved, i.e.

Pr(S(t+ 1) ∣S(t), S(t− 1), ⋅ ⋅ ⋅ , S(0)) = Pr(S(t+ 1) ∣S(t))

∙ From every possible state Sn there is a path to the absorbing state S�i , whichdoes not need to be a direct path.

A necessary condition seems to be a requirement to have separate random depressionsin the middle and output layers. Indeed, simulations showed that for the networkΓ(2, 2, 2) complete learning is not guaranteed if the middle and output layers are de-pressed by the same random amount. This condition seems to be needed to maintainthe Markov property, but further investigation is necessary to verify this.

For any state Sn to be able to reach the absorbing state S�i it is sufficient that atleast one path to the absorbing state exists. Such path can be constructed if for everystate Sn, a transition is possible to a higher learning state Sn+1. This translates to alower bound on the number of middle layer nodes. For the system to be in the learningstate Sn, a total of n middle layer nodes are used to support this state. Moving to thestate Sn+1 requires one additional free middle layer node, and so on.

For the simple learning rule, the number of middle layer nodes should therefore beas large as the number of input nodes, i.e. �m ≥ �i.

For the advanced learning rule, the amount of middle layer nodes should to be aslarge as the number of different input configurations, i.e. �m ≥ ∣I∣, where ∣I∣ representsthe order of the input configurations I.

Proving the above assertions would be very interesting, as it would establish the learn-ing capability in the advanced learning rule, which is the ability to learn arbitrarybinary inputs to output maps, and in particular, any boolean function.

3.3.1 Summary


∙ Guaranteed convergence of the Markov chain: An absorbing Markov chainis guaranteed to converge to the absorbing state. Should the Markov chain repre-sentation hold for the simple and advanced learning modes, the Chialvo-Bak canbe expected to converge to the complete learning state.

∙ Separate random depressions: A requirement for convergence is to have sep-arate random depressions in the middle and output layers. This seems to berelated to maintaining the Markov property.

∙ Lower bound on the number of middle layer nodes : Another requirementfor learning convergence is to have sufficient middle layer nodes. For the simplelearning rule �m ≥ �i and for the advanced learning rule �m ≥ ∣I∣, where ∣I∣represents the order of the input configurations I.

43

3.4 Power-Law Behaviour and Neural Avalanches

The power law tail5 in the learning time distribution p(�) of the critical regime is alsodependent on the mapping change mode, i.e. the way new mapping sets are presentedto the network.

The power law tails of the distributions in Figure 2.8a were generated under the slowchange mode, where one single mapping is changed each time. It turns out that the slowchange mode seems to be the only mapping presentation mode that results in powerlaw behaviour. If the network is provided with random mapping sets the power-lawdisappears, as illustrated in Figure 3.18a.

101

102

103

10−4

10−2

100

Learning Time Distribution p(ρ)Γ(8,72,9) [runs:1e+6]

p(ρ)

ρ

slowrand

(a) Learning under slow change mode and ran-dom change mode. The blue line representslearning with one mapping change each time,whereas the green line represents learning ran-dom mapping sets.

101

102

103

10−4

10−2

100

Learning Time Distribution p(ρ)Γ(8,72,9) [synthetic]

P(ρ

)

ρ

S1

S4

S6

S7

(b) Learning under �i ≥ n ≥ 1 mappingchanges, i.e. the slow change mode corre-sponds to n = 1. Learning under n mappingchanges corresponds to starting the system instate S�i−n. The plot was generated from syn-thetic data obtained from the AMLE.

Figure 3.18 – The power-law tail of the distribution p(�) disappears when presentingmappings to the network in other ways than the slow change mode.

In the Markov chain representation, the slow change mode is equivalent to starting thenetwork in state S(0) = S�i−1. As such, learning under �i ≥ n ≥ 1 mapping changesamounts to starting the network in the state S(0) = S�i−n. Again, only the slow changemode generated a power law tail, as illustrated in Figure 3.18b.

An interesting characterisation of the critical regime is obtained by analysing the stateoccupation matrix Q that is obtained from the transition counts after running a simula-tion. Normalising the transition counts globally gives the frequency of each transition,and summing over the rows gives the state frequency. Below is an example for the anetwork Γ(6, 42, 7) after 1e5 runs in the slow change mode.

5As mentioned in Sub-Section 2.1.3 the approximately straight segments in the distributions ofFigure 2.8b do not necessarily imply that p(�) is a proper power law tail distribution. Due to timingconstraints no conclusive power law testing was completed for p(�), as such the terminology proposedin [20] is adopted here.

44

Q =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.0063 0.0011 0 0 0 00.0011 0.0345 0.0059 0 0 00.0000 0.0059 0.0950 0.0154 0 0

0 0.0000 0.0153 0.1704 0.0262 00 0 0.0001 0.0261 0.2255 0.03280 0 0 0.0001 0.0327 0.26830 0 0 0 0.0000 0.0373

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠Summing over the columns of Q gives the relative occupation in each state. For thenetwork Γ(6, 42, 7) the relative occupation of each state is shown in Table 3.5.

Analysing the columns of Q for larger networks yields the following picture:

∙ In the super-critical regime the network spends most of the time in the statesneighbouring S�i/2.

∙ In the critical regime the network spends most of the time in the states S�i/2, ⋅ ⋅ ⋅S�i−1.

∙ In the sub-critical regime the network spends most of the time in the last twostates S�i−1 and S�i−2.

Relative State Occupation

S0 S1 S2 S3 S4 S5super-critical 0.0648 0.1971 0.2800 0.2437 0.1449 0.0695critical 0.0075 0.0416 0.1163 0.2119 0.2844 0.3384sub-critical 0.0004 0.0038 0.0218 0.0849 0.2431 0.6460

Table 3.5 – The relative state occupation of the network Γ(6, 42, 7) after 1e5 runs in theslow change mode. In the super-critical regimes the network spends most time in the middlestates. In the critical regime the occupation is from the middle state up to the state beforeabsorption. In the sub-critical regime the network spends most of the time in the last twostates before absorption.

3.4.1 Biological interpretation

The slow change mode may have an interesting biological interpretation. Assigning aslow temporal scale to the rate of mapping changes, slow enough so that these wouldbe comparable to the perturbations in the synaptic weights resulting from biologicalbackground noise, would allow to re-interpret the slow change mode as the networkrecovering from noise.

Small punctual perturbations to the synaptic weights might affect the result of theWTA rule for a given node of the active configuration. This is likely to result in theactivation of a different output node, which would trigger the corrective depressions.This effect is comparable to re-assigning an output node, specially if the perturbationaffects a synapse from an input node to a middle layer node. The ability of the networkto recover perfectly from noise has been discussed in [8][1].

45

The neural avalanches, first evidenced experimentally J. Beggs and D. Plenz [4][5] cor-respond to spontaneous neural activity displaying power-law distributions in the eventsizes. Identifying both avalanche types is not trivial as the biological neural networksanalysed by Beggs and Plenz were not know to be performing any particular activityduring the experiments, whereas the Chialvo-Bak avalanches are due to an ongoingprocess of synaptic plasticity.

As such, a parallel may be established between the biological neural avalanches andthe avalanches in the Chialvo-Bak model the slow change mode, under a very slowtemporal scale of changes.

3.4.2 Summary


∙ Power law tail on slow change mode only: The power law tail in the learningtime distribution P (�) only seems to occur in the slow change mode.

∙ Relative state occupation in the slow change mode: In the super-criticalregimes the network spends most time in the middle states, whereas in the criticalregime the occupation is from the middle state up to the state before absorption.In the sub-critical regime the network spends most of the time in the last twostates before absorption.

∙ Parallel with neural avalanches: By re-interpreting the slow change modeunder a very slow temporal scale of changes, as the network recovering fromnoise.

46

Chapter 4

Conclusion

The present work focused on extending the analytical understanding of the Chialvo-Bak network in the two-layer topology.

The model seems to be accurately represented by a first order Markov chain, where thechain states correspond to the number of learnt maps. This representation easily allowsto compute important statistics of the network, such as p (�)pred, ⟨�⟩pred and ⟨�⟩pred.Numerical testing found support for the Markov chain representation in all but verysmall networks.

A direct estimation of the Markov chain order would further contribute to the un-derstanding of the Markov chain representation and its eventual limitations.

An analytical solution was developed for the transition matrix A in the case of anetwork with two input nodes Γ(2, �m, �o). The performance of the analytical solu-tions is comparable to the maximum likelihood estimation obtained from simulationsof Γ(2, �m, �o) in the critical and sub-critical regimes.

Extending analytical for networks with higher number of inputs would consolidate theMarkov chain representation and could lead to a better understanding of the criticalregime.

Should the Markov chain representation hold in general, for the simple and advancedlearning modes, then the model can be expected to converge to the complete learningstate. A convergence requirement is to have separate random depressions in the middleand output layers, which seems to be related to maintaining the Markov property. Alower bound on the number of middle layer nodes is required for maintaining the ab-sorbing property of the Markov chain.

A proper analytical proof of the learning convergence would be useful to establishthe learning capability of the model.

The power law tail in the learning time distribution P (�) seems limited to the slowchange mode. The Markov chain representation provides a characterisation of the crit-ical regime, for which the network spends most of the time in the state where half themappings are learnt, up to the state before absorption. Very rarely all the mappingsare unlearnt in the critical regime.

47

A systematic power law testing of the learning time distribution p(�) in the criticalregime is necessary to establish the power law tail nature of this distribution.

A parallel with the biological neural avalanches is proposed by re-interpreting the slowchange mode under very slow temporal scale of changes, as the network recovering fromnoise rather than learning new mappings.

The permanent tagging of successful weights by the selective punishment rule resultsin reduced performance for large timescales. A global tag threshold, which places aglobal limit on the number of times a tagged weight can be wrong before becominguntagged, seems successful in eliminating the ageing effect and is independent on thelevel of network activity.

The local tag threshold was not developed in the present work, however, its localcharacter potentially results in network size independence, which is a clear advantageover the global tag threshold.

48

Bibliography

[1] P. Bak and D.R. Chialvo. Adaptive learning by extremal dynamics and negativefeedback. Arxiv preprint cond-mat/0009211, 2000.

[2] Per Bak. How Nature Works: The science of Self-Organized Criticality. Springer-Verlag, 1996.

[3] A. Barr, E.A. Feigenbaum, and P.R. Cohen. The handbook of artificial intelligence.Addison-Wesley Reading, MA, 1989.

[4] J.M. Beggs and D. Plenz. Neuronal avalanches in neocortical circuits. Journal ofNeuroscience, 23(35):11167–11177, 2003.

[5] J.M. Beggs and D. Plenz. Neuronal avalanches are diverse and precise activitypatterns that are stable for many hours in cortical slice cultures. Journal of neu-roscience, 24(22):5216–5229, 2004.

[6] P. Billingsley. Statistical inference for Markov processes. University of ChicagoPress, 1961.

[7] RJC Bosman, WA van Leeuwen, and B. Wemmenhove. Combining Hebbian andreinforcement learning in a minibrain model. Neural Networks, 17(1):29–36, 2004.

[8] DR Chialvo and P. Bak. Learning from mistakes. Neuroscience, 90(4):1137–1148,1999.

[9] A. Clauset, C.R. Shalizi, and M.E.J. Newman. Power-law distributions in empiricaldata. arxiv, 706, 2007.

[10] F. Crepel, N. Hemart, D. Jaillard, and H. Daniel. Long-term depression in thecerebellum. Handbook of Brain Theory and Neural Networks, 1998.

[11] I. Csiszar and P.C. Shields. The consistency of the BIC Markov order estimator.Annals of Statistics, pages 1601–1619, 2000.

[12] P. Dayan, L.F. Abbott, and L. Abbott. Theoretical neuroscience: Computationaland mathematical modeling of neural systems. MIT Press, 2001.

[13] C.M. Grinstead and J.L. Snell. Introduction to probability. Amer MathematicalSociety, 1997.

[14] M. Ito. Long-term depression. Annual Review of Neuroscience, 12(1):85–102, 1989.

[15] M.H. Johnson. Developmental Cognitive Neuroscience: An Introduction. BlackwellPublishing, 1997.

49

[16] K. Klemm, S. Bornholdt, and H.G. Schuster. Beyond Hebb: exclusive-OR andbiological learning. Physical Review Letters, 84(13):3013–3016, 2000.

[17] G. Latouche and V. Ramaswami. Introduction to matrix analytic methods instochastic modeling. Society for Industrial Mathematics, 1999.

[18] M. Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30,1961.

[19] Y. Peres and P. Shields. Two new Markov order estimators. Arxiv preprintmath/0506080, 2005.

[20] J. Wakeling. Order–disorder transition in the Chialvo–Bak ”minibrain” controlledby network geometry. Physica A: Statistical Mechanics and its Applications, 325(3-4):561–569, 2003.

[21] J.R. Wakeling. Adaptivity and ”Per learning”. Physica A: Statistical Mechanicsand its Applications, 340(4):766–773, 2004.

50

self-organised learning in the chialvo-bak model … learning in the chialvo-bak model msc project...

Documents