liquid state machine optimization - university of groningenmwiering/thesis_kok.pdf · 2009. 1....

82
Liquid State Machine Optimization Stefan Kok Utrecht University December 17, 2007 Supervisors: Dr. Marco A. Wiering Drs. Leo Pape Dr. Ignace T.C. Hooge

Upload: others

Post on 11-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Liquid State Machine Optimization

    Stefan Kok

    Utrecht University

    December 17, 2007

    Supervisors:Dr. Marco A. WieringDrs. Leo PapeDr. Ignace T.C. Hooge

  • Abstract

    In this thesis several possibilities are investigated for improving the perfor-mance of Liquid State Machines. A Liquid State Machine is a relatively newsystem that is a Machine Learning system, which is capable of coping withtemporal dependencies. Basic Recurrent Neural Networks often have prob-lems with this. One reason for this is that it takes a long time to train theRecurrent Neural Network. Liquid State Machines train much faster by us-ing a temporal reservoir to map temporal input into a static output pattern.These output patterns can be learned by a statistical learning method. Inthis thesis, two different subjects are addressed. The first subject is aboutreducing computation time on calculating the performance. Optimizationalgorithms for neural networks are often computationally heavy. This is be-cause the performance of the system, here the Liquid State Machine, needs tobe evaluated. The computation time can be decreased by using other meth-ods to evaluate the performance. The other subject that is addressed here,is in the optimization of the temporal reservoir in the Liquid State Machine.Two different algorithms are used here, namely Reinforcement Learning anda Genetic Algorithm. The goal is to find out if the algorithms can improvethe performance by improving the temporal reservoir and if so, if the per-formance is increased that much so that it is computationally beneficial touse. The experiments using a different performance measure showed that itwill probably not help in improving the performance on classification of theLiquid State Machine. The other experiment using the two different algo-rithms showed that Reinforcement Learning can not find a better setting forthe temporal reservoir to improve the performance given the settings of theexperiments. But the Genetic Algorithm is able to improve the temporalreservoir and thus improve the performance, this was tested on two differentdatasets. The first dataset used was a movement classification task. Theresults showed an improvement, but comparing it to another system, namelyEvolino, the Liquid State Machine is outperformed on both classificationand computation time. The second dataset used is a music classificationtask. The results here where more in favor of the Liquid State Machine,although it is unclear if the parameter setting for Evolino is optimal.

  • Contents

    1 Introduction 31.1 Introduction to Artificial Intelligence . . . . . . . . . . . . . . 31.2 Time Series and Time-dependencies . . . . . . . . . . . . . . . 41.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Relevance to Artificial Intelligence . . . . . . . . . . . . . . . . 71.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Reservoir Computing 92.1 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . 102.2 Time-series and Recurrent Neural Networks . . . . . . . . . . 132.3 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . 162.4 Liquid State Machines . . . . . . . . . . . . . . . . . . . . . . 18

    2.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.2 Reservoir and Readout . . . . . . . . . . . . . . . . . . 19

    3 Training Paradigms 233.1 Performance Measurements . . . . . . . . . . . . . . . . . . . 24

    3.1.1 Readout Performance . . . . . . . . . . . . . . . . . . . 243.1.2 Separation . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 263.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 263.2.2 Using Reinforcement Learning for Optimizing Liquids . 28

    3.3 Evolutionary Computation . . . . . . . . . . . . . . . . . . . . 303.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 303.3.2 Using Genetic Algorithms for Optimizing Liquids . . . 33

    3.4 Evolino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Introduction to Evolino . . . . . . . . . . . . . . . . . . 373.4.2 Enforced SubPopulations . . . . . . . . . . . . . . . . . 37

    1

  • 3.4.3 Evolino Implemented . . . . . . . . . . . . . . . . . . . 39

    4 Data 414.1 Movement Classification Task . . . . . . . . . . . . . . . . . . 414.2 Music Classification Task . . . . . . . . . . . . . . . . . . . . . 43

    5 Experiments and Results 465.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.1.1 Setup Datasets . . . . . . . . . . . . . . . . . . . . . . 475.1.2 Setup Liquid State Machine . . . . . . . . . . . . . . . 495.1.3 Setup Evolino . . . . . . . . . . . . . . . . . . . . . . . 51

    5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.1 Separation . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.2 Training Paradigms . . . . . . . . . . . . . . . . . . . . 54

    6 Conclusion 696.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    A Solution for Calculating Initial Q-values 75

    2

  • Chapter 1

    Introduction

    1.1 Introduction to Artificial Intelligence

    Artificial Intelligence (AI) is a broad subject. One purpose for AI is relatedto explain phenomena. In brain studies for example, neurons can be simu-lated to explain certain types of brain patterns. But probably most of theresearch in AI is done on finding solutions to solve problems. This can beall sorts of problems like ordering some data or classifying objects. For mostproblems there is not one single method to tackle it. Often there are differentapproaches to solve problems in AI. For example some approaches use learn-ing algorithms to find a solution, while others use a pre-programmed set ofrules to solve the problem. The learning algorithms fall under a section in AIcalled Machine Learning [13]. A number of well known learning algorithmsthat fall under Machine Learning are Genetic Algorithms [4], Reinforcementlearning [25] and Neural Networks [13]. These algorithms have an overlap inthe problems they solve, but solve these in a different way. A problem withlearning algorithms is that they are not always transparent. For example,an artificial neural network basically can be seen as a vector of values. Thismakes it difficult to explain why it behaves like it does and what these val-ues actually mean. If the neural network does not learn like it should, it isdifficult to find the problem.

    3

  • 1.2 Time Series and Time-dependencies

    One type of problem that gets a lot of attention in AI is a problem with a tem-poral aspect. This means that in the input data there is a time-dependency.In other words, an input at one timestep does not contain enough informationto get the correct output. The output is thus dependent on a number of inputpatterns. An example of a temporal problem is speech-recognition. Here thetask is to classify spoken words. In this case classification means that theoutput is a written version of the spoken word. The temporal aspect forms aproblem for Machine Learning algorithms because the temporal-dependencyis often difficult to learn with the learning algorithms that are often used.The system thus needs some sort of method to store previous input patterns.For example the words ‘bike’ and ‘like’ sound the same except for one letter.If the algorithm only classifies using the input pattern at one timestep, whenthe letter ‘i’ is spoken, the algorithm will not be able to differentiate betweenthe two words. Feed-Forward Neural Networks have trouble in dealing withthis time-dependency. Because of the different lengths of samples it is dif-ficult for these Neural Networks to have a complete memory of the input.But this is not the only problem, it is often not known in advance what thelength of a time-dependency in a sample is. This makes it hard to set thenumber of inputs for the network to the right size.

    Hidden-Markov models [19] are statistical methods that are used as anapplication for problems that have a time-dependency. A Hidden-Markovmodel consists of a set of states. The transitions of one state to the other aredecided by a probability distribution in that state. The probability distri-butions for each state is calculated by using a large set of examples and cal-culating the chance to go another state. Another application that is appliedto problems with a time-dependency is the Time-Delay Neural Network [11].This can be seen as a normal Feed-Forward Neural Network (see 2.1), butinstead of having input from one timestep, the input is stored for a numberof timesteps and given to the network at once.

    Another system based on Feed-Forward Neural Networks, is called a Re-current Neural Network [5]. This system uses recurrent circuits in the net-work to cope with the time-dependency. The activation is thus kept in thenetwork by sending the output of a neuron back into the network. RecurrentNeural Networks can be found in many forms. One is the continuous Re-current Neural Network, where neurons use a continuous function to processthe input. Another system is based on biological neurons, which is called a

    4

  • Spiking Neural Network [26]. Here the neuron uses a threshold function overthe input, to decide what the output should be. Spiking neurons are differ-ent from neurons with a continuous function, because they only send out twotypes of signals, namely a zero for not spiking, or a one for spiking. Unlikein continuous Recurrent Neural Networks, the information is just partiallyin the signal itself. Most of the information comes from the timing of thespike signals. If only one signal comes into a neuron, it will not fire, but if alot of signals come into a neuron together, the probability to spike increases.Another difference with continuous Recurrent Neural Networks is that theactivation in the spiking neurons is partially kept for future timesteps, whichcauses each neuron to have an internal memory.

    A last example of a Recurrent Neural Network that is used for time-dependencies, is called Long Short-Term Memory [8]. This network is basedon a continuous Recurrent Neural Network, which uses special kinds of neu-rons called memory cells. These memory cells also have an internal activationlike spiking neurons. The memory cells are special, because they use gates.One is to block incoming activation from coming into the cell, another is tolet activation within the cell leak away and the last one is to let activationfrom within the cell out into the network. The continuous Recurrent NeuralNetworks use some form of a Gradient Descent algorithm [13] to train theweights. Spiking Neural Networks are different. Although there is a Gradi-ent Descent algorithm, which is called Spike-Prop [3], there are other formsof learning. One is called Hebbian Learning [6]. This is a learning algo-rithm uses some simple rules to change the weight between two neurons. Ifthe two neurons spike at the same time, the weight will increase, if only onespikes, the weight will decrease. All the neural networks described above haveproblems to deal with time-dependencies, the Recurrent Neural Networks allhave trouble with learning the time-dependencies and all the neural networkssuffer from the problem that it takes long to optimize the weights.

    Two relatively new systems that cope with time-dependencies in a differ-ent way are Echo State Networks [10] and Liquid State Machines [14]. Thesetwo systems are based on the same idea. The two systems use a RecurrentNeural Network to create static patterns of the temporal data. These staticpatterns are then learned by a statistical learning algorithm. The differencebetween these two systems is that in Echo State Networks, continuous Re-current Neural Networks are used, while in Liquid State Machines, SpikingNeural Networks are used. The advantage is that both systems get roundthe troubles of learning in the Recurrent Neural Network, instead learning is

    5

  • done using a relatively simple algorithm which does not take that much timeto train.

    1.3 Research Question

    Liquid State Machines (LSM) can be computationally demanding. Thisforms a problem when improving the LSM. For example to improve the liq-uid of the LSM, the performance of the LSM is needed which costs time. Todecrease the computation time, other methods to evaluate the performanceof a liquid should be investigated. One idea that could be used is a measurefor the separation of a liquid. The separation defines how well a liquid candivide output patterns of a liquid given the input patterns. Using the sep-aration property may decrease the computation time as there is no readoutnetwork that needs to be trained, but also it could be that to calculate theseparation, less samples need to be used. One research question is thus:

    Is the separation of a liquid a good and fast performance measurefor liquid optimization?

    One aspect of this question is how the separation should be represented.The second research question in this thesis is about the different algo-

    rithms to optimize the liquid. There are a number of methods to optimizethe performance of a liquid, it will give a view on which algorithms arecapable of optimizing the liquids and how much they will improve the per-formance. This part of the research may show how complex the search-spaceis in finding the optimal solutions. Another comparison to make is to seewhether these algorithms outperform already good performing systems thatcan also handle time-series problems, like Evolino [23].

    How well are several different algorithms performing in optimizinga liquid in a Liquid State Machine?

    For testing these research questions, it may be insightful to use a toyproblem where the environment can be better controlled. A toy problemis a problem which is artificially created. A toy problem can give a faircomparison between the different algorithms and performance measurementsthat are used. In the experiments the implementation of this toy problemis a movement classification task. But an algorithm should not only becapable of handling simple toy problems, but also more complex real-world

    6

  • problems. This could show how dynamic these systems are, as they showtheir capabilities to handle noise and inconsistencies. Another part that isimportant for AI algorithms is how well they perform compared to humans.Real-world problems often already provide insight in how well humans cancope with the problem. One of these is music classification. This task issomething that humans are capable of solving.

    1.4 Relevance to Artificial Intelligence

    Cognitive Artificial Intelligence (CAI) is about creating new systems thathave intelligent behavior and study human behavior mechanisms to find newsystems that are capable of performing new tasks. In this thesis the aim is atthe first part, creating new intelligent systems or at least improving on them.All the systems that are inspected here, are learning systems. Learning isan important part for humans as it gives us the experience to cope with theuncertainties in life. Problems that humans have to solve are often problemswith a time-dependency. For example, we are able to communicate witheach other, by behavior (for example getting a red face if we feel ashamed),or with sign language or by using sound (crying, but of course also speech).In AI this is an aspect where there is a lot of research done. This is becauseit is difficult to find a system that can easily and with low computation time,cope with time-dependencies. Liquid State Machines are such a system thatare able to deal with time-dependencies using less computation time thanother known systems. The goal is to find out what improvements can bemade to improve the performance of Liquid State Machines.

    1.5 Outline

    This thesis concerns Machine Learning systems that can handle time-seriesdata. In chapter 2 an introduction will be given to the algorithms that willbe used in the experiments. The first algorithms are Feed-Forward NeuralNetworks, which are used as readout networks in Liquid State Machines.Next there is an introduction to Recurrent Neural Networks (RNN) and time-series data. After this Long Short-Term Memory and Liquid State Machinesare described. These two systems are then used in chapter 3 for optimization.

    In chapter 3 the training paradigms that are used to optimize RNNs

    7

  • are further explained. These training paradigms are used to optimize therecurrent network of the Liquid State Machine and the Long Short-Termmemory in Evolino. In the first section two different performance measuresare explained, namely the separation of the liquid and performance of thereadout network. These measures are used to evaluate the performance ofa RNN for the learning algorithms. There are two algorithms that are usedto optimize the RNN in the LSM. First a Reinforcement Learning algorithmwill be introduced and the second training paradigm for LSMs is a GeneticAlgorithm. The last section is about a system called Evolino, this combinesa Long Short-Term Memory network with a regression method to learn time-series data.

    In chapter 4 the datasets that are used in the experiments, are explained.There are two types of sets used. First there is the movement classificationset, which is a toy problem. Here the algorithms need to classify in whichdirection a target moves on a grid. The other set is about music classification.Here the task is to classify which composer has written a certain musicalpiece.

    Chapter 5 is about the experiments. The first section will explain whichparameters are used. The second section is about the experiments that aredone on the different learning algorithms. These are about the performanceof the different training paradigms like Evolino and the Genetic Algorithm.But there are also a number of experiments done on the performance mea-surements.

    In the last chapter (chapter 6), the results of the experiments in chapter5 are discussed. Furthermore some work that can be done in the future willbe discussed.

    8

  • Chapter 2

    Reservoir Computing

    Solving time-series problems is a major challenge in Machine Learning. Time-series problems are problems that have a temporal aspect, which means thatinformation of previous timesteps needs to be stored to be able to recog-nize a sample correctly. An example of this time-dependency is in speech-recognition. This gives a problem when for example a spoken word needs tobe recognized. Here there should be a memory to remember the completeword when making a decision about which word it is. If the input at timestept− 1 is not used for input at timestep t the network will probably never findthe correct word. The problem with time-series is that neural networks thatare used, not only need a long-term memory, as also used in non-temporalproblems, but also a short-term memory. This is because otherwise, inputat a timestep is independent from the other timesteps. The problem withFeed-Forward Neural Networks is that time-series are not always of the samesize. For example with speech-recognition, words are not of the same size,but a neural network has a static number of inputs, thus making it hard todefine the size of the input layer. Liquid State Machines are able to solvethis by using a fading memory which is then used as input for a readoutnetwork. This fading memory is a mapping of temporal data into a staticrepresentation, which should be easier to learn for the readout network.

    In this chapter the systems that will be used in the chapter about thetraining paradigms (see chapter 3) will be explained. First Feed-ForwardNeural Networks (FNN) are explained. FNNs are often used in solving prob-lems in Artificial Intelligence. Also it gives more insight in Recurrent NeuralNetworks (RNN) as a lot of these networks are based on FFNs. These RNNsare explained in section 2.2. Not only will the system be discussed, but also

    9

  • some of its problems. Furthermore this section discusses what time-seriesdata exactly is. After this a RNN called Long-Short Term Memory is dis-cussed (LSTM; see section 2.3). LSTM will be further used in section 3.4.Finally Liquid State Machines will be further explained (see section 2.4).

    2.1 Feed-Forward Neural Networks

    A Feed-Forward Neural Network (FNN; [13]) is a statistical learning methodbased on some basic principles of biological neurons. Although FNN neuronsdo not look like biological neurons at all, they share an important character-istic that make FNNs powerful, namely parallel processing. As can be seenin figure 2.1, an FNN has a number of layers. The earliest FNNs only had

    Figure 2.1: Image of an FNN with three layers, from bottom to top an input layer, hiddenlayer and output layer. The biases are on the right

    two layers [21], the input layer and output layer. To add more computationpower, a hidden layer is used, which means that it can approximate a lotmore functions than without. Function approximation is used, because inmost datasets, the function that can output correctly given the input pat-terns is not known. Function approximation does what the words say, itapproximates the function that can deal with the dataset correctly. In otherwords, the approximation function tries to get as close as possible to thereal function that creates the output. The network is built up of nodes (alsocalled neurons), these nodes are connected through weights (real number val-ues) and these connections are directional. In other words, the input layerprovides input for the hidden layer and the hidden layer provides input forthe output layer. Also both the input layer and the hidden layer have a

    10

  • bias node, which is an input node with a constant value namely one. Thisis, because otherwise every function has output zero if the inputs are zero,which decreases the number of functions it can approximate. The input forthe hidden layer is calculated by multiplying the weight of the connectionbetween the two nodes with the output of the input node. After this, afunction is applied using the summed input for the node. This can be theidentity function (which means nothing will be changed), but this decreasesthe solution space to only linear functions. A well known function that isused in FNNs, is called the sigmoid function (σ(x), see equation 2.1). Thisfunction squashes the input so that the outcome of the sigmoid function liesbetween a minimum and maximum. The function can be seen in figure 2.2.The sigmoid in figure 2.2 is called a logistic sigmoid function. Here the min-

    Figure 2.2: Graph of a Sigmoid function

    imum outcome is zero and the maximum outcome is one. There are othersorts of sigmoids like the hyperbolic tangent function, which has its minimumat minus one instead of zero. The logistic sigmoid function is calculated asfollows:

    σ(x) =1

    1 + e−x. (2.1)

    The input x for σ(x) for neuron j is calculated as follows:

    xj =n∑i=1

    xiwji, (2.2)

    11

  • where wji is the weight between node i and node j and xi is the outputof node i. This is done for both the hidden layer and output layer (andother hidden layers if there are any). But often, the transfer function of theoutput layer is the identity function instead of a sigmoid function. There are avariety of problems that can be solved using FNNs. One type of problems areregression problems, where the function needs to predict the output at thatmoment. For example with the stock exchange, it would predict the valueof a certain stock. But there are also other problems that can be solved,for example classification problems. Here the output layer has a number ofdifferent output neurons, one for each class. The index of the output neuronwith the highest output value will then be the class that the FNN predicts.

    To find the right outputs for the input patterns, given the target values,the weights need to be adapted to find the closest approximation to thetarget values. One possible method to search for the optimal weights isusing an algorithm called Gradient Descent [13]. Here the derivative of thenetwork with respect to the error is taken. After this, using the derivative,a step to the steepest decline (or incline) in the search-space is taken tosearch for a better solution. A well known implementation of the GradientDescent algorithm is called Back-Propagation ( [27], [22]). First the error iscalculated. The error function used here is calculated as follows:

    E(t, y) =1

    2

    n∑j=1

    (tj − yj)2. (2.3)

    Here tj is the target-value (the value that should be the output), yj is thevalue that the network has outputted, j is the j-th output neuron and n isthe number of output neurons. After this, the partial derivative with respectto output neuron j is calculated for each output neuron. The output layerhere is assumed to be using the identity function.

    δj = tj − yj. (2.4)Next, the derivatives of the hidden layer are calculated with respect to onehidden neuron, for every neuron, using the derivative of the logistic sigmoid:

    δi = yi(1− yi)n∑j=1

    wjiδj. (2.5)

    Here yi is the output of hidden neuron i. After this the weights are updatedby using δj for weights between the hidden layer and output layer and δi forweights between the input layer and hidden layer:

    wji = wji + αδjxi. (2.6)

    12

  • Here α is the learning speed, this is needed because otherwise it will onlylearn to map the input to the output of that certain learning moment. Inother words it will constantly overfit. To solve this, α is set to a small valueto make a small step into the steepest direction of the search space giventhe input and target values. Also equation 2.6 is used for all connections,only with a different δ. FNNs are often used in the form of supervisedlearning. Supervised learning means that when a FNN predicts something,the target value that is going to be learned, is given. This is in contrast withReinforcement Learning. Here the target value is not given, but based onrewards the FNN gets for predicting the correct or incorrect action.

    2.2 Time-series and Recurrent Neural Net-

    works

    Time-series data is data that is divided into discrete timesteps. This isbecause a computer splits up time into discrete timesteps, the data thus hasto adapt to the system that is used. Furthermore the input is transferred intoa number format which is a representation of the real world data. For examplesound-waves can not be directly used as input, they are first transformed intoa format that computers can deal with. In this case it is the frequency ofthe sound-wave. The time-series data thus consists of a matrix where eachcolumn can be seen as an input pattern at one timestep.

    Time-series data can form a problem for machine learning algorithms.Normally an input pattern does not have important relations with other in-put patterns. In other words, one input pattern is not dependent on otherpatterns. In time-series data, temporal dependency of patterns can be impor-tant. For example, in recognizing sounds, the algorithm should store previousparts of sounds, otherwise it will only classify a sound based on the inputat that timestep. To solve this, the algorithm needs a short-term memory.This forms a problem because, it is not clear which part of the informationshould be stored and how long something should be remembered. In a simpleproblem this might not be too hard to solve, but in most problems, wherethere is a lot of noise and other information that needs to be filtered out,this can reduce the transparency. Learning also forms a problem, becauseit is not necessarily known what needs to be learned and when this shouldhappen. One solution to this problem is called a Liquid State Machine, here

    13

  • instead of learning the patterns with the short-term memory reservoir, theLiquid State Machine uses a readout network to learn to associate the statesof the reservoir with the desired outputs.

    Feed-forward Neural Networks have problems with time-series, as theyare static and do not have any short-term memory. One solution is to bufferthe input for a number of timesteps and giving this set as one input tothe neural network. These networks are called Time-Delay Neural Networks(TDNN; [11]). This shows exactly the problem of for example FNNs. Firstthe number of input neurons can be high (because of the long input patterns)which increases the chances of overfitting and training will take much longerthan normal. Another problem is that for example with words, one wordmight be long and the other short, this means that some input patterns donot use the complete network, which makes learning more difficult. Also, thetime-dependencies are unknown and can differ in length. A better approachare Recurrent Neural Networks (RNN). These come in a number of forms.One is based on an FNN, where activation is calculated by a sigmoid function.The difference between FNNs and RNNs is that RNNs have recurrent cir-cuits. These circuits send output back into the network in the next timestep.This means that activation from the last timestep can be taken into accountfor calculating the next timestep, if there is a recurrent circuit in the network.An advantage in comparison with TDNNs, is that the short-term memorythat TDNNs have, is only within a certain time-frame. It can be that some ofthe input of earlier timesteps could influence the input of later timesteps, butTDNNs do not take this into account, they just cut the time-series in pieces.RNNs do not have this problem as an input can be theoretically rememberedforever, thus input from the beginning of the time-series, can be used forinput at the end of the time-series. However, a drawback to RNNs is learn-ing. There are learning rules based on the Gradient Descent algorithm, twowell known Gradient Descent algorithms are: Real-Time Recurrent Learning(RTRL; e.g. [20]) and Back-Propagation Through Time (BPTT; e.g. [28]).In BPTT and RTRL the error is propagated back into the network overa number of timesteps. However, using Back-Propagation has a drawback.When propagating back further in the network, the contribution of lowerlying connections to the error is getting smaller. It makes it difficult to getthe importance of states that lie further back in time. This makes it hard tolearn and also learning will be a lot longer in comparison with Feed-ForwardNeural Networks. Mostly states that lie more than ten timesteps back in thepast, can not reliably be used. This is also known as the the problem of the

    14

  • vanishing gradient ( [7]; [1]; [17]; [28]; [12]).Another implementation of RNNs are Spiking Neural Networks (SNN;

    [26]). SNN neurons, in comparison with neurons using a continuous func-tion, are more inspired by biological neurons than most RNN neurons, asthey share the property to spike. Although biological neurons are far morecomplex, they have some similarities. This of course partially depends on theimplementation of the spiking neuron, as there are a lot of different types ofneurons, in both biology as in computer science.

    Most neurons in our brain only pass through a signal when the activationhas passed a certain threshold, which causes an action potential. Because ofthis action potential, a signal is sent to the connecting neurons. To translatethis to computational methods, it means that a neuron either sends a spikein a timestep or does not send anything at all. This differs from continuousnetworks, where there is almost always activation that is passed through.Another similarity between biological neurons and spiking neurons is thatthey both have an internal activation within the neuron that stays over time.This activation comes from input signals that are received some timestepsago, and thus gives a short-term memory for each neuron. The activation also‘leaks’ away, which influences the time-frame from which input should stillbe used (with a higher decay, activation leaks out faster and thus the neuronhas a shorter memory). If the activation is low, one spike will probably notget the neuron to fire. But if at the same moment there are a number ofdifferent spikes that come into the neuron, it will increase the chances of theneuron to fire.

    The problem with SNNs is that it is even more difficult for a SNN tofind a good supervised learning method then it is for continuous RNNs.Although algorithms exist like SpikeProp [3], it is a computational heavytask to learn with Back-Propagation methods. Most learning methods in thebrain seem to be self-organizing. In self-organization there is no feedbackwhen a wrong value is outputted. The only data the self-organizing algorithmhas is the input. One well known self-organization method is called HebbianLearning [6]. This learning rule comes in many forms but they all are basedon the same idea. That is that when two neurons are connected, and bothneurons fire at the same time (or in the same time-frame), the connectionbetween the two neurons gets stronger. If the neurons do not fire at the sametime, the connection between the two neurons will be weakened.

    15

  • 2.3 Long Short-Term Memory

    Long Short-Term memory networks (LSTM; [8]) are used in Evolino, whichis described in section 3.4. Evolino optimizes these networks. A Long Short-Term Memory network is a Recurrent Neural Network that uses special neu-rons to calculate the output. These neurons are called memory cells andhave gates to decide whether activation is flowing through the core and outor not, or some of it. There are three different types of gates: input gates,forget gates and output gates. The core has a recurrent circuit (in otherwords, the activation of one time-step back is part of the input for the corefor the next time-step). This means that it can possibly remember an inputinfinitely long. The forget gate can ‘leak’ the activation out of the cell thusremoving information that is not needed anymore. The input gate is thereto protect the cell from input that is not needed. In other words, if the gateis closed, no input will come into the cell. The output gate decides wheninformation should be sent over to other memory cells. Figure 2.3 shows thesetup for the memory cell. The memory cells are connected in four ways, alloutput connections of a memory cell are connected through the output gate.The input connections (both from input neurons and other memory cells inthe network) are connected through the core and the three different gates.The internal unit holds the state of the cell. The cell state is calculated asfollows:

    si(t) = neti(t)gini (t) + g

    forgeti (t)si(t− 1). (2.7)

    Here si(t) is the state of cell i at time-step t, gini is the input gate of cell i,

    gforgeti is the forget gate of cell i, si(t− 1) is the state of cell i one time-stepback and neti(t) is calculated as follows:

    neti(t) =n∑j=1

    wcellij cj(t− 1) +o∑

    k=1

    wcellik uk(t). (2.8)

    Here cj is the output of memory cell j, n the number of hidden cells, uk theoutput of input cell k, o is the number of input cells and w is the weightbetween two cells (where cell j and k are input for cell i). The output of cellcj is calculated as follows:

    cj(t) = tanh(goutj (t)sj(t)). (2.9)

    16

  • Figure 2.3: Figure of a memory cell (from [23]), S is the internal state of the cell, GF isthe forget gate, GO is the output gate and GI is the input gate. The Σ is the input frominput cells and hidden cells. The blue nodes represent the multiplication functions

    Here goutj is the output gate of cell j. The value of the gates at time t iscalculated as follows:

    gtypei = σ(n∑j=1

    wtypeij cj(t− 1) +o∑

    k=1

    wtypeik uk(t)), (2.10)

    where σ is the logistic sigmoid function and wtype is the weight for either theinput, hidden or output gate. The strength of this model is that it has bothlong-term as well as short-term capabilities. Long-term storage is done bylearning the weights between the cells. The short-term memory is createdby the architecture of the memory cell and the wiring between the differentcells. First there is the recurrent circuit for the core, but also the differentgates add dynamics to the memory. Normally for learning a variant of Real-Time Recurrent Learning is used, but this will not be used in Evolino (theimplementation can be found in [8]) to optimize the weights.

    17

  • 2.4 Liquid State Machines

    2.4.1 Theory

    Liquid State Machines (LSM; [14]) are a new concept in machine learning.They solve time-series problems in a completely different way in comparisonwith most RNN systems and go around the problems that RNNs often havewith learning them. LSMs use a dynamic reservoir or liquid (LM) to handletime-series data, as can be seen in figure 2.4. After a certain time-period, the

    Figure 2.4: Figure of a liquid state machine (from [14]), where u(·) is the input for theliquid, LM the liquid filter, xM (t) the liquid state at time t, fM the readout network andy(t) the output at time t

    state of the liquid xM(t) is read out to use as input for a readout network fM

    (for example a FNN). This readout network learns to map the states of theliquid to the target outputs. This means there is no need to train the weightsof the RNN, which decreases the computation time and more importantly,the complexity of learning time-series data. A liquid can be represented indifferent forms, it can be a real liquid, where the waves can be seen as ashort-term memory. For example, if someone would throw a rock into a lake,the waves that this rock creates, are the memory of the liquid. In otherwords, the waves tell that something has happened a short time ago. But aSpiking Neural Network (SNN) can store much more information than a realliquid. This is because in principle, a real liquid can have information in thethree spatial dimensions. But a SNN, can have far more neurons than three,

    18

  • thus be able to store more information. SNNs consist of neurons that aremore biologically plausible in comparison with other RNNs. These neuronshave an internal activation and only communicate with each other by eithernot sending anything (the output of a neuron is zero) or sending a spike (theoutput of a neuron is one). The neuron has an internal activation whichleaks out over a time period, this is a short-term memory and the amountof leakage is the time it takes to forget an input. An important property ofSNNs is that they also have information in the spike code. This spike code isan array of spikes in a certain time-frame. Spike codes can be seen as a sort ofMorse-code, where there can be pauses (when no spikes are sent) or a ‘bleep’(when a spike is sent). The use of this, is that the time when a spike arrivesis also a form of information. Another feature of the spike neuron is thatit keeps its internal activation, although it leaks away over time, this meansthat due to recurrent circuits, the neuron itself also has a short-term memorysystem. LSMs are a good tool for classification problems. In classificationproblems, the LSM should separate different inputs from each other andclassify these. An example of this is classifying the composer of musicalpieces (this is also used in the experiment, see section 4.2 for further details).The musical piece is input for the liquid and the readout network should thenclassify which composer has composed the musical piece. Classification usedby the readout network is also described in section 2.1. For classification, thereadout network needs to separate the different states from the liquid. Giventhat there are enough units in the liquid, it can create different patterns foreach time-series pattern (see [14] for further explanation).

    Liquid State Machines have a Separation Property (SP) and Approxima-tion Property (AP) [14]. SP addresses the ability to separate two differentinput sequence from each other. This is important, because the readoutnetwork needs to be able to separate two input patterns to have a good per-formance. If two patterns look too much alike if they should not, the readoutnetwork can not differentiate between the two patterns and thus is not ableto tell which pattern belongs to which class. AP addresses the ability ofthe readout network to distinguish two different patterns and transform thestates of the liquid into the given target output.

    2.4.2 Reservoir and Readout

    For implementing the liquid an SNN is used. Here the neurons are set in agrid, neurons are connected by a chance also used in [14]. The only exception

    19

  • is that initialization for input neurons is different because they are initializedusing a static chance. The input layer is a two dimensional layer whichrepresents the spatial properties of the data. The dimensions of the inputlayer when the input is for example a chess board, are the same as thedimensions of that chess board. The spatial properties in the liquid itself arekept because of this. In other words, the liquid is built up by layers wherethe width and height are the same as the input layer. The depth definesthe number of layers there are in the liquid. The chances for connectingtwo internal neurons with each other is calculated by first calculating theEuclidian distance (D(i, j)):

    D(i, j) =√

    (xcori − xcorj)2 + (ycori − ycorj)2 + (zcori − zcorj)2.(2.11)

    In this equation i and j are two different neurons, with coordinates {xcor, ycor, zcor}.After this, the chance to connect two neurons is calculated, using the follow-ing equation:

    pconnect(i, j) = e−D(i,j)

    λ . (2.12)

    In equation 2.12 the λ parameter is used to control the probability distri-bution. The chance to not connect it is then 1 − pconnect. To calculate thechance of a neuron being inhibitory, the chance pinhibitory, is multiplied bypconnect. The chance of being excitatory is then: 1 − pinhibitory. This imple-mentation is also used by Maass et al [14]. In Maass [14], equation 2.12 ismultiplied by c, here c is set to one. In [14] Maass et al explains that thedistribution of connections is an important aspect of the liquid. Having toomuch connections between neurons that lie far apart decreases the perfor-mance. Having a lot of local connections and a few long connections seemsto be the best solution. With this randomly initialized network, there are nodirect recurrent circuits, which means that a neuron can not connect withitself. For simplicity, weights are set to one single value.

    A leaky integrate and fire model [26] is used for the spike neuron. Theneuron has a certain threshold (η). To get a spike, the activation (aj(t)) of aneuron should become higher than this threshold. After a neuron fired, thereis a refraction period (rperiod). During this refraction period, the activationis in its resting state (rrest), which is a constant. In this refraction period,the neuron can not fire and when the refraction period is over, it starts itsactivation at the resting state. The activation inside the neuron, is increasedor decreased by incoming activation from other neurons, but the activation

    20

  • inside the neuron also ‘leaks’ away, this is done by multiplying the decay (d)with the activation of the neuron one timestep ago. This decay has influenceon the capabilities for a neuron to have a short-term memory. Basicallysetting the decay to zero, the neuron has no memory and it also makes itharder to create spikes. Because with a higher decay, the activation is oftenalready closer to the threshold. Setting the decay to one, means that everyinput (after the refraction period, or the startup of the liquid) is kept in theactivation of the neuron. The internal activation of the neuron is calculatedevery timestep as follows:

    aj(t) =n∑i=1

    (xi(t) · wji) + (d · aj(t− 1)), (2.13)

    where aj(t) is the activation of neuron j at timestep t, xi is the output ofneuron i (Here the neurons are both neurons inside the liquid and inputneurons) and wji is the weight between neuron i and neuron j, where j is thereceiving neuron. aj(t− 1) is the activation of neuron j one timestep back.

    After the activation is calculated, depending on the threshold, the neuronwill either spike or not, which gives an output of respectively one or zero. Thespikes are summed up per neuron and at a certain time step over a period oftime (this is called a readout moment (xM(t)) and the readout period is calledxMperiod), these summed spikes are, together with the activation of the neuronsat that moment, passed through to the readout network. The number ofreadout moments can vary. First because samples can be of different lengthsand thus can provide either more or less readout moments. Another reason isthat the number of readout moments is dependent on the number of timestepsthat spikes are summed, i.e. a shorter readout means there are probably morereadout moments per sample. The samples are the time-series data, that isthe input for the liquid. For example a musical piece is one sample. Thesample is broken up in discrete timesteps, and the sounds can for examplebe broken up into frequencies. The frequencies at one timestep are thenan input pattern for the liquid. This input is passed through by the inputneurons. In the liquid, each input neuron passes either a zero or one throughto the spiking neurons in the liquid.

    To get readout moments the liquid must be warmed up (warmup). Dur-ing this warmup period no spikes are summed. This is done because thereis no activation at the beginning, which can influence the performance as itis hard to separate two different samples without much activation. After thewarmup period there are, dependent on the settings, a number of readout

    21

  • periods. Reading out the states of the liquid is repeated until the end of thesample. After this, the readout moments are stored and labeled with a class.

    If the liquid has encountered every sample, the readout network will betrained. An FFN (see section 2.1) is used to classify the samples. ThisFNN consists of an input layer, one hidden layer and an output layer. Thehidden layer uses a logistic sigmoid function, while the output layer is linear.As the experiments are done on classification tasks, the number of outputneurons is equal to the number of different classes. The target is one forthe neuron with the index which equals the index of the class, otherwisethe target value is zero. Every time the FNN is initialized, the weightsare the same as with all the other initializations. This is done to keep theexperiments transparent. It is possible that with random initialization, thenetwork may find a good starting point. This increases the performance andthe next time it finds a bad starting point which decreases the performance.To train the readout network, Back-Propagation is used and each sampleis run through the network and after that, it is immediately learned, whichis called online gradient descent [2]. This is done in epochs, which meansthat the complete dataset is learned in one epoch and is repeated a numberof times. To increase the learning speed of the readout network, the inputpatterns are scaled between zero and one. This means that the relationshipbetween all the readouts from the liquids stays the same, but the inputs nowlie between zero and one. This is beneficial to the FNN as it increases thelearning speed (smaller input means smaller changes in the weights whenlearning).

    22

  • Chapter 3

    Training Paradigms

    In the next sections, some algorithms that have been explained in the previ-ous chapter, are used for optimizing RNNs. First the performance functionsthat will be used in the experiments are described in section 3.1. In subsec-tion 3.1.2 a new method to measure the performance of a liquid is described(namely the separation). After this, a number of new methods are describedthat will be used to optimize liquids in LSMs. The liquid optimization ispartly done on the wiring. The wiring are the connections between differentneurons, here there are three different types, namely an excitatory connec-tion, an inhibitory connection and no connection. In liquids where the wiringis initialized based on distances of neurons, the wiring is important, becausethe weight is set to one default value for all connections. Thus finding abetter wiring could improve the performance. Reinforcement Learning (RL)and Evolutionary Algorithms (EA) are described in respectively sections 3.2and 3.3. These two algorithms could be compared, as they have an overlapin problems they can solve. The EA will also be used to optimize the weightsin the liquid. This is because EA’s are able to deal with real values. In thelast section, a system that already exists will be further explained, namelyEvolino [23]. Evolino uses Long-Short Term Memory networks as the liquidand Evolutionary SubPopulations (ESP) to optimize these.

    23

  • 3.1 Performance Measurements

    3.1.1 Readout Performance

    The most straightforward method to analyze the performance of a liquid, isto use the performance measure of the readout network. Here this measure isboth used as a tool to analyze data learned by the LSM, but it is also used asa performance measure in optimizing the liquid. The performance measureimplemented here, represents the percentage of correct classified samples.First the index of the output neuron with the highest value is picked asthe classifier. If the index of the output neuron equals the target class, theoutcome is one, otherwise it will be zero. This outcome will be called oij,where i is the i-th sample of class j. To calculate the performance P thefollowing equation is used:

    P =

    ∑cj=1

    ∑si=1oijs

    c(3.1)

    Here c is the number of classes and s the number of samples in that class.To calculate the performance, the performance is first calculated for eachclass and then these are combined in the final performance measure. This isbecause the readout network can be biased by the fact that there are moresamples from one class, which will make it look like the readout networkperforms well, but it is just biased to choose the class with the most samples.

    This measure seems to be a good measure for liquid optimization, as it isstable and the outcome is transparent. However, the readout network needsto be trained to get this performance. That means more computation time.

    3.1.2 Separation

    The disadvantage of using the performance of the readout network for opti-mization is that it is not really a fast method, as the readout network needsto be trained and that takes up computation time. Also it is an indirectmethod to calculate the performance of the liquid (It is indirect because theperformance of the readout network is measured and not the performance ofthe liquid itself). It would be better to have a measure which calculates theperformance of the liquid, in a direct method. In other words, there is noneural network or other kind of statistical learning method necessary to findthe performance of the liquid. The separation of a liquid should be able to

    24

  • fill in this task. Maass et al [14] used the separation to get a good view onhow well a liquid can divide two different input patterns. To calculate thisseparation S, the distance is calculated between two different samples. Forthis the Euclidian distance is used, which calculates the distance betweendifferent readout moments:

    S(sk, sl) =

    √√√√ n∑i=1

    (Nki −Nli)2. (3.2)

    In equation 3.2 s is a readout sample from the liquid and Nki and Nli are thereadout moments from neuron i of samples k and l and n is the number ofneurons in the liquid. Here are a two different methods to use the distanceto calculate the separation:

    • S(sij, . . . , snm) = Sdifferent(sij, . . . , snm)−Ssame(sij, . . . , snm)(3.3)

    • S(sij, snm) =∑vh=1 S

    hdifferent∑v

    h=1 Shsame

    (3.4)

    Here S is the separation, sij is the j-th sample in the i-th class, n is thenumber of different classes, m is the number of chosen samples for each class,which normally is two, h is the h-th time that the separation is calculated(for total separation), v is the number of separations that are calculated andSsame and Sdifferent are calculated as follows:

    • Ssame(sij, snm) =n∑i=1

    S(sio, sip) (3.5)

    • Sdifferent(sij, snm) =n∑i=1

    m∑j=1

    n∑k=i+1

    m∑l=1

    S(sij, skl) (3.6)

    In Ssame, o and p are different chosen samples from the dataset. For oneseparation calculation, there are two samples chosen from each class. Thesesamples are both used in Ssame and Sdifferent. The idea behind equation3.3 is that the distance between two samples from different classes shouldbe as large as possible, thus giving it a positive influence on the separationvalue, and the distance between samples of the same class should be assmall as possible, so if the distance is larger, the separation should be lower.Equation 3.3 however is not a good representation of the dataset. This isbecause it uses only a few samples and not the complete set. Maass et al [14]

    25

  • solve this by summing a number of separations. In equation 3.4 this is alsoapplied. Also, here the relation is given between the distance of samplesfrom different classes and samples from the same class. For example, if thereare two different classes and there is no relation at all, the outcome is two.If the outcome is higher than two, there is a positive relation, which meansthat the distance of samples from the same class lie closer than the distancefrom samples of different classes. If the outcome is under two, then there isa negative relation and thus the samples from different classes lie closer toeach other than samples from the same class. A nice feature of the separationis that it can also calculate the separation of one neuron. This is done byonly calculating the distance between two neurons and not a whole network.This may give some insight in how different neurons perform. Also it couldbe used to filter out neurons that have a low separation. Some neurons mayhave a low separation value which could indicate that these neurons givethe readout network a hard time at separating different samples from eachother.

    3.2 Reinforcement Learning

    3.2.1 Introduction

    Reinforcement Learning (RL; [25]) is an important learning method in ma-chine learning. It is less demanding than supervised learning, where exactdata is needed in advance. This may become a problem for certain implemen-tations like in robotics, where data is complex and hard to get in advance.RL solves this problem by using a reward system to provide as a method forlearning the optimal solution. This reward is given at certain moments andcan be negative to penalize or positive to reward. The RL algorithm hasan agent, which can for example be a robot that drives through an office.An agent can make actions, these can be all sorts of actions, for examplein robotics an action can be to go forward. Each action an agent takes,gets some kind of reward afterwards. For choosing a next action, the agentchooses the most likely action by choosing the one which has the highestexpected reward. In other words the action, that gets on average the highestreward, is chosen.

    The agent learns in the same environment where it will be used in practice.This environment depends on what problem is going to be solved. It can for

    26

  • example be a maze, where optimal routes are learned to find the fastest routeout of the maze. But it can also be a real world environment for example anoffice. Learning while also behaving in the environment is a big differencewith supervised learning and also other learning algorithms. Here learningis more an act-react method instead of learning everything in advance. Thismakes the system more dynamic, because if the environment changes, itcan use knowledge from the old environment to solve problems in the newone. But it also gradually learns how to improve its behavior in this newenvironment. While other methods have to learn the new environment beforethe agent can again be put back into use again.

    Basically there are two types of reinforcement learning algorithms, onetype is episodic, for example Monte Carlo sampling [25], where at the endof an episode, the sum of rewards is returned to update the state-value, forexample in a game of chess, where it is hard to evaluate during the game ifthe moves that are made are good or bad. After the agent finishes a gameit gets the reward and this can be seen as an episode. The other type oflearning is where a reward is directly given when an action is done, theseare mainly algorithms based on TD-learning [25]. This means that it cancope with problems that do not have a clear ending, for example in roboticswhere a robot is supposed to be running around doing jobs all day. BothMonte-Carlo sampling and TD-learning are methods which can be appliedin problems where the environment is not completely known, these methodswill thus try to predict the best action.

    Both Monte-Carlo sampling and TD-learning can be implemented in twoways, using V-values or Q-values. The difference is that with V-values, ac-tions are chosen based on a model of the environment, where the algorithminvestigates the states and chooses the action that leads to the best state(in other words gives the highest average return). Q-values have state-actionpairs, here there is no model needed, it chooses the action that probablyreturns the highest reward.

    For choosing actions, a RL agent uses a policy. There are two types ofactions, namely exploitation actions and exploration actions. Explorationmeans that the action that is chosen is not the most likely choice at thatmoment, but they are needed to search in the search-space so the agent willnot get trapped in a sub-optimal local optimum. In other words, exploringmeans trying to find new local optima. Exploitation is choosing the bestaction and thus choosing the action which probably will return the highestreward. It is important to have a good exploration policy, as this will balance

    27

  • the time an agent invests in exploration and exploitation. A couple of wellknown exploration policies are the �-greedy policy and the soft-max policy.With �-greedy normally the action with the highest value is chosen, but thereis a chance to explore and thus to choose another action. This is done byusing the � parameter, this parameter sets the chance to explore and exploit.It is a straightforward policy and not the best, as it could be possible thatthe policy needs to change over time, for example to explore less and exploitmore. The soft-max [25] policy is completely different, it uses the Boltzmannequation [25] to calculate the chances for each action, which means that thechance to be chosen depends on the value of the V or Q-value. In other words,the bigger the difference between two V or Q-values, the higher the chancethe action with the highest value will be chosen. This is a more fair methodof choosing a new action, as the relation between values is better representedin the distribution of the probabilities for the V or Q-values. Learning can bedone in two ways, either with on-policy learning or off-policy learning. Thedifference between these two is that on-policy learning updates the actionit has chosen using the value of the next action, while off-policy learningupdates chosen action using the value of the best next action.

    3.2.2 Using Reinforcement Learning for Optimizing Liq-uids

    The implementation used for optimizing the wiring of a liquid, is a multi-agent system. This means that the learning process is more complicated thannormal, as agents should work together and certainly depend on each other.In other words, if one agent changes its state, it may influence the wholechain of connections that come after this connection, including itself.

    The algorithm is an on-policy Monte-Carlo sampling. As described insection 3.2.1 Monte-Carlo sampling is episodic. This seems to be suited forliquid optimization, because one run through all the samples can be seen asone episode. Furthermore an on-policy method is used. The RL algorithmhere, uses Q-values, the agent is in a certain state and given this state, theagent chooses its action, these actions are explained later in this section. Asa policy, the soft-max algorithm is used. As already stated in the last section,the soft-max algorithm uses the Boltzmann equation to calculate the chances

    28

  • of an action to be chosen:

    pa =eQt((ni,nj),a)/τ∑nb=1 e

    Qt((ni,nj),b)/τ(3.7)

    Where τ is the temperature and Qt((ninj), a) and Qt((ninj), b) are differentQ-values for the agent in states a and b between neuron i and neuron j, withneuron j receiving output from neuron i. An agent is a connection betweentwo neurons. In this case the agent has three different states, namely thestate to be on, to be inhibitory and the state to be off. As described in section2.4.2, the wiring of an LSM with a random liquid, will be chosen using thedistance. The chances to initialize the wiring, are also used as initial chancesfor the RL algorithm. This may give a boost to initial learning. To find theright Q-values given the different chances, the following equation is used:

    Qt((ni, nj), b) = ln(eQt((ni,nj),a)/τ

    e−dista(x,y,z)/λ− eQt((ni,nj),a)/τ

    )· τ (3.8)

    Here Qt((ni, nj), a) is the default value which is predefined. The completeexplanation on how equation 3.8 is found and how to apply it to find the cor-rect Q-values, can be found in appendix A. The soft-max algorithm also hasa disadvantage. It gives another parameter that needs to be tweaked. Thereis also a trade-off that needs to be made on how to set the τ parameter andalso the default Q-value. If the Q-value is high, because of the exponent, itwill increase computation time (calculating the exponent of two takes longerthan calculating the exponent of one). If the temperature (τ) is increased,the exponent will be smaller, thus less computation time. But the proba-bilities for the different Q-values lie closer to each other. This increases thetime spend on exploration. The temperature needs to be tweaked accordingto how much exploitation and exploration there should be. Too much explo-ration and the system will not converge to an optimum, too little explorationand the system converges to a sub-optmimum.

    The agent can be in three different states, namely: {−1, 0, 1}. Thesestates stand for the type of connections that can be made, namely a positiveconnection, no connection or a negative connection. To update the Q-value,the following formula is used:

    Qt((ni, nj), s) =

    {(1− β)Qt((ni, nj), s) + θ[R(t)] if s = 0(1− β)Qt((ni, nj), s) + α[R(t)] otherwise.

    29

  • Where β is the decay, α and θ are the learning speed, s is the state wherethe agent is in, t is the time-step and R is the reward. This system is usedbecause otherwise if the number of connections is below fifty percent, it willupdate the agents in state 0 too high. θ thus should always be set lowerthan α. The reward given is the performance of LSM. The reward will onlybe given if the performance of timestep t is higher than the performance oftimestep t− 1.

    3.3 Evolutionary Computation

    3.3.1 Introduction

    Genetic Algorithms (GA; [4]) are based on evolution. GAs do not use one so-lution and build further on that specific solution to improve the performance,but they use a population of solutions and properties of the individuals inthe population to find the optimal solution. As with evolution, selection onindividuals takes place. The difference is that with GAs selection is doneunder more strict rules. The population is always of the same size, and inmost cases the number of selected chromosomes for the next generation isalso the same. Next to that the method for selecting individuals is different.Also GAs are directed to solving a goal, using a predefined fitness function,while in biology this goal seems to be missing.

    There are a number of different methods to select individuals from thepopulation. Two well known selection methods are truncation selection [4](also known as ranked-based selection) and tournament selection [4]. Thefirst is more straightforward. It selects a percentage of individuals with thebest fitness. Truncation selection is not only used in computational prob-lems, but also in other problems, for example farmers use it to get as muchmilk from cows as possible. They look at the milk production of a cow andtake the best cows to make offspring. Tournament selection is a bit differ-ent. It uses tournaments to select individuals for the next generation. Thisis done by randomly choosing a fixed number of individuals from the pop-ulation and checks which individual has the highest fitness. This individualwill be selected for reproduction. A major difference between the two se-lection methods is that tournament selection has a more diverse population.Truncation selection only selects the best individuals at that moment, butit is possible that parts of solutions that are not selected can still be used

    30

  • to improve solutions in the future. In tournament selection this is partiallyovercome because it is possible that a tournament is held between a numberof really bad solutions which means one of these bad solutions is selected andthus creates more variety in the solution space. This has some relations withreinforcement learning policies, where it is important to find a good methodto both explore and exploit. In GAs this problem also needs to be solved, sothat an optimal solution is found and not a sub-optimal. Another methodto increase the exploring capabilities of the GA is to increase the populationsize, this also has a disadvantage, it increases the computation time per gen-eration. This gives a bit of a trade-off to how large the population shouldbe, the computation time should be as short as possible, but it should be bigenough that it contains a wide variety of solutions.

    For producing offspring, crossover and mutation are used. Crossover is amethod to combine the gene-strings of two or more parents into one to get achild. Gene-strings are basically strings which contain the information of thesystem that is optimized. Two well known crossover methods are one-pointcrossover [4] and uniform crossover [4]. One-point crossover (see figure 3.1 forillustration) slices the gene-string of the parents at the same point into twoparts, and uses the first part from the first parent and the second part of thesecond parent to make a new gene-string (This can be extended to use thesecond part of the first parent and the first part of the second parent to createanother child). One-point crossover can be extended to two-point crossover(or three-point etc.), where more points are picked to slice up the gene-stringfor recombination. An advantage of this type of crossover method is that itkeeps a certain ordering in the gene-string and sometimes this can be usefulto solve the problem. Uniform crossover does not use this ordering, it pickseither a gene from the first or second parent for each gene in the gene-string(as is illustrated in figure 3.2). This means that the ordering is less important,but this can also be an advantage as the spatial property of the gene-stringcan also negatively influence the solutions. Mutation is a method to creatediversity. Mutation is done by randomly changing a value in the gene-stringto another value (within the search-space). These two methods are differentas mutation brings new diversity in the population, while crossover tries touse the best properties of the solutions that are already created.

    One of the difficulties in GAs is to find a good fitness function. This isimportant, the fitness function will define in which direction of the search-space the GA will search. If the fitness function is flawed, it will not optimizeto the wanted result. There are all sorts of problems for finding a good fitness

    31

  • Figure 3.1: Illustration of 1-point crossover, the child inherits the first three genes fromparent one and the rest from parent two

    Figure 3.2: Illustration of uniform crossover, here the child inherits genes one and fourfrom parent one and the rest from parent two

    function. The function can be too complex, which results in difficulties tofind an optimal parameter setting for the fitness function. An example of thisis when an EA is used to design the skeleton of a car, it should be strong, butalso light. Both the sturdiness and weight should be taken into account for

    32

  • the fitness function, but it is not clear how the combination should be madebetween the properties, which properties should be made more important.This increases the search-space and can result in a fitness-function that willnot give the expected results. Next to that a fitness function can be biased. Itcan be that a fitness function is biased in a way that it ignores an importantpart of the search-space and thus the optimal solution will not be found.

    3.3.2 Using Genetic Algorithms for Optimizing Liq-uids

    Evolutionary algorithms play an important role in machine learning and alot of the problems that are optimized using reinforcement learning are alsooptimized with evolutionary algorithms. It seems a good idea, to comparethe performance of an evolutionary algorithm with that of a reinforcementlearning algorithm. Other than that, optimizing the wiring of a liquid is muchlike the bit-string problems that are often optimized with genetic algorithms.Although Evolutionary Algorithms are already used to optimize Liquid StateMachines and Echo State Networks, for example in [9], wiring optimizationis not very common. The genestring is represented by two layers (see figure3.3 for illustration), an input layer and the liquid layer. The input layerrepresents the connections from input neurons to liquid neurons and theliquid layer represents connections between two liquid neurons. Also unlikein random liquids, it is possible to have neurons that are connected to itself,because if this has a negative influence on the fitness, the GA will removethese recurrent circuits after a number of generations.

    The algorithm that is implemented goes as follows:

    1. The pool is initialized, wiring is randomly initialized as described insection 2.4.2. After this the layout of the wiring is put into a genestringas can be seen in figure 3.3.

    2. Each chromosome is validated to calculate the fitness, by using theliquid in a LSM.

    3. Parents are selected from the chromosome pool.

    4. Offspring is produced by using crossover with two parents.

    5. All chromosomes in the new pool have a chance to be mutated.

    33

  • 6. steps two, three and four and five are repeated until the end conditionis met (here the end condition is the number of generations that ispre-defined)

    Figure 3.3: Illustration of how a liquid is converted to a genestring, the purple connec-tions are inhibitory connections. Note that the representation of the liquid here is a twodimensional liquid, normally this would be a three dimensional liquid

    For selecting the parent, tournament selection is used. In tournament se-lection a number of chromosomes are chosen to take part in a ‘tournament’,which means that the chromosome with the highest fitness wins and is se-lected for producing offspring. As already stated in 3.3.1, population size isimportant as is the number of individuals per tournament. More individu-als in a tournament means that the chances are less likely that chromosomeswith a lower fitness will be selected, thus there is a trade-off between choosingthe best chromosomes and keeping the population diverse.

    Selection takes place by shuffling the chromosome pool, and picking outthe first number of chromosomes in the pool for a tournament. After thetournament, the best individual is put in the parent pool. This is repeateduntil the parent pool is filled. The chromosomes selected for a tournament are

    34

  • not put back into the pool. This way every chromosome gets a chance to beselected as a parent and be in the next generation pool (on the condition thatit wins the tournament of course). A small change to this implementationcan be made, by selecting a small number of the best chromosomes withoutusing tournament selection. This way, the best chromosomes are kept in thegene-pool.

    After that, the parent pool is shuffled and the first two parents are selectedto make a child, this is repeated until the chromosome pool is full again.These parents will not be put back into the old parent pool, thus everychromosome has an equal chance to make offspring and most likely will (ifthe number of parents selected for the new pool is low enough every parentwill have an offspring). Of course if the parent pool is empty, but there arestill not enough children in the pool, the whole process is repeated againuntil the pool is full.

    Children are made by using crossover. For the implementation that isused here, uniform crossover is used. This is because the ordering of thegenestring is different of that of the liquid (the liquid is 3d and genestring is2d), this should not affect the solutions that are generated. Also to make aone-point crossover (or multi-point crossover) that cuts a 3d slice in the gene-string to combine with another parent, would be complex, thus the choice isto not use the spatial property at all. After this, all chromosomes, includingthe parents chromosomes have a chance to mutate, this is done by goingthrough the genestring and with the chance pmutate, a gene will be mutated.If the gene is mutated, there is a fifty percent chance to mutate in one valueand 50 percent to mutate in the other. For example, if the gene has thevalue of one, it will either with a chance of a half, mutate in minus one, orzero. The parent pool is also mutated, because this reduces the chances ofelitism. Elitism means that only genes from the best chromosomes are keptin the population, but it is possible that chromosomes with a lower fitnessstill have some good properties that could be used in next generations. Mu-tation already solves this by keeping the population diverse, but if it is notdone on the parents, there is a chance that the diversity in the gene-pool islow and thus increases the chances of getting elitism. A disadvantage of thisimplementation however, is that the mutated parents get a different fitnessvalue and this increases the computation time. An exception however arethe parents that are selected without the use of tournament selection, as isexplained above. These chromosomes will not be mutated as this would re-move its purpose of at least keeping a small number of the best chromosomes

    35

  • in the gene-pool.After mutation, the fitness is calculated for every chromosome in the pool

    (including mutated parents). Fitness is calculated by training the readoutnetwork on the training dataset. After this a control set is used to determinethe performance of the LSM and this value is used as fitness.

    As an extension another layer is added to the chromosome. This layerconsists of the weights between the neurons. The weights and wiring areoptimized together. It is faster to optimize weights parallel to optimizing thewiring, because the computation time is dependent on how many connectionsthere are, and the wiring thus reduces computation time (given that thereare connections missing between neurons). The representation is kept as infigure 3.3, but now there are four layers, namely two input layers and twohidden layers (one for weights and one for the connections).

    The whole process of optimizing weights is the same as with the wiring,with one exception and that is mutation. To mutate a weight, a randomvalue between -0.025 and 0.025 is picked and added to the original weight.Also a weight can not be below zero or above one. Because the wiringalready has a negative value, a negative weight would make this positiveand can interfere with the learning process. In other words, if a weight isnegative and a connection is negative and the fitness is better because itis a positive number, when crossover is used this negative weight can bechosen in combination with a positive connection, which can have a negativeinfluence on the fitness. A weight can not be higher than one, because aweight could increase to infinity which can be a negative influence on learning.Furthermore it could be more biologically plausible as a neuron only has acertain amount of energy and can not make the signal any stronger thanthat. The maximum of one is chosen, because this is also the threshold valueand firing value of a neuron, so a signal can not become higher than its spikevalue. Using wiring in combination with weights has some disadvantages.Weights are selected while the connection at that gene could be in the off-state, so a weight that has no influence on the outcome, is still chosen in thenext gene-pool. This can be negative but it can also be positive, as there ismore diversity in the gene-pool.

    36

  • 3.4 Evolino

    3.4.1 Introduction to Evolino

    Evolino stands for EVolution of recurrent systems with Optimal LINear Out-put. Evolino has the same architecture as LSMs and ESNs, it uses a liquidand readout network. The liquid is a LSTM (as described in 2.3), whichis then optimized using evolution as described in the next section (section3.4.2). The readout network that is used is a linear regression model asthe name Evolino (LINear Output) already implies. In section 3.4.3, theimplementation is described that is used in the experiments.

    3.4.2 Enforced SubPopulations

    Enforced SubPopulations (ESP) is an evolutionary approach that is an ex-tension to Symbiotic, Adaptive NeuroEvolution (SANE; [15]). It is a co-evolutionary approach to solving Neuro-Evolutionary problems. In most EAapproaches, a chromosome represents the complete neural network, but inSANE individual neurons are optimized. These neurons in most implemen-tations are basically a set of weights (real numbers) for that neuron. This isdone by making a genepool of neuron chromosomes and combining a num-ber of these to make a network. After this, the fitness of the network iscalculated and each chromosome that took part in the network, gets thatfitness. The difference between SANE and ESP is that in ESP there are sub-populations, which means that for each point in the neural network there is asub-population that is evolved (see figure 3.4 for an illustration). SANE doesnot have different sub-populations, there is only one chromosome pool withneurons in it and each neuron can be chosen for each place in the network.This means that ESP has more specialized functions, which may increase thevariety in the genepool. The algorithm runs as follows:

    1. First the populations are initialized. Each sub-population is of the samesize. Each chromosome consists of weights between that neuron andboth the input and hidden layer. These weight are randomly initializedreal values. In Evolino, this is extended for the memory cells, whichconsist of four different types of weights, namely the three gate weightsand the normal cell weight. The total number of weights is then 4 ×(I + H). This is four times (for each gate and the cell weight) theweights between the input neurons and the memory cell and four times

    37

  • Figure 3.4: Illustration from [23] shows how ESP works. From each sub-population inthe gene pool (on the left), a chromosome is taken, in the middle there is the LSTMnetwork and the four chromosomes chosen from the sub-populations, are implemented inthe network

    the weights between hidden memory cells and the memory cell that isinitialized here.

    2. The neurons are evaluated. In each sub-population one neuron is chosento combine into a network. Then the network is evaluated and gets afitness score. This score is added to the cumulative fitness, which isthe actual fitness that is used to rate the chromosome. This process isdone for each neuron in each sub-population and repeated until eachneuron has participated in a predefined number of evaluations.

    3. Children are made to make a new generation. Dependent on theimplementation, there can be crossover and mutation within a sub-population and selection methods can also differ. Most selection meth-ods that are used in GAs can also be used in ESP.

    4. Step two and three are repeated until the stop conditions are met. The

    38

  • condition in Evolino is the number of generations it will train.

    3.4.3 Evolino Implemented

    As stated in the introduction, Evolino also uses a recurrent network to han-dle the time-series, but uses a linear readout network to map the states ofthe recurrent network to the output layer. The readout network is a linearregression model, which basically is the output layer of Evolino. It is trainedwith the Moore-Penrose pseudo-inverse method [18]. This method is chosenbecause it is known to be a fast and powerful method to learn the linearoutputs. The weights in the LSTM are optimized by evolution, in this case,ESP is used to evolve the network. To evaluate the network (for the fitness),first all the samples are presented to the LSTM, after this, the output thatis produced by the LSTM, is stored in a matrix to be learned by the linearregression model. To get the readout moments from the LSTM, first theLSTM is warmed up, which means that the output of the LSTM is not used.After that for a number of time-steps, the output is summed and used asinput for the linear regression model. For all vectors in the matrix, thereis a target vector, for training the regression model. After all samples arepassed through the LSTM, the output weights are trained. After this, thetraining-set is presented again to the LSTM and now with the optimal re-gression model, the error that is calculated for the output units, is added tothe fitness of the chromosomes in the network.

    The memory cells that need to be optimized with the ESP which canbe seen in figure 3.5, are represented by a string which contains all weightconnections in the memory cell. These are the weights for both input neuronsas hidden memory cells. The representation consists of the weights for theinput, output and hidden gates and the cell weights that connect two units(either input units or memory cells) with each other. First the best twentyfive percent of each subpopulation is selected. After this, to create offspring,both mutation and crossover are used. The crossover method that is usedhere is one-point crossover. The explanation about how this crossover workscan be found in section 3.3.1. Mutation is done by adding some noise, createdby using the Cauchy distribution [23]. The Cauchy distribution is calculatedas follows:

    f(x) =$

    π($2 + x2). (3.9)

    39

  • Figure 3.5: Illustration from [23] shows how a memory cell is represented in a chromosome.

    Here $ is the interval of the distribution. After offspring is created, the fiftypercent worst chromosomes in the sub-populations are replaced by offspring.If the performance of the ESP will not improve, burst mutation is used.This is used when for a number of generations, there is no performanceimprovement. Burst mutation keeps the best neurons in the population andthe rest of the populations are filled with mutations of those best neurons.After this the algorithm resumes, but in a different point in the search space.This method can be repeated a number of times when there are no increasesin performance. After ESP finishes training, the best memory cells of eachsub-population are used to build a network, that is used for new data.

    40

  • Chapter 4

    Data

    4.1 Movement Classification Task

    It is difficult to find a good classification task to solve. First of all, mostdatasets do not contain a lot of data, because it often is hard to get a goodrepresentation in the data, if the data comes from a real world problem. Thismeans that algorithms will not always perform the way they should do. Ifthe set is too small, any algorithm will have a hard time finding the commonpatterns to divide the data into classes. This may give a bad view on howgood an algorithm can solve the problem. The solution is to use a dummydataset. The advantage is that there is more control over what is happening.The training set can be as large or small as required and it is easy to adjusta few parameters to make the problem to classify correctly, harder or a biteasier. Furthermore it is easier to have multiple datasets (like a test set andevaluation set). Because most real world datasets do not have that muchsamples, it will be even worse when it has to be divided into a training setand test set and maybe even an evaluation set for training a GA.

    A problem however is finding a good problem-set to optimize. It shouldnot be too hard or complex because then it does not give the transparencythat is needed, but it also should not be too simple because then there aremore simple methods to find an answer to the problem. A nice implemen-tation is a movement classification task, a temporal dependency is neededto solve the problem and it is simple to add some noise to make solving theproblem more difficult.

    The set-up is as follows: there is a round-world grid (see figure 4.1). In

    41

  • Figure 4.1: An example of a sample of three steps with the patch moving east, includinggrid noise.

    other words, the left border is connected with the right border and the topborder is connected with the bottom border, thus when a target moves of thegrid, it will appear on the other side. This grid is built up of white squaresand the size can be preset. Furthermore, there is a patch (or target) thatcan be of any size (as long as it is smaller than the grid). The patch is ablack square and the patch either moves north, east, south or west. The taskfor time-series algorithm is to classify which direction the patch is moving.The input pattern is simple, every point in the grid is an input and if thepatch is not on a point, it will give a zero as input, if the patch is at apoint it will have a one as input. To use this as input, it is transposed toa one dimensional string as in figure 4.2. This is what is used as input forthe RNN. Furthermore to make this problem more challenging, noise can beadded. If the patch is not on a point, that point has a probability that it isa noisy input. A square that is noise will give an input of one in the inputstring. Another method to increase the complexity is changing the numberof samples in a dataset, decreasing the number of samples will make it harderto solve the problem, but it gives more insight on how much the RNN willgeneralize to solve the problem.

    42

  • 1. 0 0 1 0 1 0 0 0 02. 1 0 0 0 0 1 0 0 03. 0 0 0 1 0 0 0 1 0

    Figure 4.2: representation of figure 4.1 as input for the liquid

    4.2 Music Classification Task

    Music classification can be a nice tool to assist an user in recognizing musicalpieces. For example, there are databases containing music that needs to beordered. Because these databases are so big, it may be difficult to onlylet human users recognize the music. A machine can do this faster. Inthe experiments, the task is to classify the correct composer for a musicalpiece. The dataset is taken from the experiments done in [16]. The datasetcontains pieces of Johann Sebastian Bach namely the Well Tempered Clavierand Ludwig van Beethoven with the Piano Sonatas. Although Bach is froman earlier time than Beethoven, there are similarities. Beethoven has playedand studied the music of Bach and is also certainly inspired by Bach’s music.

    There are two methods of storing music in a digital form. The first isin waves, this is a digital recording of real (analog) instruments. The othermethod is to store information about the musical piece instead of storingthe wave data generated by musical instruments. This contains informationabout notes, about volume, pitch, onset and length. The sound of the instru-ment is not stored. A well known digital format that uses this is cal