reservoir computing in forecasting financial markets · artiﬁcial neural networks are trainable...

April 9, 2015

Reservoir Computing in Forecasting Financial Markets

Jenny Su

Committee Members:Professor Daniel Gauthier, Adviser

Professor Kate ScholbergProfessor Joshua Socolar

Defense held on Wednesday, April 15, 2015 in Physics Building Room 298

Abstract

The ability of the echo state network to learn chaotic time series makes it an interest-ing tool for financial forecasting where data is very nonlinear and complex. In this studyI initially examine the Mackey-Glass system to determine how different global parameterscan optimize training in an echo state network. In order to simultaneously optimize multipleparameters I conduct a grid search to explore the mean squared error surface plot. In thethe grid search I find that error is relatively stable over certain ranges of the leaking rate andspectral radius. However, the ranges over which the Mackey-Glass system minimizes errordo not correspond with an error surface plot minimum for financial data, as a result of in-trinsic qualities such as step size and timescale of dynamics in the data. The study of chaosin financial time series data leads me to find alternate understandings of the distribution ofthe relative stock price over time. I find the Lorentzian distribution and the Voigt profile aregood models for explaining the thick tails that characterize large returns and losses, whichare not explained in the common Gaussian model. These distributions act as an untrainedrandom model to benchmark the predictions of the echo state network trained on the his-torical price changes in the S&P 500. The global reservoir parameters, optimized in a gridsearch given financial input data, does not lead to significant predictive abilities. Studiesof the committees of multiple reservoirs are shown to give similar forecasts to single reser-voirs. Compared to a benchmark random sample from the defined distribution of previousinput, the echo state network is not able to make significantly better forecasts, suggestingthe necessity of more sophisticated statistical techniques and the need to better understandchaotic dynamics in finance.

2

Contents

1 Introduction 21.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Network Concepts 72.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 The Reservoir and Echo State Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Training and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Reservoir Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Echo State Network and Mackey-Glass 163.1 Mackey-Glass System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Constant Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Leaking Rate, Spectral Radius, and Reservoir Size . . . . . . . . . . . . . . . . . . . . . 243.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Financial Forecasting and Neural Networks 304.1 Nonlinear Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 History of Neural Networks in Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Neural Forecasting Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Network Predictions 355.1 S&P 500 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4 Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6 Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Concluscion 576.1 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

1

Chapter 1

Introduction

1.1 Background

Artificial Neural Networks are trainable systems that have powerful learning capabilities with many

applications in forecasting and classification. Their learning process has many similarities to that of

the human brain because their design was inspired by research into biological nervous systems. Like

the central nervous system of animals, the artificial neural network is composed of many “neurons” or

artificial nodes, which are connected in a defined network. Both biological and artificial neural networks

send and receive feedback signals through their connections which partially determines their expressed

state. Therefore these networks are constantly updating with information from external sources as well

as internally within the network from other neurons.

The development of artificial neural networks began in 1943 when McCulloch and Pitts [23] first

proposed a network in which the state of neurons was determined by a combination of all the signals

received from connected neurons in the network. In their model, simple artificial “neurons” i = 1,...,n

could only have binary neuron activations or states, ni = 0,1, where ni = 0 represents the resting state

and ni = 1 represents the activated state of the neuron. The binary state of each neuron i is dictated by a

linear combination of action potentials,∑

j wi jn j(t), where wi j is a matrix representation of the connection

weights between neurons i and j as seen in figure 1.1. If this value exceeds a certain threshold, then

neuron i becomes activated and transmits signal ni = 1.

2

Figure 1.1: Network model of neurons are summed as a linear combination which must exceed a certain

threshold for the neuron to be activated.

Another important foundational step occurred in 1949 when Hebb published The Organization of Be-

havior in which he asserts that simultaneous activation of neurons leads to increased connection strengths

between those neurons [8]. This established the basis for Hebb’s learning rule which states that the synap-

tic connection between neurons was adaptive. The connection between neurons can be strengthened or

increased as a result of repeated activations between the neurons. If neuron x1 activations consistently

leads to neuron x2 activations, then the connection weight between them would incrase. This idea was

later incorporated in 1961 by Edward Caianeillo to the learning algorithm used to determine weight ma-

trix wi j connecting neurons [1] where the connection weight matrix had adaptive weights for neurons

with similar activations. The connection between neurons which activate simultaneously are stronger.

All of these discoveries contributed to the development of the first simple feed-forward network

which Rosenblatt and his collaborators called a perceptron [29]. The perceptron consisted of two layers

of neurons: the input layer and the output layer. Signals continues from the input to the output in a

single direction as seen in figure 1.2. This first feed-forward network was used in a simple classification

problem between two classes. After each classification, if the model predicts the class incorrectly the

weights are adjusted and the network is run again until the weights converge to the correct values that

3

allow the model to predict the class correctly.

Figure 1.2: Single-layer perceptrons have adjustable weights which are trained to predict correct classi-

fication.

Years later in 1982, Hopfield published the basis for a recurrent network where neurons can be up-

dated sequentially based on information stored within the network [9]. As opposed to a feedforward

network, recurrent neural networks have neurons whose connections are multidirectional as shown in

figure 1.3. This allows the network to display dynamical behavior based on internal memory of signals

throughout the network. In Hopfield’s first network, there was no self-connection, the neuron did not re-

ceive input from itself, and the connections between neurons were symmetric so that wi j = w ji. Hopfield

showed that networks with these characteristics which iteratively adjusted the state of each neuron would

reach a local minimum and evolve to a final state.

4

Figure 1.3: Hopfield network neurons connections are multidirectional with symmetric weights.

Within the field of networks, many different learning rules that have been developed are classified

as supervised learning, where the network is adjusted by comparing the network output with the desired

output. One of the most common learning rules, error backpropagation, was first proposed in 1974

by Werbos [37]. The learning algorithm makes small iterative adjustments to network connections to

minimize the difference between the target and network output. Backpropagation has been especially

successful in feed forward networks but the method is only partially successful with recurrent neural

networks. Output of recurrent neural networks can bifurcate unlike the smooth continuous output of

feed-forward networks. This bifurcation can lead to discontinuous error surfaces[3].

An alternative to backpropagation was introduced with echo state networks (ESNs) and liquid state

machines [13, 19]. These two training models developed independently within the context of machine

learning and computational neuroscience respectively share the same basic idea of a randomly connected

“reservoir” of neurons with a trained output feedback. This developed into the current research field of

Reservoir Computing. The specific network concepts defined by the echo state network approach I will

be using in this project is explained in detail in chapter 2.

Applications of reservoir computing are useful in tasks for function approximation, signal process-

ing, classification, and data processing. They have been applied to system identification, natural language

5

processing, medical diagnosis, as well as e-mail spam filters. In this study, I am interested in the applica-

tion of these network models to the financial sector, which historically has great incentives to determine

models that are capable of predicting and forecasting changes in a complex, widely fluctuating market.

1.2 Approach

The goal of the project is to study echo state network and its modeling capacity in forecasting financial

data series. Within the scope of this project I first forecast a known dynamical system Mackey-Glass to

better understand reservoir behavior given specified global parameters. Secondly I will then use knowl-

edge of ESNs in application to financial data to determine whether ESNs may provide a useful prediction

of the stock market trajectories.

The paper is organized as follows. The second chapter discusses network concepts and the mathemat-

ical description of the Echo State network. In order to determine best practices, in chapter three I study

the network using the Mackey-Glass system, a nonlinear delay differential equation that is commonly

used to test modelling of complex systems. The ESN can be trained using Mackey-Glass on varying

global parameters of the reservoir. These global parameters are tuned to optimize network performance.

The fourth chapter introduces the dynamics of the stock market and how previous forecasting measures

have dealt with these widely fluctuating time series. In the fifth chapter I apply the echo state network

approach to financial data. I study the impact of global parameters and data processing techniques. I

benchmark the reservoir predictions using random samples from distributions that best fit the historical

input data. I also study the impact of committee methods combining output from multiple reservoirs in

comparison to single reservoir predictions as well as random sample. The sixth and final chapter con-

cludes with a comparative discussion of the echo state network approach in financial forecasting as well

as limitations of the model and further studies.

6

Chapter 2

Network Concepts

2.1 Basic Concept

There are many supervised learning algorithms for training recurrent neural networks. In the echo state

network (ESN) approach used in this project, only the output weights from the reservoir are trained.

The connections of the neurons within the reservoir are randomly generated at the outset as in figure

2.1. Training all network connections is unnecessary, which makes this faster than previous learning

algorithms in which all connections are trained. Because the network has recurrent loops, it maintains a

memory of the past input and the output consists of “echoes” of the initial input time series.

7

Figure 2.1: Echo state network approach has a combination of trained and untrained connections between

neurons or nodes.

In implementing the echo state network, an input teacher data series u(n) is used to train a reservoir of

size Nx with neurons connections determined by a randomly generated matrix W ∈ RNx×Nx . The output

node of the reservoir gives a readout of a linear combination of all or a portion of the neuron activations

Wout x(n). This matrix Wout is computed so that the output y(n) corresponds as closely as possible to

a defined target data series ytarget(n). This last step is the training portion of the learning algorithm.

Once the best weight matrix Wout is determined, new input data u(n) can be used in the reservoir to

generate output or reservoir predictions y(n) beyond the target data. The rest of this chapter describes the

mathematical properties of the different components involved in the ESN approach and also introduces

the tunable global parameters to optimize this learning architecture, as first developed by Jaeger in 2001

[13].

2.2 Input

The input data to the reservoir serves as the driving mechanism for the reservoir. Not only does the

reservoir exhibit nonlinear dynamics as a result of the input, but the reservoir will also retain a “memory”

8

of previous input. This ability to remember, which will later be defined as the echo state property in

section 2.3, occurs as a result of the recurrence of the network; the nodes form recurrent cycles through

their connections in W.

Typically, the teacher input series, u(n), consists of Nu series of data points at discrete time steps

(n, n + 1, n + 2, ...). In my studies, part of the input teacher series is used as the target data ytarget(n) that

the output is trained on. There are no limitations to the starting point of the data series, and therefore

shifting the initial starting input does not significantly impact the ability of the reservoir to learn the input

series. While in many contexts, Nu is one-dimensional, there is no limit to the number of arrays that can

be used as input to the reservoir. The input is fed into the reservoir using a randomly generated input

weight matrix Win ∈ RNx×(1+Nu). In addition to the specified input data u(n), there is also a constant bias

value input to the reservoir. This bias input is weighted by the randomly generated column of values

in Win as seen in figure 2.2. The purpose of the bias constant serves to increase the variability of input

dynamics [13]. The impact of this bias constant is further studied in section 3.2. All these input values

directly impact the reservoir, which is studied in the next section.

Figure 2.2: Win determines the connection for both the input data and bias input.

9

2.3 The Reservoir and Echo State Property

The reservoir is defined by the neurons within it whose states all follow the same update equation as a

function of the input and feedback from other neurons. The state of each neurons or nodes depends on

previous states in the reservoir according to

x(n + 1) = (1 − α)x(n) + α tanh(Wx(n) + Winu(n + 1) + W f by(n) + v(n)), (2.1)

where α is the leaking rate of the network. Technically the state update can be governed by any sigmoidal

function but in the context of this project I use the tanh function because it is the standard sigmoid

function studied across all encountered literature in reservoir computing. According to Eq. 2.1, each

new neuron state x(n + 1) is determined by its current state x(n) as well as a nonlinear expression of other

current nodes Wx(n) and the input data Winu(n + 1). The state x(n + 1) can also depend on the output

adjusted by the randomly generated feedback weight matrix W f by(n) as well as a noise function v(n).

However, both of these parameters are optional and are omitted in this project. Jaeger found that noise

was useful in maintaining stability of reservoir activations when driven by highly chaotic time series

such as the Mackey-Glass system [13]. He also concluded that noise insertion negatively impacted the

precision of predictions, which is undesirable in my project. In the project, I do not use a separate, W f b,

to feed into the reservoir but rather feed the output y(n) back as input u(n) and therefore W f b = Win.

This allows the reservoir to exhibit dynamical behavior as a result of the output as well as the input.

One of the most important parameters in determining reservoir behavior is the random weight matrix

W that describes neuron connections. The recurrent loops that occur due to these connections lead to

the echo state property of the network that gives the reservoir a finite “memory.” The echo state property

states the the reservoir retains a “memory” of previous states as the input data is stored in the recurrent

loops of the reservoir connections. The echo state property can be ensured in practice if spectral radius,

denoted by ρ, is less than 1. The spectral radius is the largest maximum eigenvalue of the weight matrix

W. However more recent researchers have determined that the echo state property occurs even for the

more relaxed condition when the maximum value of entry wi j in W is less than 1 [6]. Jaeger, more

recently, also noted that the echo state property can exist even when ρ > 1 but may not exist for all input

and never exists for the null input [22]. Increasing the ρ may even increase network performance by

increasing memory of the input since the spectral radius defines how long input is remembered. He also

10

found that the echo state property might be defined with respect to the input u(n).

Since the reservoir exhibits memory, the initial n activations should not be used in training, which is

defined in the next section, due to any transients that may exist in the reservoir. Since ρ in most of the

studies in this project is close to unity there is a slow forgetting. Therefore, many initial states need to be

dismissed before actual predictions can take place. This makes this learning process input-intensive and

data-wasteful and other techniques have been developed that incorporate an auxiliary initiator network

that computes an appropriate starting state for the recurrent model network [13]. These techniques are

beyond the scope of my project because the data length is not so much a limitation in financial data.

2.4 Training and Output

Given all the reservoir activations resulting from the Nu input series, the goal is to take the output of this

reservoir y(n) to best approximate the target data ytarget(n). Ideally the goal is to minimize the difference

between the target data and the reservoir output. In this project, I use Mean Squared Error (MSE) defined

as

MS E =1n

n∑i=1

(yout(n) − ytarget(n))2. (2.2)

Because the reservoir output is a linear combination of the activations,

Y = WoutX, (2.3)

where Y and X are matrix representations of the reservoir output y(n) and reservoir states x(n) respec-

tively, minimizing MSE becomes a simple linear regression. After substituting in matrix forms used in

Eq. 2.3 into Eq. 2.4, I find

MS E = (WoutX − Ytarget)2. (2.4)

Minimizing the MSE gives the solution to Wout as

Wout = YtargetX−1. (2.5)

However, because u(n) is often larger than the reservoir size, the system is overdetermined. Taking the

inverse of an overdetermined matrix can give unstable solutions. Therefore instead of a output matrix as

11

a simple function of the target and inverse of the activations, I implement Tikhonov regularization, also

known as ride regression, in order to find a stable solution for Wout [17]. In Tikhonov regularization, a

regularization term is added to Eq. 2.2, resulting in

(WoutX − Ytarget)2 − β(Wout)2, (2.6)

where β is the regularization coefficient used to penalize larger norms. Minimizing Eq. 2.6, I obtain the

solution

Wout = YtargetXT (XXT + βI)−1. (2.7)

Alternatively, as mentioned previously in section 2.2, the noise function v(n) is also used to stabilize

solutions in systems that are overdetermined. However the ridge regression method is a more compu-

tationally efficient solution that does not penalize the precision of reservoir predictions that noise may

affect.

2.5 Reservoir Optimization

In determining the optimal reservoir, there are many adjustable global parameters that influence the

dynamical behavior. The goal in reservoir optimization is therefore to generate dynamical behavior that

is most similar to the system the reservoir is attempting to model. The reservoir dynamics, as seen in Eq.

2.1, is governed by W, Win, and α. These parameters are related to different quantities that characterize

the network such as: reservoir size Nx, spectral radius ρ, input scaling β, and leaking rate α. These global

parameters and their effect on the reservoir are the topic of this section.

The current standard practice involves intuitive manual adjustment of each of these parameters to

optimize the reservoir [17]. I will study parameter optimization in section 5.3.

Reservoir Size

The reservoir size Nx, or the number of neurons, dictates the model capacity of the network. The current

intuition states that in general, bigger reservoirs lead to better performance. The training method (Eq.

2.7) used for Echo State Networks is computationally efficient enough to generate large reservoir sizes

12

on the order of magnitude of 104. These have been found useful in automatic speech recognition [32].

A benchmark for the lower bound for a reservoir is given by Lukosevicius [17]: the reservoir size Nx

should be at least equal to the estimate of the independent real values that the reservoir needs to retain in

memory of the input. This is a result of the echo state property, described in section 2.3. The maximum

length memory of the reservoir is limited by its size.

Spectral Radius

The spectral radius ρ is one of the most central parameters in the echo state network because it plays

an important role in governing the echo state property of the reservoir. The radius ρ is defined as the

maximum absolute eigenvalue of the reservoir connection matrix W and it determines the length of the

reservoir memory. Larger spectral radius corresponds to a longer memory of the input history.

In this project, the spectral radius essentially scales the weight matrix W. After a random matrix

W is generated, the spectral radius of this matrix is calculated by ρ(W). Then W is divided by ρ(W)

to normalize its spectral radius. This means ρ(W) becomes equal to 1. Then the connection matrix is

multiplied by the selected spectral radius ρ such that ρ(W) = ρ. This optimization of the matrix spectral

radius occurs after the connection weight matrix is randomly selected. Then its spectral radius is adjusted

to fit the desired ρ value.

In practice the condition, ρ < 1, ensures the echo state property in most applications [17]. However,

the theoretical limit established in Ref. [6] shows that the echo state property only exists for every input

under the tighter condition ρ < 1/2. In application, however, the echo state property often holds for even

ρ ≥ 1 given nonzero input. Typically large ρ values may push the reservoir into chaotic spontaneous

behavior which would violate the echo state property. This is shown by Jaeger [12] where the trivial zero

input will lead to linearly unstable solutions given ρ > 1.

Generally the spectral radius should be tuned to optimize reservoir performance with a starting bench-

mark value of ρ = 1 and exploring other ρ values close to 1. When modelling systems that depend on

more recent input history, ρ should be smaller compared to a larger ρ when a more extensive memory is

required.

13

Input Scaling

Input scaling is an important parameter as previously noted in section 2.3 because it also plays a role in

determining the echo state property [22]. The input scaling is typically determined by the input weight

matrix Win which is randomly sampled on a range [−a, a], where a is the scaling parameter. While

not necessarily required, the input scaling parameter determines the the scale of the entire input series

by scaling all the columns of Win. This reduces the number of free parameters that need to be tuned

to optimize reservoir performance [17]. Theoretically it is possible to scale individual input units u(n).

Increasing the number of input scaling parameters could serve to expand components of the input u(n)

that favorably drive reservoir dynamics. However there are no algorithms that can easily scale individual

input components and exploration of this topic is beyond the scope of this project.

In this project, I determine the input scaling through Win, which is chosen randomly from the range

[−1, 1]. The input scaling parameter, a, is multiplied by the input data, u(n). This is analogous to

sampling on the range, [−a, a]. Win not only scales the input u(n) but it also scales the constant bias.

Input scaling determines the nonlinearity of the network dynamics. Lukosevicius [17] advises en-

suring that the inputs to the state update Eq. 2.1 are bounded. Because of the tanh function that defines

neuron activations, inputs very close to 0 will lead to linear behavior of neurons. Inputs close to 1 or -1

may cause binary switching behavior of reservoirs. Inputs between 0 and 1 or -1 will lead to dynamical

nonlinear behavior of neurons.

Leaking Rate

The leaking rate α approximates a discrete Euler integration of the state over time. The leaking rate

α, which is bounded between [0,1], determines the speed at which the input affects or leaks into the

reservoir. This discretization term comes from an Euler integration of the state equation in time

∆x∆t

=x(n + 1) − x(n)

∆t≈ xn + f (n) (2.8)

so I can determine the solution for a new state x(n + 1)

x(n + 1) = (1 − ∆t)xn + ∆t f (n). (2.9)

In the reservoir the leaking rate α is the ∆t in the Euler integration of an ordinary differential equation.

14

This also functions as a form of exponential smoothing for the time series where previous states x(n) will

be weighted exponentially less over time because by substitution to the state update equation (Eq. 2.1) I

can see

x(n + 1) = (1 − α)2x(n − 2) + α(1 − α) f (x(n − 1)) + α f (x(n)). (2.10)

In general, the leaking rate is set to match the speed of the dynamics the reservoir is attempting

to model from ytarget. A recent study [12] has shown that leaking rate can also impact the short-term

memory of echo state networks. The study showed that in some reserovirs, given a small leaking rate,

α, the slow dynamics of the reservoir states, x(n), could increase the length of the short-term memory of

the reservoir. I study impact of the leaking rate along with other parameters more explicitly in section

5.3. Further studies of multiple time scales may be useful to determine components of the system that

may have multi-scale dynamics which occur on different timescales but are beyond the scope within this

paper.

2.6 Summary

In this chapter, I outlined the specific components of the echo state network as well as specific parameters

that are useful in optimizing its performance. In the following chapter I apply this knowledge in practice

by studying the modelling capacity of the reservoir using a known time series.

15

Chapter 3

Echo State Network and Mackey-Glass

3.1 Mackey-Glass System

The Mackey-Glass equation is a nonlinear time delay differential equation. It was originally derived as

a model of chaotic dynamics in blood cell generation in a collaboration between Mackey and Glass at

McGill University [20]. The equation expands on a simple feedback system whose dynamics are given

bydxdt

= λ − γx, (3.1)

which has a stable equilibrium at λ/γ in the limits t → ∞ given that γ is positive. In this simple feedback,

system the rate of change of the control variable, dx/dt, is influenced by the value of the control variable.

The system increases at a constant rate of λ and decreases at rate γx.

To better model real physiological systems, modifications to this simple feedback system include

introducing a time delay component. The form of the Mackey-Glass equation studied today is

dxdt

= βxτ

1 + xnτ− γx. (3.2)

The equation can generate periodic behavior, bifurcations, and chaotic dynamics given specified

parameters β, γ, τ, and n. The derivatives of a time delay differential equation are dependent on solutions

at previous times. xτ represents the the state x at a time delayed by the constant τ, x(t − τ). There may

be a significant time lag between determining the control variable x and responding with an updated

change. Additionally, in real physiological models, the parameters γ and λ are not necessarily constant

16

for all time t, and may vary with x(t − τ), also denoted xτ. This is true for Eq. 3.2, which also exhibits

period doubling bifurcations leading up to chaotic regime, and has been extensively studied in relation

to chaos theory [7].

Because the system has been so well characterized by previous studies, the Mackey-Glass system is

often used to benchmark time series prediction studies. Previous studies by Jaeger [14] have shown that

ESNs improve upon the best previous techniques for modeling the Mackey-Glass chaotic systems [36]

by a factor of 700, which he attributes to the ESN’s ability to store and remember previous states defined

as the echo state property in section 2.3. In Jaeger’s article, he uses Mackey-Glass parameters β = 0.2,

n = 10, γ = 0.1, and τ = 17, which I reproduced in figure 3.1 using dde23 solver [5] in Python. For

τ > 16.8, the system has a chaotic attractor. Since I use τ = 17 the system exhibits a chaotic attractor

which can be observed in the time delay embedding in figure 3.2. I use the same above parameters to

start to train my echo state networks.

Figure 3.1: Mackey-Glass system over time t = 3000 given β = 0.2, n = 10, γ = 0.1, and τ = 17.

I begin to understand the preparation for reservoir training tasks using the Mackey-Glass system

since it has been well studied in the context of the reservoir. The rest of this chapter details the studies

of the constant bias global parameter discussed in section 2.5 and the impact on optimization of the echo

17

Figure 3.2: Phase space diagram of Mackey-Glass attractor plotted by time delay embedding givenβ = 0.2, n = 10, γ = 0.1, and τ = 17.

state network source code made publicly available by Lukosecivius [18]. These studies will give insight

into the basic routine used in training networks which is applied in subsequent chapters to financial data

sets.

3.2 Constant Bias

In determining states of the activations (see Eq. 2.1), there is an additional bias input, randomly generated

within the Win matrix that is multiplied by a constant scaling parameter. In order to better understand

the impact of this bias input on the reservoir dynamics I use Mackey-Glass data and study the behavior

of the reservoir as a result of changing bias. I generate input Mackey-Glass data for training and testing,

construct an echo state network, and train the network on the output matrix to make predictions for the

next step. The rest of this section illustrates the impact of bias on the mean squared error. As I study

MSE, I examine the impact that the output weight matrix Wout may have on the error as it indirectly

impacts MSE by determining y(n).

Input preparation

Using Mackey-Glass parameters β = 0.2, n = 10, γ = 0.1, and τ = 17, I generate times series as shown

in figure 3.1. This data is split up into the training data as well as the testing data. In this numerical study,

18

I use a training length of 2000 and a testing length of 1000. The experimental data is shifted between

-1 and 1 to fall within the nonlinear regime of the tanh function. There is no additional input scaling

factor needed to adjust the input in this case because the range of the series is less than 1. The training

data is weighted according to the randomly generated input weight matrix used in the reservoir to drive

the activation states in Eq. 2.1. The first 100 inputs are used to initialize the reservoir and are not used

to train the output. The rest of the training data is used as target data to train on as Ytarget in Eq. 2.7.

The testing length is used to compare with the reservoir output to determine the accuracy of reservoir

predictions as in Eq. 2.2.

Reservoir Construction

A reservoir was constructed as described in section 2.3. In order to study the constant bias I varied

the scaling parameter for the bias from the range 0 to 10 with .1 increments, I examine the impact this

may have on the reservoir activations by isolating the parameter. I hold all other parameters constant:

reservoir size = 1000, leaking rate = .3, spectral radius = 1.25, input scaling = 1. Using the first 2000

data points in the training sequence to drive the reservoir according to the reservoir update Eq. 2.1, the

states of all the reservoir nodes are collected in X. As seen in figure 3.3 which shows some reservoir

activations, the initial activations resulting from the first few random states of the network are highly

variable in the first few steps t > 50. Therefore the first 100 activations are ignored to remove any initial

transience of the random reservoir. Since reservoir connections are randomly generated, I created 10

reservoirs given each scaling parameter to test the impact of bias on MSE.

19

Figure 3.3: The activations of a few nodes for t < 50 are more variable than the stable activations for

t > 50.

Training and Prediction

Training the output feedback of the reservoir requires using the target data Ytarget and reservoir activa-

tions X in Eq. 2.7. The regularization coefficient is set to 10−8 for all the reservoir training throughout

this project in order to minimize the number of adjusted parameters. Once the output matrix is deter-

mined, the matrix prediction y(n) is calculated by the output weight matrix and the reservoir states. This

output y(n) is then used as the subsequent input to continue to drive the reservoir and make the following

prediction. This series of predictions is compared to the test data and the MSE, given by Eq. 2.2, is

determined.

Results

After training a number of reservoirs over a range of scaling constants, I compare the prediction results

of the reservoirs to the test data. The following plot is an example of the actual data compared to the

20

reservoir predictions for one of the runs using a bias scaling constant of 1.

Figure 3.4: Reservoir Predictions after training compared to test data points.

To compare the results across the varying scaling constants, I calculate the MSE using Eq. 2.2 for

reservoir predictions over a length of 500 steps for each reservoir. The average MSE of the 10 reservoirs

for each scaling bias is shown in the figure below.

21

Figure 3.5: MSE across varying scaling constants for the bias given parameters reservoir size = 1000,

leaking rate = .3, spectral radius = 1.25, input scaling = 1.

As seen in figure 3.5, the smallest MSE occurred when the scaling constant was kept below 3.0 but

error increased when the scaling constant was zero or close to it. In the range from 0.8 to 2.2, the MSE

was on the order of magnitude of 10−5 or smaller. Typically in literature, this scaling constant is kept at

1 [17].

The purpose of the constant bias input as explained by Jaeger and in section 2.2 was to increase the

variability of the reservoir dynamics. In application, the bias may help the reservoir deal with offset or

non-centered data. The mean of the shifted input data is close to but not quite zero, at around -0.066.

While the bias input does seem to have an impact on MSE, there may be some indicator for large

error in the echo state network model. Since Wout has a significant impact on output y(n) I compare the

different weight matrices as they affect MSE. Higher weights in the the output matrix could possibly be

a result of an unstable solution in Eq. 2.7. To study whether or not the output weight matrix would be

22

correlated to the MSE, figure 3.6 was produced to compare the mean of the output weight matrix to the

MSE.

Figure 3.6: MSE as a function of output weight matrix Wout

Figure 3.6 indicates no correlation between the mean of the weight matrix and the MSE. This might

have been expected because weight matrices with larger more unstable values could indicate poor linear

fits for Wout from Eq. 2.7. However because Tikhonov regression was intended to penalize large, unsta-

ble matrix solutions, these solutions may not manifest in the final outcome weight matrix. This already

acts as a threshold for any solution derived from Eq. 2.7 to dismiss highly unstable solutions. Other

studies of the output matrix could observe other statistical features. In the following section we study the

impact of a few other reservoir parameters.

23

3.3 Leaking Rate, Spectral Radius, and Reservoir Size

There are many other important parameters that can be optimized in the network as discussed in section

5.3. In this section we study the interrelated impact of multiple parameters. One of the flaws of opti-

mizing each parameter individually is that the MSE may have a local minimum as a function of multiple

parameters. In order to better optimize across multiple parameters, I study more advanced optimization

techniques.

Gradient Descent

Gradient descent is a method commonly used to minimize error functions. In gradient descent optimiza-

tion, also known as method of steepest descent, an iterative approach is taken to converge to a local

minimum. From an initial starting vector of parameters, the gradient for MSE is calculated across the

multiple parameters. The direction in which MSE will fall the fastest is the negative gradient at that

initial starting set of parameters. These parameters are then adjusted in the direction of steepest descent

so that MSE is reduced and the gradient is calculated again. This process is repeated to minimize the

MSE until the gradient converges to 0, at which point there is a local minimum. While gradient descent

algorithms are useful in finding the minimum of smooth convex functions, there is not much knowledge

about the error surface in reservoir computing. The error surface could be discontinuous and contain

multiple local minima where the gradient descent algorithm could fall and miss the absolute minimum.

In order to get a better sense of the kind of error surface that exists in reservoir optimization, I conduct a

grid search detailed in the next subsection.

Grid Search

The basic grid search explores the MSE outcome for multiple reservoirs across several ranges of global

parameters. Grid searches are a computationally expensive brute force method to finding optimal param-

eters since multiple calculations need to be made at each point along the grid. However, for the purposes

of this study, it provides the ability to visualize the error surface over many reservoir parameters.

In this grid search, I explore the error surface as a function of reservoir size, leaking rate and spectral

radius. Data processing follows the same procedures as described in section 3.2. However, instead of

simply adjusting one parameter, I am able to tune three different parameters. I study the slices of the

24

error surface at different reservoir sizes Nx = 100, 400, 700, and 1000. This surface plot shows the MSE

at different values of leaking rate 0.05 ≤ α ≤ 0.3 and spectral radius 0.8 ≤ ρ ≤ 1.2. At each point on the

grid defined by the global parameters, 10 different randomly connected reservoirs are created. Because

of the computational costs associated with grid search I take large step sizes in all of my parameters to

minimize the number of searches that need to be run.

Results

While I am not able to conduct an exhaustive grid search, the purpose of the grid search is to better

understand the error surface across reservoir parameters. The surface plots in figures 3.7, 3.8, 3.9, and

3.10 show the mean of the MSE outcome for the 10 reservoirs created at each point. Figure 3.7 excludes

values with leaking rate α < 0.2 because the MSE exploded beyond that point. I have already discussed

in section 2.5 that smaller reservoir sizes do not have the capacity for large memory of input. Therefore

it makes sense that many parameter values given in the grid search returned extremely high MSE values.

The MSE surface also tends to be lower for the reservoir Nx = 1000 compared to the other three plots.

Studying these surface plots across multiple reservoir sizes, the MSE seems to be minimized for higher

leaking rates α > 0.2 which correspond to the values of α I have been using in our studies of Mackey-

Glass.

25

Figure 3.7: MSE surface for Nx = 100 reservoirs across α and ρ

26


27


28


3.4 Summary

In this chapter I examined the Mackey-Glass system and how the numerical solutions could be used

as input to the echo state network and as a demonstration of the network training process. I studied

specifically the bias input in reservoir and how the scaling constant for bias input affected the reservoir

predictions and error. I determined a wide optimal range for the bias input. Studying the output weight

matrix, I determined there was no direct relationship between the mean of output weights and the predic-

tion ability of the reservoir. I was able to study the multivariate impact of reservoir parameters on MSE

in a grid search across multiple reservoir sizes, leaking rates, and spectral radii. The grid search suggests

using leaking rate values α > 0.2, and the leaking rate value used in the Mackey-Glass studies is α = 0.3,

which confirms I have been operating in an optimal range of reservoir parameters. As I move further,

these initial studies will guide my understanding of the reservoir predictions of financial data.

29

Chapter 4

Financial Forecasting and Neural

Networks

This chapter provides a cursory examination of the history of analysis in financial markets as it relates

to forecasting and nonlinear dynamics. In section 4.1, I describe the complex dynamics of the financial

input. As a result of the nonlinearity, neural networks have historically been used as forecasting models

with some success. The following section will elaborate some of the results that artificial neural networks

have acheived in finance. Finally in the last section, I discuss the specific echo state network approach

used in the project as a part of the Neural Forecasting Competition in 2007 in which different neural

networks were used to predict unknown financial data.

4.1 Nonlinear Characteristics

In finance, many qualitative and quantitative forecasting techniques have been applied to try to predict

fluctuations in share prices. However, according to the efficient market hypothesis, one cannot outper-

form the market to achieve returns in excess of the average market returns based on empirical informa-

tion. This is because the current market price reflects all known information about future value of the

stock and investors cannot purchase undervalued stock or sell at inflated rates. This hypothesis is highly

controversial and heavily debated. An entire body of study exists to produce methods and models that

hope to have substantial predictive abilities of stock prices.

Prediction is extremely complex and difficult as the stock market is the result of interactions of

30

many variables and players. Many previous techniques used various linear models to attempt to explain

the numerous complex interactions. Chaos theory offers an explanation of the underlying generating

process, suggesting the stock market may be a deterministic system. This section outlines the history of

chaos in the description of financial markets, the determination of chaos in a time series, and possible

causes for nonlinear behavior in the financial model.

As a result of the failure of linear models, models of complex systems were developed to explain

nonlinear processes in financial markets. The foundation for chaos theory in the realm of finance was

established by Mandelbrot in 1963 in a study of cotton spot price data [21] where he found price changes

did not fit a normal distribution. The theoretical justification for price changes fitting a Gaussian was

given by Osbourne [27] using central limit theorem. It states that, if transactions are a true random

walk, they are randomly independently, identically distributed, and price changes should be normally

distributed since the price changes are the simple sum of the IID transactions.

Mandelbrot discovered that the price changes differed from a normal distribution [21]. Particularly

he found long tails in the distribution of price changes. There are more observations at the extreme ends

than a normal distribution would predict. Furthermore his studies of cotton prices did not indicate finite

variance. He expected to see variance converge as he sampled more cotton price changes since he was

increasing the number of observations. Increasing the sample size of cotton prices did not cause the

variance of cotton prices to converge as he would expect for Gaussian distribution since the variance for

each observation is identical. He suggested instead the distribution modeled a stable Paretian distribution.

Fama confirmed a stable Paretian distribution was a better fit for price returns than a Gaussian in his

studies on thirty stocks in the Dow-Jones Industrial Index [4]. Other studies by Praetz in 1972 suggested

a scaled t-distribution as an alternative [28]. Later on in section 5.2, I fit the financial data to both

a Gaussian and a Lorentzian distribution, which is a stable paretian distribution with no skew and no

defined variance measure, which was an important characteristic Mandelbrot observed in cotton prices.

Mandelbrot’s work established the basis of chaos theory in the world of finance. The rejection of the

random walk theory brought chaos and determinism into the study of economics [24].

Studies have found that there is evidence of nonlinearities in share price data by applying correlation

dimension analysis [25]. These studies sought to determine whether a time series is independently and

identically distributed as would be expected for a random walk. Studies using correlation dimension

have gone on to reject the Gaussian hypothesis offered by Osbourne and argue the existence of chaos in

31

financial markets[10].

There have been many reasons offered that suggest chaotic dynamics within the stock market. Deter-

ministic processes could be the result of behavioral economics which dictate irrationality in investment

decisions [35]. Rather than making purely logical decisions based on the current market, the psychology

of fear and emotions plays a deterministic role in risk-taking and investment decisions. Paretian distri-

bution implies a paretian market which is inherently more risky because there are more abrupt changes,

variability higher and probability of loss is greater [4]. In general, the complex interactions of the many

variables and players in the stock market could behave nonlinearly. Many researchers claim there does

exist some degree of determinism in the market driven by high dimensional attractors. This debate on

chaos will be important as I study the use of neural networks in the financial domain.

4.2 History of Neural Networks in Finance

Throughout the years of research analysis on stock prices, studies have indicated a major roadblock is the

lack of mathematical and statistical techniques in the field [35]. Some of the benefits of neural networks

lie in the learning ability of networks to approximate functions. Given the vast amount of data available

in the financial world, neural networks have become invaluable tools in detecting complex processes and

modeling these relationships.

Networks have been introduced as models for times series forecasting as early as 1990 where most

early studies focused on predicting the return from indices. In the prediction of returns in the Tokyo

Stock Price Index (TOPIX) on data covering the period January 1985 to September 1989, Kimoto and

collaborators [16] compared the performance of modular neural works trained through backpropagation

with multiple regression analysis, a traditional method of forecasting. The modular neural networks they

created were series of independent feed-forward neural networks with three layers including input layer,

a hidden layer of nodes, and an output layer. The output of these networks was combined to inform

buy or sell decisions for TOPIX. They were able to determine a higher correlation coefficient between

the target data and the system of networks (0.527) than for individual networks (0.414-0.458). In 1990,

Kamijo and Tanigawa [15] used a recurrent neural network approach to model candlestick charts, chart

combining line charts and bar charts to illustrate the high, low, open, and close price for each day. They

were able to use backpropagation to train the RNN to recognize specific patterns in the price chart but

32

were unable draw conclusions about the predictive abilities of the network.

Neural networks were used to predict the daily change in direction of the S&P 500 in both studies

conducted by Trippi and Desieno [33] and by Choi, Lee, and Lee [2]. Trippi and Desieno [33] developed

composite rules which are different ways to combine the trained output from multiple networks to deter-

mine Boolean (rise or fall) signals. Their results show that they are 99% confident their best composite

rule would outperform a randomly generated trading position. They estimated a potential annual return

of $60,000. Choi, Lee, and Lee [2] used neural networks to make rise or fall predictions in the stock

index. Using this method they were able to make higher annualized gains than previous methods.

To address limitations of backpropagation in dealing with noise and non-stationary data [30], other

research uses a hybrid method incorporating rules-based system from machine learning along with recur-

rent neural networks. The rules-based technique categorizes the stock data into cases based on empirical

rules derived from past observations. Studies from Tsaih, Hsu, and Lai incorporating a hybrid method

using rules-based system to generate input data use in neural networks found better returns over a six

year period compared to a strategy where the stock and bought and held over the six year period [34].

Studies are still working to continue to improve the training algorithms for neural networks. In the next

section I will study how the echo state network training algorithm, described in earlier chapters, can be

implemented implemented in handling financial data.

4.3 Neural Forecasting Competition

In spring of 2007, echo state networks outperformed many other neural network training algorithms in

the NN3 Artificial neural Network and Computational Intelligenece Forecasting Competition, where the

objective was to forecast 111 monthly financial timeseries by 18 months. The submission using an echo

state network approach by Ilies, Jaeger, Kosuchinas, Rincon, Sakenas, and Vaskevicius was ranked first

in forecasting the competition time series [26].

The team used the same recurrent nerual network architecture described in chapter 2 to train blocks

of the 111 competition time series with high levels of success [11]. Based on their report, 111 time series

was divided into 6 temporally blocks. Each of these blocks was preprocessed using seasonal decomposi-

tion methods before being used to train a collective of 500 echo state networks. The reservoir parameters

were manually manipulated using part of the time series as a validation set [11]. The promising results

33

from the competition give rise to many more questions about the application of echo state networks in

finance, which I will address in the next chapter.

34

Chapter 5

Network Predictions

In this chapter, I apply all I have learned about echo state networks and finance to test the ability of the

echo state network to handle financial data. In the first section, I examine the data set I will be using,

the daily closing price of the S&P500 (GPSC) as well as the data processing measures implemented in

the project. In section 5.2 I examine how I benchmark the reservoir performance against a random

guess drawn from a distribution which accurately models the input data u(n). In order to optimize

parameters simultaneously, I conduct a series of grid search in section 5.3. I present the results of the

reservoir prediction in section 5.5. Then I study other methods which may improve prediction such as

the committee method in section 5.6.

5.1 S&P 500 Data

The data that I will be using as input data as well as the target data is the S&P 500, which is an American

stock index of the top 500 corporations in the New York Stock Exchange. Specifically I used the daily

close price before dividend adjustments of each day retrieved from Yahoo Finance [38]. The time period

of data used ranges from March 2007 to March 2015. The figure below shows the raw input data from

Yahoo Finance.

35

Figure 5.1: S&P 500 daily close prices from March 2007 to March 2015.

5.1.1 Data Processing

In order to use stock prices as input, I need to apply data processing techniques that would allow the

reservoir to run efficiently. Raw input could lead to unstable reservoir dynamics because the raw input

data could be in a range that does not produce meaningful reservoir activations. Important considerations

in processing data include converting the non-stationary financial time series to a stationary set. Once

the data set is detrended, I also need to process the data to ensure the scale is within the correct range for

reservoir dynamics.

Using a stationary representation of the financial data is important in ensuring proper reservoir per-

formance [31]. Stationary processes are defined as a stationary distribution over time. In other words,

the mean and the variance of the data remain constant over time.

There are several methods for detrending considered for this project including simple differencing,

logarithmic differencing, and relative differencing. For the original n-length time series y(t) = y1, y2...yn,

the simple difference is

36

ydi f f (t) = y2 − y1, y3 − y1, ...yn − yn−1. (5.1)

Another method to detrend the data is logarithmic differencing

ylog(t) = log(y2/y1), log(y3/y2), ... log(yn/yn−1) (5.2)

The last method similar to the simple difference is the relative difference

yrel(t) = (y2 − y1)/y1, (y3 − y2)/y2, ...(yn − yn−1)/yn (5.3)

I chose to implement a relative difference between data points as it indicates the change in stock price

which has explicit implications on stock returns. The sign of the relative difference gives the direction of

price trajectory. The following is a plot of the detrended data used in the project.

Figure 5.2: Relative difference of S&P 500 data from figure 5.1.

37

Another important consideration in data processing is in the scaling of the input data. As discussed

in section 2.5 the input should fall within the nonlinear range of the tanh function. Since many of the

relative differences were below 10%, an input scaling factor between 1 and 50 was used. In figure 5.3

the MSE from an average of 10 reservoirs over a scaling factor between 1 and 50. This is the result for

a reservoir with spectral radius ρ = 0.8, Nx = 500 neurons, and leaking rate α = 0.1. This plot shows

input scale that minimizes MSE to be 1, which implies that the unscaled data may be a valid input to the

reservoir. The range of input scales which causes a peak in MSE should be avoided. The shape of the

MSE curve is very interesting but further studies are still needed to understand the dynamics underlying

the bell shaped MSE curve.

Figure 5.3: MSE of reservoir over input scaling factor between 1 and 50, reservoir with spectral radius

ρ = 1.1, resevoir size of 500 neurons, and leaking rate = 0.3.

38

5.2 Benchmarking

In order to test the reservoir performance, I compare the reservoir output to a random draw from a

distribution. As discussed in section 4.1, according to random walk theory, the distribution of price

changes should be Gaussian. A random independent identically distributed variable should have a normal

distribution. Therefore I fit a histogram of S&P 500 price changes to a Gaussian in the plot in figure 5.4

where a Guassian distribution is defined by mean µ and standard deviation σ in

G(x, µ, σ) =1

σ√

(2π)e−

(x−µ)2

2σ2 . (5.4)

The parameters µ and σ are fitted in table 5.1 with reduced chi-square valeu = 7.377.

µ −1.4936e − 05 ± 0.000216

σ 0.00756650 ± 0.000176

Table 5.1: Gaussian fit parameters for distribution of S&P 500 data

In figure 5.4, the Gaussian does not fit the distribution very well; the reduced chi-square is much

larger than 1, which means the distribution is not fully capturing the data. There are too many extreme

values in the long tail of the distribution for the Gaussian to be a good fit. Furthermore, because the

Gaussian drops off exponentially from the mean, the slope is too steep to capture all the points in the bell

curve. In table 5.1, the fitted µ has very high uncertainty as the center may be variable. The Gaussian is

not a good fit for the distribution of price changes, which also leads me to reject the Gaussian distribution

as way to fit the input data.

39

Figure 5.4: Gaussian fit to histogram of relative price change in S&P 500.

In order to find a distribution with a better fit, I attempted a few other fits using the Lorentzian dis-

tribution, also known as the Cauchy distribution, and the Voigt profile, which is a convolution of the

normal and the Lorentzian distribution. Both of these distributions have thicker tails than the Gaus-

sian and would therefore be better able to capture the larger price changes within the distribution. The

Lorentzian distribution is a unique case of the stable Paretian distribution discussed in section 4.1 with

no skewness. The probability distribution function of the Lorentzian is defined by location x0 and γ in

L(x, x0, γ) =1

πγ[1 + ( (x−x0)γ )

2]. (5.5)

Figure 5.5 is a graph of the fit to a Lorentzian with reduced chi-square = 1.518. The parameters for

the distribution are listed in table 5.2. The goodness of fit for the Lorentzian distribution is very close

to reduced chi-square value = 1, which indicates a good fit between the data and the distribution. The

thicker tails of the Lorentzian distribution are better able to capture the larger price changes that occur

in the distribution. The uncertainty for the center location x0 is relatively high compared to the center

location value. However, since there was also trouble fitting a µ to the Gaussian model, I assume that the

center of the historical price change distribution is variable and difficult to fit.

40

Figure 5.5: Lorentzian fit to histogram of relative price change in S&P 500.

x0 −4.7138e − 06 ± 7.81e − 05

γ 0.00537502 ± 7.81e − 05

Table 5.2: Lorentzian fit parameters for distribution of S&P 500 data.

Another view of the Lorentzian distribution fit to the data at the tails is shown in figure 5.6. This

figure shows that the tail of the Lorentzian distribution is able to fit the data well. However the Lorentzian

distribution does tend to have slightly thicker tails than the data, and the Lorentzian distribution would

be more likely to predict larger price fluctuations than that which actually occur.

41

Figure 5.6: A closer view of the tail of relative prices and the Lorentzian distribution shows fewer large

price fluctuations than the Lorentzian would predict

The next distribution that I attempt to fit the data to is the Voigt profile, a convolution of the Gaus-

sian distribution, given by Eq. 5.4, and the Lorentzian distribution, given by Eq. 5.5. Therefore the

convolution is defined by

V(x, σ, γ) =

∫ ∞−∞

G(x′, σ)L(x − x′, γ)dx′. (5.6)

The Voigt profile in figure 5.7 has almost the same reduced chi-square value fit as the Lorentzian in

figure 5.5. The goodness of fit for the Voigt profile is 1.342 and the parameters for the Voigt profile are

shown in table 5.3. The Voigt profile has three parameter values to fit compared to the two parameters for

both the Gaussian and the Lorentzian. Even with this additional parameter, the goodness of fit is almost

identical to the Lorentzian fit. As with the Lorentzian distribution, the Voigt profile is better able to the

thicker tail of the price change distribution with high fluctuations. Similar to the previous two models,

42

the fitted parameter for the center x0 is uncertain for the Voigt model. In the Voigt model, the σ value is

more uncertain; the uncertainty of the σ in the fit is 13.57% of the σ value.

Figure 5.7: Voigt profile fit to histogram of relative price change in S&P 500

x0 2.8723e − 05 ± 7.70e − 05

γ 0.00480562 ± 0.000166

σ 0.00211020 ± 0.000286

Table 5.3: Voigt profile fit parameters for distribution of S&P 500 data

Ultimately I use units drawn randomly from the Lorentzian distribution that best approximates the

inputs to measure the performance of the reservoir. The reduced chi-square for the Lorentzian distribution

is very similar to the goodness of fit for the Voigt profile. Implementing a random selection from the

Lorentzian distribution is also more straightforward. This selection from the distribution simulates a

random guess for the price change each day based on the distribution of all previous days. The random

43

guess establishes the baseline for a randomly determined forecasting model. Comparing this to an echo

state network, I can measure how well the trained network predicts price changes by studying the MSE

of each method.

5.3 Parameter Optimization

In order to optimize the reservoir to make the best reservoir predictions possible, there are a large number

of global parameters that can be tuned. I discussed these parameters in section 2.5. Initially, I optimized

these parameters individually to find the parameter value that would reduce MSE, which is the standard

practice [17]. However, many of these parameters may affect each other, and therefore the best optimiza-

tion strategy would determine the parameters simultaneously. I considered a gradient descent algorithm

that would calculate the gradient of the MSE as a result of the different parameters and move in the di-

rection of greatest descent to find the local minimum. However, this method may not be effective where

there are many local minima or where the surface is discontinuous. Therefore, to better understand the

landscape of MSE as a function of multiple parameters, I conduct a coarse grid search across the leaking

rate, spectral radius, and reservoir size. To conduct the grid search, I create 10 reservoir given each set

of parameters and find the average MSE across the 10 reservoirs created. Figures 5.8, 5.9, 5.10, and 5.11

show the surface of the MSE as a function of spectral radius and leaking rate across reservoir sizes of

100, 400, 700, and 1000 respectively. These figures highlight that while MSE may not be largely vary-

ing, certain choices of parameters will cause the reservoir to fluctuate highly and return output values

that have extremely high errors. Compared to the grid search in section 3.3 where α ≥ 0.3 gave the

best MSE results, the MSE is minimized in the case of financial data input for leaking rate α ≤ 0.1 and

spectral radius ρ ≤ 1.

There are many differences between the Mackey-Glass data and the stock data that could impact the

optimal parameters for reservoir output. In the case of Mackey-Glass, the solution to Eq. 3.2 was taken

for very small time steps of ∆t = 0.1. This makes the Mackey-Glass input much smoother than the

financial data series, which from figure 5.2 seem to experience many shocks. The smaller leaking rate

α could minimize the impact of these sudden shocks by slowly integrating the input updates over time.

The smaller optimal spectral radius implies a shorter memory required for the system. Financial markets

may be less dependent on price changes in the distant past than on the more recent price shifts. However,

44

despite differences between the effect of spectral radius and leaking rate on Mackey-Glass input and

financial data input, the reservoir size functions similarly in both such that increasing reservoir increases

stability of MSE. The slope of the MSE surface for reservoir size Nx = 1000 is lower than the surface

for Nx = 100 for both Mackey-Glass input and S&P 500 price changes input. In preparing the reservoir

model to forecast and predict S&P 500 price changes, I will maintain the optimal range of parameters,

where α ≤ 0.1, ρ ≤ 1, and reservoir size Nx is as large as computationally feasible.

Figure 5.8: MSE surface across spectral radius and leaking rate with reservoir size = 100.

45


46


47


5.4 Reservoir

The reservoir used in forecasting the S&P 500 data is very similar to the reservoir system used in section

3.2. From figure 5.3, the optimal scaling factor = 1, which I set for all reservoirs forecasting the market

index. Based on section 5.3, I use reservoir parameters reservoir size Nx = 1000, leaking rate α = 0.1,

and spectral radius ρ = 0.8. For the input, I use the first 1000 data points u0, u1...u1000 in the training

sequence computed as the relative difference in Eq. 5.3 to drive the reservoir. The first 20 activations

are disregarded and not used in training the network since they function to initialize the reservoir. Then

the weighted output of the nodes in the reservoir is trained on the target data, which comes from the

relative difference of the input S&P 500 data ytarget(n) = u(n). This gives the trained output weight

matrix Wout, which is used to predict the price change one day ahead. Because the market exhibits

such complex behaviors, it is hard to expect even a trained reservoir to make highly accurate predictions

48

many days in advance. The more interesting and practical application is the ability to predict where the

market will be one day into the future, which would allow investors time to make a buy or sell decision.

Following the output prediction y(1001), the reservoir is retrained again using 1000 S&P 500 data points

shifted forward by one day u1, u2...u1001. Then the retrained reservoir makes a prediction y(1002) for the

following day. This training process is repeated for 500 runs to determine 500 one-day ahead predictions

from the reservoir. This is compared to 500 random samples from the distribution. The next section

discusses the results from this reservoir.

5.5 Results

Using a random prediction drawn from the underlying distribution of the input data, I can compare the

incremental benefit of a trained reservoir and its predicted output. Figure 5.12 compares the 500 single-

day predictions of the reservoir y(n) to the target data ytarget(n) and figure 5.13 compares the 500 random

selections from both the Gaussian and Lorentzian distribution to the same ytarget.

49

Figure 5.12: Target data ytarget(n) compared to the outcome y(n) from the reservoir.

Figure 5.13: Target data ytarget(n) compared to the predictions from the random guesses of a Lorentzian

distribution and Gaussian distribution.

50

The random selections from the Lorentzian distribution seem to have very large fluctuations. How-

ever, plotting the histogram of the 500 random selections compared to the distribution of the relative

price of the S&P 500 shows that the random selections do correspond to the same distribution as the

input data (figure 5.14). Additionally, the histogram of random selections from the Gaussian distribution

do not match the distribution of input data in figure 5.15. The random selection from the Lorentzian is

good sample of random values that correspond with the distribution of the input. The Gaussian, on the

other hand, does not approximate the input distribution.

Figure 5.14: Histogram of the random selections from Lorentzian distribution corresponds with the

distribution of the input data.

51

Figure 5.15: Histogram of the random selections from Gausssian distribution does not correspond with

the distribution of the input data.

To better understand if the reservoir predictions are consistent with the the target data, I calculate

the MSE at each output for the reservoir as well as for the random selection from both the Gaussian

and the Lorentzian distribution. Figure 5.16 shows the MSE over the 500 runs. All of the methods

have a hard time predicting large fluctuations in the data, which accounts for many of what seem like

correlated jumps. To directly compare the performance of the reservoir to the random selections, figure

5.17 plots the difference in MSE between the random guess and the reservoir prediction. The bars above

zero illustrate the instances when the reservoir makes better predictions than the random guess. Figure

5.17 shows that the reservoir does better than the random selection from a Lorentzian distribution 274

times out of 500 runs. However many times when the reservoir outperforms the random selection, the

reservoir outperforms by a large magnitude. The reservoir outperforms the random selection by over

0.001 36 times while the random selection only outperforms the reservoir by the same amount once.

52

Figure 5.16: MSE of the reservoir compared to random guessing from a Lorentzian/Cauchy and Gaus-

sian.

Figure 5.17: Difference in MSE between a random guess and a reservoir prediction.

The table 5.4 lists the mean and standard deviation of MSE for the different methods of prediction.

Low average MSE is meaningful because that implies lower error and predictions closer to the target

value. A low variance measure means the model is less risky and predicts close to the target value. This

is important for investors who need to know and minimize risk in their portfolio. In table 5.4, the random

53

selection from a Gaussian distribution has the lowest mean and standard deviation. Since many of the

random selections from the Gaussian fall in the center and the distribution of input price changes was

most concentrated in the center, it makes sense that the Gaussian distribution would have a low error.

The reservoir maintains almost the same average MSE and standard deviation. The random selection

from the Lorentzian does the worst in terms of average MSE and standard deviation because big price

changes are predicted more often in the Lorentzian distribution with thicket tails.

Mean Std. Dev.

Gaussian 0.0001881 0.0003914

Lorentzian 0.00043667 0.0010615

Reservoir 0.0002125 0.0004747

Table 5.4: MSE values for benchmark and reservoir predictions

Another measure to determine whether the output of the reservoir corresponds to the target output

is through the correlation coefficient. Rather than a measure of absolute error like MSE, the correlation

coefficient takes into account the direction as well as magnitude. Table 5.5 shows the correlation coeffi-

cient between the random selection benchmarks and the target data series and the correlation coefficient

between the reservoir output and the target data series. The correlation coefficient is small for all of

our models, the reservoir even has a negative correlation with the target data set which is discouraging.

This implies that the time series for the target output and the reservoir output move in opposite direc-

tions. However the correlation coefficients for all of the models are close to 0 and therefore imply no

correlation between the models’ time series and the target time series.

Correlation Coefficient

Gaussian -0.02380788

Lorentzian 0.01243823

Reservoir -0.05075724

Table 5.5: Correlation coefficient between

While this reservoir performance does not seem to correlate to the target output in stock data, further

studies may show that the reservoir is capable of learning some of the underlying dynamical behavior

54

in the financial time series. Although the echo state network is far from perfect, this project begins to

show how reservoirs can process financial data. In the last section of the project I focus on a common

technique used in many neural network applications that combines output from multiple reservoirs in

order to improve network predictions.

5.6 Committee

In the committee method, multiple reservoirs are trained on the same dataset and the combined output

of all the reservoirs is computed usually as an average. Because the internal reservoir connections are

randomly generated each time, each reservoir has slightly varying dynamics and performance. This

variance in reservoir dynamics may be useful in determining the output prediction that best fits the target.

There are other ways to combine output from multiple reservoirs. A ranking algorithm, which identifies

the reservoirs with the best performance, could weight those reservoirs more highly while removing

outlier reservoirs which perform poorly. In this study, I focus on a simple committee by averaging

reservoirs and leave the ranking algorithms for future studies.

In studying committees, I follow the same methodology for reservoir creation as in section 5.5 using

the same global parameters for reservoir specification. I create a single reservoir given 1000 previous

data points, dismissing the first 20 states used to initialize the network. I maintain the same reservoir

global parameters as specified in 5.5. To create committees, I combine the output of 100 such reservoirs

given the same specifications as the single reservoir. The committee output is a simple average of the

outputs of the 100 reservoirs within the committee. The committee reservoirs each make one-day ahead

predictions as I did in section 5.5 and then their predictions are combined to the make the committee

prediction

ycom(n) =1

100

∑i

yi(n). (5.7)

The predictions of the committee reservoir compared to the the random guess from the Lorentzian

distribution and the single reservoir is in figure 5.18. The committee output from 100 reservoirs has

almost the exact same output as the single reservoir. This demonstrates that there is no added benefit of

computing the committees given the parameters and input I am using in this project.

Moving forward with the project, I reject the use of committees as the outcome of the performance

55

Figure 5.18: The predictions for the single reservoir, committee, and random guess from a Lorentziandistribution

of single reservoirs are very similar and are much less computationally expensive.

56

Chapter 6

Concluscion

Studying the echo state network, I was able to understand the training process for the network with

regards to two different data sources: Mackey-Glass systems and financial systems.

6.1 Analysis of Results

The initial study of the Mackey-Glass system served to help me understand basic reservoir dynamics

and parameters. I studied the impact of the bias input that has not been widely discussed in previous

literature. I concluded the standard bias scaling constant = 1 used across most literature falls within the

range of values that reduce error in the reservoir. Studying the impact of the output weight matrix on

error, I saw no correlation in the mean of this output matrix. This however does not conclusively imply

the Wout does not have an effect on MSE but rather that the regularization coefficient will penalize any

output matrix with higher weights and therefore those solutions will not manifest. Regardless, this is

important to note because it checks that the training process removes any output weight matrices that

easily skew MSE results. In optimizing the parameters to minimize MSE, a grid search of the error space

shows that error is minimized when leaking rate α > 0.3. This error surface however differs significantly

across systems as I saw in the second set of studies.

In the next study, I was able to apply knowledge about echo state networks to studying S&P 500

daily closing prices. In order to benchmark the results I compared the distribution of price changes to a

Gaussian, Lorentzian, and Voigt profile. While many theorists and academics claim price fluctuations are

randomly independently and identically distributed random walk, this was not the conclusion I reached

57

in this project. There were many more abrupt price changes in the tail than the Gaussian would predict.

Both the Lorentzian and Voigt profile were better fits for the input data. Further studies could seek to

understand whether chaos determines the Lorentzian distribution of the of relative price changes and

how it may be useful in understanding determinism in the field of finance. Studies of this nature across

other stock indices will be useful to bolster this research. The implications of markets determined by the

Lorentzian suggest that stock prices may be more variable and more risky than under the assumption of

a market determined by a Gaussian. Price fluctuations tend to be greater in both directions implying both

big profits as well as big losses.

In terms of forecasting the market, the reservoir was not able to significantly outperform the random

selections from known distributions. In comparison to the random selection from the Lorentzian distri-

bution, the reservoir was able to predict better results in a small majority of the cases when comparing

MSE (fig 5.17). However the reservoir was able to make many predictions that beat the random model

by a larger magnitude. Another concern may be the Lorentzian fit to the distribution data. The diffi-

culty in forecasting may lie in the financial time series itself. Stock market data is incredibly complex

and any attempt at modeling and forecasting the market will require much more sophisticated statistical

techniques. However the study of the S&P 500 index has opened up many questions about the chaotic

nature of stocks as well as the specific capabilities of echo state networks and reservoir computing. In

the grid search to optimize reservoir parameters with S&P 500 data, the differences in the surface plot

of MSE implies unique reservoir behavior for both Mackey-Glass and financial systems. Further studies

could study how different timescales and other features of stock data may impact the network output.

The optimal range for spectral radius has many implications in understanding how memory is needed

to characterize the dynamics of the system. Studying commonly used techniques to improve network

output, I found committees of multiple reservoirs did not improve performance compared to that of the

single reservoir. I concluded the computational cost was not worth the marginal difference in prediction.

The relatively small effect on performance could be the result of the large reservoir size. Large reservoirs

may have enough memory capacity to remember all the system’s dynamics. However additional studies

into committees would be useful particularly in examining other methods to combine individual reservoir

outputs. Further studies to calibrate the distribution would be useful to create a more rigorous benchmark

for the reservoir. Other studies would look at alternatives to benchmarking the data. Another interesting

application would be to create a trading algorithm to test and observe market impact of the reservoir.

58

Other reservoir studies that would be useful entail using the reservoir to make Boolean predictions on

simply whether or not the market will rise or fall rather than the exact value it changes. This would also

provide valuable applicable benefits to investors.

One of the key areas of study now involves finding ways to optimize the parameters of the reservoir.

In the study, the coarse grid search highlighted the often interconnected impacts of reservoir parameters

and therefore the need to construct more rigorous algorithms to select parameters.

6.2 Conclusion

In conclusion, the echo state network provides an interesting paradigm for understanding complex time

series. It shows a lot of promise in helping understand complex systems but there still exists a lot of room

for additional research in order to be a viable option for forecasting financial markets.

Although the results from this project do not conclusively show that echo state networks have signif-

icant forecasting ability for stock data, it is an interesting exercise in learning about reservoir computing

and its application in complex systems.

59

Bibliography

[1] ER Caianiello. “Outline of a theory of thought-processes and thinking machines”. In: Journal of

Theoretical Biology 1.2 (1961), pp. 204–235.

[2] JH Choi, MK Lee, and MW Rhee. “Trading S&P 500 stock index futures using a neural network”.

In: Proceedings of the third annual international conference on artificial intelligence applications

on wall street. 1995, pp. 63–72.

[3] Kenji Doya. “Bifurcations in the learning of recurrent neural networks 3”. In: learning (RTRL) 3

(1992), p. 17.

[4] Eugene F Fama. “Mandelbrot and the stable Paretian hypothesis”. In: Journal of Business (1963),

pp. 420–429.

[5] V. Flunkert and E. Scholl. pydelay – a python tool for solving delay differential equations. arXiv:0911.1633 [nlin.CD].

2009.

[6] Mathieu Galtier and Gilles Wainrib. “A local Echo State Property through the largest Lyapunov

exponent”. In: arXiv preprint arXiv:1402.1619 (2014).

[7] L. Glass and M. Mackey. “Mackey-Glass equation”. In: Scholarpedia 5.3 (2010). revision 91447,

p. 6908.

[8] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology

Press, 2005.

[9] John J Hopfield. “Neural networks and physical systems with emergent collective computational

abilities”. In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558.

[10] David A. Hsieh. “Chaos and Nonlinear Dynamics: Application to Financial Markets”. In: The

Journal of Finance 46.5 (1991), pp. 1839–1877. issn: 00221082.

60

[11] Iulian Ilies et al. “Stepping forward through echoes of the past: forecasting with echo state net-

works”. In: URL: http://www. neural-forecasting-competition. com/downloads/methods/27-NN3 Herbert Jaeger report.

pdf (2007).

[12] Herbert Jaeger. Long short-term memory in echo state networks: Details of a simulation study.

Tech. rep. Technical Report, 2012.

[13] Herbert Jaeger. “The echo state approach to analysing and training recurrent neural networks-

with an erratum note”. In: Bonn, Germany: German National Research Center for Information

Technology GMD Technical Report 148 (2001), p. 34.

[14] Herbert Jaeger and Harald Haas. “Harnessing nonlinearity: Predicting chaotic systems and saving

energy in wireless communication”. In: Science 304.5667 (2004), pp. 78–80.

[15] Ken-ichi Kamijo and Tetsuji Tanigawa. “Stock price pattern recognition-a recurrent neural net-

work approach”. In: Neural Networks, 1990., 1990 IJCNN International Joint Conference on.

IEEE. 1990, pp. 215–221.

[16] Takashi Kimoto et al. “Stock market prediction system with modular neural networks”. In: Neural

Networks, 1990., 1990 IJCNN International Joint Conference on. IEEE. 1990, pp. 1–6.

[17] Mantas Lukosevicius. “A practical guide to applying echo state networks”. In: Neural Networks:

Tricks of the Trade. Springer, 2012, pp. 659–686.

[18] Mantas Lukosevicius. Educational Echo State Network implementations. http://minds.jacobs-

university.de/mantas/code.

[19] Wolfgang Maass, Thomas Natschlager, and Henry Markram. “Real-time computing without stable

states: A new framework for neural computation based on perturbations”. In: Neural computation

14.11 (2002), pp. 2531–2560.

[20] Michael C Mackey, Leon Glass, et al. “Oscillation and chaos in physiological control systems”.

In: Science 197.4300 (1977), pp. 287–289.

[21] Benoit B Mandelbrot. The variation of certain speculative prices. Springer, 1997.

[22] G. Manjunath and H. Jaeger. “Echo State Property Linked to an Input: Exploring a Fundamental

Characteristic of Recurrent Neural Networks”. In: Neural Comput. 25.3 (Mar. 2013), pp. 671–696.

61

issn: 0899-7667. doi: 10.1162/NECO_a_00411. url: http://dx.doi.org/10.1162/NECO_a_

00411.

[23] Warren S McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in nervous ac-

tivity”. In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115–133.

[24] Philip Mirowski. “From Mandelbrot to Chaos in Economic Theory”. In: Southern Economic Jour-

nal 57.2 (Oct. 1990), p. 289. url: http://search.proquest.com/docview/212099581?

accountid=10598.

[25] Peter Mulieri. “Stock market price behavior: Random walks and nonlinear dynamics”. In: Journal

of Technical Analysis 37 (1991), pp. 28–34.

[26] NN3 Artificial Neural Network and Computational Intelligence Forecasting Challenge.

[27] M. F. M. Osborne. “Brownian Motion in the Stock Market”. In: Operations Research 7.2 (1959),

pp. 145–173. issn: 0030364X.

[28] Peter D. Praetz. “The Distribution of Share Price Changes”. In: The Journal of Business 45.1

(1972), pp. 49–55. issn: 00219398.

[29] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and organization

in the brain.” In: Psychological review 65.6 (1958), p. 386.

[30] Randall S Sexton, Robert E Dorsey, and John D Johnson. “Toward global optimization of neural

networks: a comparison of the genetic algorithm and backpropagation”. In: Decision Support

Systems 22.2 (1998), pp. 171–185.

[31] Konrad Stanek. “Reservoir computing in financial forecasting with committee methods”. MA the-

sis. Technical University of Denmark, 2011.

[32] Fabian Triefenbach et al. “Phoneme recognition with large hierarchical reservoirs”. In: Advances

in neural information processing systems. 2010, pp. 2307–2315.

[33] Robert R Trippi and Duane DeSieno. “Trading equity index futures with a neural network”. In:

The Journal of Portfolio Management 19.1 (1992), pp. 27–33.

[34] Ray Tsaih, Yenshan Hsu, and Charles C Lai. “Forecasting S&P 500 stock index futures with a

hybrid AI system”. In: Decision Support Systems 23.2 (1998), pp. 161–174.

62

[35] R.J. Van Eyden. The Application of Neural Networks in the Forecasting of Share Prices. Finance

& Technology Pub, 1996. isbn: 9780965133203. url: http://books.google.com/books?id=

wjXnAAAACAAJ.

[36] Juha Vesanto. “Using the SOM and local models in time-series prediction”. In: Proc. Workshop

on Self-Organizing Maps 1997. Citeseer. 1997, pp. 209–214.

[37] Paul Werbos. “Beyond regression: New tools for prediction and analysis in the behavioral sci-

ences”. In: (1974).

[38] Yahoo Finance.

63

reservoir computing in forecasting financial markets · artiﬁcial neural networks are trainable...

Documents