reservoir computing in forecasting financial markets · artificial neural networks are trainable...
TRANSCRIPT
April 9, 2015
Reservoir Computing in Forecasting Financial Markets
Jenny Su
Committee Members:Professor Daniel Gauthier, Adviser
Professor Kate ScholbergProfessor Joshua Socolar
Defense held on Wednesday, April 15, 2015 in Physics Building Room 298
Abstract
The ability of the echo state network to learn chaotic time series makes it an interest-ing tool for financial forecasting where data is very nonlinear and complex. In this studyI initially examine the Mackey-Glass system to determine how different global parameterscan optimize training in an echo state network. In order to simultaneously optimize multipleparameters I conduct a grid search to explore the mean squared error surface plot. In thethe grid search I find that error is relatively stable over certain ranges of the leaking rate andspectral radius. However, the ranges over which the Mackey-Glass system minimizes errordo not correspond with an error surface plot minimum for financial data, as a result of in-trinsic qualities such as step size and timescale of dynamics in the data. The study of chaosin financial time series data leads me to find alternate understandings of the distribution ofthe relative stock price over time. I find the Lorentzian distribution and the Voigt profile aregood models for explaining the thick tails that characterize large returns and losses, whichare not explained in the common Gaussian model. These distributions act as an untrainedrandom model to benchmark the predictions of the echo state network trained on the his-torical price changes in the S&P 500. The global reservoir parameters, optimized in a gridsearch given financial input data, does not lead to significant predictive abilities. Studiesof the committees of multiple reservoirs are shown to give similar forecasts to single reser-voirs. Compared to a benchmark random sample from the defined distribution of previousinput, the echo state network is not able to make significantly better forecasts, suggestingthe necessity of more sophisticated statistical techniques and the need to better understandchaotic dynamics in finance.
2
Contents
1 Introduction 21.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Network Concepts 72.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 The Reservoir and Echo State Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Training and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Reservoir Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Echo State Network and Mackey-Glass 163.1 Mackey-Glass System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Constant Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Leaking Rate, Spectral Radius, and Reservoir Size . . . . . . . . . . . . . . . . . . . . . 243.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Financial Forecasting and Neural Networks 304.1 Nonlinear Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 History of Neural Networks in Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Neural Forecasting Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Network Predictions 355.1 S&P 500 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4 Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6 Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Concluscion 576.1 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1
Chapter 1
Introduction
1.1 Background
Artificial Neural Networks are trainable systems that have powerful learning capabilities with many
applications in forecasting and classification. Their learning process has many similarities to that of
the human brain because their design was inspired by research into biological nervous systems. Like
the central nervous system of animals, the artificial neural network is composed of many “neurons” or
artificial nodes, which are connected in a defined network. Both biological and artificial neural networks
send and receive feedback signals through their connections which partially determines their expressed
state. Therefore these networks are constantly updating with information from external sources as well
as internally within the network from other neurons.
The development of artificial neural networks began in 1943 when McCulloch and Pitts [23] first
proposed a network in which the state of neurons was determined by a combination of all the signals
received from connected neurons in the network. In their model, simple artificial “neurons” i = 1,...,n
could only have binary neuron activations or states, ni = 0,1, where ni = 0 represents the resting state
and ni = 1 represents the activated state of the neuron. The binary state of each neuron i is dictated by a
linear combination of action potentials,∑
j wi jn j(t), where wi j is a matrix representation of the connection
weights between neurons i and j as seen in figure 1.1. If this value exceeds a certain threshold, then
neuron i becomes activated and transmits signal ni = 1.
2
Figure 1.1: Network model of neurons are summed as a linear combination which must exceed a certain
threshold for the neuron to be activated.
Another important foundational step occurred in 1949 when Hebb published The Organization of Be-
havior in which he asserts that simultaneous activation of neurons leads to increased connection strengths
between those neurons [8]. This established the basis for Hebb’s learning rule which states that the synap-
tic connection between neurons was adaptive. The connection between neurons can be strengthened or
increased as a result of repeated activations between the neurons. If neuron x1 activations consistently
leads to neuron x2 activations, then the connection weight between them would incrase. This idea was
later incorporated in 1961 by Edward Caianeillo to the learning algorithm used to determine weight ma-
trix wi j connecting neurons [1] where the connection weight matrix had adaptive weights for neurons
with similar activations. The connection between neurons which activate simultaneously are stronger.
All of these discoveries contributed to the development of the first simple feed-forward network
which Rosenblatt and his collaborators called a perceptron [29]. The perceptron consisted of two layers
of neurons: the input layer and the output layer. Signals continues from the input to the output in a
single direction as seen in figure 1.2. This first feed-forward network was used in a simple classification
problem between two classes. After each classification, if the model predicts the class incorrectly the
weights are adjusted and the network is run again until the weights converge to the correct values that
3
allow the model to predict the class correctly.
Figure 1.2: Single-layer perceptrons have adjustable weights which are trained to predict correct classi-
fication.
Years later in 1982, Hopfield published the basis for a recurrent network where neurons can be up-
dated sequentially based on information stored within the network [9]. As opposed to a feedforward
network, recurrent neural networks have neurons whose connections are multidirectional as shown in
figure 1.3. This allows the network to display dynamical behavior based on internal memory of signals
throughout the network. In Hopfield’s first network, there was no self-connection, the neuron did not re-
ceive input from itself, and the connections between neurons were symmetric so that wi j = w ji. Hopfield
showed that networks with these characteristics which iteratively adjusted the state of each neuron would
reach a local minimum and evolve to a final state.
4
Figure 1.3: Hopfield network neurons connections are multidirectional with symmetric weights.
Within the field of networks, many different learning rules that have been developed are classified
as supervised learning, where the network is adjusted by comparing the network output with the desired
output. One of the most common learning rules, error backpropagation, was first proposed in 1974
by Werbos [37]. The learning algorithm makes small iterative adjustments to network connections to
minimize the difference between the target and network output. Backpropagation has been especially
successful in feed forward networks but the method is only partially successful with recurrent neural
networks. Output of recurrent neural networks can bifurcate unlike the smooth continuous output of
feed-forward networks. This bifurcation can lead to discontinuous error surfaces[3].
An alternative to backpropagation was introduced with echo state networks (ESNs) and liquid state
machines [13, 19]. These two training models developed independently within the context of machine
learning and computational neuroscience respectively share the same basic idea of a randomly connected
“reservoir” of neurons with a trained output feedback. This developed into the current research field of
Reservoir Computing. The specific network concepts defined by the echo state network approach I will
be using in this project is explained in detail in chapter 2.
Applications of reservoir computing are useful in tasks for function approximation, signal process-
ing, classification, and data processing. They have been applied to system identification, natural language
5
processing, medical diagnosis, as well as e-mail spam filters. In this study, I am interested in the applica-
tion of these network models to the financial sector, which historically has great incentives to determine
models that are capable of predicting and forecasting changes in a complex, widely fluctuating market.
1.2 Approach
The goal of the project is to study echo state network and its modeling capacity in forecasting financial
data series. Within the scope of this project I first forecast a known dynamical system Mackey-Glass to
better understand reservoir behavior given specified global parameters. Secondly I will then use knowl-
edge of ESNs in application to financial data to determine whether ESNs may provide a useful prediction
of the stock market trajectories.
The paper is organized as follows. The second chapter discusses network concepts and the mathemat-
ical description of the Echo State network. In order to determine best practices, in chapter three I study
the network using the Mackey-Glass system, a nonlinear delay differential equation that is commonly
used to test modelling of complex systems. The ESN can be trained using Mackey-Glass on varying
global parameters of the reservoir. These global parameters are tuned to optimize network performance.
The fourth chapter introduces the dynamics of the stock market and how previous forecasting measures
have dealt with these widely fluctuating time series. In the fifth chapter I apply the echo state network
approach to financial data. I study the impact of global parameters and data processing techniques. I
benchmark the reservoir predictions using random samples from distributions that best fit the historical
input data. I also study the impact of committee methods combining output from multiple reservoirs in
comparison to single reservoir predictions as well as random sample. The sixth and final chapter con-
cludes with a comparative discussion of the echo state network approach in financial forecasting as well
as limitations of the model and further studies.
6
Chapter 2
Network Concepts
2.1 Basic Concept
There are many supervised learning algorithms for training recurrent neural networks. In the echo state
network (ESN) approach used in this project, only the output weights from the reservoir are trained.
The connections of the neurons within the reservoir are randomly generated at the outset as in figure
2.1. Training all network connections is unnecessary, which makes this faster than previous learning
algorithms in which all connections are trained. Because the network has recurrent loops, it maintains a
memory of the past input and the output consists of “echoes” of the initial input time series.
7
Figure 2.1: Echo state network approach has a combination of trained and untrained connections between
neurons or nodes.
In implementing the echo state network, an input teacher data series u(n) is used to train a reservoir of
size Nx with neurons connections determined by a randomly generated matrix W ∈ RNx×Nx . The output
node of the reservoir gives a readout of a linear combination of all or a portion of the neuron activations
Wout x(n). This matrix Wout is computed so that the output y(n) corresponds as closely as possible to
a defined target data series ytarget(n). This last step is the training portion of the learning algorithm.
Once the best weight matrix Wout is determined, new input data u(n) can be used in the reservoir to
generate output or reservoir predictions y(n) beyond the target data. The rest of this chapter describes the
mathematical properties of the different components involved in the ESN approach and also introduces
the tunable global parameters to optimize this learning architecture, as first developed by Jaeger in 2001
[13].
2.2 Input
The input data to the reservoir serves as the driving mechanism for the reservoir. Not only does the
reservoir exhibit nonlinear dynamics as a result of the input, but the reservoir will also retain a “memory”
8
of previous input. This ability to remember, which will later be defined as the echo state property in
section 2.3, occurs as a result of the recurrence of the network; the nodes form recurrent cycles through
their connections in W.
Typically, the teacher input series, u(n), consists of Nu series of data points at discrete time steps
(n, n + 1, n + 2, ...). In my studies, part of the input teacher series is used as the target data ytarget(n) that
the output is trained on. There are no limitations to the starting point of the data series, and therefore
shifting the initial starting input does not significantly impact the ability of the reservoir to learn the input
series. While in many contexts, Nu is one-dimensional, there is no limit to the number of arrays that can
be used as input to the reservoir. The input is fed into the reservoir using a randomly generated input
weight matrix Win ∈ RNx×(1+Nu). In addition to the specified input data u(n), there is also a constant bias
value input to the reservoir. This bias input is weighted by the randomly generated column of values
in Win as seen in figure 2.2. The purpose of the bias constant serves to increase the variability of input
dynamics [13]. The impact of this bias constant is further studied in section 3.2. All these input values
directly impact the reservoir, which is studied in the next section.
Figure 2.2: Win determines the connection for both the input data and bias input.
9
2.3 The Reservoir and Echo State Property
The reservoir is defined by the neurons within it whose states all follow the same update equation as a
function of the input and feedback from other neurons. The state of each neurons or nodes depends on
previous states in the reservoir according to
x(n + 1) = (1 − α)x(n) + α tanh(Wx(n) + Winu(n + 1) + W f by(n) + v(n)), (2.1)
where α is the leaking rate of the network. Technically the state update can be governed by any sigmoidal
function but in the context of this project I use the tanh function because it is the standard sigmoid
function studied across all encountered literature in reservoir computing. According to Eq. 2.1, each
new neuron state x(n + 1) is determined by its current state x(n) as well as a nonlinear expression of other
current nodes Wx(n) and the input data Winu(n + 1). The state x(n + 1) can also depend on the output
adjusted by the randomly generated feedback weight matrix W f by(n) as well as a noise function v(n).
However, both of these parameters are optional and are omitted in this project. Jaeger found that noise
was useful in maintaining stability of reservoir activations when driven by highly chaotic time series
such as the Mackey-Glass system [13]. He also concluded that noise insertion negatively impacted the
precision of predictions, which is undesirable in my project. In the project, I do not use a separate, W f b,
to feed into the reservoir but rather feed the output y(n) back as input u(n) and therefore W f b = Win.
This allows the reservoir to exhibit dynamical behavior as a result of the output as well as the input.
One of the most important parameters in determining reservoir behavior is the random weight matrix
W that describes neuron connections. The recurrent loops that occur due to these connections lead to
the echo state property of the network that gives the reservoir a finite “memory.” The echo state property
states the the reservoir retains a “memory” of previous states as the input data is stored in the recurrent
loops of the reservoir connections. The echo state property can be ensured in practice if spectral radius,
denoted by ρ, is less than 1. The spectral radius is the largest maximum eigenvalue of the weight matrix
W. However more recent researchers have determined that the echo state property occurs even for the
more relaxed condition when the maximum value of entry wi j in W is less than 1 [6]. Jaeger, more
recently, also noted that the echo state property can exist even when ρ > 1 but may not exist for all input
and never exists for the null input [22]. Increasing the ρ may even increase network performance by
increasing memory of the input since the spectral radius defines how long input is remembered. He also
10
found that the echo state property might be defined with respect to the input u(n).
Since the reservoir exhibits memory, the initial n activations should not be used in training, which is
defined in the next section, due to any transients that may exist in the reservoir. Since ρ in most of the
studies in this project is close to unity there is a slow forgetting. Therefore, many initial states need to be
dismissed before actual predictions can take place. This makes this learning process input-intensive and
data-wasteful and other techniques have been developed that incorporate an auxiliary initiator network
that computes an appropriate starting state for the recurrent model network [13]. These techniques are
beyond the scope of my project because the data length is not so much a limitation in financial data.
2.4 Training and Output
Given all the reservoir activations resulting from the Nu input series, the goal is to take the output of this
reservoir y(n) to best approximate the target data ytarget(n). Ideally the goal is to minimize the difference
between the target data and the reservoir output. In this project, I use Mean Squared Error (MSE) defined
as
MS E =1n
n∑i=1
(yout(n) − ytarget(n))2. (2.2)
Because the reservoir output is a linear combination of the activations,
Y = WoutX, (2.3)
where Y and X are matrix representations of the reservoir output y(n) and reservoir states x(n) respec-
tively, minimizing MSE becomes a simple linear regression. After substituting in matrix forms used in
Eq. 2.3 into Eq. 2.4, I find
MS E = (WoutX − Ytarget)2. (2.4)
Minimizing the MSE gives the solution to Wout as
Wout = YtargetX−1. (2.5)
However, because u(n) is often larger than the reservoir size, the system is overdetermined. Taking the
inverse of an overdetermined matrix can give unstable solutions. Therefore instead of a output matrix as
11
a simple function of the target and inverse of the activations, I implement Tikhonov regularization, also
known as ride regression, in order to find a stable solution for Wout [17]. In Tikhonov regularization, a
regularization term is added to Eq. 2.2, resulting in
(WoutX − Ytarget)2 − β(Wout)2, (2.6)
where β is the regularization coefficient used to penalize larger norms. Minimizing Eq. 2.6, I obtain the
solution
Wout = YtargetXT (XXT + βI)−1. (2.7)
Alternatively, as mentioned previously in section 2.2, the noise function v(n) is also used to stabilize
solutions in systems that are overdetermined. However the ridge regression method is a more compu-
tationally efficient solution that does not penalize the precision of reservoir predictions that noise may
affect.
2.5 Reservoir Optimization
In determining the optimal reservoir, there are many adjustable global parameters that influence the
dynamical behavior. The goal in reservoir optimization is therefore to generate dynamical behavior that
is most similar to the system the reservoir is attempting to model. The reservoir dynamics, as seen in Eq.
2.1, is governed by W, Win, and α. These parameters are related to different quantities that characterize
the network such as: reservoir size Nx, spectral radius ρ, input scaling β, and leaking rate α. These global
parameters and their effect on the reservoir are the topic of this section.
The current standard practice involves intuitive manual adjustment of each of these parameters to
optimize the reservoir [17]. I will study parameter optimization in section 5.3.
Reservoir Size
The reservoir size Nx, or the number of neurons, dictates the model capacity of the network. The current
intuition states that in general, bigger reservoirs lead to better performance. The training method (Eq.
2.7) used for Echo State Networks is computationally efficient enough to generate large reservoir sizes
12
on the order of magnitude of 104. These have been found useful in automatic speech recognition [32].
A benchmark for the lower bound for a reservoir is given by Lukosevicius [17]: the reservoir size Nx
should be at least equal to the estimate of the independent real values that the reservoir needs to retain in
memory of the input. This is a result of the echo state property, described in section 2.3. The maximum
length memory of the reservoir is limited by its size.
Spectral Radius
The spectral radius ρ is one of the most central parameters in the echo state network because it plays
an important role in governing the echo state property of the reservoir. The radius ρ is defined as the
maximum absolute eigenvalue of the reservoir connection matrix W and it determines the length of the
reservoir memory. Larger spectral radius corresponds to a longer memory of the input history.
In this project, the spectral radius essentially scales the weight matrix W. After a random matrix
W is generated, the spectral radius of this matrix is calculated by ρ(W). Then W is divided by ρ(W)
to normalize its spectral radius. This means ρ(W) becomes equal to 1. Then the connection matrix is
multiplied by the selected spectral radius ρ such that ρ(W) = ρ. This optimization of the matrix spectral
radius occurs after the connection weight matrix is randomly selected. Then its spectral radius is adjusted
to fit the desired ρ value.
In practice the condition, ρ < 1, ensures the echo state property in most applications [17]. However,
the theoretical limit established in Ref. [6] shows that the echo state property only exists for every input
under the tighter condition ρ < 1/2. In application, however, the echo state property often holds for even
ρ ≥ 1 given nonzero input. Typically large ρ values may push the reservoir into chaotic spontaneous
behavior which would violate the echo state property. This is shown by Jaeger [12] where the trivial zero
input will lead to linearly unstable solutions given ρ > 1.
Generally the spectral radius should be tuned to optimize reservoir performance with a starting bench-
mark value of ρ = 1 and exploring other ρ values close to 1. When modelling systems that depend on
more recent input history, ρ should be smaller compared to a larger ρ when a more extensive memory is
required.
13
Input Scaling
Input scaling is an important parameter as previously noted in section 2.3 because it also plays a role in
determining the echo state property [22]. The input scaling is typically determined by the input weight
matrix Win which is randomly sampled on a range [−a, a], where a is the scaling parameter. While
not necessarily required, the input scaling parameter determines the the scale of the entire input series
by scaling all the columns of Win. This reduces the number of free parameters that need to be tuned
to optimize reservoir performance [17]. Theoretically it is possible to scale individual input units u(n).
Increasing the number of input scaling parameters could serve to expand components of the input u(n)
that favorably drive reservoir dynamics. However there are no algorithms that can easily scale individual
input components and exploration of this topic is beyond the scope of this project.
In this project, I determine the input scaling through Win, which is chosen randomly from the range
[−1, 1]. The input scaling parameter, a, is multiplied by the input data, u(n). This is analogous to
sampling on the range, [−a, a]. Win not only scales the input u(n) but it also scales the constant bias.
Input scaling determines the nonlinearity of the network dynamics. Lukosevicius [17] advises en-
suring that the inputs to the state update Eq. 2.1 are bounded. Because of the tanh function that defines
neuron activations, inputs very close to 0 will lead to linear behavior of neurons. Inputs close to 1 or -1
may cause binary switching behavior of reservoirs. Inputs between 0 and 1 or -1 will lead to dynamical
nonlinear behavior of neurons.
Leaking Rate
The leaking rate α approximates a discrete Euler integration of the state over time. The leaking rate
α, which is bounded between [0,1], determines the speed at which the input affects or leaks into the
reservoir. This discretization term comes from an Euler integration of the state equation in time
∆x∆t
=x(n + 1) − x(n)
∆t≈ xn + f (n) (2.8)
so I can determine the solution for a new state x(n + 1)
x(n + 1) = (1 − ∆t)xn + ∆t f (n). (2.9)
In the reservoir the leaking rate α is the ∆t in the Euler integration of an ordinary differential equation.
14
This also functions as a form of exponential smoothing for the time series where previous states x(n) will
be weighted exponentially less over time because by substitution to the state update equation (Eq. 2.1) I
can see
x(n + 1) = (1 − α)2x(n − 2) + α(1 − α) f (x(n − 1)) + α f (x(n)). (2.10)
In general, the leaking rate is set to match the speed of the dynamics the reservoir is attempting
to model from ytarget. A recent study [12] has shown that leaking rate can also impact the short-term
memory of echo state networks. The study showed that in some reserovirs, given a small leaking rate,
α, the slow dynamics of the reservoir states, x(n), could increase the length of the short-term memory of
the reservoir. I study impact of the leaking rate along with other parameters more explicitly in section
5.3. Further studies of multiple time scales may be useful to determine components of the system that
may have multi-scale dynamics which occur on different timescales but are beyond the scope within this
paper.
2.6 Summary
In this chapter, I outlined the specific components of the echo state network as well as specific parameters
that are useful in optimizing its performance. In the following chapter I apply this knowledge in practice
by studying the modelling capacity of the reservoir using a known time series.
15
Chapter 3
Echo State Network and Mackey-Glass
3.1 Mackey-Glass System
The Mackey-Glass equation is a nonlinear time delay differential equation. It was originally derived as
a model of chaotic dynamics in blood cell generation in a collaboration between Mackey and Glass at
McGill University [20]. The equation expands on a simple feedback system whose dynamics are given
bydxdt
= λ − γx, (3.1)
which has a stable equilibrium at λ/γ in the limits t → ∞ given that γ is positive. In this simple feedback,
system the rate of change of the control variable, dx/dt, is influenced by the value of the control variable.
The system increases at a constant rate of λ and decreases at rate γx.
To better model real physiological systems, modifications to this simple feedback system include
introducing a time delay component. The form of the Mackey-Glass equation studied today is
dxdt
= βxτ
1 + xnτ− γx. (3.2)
The equation can generate periodic behavior, bifurcations, and chaotic dynamics given specified
parameters β, γ, τ, and n. The derivatives of a time delay differential equation are dependent on solutions
at previous times. xτ represents the the state x at a time delayed by the constant τ, x(t − τ). There may
be a significant time lag between determining the control variable x and responding with an updated
change. Additionally, in real physiological models, the parameters γ and λ are not necessarily constant
16
for all time t, and may vary with x(t − τ), also denoted xτ. This is true for Eq. 3.2, which also exhibits
period doubling bifurcations leading up to chaotic regime, and has been extensively studied in relation
to chaos theory [7].
Because the system has been so well characterized by previous studies, the Mackey-Glass system is
often used to benchmark time series prediction studies. Previous studies by Jaeger [14] have shown that
ESNs improve upon the best previous techniques for modeling the Mackey-Glass chaotic systems [36]
by a factor of 700, which he attributes to the ESN’s ability to store and remember previous states defined
as the echo state property in section 2.3. In Jaeger’s article, he uses Mackey-Glass parameters β = 0.2,
n = 10, γ = 0.1, and τ = 17, which I reproduced in figure 3.1 using dde23 solver [5] in Python. For
τ > 16.8, the system has a chaotic attractor. Since I use τ = 17 the system exhibits a chaotic attractor
which can be observed in the time delay embedding in figure 3.2. I use the same above parameters to
start to train my echo state networks.
Figure 3.1: Mackey-Glass system over time t = 3000 given β = 0.2, n = 10, γ = 0.1, and τ = 17.
I begin to understand the preparation for reservoir training tasks using the Mackey-Glass system
since it has been well studied in the context of the reservoir. The rest of this chapter details the studies
of the constant bias global parameter discussed in section 2.5 and the impact on optimization of the echo
17
Figure 3.2: Phase space diagram of Mackey-Glass attractor plotted by time delay embedding givenβ = 0.2, n = 10, γ = 0.1, and τ = 17.
state network source code made publicly available by Lukosecivius [18]. These studies will give insight
into the basic routine used in training networks which is applied in subsequent chapters to financial data
sets.
3.2 Constant Bias
In determining states of the activations (see Eq. 2.1), there is an additional bias input, randomly generated
within the Win matrix that is multiplied by a constant scaling parameter. In order to better understand
the impact of this bias input on the reservoir dynamics I use Mackey-Glass data and study the behavior
of the reservoir as a result of changing bias. I generate input Mackey-Glass data for training and testing,
construct an echo state network, and train the network on the output matrix to make predictions for the
next step. The rest of this section illustrates the impact of bias on the mean squared error. As I study
MSE, I examine the impact that the output weight matrix Wout may have on the error as it indirectly
impacts MSE by determining y(n).
Input preparation
Using Mackey-Glass parameters β = 0.2, n = 10, γ = 0.1, and τ = 17, I generate times series as shown
in figure 3.1. This data is split up into the training data as well as the testing data. In this numerical study,
18
I use a training length of 2000 and a testing length of 1000. The experimental data is shifted between
-1 and 1 to fall within the nonlinear regime of the tanh function. There is no additional input scaling
factor needed to adjust the input in this case because the range of the series is less than 1. The training
data is weighted according to the randomly generated input weight matrix used in the reservoir to drive
the activation states in Eq. 2.1. The first 100 inputs are used to initialize the reservoir and are not used
to train the output. The rest of the training data is used as target data to train on as Ytarget in Eq. 2.7.
The testing length is used to compare with the reservoir output to determine the accuracy of reservoir
predictions as in Eq. 2.2.
Reservoir Construction
A reservoir was constructed as described in section 2.3. In order to study the constant bias I varied
the scaling parameter for the bias from the range 0 to 10 with .1 increments, I examine the impact this
may have on the reservoir activations by isolating the parameter. I hold all other parameters constant:
reservoir size = 1000, leaking rate = .3, spectral radius = 1.25, input scaling = 1. Using the first 2000
data points in the training sequence to drive the reservoir according to the reservoir update Eq. 2.1, the
states of all the reservoir nodes are collected in X. As seen in figure 3.3 which shows some reservoir
activations, the initial activations resulting from the first few random states of the network are highly
variable in the first few steps t > 50. Therefore the first 100 activations are ignored to remove any initial
transience of the random reservoir. Since reservoir connections are randomly generated, I created 10
reservoirs given each scaling parameter to test the impact of bias on MSE.
19
Figure 3.3: The activations of a few nodes for t < 50 are more variable than the stable activations for
t > 50.
Training and Prediction
Training the output feedback of the reservoir requires using the target data Ytarget and reservoir activa-
tions X in Eq. 2.7. The regularization coefficient is set to 10−8 for all the reservoir training throughout
this project in order to minimize the number of adjusted parameters. Once the output matrix is deter-
mined, the matrix prediction y(n) is calculated by the output weight matrix and the reservoir states. This
output y(n) is then used as the subsequent input to continue to drive the reservoir and make the following
prediction. This series of predictions is compared to the test data and the MSE, given by Eq. 2.2, is
determined.
Results
After training a number of reservoirs over a range of scaling constants, I compare the prediction results
of the reservoirs to the test data. The following plot is an example of the actual data compared to the
20
reservoir predictions for one of the runs using a bias scaling constant of 1.
Figure 3.4: Reservoir Predictions after training compared to test data points.
To compare the results across the varying scaling constants, I calculate the MSE using Eq. 2.2 for
reservoir predictions over a length of 500 steps for each reservoir. The average MSE of the 10 reservoirs
for each scaling bias is shown in the figure below.
21
Figure 3.5: MSE across varying scaling constants for the bias given parameters reservoir size = 1000,
leaking rate = .3, spectral radius = 1.25, input scaling = 1.
As seen in figure 3.5, the smallest MSE occurred when the scaling constant was kept below 3.0 but
error increased when the scaling constant was zero or close to it. In the range from 0.8 to 2.2, the MSE
was on the order of magnitude of 10−5 or smaller. Typically in literature, this scaling constant is kept at
1 [17].
The purpose of the constant bias input as explained by Jaeger and in section 2.2 was to increase the
variability of the reservoir dynamics. In application, the bias may help the reservoir deal with offset or
non-centered data. The mean of the shifted input data is close to but not quite zero, at around -0.066.
While the bias input does seem to have an impact on MSE, there may be some indicator for large
error in the echo state network model. Since Wout has a significant impact on output y(n) I compare the
different weight matrices as they affect MSE. Higher weights in the the output matrix could possibly be
a result of an unstable solution in Eq. 2.7. To study whether or not the output weight matrix would be
22
correlated to the MSE, figure 3.6 was produced to compare the mean of the output weight matrix to the
MSE.
Figure 3.6: MSE as a function of output weight matrix Wout
Figure 3.6 indicates no correlation between the mean of the weight matrix and the MSE. This might
have been expected because weight matrices with larger more unstable values could indicate poor linear
fits for Wout from Eq. 2.7. However because Tikhonov regression was intended to penalize large, unsta-
ble matrix solutions, these solutions may not manifest in the final outcome weight matrix. This already
acts as a threshold for any solution derived from Eq. 2.7 to dismiss highly unstable solutions. Other
studies of the output matrix could observe other statistical features. In the following section we study the
impact of a few other reservoir parameters.
23
3.3 Leaking Rate, Spectral Radius, and Reservoir Size
There are many other important parameters that can be optimized in the network as discussed in section
5.3. In this section we study the interrelated impact of multiple parameters. One of the flaws of opti-
mizing each parameter individually is that the MSE may have a local minimum as a function of multiple
parameters. In order to better optimize across multiple parameters, I study more advanced optimization
techniques.
Gradient Descent
Gradient descent is a method commonly used to minimize error functions. In gradient descent optimiza-
tion, also known as method of steepest descent, an iterative approach is taken to converge to a local
minimum. From an initial starting vector of parameters, the gradient for MSE is calculated across the
multiple parameters. The direction in which MSE will fall the fastest is the negative gradient at that
initial starting set of parameters. These parameters are then adjusted in the direction of steepest descent
so that MSE is reduced and the gradient is calculated again. This process is repeated to minimize the
MSE until the gradient converges to 0, at which point there is a local minimum. While gradient descent
algorithms are useful in finding the minimum of smooth convex functions, there is not much knowledge
about the error surface in reservoir computing. The error surface could be discontinuous and contain
multiple local minima where the gradient descent algorithm could fall and miss the absolute minimum.
In order to get a better sense of the kind of error surface that exists in reservoir optimization, I conduct a
grid search detailed in the next subsection.
Grid Search
The basic grid search explores the MSE outcome for multiple reservoirs across several ranges of global
parameters. Grid searches are a computationally expensive brute force method to finding optimal param-
eters since multiple calculations need to be made at each point along the grid. However, for the purposes
of this study, it provides the ability to visualize the error surface over many reservoir parameters.
In this grid search, I explore the error surface as a function of reservoir size, leaking rate and spectral
radius. Data processing follows the same procedures as described in section 3.2. However, instead of
simply adjusting one parameter, I am able to tune three different parameters. I study the slices of the
24
error surface at different reservoir sizes Nx = 100, 400, 700, and 1000. This surface plot shows the MSE
at different values of leaking rate 0.05 ≤ α ≤ 0.3 and spectral radius 0.8 ≤ ρ ≤ 1.2. At each point on the
grid defined by the global parameters, 10 different randomly connected reservoirs are created. Because
of the computational costs associated with grid search I take large step sizes in all of my parameters to
minimize the number of searches that need to be run.
Results
While I am not able to conduct an exhaustive grid search, the purpose of the grid search is to better
understand the error surface across reservoir parameters. The surface plots in figures 3.7, 3.8, 3.9, and
3.10 show the mean of the MSE outcome for the 10 reservoirs created at each point. Figure 3.7 excludes
values with leaking rate α < 0.2 because the MSE exploded beyond that point. I have already discussed
in section 2.5 that smaller reservoir sizes do not have the capacity for large memory of input. Therefore
it makes sense that many parameter values given in the grid search returned extremely high MSE values.
The MSE surface also tends to be lower for the reservoir Nx = 1000 compared to the other three plots.
Studying these surface plots across multiple reservoir sizes, the MSE seems to be minimized for higher
leaking rates α > 0.2 which correspond to the values of α I have been using in our studies of Mackey-
Glass.
25
Figure 3.7: MSE surface for Nx = 100 reservoirs across α and ρ
26
Figure 3.8: MSE surface for Nx = 400 reservoirs across α and ρ
27
Figure 3.9: MSE surface for Nx = 700 reservoirs across α and ρ
28
Figure 3.10: MSE surface for Nx = 1000 reservoirs across α and ρ
3.4 Summary
In this chapter I examined the Mackey-Glass system and how the numerical solutions could be used
as input to the echo state network and as a demonstration of the network training process. I studied
specifically the bias input in reservoir and how the scaling constant for bias input affected the reservoir
predictions and error. I determined a wide optimal range for the bias input. Studying the output weight
matrix, I determined there was no direct relationship between the mean of output weights and the predic-
tion ability of the reservoir. I was able to study the multivariate impact of reservoir parameters on MSE
in a grid search across multiple reservoir sizes, leaking rates, and spectral radii. The grid search suggests
using leaking rate values α > 0.2, and the leaking rate value used in the Mackey-Glass studies is α = 0.3,
which confirms I have been operating in an optimal range of reservoir parameters. As I move further,
these initial studies will guide my understanding of the reservoir predictions of financial data.
29
Chapter 4
Financial Forecasting and Neural
Networks
This chapter provides a cursory examination of the history of analysis in financial markets as it relates
to forecasting and nonlinear dynamics. In section 4.1, I describe the complex dynamics of the financial
input. As a result of the nonlinearity, neural networks have historically been used as forecasting models
with some success. The following section will elaborate some of the results that artificial neural networks
have acheived in finance. Finally in the last section, I discuss the specific echo state network approach
used in the project as a part of the Neural Forecasting Competition in 2007 in which different neural
networks were used to predict unknown financial data.
4.1 Nonlinear Characteristics
In finance, many qualitative and quantitative forecasting techniques have been applied to try to predict
fluctuations in share prices. However, according to the efficient market hypothesis, one cannot outper-
form the market to achieve returns in excess of the average market returns based on empirical informa-
tion. This is because the current market price reflects all known information about future value of the
stock and investors cannot purchase undervalued stock or sell at inflated rates. This hypothesis is highly
controversial and heavily debated. An entire body of study exists to produce methods and models that
hope to have substantial predictive abilities of stock prices.
Prediction is extremely complex and difficult as the stock market is the result of interactions of
30
many variables and players. Many previous techniques used various linear models to attempt to explain
the numerous complex interactions. Chaos theory offers an explanation of the underlying generating
process, suggesting the stock market may be a deterministic system. This section outlines the history of
chaos in the description of financial markets, the determination of chaos in a time series, and possible
causes for nonlinear behavior in the financial model.
As a result of the failure of linear models, models of complex systems were developed to explain
nonlinear processes in financial markets. The foundation for chaos theory in the realm of finance was
established by Mandelbrot in 1963 in a study of cotton spot price data [21] where he found price changes
did not fit a normal distribution. The theoretical justification for price changes fitting a Gaussian was
given by Osbourne [27] using central limit theorem. It states that, if transactions are a true random
walk, they are randomly independently, identically distributed, and price changes should be normally
distributed since the price changes are the simple sum of the IID transactions.
Mandelbrot discovered that the price changes differed from a normal distribution [21]. Particularly
he found long tails in the distribution of price changes. There are more observations at the extreme ends
than a normal distribution would predict. Furthermore his studies of cotton prices did not indicate finite
variance. He expected to see variance converge as he sampled more cotton price changes since he was
increasing the number of observations. Increasing the sample size of cotton prices did not cause the
variance of cotton prices to converge as he would expect for Gaussian distribution since the variance for
each observation is identical. He suggested instead the distribution modeled a stable Paretian distribution.
Fama confirmed a stable Paretian distribution was a better fit for price returns than a Gaussian in his
studies on thirty stocks in the Dow-Jones Industrial Index [4]. Other studies by Praetz in 1972 suggested
a scaled t-distribution as an alternative [28]. Later on in section 5.2, I fit the financial data to both
a Gaussian and a Lorentzian distribution, which is a stable paretian distribution with no skew and no
defined variance measure, which was an important characteristic Mandelbrot observed in cotton prices.
Mandelbrot’s work established the basis of chaos theory in the world of finance. The rejection of the
random walk theory brought chaos and determinism into the study of economics [24].
Studies have found that there is evidence of nonlinearities in share price data by applying correlation
dimension analysis [25]. These studies sought to determine whether a time series is independently and
identically distributed as would be expected for a random walk. Studies using correlation dimension
have gone on to reject the Gaussian hypothesis offered by Osbourne and argue the existence of chaos in
31
financial markets[10].
There have been many reasons offered that suggest chaotic dynamics within the stock market. Deter-
ministic processes could be the result of behavioral economics which dictate irrationality in investment
decisions [35]. Rather than making purely logical decisions based on the current market, the psychology
of fear and emotions plays a deterministic role in risk-taking and investment decisions. Paretian distri-
bution implies a paretian market which is inherently more risky because there are more abrupt changes,
variability higher and probability of loss is greater [4]. In general, the complex interactions of the many
variables and players in the stock market could behave nonlinearly. Many researchers claim there does
exist some degree of determinism in the market driven by high dimensional attractors. This debate on
chaos will be important as I study the use of neural networks in the financial domain.
4.2 History of Neural Networks in Finance
Throughout the years of research analysis on stock prices, studies have indicated a major roadblock is the
lack of mathematical and statistical techniques in the field [35]. Some of the benefits of neural networks
lie in the learning ability of networks to approximate functions. Given the vast amount of data available
in the financial world, neural networks have become invaluable tools in detecting complex processes and
modeling these relationships.
Networks have been introduced as models for times series forecasting as early as 1990 where most
early studies focused on predicting the return from indices. In the prediction of returns in the Tokyo
Stock Price Index (TOPIX) on data covering the period January 1985 to September 1989, Kimoto and
collaborators [16] compared the performance of modular neural works trained through backpropagation
with multiple regression analysis, a traditional method of forecasting. The modular neural networks they
created were series of independent feed-forward neural networks with three layers including input layer,
a hidden layer of nodes, and an output layer. The output of these networks was combined to inform
buy or sell decisions for TOPIX. They were able to determine a higher correlation coefficient between
the target data and the system of networks (0.527) than for individual networks (0.414-0.458). In 1990,
Kamijo and Tanigawa [15] used a recurrent neural network approach to model candlestick charts, chart
combining line charts and bar charts to illustrate the high, low, open, and close price for each day. They
were able to use backpropagation to train the RNN to recognize specific patterns in the price chart but
32
were unable draw conclusions about the predictive abilities of the network.
Neural networks were used to predict the daily change in direction of the S&P 500 in both studies
conducted by Trippi and Desieno [33] and by Choi, Lee, and Lee [2]. Trippi and Desieno [33] developed
composite rules which are different ways to combine the trained output from multiple networks to deter-
mine Boolean (rise or fall) signals. Their results show that they are 99% confident their best composite
rule would outperform a randomly generated trading position. They estimated a potential annual return
of $60,000. Choi, Lee, and Lee [2] used neural networks to make rise or fall predictions in the stock
index. Using this method they were able to make higher annualized gains than previous methods.
To address limitations of backpropagation in dealing with noise and non-stationary data [30], other
research uses a hybrid method incorporating rules-based system from machine learning along with recur-
rent neural networks. The rules-based technique categorizes the stock data into cases based on empirical
rules derived from past observations. Studies from Tsaih, Hsu, and Lai incorporating a hybrid method
using rules-based system to generate input data use in neural networks found better returns over a six
year period compared to a strategy where the stock and bought and held over the six year period [34].
Studies are still working to continue to improve the training algorithms for neural networks. In the next
section I will study how the echo state network training algorithm, described in earlier chapters, can be
implemented implemented in handling financial data.
4.3 Neural Forecasting Competition
In spring of 2007, echo state networks outperformed many other neural network training algorithms in
the NN3 Artificial neural Network and Computational Intelligenece Forecasting Competition, where the
objective was to forecast 111 monthly financial timeseries by 18 months. The submission using an echo
state network approach by Ilies, Jaeger, Kosuchinas, Rincon, Sakenas, and Vaskevicius was ranked first
in forecasting the competition time series [26].
The team used the same recurrent nerual network architecture described in chapter 2 to train blocks
of the 111 competition time series with high levels of success [11]. Based on their report, 111 time series
was divided into 6 temporally blocks. Each of these blocks was preprocessed using seasonal decomposi-
tion methods before being used to train a collective of 500 echo state networks. The reservoir parameters
were manually manipulated using part of the time series as a validation set [11]. The promising results
33
from the competition give rise to many more questions about the application of echo state networks in
finance, which I will address in the next chapter.
34
Chapter 5
Network Predictions
In this chapter, I apply all I have learned about echo state networks and finance to test the ability of the
echo state network to handle financial data. In the first section, I examine the data set I will be using,
the daily closing price of the S&P500 (GPSC) as well as the data processing measures implemented in
the project. In section 5.2 I examine how I benchmark the reservoir performance against a random
guess drawn from a distribution which accurately models the input data u(n). In order to optimize
parameters simultaneously, I conduct a series of grid search in section 5.3. I present the results of the
reservoir prediction in section 5.5. Then I study other methods which may improve prediction such as
the committee method in section 5.6.
5.1 S&P 500 Data
The data that I will be using as input data as well as the target data is the S&P 500, which is an American
stock index of the top 500 corporations in the New York Stock Exchange. Specifically I used the daily
close price before dividend adjustments of each day retrieved from Yahoo Finance [38]. The time period
of data used ranges from March 2007 to March 2015. The figure below shows the raw input data from
Yahoo Finance.
35
Figure 5.1: S&P 500 daily close prices from March 2007 to March 2015.
5.1.1 Data Processing
In order to use stock prices as input, I need to apply data processing techniques that would allow the
reservoir to run efficiently. Raw input could lead to unstable reservoir dynamics because the raw input
data could be in a range that does not produce meaningful reservoir activations. Important considerations
in processing data include converting the non-stationary financial time series to a stationary set. Once
the data set is detrended, I also need to process the data to ensure the scale is within the correct range for
reservoir dynamics.
Using a stationary representation of the financial data is important in ensuring proper reservoir per-
formance [31]. Stationary processes are defined as a stationary distribution over time. In other words,
the mean and the variance of the data remain constant over time.
There are several methods for detrending considered for this project including simple differencing,
logarithmic differencing, and relative differencing. For the original n-length time series y(t) = y1, y2...yn,
the simple difference is
36
ydi f f (t) = y2 − y1, y3 − y1, ...yn − yn−1. (5.1)
Another method to detrend the data is logarithmic differencing
ylog(t) = log(y2/y1), log(y3/y2), ... log(yn/yn−1) (5.2)
The last method similar to the simple difference is the relative difference
yrel(t) = (y2 − y1)/y1, (y3 − y2)/y2, ...(yn − yn−1)/yn (5.3)
I chose to implement a relative difference between data points as it indicates the change in stock price
which has explicit implications on stock returns. The sign of the relative difference gives the direction of
price trajectory. The following is a plot of the detrended data used in the project.
Figure 5.2: Relative difference of S&P 500 data from figure 5.1.
37
Another important consideration in data processing is in the scaling of the input data. As discussed
in section 2.5 the input should fall within the nonlinear range of the tanh function. Since many of the
relative differences were below 10%, an input scaling factor between 1 and 50 was used. In figure 5.3
the MSE from an average of 10 reservoirs over a scaling factor between 1 and 50. This is the result for
a reservoir with spectral radius ρ = 0.8, Nx = 500 neurons, and leaking rate α = 0.1. This plot shows
input scale that minimizes MSE to be 1, which implies that the unscaled data may be a valid input to the
reservoir. The range of input scales which causes a peak in MSE should be avoided. The shape of the
MSE curve is very interesting but further studies are still needed to understand the dynamics underlying
the bell shaped MSE curve.
Figure 5.3: MSE of reservoir over input scaling factor between 1 and 50, reservoir with spectral radius
ρ = 1.1, resevoir size of 500 neurons, and leaking rate = 0.3.
38
5.2 Benchmarking
In order to test the reservoir performance, I compare the reservoir output to a random draw from a
distribution. As discussed in section 4.1, according to random walk theory, the distribution of price
changes should be Gaussian. A random independent identically distributed variable should have a normal
distribution. Therefore I fit a histogram of S&P 500 price changes to a Gaussian in the plot in figure 5.4
where a Guassian distribution is defined by mean µ and standard deviation σ in
G(x, µ, σ) =1
σ√
(2π)e−
(x−µ)2
2σ2 . (5.4)
The parameters µ and σ are fitted in table 5.1 with reduced chi-square valeu = 7.377.
µ −1.4936e − 05 ± 0.000216
σ 0.00756650 ± 0.000176
Table 5.1: Gaussian fit parameters for distribution of S&P 500 data
In figure 5.4, the Gaussian does not fit the distribution very well; the reduced chi-square is much
larger than 1, which means the distribution is not fully capturing the data. There are too many extreme
values in the long tail of the distribution for the Gaussian to be a good fit. Furthermore, because the
Gaussian drops off exponentially from the mean, the slope is too steep to capture all the points in the bell
curve. In table 5.1, the fitted µ has very high uncertainty as the center may be variable. The Gaussian is
not a good fit for the distribution of price changes, which also leads me to reject the Gaussian distribution
as way to fit the input data.
39
Figure 5.4: Gaussian fit to histogram of relative price change in S&P 500.
In order to find a distribution with a better fit, I attempted a few other fits using the Lorentzian dis-
tribution, also known as the Cauchy distribution, and the Voigt profile, which is a convolution of the
normal and the Lorentzian distribution. Both of these distributions have thicker tails than the Gaus-
sian and would therefore be better able to capture the larger price changes within the distribution. The
Lorentzian distribution is a unique case of the stable Paretian distribution discussed in section 4.1 with
no skewness. The probability distribution function of the Lorentzian is defined by location x0 and γ in
L(x, x0, γ) =1
πγ[1 + ( (x−x0)γ )
2]. (5.5)
Figure 5.5 is a graph of the fit to a Lorentzian with reduced chi-square = 1.518. The parameters for
the distribution are listed in table 5.2. The goodness of fit for the Lorentzian distribution is very close
to reduced chi-square value = 1, which indicates a good fit between the data and the distribution. The
thicker tails of the Lorentzian distribution are better able to capture the larger price changes that occur
in the distribution. The uncertainty for the center location x0 is relatively high compared to the center
location value. However, since there was also trouble fitting a µ to the Gaussian model, I assume that the
center of the historical price change distribution is variable and difficult to fit.
40
Figure 5.5: Lorentzian fit to histogram of relative price change in S&P 500.
x0 −4.7138e − 06 ± 7.81e − 05
γ 0.00537502 ± 7.81e − 05
Table 5.2: Lorentzian fit parameters for distribution of S&P 500 data.
Another view of the Lorentzian distribution fit to the data at the tails is shown in figure 5.6. This
figure shows that the tail of the Lorentzian distribution is able to fit the data well. However the Lorentzian
distribution does tend to have slightly thicker tails than the data, and the Lorentzian distribution would
be more likely to predict larger price fluctuations than that which actually occur.
41
Figure 5.6: A closer view of the tail of relative prices and the Lorentzian distribution shows fewer large
price fluctuations than the Lorentzian would predict
The next distribution that I attempt to fit the data to is the Voigt profile, a convolution of the Gaus-
sian distribution, given by Eq. 5.4, and the Lorentzian distribution, given by Eq. 5.5. Therefore the
convolution is defined by
V(x, σ, γ) =
∫ ∞−∞
G(x′, σ)L(x − x′, γ)dx′. (5.6)
The Voigt profile in figure 5.7 has almost the same reduced chi-square value fit as the Lorentzian in
figure 5.5. The goodness of fit for the Voigt profile is 1.342 and the parameters for the Voigt profile are
shown in table 5.3. The Voigt profile has three parameter values to fit compared to the two parameters for
both the Gaussian and the Lorentzian. Even with this additional parameter, the goodness of fit is almost
identical to the Lorentzian fit. As with the Lorentzian distribution, the Voigt profile is better able to the
thicker tail of the price change distribution with high fluctuations. Similar to the previous two models,
42
the fitted parameter for the center x0 is uncertain for the Voigt model. In the Voigt model, the σ value is
more uncertain; the uncertainty of the σ in the fit is 13.57% of the σ value.
Figure 5.7: Voigt profile fit to histogram of relative price change in S&P 500
x0 2.8723e − 05 ± 7.70e − 05
γ 0.00480562 ± 0.000166
σ 0.00211020 ± 0.000286
Table 5.3: Voigt profile fit parameters for distribution of S&P 500 data
Ultimately I use units drawn randomly from the Lorentzian distribution that best approximates the
inputs to measure the performance of the reservoir. The reduced chi-square for the Lorentzian distribution
is very similar to the goodness of fit for the Voigt profile. Implementing a random selection from the
Lorentzian distribution is also more straightforward. This selection from the distribution simulates a
random guess for the price change each day based on the distribution of all previous days. The random
43
guess establishes the baseline for a randomly determined forecasting model. Comparing this to an echo
state network, I can measure how well the trained network predicts price changes by studying the MSE
of each method.
5.3 Parameter Optimization
In order to optimize the reservoir to make the best reservoir predictions possible, there are a large number
of global parameters that can be tuned. I discussed these parameters in section 2.5. Initially, I optimized
these parameters individually to find the parameter value that would reduce MSE, which is the standard
practice [17]. However, many of these parameters may affect each other, and therefore the best optimiza-
tion strategy would determine the parameters simultaneously. I considered a gradient descent algorithm
that would calculate the gradient of the MSE as a result of the different parameters and move in the di-
rection of greatest descent to find the local minimum. However, this method may not be effective where
there are many local minima or where the surface is discontinuous. Therefore, to better understand the
landscape of MSE as a function of multiple parameters, I conduct a coarse grid search across the leaking
rate, spectral radius, and reservoir size. To conduct the grid search, I create 10 reservoir given each set
of parameters and find the average MSE across the 10 reservoirs created. Figures 5.8, 5.9, 5.10, and 5.11
show the surface of the MSE as a function of spectral radius and leaking rate across reservoir sizes of
100, 400, 700, and 1000 respectively. These figures highlight that while MSE may not be largely vary-
ing, certain choices of parameters will cause the reservoir to fluctuate highly and return output values
that have extremely high errors. Compared to the grid search in section 3.3 where α ≥ 0.3 gave the
best MSE results, the MSE is minimized in the case of financial data input for leaking rate α ≤ 0.1 and
spectral radius ρ ≤ 1.
There are many differences between the Mackey-Glass data and the stock data that could impact the
optimal parameters for reservoir output. In the case of Mackey-Glass, the solution to Eq. 3.2 was taken
for very small time steps of ∆t = 0.1. This makes the Mackey-Glass input much smoother than the
financial data series, which from figure 5.2 seem to experience many shocks. The smaller leaking rate
α could minimize the impact of these sudden shocks by slowly integrating the input updates over time.
The smaller optimal spectral radius implies a shorter memory required for the system. Financial markets
may be less dependent on price changes in the distant past than on the more recent price shifts. However,
44
despite differences between the effect of spectral radius and leaking rate on Mackey-Glass input and
financial data input, the reservoir size functions similarly in both such that increasing reservoir increases
stability of MSE. The slope of the MSE surface for reservoir size Nx = 1000 is lower than the surface
for Nx = 100 for both Mackey-Glass input and S&P 500 price changes input. In preparing the reservoir
model to forecast and predict S&P 500 price changes, I will maintain the optimal range of parameters,
where α ≤ 0.1, ρ ≤ 1, and reservoir size Nx is as large as computationally feasible.
Figure 5.8: MSE surface across spectral radius and leaking rate with reservoir size = 100.
45
Figure 5.9: MSE surface across spectral radius and leaking rate with reservoir size = 400.
46
Figure 5.10: MSE surface across spectral radius and leaking rate with reservoir size = 700.
47
Figure 5.11: MSE surface across spectral radius and leaking rate with reservoir size = 1000.
5.4 Reservoir
The reservoir used in forecasting the S&P 500 data is very similar to the reservoir system used in section
3.2. From figure 5.3, the optimal scaling factor = 1, which I set for all reservoirs forecasting the market
index. Based on section 5.3, I use reservoir parameters reservoir size Nx = 1000, leaking rate α = 0.1,
and spectral radius ρ = 0.8. For the input, I use the first 1000 data points u0, u1...u1000 in the training
sequence computed as the relative difference in Eq. 5.3 to drive the reservoir. The first 20 activations
are disregarded and not used in training the network since they function to initialize the reservoir. Then
the weighted output of the nodes in the reservoir is trained on the target data, which comes from the
relative difference of the input S&P 500 data ytarget(n) = u(n). This gives the trained output weight
matrix Wout, which is used to predict the price change one day ahead. Because the market exhibits
such complex behaviors, it is hard to expect even a trained reservoir to make highly accurate predictions
48
many days in advance. The more interesting and practical application is the ability to predict where the
market will be one day into the future, which would allow investors time to make a buy or sell decision.
Following the output prediction y(1001), the reservoir is retrained again using 1000 S&P 500 data points
shifted forward by one day u1, u2...u1001. Then the retrained reservoir makes a prediction y(1002) for the
following day. This training process is repeated for 500 runs to determine 500 one-day ahead predictions
from the reservoir. This is compared to 500 random samples from the distribution. The next section
discusses the results from this reservoir.
5.5 Results
Using a random prediction drawn from the underlying distribution of the input data, I can compare the
incremental benefit of a trained reservoir and its predicted output. Figure 5.12 compares the 500 single-
day predictions of the reservoir y(n) to the target data ytarget(n) and figure 5.13 compares the 500 random
selections from both the Gaussian and Lorentzian distribution to the same ytarget.
49
Figure 5.12: Target data ytarget(n) compared to the outcome y(n) from the reservoir.
Figure 5.13: Target data ytarget(n) compared to the predictions from the random guesses of a Lorentzian
distribution and Gaussian distribution.
50
The random selections from the Lorentzian distribution seem to have very large fluctuations. How-
ever, plotting the histogram of the 500 random selections compared to the distribution of the relative
price of the S&P 500 shows that the random selections do correspond to the same distribution as the
input data (figure 5.14). Additionally, the histogram of random selections from the Gaussian distribution
do not match the distribution of input data in figure 5.15. The random selection from the Lorentzian is
good sample of random values that correspond with the distribution of the input. The Gaussian, on the
other hand, does not approximate the input distribution.
Figure 5.14: Histogram of the random selections from Lorentzian distribution corresponds with the
distribution of the input data.
51
Figure 5.15: Histogram of the random selections from Gausssian distribution does not correspond with
the distribution of the input data.
To better understand if the reservoir predictions are consistent with the the target data, I calculate
the MSE at each output for the reservoir as well as for the random selection from both the Gaussian
and the Lorentzian distribution. Figure 5.16 shows the MSE over the 500 runs. All of the methods
have a hard time predicting large fluctuations in the data, which accounts for many of what seem like
correlated jumps. To directly compare the performance of the reservoir to the random selections, figure
5.17 plots the difference in MSE between the random guess and the reservoir prediction. The bars above
zero illustrate the instances when the reservoir makes better predictions than the random guess. Figure
5.17 shows that the reservoir does better than the random selection from a Lorentzian distribution 274
times out of 500 runs. However many times when the reservoir outperforms the random selection, the
reservoir outperforms by a large magnitude. The reservoir outperforms the random selection by over
0.001 36 times while the random selection only outperforms the reservoir by the same amount once.
52
Figure 5.16: MSE of the reservoir compared to random guessing from a Lorentzian/Cauchy and Gaus-
sian.
Figure 5.17: Difference in MSE between a random guess and a reservoir prediction.
The table 5.4 lists the mean and standard deviation of MSE for the different methods of prediction.
Low average MSE is meaningful because that implies lower error and predictions closer to the target
value. A low variance measure means the model is less risky and predicts close to the target value. This
is important for investors who need to know and minimize risk in their portfolio. In table 5.4, the random
53
selection from a Gaussian distribution has the lowest mean and standard deviation. Since many of the
random selections from the Gaussian fall in the center and the distribution of input price changes was
most concentrated in the center, it makes sense that the Gaussian distribution would have a low error.
The reservoir maintains almost the same average MSE and standard deviation. The random selection
from the Lorentzian does the worst in terms of average MSE and standard deviation because big price
changes are predicted more often in the Lorentzian distribution with thicket tails.
Mean Std. Dev.
Gaussian 0.0001881 0.0003914
Lorentzian 0.00043667 0.0010615
Reservoir 0.0002125 0.0004747
Table 5.4: MSE values for benchmark and reservoir predictions
Another measure to determine whether the output of the reservoir corresponds to the target output
is through the correlation coefficient. Rather than a measure of absolute error like MSE, the correlation
coefficient takes into account the direction as well as magnitude. Table 5.5 shows the correlation coeffi-
cient between the random selection benchmarks and the target data series and the correlation coefficient
between the reservoir output and the target data series. The correlation coefficient is small for all of
our models, the reservoir even has a negative correlation with the target data set which is discouraging.
This implies that the time series for the target output and the reservoir output move in opposite direc-
tions. However the correlation coefficients for all of the models are close to 0 and therefore imply no
correlation between the models’ time series and the target time series.
Correlation Coefficient
Gaussian -0.02380788
Lorentzian 0.01243823
Reservoir -0.05075724
Table 5.5: Correlation coefficient between
While this reservoir performance does not seem to correlate to the target output in stock data, further
studies may show that the reservoir is capable of learning some of the underlying dynamical behavior
54
in the financial time series. Although the echo state network is far from perfect, this project begins to
show how reservoirs can process financial data. In the last section of the project I focus on a common
technique used in many neural network applications that combines output from multiple reservoirs in
order to improve network predictions.
5.6 Committee
In the committee method, multiple reservoirs are trained on the same dataset and the combined output
of all the reservoirs is computed usually as an average. Because the internal reservoir connections are
randomly generated each time, each reservoir has slightly varying dynamics and performance. This
variance in reservoir dynamics may be useful in determining the output prediction that best fits the target.
There are other ways to combine output from multiple reservoirs. A ranking algorithm, which identifies
the reservoirs with the best performance, could weight those reservoirs more highly while removing
outlier reservoirs which perform poorly. In this study, I focus on a simple committee by averaging
reservoirs and leave the ranking algorithms for future studies.
In studying committees, I follow the same methodology for reservoir creation as in section 5.5 using
the same global parameters for reservoir specification. I create a single reservoir given 1000 previous
data points, dismissing the first 20 states used to initialize the network. I maintain the same reservoir
global parameters as specified in 5.5. To create committees, I combine the output of 100 such reservoirs
given the same specifications as the single reservoir. The committee output is a simple average of the
outputs of the 100 reservoirs within the committee. The committee reservoirs each make one-day ahead
predictions as I did in section 5.5 and then their predictions are combined to the make the committee
prediction
ycom(n) =1
100
∑i
yi(n). (5.7)
The predictions of the committee reservoir compared to the the random guess from the Lorentzian
distribution and the single reservoir is in figure 5.18. The committee output from 100 reservoirs has
almost the exact same output as the single reservoir. This demonstrates that there is no added benefit of
computing the committees given the parameters and input I am using in this project.
Moving forward with the project, I reject the use of committees as the outcome of the performance
55
Figure 5.18: The predictions for the single reservoir, committee, and random guess from a Lorentziandistribution
of single reservoirs are very similar and are much less computationally expensive.
56
Chapter 6
Concluscion
Studying the echo state network, I was able to understand the training process for the network with
regards to two different data sources: Mackey-Glass systems and financial systems.
6.1 Analysis of Results
The initial study of the Mackey-Glass system served to help me understand basic reservoir dynamics
and parameters. I studied the impact of the bias input that has not been widely discussed in previous
literature. I concluded the standard bias scaling constant = 1 used across most literature falls within the
range of values that reduce error in the reservoir. Studying the impact of the output weight matrix on
error, I saw no correlation in the mean of this output matrix. This however does not conclusively imply
the Wout does not have an effect on MSE but rather that the regularization coefficient will penalize any
output matrix with higher weights and therefore those solutions will not manifest. Regardless, this is
important to note because it checks that the training process removes any output weight matrices that
easily skew MSE results. In optimizing the parameters to minimize MSE, a grid search of the error space
shows that error is minimized when leaking rate α > 0.3. This error surface however differs significantly
across systems as I saw in the second set of studies.
In the next study, I was able to apply knowledge about echo state networks to studying S&P 500
daily closing prices. In order to benchmark the results I compared the distribution of price changes to a
Gaussian, Lorentzian, and Voigt profile. While many theorists and academics claim price fluctuations are
randomly independently and identically distributed random walk, this was not the conclusion I reached
57
in this project. There were many more abrupt price changes in the tail than the Gaussian would predict.
Both the Lorentzian and Voigt profile were better fits for the input data. Further studies could seek to
understand whether chaos determines the Lorentzian distribution of the of relative price changes and
how it may be useful in understanding determinism in the field of finance. Studies of this nature across
other stock indices will be useful to bolster this research. The implications of markets determined by the
Lorentzian suggest that stock prices may be more variable and more risky than under the assumption of
a market determined by a Gaussian. Price fluctuations tend to be greater in both directions implying both
big profits as well as big losses.
In terms of forecasting the market, the reservoir was not able to significantly outperform the random
selections from known distributions. In comparison to the random selection from the Lorentzian distri-
bution, the reservoir was able to predict better results in a small majority of the cases when comparing
MSE (fig 5.17). However the reservoir was able to make many predictions that beat the random model
by a larger magnitude. Another concern may be the Lorentzian fit to the distribution data. The diffi-
culty in forecasting may lie in the financial time series itself. Stock market data is incredibly complex
and any attempt at modeling and forecasting the market will require much more sophisticated statistical
techniques. However the study of the S&P 500 index has opened up many questions about the chaotic
nature of stocks as well as the specific capabilities of echo state networks and reservoir computing. In
the grid search to optimize reservoir parameters with S&P 500 data, the differences in the surface plot
of MSE implies unique reservoir behavior for both Mackey-Glass and financial systems. Further studies
could study how different timescales and other features of stock data may impact the network output.
The optimal range for spectral radius has many implications in understanding how memory is needed
to characterize the dynamics of the system. Studying commonly used techniques to improve network
output, I found committees of multiple reservoirs did not improve performance compared to that of the
single reservoir. I concluded the computational cost was not worth the marginal difference in prediction.
The relatively small effect on performance could be the result of the large reservoir size. Large reservoirs
may have enough memory capacity to remember all the system’s dynamics. However additional studies
into committees would be useful particularly in examining other methods to combine individual reservoir
outputs. Further studies to calibrate the distribution would be useful to create a more rigorous benchmark
for the reservoir. Other studies would look at alternatives to benchmarking the data. Another interesting
application would be to create a trading algorithm to test and observe market impact of the reservoir.
58
Other reservoir studies that would be useful entail using the reservoir to make Boolean predictions on
simply whether or not the market will rise or fall rather than the exact value it changes. This would also
provide valuable applicable benefits to investors.
One of the key areas of study now involves finding ways to optimize the parameters of the reservoir.
In the study, the coarse grid search highlighted the often interconnected impacts of reservoir parameters
and therefore the need to construct more rigorous algorithms to select parameters.
6.2 Conclusion
In conclusion, the echo state network provides an interesting paradigm for understanding complex time
series. It shows a lot of promise in helping understand complex systems but there still exists a lot of room
for additional research in order to be a viable option for forecasting financial markets.
Although the results from this project do not conclusively show that echo state networks have signif-
icant forecasting ability for stock data, it is an interesting exercise in learning about reservoir computing
and its application in complex systems.
59
Bibliography
[1] ER Caianiello. “Outline of a theory of thought-processes and thinking machines”. In: Journal of
Theoretical Biology 1.2 (1961), pp. 204–235.
[2] JH Choi, MK Lee, and MW Rhee. “Trading S&P 500 stock index futures using a neural network”.
In: Proceedings of the third annual international conference on artificial intelligence applications
on wall street. 1995, pp. 63–72.
[3] Kenji Doya. “Bifurcations in the learning of recurrent neural networks 3”. In: learning (RTRL) 3
(1992), p. 17.
[4] Eugene F Fama. “Mandelbrot and the stable Paretian hypothesis”. In: Journal of Business (1963),
pp. 420–429.
[5] V. Flunkert and E. Scholl. pydelay – a python tool for solving delay differential equations. arXiv:0911.1633 [nlin.CD].
2009.
[6] Mathieu Galtier and Gilles Wainrib. “A local Echo State Property through the largest Lyapunov
exponent”. In: arXiv preprint arXiv:1402.1619 (2014).
[7] L. Glass and M. Mackey. “Mackey-Glass equation”. In: Scholarpedia 5.3 (2010). revision 91447,
p. 6908.
[8] Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology
Press, 2005.
[9] John J Hopfield. “Neural networks and physical systems with emergent collective computational
abilities”. In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558.
[10] David A. Hsieh. “Chaos and Nonlinear Dynamics: Application to Financial Markets”. In: The
Journal of Finance 46.5 (1991), pp. 1839–1877. issn: 00221082.
60
[11] Iulian Ilies et al. “Stepping forward through echoes of the past: forecasting with echo state net-
works”. In: URL: http://www. neural-forecasting-competition. com/downloads/methods/27-NN3 Herbert Jaeger report.
pdf (2007).
[12] Herbert Jaeger. Long short-term memory in echo state networks: Details of a simulation study.
Tech. rep. Technical Report, 2012.
[13] Herbert Jaeger. “The echo state approach to analysing and training recurrent neural networks-
with an erratum note”. In: Bonn, Germany: German National Research Center for Information
Technology GMD Technical Report 148 (2001), p. 34.
[14] Herbert Jaeger and Harald Haas. “Harnessing nonlinearity: Predicting chaotic systems and saving
energy in wireless communication”. In: Science 304.5667 (2004), pp. 78–80.
[15] Ken-ichi Kamijo and Tetsuji Tanigawa. “Stock price pattern recognition-a recurrent neural net-
work approach”. In: Neural Networks, 1990., 1990 IJCNN International Joint Conference on.
IEEE. 1990, pp. 215–221.
[16] Takashi Kimoto et al. “Stock market prediction system with modular neural networks”. In: Neural
Networks, 1990., 1990 IJCNN International Joint Conference on. IEEE. 1990, pp. 1–6.
[17] Mantas Lukosevicius. “A practical guide to applying echo state networks”. In: Neural Networks:
Tricks of the Trade. Springer, 2012, pp. 659–686.
[18] Mantas Lukosevicius. Educational Echo State Network implementations. http://minds.jacobs-
university.de/mantas/code.
[19] Wolfgang Maass, Thomas Natschlager, and Henry Markram. “Real-time computing without stable
states: A new framework for neural computation based on perturbations”. In: Neural computation
14.11 (2002), pp. 2531–2560.
[20] Michael C Mackey, Leon Glass, et al. “Oscillation and chaos in physiological control systems”.
In: Science 197.4300 (1977), pp. 287–289.
[21] Benoit B Mandelbrot. The variation of certain speculative prices. Springer, 1997.
[22] G. Manjunath and H. Jaeger. “Echo State Property Linked to an Input: Exploring a Fundamental
Characteristic of Recurrent Neural Networks”. In: Neural Comput. 25.3 (Mar. 2013), pp. 671–696.
61
issn: 0899-7667. doi: 10.1162/NECO_a_00411. url: http://dx.doi.org/10.1162/NECO_a_
00411.
[23] Warren S McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in nervous ac-
tivity”. In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115–133.
[24] Philip Mirowski. “From Mandelbrot to Chaos in Economic Theory”. In: Southern Economic Jour-
nal 57.2 (Oct. 1990), p. 289. url: http://search.proquest.com/docview/212099581?
accountid=10598.
[25] Peter Mulieri. “Stock market price behavior: Random walks and nonlinear dynamics”. In: Journal
of Technical Analysis 37 (1991), pp. 28–34.
[26] NN3 Artificial Neural Network and Computational Intelligence Forecasting Challenge.
[27] M. F. M. Osborne. “Brownian Motion in the Stock Market”. In: Operations Research 7.2 (1959),
pp. 145–173. issn: 0030364X.
[28] Peter D. Praetz. “The Distribution of Share Price Changes”. In: The Journal of Business 45.1
(1972), pp. 49–55. issn: 00219398.
[29] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and organization
in the brain.” In: Psychological review 65.6 (1958), p. 386.
[30] Randall S Sexton, Robert E Dorsey, and John D Johnson. “Toward global optimization of neural
networks: a comparison of the genetic algorithm and backpropagation”. In: Decision Support
Systems 22.2 (1998), pp. 171–185.
[31] Konrad Stanek. “Reservoir computing in financial forecasting with committee methods”. MA the-
sis. Technical University of Denmark, 2011.
[32] Fabian Triefenbach et al. “Phoneme recognition with large hierarchical reservoirs”. In: Advances
in neural information processing systems. 2010, pp. 2307–2315.
[33] Robert R Trippi and Duane DeSieno. “Trading equity index futures with a neural network”. In:
The Journal of Portfolio Management 19.1 (1992), pp. 27–33.
[34] Ray Tsaih, Yenshan Hsu, and Charles C Lai. “Forecasting S&P 500 stock index futures with a
hybrid AI system”. In: Decision Support Systems 23.2 (1998), pp. 161–174.
62
[35] R.J. Van Eyden. The Application of Neural Networks in the Forecasting of Share Prices. Finance
& Technology Pub, 1996. isbn: 9780965133203. url: http://books.google.com/books?id=
wjXnAAAACAAJ.
[36] Juha Vesanto. “Using the SOM and local models in time-series prediction”. In: Proc. Workshop
on Self-Organizing Maps 1997. Citeseer. 1997, pp. 209–214.
[37] Paul Werbos. “Beyond regression: New tools for prediction and analysis in the behavioral sci-
ences”. In: (1974).
[38] Yahoo Finance.
63