value-at-risk model combination using artiflcial …...value-at-risk model combination using...
TRANSCRIPT
Value-at-Risk Model Combination Using Artificial Neural
Networks ∗
Yan Liu
Emory University
August 2005
Abstract
Value at Risk (VaR) has become the industry standard to measure the market risk. However,
the selection of the VaR models is controversial. Simulation Results indicate Historical Simulation
has significant positive bias, while GARCH (1,1) has has significant negative bias. Also HS adapts
structural change slowly but stable, while GARCH adapts structural break rapidly but less stable.
Thus the model selection is often unstable and cause high variability in the final estimation. This
paper proposes VaR forecast combinations using Artificial Neural Networks (ANNs) instead of
model selection. Based on Mean Loss Comparison, Violation Ratio and Christofferson’s condi-
tional coverage test, both the simulation and real data results prove that the ANNs combinations
have superior forecast performance than the individual VaR models.
JEL classification: C22 C32 C45 C50
Key Words: Value-at-Risk, Artificial Neural Networks, Genetic Algorithm
∗
1
1 Introduction
There are three major types of risks in financial markets, Credit Risk, Liquidity Risk and Market
Risk. For trading purpose, market risk is the most important risk to be considered. Value at Risk
(VaR) was introduced by J.P.Morgan (1996) and has become the standard measure to quantify
market risk.
VaR is generally defined as the maximum possible loss for a given position or portfolio within
a known confidence interval over a specific time horizon. The measure can be used by the financial
institutions to assess their risks or by a regulatory committee to set margin requirements. Both
purposes will lead to the same VaR measure, even though the concepts are different. In other
words, VaR is used to help the financial institutions stay in business after a catastrophic event.
There are a variety of approaches to estimate the VaR. They range from parametric (Risk-
metrics, GARCH, etc.) to semi-parametric (Extreme Value Theory, CAViaR, etc.) and non-
parametric (Historical Simulation and its variants, etc.). In practice, people face the important
issue of how to choose the ”best” model among so many candidates. Since different methodologies
can yield different VaR measures for the same portfolio, there has to be some models leading to
significant error in risk measurement. The risk of choosing inappropriate model is called ”model
risk” and is an important question left to the risk manager. As the result, the VaR model com-
parison becomes a very important issue. Abundant literature concern about this problem, such
as Christoffersen etc. (1998, 2001, 2004), Sarma etc. (2003), Lopez(1998), etc. With so many
model criterions, the selection of VaR models becomes more complicated.
There are some difficulties for the VaR model selection.
First, Aiolfi and Timmermann (2004), Hendry and Clements(2002) indicate individual models
may be differently affected by non-statonarities such as structural breaks, volatility clustering. For
example, if there are some structural changes today and the volatility increases a lot, the Historical
Simulation (HS) would not produce a tomorrow’s VaR prediction much different from today’s
VaR prediction because HS is based on empirical quantiles. While GARCH model would results
a more volatile VaR prediction since GARCH model catches the volatility clustering property
of financial time series. However, GARCH would only be affected by the today’s increasing
volatility temporarily and the VaR prediction would go back to the previous level when the
volatility decreases. On the contrary, HS adapts the changes slowly, but is more stable and the
parameter estimation is more precise. In general, people could expect GARCH can have better
forecast performance in the short run, while HS has better chance to win the long run game.
2
In the real world, the structural break is very difficult to be detected in ”real time”, thus the
selection between HS and GARCH will be difficult.
Second, as stated by Clemen(1989), Stock and Watson(2001,2004), all the individual models
can be viewed as misspecified models. Thus the forecasting models are local approximations and
the same model is unlikely to dominate all others at all points in time. We can imagine the
best model will change all the time and we can hardly to track the best model based on past
performance. Chatfield(1995), Hoeting,etc.(1999) mentioned the ”model uncertainty” problem in
selecting multiple forecasting models. Here the uncertainty indicates the best model to minimize
the appropriate loss function. Our task is often to find the best VaR forecasting model which
minimizes the ”tick” loss function, not to find the unique correct conditional quantile model.
In fact, true model often has worse forecast ability than best misspecified model. Thus ”model
uncertainty” is an important issue in VaR model selection. Bao, Lee and Saltoglu(2004) showed
the forecasting performance of VaR models considered varies over three periods before, during
and after the crisis.
Due to the difficulty to select the best VaR models, in this work I will try to find the best
VaR model by combination, not by selection. The theory of combining forecasts was originally
developed by Bates and Granger (1969). From a theoretical viewpoint, forecast combination can
be seen as a method to pool the information set contained in the individual forecast models.
Combining the forecasts can diversify the selection risk like portfolio diversification theory. On
average, combination can absorb the different adaptability of VaR models and diversify the fore-
cast error uncertainty. Since our purpose is to find the best model, not to find the correct model,
combining the VaR model is accepted by practitioners. Tons of empirical literature supports the
forecast combinations in diverse areas such as forecasting GDP, inflation, stock price, city pop-
ulation, etc. Recent empirical work such as Stock and Watson(1999,2001) has further confirmed
the accuracy gains by forecast combination. Timmermann(2004) provides a nice survey paper
about forecast combination. However, little empirical work has been done for the conditional
quantile forecasting. Giacomini and Komunjer(2003) construct a Conditional Quantile Forecast
Encompassing test for the evaluation and combination of the Conditional Quantile Forecast. But
their work concentrates on the encompassing test and not on multiple forecasts combination. As
they state, the combination for VaR model is beneficial since VaR is a small coverage quantile
model which is sensitive for the few observations in the density tail. Combining from different
information set can make the forecast more robust.
In this work, I propose to combine the VaR models using the nonlinear Artificial Neural
Networks (ANNs). Linear combination and average combination are two special cases of ANNs
3
combination. Due to the use of asymmetric non-differentiable ’tick’ loss function, I apply Ge-
netic Algorithm (GA) to train the ANNs. Applications of simulated and real data support the
VaR combing methodology. A comparison between individual models and ANNs combinations is
provided.
The remainder of the paper is organized as follows: Section 2 describes the existing VaR
methodologies and their properties via Monte Carlo simulation. Section 3 introduces the ANNs
combination methods and the use of GA to train the ANNs. Section 4 introduce three backtesting
creterion to compare the VaR models. Section 5 compares the performance of individual models
and ANNs combination model using simulated data. Section 6 applies the VaR combination to the
real data and compare the performance with individual models. Section 7 concludes the paper.
4
2 VaR Models
Under a probabilistic framework, at the time index t, we are interested in the risk of a financial
position for the next lperiods. Let ∆V (l) be the asset value change from time t to t + l. This
quantity is measured in dollars and is a random variable at the time t. Denote the CDF of ∆V (l)
by Fl(x). We define the VaR of a long position over the time horizon l with probability p as
p = Pr[∆V (l) ≤ V aR] = Fl(V aR) (1)
For a long position, the loss occurs when ∆V (l) < 0, thus VaR defined in (1) is assumed
negative value because usually p is very small. Eq.(1) can be interpreted as the probability that
the holder would encounter a loss greater than or equal to VaR over the time horizon l is p.
For a short position, the holder suffers a loss when ∆V (l) > 0 . The VaR is defined as
p = Pr[∆V (l) ≥ V aR] = 1− Fl(V aR) (2)
The VaR of a short position is typically a positive value.
Usually the p is very small, thus the VaR is a story with tail behavior of the CDF Fl(x). For
a long position, the left tail of Fl(x) is important; for a short position, the right tail is important.
If the CDF is known, the VaR is simply its pth quantile. The CDF is unknown in practice, thus
studies of VaR are essentially concerned with estimation of the CDF, especially the tail behavior
of the CDF.
In general, the calculation of VaR involves the following factors:
1. probability p ;
2. time horizon l ;
3. CDF Fl(x) or its quantile.
It’s easily to recognize the CDF Fl(x) is the central part of the VaR modeling. Different
methods of VaR models are in fact different approaches to CDF estimation. In this paper, two
popular methods will be considered, GARCH approach and Historical Simulation.
2.1 GARCH Approach
Consider the log return rt of an asset, the mean and volatility equation for rt can be described as
the following GARCH process.
5
rt = φ0 +p∑
i=1
φirt−i −q∑
j=1
θjat−j + at, at = σtεt (3)
σ2t = α0 +
u∑
i=1
(αia2t−i) +
v∑
j=1
βjσ2t−j (4)
The above GARCH process can be used to obtain one period forecasts of conditional mean
and conditional variance. We denote them as r̂t(1) and σ̂2t (1). Using these two fitted value and
the pth quantile of the return’s standardized conditional distribution, we can obtain the VaR. For
example, if the εt is Gaussian, then
V aR0.05 = r̂t(1)− 1.65σ̂t(1) (5)
If εt is a Student-t distribution with v degrees of freedom, then
V aRp = r̂t(1)− tv(p)σ̂t(1)√v/(v − 2)
(6)
2.2 Historical Simulation
The Historical Simulation (HS) is very simple. First define the portfolio return on t + 1 as
Rp,t+1. The HS technique assumes that the distribution of tomorrow’s portfolio returns, Rp,t+1 , is
identical to the empirical distribution of the past m periods’ portfolio return, {Rp,t+1−τ}mτ=1. The
VaR with coverage rate p, is simplified as the 100pth percentile of the sequence {Rp,t+1−τ}mτ=1. In
practice, we sort the returns {Rp,t+1−τ}mτ=1 in ascending order, the 100p percentile of the sequence
is the V aRpp,t+1. If the VaR falls in between two observations, linear interpolation can be used.
2.3 VaR Properties of GARCH and HS
2.3.1 VaR Biases
To show the necessary of the VaR model combination, I conduct a simple Monte Carlo simulation
to show GARCH and HS produce significantly biased VaR forecasts. In the following numerical
experiment, I use t-distribution with degree of freedom DF. DF decides the tail behavior of the
distribution, the smaller the DF, the heavier the tail. Normal distribution corresponds to the
t-distribution with DF to infinity.
I create 5000 samples by simulating the t-distribution with DF of 1, 3, 5, 10, 50, 200, and
1000. Each sample includes 1000 observations. Then the VaR is estimated for each sample with
the method Historical simulation and GARCH(1,1). The VaR is calculated repeatedly for 5000
6
times. The coverage rate is set to 5%. The estimation is conducted with 100, 200, 500, 1000
observations. After the calculations, the bias is calculated as
Bias = V aR− V aR∗ (7)
where V aR∗ is the theoretical VaR which is the inverse function of t-distribution with coverage
rate 5%, V aR is the empirical mean of the simulation results. Table.1 shows the bias for different
sample size and different DF.
The results show that, in general, GARCH is subject to a significant negative bias; while
HS is subject to a significant positive bias, which concords with Inui, Kijima and Kitano(2003)’s
findings. For both GARCH and HS, the bias increases when distribution tail becomes heavier.
The bias of HS increases when sample size becomes smaller. However, GARCH doesn’t have such
characteristics. The different signs of biases for GARCH and HS impliy that the combination of
GARCH and HS might decrease the bias for the VaR estimation.
2.3.2 VaR Prediction under Structural Breaks
As stated in the introduction, GARCH and HS are expected to be differently affected by structural
breaks. In this section I conduct a Monte Carlo Simulation to show how they adapt the structural
breaks.
In the following numerical experiment, the data include two data generation process (DGP).
The first 1,000 observations are generated by the following GARCH(1,1) process,
rt = at
σ2t = 0.0188 + 0.259a2
t−i + 0.7217σ2t−j
at = σtεt, εt ∼ t(3)
The next 500 observations are generated by the following GARCH(1,1) process,
rt = at
σ2t = 0.0188 + 0.059a2
t−i + 0.9217σ2t−j
at = σtεt, εt ∼ t(10000)
The purpose of applying these two different DGP is create structural break which happens from
the 1, 001st observation. From the parameters in the above two DGPs, obviously the first 1,000
observations are less volatile and the VaR forecasts should be higher than the next 500 observa-
tions.
7
The first 500 observations are used to estimate the parameters of GARCH model, then the
parameter estimation is fixed and used to forecast the VaR of next 1,000 observations. Thus we
would have 1,000 GARCH VaR forecasts. The first 500 VaR forecasts are under the first DGP,
the last 500 VaR forecasts are under the second DGP.
The window size for the HS is also 500, since the whole sample size is 1,500, thus we would
also have 1,000 VaR forecasts.
This process is repeated 5,000 times and the resulting VaRs are taken average. The average
sample standard deviation of the first 1,000 observations is 0.4787, the average sample standard
deviation of the last 500 observations is 0.9257.
Figure.1 shows the behavior of GARCH and HS under the structural break environment. The
structural break happens from time t = 501. GARCH reacts with this structural break rapidly
and the VaR forecast by GARCH drops sharply at the beginning of the strctural break. However,
since the GARCH VaR forecast is influenced by the most recently return, thus the VaR forecast
comes back rapidly to the previous level. On the contrary, HS reacts smoothly and steadily
with the structural break. HS does not drops very fast, but it decreases steadily and reflects the
structural break in the long run.
The simulation results support that GARCH adapts the structural break rapidly but is less
stable, while HS adapts the structural change slowly but is more stable. Both have their advantage
and disadvantage and they can compensate each other by intuition, thus combination of GARCH
and HS might absorbs their advantage and diversifies their disadvantage, and finally results a
better VaR forecast.
8
3 Artificial Neural Networks Combination
From above discussion, we have seen that combining more than two VaR models is appealing.
Hornik, Stinchcombe and White (1989,1990) demonstrated that ANNs are able to approximate
arbitrarily well a large class of functions. Therefore ANNs are ideally suited to the problem of
forecast combining when the optimal combination of individual forecasts is potentially nonlinear.
Since GARCH and HS are different types of models, thus we have confidence that the potential
combination relationship is nonlinear. In this section I present how to apply Artificial Neural
Networks (ANNs) to conduct the nonlinearly combine the VaR models.
Donaldson and Kamstra (1996) applied the ANNs to combine time series forecasts of stock
market volatility. In this paper I will use ANNs to combine the VaR models which is essentially
a nonlinear quantile regression model.
3.1 ANNs Combination Architecture
An artificial neural network is a mathematical model for information processing based on a con-
nectionist approach to computation. ANNs are a wide class of flexible nonlinear regression.
In this paper the Multilayer Perception Network (MLP) is applied. MLP consists of an input
layer, several hidden layers, and an output layer. White (1992) shows MLPs are general-purpose,
flexible, nonlinear models that can approximate virtually any function to any desired degree of
accuracy if given enough hidden neurons and enough data. In other words, MLPs are universal
approximators.
To explain how the neural network works, I use a simple ANN structure which includes only
one hidden layers and one output. I will use the following notations,
xi = input values
αj = bias for hidden layer
hj = hidden neuron values
c = bias for output layer
y = predicted output value
t = target output value
wi = weight from input to hidden layer
g = activation function from input layer to hidden layer
dj = weight from hidden layer to output
f = activation function from hidden layer to output layer
9
The hidden neuron values and output values are calculated by the following nonlinear equations,
y = f(c + Σdjhj) (8)
hj = gj(aj + Σwixi) (9)
Given independent variables and dependent variable, an ANNs designer need determine the
number of hidden layers, number of hidden neurons and activation functions. White (1990) proved
that, provided a sufficient number of nodes are placed on the first layer of ANN, higher layers not
needed to establish a satisfactory connection between the initial raw inputs and the final output.
Thus in this paper I will adopt the single hidden layer ANN to combine the VaR models.
Figure. 2 shows the structure of the ANN combination models. {f1t, f2t} denote the individual
forecasts; {z1t, z2t} denote the normalized individual VaR forecasts; {h1t, h2t} denote the hidden
neurons; the number of hidden neurons may not be two and it will be determined by validation;
1 denotes the bias unit; Ft denotes nonlinear combining VaR forecast. The relationships among
the above variables are,
Ft = β0 + Σ2j=1βjhjt + Σ2
i=1δifit (10)
hjt = tanh(α0j + Σ2l=1αljzlt) (11)
zlt = (flt − flt)/Sflt(12)
tanh(x) =ex − e−x
ex + e−x(13)
flt = in-sample mean of flt
Sflt= in-sample standard deviation of flt
Equation (12) normalizes the individual VaR forecast to as the inputs for the hidden neurons.
Equation (11) transfers the linear combination of to hidden neurons via hyperbolic tangent ac-
tivation function tanh(x) which is defined in Equation (13). Finally Equation (10) transfers the
hidden neurons to the final output via linear transfer function.
When βj = 0, this nonlinear function will be identical to linear quantile regression combination.
Thus ANNs combination is the generalized form of all combination methods.
Theoretical results indicate that given enough hidden units, a network like the one in Fig. 1
can approximate any reasonable function to any required degree of accuracy. In other words, any
function can be expressed as a linear combination of tanh functions: tanh is a universal basis
function.
What we need estimate are the weights αlj , βj , δi and the number of hidden neurons n. They
are estimated by minimizing ’tick’ loss function defined in the next section.
10
In this paper candidate VaR forecasts are the input, the combined VaR forecast is the output.
The difference between the VaR ANNs model and standard ANNs is that the standard ANNs
deals in with mean forecast and the cost function is a symmetric differentiable function, while for
VaR models we are interested in the quantile forecast and the cost function is an asymmetric non-
differentiable ’tick’ loss function. Thus the Gradient-based Standard Backpropagation (SBP) is
not suitable to train the quantile ANNs. Here I will use Genetic Algorithm to train the VaR-ANNs
model.
3.2 Genetic Algorithm (GA)
Genetic Algorithm is a method for solving optimization problems that is based on natural selec-
tion, the process that drives biological evolution. GA was developed by Holland (1962, 1975).
Beasley et al. (1993) provides an excellent introduction for GA.
GA maintains an initial population of solution candidates and evaluates the quality of each
solution candidate according to a specific cost function. Then GA repeatedly modifies the popula-
tion of individual solutions. At each step, the genetic algorithm selects individuals at random from
the current population to be parents and uses them produce the children for the next generation.
Over successive generations, the population ”evolves” toward an optimal solution.
The following outline summarizes how the genetic algorithm works:
1. Creating an initial population, usually randomly, but can be specified by the designers.
2. The algorithm then creates a sequence of new populations, or generations. The individuals in
the current generation are used to create the next generation. the following steps are used,
1) Scores each member of the current population by computing its fitness value according to the
cost function.
2) Scales the raw fitness scores to convert them into a more usable range of values.
3) Selects parents based on their fitness.
4) Produces children from the parents.
Reproduction: With probabilities proportional to their fitness, members of the population are
selected for the new population.
Crossover: Pairs of chromosomes in the new population are chosen at random to exchange genetic
material, their bits, in a mating operation called crossover. This produces two new chromosomes
that replace the parents.
Mutation: Randomly chosen bits in the offspring are flipped, called mutation.
5) Replaces the current population with the children to form the next generation.
3. Repeat the above procedures until one of the stopping criteria is met.
11
GA can be used to solve a variety of optimization problems that can’t be solved by standard
gradient-based optimization algorithms, including problems in which the objective function is
discontinuous, non-differentiable, stochastic, or highly nonlinear. Considering the property of the
quantile ANNs, GA is used to train the ANNs.
12
4 Backtesting Criterion
In order to verify the reliability of VaR combination, I choose the following backtesting criterion
to compare the forecast performance among combining models and individual models.
4.1 Loss Functions
Basel Committee on Banking Supervision (1996a) indicates, the magnitudes as well as the number
of exceptions are a matter of regulatory concern. Lopez (1998) incorporated this concern into a
set of loss functions. The general form of those loss functions is defined as:
L =1n
n∑
i=1
Ct+i (14)
Ct+1 =
f(Lt+1, V aRt) if Lt+1 < V aRt
g(Lt+1, V aRt) if Lt+1 ≥ V aRt
(15)
where f(Lt+1, V aRt) ≥ g(Lt+1, V aRt). The following are the two common VaR loss functions
I will consider in this paper:
4.1.1 Binomial Loss
Ct+1 =
1 if Lt+1 < V aRt
0 if Lt+1 ≥ V aRt
(16)
This loss function considers only the number of exceptions, doesn’t consider magnitude. We
often call the mean loss value (7) calculated from this loss function as violation ratio.
4.1.2 ’tick’ Loss
Ct+1 =
(α− 1)(Lt+1 − V aRt) if Lt+1 < V aRt
α(Lt+1 − V aRt) if Lt+1 ≥ V aRt
(17)
13
in compact format, it can be expressed as Ct+1 = (α − I(Lt+1 − V aRt < 0))(Lt+1 − V aRt),
where I(Lt+1 − V aRt < 0) is an indicator function. We often call this loss function as ’tick’
or ’check’ loss function. In the quantile regression literature, ’tick’ loss function is usually the
implicit objective function.
In this paper I will apply loss functions (9) and (10) to compare the individual VaR models
and combining models. The mean loss is calculated by equation (7).
4.2 Conditional Coverage Test
The conditional coverage test is proposed by Christoffersen (1998). The analysis is based on
the indicator sequence Ct+1 defined in equation (9). Accurate VaR model should exhibit the
property of correct conditional coverage. It implies the indicator sequence Ct+1 should exhibit
both correct unconditional coverage and serial independence. Christoffersen (1998) develops a
three-step testing procedure.
4.2.1 Correct Unconditional Coverage Test
H0: correct violation ratio
Ha: incorrect violation ratio
LRuc = −2log
[pn1(1− p)n0
πn1(1− π)n0
]∼ χ2
(1) (18)
where
p = coverage rate for the VaR Model
n1 = number of exception
n0 = number of non-exception
π =n1
n0 + n1, the MLE of p
4.2.2 Independence Test
H0: exception process is independent
Ha: exception process is first-order Markov process
14
LRind = −2log
[(1− π2)n00+n11πn01+n11
2
(1− π01)n00πn0101 (1− π11)n10πn11
11
]∼ χ2
(1) (19)
where
nij = number value i follow by j in indicator sequence
π01 =n01
n00 + n01
π11 =n11
n10 + n11
π2 =n01 + n11
n00 + n01 + n10 + n11(20)
4.2.3 Correct Conditional Coverage Test
H0: independent exception process with correct violation ratio p
Ha: first-order Markov exception process with a different transition probability matrix (TPM)
LRind = −2log
[pn1(1− p)n0
(1− π01)n00πn0101 (1− π11)n10πn11
11
]∼ χ2
(2) (21)
15
5 Simulation Study
The purpose of this section is to design and perform a Monte Carlo simulation experiment to
compare the one-step ahead VaR forecast performance of ANNs combination model and individual
VaR models.
In the following numerical experiment, the data are generated by the following GARCH(1,1)
process,
rt = at
σ2t = 0.0188 + 0.059a2
t−i + 0.9217σ2t−j
at = σtεt, εt ∼ t(d.f.)
The parameters used here are estimated from S&P 500 index. The innovation process follows
t-distribution with degree of freedom (d.f.) varying from 3 to 1000. The sample size varies from
100 to 1000. The repetition of Monte Carlo Simulation is 1000.
The comparison procedure is described as following,
1. let d.f. = 3, sample size = 100; then generate GARCH(1,1) process by the above DGP, the
number of observations is 101, the first 100 data is used to estimate VaR, the last data is used
for out-of-sample comparison;
2. We estimate VaR by GARCH and HS with the first 100 observations;
3. Repeat the above two steps 1,000 times, we would have 1,000 VaR forecasts by both
GARCH and HS, and we also have 1,000 out-of-sample data;
4. Train the ANN and estimate the weights of ANNs with the 1,000 pairs of data from step
3; This model would be our ANNs combination model.
5. Repeat step 1, then estimate GARCH and HS VaR with generated data, also estimate the
ANNs VaR with weights from step 4;
6. Repeat step 5 1,000 times, would would have 1,000 VaR forecasts from GARCH, HS and
ANNs and 1,000 one-step-ahead out-of-sample data;
7. Compare the VaR forecast performance by violation ratio and ’tick’ loss;
8. Repeat all of above steps with different d.f. and sample size.
Table. 2 reports the violation ratio comparison between GARCH, HS and ANNs combination
model. The coverage rate is 5%, thus the model with violation ratio close to 5% is preferred.
From the table, we observe GARCH always generate the violation ratio less than 5%, which
16
means the VaR forecasts by GARCH are too conservative. This conservativeness becomes more
severe when the degree of freedom of the t-distribution of innovation increases. There are no
apparent pattern for the GARCH VaR foreacst with the change of sample size. HS VaR does
not have an obvious trend with the tail behavior of the innovation and the sample size, however,
we observe the performance of HS is very unstable. On the contrary, ANNs combination model
behaves much better than GARCH and HS, the performance is stable and there are no trend
between the violation ratio and the tail behavior of innovation and the sample size. Most of the
violation ratios are between 4% and 6% which are very close to 5%.
Figure. 2 compare the ’tick’ loss of the three models. Since ANNs model would definitely
has smaller loss than the individual models, this figure reports the out-of-sample loss comparison.
Obviously, ANNs have smaller loss under nearly all of the DGP except when the d.f. equal to 50.
This indicate the in-sample training is acceptable.
This Monte Carlo experiment prove that, if based on violation ratio and ’tick’ loss, ANNs
combination could generate better VaR forecasts than GARCH and HS.
17
6 Empirical Application
Four daily stock return series are employed to test the VaR models. They are S&P 500 INDEX,
DOW JONES Industrial Average Index (DJI ), Ford Motor Co. (Ford) and International Business
Machines Corp (IBM) - for the period from 2-Jan-1990 to 4-Apr-2005 with total 3846 observations.
Returns are computed as 100 times the difference of the log of the prices. The price data comes
from Yahoo Finance.
Table. 3 reports the summary statistics of these return series. S&P 500 has the highest mean
return and Ford has the worst performance during these fifteen years. Because S&P 500 and DJI
are composite indices, they have obviously smaller variances than Ford and IBM. All of these four
return series have negative skewness and Ford has a very large negative skewness. All of these
four return series have heavy tails, which is the foundation of using GARCH in this study.
6.1 Methodology
The individual VaR models are the Historical Simulation (HS), and GARCH(1,1) with innovation
process N(0,1). The VaR model window size is fixed as 1,000 days which is equal the four year
trading days.
The reason why I choose these individual VaR models is that they employ partially non-
overlapping information sets. HS only uses empirical distribution of historical return data, while
GARCH applies conditional volatility forecast model. Thus, there may be some advantage to
combine the VaR models since we could employ more information.
The HS and GARCH in this study are one-step-ahead forecast model beginning from the
1, 001st observations. Daily data from the 1st observation to 1, 000th observation are used to
estimate the parameters of GARCH model. The parameters of the GARCH model will stay fixed
in the following forecasting. Then we start to produce the VaR forecast for the 1, 001st day. We
update the data set using rolling window method which keep the sample size constant as 1,000, by
adding the 1, 001st observation and dropping the 1st observation. We then use the new data set
to produce the one-day-ahead out-of-sample forecasts for the 1, 002nd day. This rolling recursive
updating procedure is repeated until the last day of the whole sample. Then we have 3,846 - 1,000
= 2,846 individual VaR forecasts.
The above individual VaR forecasts constitute the base for the combination. I divide the
individual out-of-sample forecasts into two subsamples: 1st to 1, 000th and 1, 001st to 2, 846th.
The first subsample is used to select the optimal specifications for the ANNs models. Then the
optimal ANN architecture is used to forecast the VaR in the second subsample. I use the VaR
18
forecast in the second subsample to compare the performance between the ANNs combination
and individual VaR models based on the backtesting criterion described in section 4. The ANNs
specification is fixed in the backtesting period.
Genetic Algorithm is used to train the ANN because of the asymmetric non-differential ob-
jective function. There are some important parameters need be decided for the algorithm. The
population size (PS) is set as the 10 times of neurons. The Generations number is set as 200
which can achieve good convergence from the training experience. The crossover parameter (CP)
is set as 0.5 which is a common setting for GA training. The mutation parameter (MP) is set
as 0.8. The parameters are chosen to balance the power of the optimization and efficiency of the
computing time. People believe the GA training is some kind of an art.
6.2 Comparison Results
We are sure the optimal ANN combination will have better performance in the first validation
subsample described above since ANN combination nested all possible linear and nonlinear com-
bination of individual VaR models. Table. 4 reports the in-sample ’tick’ loss of GARCH, HS and
ANNs combination models. The results concord with our expectation, ANNs has the smallest
in-sample loss. Thus the in-sample training of the ANNs combination is valid. The important
comparison is the performance in the second testing subsample. It determines if the ANN com-
bination is of practically useful.
Table. 5 reports the 5% out-of-sample VaR comparison results. Four return series and 5%
coverage rate are tested. Three kinds of models are displayed in the table, where two individual
VaR models are listed first and ANNs combination model follows. I don’t report the linear
combination and average combination separately. As stated in the section 3.1, linear combination
and average combination are special cases of ANNs combination, thus they are compared with
other nonlinear models in the validation period. In fact it’s hard to believe the HS and GARCH
have linear relationship which can efficiently combine the information set of HS and GARCH.
Three comparison criterion are adopted which are described in section 4. Three backtesting
criterion are used to backtest the models. The first column reports the mean ’tick’ loss in the
testing sample, the second column reports the most common measure, violation ratio. From the
third column to the fifth column are the Christoffersen’s unconditional coverage test, independence
test and conditional coverage test. In the table, the best model for the mean loss test, violation
ratio test are marked with bold fond. The models passing the Christoffersen’s tests are expressed
in bold font.
19
The first column reports the mean ’tick’ loss comparison. The performance of GARCH and
HS is mixed, however, ANNs combination model has the smallest loss in all return series. This
indicates the ANNs combination could reduce the loss in both in-sample and out-of-sample data.
The second column shows the violation ratio comparison. Generally speaking, GARCH always
generates conservative VaR forecasts except for Ford. On the contrary, HS VaR forecasts are too
generous which consistently result the violation ratios larger than 5%. ANNs combination has
better performance than GARCH and HS, the deviation from the correct coverage is smaller than
GARCH and HS. However, we notice for the DJI and IBM, the deviation is not small enough.
Column 3 to column 5 reports the results of the conditional coverage test proposed by Christof-
fersen (1998). Column 3 is the results of the unconditional coverage test. Both GARCH and HS
fails this test for all four return series. ANNs combination passes this test for S&P 500 and Ford,
however, it fails to pass for DJI and IBM. The results concord with the violation ratio test since
both tests are essentially the same. For the independence test, column 4 tells us all of the models
pass the test. Thus the violation series from all three models have no serial correlation. The last
column is the result of conditional coverage test which is the joint test of unconditional coverage
test and the independence test. The statistics of this test is also the sum of previous two statistics.
Because of the unsatisfactory results from unconditional coverage test, both GARCH and HS fail
to pass this test. ANNs combination model pass this test for S&P 500 and Ford and fail to pass
the tests of DJI and IBM. We notice that the 95% critical value of chi-square with d.f. 2 is 5.99,
and the statistics of ANNs combination model for DJI and IBM are very close to the critical
value and much smaller than the statistics of GARCH and HS. Thus the ANNs combination
model improves the performance of the individual models for the conditional coverage test.
20
7 Concluding Remarks
The two popular VaR models, such as GARCH and HS, have two drawbacks. First, HS has
significant positive bias, while GARCH has significant negative biases. Second, if the structural
break appears, HS would respond too slowly and GARCH would respond fast but unstable.
In order to solve these two problems, this paper proposes an ANNs combination method and
studies performance of the ANNs combination. Based on the ’tick’ loss function, violation ratio
and Christoffersen’s conditional coverage tests, both the simulation results and the empirical
results show combining the HS and GARCH with ANNs significantly improves the VaR model
performance.
21
References
Aiolfi, M. and A. Timmermann (2004), Persistence in Forecasting Performance and Conditional
Combination Strategies. Forthcoming in Journal of Econometrics
Backus, Bao, Y., Lee, T.-H. and Saltoglu, B.(2004), A test for density forecast comparison with
applications to risk management. Department of Economics, UC Riverside.
Basel Committee on Banking Supervision (1996a), Amendment to the capital accord to incorpo-
rate market risks
Basel Committee on Banking Supervision (1996b), Supervisory framework for the use of ”back-
testing” in conjunction with the internal models approach to market risk capital requirements
Basel Committee on Banking Supervision (1998), Amendment to the Basel Capital Accord of
July 1988
Basel Committee on Banking Supervision (1999), Performance of Models-Based Capital Charges
for Market Risk: 1 July-31 December 1998
Bates, J.M. and C.W.J. Granger(1969).The combination of forecasts. Operations Research Quar-
terly, 20, 451-468
Beasley, D., Bull, D.R., and Martin, R.R. (1993), An Overview of Genetic Algorithms, University
Computing, 15(2) 58-69, 170-181
Blanco C. and G. Ihle (1999). How good is your VaR? Using backtesting to assess system perfor-
mance. Financial Engineering News
Caporin, M. (2003) Evaluating value-at-risk measures in presence of long memory conditional
volatility. GRETA, working paper n. 05.03.
Chatfield, C. (1995). Model uncertainty, data mining, and statistical inference, J. Roy. Statist.
Soc. Ser. A 158 419-466.
Christoffersen, P. (1998). Evaluating Interval Forecasts, International Economic Review, 1998,
Volume 39, 841-862.
Christoffersen, P., J. Hahn and A. Inoue (2001). Testing and Comparing Value-at-Risk Measures,
Journal of Empirical Finance, 2001, Volume 8, 325-342.
Christoffersen, P. and D.Pelletier (2004).Backtesting Value at Risk: A Duration-Based Approach,
Journal of Financial Econometrics, 2004, Volume 2, 84-108.
Clemen, R.T. (1989). Combining Forecasts: A Review and Annotated Bibliography,” Interna-
tional Journal of Forecasting, 5, (1989) 559-583
R.G. Donaldson and M. Kamstra, (1996). Using Artificial Neural Networks to Combine Forecasts,
Journal of Forecasting, 15, 49-61.
22
Engle, R. and S. Manganelli (1999). CAViaR: Conditional Value at Risk by Quantile Regression,
Manuscript, NYU Stern
Giacomini R., I. Komunjer (2004). Evaluation and Combination of Conditional Quantile Fore-
casts, forthcoming Journal of Business and Economic Statistics
Hendry, D.F. and M.P. Clements (2002). Pooling of Forecasts. Econometrics Journal 5, 1-26
Hoeting, J., D. Madigan, A. Raftery and C. Volinsky (1999). Bayesian Model Averaging, Statis-
tical Science 14, 382-417.
Holland, J. (1965), Universal spaces: A basis for studies of adaptation, In Automata Theory.
Caianiello, E. R. (ed.) Academic Press. 218-30.
Holland, J. (1975) Adaptation in Natural and Artificial Systems, Ann Arbor, The University of
Michigan Press
Hornik, K., M. Stinchcombe and H. White (1989), Multilayer Feedforward Networks are Universal
Approximators, Neural Networks, 2, 359-366
Hornik, K., M. Stinchcombe and H. White (1990), Universal Approximation of an Unknown
Mapping and Its Derivatives Using Multilayer Feedforward Networks, Neural Networks, 3,
551-560
Inui, K., M. Kijima, and A. Kitano (2003), VaR is subject to a significant positive bias. Mimeo,
Kyoto University and Financial Services Agency
Koenker, R. and B.J. Park (1996). An Interior Point Algorithm for Nonlinear Quantile Regression,
Journal of Econometrics, 71, 265-285
Lopez J. A. (1998), Methods for evaluating value-at-risk estimates. Federal Reserve Bank of New
York research paper n. 9802.
Riskmetrics (1996). Technical Document. Technical report. J.P. Morgan
Sarma, M., S. Thomas, and A. Shah (2003), Selection of Value at Risk models, Journal of Fore-
casting, 22(4), 337-358
Stock, J.H. and M. Watson (1999). Forecasting Inflation, Journal of Monetary Economics, 1999,
Vol. 44, no. 2.
Stock, J.H. and M. Watson (2001). A Comparison of Linear and Nonlinear Univariate Models for
Forecasting Macroeconomic Time Series, (with James Stock), in Cointegration, Causality, and
Forecasting A Festschrift in Honour of Clive W.J. Granger R.F. Engle and H. White (eds),
Oxford University Press
Stock, J.H. and M. Watson (2004). Combination Forecasts Of Output Growth In A Seven-Country
Data Set, forthcoming Journal of Forecasting, 2004
23
Timmermann, A. (2004). Forecast Combinations. Forthcoming in Handbook of Economic Fore-
casting (Edited by Elliott, Granger and Timmermann (North Holland)
White, H. (1990), Connectionist Nonparametric Regression: Multilayer Feedforward Networks
Can Learn Arbitrary Mappings, Neural Networks, 3, 535-549 .
White, H. (1992), Nonparametric Estimation of Conditional Quantiles Using Neural Networks, in
Proceedings of the Symposium on the Interface. New York: SpringerVerlag,
24
Figure 2. ANNs Combination Structure
1
1tz
2tz
1tf
2tf
1th
2th
1
tF
Figure 3. Loss Comparison
Table 1. GARCH and HS VaR Bias (5%)
GARCH Bias
df \ n 100 250 500 1000 3 -1.859 -1.829 -1.841 -1.800 10 -0.206 -0.204 -0.203 -0.213 50 -0.048 -0.054 -0.047 -0.041
1000 -0.034 -0.026 -0.023 -0.022 HS Bias
df \ n 100 250 500 1000
3 -0.449 -0.014 -0.033 0.020 10 0.220 0.057 0.019 0.017 50 0.046 0.028 0.037 0.012
1000 0.061 0.010 0.011 0.009 DGP: t-distribution with DF of 3,10,50 and 1000 MC repetition: 5000 Bias = forecast VaR - theoretical VaR forecast VaR = mean of VaR forecasts in the simulation theoretical VaR = 5% inverse CDF of t-distribution
Table 2. Violation Ratio Comparison (5%)
Method \ n 100 250 500 1000 df = 3 GARCH 0.016 0 0 0.035
HS 0.056 0.049 0.023 0.11 ANN 0.03 0.066 0.058 0.046
df = 10 GARCH 0.019 0.032 0.067 0.04 HS 0.017 0.053 0.084 0.072 ANN 0.03 0.067 0.046 0.098
df = 50 GARCH 0.038 0.035 0.042 0.032 HS 0.066 0.038 0.023 0.098 ANN 0.038 0.054 0.023 0.065
df = 1000 GARCH 0.019 0.032 0.067 0.04 HS 0.017 0.053 0.084 0.072 ANN 0.058 0.05 0.034 0.065 K = 0.018
DGP: GARCH(1,1) process GARCH = 0.9217 MC repetition: 1000 ARCH = 0.059
Table 3. Summary Statistics
Mean Median Variance Skewness Kurtosis S&P 0.052 0.040 1.047 -3.705 83.161 DJI 0.015 0.021 0.193 -0.222 7.524
Ford -0.025 0.000 5.408 -10.148 270.260 IBM 0.025 0.000 2.705 -0.881 25.048
Table 4. In-Sample Loss (5%)
S&P DJI IBM Ford GARCH 0.7674 0.8576 3.0632 2.5668 HS 0.7673 0.7949 2.6012 2.5481 ANN 0.7308 0.7575 2.5699 2.4960
Table 5. Out-of-Sample Performance (5%) Loss Vio Ratio LR_uc LR_ind LR_cc c.v. 3.842 3.842 5.992 SP GARCH 1.4267 0.029 10.867 0.030 10.955
HS 1.4542 0.096 35.511 0.007 35.719 ANN 1.3887 0.057 0.989 0.182 1.288
DJI GARCH 1.4036 0.031 8.739 0.002 8.804 HS 1.4359 0.085 21.512 3.241 24.931 ANN 1.3722 0.067 5.524 1.404 7.066
IBM GARCH 3.7766 0.030 9.769 1.858 11.687 HS 3.5180 0.078 14.204 0.154 14.521 ANN 3.4460 0.068 6.161 0.433 6.735
Ford GARCH 3.4888 0.074 10.634 0.051 10.838 HS 3.4794 0.070 7.530 0.931 8.606 ANN 3.4463 0.040 2.253 1.074 3.409