machine learning and forecast combination in incomplete panels
TRANSCRIPT
1
MACHINE LEARNING AND FORECAST
COMBINATION IN INCOMPLETE PANELS
Kajal Lahiri, Huaming Peng, Yongchen Zhao1
Department of Economics, University at Albany, SUNY
Albany, New York, USA
β‘ This paper focuses on the newly proposed on-line forecast combination
algorithms in Sancetta (2010), Yang (2004), and Wei and Yang (2012). We first
establish the asymptotic relationship between these new algorithms and the Bates
and Granger (1969) method. Then, we show that when implemented on unbalanced
panels, different combination algorithms implicitly impute missing data differently,
making results not comparable across methods. Using forecasts of a number of
macroeconomic variables from the U.S. Survey of Professional Forecasters, we
evaluate the performance of the new algorithms and contrast their inner
mechanisms with that of Bates and Grangerβs method. Missing data in the SPF
panels are specifically controlled for by explicit imputation. We find that even
though equally weighted average is hard to beat, the new algorithms deliver
superior performance especially during periods of volatility clustering and
structural breaks.
Keywords On-line learning; Recursive algorithms; Unbalanced panel; SPF
forecasts.
JEL Classification C22; C53; C14.
1. INTRODUCTION
Since the seminal work of Bates and Granger (1969), the potential
benefits of combining multiple forecasts instead of simply choosing the
single best has long been recognized. The basic idea is that under certain
1 An earlier version of the paper was presented at the New York Camp Econometrics VI (Lake
Placid, April 2011) and the 17th International Panel Data Conference (Montreal, July 2011). We
thank Cheng Hsiao and Tom Wansbeek for helpful comments.
2
conditions, optimally combined forecast can be more accurate than
individual forecasts in the panel. Moreover, combining forecasts can be a
useful hedge against structural breaks and model instability, see
Timmermann (2006) for a survey. However, despite the development of
many new forecast combination methods during the past forty years,
empirical studies still find that a simple average (henceforth SA) of
forecasts perform very well compared to more elaborate procedures. This
βforecast combination puzzleβ, as dubbed originally by Stock and Watson
(2004), is related to several issues. First is the βcurse of dimensionalityβ,
since performance-based combination methods require the estimation of a
large number of weight parameters. A number of well-designed Monte
Carlo studies and analytical results have illustrated this problem, see Kang
(1986), Smith and Wallis (2009), CapistrΓ‘n and Timmermann (2009), and
Issler and Lima (2009). There are other equally important factors too. A
forecasterβs past record may not be a good indicator of his/her future
performance due to structural breaks, outliers, new information,
uncertainty shocks, and other ex-ante unobservables. These factors can
make the relative rankings and combination weights unstable,
unpredictable, and generally misleading. Aiolfi and Timmermann (2006)
document such crossings in the context of a large number of forecasting
models.
Parallel to the aforementioned literature, another promising approach
to forecast combination has developed in recent years using aggregation
algorithms and on-line learning. Two fundamental components of
successful forecast combination is the choice of the combination rule and
the weights. In the latter approach, time varying combination weights are
naturally built into on-line recursive algorithms that do not require the
knowledge of the full covariance matrix of the forecast errors. Yang
3
(2004) distinguished between two broad approaches to combining: in the
first the combined forecast tries to be as good as the best in the group -
called combining for adaptation; and in the second approach the combined
forecast tries to be better than each individual forecast β called combining
for improvement. The Bates and Granger (BG) procedure falls in the latter
category. Yang (2004) suggested a new automated combining method, the
Aggregated Forecast Through Exponential Reweighting algorithm
(henceforth AFTER), which is a variant of the aggregating algorithm
proposed by Vovk (1990), and belongs to the first category. AFTER has
been found to be useful in many recent applications.2 It is now well
known that, under certain circumstances, the cost of combining for
improvement due to parameter estimation can be substantially higher than
that of combining for adaptation.
In addition to AFTER, we consider another on-line recursive
algorithm from the machine learning literature with shrinking (henceforth
MLS) due to Sancetta (2010). Unlike the BG approach, the on-line
algorithms tend to select the few top forecasters, thus requiring much
smaller number of estimated parameters. By simple recursive updates,
they allow for time varying optimal combination weights. There are subtle
differences in the assumptions built into AFTER and MLS algorithms:
AFTER requires the existence of a moment generating function of the
forecast errors, but the errors need not be bounded. MLS does not require
any assumption on the nature of the forecasts and the actuals, or the
stability of the system besides a tail condition on the error distribution. But
in the process it can only establish rather weak performance bounds. Wei
2 See Altavilla and Grauwe (2010), Rapach and Strauss (2005, 2007), Leung and Barron
(2006), Sanchez (2008), Fan, Chen and Leem(2008), and Inoue and Kilian (2008).
4
and Yang (2012) noted that with alternative forecasts being similar and
stable, the BG approach tends to be unnecessarily aggressive, and AFTER,
in these situations, can perform better. On the other hand, when the best
forecaster changes over time and unstable, the gradient-based method for
improvement as suggested by MLS can be advantageous. Instead of
looking for the best combined forecast, it could be more profitable to look
for the best available forecast in real time. Thus, these methods can be
complimentary, depending on the forecasting environment one may face in
real time.
Wei and Yang (2012) extended the AFTER algorithm that is designed
for squared loss (s-AFTER) to absolute error loss (L1-AFTER) and Huber
loss functions (h-AFTER), with the special objective of reducing the
influence of outliers. In the presence of structural breaks, outliers are more
likely to occur. The quadratic loss coupled with doubly exponential
weighted exponential scheme makes AFTER weights very sensitive to
outliers. The BG combination weights can be very sensitive to structural
breaks too. In a real life situation, since the future scenario is seldom
predictable, one cannot determine a priori which combining strategy to
adopt. However, Wei and Yang (2012) have shown that while robust to
outliers, L1-AFTER and h-AFTER are only marginally inferior to s-
AFTER when errors are normally distributed with no outliers.
There are three main objectives of this paper: First, we derive the
asymptotic forms of s-AFTER, L1-AFTER and BG method and establish
their asymptotic relationships. These asymptotic relationships not only
provide us with fresh new insights into the mechanism of how the
AFTER-type algorithms operate relative to the BG and simple averaging
scheme, but also explain the rationale behind the distinct forecast
performance of these combination methods. More specifically, we show
5
that in the presence of heteroscedasticity, s-AFTER eventually behaves
similar to that of a normalized power function of the BG method. In
particular, the s-AFTER algorithm would magnify the BG weights if they
are sufficiently large while discounting the weights of BG dramatically if
they are below some threshold level. On the other hand, not unexpectedly,
we show that if the forecast errors are asymptotically stationary over time
and conditionally homoscedastic across forecasters or forecasting
procedures, then s-AFTER, L1-AFTER and BG methods all behave like
the SA scheme when both the training and the remaining sample sizes are
sufficiently large.
Second, we evaluate these newly developed on-line algorithms and
compare them with many existing combining procedures using a large
data set of expert forecasts for a number of macroeconomic variables. We
use the U.S. Survey of Professional Forecasters (SPF) from 1968:IV with
many missing data that is very typical of such surveys The long time span
over which these forecasts have been recorded will hopefully help us to
understand the relative strengths and weaknesses of these combining
schemes under widely different forecasting scenarios actually observed in
real life.
Finally, we examine the implications of missing data in a panel on the
comparison of alternative combining procedures. Schimdt (1977) spurred
early work on the estimation of panel data models with incomplete panels.
In our context, CapistrΓ‘n and Timmermann (2009) were the first to
examine the implication of incomplete panels due to entry, exit and re-
entry of experts and demonstrated that due to reduced number of
overlapping forecasts, the performance of procedures like BG that require
estimating error covariances will deteriorate relative to methods like SA
that do not need them. The same reasoning would favor the newer on-line
6
automated procedures like AFTER and MLS because they do not require
such covariances. In addition to this problem, we also show that if the
missing observations are not explicitly treated uniformly across different
combining procedures, different combining schemes would yield different
imputed values. This effectively means, when applied to unbalanced
panels, different combination methods are implemented on different data
sets, and naturally the results of different combinations will not be
comparable. In our exercises, we address the issue of incomplete panel by
explicit imputations, improvising on a procedure suggested by Genre et al.
(2013).
The organization of the paper is as follows. Section 2 contains
theoretical results on the asymptotic relationship between different
combination methods and the incomparability between combination
methods when applied to unbalanced panel data. Section 3 discusses the
dataset and data-related issues including imputations of missing data. We
evaluate the performance of the newly developed combination algorithms
and compare them with some of the extant methods in Section 4, and
study the behavior of the on-line algorithms more closely in Section 5.
Section 6 concludes.
2. COMBINATION METHODS AND IMPLIED
IMPUTATIONS
2.1. COMBINATION METHODS AND THEIR ASYMPTOTIC
RELATIONSHIPS
In this subsection, we study the asymptotic relationships between the
AFTER algorithms, the BG algorithm, and SA method. Suppose there are
n forecasters. yt is the variable of interest at time t (t = 1,β― , T) and yj,t is
the forecast of yt made by the jth forecaster at time t β h, where h is a
7
positive integer indicating forecast horizon3. The forecast combination
problem is how to assign weights to these n forecasters at time t + 1 after
observing yΟ, yj,Ο and the associated forecast errors ej,Ο = yΟ β yj,Ο for
Ο = 1,β― , t and j = 1,β― , n.
A popular solution to the forecast combination problem is the Bates
and Granger's approach (1969), where combining weights are proportional
to the inverse of the mean squared errors (MSE) (see, Stock and Watson
2004). More specifically, at time t, BG method computes and assigns the
weight Οj,t+1BG =
Οj,tβ2
β βnj=1Οj,t
β2 to the jth forecaster for time t + 1 assuming the
forecasts are unbiased. In practice, the variances Οj,t2 are rarely known and
usually replaced by their estimates Οj,t2 =
1
tβ1β βtβ1Ο=1 ej,Ο
2 . But then οΏ½οΏ½π,π‘+1π΅πΊ β
ππ,π‘+1π΅πΊ βπ 0 whenever οΏ½οΏ½π,π‘
2 β ππ,π‘2 βπ 0 for all π as π‘ β β. If the weak
stationarity condition such that ππ,π‘2 = ππ
2 for all π‘ is imposed, BG weights
can be further simplified to
ππ,π‘+1π΅πΊ =
ππβ2
β βππ=1 ππ
β2. (2.1)
The newly developed s-AFTER algorithm proposed by Yang (2004)
aims to pick the best few forecasters by minimizing the square loss
function. According to Yang (2004), when the errors are normal and the
variances are estimated, the weights of s-AFTER are estimated by
οΏ½οΏ½π,π‘+1π βπ΄πΉππΈπ =
β βπ‘π=π‘π+1
οΏ½οΏ½π,πβ1exp(β
12β βπ‘π=π‘π+1
ππ,π2
οΏ½οΏ½π,π2 )
β βππ=1 β βπ‘
π=π‘π+1οΏ½οΏ½π,πβ1exp(β
12β βπ‘π=π‘π+1
ππ,π2
οΏ½οΏ½π,π2 )
ββ (2.2)
3 Without loss of generality, we do not identify forecast horizon in this section.
8
forβπ‘ β₯ π‘π + 1
where π‘π is the size of the training sample.
In this scheme, the lower the value of οΏ½οΏ½π,π‘2 , the higher the weights.
Also, the latest squared forecasts errors are evaluated relative to its
estimated expected value οΏ½οΏ½π,π‘2 .βThus, a large squared forecast error relative
to its estimated expected value is interpreted as a sign of potential
deterioration of the particular forecasterβs performance. As a result, the
contribution of that forecast to the combination is exponentially reduced,
see also Zou and Yang (2004) .
If οΏ½οΏ½π,π‘2 β ππ‘
2 βπ 0 for all π,4 when π‘ β π‘π β β,
{ β β
π‘
π=π‘π+1
οΏ½οΏ½π,πβ1exp(β
1
2β β
π‘
π=π‘π+1
ππ,π2
οΏ½οΏ½π,π2 )}
1π‘βπ‘π
β { β β
π‘
π=π‘π+1
ππβ1exp(β
1
2ββ
π‘
π=1
ππ,π2
ππ2)}
1π‘βπ‘π
βπ 0
As a result, we obtain
{οΏ½οΏ½π,π‘+1
π βπ΄πΉππΈπ }1
π‘βπ‘π
β βππ=1 {οΏ½οΏ½π,π‘+1
π βπ΄πΉππΈπ }1
π‘βπ‘π
β{ππ,π‘+1
π βπ΄πΉππΈπ }1
π‘βπ‘π
β βππ=1 {ππ,π‘+1
π βπ΄πΉππΈπ }1
π‘βπ‘π
βπ 0
forβallβπ. It follows that οΏ½οΏ½π,π‘+1π βπ΄πΉππΈπ β ππ,π‘+1
π βπ΄πΉππΈπ βπ 0 where
4 This holds provided the following assumptions are satisfied: (i) ππ,π‘
2 = ππ‘2, (ii) the
forecast errors ππ,π‘ have uniformly bounded fourth moments, and pπ | π,π‘ β π‘β1 π‘|
2
β, (iii) ππ‘2 β π2 βπ 0 for some π2, and (iv) the ππ‘ procedure is consistent (see
Proposition 3 in Yang (2004)).
9
ππ,π‘+1π π΄πΉππΈπ =
exp (β12β βπ‘π=π‘π+1
ππ,π2
ππ2)
β βππ=1 exp(β
12β βπ‘π=π‘π+1
ππ,π2
ππ2)
(2.3)
for all π. The above expression is almost identical to the s-AFTER
algorithm defined by Yang (2004) for known homoscedastic conditional
variance except that in (2.3) the summation ranges from π‘π + 1 to π‘ rather
from 1 to π‘. The slight difference is due to the fact that π‘π sample has to be
used to compute the initial estimate of ππ‘2.
When the forecast errors are asymptotically (conditionally)
homoscedastic over time and across π, i.e., οΏ½οΏ½π,π‘2 β π2 βπ 0βfor all π as π‘
approaches infinity, it is trivial to see from (2.3) that s-AFTER acts like
SA for sufficiently large sample. By contrast, BG method requires only
οΏ½οΏ½π,π‘2 β ππ‘
2 βπ 0 for all π to yield equal weight eventually.
The assumption of (asymptotically) conditional homoscedasticity
over time and across π may be too restrictive as it rules out many
interesting cases. In general, the conditional variances ππ,π‘2 are expected to
vary across j (see, for example, Davies and Lahiri 1995, and Lahiri and
Sheng 2010). Hence here we consider what is typical in forecast
combination literature, and assume βοΏ½οΏ½π,π‘2 β ππ
2 βπ 0. This condition
encompasses as a special case the assumption that ππ,π‘ are weakly
stationary and ergodic in second moment for all π. Under this condition, as
π‘ β β, we have οΏ½οΏ½π,π‘+1π βπ΄πΉππΈπ β ππ,π‘+1
π βπ΄πΉππΈπ βπ 0, where
ππ,π‘+1π βπ΄πΉππΈπ =
ππβ(π‘βπ‘π)exp (β
12ππ
2β βπ‘π=π‘π+1
ππ,π2 )
β βππ=1 ππ
β(π‘βπ‘π)exp(β12ππ
2β βπ‘π=π‘π+1
ππ,π2 )
(2.4)
10
which is the s-AFTER algorithm defined in Yang (2004) for known
variance in the sense that it allows for heteroskedasticity of forecast errors.
Since ππ,π‘2 βπ ππ
2 by assumption and 1
π‘βπ‘πβ βπ‘π=π‘π+1
(ππ,π2 β ππ,π
2 ) βπ 0 as
π‘ β π‘π β β, provided that the weak law of large numbers can be applied,
for largeβπ‘πβandβ π‘ β π‘π, ππ,π‘+1π βπ΄πΉππΈπ may be well approximated by
ππ,π‘+1π βπ΄πΉππΈπ =
ππβ(π‘βπ‘π)
β βππ=1 ππ
β(π‘βπ‘π). (2.5)
Combining (2.1) and (2.5) yields
ππ,π‘+1π βπ΄πΉππΈπ =
{ππ,π‘+1π΅πΊ }
β12(π‘βπ‘π)
β βππ=1 {ππ,π‘+1
π΅πΊ }β12(π‘βπ‘π)
. (2.6)
Therefore, under the assumptionβοΏ½οΏ½π,π‘2 β ππ
2 βπ 0, ππ,π‘+1π βπ΄πΉππΈπ βis a
normalized power transform of ππ,π‘+1π΅πΊ .ββGiven the power is negative
because of π‘ > π‘π, the s-AFTER algorithm would magnify the weight
produced by BG when the forecaster can be distinguished from the rest of
the forecasters. Conversely, it tends to discount the weight of a forecaster
by BGβs method if the weight is below some data-dependent threshold
level. This explains why s-AFTER would eventually pick the best
forecaster if its past and most recent performance are well above the rest
of the forecasters. However, we should note that the s-AFTER algorithm
operates according to the dynamic and cross sectional properties of the
data as well.
Along the same line, Wei and Yang (2012) extended the AFTER
algorithm by adopting absolute error loss (L1 loss) and Huber loss
functions and propose two new algorithms (L1-AFTER and h-AFTER) for
forecast combination. Since L1-AFTER and h-AFTER behave similarly,
here we only focus on the L1-AFTER algorithm and examine its
11
asymptotic relationship with s-AFTER algorithm. According to Wei and
Yang (2012), the weights of L1-AFTER algorithm are estimated as
οΏ½οΏ½π,π‘+1πΏ1βπ΄πΉππΈπ =
β βπ‘π=π‘π+1
οΏ½οΏ½π,πβ1exp(βπβ βπ‘
π=π‘π+1
|ππ,π|
οΏ½οΏ½π,π)
β βππ=1 β βπ‘
π=π‘π+1οΏ½οΏ½π,πβ1exp(βπβ βπ‘
π=π‘π+1
|ππ,π|
οΏ½οΏ½π,π)
ββforββπ‘
β₯ π‘π + 1
(2.7)
where π is a positive tuning parameter, and οΏ½οΏ½π,π =1
πβ1β βπβ1π =1 |ππ,π | or
οΏ½οΏ½π,π = οΏ½οΏ½π,π‘. Following Wei and Yang (2012), we set π = 1 and let οΏ½οΏ½π,π =
1
πβ1β βπβ1π =1 |ππ,π |. Suppose οΏ½οΏ½π,π β |ππ| βπ 0 for all π, which follows trivially
ifβππ,π‘ are weakly stationary and ergodic in second moment for all π
(though this assumption is sufficient but not necessary). Using argument
similar to that for s- AFTER algorithm, we have οΏ½οΏ½π,π‘+1πΏ1βπ΄πΉππΈπ β
ππ,π‘+1πΏ1βπ΄πΉππΈπ βπ 0, where
ππ,π‘+1πΏ1βπ΄πΉππΈπ =
{ |ππ|}β(π‘βπ‘π)
β βππ=1 { |ππ|}
β(π‘βπ‘π)β. (2.8)
By examining (2.5) and (2.8), we can see that ππ,π‘+1πΏ1βπ΄πΉππΈπ β has
identical functional form as ππ,π‘+1π βπ΄πΉππΈπ . These two AFTER-type algorithms
differ only in the argument: ππ for s-AFTER and |ππ| for L1-AFTER,
reflecting their respective loss functions (the absolute loss and squared
loss, because ππ = β ππ2, provided ππ is unbiased) used in building these
algorithms. Some simple algebraic manipulation yields the following large
sample relationship between L1-AFTER and s-AFTER:
12
ππ,π‘+1πΏ1βπ΄πΉππΈπ =
ππ,π‘+1π βπ΄πΉππΈπ {1 + ππ
β1( |ππ| β ππ)}β(π‘βπ‘π)
β βππ=1 ππ,π‘+1
π βπ΄πΉππΈπ {1 + ππβ1( |ππ| β ππ)}
β(π‘βπ‘π). (2.9)
It is obvious that the difference between Οj,t+1L1βAFTERβ and Οj,t+1
sβAFTERβ stems
from the discrepancy between E|ej|β and Οj, or the difference between the
absolute loss and the square root of the square loss. Note that E|ej| β Οj
0 by Jensen inequality provided that ejβis unbiased and Var(|ej|) β 0.
Henceββ{1 + Οjβ1(E|ej| β Οj)}
β(tβto)> 1 works as a mediating factor to
counteract the impact of diminishing Οj,t+1sβAFTER due to an outlier so that
Οj,t+1L1βAFTERβ is less sensitive in the presence of forecast outliers, certeris
paribus.
2.2. INCOMPARABILITY OF COMBINED FORECASTS IN
UNBALANCED PANELS
The discussions in the previous subsection pertaining to the
asymptotic relationships among SA, BG, s-AFTER and L1-AFTER are
presented in the context of balanced panels. But the majority of panel data
sets faced by empirical researchers are unbalanced in nature. The question
we explore in this subsection is what happens if empirical researchers
blindly apply various combination methodologies without properly
allowing for the unbalanced structure of forecast data at hand.
For simplicity, suppose that an analyst observes π forecasters, the
earliest time at which forecast data are available is π‘ = 1 while the latest
time at which forecast data are available is π‘ = π. Due to entry and exit of
forecast experts from time to time, the data is unbalanced in the sense that
forecast data may not be available for some forecaster π at some time π‘.
Define ππ‘π΄ = *π: π,π‘βi βob ervedβatβtimeβπ‘βandβπ = 1,β― , π+ and ππ‘
ππ΄ =
13
*π: π,π‘βi βnotβavailableβatβπ‘βandβπ = 1,β― , π+. Assume also that there are
ππ‘ observations at time π‘, which then implies that π β ππ‘ elements belong
to ππ‘ππ΄. If simple averaging method is applied only to observed data, then
effectively, the weights of SA in a unbalanced panel with ππ‘π΄ and ππ‘
ππ΄
(π‘ = 1,β― , π) are governed by
οΏ½οΏ½π,π‘+1ππ΄ = {
0, π,π‘+1i βnotβavailable
1
ππ‘+1, otherwi e.
from which we get the combined forecast by SA method π‘+1ππ΄ =
β βππ=1 οΏ½οΏ½π,π‘+1
ππ΄ π,π‘+1 =1
ππ‘+1β βπβππ‘+1
π΄ π,π‘+1. Now let π,π‘+1β denote computed
input for the unobserved π,π‘+1, and define π‘+1ππ΄ =
1
π.β βπβππ‘+1
π΄ π,π‘+1 +
β βπβππ‘+1ππ΄ π,π‘+1
β /. Then since β βπβππ‘+1π΄ π,π‘+1 = ππ‘+1 π‘+1
ππ΄ , we can obtain
1
πβππ‘+1ββ βπβππ‘+1
ππ΄ π,π‘+1β = π‘+1
ππ΄ . This implies that blindly applying simple
average method without properly allowing for the unbalanced structure
gives rise to imputed values π,π‘+1β (π β ππ‘+1
ππ΄ ) such that the simple average
of the imputed values must be equal to the simple average of the observed
data at time π‘ + 1. For example, if only one data point is not observed,
then simple averaging simply fills in the blank cell with the simple
average of the observed data from other forecasters without recognizing
the idiosyncrasies of the particular forecaster.
Next, suppose Bate and Granger's (BG) Method is applied to an
unbalanced panel, naturally, the weights of BG method are defined by
οΏ½οΏ½π,π‘+1π΅πΊ =
{
ββ
0, ifβ π,π‘+1i βnotβavailable
οΏ½οΏ½π,π‘+1π΅πΊπ΄ = οΏ½οΏ½π,π‘
β2/ β β
π
πβππ‘π΄
οΏ½οΏ½π,π‘β2 otherwi e
14
from which we have combined forecast by BG method, π‘+1π΅πΊ =
β βππ=1 οΏ½οΏ½π,π‘+1
π΅πΊ π,π‘+1 = β βπβππ‘+1π΄ .οΏ½οΏ½π,π‘
β2/β βππβππ‘
π΄ οΏ½οΏ½π,π‘β2/ π,π‘+1. If we assume οΏ½οΏ½π,π‘
β2β
are available for all j at time t and π,π‘+1β are being filled into the missing
spots at time t+1, then π‘+1π΅πΊ = β βπβππ‘+1
π΄ (οΏ½οΏ½π,π‘β2/β βπ
π=1 οΏ½οΏ½π,π‘β2) π,π‘+1 +
β βπβππ‘+1ππ΄ (οΏ½οΏ½π,π‘
β2/β βππ=1 οΏ½οΏ½π,π‘
β2) π,π‘+1β . Hence by using the fact that
π‘+1π΅πΊ β βπ
πβππ‘π΄ οΏ½οΏ½π,π‘
β2 = β βπβππ‘+1π΄ οΏ½οΏ½π,π‘
β2 π,π‘+1β we can obtain β βπβππ‘ππ΄ (οΏ½οΏ½π,π‘
β2/
β βππβππ‘
ππ΄ οΏ½οΏ½π,π‘β2) π,π‘+1
β = π‘+1π΅πΊ , demonstrating that inadvertently applying the
BG procedure to unbalanced panels would produce imputed values π,π‘+1β
(π β ππ‘+1ππ΄ ) for missing data in such a way that the BG weighted average of
imputed values equals to the BG weighted average of observed data at
time π‘ + 1. Note that for imputed values, the weights are based on either
prior knowledge or estimates based past available data. Again, if only one
data point is not observed, then BG approach implicitly fills in the blank
cell with the BG weighted average of the observed data.
Now suppose s-AFTER algorithm is applied directly to an unbalanced
data set with its weights given by s-AFTER for an unbalanced panel:
οΏ½οΏ½π,π‘+1π βπ΄πΉππΈπ
=
{
ββ
0, ifβ π,π‘+1βi βnotβavailable
β βπ‘π=π‘π+1
οΏ½οΏ½π,πβ1exp(β
12β βπ‘π=π‘π+1
ππ,π2
οΏ½οΏ½π,π2 )
β βπβππ‘π΄ β βπ‘
π=π‘π+1οΏ½οΏ½π,πβ1exp(β
12β βπ‘π=π‘π+1
ππ,π2
οΏ½οΏ½π,π2 )
, otherwi e
Then, similar derivation shows that
β βπβππ‘ππ΄
β βπ‘π=π‘π+1
οΏ½οΏ½π,πβ1exp(β
1
2β βπ‘π=π‘π+1
ππ,π2
οΏ½οΏ½π,π2 )οΏ½οΏ½π,π‘+1
β
β βπβππ‘
ππ΄ β βπ‘π=π‘π+1
οΏ½οΏ½π,πβ1exp(β
1
2β βπ‘π=π‘π+1
ππ,π2
οΏ½οΏ½π,π2 )
= π‘+1π βπ΄πΉππΈπ indicating that s-
15
AFTER algorithm applied to unbalanced panels generated implicit values
π,π‘+1β (π β ππ‘+1
ππ΄ ) for missing data in such a way that the s-AFTER weighted
average of imputed values equals to the its counterpart of observed data at
time π‘ + 1. Again, past data for individuals whose current forecasts are
missing are assumed to be available. In particular, if only one data point is
missing, then s-AFTER algorithm impute it with the weighted average by
s-AFTER algorithm of the observed data.
We can now safely conclude that SA, BG, and s-AFTER are not
comparable when directly applied to an unbalanced panel because they are
implicitly using different (balanced) data sets. This conclusion also holds
for other combination procedures in general. Consequently, existing
results on evaluating the performance of various combination methods
may be misleading when these results are applied directly to unbalanced
panels. At the very least, great care and caution must be taken to interpret
these empirical results. Finally, the results in this subsection suggest that
the issue of unbalanced panels must be addressed properly before
comparing combined forecasts by various procedures.
Facing the missing data problem, different authors have handled the
situation differently. Issler and Lima (2009) reduced a much larger
incomplete panel data set to a small (N=18, T=41) balanced set. Note that
this type of trimming also implies that implicitly the discarded values of
the original unbalanced panel are essentially replaced by the means of the
remaining observations for each time period while implementing SA, thus
reducing the heterogeneity of the individual forecasters and the scope of
performance-based combination methods. CapistrΓ‘n and Timmermann
(2009) used US SPF data over 1987-2006 that contained huge amount of
missing observations (see our Figure 1) and trimmed the data by requiring
forecasters to have a minimum of 10 common contiguous observations.
16
They computed the root mean squared errors (RMSE) of 12 different
combining procedures relative to SA based on the trimmed unbalanced
panel. Our analysis suggests that strictly speaking these RMSE ratios are
not comparable. Poncela et al (2011) use one-period- ahead forecasts from
US SPF over 1991-2008, and restrict the data to those individuals who
have been on the panel for a minimum of 7 years and never missed more
than four consecutive forecasts. The resultant incomplete panel is then
filled in two ways: i) the missing 1-quarter-ahead forecasts are replaced by
the 2-. (or 3- or 4- if the previous forecasts are missing) quarter ahead
forecasts, or ii) the missing forecast is replaced by that individualβs
historical mean forecast. Since the same imputation scheme is used in this
paper for all procedures, the results are comparable. However, this scheme
minimizes the commonality of individual forecasts and emphasizes the
idiosyncratic component in the forecast data. The imputation scheme used
by Genre et al. (2013) use both an empirically determined fraction of
forecaster jβs previous observed deviation from the average forecast and
the average forecast in period t, thus drawing information from both
directions.
3. U.S. SPF DATA AND MISSING DATA ISSUES
3.1. SPF DATA AND VARIABLES
The data we use to evaluate the performance of a number of forecast
combination algorithms is the U.S. Survey of Professional Forecasters
(SPF). It is a high quality, long standing, and widely used quarterly survey
on macroeconomic forecasts in the United States. The survey was initially
conducted by the American Statistical Association (ASA) and the National
Bureau of Economic Research (NBER). Starting from 1990, the survey
was taken over by the Federal Reserve Bank of Philadelphia. This change
17
in administration led to a unique missing data pattern thus a challenge for
empirical work.
From the 39 regularly surveyed variables in SPF, we select the growth
rate of real GDP (RGDP), seasonally adjusted annual rate of change for
GDP price deflator (PGDP), the CPI inflation rate (CPI), and the
seasonally adjusted quarterly average unemployment rate (UNEMP) as
our target variables. For each variable, we examine the forecasts made for
the current quarter and the following 3 quarters, starting from the fourth
quarter of 1968 (1968:IV) to the third quarter of 2011 (2011:III).
3.2. MISSING DATA AND IMPUTATIONS
As shown in the previous section, different forecast combination
methods, when applied to incomplete panels, implicitly impute the
missing forecasts differently. It is easy to see that the amount and pattern
of missing data directly determine the extent to which the comparison
results are affected. Figure 1 shows the missing data structure of the SPF
over its entire life5. A black square in the figure represents a data point;
and a blank spot represents a missing forecast. Strikingly, the amount of
missing data is far exceeds the amount of available data. Taking one-
quarter-ahead PGDP forecasts as an example β a fully balanced panel with
425 forecasters from 1968:IV to 2011:III without missing data would have
73,100 data points. However, we have only 6,520 data points in this
period. This means that 91% of the data are missing! As for the pattern of
missing data, Figure 1 shows that before 1990, there were a large number
of forecasters whose forecasts started from the initial years around 1970.
Then, about half of the forecasters stopped forecasting mid-way while the
5 PGDP one-quarter-ahead forecasts are used to construct Figure 1. The amount and
pattern of missing data for other variables are similar.
18
rest kept forecasting until around 1990. Only six forecasters who joined
the survey in its early days remain in the sample until recently. On the
other hand, starting from 1990, many new forecasters joined the survey
every few years, and about half of them kept forecasting.
Based on these observations, we choose to construct two subsamples
instead of using the entire dataset as a whole. The first subsample includes
the initial years from 1968:IV to 1990:IV. The second subsample goes
from 2000:I to 2011:III. We can thus utilize the part of the sample where
data points are highly concentrated, and avoid the part that contains too
many missing data points.
We limit our attention to frequent forecasters β ones with sufficient
number of observed forecasts β to further reduce the amount of missing
data, such that in both subsamples, the amount of missing data is kept as
low as possible while still maintaining a reasonable number of
forecasters.6 Specifically, we require forecasters to have at least 45
forecasts in subsample 1 or at least 36 forecasts in subsample 2. As a
result, depending on variables and subsamples, around 15 forecasters
remain. On average, there are about 40% missing data for subsample 1 and
about 15% missing data for subsample 2.
To accurately measure the performance of different combination
methods in the incomplete panel, we impute the missing data explicitly.
Two imputation methods are considered. The first method gives imputed
values as π,π‘β =
1
ππ‘β π,π‘πβππ‘
π΄ if π β ππ‘π΄, where ππ‘ is the total number of
elements in ππ‘π΄. This method replaces missing forecasts with the simple
average of the non-missing forecasts for the same period, which is the
6 We observe no clear relationship between performance and participation. CapistrΓ‘n and
Timmermann (2009) and Genre et al (2013) also reported a similar finding.
19
imputation implied when using simple average for combination. The
downside of this method is that it reduces the level of forecast dispersion,
which limits the combination algorithmsβ ability to distinguish good
performers from poor ones. Especially for the performance-based
methods, it would be more reasonable if imputed values reflect, at least
partially, the past performance of the forecaster.
Such concerns lead us to the second imputation method, based Genre
et al. (2013), where the imputed values are given by π,π‘ β π‘ =
οΏ½οΏ½π[β ( π,π‘βπ β π‘βπ )4π =1 ], where π‘ is the mean forecast at time π‘.
Intuitively, a missing individual forecast is replaced by an adjusted mean
forecast for that period. The adjustment is made according to the recent
average deviation of the forecast made by that forecaster from the mean
forecasts. This method is superior, in principle, to the first method,
because the imputed value for a forecaster incorporates both the common
component and idiosyncrasy of that forecaster. In particular, if a forecaster
tends to produce forecasts that are far from the average, his or her imputed
forecasts would reflect that characteristic.
Note that both imputation methods have to be implemented in real
time just like the combination methods. This presents no problem for the
first imputation method. But for the second method, the excessive amount
of missing data even after imposing the participation requirement makes it
infeasible sometimes to estimate all the π½πs in real time. When this is the
case, we use the most recent estimate of π½π when such an estimate is
available.
Note that missing data creates an additional challenge to estimate the
weights associated with the algorithms. However, it may be noted that the
on-line algorithms may have a relative advantage over BG in this regard.
20
First, as pointed out before, they do not need estimates of error
covariances. In addition, the weights of s-AFTER defined in (2.2) can be
written recursively as
οΏ½οΏ½π,π‘+1π βπ΄πΉππΈπ =
οΏ½οΏ½π,π‘π βπ΄πΉππΈπ οΏ½οΏ½π,t
β1exp(βππ,t2
2οΏ½οΏ½π,t2 )
β βππ=1 οΏ½οΏ½π,π‘
π βπ΄πΉππΈπ οΏ½οΏ½π,tβ1exp(β
ππ,t2
2οΏ½οΏ½π,t2 )
ββforβπ‘ β₯ π‘π + 1 (3.1)
from which it is evident that previous forecast errors, which affect
Οj,t+1sβAFTER through Οj,t
sβAFTER play an equally important role in
determining Οj,t+1sβAFTER as the latest forecast error. Indeed, the natural
logarithm of Οj,t+1sβAFTER behaves like a unit root process. Thus a large
forecast error tends to have a permanent effect on the weights of s-
AFTER. For our purposes, it may be an advantage since the long memory
property of Οj,t+1sβAFTERβmay help to alleviate the problem of missing data in
unbalanced panel as the impact of past errors do not decay at all.
4. PERFORMANCE OF COMBINATION METHODS
4.1. MEASURING THE PERFORMANCES OF COMBINATION
METHODS
To thoroughly evaluate the performance of the new combination
methods like the AFTER algorithms and the machine learning algorithm
MLS, we conduct a real time forecast combination exercise using several
popular existing methods in addition to the new methods. In addition to
the simple average method (SA), we consider Bates and Grangerβs method
(BG), as well as median (ME), recent best (RB), and trimmed mean (TM)
methods. SA is such that the combined forecast is the simple (equally
weighted) average of individual forecasts. ME method uses the median of
individual forecasts as the combined forecast. RB is the combined forecast
21
that is set to be the forecast made by the individual forecaster who enjoys
the best past performance as measured by MSE as of last period. TM
selects the mean of the pool of individual forecasts after the maximum and
the minimum forecasts are removed7. BG method and the AFTERs
algorithms are implemented as detailed in Section 2.
The MLS method is implemented according to Algorithm 1 in
Sancetta (2010). The core step in the algorithm is to compute the current-
period weight (before shrinkage) ππ,π‘+1ππΏπβ² for each individual forecaster,
based on this individualβs previous-period weight ππ,π‘ππΏπ and current-period
loss ππ‘(ππ‘ππΏπ). Let βππ‘(ππ‘
ππΏπ) be the gradient of the loss function with
respect to (previous-period) weight ππ‘ππΏπ, and βπππ‘(ππ‘
ππΏπ) be its πth
element. The current-period weight is calculated as
ππ,π‘+1ππΏπβ² =
*ππ,π‘ππΏπ exp[βππ‘βπΌβπππ‘(ππ,π‘
ππΏπ)]+/*β ππ,π‘ππΏπ exp,βππ‘βπΌβπππ‘(ππ,π‘
ππΏπ)-ππ=1 +,
where π is the learning rate parameter, and πΌ is a parameter that controls
the speed of learning. In the final shrinkage step that gives the final
current-period weight used for combination ππ,π‘+1ππΏπ , all the ππ,π‘+1
ππΏπβ²s that are
lower than a predetermined small threshold (πΎ/π, which is controlled by
parameter πΎ) is replaced by the threshold value πΎ/π, and the remaining
weights are scaled such that all weights add up to 1.
In the MLS method, the gradient of the loss function βππ‘(ππ‘ππΏπ),
together with learning rate π controlled by the power parameter Ξ±, is used
in the first update to generate the ex post combination weight, which are
7 We have also considered the Winsorized mean method where the top and bottom 5% of
individual forecasts are trimmed and replaced by the remaining forecasts that is closest to
the trimmed ones at both ends. Note that winsorization maintains the variability of
individual forecasts more than trimming. We do not report results associated with
Winsorized mean because they were similar to TM.
22
then projected on a pre-specified subset in second update (shrinkage) to
ensure that all the weights are bounded by some threshold constaint. The
MLS, like the BG method, aims to achieve the best forecast combination.
As simple average often provides a very good benchmark, we
compare the performance of other combination methods against that of the
simple average method using the relative MSE measure, cf. Genre (2013).
For any combination method, the relative MSE is the ratio between the
MSE of the combined forecasts produced by that method and the MSE of
the averages of the individual forecasts. Since the statistical significance
of Diebold-Mariano (1995)-type tests of equal forecast accuracy across
different methods are not directly comparable, we do not report them here,
cf. CapistrΓ‘n and Timmermann (2009). The relative MSE provides
information on the relative forecast accuracy that is independent of the
absolute accuracy (i.e., the actual MSEs). The latter often vary greatly
depending on the variable, horizon, and sample periods. Therefore, it is
entirely possible that in certain cases, even the method with relatively
better performance produces poor forecasts. Apparently, combining such
forecasts is of no practical use, and comparisons like this are completely
spurious. Therefore, while still reporting the comparisons for longer
horizon forecasts, we focus our analysis on current-quarter and one-
quarter-ahead forecasts. As we carry out the forecast combination
exercises in real time, following now a standard practice in forecast
evaluation, we use the first vintage (initial release) of a variable as actual
values when calculating the MSEs.
23
4.2. COMPARISON OF THE ACCURACY OF COMBINED
FORECASTS
Comparison of alternative combination methods with special
reference to the on-line algorithms is one of the main objectives of this
study. We implement the above-discussed algorithms in real time on the
(filled-in) balanced panel and compare their performances. The MLS
algorithm is implemented with the exponent in the learning rate πΌ = 0.5,
and the parameter that specifies the amount of shrinkage πΎ = 0.05, as
chosen by Sancetta (2010) in implementing the algorithm. The learning
rate parameter π is chosen ex post from values in the set *0.01, 0.05, 0.1,
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1+ based on
performance of the combined forecasts.8 Table 1 gives the MSEs of the
combined forecasts for CPI Inflation, PGDP, RGDP and UNEMP
produced by different combination methods.9 Figure 2 shows, in terms of
bar charts, the MSEs for PGDP and UNEMP relative to SA based on what
are reported in Table 1.Thus, a vertical bar less than one means the method
under consideration is superior to SA. We find that despite producing
combined forecasts that are inferior to SA at times, most of the
performance-based combination methods, especially the newly developed
AFTER algorithms and machine learning algorithm (MLS), provide
considerable improvements in generating PGDP forecasts in the first
period (1968IV-2000:I) and in generating UNEMP forecasts in the second
8 For choosing the optimum learning rate, Sancetta (2010) proposes two methods in
addition to choosing the learning rate ex post. But in our case, in most cases, the
performance of MLS algorithm is found to be insensitive to choice of learning rate,
similar to Sancetta (2010, p. 613). 9 In what follows, we report only the results from using the second imputation method.
Also, we omit results related to the median, trimmed mean, and Winsorized mean
combination methods because they did not contribute any additional insight to our
findings in the paper.
24
period (2001:I-2011:III). For PGDP, these combination methods perform
better in subsample 1, presumably because the volatility and heterogeneity
in individual forecasters were more substantial in the early 1980s. For
current-quarter forecasts, both the MLS and the AFTER algorithms
produce combined forecasts with MSEs lower than 80% of that of the
simple average forecasts. Improvements in forecast accuracies are noted
for one-quarter and two-quarter horizons as well with decreasing efficacy.
MSE of the current-quarter forecasts produced by the MLS algorithm, the
best performer for this subsample, is 1.250, far less than the benchmark
MSE of 1.687.
For UNEMP, the performance-based algorithms contribute the most
to forecast accuracy in the later subsample 2, when the unemployment rate
was drastically and unexpectedly affected by the most recent recession
beginning 2007:IV. The combined forecasts produced by the L1-AFTER
algorithm has about 10% lower MSE than the simple average for all four
horizons. The MLS algorithm also provides slightly more accurate
forecasts for all four horizons. The h-AFTER and the s-AFTER algorithms
noticeably outperform the benchmark at one-quarter and two-quarter
horizons.
Now looking at Table 1, we find that the MSEs for CPI inflation
associated with MLS, s-AFTER, L1-AFTER and h-AFTER for the current
quarter forecasts during the second period (i.e., 2000:I-2011:III) are less
than those for SA by substantial margins. For RGDP, none of the
alternative combining procedures shows clear-cut superiority over SA. We
also note that for all variables and in both subsamples, BG and MLS
methods never produce combined forecasts that are much less accurate
than the simple average, while the AFTER algorithms sometimes show
inferior performance. Overall, as the horizon increases, relative
25
contribution of the combination methods relative to the simple average
becomes smaller and often inferior to SA.
Since L1-AFTER and h-AFTER are derived using loss functions other
than the squared loss, it is necessary to compare their performance with
our simple average benchmark using appropriate loss functions. Table 2
provides such a comparison in two panels corresponding to Huber and
absolute losses for PGDP and UNEMP. As expected, the losses, when
evaluated under the appropriate loss, are smaller than those reported in
Table 1 under squared loss.10
More appropriately, in Figure 3 we have
presented the MAE and Huber losses for L1-AFTER and h-AFTER
respectively, normalized by respective SA losses. The lengths of these bars
in Figure 3 are mostly smaller that the corresponding bars in Figure 2.
However the differences are not big and do not make any difference in our
conclusions.
4.3. IMPUTATION METHOD AND MISSING DATA RE-EXAMINED
The second imputation method discussed above preserves an
individualβs tendency to make forecasts that deviate from the cross-section
mean. This benefit can only be realized when the forecasters in our sample
do indeed have such a tendency, i.e., idiosyncratic biases. By regressing
the deviation of individual forecasts from the contemporaneous mean on
lagged deviations, we can check to what extent such a tendency exists.
These are reported in Table 3. For each variable and each subsample, we
report the number of regressions we ran, as well as the average proportion
10
Only exception are few L1-AFTER figures for UNEMP for current and 1-quarter-
ahead forecasts where the errors are very small fractions, so that their squares become
much less that their absolute values.
26
of these regressions in which the past deviation is statistically significant
across all forecasters.
For PGDP subsample 1, only in 9% of the regressions, past deviations
are significant in explaining future deviations, when current-quarter
forecasts are examined. But for three-quarter-ahead forecasts, in 34% of
the regressions, past deviations are significant. For PGDP subsample 2,
past deviations are significant in a considerably larger proportion of the
regressions at all horizons, increasing from 22% to as high as 65%. The
situation is the opposite for UNEMP, where in subsample 2, past
deviations are significant in a smaller number of regressions compared to
those in subsample 1. Still, as horizon increases, for UNEMP, past
deviations become significant in more percentage of cases.
However, note that the difference between the missing forecasts
imputed by the second imputation method and those imputed by simple
average is generally rather small. Even for some of the forecasters whose
forecasts consistently deviate from the mean, since the deviations or
οΏ½οΏ½πβmay be small, the imputed values are often close to the mean. A similar
result was reported in Genre et al. (2013).
5. BEHAVIOR OF SELECTED FORECAST
COMBINATION ALGORITHMS
5.1. A CLOSER LOOK AT S-AFTER, MLS, AND BG METHOD
The results in the previous section clearly show the advantage of the
newly developed AFTER and MLS methods in certain cases. It is
therefore particularly interesting and informative to compare their
behavior to that of the familiar BG method. This comparison is presented
in Figures 4, 5 and 6. In each figure, from top to bottom, we show
individual forecastersβ squared forecast errors, the evolution of individual
27
forecastersβ cumulative MSEs, the weights estimated using the BG
method, the weights estimated using the s-AFTER (or MLS) method, as
well as squared errors of the combined forecasts produced by the two
methods.
Figure 4 compares the MLS method with the BG method using
current-quarter forecasts of PGDP for subsample 1. In this case, the MLS
method performs better than the BG method with 25% lower MSE. As the
individual squared errors and MSEs show, individual performances are
rather stable and clearly heterogeneous, with the exception of a few
quarters in the beginning. The BG method produces stable weights after
the first year or two, essentially weighting most of the forecasters equally
around 8%. This is the so-called portfolio diversification logic of BG
emphasized by Timmermann (2006). On the contrary, the MLS method
puts extremely high weights on the best forecaster who shows persistently
good performance. In the beginning of 1978, the previously identified best
forecaster showed a small uptick in MSE which led to a drastic down
weighting of the forecaster with second best picking up the share of the
weight drastically. Subsequently, the weights assigned by the MLS
method dropped by nearly 50%, while the weights assigned to this
forecaster by the BG method dropped very little. As shown in the
comparison of the squared errors of the combined forecasts, the squared
error of BG combined forecasts are significantly larger than that of the
MLS combined forecasts. A similar event happened again in early 1981,
where the MLS combined forecasts showed a smaller error. In addition,
we note that starting from 1978, after the deterioration in the performance
of the previous top forecaster, the MLS method gave almost equal weights
to the two best forecasters, until after 1981 when the previous top
forecaster re-established his/her edge. During this period, no significant
28
change happened to the weights assigned to these two forecasters by the
BG method.
Figure 5 compares the s-AFTER method with the BG method using
the one-quarter-ahead forecasts of UNEMP for subsample 2. In this case,
apart from the relatively large errors seen after 2007, for most of the
quarters, individual forecast errors are small and individual performances
are clearly heterogeneous, especially for the best and worst forecasters.
This persistence in ranking is similar to the evidence in Aiolfi and
Timmermann (2006). Both the BG and the s-AFTER methods successfully
identify the best forecaster and assign relatively high weight. Still, the
weights assigned by the BG method are around 10% while weights
assigned by the s-AFTER method varied from 20% to 95%. In two
quarters around early 2008 and early 2009, the best forecaster made
relatively big mistakes that led to notable increase in the MSE. Similar to
the behavior of the MLS method in the previous case, the s-AFTER
method drastically decreased the weights as a result of the performance hit
β about a 40% decrease in weight in early 2008 and about a 20% decrease
in early 2009. Yang (2004) has emphasized this property of the s-AFTER
algorithm wherein a small error by a very good established forecaster
produces drastic weight adjustment. Aggressively weighting the good
forecaster and penalizing poor performance, s-AFTER displayed superior
performance with more than 20% reduction in MSE compared to the
benchmark. This is consistent with the theoretical relationship between s-
AFTER and BG derived in section 2, equation (6). Note that the superior
performance of the s-AFTER method for this period, compared to the BG
method, comes mostly because of smaller errors during the period from
late 2008 to early 2009.
29
From the above two cases, we see that s-AFTER and MLS methods
behave more aggressively in adjusting the weights than the familiar BG
method.
This makes the algorithms adapt to changes in individual
forecastersβ performances and adjust their weights in a speedy manner, so
that changes in performances get quickly reflected in weights. However, if
the changes in performance do not persist into the future, the adjustments
made by these algorithms may even worsen the situation. For example,
during periods with high volatility, a poor forecaster may produce a highly
accurate forecast purely by chance rather than due to forecasting skill.
When a change in performance is less likely to persist or uncertain, the
weights should arguably be adjusted cautiously rather than aggressively. A
psychological support for this logic can be found in Denrell and Fang
(2010). From a diversification perspective, aggressively adjusting weights
creates increased amounts of risks. If structural break happens or a top
forecaster happens to behave poorly in one period, the combined forecasts
may suffer a huge unexpected loss. Figure 6 provides such an example. In
forecasting current quarter RGDP, the best forecaster, forecaster 44, who
was receiving nearly 90% of the weight from s-AFTER, made a big
mistake in 1979:IV. Even though immediately after this mistake, weight
assigned to this forecaster by s-AFTER method dropped to below 1%,
there was no chance to avoid a big forecast error in that period. This
mistake alone made the s-AFTER inferior over the whole sample on the
average compared to BG β even though for all other quarters in the
sample, the two forecasts are very close. Interestingly, after the 1979:IV
mistake, forecaster 44 vanished completely from combining even though
his/her performance continued to be in the middle range. Unlike the
model-based forecasts that tend to be more stable and rank preserving, the
30
relatively large psychological component in these expert survey forecasts
makes s-AFTER-type algorithms susceptible to such outliers.
In order to see how outliers are accommodated in L-AFTER, we
reexamine Figure 5 where the dynamics of BG and s-AFTER were
discussed in forecasting UNEMP during 2000:I-2011:III. In Figure 7 we
presented the individual MAEs, L-AFTER weights and squared error
from L-AFTER forecasts during the same episode. Due to the huge
forecast errors during 2008-09, whereas s-AFTER hesitated with its prime
forecaster by scaling down the weight temporarily (see Figure 5), L-
AFTER downplayed the importance of the initial error and steadily
increased the weight of the original best forecaster. The result was that in
the post 2008 period, L-AFTER had significantly less forecast error than
both BG and s-AFTER. Thus, this example clearly illustrates how,
compared to s-AFTER, L-AFTER dealt with and minimized the
importance of the forecast outliers generated by the latest recession.
5.2. REQUIREMENTS FOR SUCCESSFUL COMBINATIONS
Based on the results in the previous sections, we are able to identify
the following conditions, under which the performance-based weighting
algorithms are likely to outperform the simple average method.
Firstly, it is crucial that performances of individual forecasters are
relatively stable over time and rank preserving. A stricter requirement is
that past performance of a forecaster is a good predictor of this
forecasterβs future performance. This condition is necessary because the
performance based weighting methods rely on past performance to predict
future performance. In our experiments, we found the predictive power to
be very low, and the widespread incidence of missing forecasts in
31
subjective survey forecasts like the SPF data make the value of the
combination weights even more tenuous.
Secondly, in order to produce a series of optimally-combined
forecasts that outperforms simple benchmarks, the differences in
forecastersβ performances should be sufficiently large together with widely
different correlations in forecast errors between forecasters. Much of these
requirements have been discussed in Aiolfi and Timmermann (2006), and
have been corroborated in the burgeoning psychological literature.
11However, they assume special significance while analyzing the
performance of aggressive on-line combination algorithms facing many
missing forecasts.12
Thirdly, for weighting methods that generously weight the best
forecaster based on past performance, e.g., the AFTER methods, it is
necessary that the best forecaster does not make big mistakes. Otherwise,
such sparse combining would produce combined forecasts that suffer
greatly from such mistakes.
6. CONCLUSIONS
This study focuses on the performance and behavior of the newly
developed AFTER and the MLS methods for forecast combination in
unbalanced panels. For monitoring large surveys like SPF or Blue Chip
forecasts, these on-line algorithms can be automated such that that
learning and adaptation to time-varying relative usefulness of forecasters,
old and new, can take place without out user intervention Our aim here is
not to run a horse race among alternative combining schemes, but rather to
11
See for example, Larrick and Soll (2006, 2009), Yaniv and Milyavsky (2007), Vul and
Pashler (2008), Herzog and Hertwig (2009). 12
Elliott (2011) finds the simple averaging will be optimal if the row sums of the
covariance matrix of forecast errors are equal for each row.
32
understand the conditions under which alternative forecast combination
algorithms can work compared against the simple weighted average.
To have a better understanding of how these alternative algorithms
work, we first establish the asymptotic relationship between the new s-
AFTER algorithm and the familiar Bates and Granger procedure. We find
that under the assumption that the conditional variance of the forecast
errors for each forecaster converges to the same value for all forecasters,
both the s-AFTER and BGβs method operate in a way similar to simple
average scheme. However, when heterogeneity in variances is present,
there is a simple nonlinear relationship between the two methods. s-
AFTER algorithm magnifies the weight assigned to a forecaster by BG
method, if the forecaster can be distinguished from other forecasters by
good past performance. On the other hand, s-AFTER reduces the weight
drastically towards zero if the performance is sufficiently below some data
dependent threshold level. Our empirical findings using SPF data illustrate
this theoretical result. In many cases, when using the on-line algorithms,
only a few top forecasters are given nontrivially positive weight. As a
result, this approach has the advantage that it does not require the
estimation of a large number of weigh parameters.
We then show that when implementing different forecast combination
methods on unbalanced panels, each method implicitly imputes the
missing forecasts differently. This makes the performance of the combined
forecasts produced by different combination algorithms incomparable. To
address this issue, we explicitly impute missing forecasts using a
regression method that incorporates individual idiosyncrasies as well as
the average forecast of others, and use the same data to evaluate
alternative combination methods.
33
Furthermore, we evaluate these newly developed forecast
combination algorithms and examine in details the inner mechanics
characterizing the algorithms. The empirical evidence confirms our
analytical results on the behavior of the combination algorithms. Our
results suggest that these robust on-line algorithms help to reduce the MSE
of the combined forecasts, when persistent forecaster heterogeneity and
outliers are prevalent in the forecast data. This is achieved mostly because
the algorithms are very agile in adapting to recent changes in individual
performances and weighting good forecasters aggressively. We find that
the on-line algorithms tend perform well at shorter horizons, especially
when the other algorithms fail due to volatility clustering, structural
breaks, and outliers. In particular, the performances of individual
forecasters need to be sufficiently persistent and heterogeneous for the
newly developed pattern recognition and machine learning algorithms to
deliver maximum improvement in forecast accuracy. Unfortunately,
situations in which these conditions will prevail are difficult to determine
a priori. Thus, on balance, our evidence suggests that the simple un-
weighted average continues to be a dependable combination method in
summarizing survey data of forecasts provided by large panels with
frequent entry and exit of experts.
REFERENCES
Aiolfi, M., Timmermann, A. (2006). Persistence in forecasting performance and
conditional combination strategies. Journal of Econometrics 135:31-53.
Altavilla, C., De Grauwe, P. (2010). Forecasting and combining competing models of
exchange rate determination. Applied Economics 42:3455-3480.
Bates, J., Granger, C. W. J. (1969). The combination of forecasts. Operations Research
Quarterly 20:451-468.
34
CapistrΓ‘n, C., Timmermann, A. (2009). Forecast combination with entry and exit of
experts. Journal of Business and Economic Statistics 27(4):428β40
Davies, A., Lahiri, K. (1995). A new framework for analyzing survey forecasts using
three-dimensional panel data. Journal of Econometrics 68:205-227.
Denrell, J., Fang, C. (2010). Predicting the next big thing: Success as a signal of poor
judgment. Management Science 56(10):1653-1667.
Diebold, F. X., Mariano, R. S. (1995). Comparing predictive accuracy. Journal of
Business and Economic Statistics 13:253-263.
Elliott, G. (2011). Averaging and the optimal combination of forecasts. Unpublished
manuscript.
Fan, S., Chen, L., Lee, W. J. (2008). Short-term load forecasting using comprehensive
combination based on multi-meteorological information. Industrial and
Commercial Power Systems Technical Conference. ICPS. IEEE/IAS.
Genre, V., Kenny, G., Meyler, A., Timmermann, A. (2013). Combining expert forecasts:
can anything beat the simple average? International Journal of Forecasting
29(1):108-121.
Herzog, S. M., Hertwig, R. (2009). The wisdom of many in one mind. Improving
individual judgments with dialectical bootstrapping. Psychological Science, 20:231-
237.
Inoue, A., Kilian, L. (2008). How useful is bagging in forecasting economic time series?
A case study of U.S. consumer price inflation. Journal of the American Statistical
Association 103(482):511-522.
Issler, J., Lima, L. (2009). A panel data approach to economic forecasting: The bias-
corrected average forecast. Journal of Econometrics 152(2):153-164.
Kang, H, (1986). Unstable weights in the combination of forecasts. Management Science
32(6):683 - 695.
Lahiri, K., Sheng, X. (2010). Measuring Forecast Uncertainty by Disagreement: The
Missing Link, Journal of Applied Econometrics 25 (4), 514-538
Larrick, R. P., Soll, J. B. (2006). Intuitions about combining opinions: Misappreciation of
the averaging principle. Management Science, 52:111-127.
Leung, G., Barron, A.R. (2006). Information theory and mixing least-squares regressions.
Information Theory, IEEE Transactions 52(8):3396 β 3410.
Poncela, P., Rodriguez, J., Sanchez-Mangas, R., Senra, E. (2011). Forecast combination
through dimension reduction techniques. International Journal of Forecasting
27:224-237.
Rapach, D. E., Strauss, J. K. (2005). Forecasting employment growth in missouri with
many potentially relevant predictors: An analysis of forecast combination methods.
Federal Reserve Bank of St. Louis Regional Economic Development 1(1):97-112.
35
Rapach, D. E., Strauss, J. K. (2007). Forecasting real housing price growth in the eighth
district states, Federal Reserve Bank of St. Louis, Regional Economic Development
3(2):33-42.
Sancetta, A. (2010). Recursive forecast combination for dependent heterogeneous data.
Econometric Theory 26:598-631.
SαΊ£nchez, I. (2008). Adaptive combination of forecasts with application to wind energy.
International Journal of Forecasting 24:679-693.
Schmidt, P. (1977). Estimation of seemingly unrelated regressions with unequal numbers
of observations. Journal of Econometrics 5:365-377.
Smith, J., Wallis, K.F. (2009). A Simple Explanation of the Forecast Combination Puzzle.
Oxford Bulletin of Economics and Statistics 71:331-355.
Soll, J. B., Larrick, R. P. (2009). Strategies for revising judgment: How (and how well)
people use othersβ opinion, Journal of Experimental Psychology: Learning,
Memory, and Cognition 35:780-805.
Stock, J.H., Watson, M.W. (2004). Combination forecasts of output growth in a seven-
country data set. Journal of Forecasting 23:405-430.
Timmermann, A. (2006). Forecast Combinations. in: Elliott, G., Granger, C.W.J.,
Timmermann, A. (Eds.), Handbook of Economic Forecasting. Elsevier Press.
Vovk, V.G. (1990). Aggregating strategies. Proceedings of the third annual workshop on
computational learning theory. Morgan Kaufmann Publishers Inc., Rochester, New
York, United States, 371-386.
Vul E., Pashler, H. (2008). Measuring the crowd within: Probabilistic representations
within individuals. Psychological Science, 19:645-647.
Wei, Y., Yang, Y., (2012). Robust forecast combinations. Journal of Econometrics
166(2):224-236.
Yang, Y. (2004). Combining forecasting procedures: some theoretical results.
Econometric Theory 20:176-222.
Yaniv, I., Milyavsky, M. (2007). Using advice from multiple sources to revise and
improve judgments. Organizational Behavior and Human Decision Processes
103:104-120.
Zou, H., Yang, Y. (2004). Combining time series models for forecasting. International
Journal of Forecasting 20:69-84.
1
Table 1. MSEs of forecasts made by different combination methods
Method
Current Quarter 1-Quarter Ahead 2-Quarter Ahead 3-Quarter Ahead
Forecasts Forecasts Forecasts Forecasts
1968:IV 2000:I 1968:IV 2000:I 1968:IV 2000:I 1968:IV 2000:I
to to to to to to to to
1990:IV 2011:III 1990:IV 2011:III 1990:IV 2011:III 1990:IV 2011:III
CPI Inflation (CPI)
BG 3.579 10.973 11.576 11.402
ME
3.831
10.867
11.819
11.420
RB
2.724
12.382
12.037
11.429
SA
3.870
10.994
11.513
11.313
TM
3.787
10.915
11.520
11.502
MLS
2.887
11.017
11.507
9.156
L1-AFTER
2.429
13.033
12.355
11.219
h-AFTER
2.617
12.207
12.269
11.879
s-AFTER 2.659 12.172 12.264 11.799
GDP Price Deflator Inflation (PGDP)
BG 1.605 1.013 3.403 1.018 4.212 1.061 4.840 1.103
ME 1.639 1.140 3.412 1.019 4.202 1.081 4.778 1.229
RB 1.494 1.047 3.140 1.821 4.706 1.616 6.255 1.434
SA 1.687 1.044 3.481 1.039 4.286 1.058 4.869 1.080
TM 1.650 1.079 3.430 1.020 4.254 1.056 4.821 1.114
MLS 1.250 1.042 3.358 1.036 4.251 0.978 4.872 1.083
L1-AFTER 1.291 0.993 3.129 1.263 4.483 1.323 7.521 1.474
h-AFTER 1.325 1.005 3.142 1.281 4.151 1.224 5.491 1.393
s-AFTER 1.326 1.016 3.135 1.288 4.149 1.213 5.460 1.386
Real GDP Growth (RGDP)
BG 5.890 1.534 11.657 4.025 14.696 6.687 18.320 8.498
ME 6.100 1.559 11.231 4.075 14.541 6.724 17.801 8.505
RB 9.335 1.715 14.222 4.319 16.067 7.262 22.689 7.514
SA 6.038 1.524 11.374 4.005 14.662 6.669 17.971 8.494
TM 5.982 1.517 11.457 3.973 14.471 6.716 17.503 8.482
MLS 6.096 1.526 11.416 4.019 14.701 6.683 18.170 8.508
L1-AFTER 8.300 1.580 12.520 4.591 17.370 6.993 21.615 8.656
h-AFTER 7.350 1.514 12.096 4.199 15.069 7.443 21.300 8.666
s-AFTER 7.297 1.516 12.173 4.189 15.022 7.630 21.543 8.738
Unemployment Rate (UNEMP)
BG 0.049 0.027 0.241 0.182 0.490 0.529 0.812 1.192
ME 0.047 0.028 0.245 0.188 0.494 0.544 0.810 1.204
RB 0.051 0.033 0.227 0.132 0.501 0.484 0.793 1.101
SA 0.050 0.029 0.242 0.188 0.496 0.546 0.822 1.222
TM 0.050 0.027 0.238 0.186 0.504 0.545 0.814 1.205
MLS 0.049 0.028 0.242 0.185 0.496 0.541 0.822 1.219
L1-AFTER 0.053 0.025 0.240 0.144 0.492 0.486 0.812 1.056
h-AFTER 0.051 0.028 0.248 0.150 0.486 0.497 0.798 1.279
s-AFTER 0.052 0.029 0.250 0.147 0.486 0.497 0.801 1.287
* Numbers in bold denote DM test rejection at 10%, i.e., the method performs significantly better than SA.
2
Table 2. Performance of h-AFTER and L1-AFTER using alternative loss functions
Panel I. Performance of h-AFTER and SA benchmark measured by Huber loss
Method
Current Quarter
Forecasts
1-Quarter Ahead
Forecasts
2-Quarter Ahead
Forecasts
3-Quarter Ahead
Forecasts
1968:IV
to
1990:IV
2000:I
to
2011:III
1968:IV
to
1990:IV
2000:I
to
2011:III
1968:IV
to
1990:IV
1968:IV
to
1990:IV
2000:I
to
2011:III
1968:IV
to
1990:IV
GDP Price Deflator Inflation (PGDP)
SA 1.272 0.872 1.959 0.888 2.277 0.916 2.527 0.945
h-AFTER 1.000 0.822 1.755 0.961 2.239 0.995 2.702 1.138
Unemployment Rate (UNEMP)
SA 0.050 0.029 0.235 0.188 0.435 0.454 0.665 0.829
h-AFTER 0.051 0.028 0.242 0.150 0.438 0.420 0.657 0.846
Panel II. Performance of L1-AFTER and SA benchmark measured by MAE
Method
Current Quarter
Forecasts
1-Quarter Ahead
Forecasts
2-Quarter Ahead
Forecasts
3-Quarter Ahead
Forecasts
1968:IV
to
1990:IV
2000:I
to
2011:III
1968:IV
to
1990:IV
2000:I
to
2011:III
1968:IV
to
1990:IV
1968:IV
to
1990:IV
2000:I
to
2011:III
1968:IV
to
1990:IV
GDP Price Deflator Inflation (PGDP)
SA 1.019 0.828 1.402 0.868 1.575 0.889 1.711 0.891
L1-AFTER 0.899 0.784 1.265 0.841 1.581 0.971 2.039 1.015
Unemployment Rate (UNEMP)
SA 0.176 0.127 0.381 0.298 0.523 0.504 0.675 0.716
L1-AFTER 0.164 0.124 0.377 0.277 0.533 0.494 0.692 0.711
* L1-AFTER implemented using dji.
Table 3. Percentage of significant imputation regressions at each horizon
Variable Subsample Number of
Regressions
Current-Quarter
Forecasts
One-quarter
Ahead
Forecasts
Two-Quarter
Ahead
Forecasts
Three-Quarter
Ahead
Forecasts
PGDP 1 88 0.09 0.23 0.28 0.34
2 25 0.22 0.64 0.75 0.65
UNEMP 1 88 0.26 0.52 0.68 0.75
2 25 0.23 0.44 0.47 0.53
* N: total number of regressions run for each forecaster; P: percentage of significant regressions at each horizon.
3
Figure 1. Overview of data patterns
This figure shows the overall data patterns for all forecasters (using PGDP one-quarter ahead forecasts as example as the patterns are similar for all other
variables and horizons). A dot in the figure represents a data point. Blank represents missing.
Date
Forecaster 300
Su
bsa
mp
le 2
:S
ub
sam
ple
1:
19
68
:IV
to
19
90
:IV
20
00
:I t
o 2
01
1:I
II
50 100 150 200 250
1970q4
1975q4
1980q4
1985q4
1990q4
1995q4
2000q4
2005q4
2010q4
4
Figure 2. Relative forecast performance of alternative combination methods
Performance reported in this set of figures is MSE of the method of interest relative to MSE of simple average
method. Each group of bars represents one method (denoted under the group). The four bars in each group represent
current quarter forecasts to 3-quarter ahead forecasts (left to right).
GDP Price Deflator Inflation (PGDP)
Subsample 1: 1968:IV to 1990:IV Subsample 2: 2000:I to 2011:III
Unemployment Rate (UNEMP)
Subsample 1: 1968:IV to 1990:IV Subsample 2: 2000:I to 2011:III
0.6
0.7
0.8
0.9
1
1.1
1.2
BG
RB
ML
S
L-A
FT
ER
h-A
FT
ER
s-A
FT
ER
0.6
0.7
0.8
0.9
1
1.1
1.2
BG
RB
ML
S
L-A
FT
ER
h-A
FT
ER
s-A
FT
ER
0.6
0.7
0.8
0.9
1
1.1
1.2
BG
RB
ML
S
L-A
FT
ER
h-A
FT
ER
s-A
FT
ER
0.6
0.7
0.8
0.9
1
1.1
1.2
BG
RB
ML
S
L-A
FT
ER
h-A
FT
ER
s-A
FT
ER
1
Figure 3. L1-AFTER and h-AFTER Evaluated Using Appropriate Loss Functions
Performance reported in this set of figures is MAE (for L-AFTER) or Huber loss (for h-AFTER) relative to the
respective loss of simple average method. The four bars in each group of bars represent current quarter forecasts to
3-quarter ahead forecasts (left to right). The name of the variable, method, and subsample are denoted under the
bars.
0.6
0.7
0.8
0.9
1
1.1
1.2
PG
DP
UN
EM
P
PG
DP
UN
EM
P
PG
DP
UN
EM
P
PG
DP
UN
EM
P
h-AFTER L-AFTER h-AFTER L-AFTER
Subsample 1 Subsample 2
2
Figure 4. Evolution of weights and performance of individual forecasters - PGDP
GDP Price Deflator Inflation (PGDP), subsample 1, current-quarter forecasts
0
40
80
120
Indiv
idual
Square
d E
rrors
0.00
2.00
4.00
6.00
8.00
Indiv
idual
MS
Es
.000
.040
.080
.120
.160
BG
Weig
hts
0.00
0.25
0.50
0.75
1.00
ML
S W
eih
ts
0.0
5.0
10.0
15.0
1974 1976 1978 1980 1982 1984 1986 1988 1990
Sq. Err. of BG Combined Forecasts
Sq. Err. of MLS Combined Forecasts
Sq. E
rr. of
Com
bin
ed F
ore
cast
s
3
Figure 5: Evolution of weights and performance of individual forecasters - UNEMP
Unemployment Rate (UNEMP), subsample 2, one-quarter ahead forecasts
0.00
1.00
2.00
3.00
Indiv
idual
Square
d E
rrors
.100
.200
Indiv
idual
MS
Es
.040
.060
.080
.100
BG
Weig
hts
0.00
0.25
0.50
0.75
1.00
sAF
TE
R W
eig
hts
0.00
0.40
0.80
1.20
I II III IV I II III IV I II III IV I II III IV I II III IV I II III IV I
2005 2006 2007 2008 2009 2010 2011
Sq. Err. of sAFTER Combined Forecasts
Sq. Err. of BG Combined Forecasts
Sq. E
rr. of
Com
bin
ed F
ore
cast
s
4
Figure 6. Reacting to Outlier β Behavior of BG and s-AFTER method
Real GDP (RGDP), subsample 1, current-quarter forecasts. In the top four panels, solid line represents forecaster 44,
dashed line represents forecaster 65. The outlier is forecaster 44βs forecast for the fourth quarter of 1974.
0
100
200
300
Indiv
idual
Square
d E
rrors
0
10
20
30
40
Indiv
idual
MS
Es
.00
.05
.10
.15
BG
Weig
hts
0.00
0.25
0.50
0.75
1.00
sAF
TE
R W
eih
ts
0
40
80
120
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
Sq. Err. of BG Combined Forecasts
Sq. Err. of sAFTER Combined Forecasts
Sq. E
rr. of
Com
bin
ed F
ore
cast
s
5
Figure 7. Reacting to Structural Breakβ Behavior of L-AFTER method
Unemployment rate (UNEMP), subsample 2, one-quarter-ahead forecasts. Shown from top to bottom are individual
MAEs, weights assigned to individual forecasters by L-AFTER method, and the squared errors of the combined
forecasts produced by L-AFTER method. Behavior of s-AFTER and BG methods for this case can be seen in
subfigure 2 of Figure 3. The individual (forecaster 483) receiving the highest weight is the same in BG, s-AFTER,
and L-AFTER.
.100
.200
.300
.400
Indiv
idual
MA
Es
0.00
0.25
0.50
0.75
1.00
L-A
FT
ER
Weig
hts
0.00
0.25
0.50
0.75
1.00
I II III IV I II III IV I II III IV I II III IV I II III IV I II III IV I
2005 2006 2007 2008 2009 2010 2011
Sq. E
rr. of
Com
bin
ed F
ore
cast
s
Sq. Err. of Combined Forecasts Produced by L-AFTER