machine learning and forecast combination in incomplete panels

1

MACHINE LEARNING AND FORECAST

COMBINATION IN INCOMPLETE PANELS

Kajal Lahiri, Huaming Peng, Yongchen Zhao1

Department of Economics, University at Albany, SUNY

Albany, New York, USA

□ This paper focuses on the newly proposed on-line forecast combination

algorithms in Sancetta (2010), Yang (2004), and Wei and Yang (2012). We first

establish the asymptotic relationship between these new algorithms and the Bates

and Granger (1969) method. Then, we show that when implemented on unbalanced

panels, different combination algorithms implicitly impute missing data differently,

making results not comparable across methods. Using forecasts of a number of

macroeconomic variables from the U.S. Survey of Professional Forecasters, we

evaluate the performance of the new algorithms and contrast their inner

mechanisms with that of Bates and Granger’s method. Missing data in the SPF

panels are specifically controlled for by explicit imputation. We find that even

though equally weighted average is hard to beat, the new algorithms deliver

superior performance especially during periods of volatility clustering and

structural breaks.

Keywords On-line learning; Recursive algorithms; Unbalanced panel; SPF

forecasts.

JEL Classification C22; C53; C14.

1. INTRODUCTION

Since the seminal work of Bates and Granger (1969), the potential

benefits of combining multiple forecasts instead of simply choosing the

single best has long been recognized. The basic idea is that under certain

1 An earlier version of the paper was presented at the New York Camp Econometrics VI (Lake

Placid, April 2011) and the 17th International Panel Data Conference (Montreal, July 2011). We

thank Cheng Hsiao and Tom Wansbeek for helpful comments.

2

conditions, optimally combined forecast can be more accurate than

individual forecasts in the panel. Moreover, combining forecasts can be a

useful hedge against structural breaks and model instability, see

Timmermann (2006) for a survey. However, despite the development of

many new forecast combination methods during the past forty years,

empirical studies still find that a simple average (henceforth SA) of

forecasts perform very well compared to more elaborate procedures. This

“forecast combination puzzle”, as dubbed originally by Stock and Watson

(2004), is related to several issues. First is the “curse of dimensionality”,

since performance-based combination methods require the estimation of a

large number of weight parameters. A number of well-designed Monte

Carlo studies and analytical results have illustrated this problem, see Kang

(1986), Smith and Wallis (2009), Capistrán and Timmermann (2009), and

Issler and Lima (2009). There are other equally important factors too. A

forecaster’s past record may not be a good indicator of his/her future

performance due to structural breaks, outliers, new information,

uncertainty shocks, and other ex-ante unobservables. These factors can

make the relative rankings and combination weights unstable,

unpredictable, and generally misleading. Aiolfi and Timmermann (2006)

document such crossings in the context of a large number of forecasting

models.

Parallel to the aforementioned literature, another promising approach

to forecast combination has developed in recent years using aggregation

algorithms and on-line learning. Two fundamental components of

successful forecast combination is the choice of the combination rule and

the weights. In the latter approach, time varying combination weights are

naturally built into on-line recursive algorithms that do not require the

knowledge of the full covariance matrix of the forecast errors. Yang

3

(2004) distinguished between two broad approaches to combining: in the

first the combined forecast tries to be as good as the best in the group -

called combining for adaptation; and in the second approach the combined

forecast tries to be better than each individual forecast – called combining

for improvement. The Bates and Granger (BG) procedure falls in the latter

category. Yang (2004) suggested a new automated combining method, the

Aggregated Forecast Through Exponential Reweighting algorithm

(henceforth AFTER), which is a variant of the aggregating algorithm

proposed by Vovk (1990), and belongs to the first category. AFTER has

been found to be useful in many recent applications.2 It is now well

known that, under certain circumstances, the cost of combining for

improvement due to parameter estimation can be substantially higher than

that of combining for adaptation.

In addition to AFTER, we consider another on-line recursive

algorithm from the machine learning literature with shrinking (henceforth

MLS) due to Sancetta (2010). Unlike the BG approach, the on-line

algorithms tend to select the few top forecasters, thus requiring much

smaller number of estimated parameters. By simple recursive updates,

they allow for time varying optimal combination weights. There are subtle

differences in the assumptions built into AFTER and MLS algorithms:

AFTER requires the existence of a moment generating function of the

forecast errors, but the errors need not be bounded. MLS does not require

any assumption on the nature of the forecasts and the actuals, or the

stability of the system besides a tail condition on the error distribution. But

in the process it can only establish rather weak performance bounds. Wei

2 See Altavilla and Grauwe (2010), Rapach and Strauss (2005, 2007), Leung and Barron

(2006), Sanchez (2008), Fan, Chen and Leem(2008), and Inoue and Kilian (2008).

http://amstat.tandfonline.com/action/doSearch?action=runSearch&type=advanced&result=true&prevSearch=%2Bauthorsfield%3A%28Inoue%2C+Atsushi%29

http://amstat.tandfonline.com/action/doSearch?action=runSearch&type=advanced&result=true&prevSearch=%2Bauthorsfield%3A%28Kilian%2C+Lutz%29

4

and Yang (2012) noted that with alternative forecasts being similar and

stable, the BG approach tends to be unnecessarily aggressive, and AFTER,

in these situations, can perform better. On the other hand, when the best

forecaster changes over time and unstable, the gradient-based method for

improvement as suggested by MLS can be advantageous. Instead of

looking for the best combined forecast, it could be more profitable to look

for the best available forecast in real time. Thus, these methods can be

complimentary, depending on the forecasting environment one may face in

real time.

Wei and Yang (2012) extended the AFTER algorithm that is designed

for squared loss (s-AFTER) to absolute error loss (L1-AFTER) and Huber

loss functions (h-AFTER), with the special objective of reducing the

influence of outliers. In the presence of structural breaks, outliers are more

likely to occur. The quadratic loss coupled with doubly exponential

weighted exponential scheme makes AFTER weights very sensitive to

outliers. The BG combination weights can be very sensitive to structural

breaks too. In a real life situation, since the future scenario is seldom

predictable, one cannot determine a priori which combining strategy to

adopt. However, Wei and Yang (2012) have shown that while robust to

outliers, L1-AFTER and h-AFTER are only marginally inferior to s-

AFTER when errors are normally distributed with no outliers.

There are three main objectives of this paper: First, we derive the

asymptotic forms of s-AFTER, L1-AFTER and BG method and establish

their asymptotic relationships. These asymptotic relationships not only

provide us with fresh new insights into the mechanism of how the

AFTER-type algorithms operate relative to the BG and simple averaging

scheme, but also explain the rationale behind the distinct forecast

performance of these combination methods. More specifically, we show

5

that in the presence of heteroscedasticity, s-AFTER eventually behaves

similar to that of a normalized power function of the BG method. In

particular, the s-AFTER algorithm would magnify the BG weights if they

are sufficiently large while discounting the weights of BG dramatically if

they are below some threshold level. On the other hand, not unexpectedly,

we show that if the forecast errors are asymptotically stationary over time

and conditionally homoscedastic across forecasters or forecasting

procedures, then s-AFTER, L1-AFTER and BG methods all behave like

the SA scheme when both the training and the remaining sample sizes are

sufficiently large.

Second, we evaluate these newly developed on-line algorithms and

compare them with many existing combining procedures using a large

data set of expert forecasts for a number of macroeconomic variables. We

use the U.S. Survey of Professional Forecasters (SPF) from 1968:IV with

many missing data that is very typical of such surveys The long time span

over which these forecasts have been recorded will hopefully help us to

understand the relative strengths and weaknesses of these combining

schemes under widely different forecasting scenarios actually observed in

real life.

Finally, we examine the implications of missing data in a panel on the

comparison of alternative combining procedures. Schimdt (1977) spurred

early work on the estimation of panel data models with incomplete panels.

In our context, Capistrán and Timmermann (2009) were the first to

examine the implication of incomplete panels due to entry, exit and re-

entry of experts and demonstrated that due to reduced number of

overlapping forecasts, the performance of procedures like BG that require

estimating error covariances will deteriorate relative to methods like SA

that do not need them. The same reasoning would favor the newer on-line

6

automated procedures like AFTER and MLS because they do not require

such covariances. In addition to this problem, we also show that if the

missing observations are not explicitly treated uniformly across different

combining procedures, different combining schemes would yield different

imputed values. This effectively means, when applied to unbalanced

panels, different combination methods are implemented on different data

sets, and naturally the results of different combinations will not be

comparable. In our exercises, we address the issue of incomplete panel by

explicit imputations, improvising on a procedure suggested by Genre et al.

(2013).

The organization of the paper is as follows. Section 2 contains

theoretical results on the asymptotic relationship between different

combination methods and the incomparability between combination

methods when applied to unbalanced panel data. Section 3 discusses the

dataset and data-related issues including imputations of missing data. We

evaluate the performance of the newly developed combination algorithms

and compare them with some of the extant methods in Section 4, and

study the behavior of the on-line algorithms more closely in Section 5.

Section 6 concludes.

2. COMBINATION METHODS AND IMPLIED

IMPUTATIONS

2.1. COMBINATION METHODS AND THEIR ASYMPTOTIC

RELATIONSHIPS

In this subsection, we study the asymptotic relationships between the

AFTER algorithms, the BG algorithm, and SA method. Suppose there are

n forecasters. yt is the variable of interest at time t (t = 1,⋯ , T) and yj,t is

the forecast of yt made by the jth forecaster at time t − h, where h is a

7

positive integer indicating forecast horizon3. The forecast combination

problem is how to assign weights to these n forecasters at time t + 1 after

observing yτ, yj,τ and the associated forecast errors ej,τ = yτ − yj,τ for

τ = 1,⋯ , t and j = 1,⋯ , n.

A popular solution to the forecast combination problem is the Bates

and Granger's approach (1969), where combining weights are proportional

to the inverse of the mean squared errors (MSE) (see, Stock and Watson

2004). More specifically, at time t, BG method computes and assigns the

weight ωj,t+1BG =

σj,t−2

∑ ‍nj=1σj,t

−2 to the jth forecaster for time t + 1 assuming the

forecasts are unbiased. In practice, the variances σj,t2 are rarely known and

usually replaced by their estimates σj,t2 =

1

t−1∑ ‍t−1τ=1 ej,τ

2 . But then ��𝑗,𝑡+1𝐵𝐺 −

𝜔𝑗,𝑡+1𝐵𝐺 →𝑝 0 whenever ��𝑗,𝑡

2 − 𝜎𝑗,𝑡2 →𝑝 0 for all 𝑗 as 𝑡 → ∞. If the weak

stationarity condition such that 𝜎𝑗,𝑡2 = 𝜎𝑗

2 for all 𝑡 is imposed, BG weights

can be further simplified to

𝜔𝑗,𝑡+1𝐵𝐺 =

𝜎𝑗−2

∑ ‍𝑛𝑗=1 𝜎𝑗

−2. (2.1)

The newly developed s-AFTER algorithm proposed by Yang (2004)

aims to pick the best few forecasters by minimizing the square loss

function. According to Yang (2004), when the errors are normal and the

variances are estimated, the weights of s-AFTER are estimated by

��𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 =

∏ ‍𝑡𝜏=𝑡𝑜+1

��𝑗,𝜏−1exp(−

12∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

��𝑗,𝜏2 )

∑ ‍𝑛𝑗=1 ∏ ‍𝑡

𝜏=𝑡𝑜+1��𝑗,𝜏−1exp(−

12∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

��𝑗,𝜏2 )

‍‍ (2.2)

3 Without loss of generality, we do not identify forecast horizon in this section.

8

for‍𝑡 ≥ 𝑡𝑜 + 1

where 𝑡𝑜 is the size of the training sample.

In this scheme, the lower the value of ��𝑗,𝑡2 , the higher the weights.

Also, the latest squared forecasts errors are evaluated relative to its

estimated expected value ��𝑗,𝑡2 .‍Thus, a large squared forecast error relative

to its estimated expected value is interpreted as a sign of potential

deterioration of the particular forecaster’s performance. As a result, the

contribution of that forecast to the combination is exponentially reduced,

see also Zou and Yang (2004) .

If ��𝑗,𝑡2 − 𝜎𝑡

2 →𝑝 0 for all 𝑗,4 when 𝑡 − 𝑡𝑜 → ∞,

{ ∏ ‍

𝑡

𝜏=𝑡𝑜+1


1

2∑ ‍

𝑡

𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

��𝑗,𝜏2 )}

1𝑡−𝑡𝑜

− { ∏ ‍

𝑡

𝜏=𝑡𝑜+1

𝜎𝜏−1exp(−

1

2∑‍

𝑡

𝜏=1

𝑒𝑗,𝜏2

𝜎𝜏2)}

1𝑡−𝑡𝑜

→𝑝 0

As a result, we obtain

{��𝑗,𝑡+1

𝑠−𝐴𝐹𝑇𝐸𝑅}1

𝑡−𝑡𝑜

∑ ‍𝑛𝑗=1 {��𝑗,𝑡+1


𝑡−𝑡𝑜

−{𝜔𝑗,𝑡+1


𝑡−𝑡𝑜

∑ ‍𝑛𝑗=1 {𝜔𝑗,𝑡+1


𝑡−𝑡𝑜

→𝑝 0

for‍all‍𝑗. It follows that ��𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 − 𝜔𝑗,𝑡+1

𝑠−𝐴𝐹𝑇𝐸𝑅 →𝑝 0 where

4 This holds provided the following assumptions are satisfied: (i) 𝜎𝑗,𝑡

2 = 𝜎𝑡2, (ii) the

forecast errors 𝑒𝑗,𝑡 have uniformly bounded fourth moments, and p𝜏 | 𝑗,𝑡 − 𝑡−1 𝑡|

2

∞, (iii) 𝜎𝑡2 − 𝜎2 →𝑝 0 for some 𝜎2, and (iv) the 𝑗𝑡 procedure is consistent (see

Proposition 3 in Yang (2004)).

9

𝜔𝑗,𝑡+1𝑠𝐴𝐹𝑇𝐸𝑅 =

exp (−12∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

𝜎𝜏2)

∑ ‍𝑛𝑗=1 exp(−

12∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

𝜎𝜏2)

(2.3)

for all 𝑗. The above expression is almost identical to the s-AFTER

algorithm defined by Yang (2004) for known homoscedastic conditional

variance except that in (2.3) the summation ranges from 𝑡𝑜 + 1 to 𝑡 rather

from 1 to 𝑡. The slight difference is due to the fact that 𝑡𝑜 sample has to be

used to compute the initial estimate of 𝜎𝑡2.

When the forecast errors are asymptotically (conditionally)

homoscedastic over time and across 𝑗, i.e., ��𝑗,𝑡2 − 𝜎2 →𝑝 0‍for all 𝑗 as 𝑡

approaches infinity, it is trivial to see from (2.3) that s-AFTER acts like

SA for sufficiently large sample. By contrast, BG method requires only

��𝑗,𝑡2 − 𝜎𝑡

2 →𝑝 0 for all 𝑗 to yield equal weight eventually.

The assumption of (asymptotically) conditional homoscedasticity

over time and across 𝑗 may be too restrictive as it rules out many

interesting cases. In general, the conditional variances 𝜎𝑗,𝑡2 are expected to

vary across j (see, for example, Davies and Lahiri 1995, and Lahiri and

Sheng 2010). Hence here we consider what is typical in forecast

combination literature, and assume ‍��𝑗,𝑡2 − 𝜎𝑗

2 →𝑝 0. This condition

encompasses as a special case the assumption that 𝑒𝑗,𝑡 are weakly

stationary and ergodic in second moment for all 𝑗. Under this condition, as

𝑡 → ∞, we have ��𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 − 𝜔𝑗,𝑡+1

𝑠−𝐴𝐹𝑇𝐸𝑅 →𝑝 0, where

𝜔𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 =

𝜎𝑗−(𝑡−𝑡𝑜)exp (−

12𝜎𝑗

2∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2 )


−(𝑡−𝑡𝑜)exp(−12𝜎𝑗

2∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2 )

(2.4)

10

which is the s-AFTER algorithm defined in Yang (2004) for known

variance in the sense that it allows for heteroskedasticity of forecast errors.

Since 𝜎𝑗,𝑡2 →𝑝 𝜎𝑗

2 by assumption and 1

𝑡−𝑡𝑜∑ ‍𝑡𝜏=𝑡𝑜+1

(𝑒𝑗,𝜏2 − 𝜎𝑗,𝜏

2 ) →𝑝 0 as

𝑡 − 𝑡𝑜 → ∞, provided that the weak law of large numbers can be applied,

for large‍𝑡𝑜‍and‍ 𝑡 − 𝑡𝑜, 𝜔𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 may be well approximated by


𝜎𝑗−(𝑡−𝑡𝑜)


−(𝑡−𝑡𝑜). (2.5)

Combining (2.1) and (2.5) yields


{𝜔𝑗,𝑡+1𝐵𝐺 }

−12(𝑡−𝑡𝑜)

∑ ‍𝑛𝑗=1 {𝜔𝑗,𝑡+1

𝐵𝐺 }−12(𝑡−𝑡𝑜)

. (2.6)

Therefore, under the assumption‍��𝑗,𝑡2 − 𝜎𝑗

2 →𝑝 0, 𝜔𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 ‍is a

normalized power transform of 𝜔𝑗,𝑡+1𝐵𝐺 .‍‍Given the power is negative

because of 𝑡 > 𝑡𝑜, the s-AFTER algorithm would magnify the weight

produced by BG when the forecaster can be distinguished from the rest of

the forecasters. Conversely, it tends to discount the weight of a forecaster

by BG’s method if the weight is below some data-dependent threshold

level. This explains why s-AFTER would eventually pick the best

forecaster if its past and most recent performance are well above the rest

of the forecasters. However, we should note that the s-AFTER algorithm

operates according to the dynamic and cross sectional properties of the

data as well.

Along the same line, Wei and Yang (2012) extended the AFTER

algorithm by adopting absolute error loss (L1 loss) and Huber loss

functions and propose two new algorithms (L1-AFTER and h-AFTER) for

forecast combination. Since L1-AFTER and h-AFTER behave similarly,

here we only focus on the L1-AFTER algorithm and examine its

11

asymptotic relationship with s-AFTER algorithm. According to Wei and

Yang (2012), the weights of L1-AFTER algorithm are estimated as

��𝑗,𝑡+1𝐿1−𝐴𝐹𝑇𝐸𝑅 =


��𝑗,𝜏−1exp(−𝜆∑ ‍𝑡

𝜏=𝑡𝑜+1

|𝑒𝑗,𝜏|

��𝑗,𝜏)

∑ ‍𝑛𝑗=1 ∏ ‍𝑡

𝜏=𝑡𝑜+1��𝑗,𝜏−1exp(−𝜆∑ ‍𝑡

𝜏=𝑡𝑜+1

|𝑒𝑗,𝜏|

��𝑗,𝜏)

‍‍for‍‍𝑡

≥ 𝑡𝑜 + 1

(2.7)

where 𝜆 is a positive tuning parameter, and ��𝑗,𝜏 =1

𝜏−1∑ ‍𝜏−1𝑠=1 |𝑒𝑗,𝑠| or

��𝑗,𝜏 = ��𝑗,𝑡. Following Wei and Yang (2012), we set 𝜆 = 1 and let ��𝑗,𝜏 =

1

𝜏−1∑ ‍𝜏−1𝑠=1 |𝑒𝑗,𝑠|. Suppose ��𝑗,𝜏 − |𝑒𝑗| →𝑝 0 for all 𝑗, which follows trivially

if‍𝑒𝑗,𝑡 are weakly stationary and ergodic in second moment for all 𝑗

(though this assumption is sufficient but not necessary). Using argument

similar to that for s- AFTER algorithm, we have ��𝑗,𝑡+1𝐿1−𝐴𝐹𝑇𝐸𝑅 −

𝜔𝑗,𝑡+1𝐿1−𝐴𝐹𝑇𝐸𝑅 →𝑝 0, where

𝜔𝑗,𝑡+1𝐿1−𝐴𝐹𝑇𝐸𝑅 =

{ |𝑒𝑗|}−(𝑡−𝑡𝑜)

∑ ‍𝑛𝑗=1 { |𝑒𝑗|}

−(𝑡−𝑡𝑜)‍. (2.8)

By examining (2.5) and (2.8), we can see that 𝜔𝑗,𝑡+1𝐿1−𝐴𝐹𝑇𝐸𝑅‍ has

identical functional form as 𝜔𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅. These two AFTER-type algorithms

differ only in the argument: 𝜎𝑗 for s-AFTER and |𝑒𝑗| for L1-AFTER,

reflecting their respective loss functions (the absolute loss and squared

loss, because 𝜎𝑗 = √ 𝑒𝑗2, provided 𝑒𝑗 is unbiased) used in building these

algorithms. Some simple algebraic manipulation yields the following large

sample relationship between L1-AFTER and s-AFTER:

12

𝜔𝑗,𝑡+1𝐿1−𝐴𝐹𝑇𝐸𝑅 =

𝜔𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅{1 + 𝜎𝑗

−1( |𝑒𝑗| − 𝜎𝑗)}−(𝑡−𝑡𝑜)

∑ ‍𝑛𝑗=1 𝜔𝑗,𝑡+1

𝑠−𝐴𝐹𝑇𝐸𝑅{1 + 𝜎𝑗−1( |𝑒𝑗| − 𝜎𝑗)}

−(𝑡−𝑡𝑜). (2.9)

It is obvious that the difference between ωj,t+1L1−AFTER‍ and ωj,t+1

s−AFTER‍ stems

from the discrepancy between E|ej|‍ and σj, or the difference between the

absolute loss and the square root of the square loss. Note that E|ej| − σj

0 by Jensen inequality provided that ej‍is unbiased and Var(|ej|) ≠ 0.

Hence‍‍{1 + σj−1(E|ej| − σj)}

−(t−to)> 1 works as a mediating factor to

counteract the impact of diminishing ωj,t+1s−AFTER due to an outlier so that

ωj,t+1L1−AFTER‍ is less sensitive in the presence of forecast outliers, certeris

paribus.

2.2. INCOMPARABILITY OF COMBINED FORECASTS IN

UNBALANCED PANELS

The discussions in the previous subsection pertaining to the

asymptotic relationships among SA, BG, s-AFTER and L1-AFTER are

presented in the context of balanced panels. But the majority of panel data

sets faced by empirical researchers are unbalanced in nature. The question

we explore in this subsection is what happens if empirical researchers

blindly apply various combination methodologies without properly

allowing for the unbalanced structure of forecast data at hand.

For simplicity, suppose that an analyst observes 𝑛 forecasters, the

earliest time at which forecast data are available is 𝑡 = 1 while the latest

time at which forecast data are available is 𝑡 = 𝑇. Due to entry and exit of

forecast experts from time to time, the data is unbalanced in the sense that

forecast data may not be available for some forecaster 𝑗 at some time 𝑡.

Define 𝑆𝑡𝐴 = *𝑗: 𝑗,𝑡‍i ‍ob erved‍at‍time‍𝑡‍and‍𝑗 = 1,⋯ , 𝑛+ and 𝑆𝑡

𝑁𝐴 =

13

*𝑗: 𝑗,𝑡‍i ‍not‍available‍at‍𝑡‍and‍𝑗 = 1,⋯ , 𝑛+. Assume also that there are

𝑛𝑡 observations at time 𝑡, which then implies that 𝑛 − 𝑛𝑡 elements belong

to 𝑆𝑡𝑁𝐴. If simple averaging method is applied only to observed data, then

effectively, the weights of SA in a unbalanced panel with 𝑆𝑡𝐴 and 𝑆𝑡

𝑁𝐴

(𝑡 = 1,⋯ , 𝑇) are governed by

��𝑗,𝑡+1𝑆𝐴 = {

0, 𝑗,𝑡+1i ‍not‍available

1

𝑛𝑡+1, otherwi e.

from which we get the combined forecast by SA method 𝑡+1𝑆𝐴 =

∑ ‍𝑛𝑗=1 ��𝑗,𝑡+1

𝑆𝐴 𝑗,𝑡+1 =1

𝑛𝑡+1∑ ‍𝑗∈𝑆𝑡+1

𝐴 𝑗,𝑡+1. Now let 𝑗,𝑡+1∗ denote computed

input for the unobserved 𝑗,𝑡+1, and define 𝑡+1𝑆𝐴 =

1

𝑛.∑ ‍𝑗∈𝑆𝑡+1

𝐴 𝑗,𝑡+1 +

∑ ‍𝑗∈𝑆𝑡+1𝑁𝐴 𝑗,𝑡+1

∗ /. Then since ∑ ‍𝑗∈𝑆𝑡+1𝐴 𝑗,𝑡+1 = 𝑛𝑡+1 𝑡+1

𝑆𝐴 , we can obtain

1

𝑛−𝑛𝑡+1‍∑ ‍𝑗∈𝑆𝑡+1

𝑁𝐴 𝑗,𝑡+1∗ = 𝑡+1

𝑆𝐴 . This implies that blindly applying simple

average method without properly allowing for the unbalanced structure

gives rise to imputed values 𝑗,𝑡+1∗ (𝑗 ∈ 𝑆𝑡+1

𝑁𝐴 ) such that the simple average

of the imputed values must be equal to the simple average of the observed

data at time 𝑡 + 1. For example, if only one data point is not observed,

then simple averaging simply fills in the blank cell with the simple

average of the observed data from other forecasters without recognizing

the idiosyncrasies of the particular forecaster.

Next, suppose Bate and Granger's (BG) Method is applied to an

unbalanced panel, naturally, the weights of BG method are defined by

��𝑗,𝑡+1𝐵𝐺 =

{

‍‍

0, if‍ 𝑗,𝑡+1i ‍not‍available

��𝑗,𝑡+1𝐵𝐺𝐴 = ��𝑗,𝑡

−2/ ∑ ‍

𝑛

𝑗∈𝑆𝑡𝐴

��𝑗,𝑡−2 otherwi e

14

from which we have combined forecast by BG method, 𝑡+1𝐵𝐺 =

∑ ‍𝑛𝑗=1 ��𝑗,𝑡+1

𝐵𝐺 𝑗,𝑡+1 = ∑ ‍𝑗∈𝑆𝑡+1𝐴 .��𝑗,𝑡

−2/∑ ‍𝑛𝑗∈𝑆𝑡

𝐴 ��𝑗,𝑡−2/ 𝑗,𝑡+1. If we assume ��𝑗,𝑡

−2‍

are available for all j at time t and 𝑗,𝑡+1∗ are being filled into the missing

spots at time t+1, then 𝑡+1𝐵𝐺 = ∑ ‍𝑗∈𝑆𝑡+1

𝐴 (��𝑗,𝑡−2/∑ ‍𝑛

𝑗=1 ��𝑗,𝑡−2) 𝑗,𝑡+1 +

∑ ‍𝑗∈𝑆𝑡+1𝑁𝐴 (��𝑗,𝑡

−2/∑ ‍𝑛𝑗=1 ��𝑗,𝑡

−2) 𝑗,𝑡+1∗ . Hence by using the fact that

𝑡+1𝐵𝐺 ∑ ‍𝑛

𝑗∈𝑆𝑡𝐴 ��𝑗,𝑡

−2 = ∑ ‍𝑗∈𝑆𝑡+1𝐴 ��𝑗,𝑡

−2 𝑗,𝑡+1‍ we can obtain ∑ ‍𝑗∈𝑆𝑡𝑁𝐴 (��𝑗,𝑡

−2/

∑ ‍𝑛𝑗∈𝑆𝑡

𝑁𝐴 ��𝑗,𝑡−2) 𝑗,𝑡+1

∗ = 𝑡+1𝐵𝐺 , demonstrating that inadvertently applying the

BG procedure to unbalanced panels would produce imputed values 𝑗,𝑡+1∗

(𝑗 ∈ 𝑆𝑡+1𝑁𝐴 ) for missing data in such a way that the BG weighted average of

imputed values equals to the BG weighted average of observed data at

time 𝑡 + 1. Note that for imputed values, the weights are based on either

prior knowledge or estimates based past available data. Again, if only one

data point is not observed, then BG approach implicitly fills in the blank

cell with the BG weighted average of the observed data.

Now suppose s-AFTER algorithm is applied directly to an unbalanced

data set with its weights given by s-AFTER for an unbalanced panel:

��𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅

=

{

‍‍

0, if‍ 𝑗,𝑡+1‍i ‍not‍available



12∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

��𝑗,𝜏2 )

∑ ‍𝑗∈𝑆𝑡𝐴 ∏ ‍𝑡

𝜏=𝑡𝑜+1��𝑗,𝜏−1exp(−

12∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

��𝑗,𝜏2 )

, otherwi e

Then, similar derivation shows that

∑ ‍𝑗∈𝑆𝑡𝑁𝐴



1

2∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

��𝑗,𝜏2 )��𝑗,𝑡+1

∗

∑ ‍𝑗∈𝑆𝑡

𝑁𝐴 ∏ ‍𝑡𝜏=𝑡𝑜+1


1

2∑ ‍𝑡𝜏=𝑡𝑜+1

𝑒𝑗,𝜏2

��𝑗,𝜏2 )

= 𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 indicating that s-

15

AFTER algorithm applied to unbalanced panels generated implicit values

𝑗,𝑡+1∗ (𝑗 ∈ 𝑆𝑡+1

𝑁𝐴 ) for missing data in such a way that the s-AFTER weighted

average of imputed values equals to the its counterpart of observed data at

time 𝑡 + 1. Again, past data for individuals whose current forecasts are

missing are assumed to be available. In particular, if only one data point is

missing, then s-AFTER algorithm impute it with the weighted average by

s-AFTER algorithm of the observed data.

We can now safely conclude that SA, BG, and s-AFTER are not

comparable when directly applied to an unbalanced panel because they are

implicitly using different (balanced) data sets. This conclusion also holds

for other combination procedures in general. Consequently, existing

results on evaluating the performance of various combination methods

may be misleading when these results are applied directly to unbalanced

panels. At the very least, great care and caution must be taken to interpret

these empirical results. Finally, the results in this subsection suggest that

the issue of unbalanced panels must be addressed properly before

comparing combined forecasts by various procedures.

Facing the missing data problem, different authors have handled the

situation differently. Issler and Lima (2009) reduced a much larger

incomplete panel data set to a small (N=18, T=41) balanced set. Note that

this type of trimming also implies that implicitly the discarded values of

the original unbalanced panel are essentially replaced by the means of the

remaining observations for each time period while implementing SA, thus

reducing the heterogeneity of the individual forecasters and the scope of

performance-based combination methods. Capistrán and Timmermann

(2009) used US SPF data over 1987-2006 that contained huge amount of

missing observations (see our Figure 1) and trimmed the data by requiring

forecasters to have a minimum of 10 common contiguous observations.

16

They computed the root mean squared errors (RMSE) of 12 different

combining procedures relative to SA based on the trimmed unbalanced

panel. Our analysis suggests that strictly speaking these RMSE ratios are

not comparable. Poncela et al (2011) use one-period- ahead forecasts from

US SPF over 1991-2008, and restrict the data to those individuals who

have been on the panel for a minimum of 7 years and never missed more

than four consecutive forecasts. The resultant incomplete panel is then

filled in two ways: i) the missing 1-quarter-ahead forecasts are replaced by

the 2-. (or 3- or 4- if the previous forecasts are missing) quarter ahead

forecasts, or ii) the missing forecast is replaced by that individual’s

historical mean forecast. Since the same imputation scheme is used in this

paper for all procedures, the results are comparable. However, this scheme

minimizes the commonality of individual forecasts and emphasizes the

idiosyncratic component in the forecast data. The imputation scheme used

by Genre et al. (2013) use both an empirically determined fraction of

forecaster j’s previous observed deviation from the average forecast and

the average forecast in period t, thus drawing information from both

directions.

3. U.S. SPF DATA AND MISSING DATA ISSUES

3.1. SPF DATA AND VARIABLES

The data we use to evaluate the performance of a number of forecast

combination algorithms is the U.S. Survey of Professional Forecasters

(SPF). It is a high quality, long standing, and widely used quarterly survey

on macroeconomic forecasts in the United States. The survey was initially

conducted by the American Statistical Association (ASA) and the National

Bureau of Economic Research (NBER). Starting from 1990, the survey

was taken over by the Federal Reserve Bank of Philadelphia. This change

17

in administration led to a unique missing data pattern thus a challenge for

empirical work.

From the 39 regularly surveyed variables in SPF, we select the growth

rate of real GDP (RGDP), seasonally adjusted annual rate of change for

GDP price deflator (PGDP), the CPI inflation rate (CPI), and the

seasonally adjusted quarterly average unemployment rate (UNEMP) as

our target variables. For each variable, we examine the forecasts made for

the current quarter and the following 3 quarters, starting from the fourth

quarter of 1968 (1968:IV) to the third quarter of 2011 (2011:III).

3.2. MISSING DATA AND IMPUTATIONS

As shown in the previous section, different forecast combination

methods, when applied to incomplete panels, implicitly impute the

missing forecasts differently. It is easy to see that the amount and pattern

of missing data directly determine the extent to which the comparison

results are affected. Figure 1 shows the missing data structure of the SPF

over its entire life5. A black square in the figure represents a data point;

and a blank spot represents a missing forecast. Strikingly, the amount of

missing data is far exceeds the amount of available data. Taking one-

quarter-ahead PGDP forecasts as an example – a fully balanced panel with

425 forecasters from 1968:IV to 2011:III without missing data would have

73,100 data points. However, we have only 6,520 data points in this

period. This means that 91% of the data are missing! As for the pattern of

missing data, Figure 1 shows that before 1990, there were a large number

of forecasters whose forecasts started from the initial years around 1970.

Then, about half of the forecasters stopped forecasting mid-way while the

5 PGDP one-quarter-ahead forecasts are used to construct Figure 1. The amount and

pattern of missing data for other variables are similar.

18

rest kept forecasting until around 1990. Only six forecasters who joined

the survey in its early days remain in the sample until recently. On the

other hand, starting from 1990, many new forecasters joined the survey

every few years, and about half of them kept forecasting.

Based on these observations, we choose to construct two subsamples

instead of using the entire dataset as a whole. The first subsample includes

the initial years from 1968:IV to 1990:IV. The second subsample goes

from 2000:I to 2011:III. We can thus utilize the part of the sample where

data points are highly concentrated, and avoid the part that contains too

many missing data points.

We limit our attention to frequent forecasters – ones with sufficient

number of observed forecasts – to further reduce the amount of missing

data, such that in both subsamples, the amount of missing data is kept as

low as possible while still maintaining a reasonable number of

forecasters.6 Specifically, we require forecasters to have at least 45

forecasts in subsample 1 or at least 36 forecasts in subsample 2. As a

result, depending on variables and subsamples, around 15 forecasters

remain. On average, there are about 40% missing data for subsample 1 and

about 15% missing data for subsample 2.

To accurately measure the performance of different combination

methods in the incomplete panel, we impute the missing data explicitly.

Two imputation methods are considered. The first method gives imputed

values as 𝑗,𝑡∗ =

1

𝑁𝑡∑ 𝑗,𝑡𝑗∈𝑆𝑡

𝐴 if 𝑖 ∈ 𝑆𝑡𝐴, where 𝑁𝑡 is the total number of

elements in 𝑆𝑡𝐴. This method replaces missing forecasts with the simple

average of the non-missing forecasts for the same period, which is the

6 We observe no clear relationship between performance and participation. Capistrán and

Timmermann (2009) and Genre et al (2013) also reported a similar finding.

19

imputation implied when using simple average for combination. The

downside of this method is that it reduces the level of forecast dispersion,

which limits the combination algorithms’ ability to distinguish good

performers from poor ones. Especially for the performance-based

methods, it would be more reasonable if imputed values reflect, at least

partially, the past performance of the forecaster.

Such concerns lead us to the second imputation method, based Genre

et al. (2013), where the imputed values are given by 𝑗,𝑡 − 𝑡 =

��𝑖[∑ ( 𝑗,𝑡−𝑠 − 𝑡−𝑠)4𝑠=1 ], where 𝑡 is the mean forecast at time 𝑡.

Intuitively, a missing individual forecast is replaced by an adjusted mean

forecast for that period. The adjustment is made according to the recent

average deviation of the forecast made by that forecaster from the mean

forecasts. This method is superior, in principle, to the first method,

because the imputed value for a forecaster incorporates both the common

component and idiosyncrasy of that forecaster. In particular, if a forecaster

tends to produce forecasts that are far from the average, his or her imputed

forecasts would reflect that characteristic.

Note that both imputation methods have to be implemented in real

time just like the combination methods. This presents no problem for the

first imputation method. But for the second method, the excessive amount

of missing data even after imposing the participation requirement makes it

infeasible sometimes to estimate all the 𝛽𝑖s in real time. When this is the

case, we use the most recent estimate of 𝛽𝑖 when such an estimate is

available.

Note that missing data creates an additional challenge to estimate the

weights associated with the algorithms. However, it may be noted that the

on-line algorithms may have a relative advantage over BG in this regard.

20

First, as pointed out before, they do not need estimates of error

covariances. In addition, the weights of s-AFTER defined in (2.2) can be

written recursively as

��𝑗,𝑡+1𝑠−𝐴𝐹𝑇𝐸𝑅 =

��𝑗,𝑡𝑠−𝐴𝐹𝑇𝐸𝑅��𝑗,t

−1exp(−𝑒𝑗,t2

2��𝑗,t2 )

∑ ‍𝑛𝑗=1 ��𝑗,𝑡

𝑠−𝐴𝐹𝑇𝐸𝑅��𝑗,t−1exp(−

𝑒𝑗,t2

2��𝑗,t2 )

‍‍for‍𝑡 ≥ 𝑡𝑜 + 1 (3.1)

from which it is evident that previous forecast errors, which affect

ωj,t+1s−AFTER through ωj,t

s−AFTER play an equally important role in

determining ωj,t+1s−AFTER as the latest forecast error. Indeed, the natural

logarithm of ωj,t+1s−AFTER behaves like a unit root process. Thus a large

forecast error tends to have a permanent effect on the weights of s-

AFTER. For our purposes, it may be an advantage since the long memory

property of ωj,t+1s−AFTER‍may help to alleviate the problem of missing data in

unbalanced panel as the impact of past errors do not decay at all.

4. PERFORMANCE OF COMBINATION METHODS

4.1. MEASURING THE PERFORMANCES OF COMBINATION

METHODS

To thoroughly evaluate the performance of the new combination

methods like the AFTER algorithms and the machine learning algorithm

MLS, we conduct a real time forecast combination exercise using several

popular existing methods in addition to the new methods. In addition to

the simple average method (SA), we consider Bates and Granger’s method

(BG), as well as median (ME), recent best (RB), and trimmed mean (TM)

methods. SA is such that the combined forecast is the simple (equally

weighted) average of individual forecasts. ME method uses the median of

individual forecasts as the combined forecast. RB is the combined forecast

21

that is set to be the forecast made by the individual forecaster who enjoys

the best past performance as measured by MSE as of last period. TM

selects the mean of the pool of individual forecasts after the maximum and

the minimum forecasts are removed7. BG method and the AFTERs

algorithms are implemented as detailed in Section 2.

The MLS method is implemented according to Algorithm 1 in

Sancetta (2010). The core step in the algorithm is to compute the current-

period weight (before shrinkage) 𝜔𝑗,𝑡+1𝑀𝐿𝑆′ for each individual forecaster,

based on this individual’s previous-period weight 𝜔𝑗,𝑡𝑀𝐿𝑆 and current-period

loss 𝑙𝑡(𝜔𝑡𝑀𝐿𝑆). Let ∇𝑙𝑡(𝜔𝑡

𝑀𝐿𝑆) be the gradient of the loss function with

respect to (previous-period) weight 𝜔𝑡𝑀𝐿𝑆, and ∇𝑗𝑙𝑡(𝜔𝑡

𝑀𝐿𝑆) be its 𝑗th

element. The current-period weight is calculated as

𝜔𝑗,𝑡+1𝑀𝐿𝑆′ =

*𝜔𝑗,𝑡𝑀𝐿𝑆 exp[−𝜂𝑡−𝛼∇𝑗𝑙𝑡(𝜔𝑗,𝑡

𝑀𝐿𝑆)]+/*∑ 𝜔𝑗,𝑡𝑀𝐿𝑆 exp,−𝜂𝑡−𝛼∇𝑗𝑙𝑡(𝜔𝑗,𝑡

𝑀𝐿𝑆)-𝑛𝑗=1 +,

where 𝜂 is the learning rate parameter, and 𝛼 is a parameter that controls

the speed of learning. In the final shrinkage step that gives the final

current-period weight used for combination 𝜔𝑗,𝑡+1𝑀𝐿𝑆 , all the 𝜔𝑗,𝑡+1

𝑀𝐿𝑆′s that are

lower than a predetermined small threshold (𝛾/𝑛, which is controlled by

parameter 𝛾) is replaced by the threshold value 𝛾/𝑛, and the remaining

weights are scaled such that all weights add up to 1.

In the MLS method, the gradient of the loss function ∇𝑙𝑡(𝜔𝑡𝑀𝐿𝑆),

together with learning rate 𝜂 controlled by the power parameter α, is used

in the first update to generate the ex post combination weight, which are

7 We have also considered the Winsorized mean method where the top and bottom 5% of

individual forecasts are trimmed and replaced by the remaining forecasts that is closest to

the trimmed ones at both ends. Note that winsorization maintains the variability of

individual forecasts more than trimming. We do not report results associated with

Winsorized mean because they were similar to TM.

22

then projected on a pre-specified subset in second update (shrinkage) to

ensure that all the weights are bounded by some threshold constaint. The

MLS, like the BG method, aims to achieve the best forecast combination.

As simple average often provides a very good benchmark, we

compare the performance of other combination methods against that of the

simple average method using the relative MSE measure, cf. Genre (2013).

For any combination method, the relative MSE is the ratio between the

MSE of the combined forecasts produced by that method and the MSE of

the averages of the individual forecasts. Since the statistical significance

of Diebold-Mariano (1995)-type tests of equal forecast accuracy across

different methods are not directly comparable, we do not report them here,

cf. Capistrán and Timmermann (2009). The relative MSE provides

information on the relative forecast accuracy that is independent of the

absolute accuracy (i.e., the actual MSEs). The latter often vary greatly

depending on the variable, horizon, and sample periods. Therefore, it is

entirely possible that in certain cases, even the method with relatively

better performance produces poor forecasts. Apparently, combining such

forecasts is of no practical use, and comparisons like this are completely

spurious. Therefore, while still reporting the comparisons for longer

horizon forecasts, we focus our analysis on current-quarter and one-

quarter-ahead forecasts. As we carry out the forecast combination

exercises in real time, following now a standard practice in forecast

evaluation, we use the first vintage (initial release) of a variable as actual

values when calculating the MSEs.

23

4.2. COMPARISON OF THE ACCURACY OF COMBINED

FORECASTS

Comparison of alternative combination methods with special

reference to the on-line algorithms is one of the main objectives of this

study. We implement the above-discussed algorithms in real time on the

(filled-in) balanced panel and compare their performances. The MLS

algorithm is implemented with the exponent in the learning rate 𝛼 = 0.5,

and the parameter that specifies the amount of shrinkage 𝛾 = 0.05, as

chosen by Sancetta (2010) in implementing the algorithm. The learning

rate parameter 𝜂 is chosen ex post from values in the set *0.01, 0.05, 0.1,

0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1+ based on

performance of the combined forecasts.8 Table 1 gives the MSEs of the

combined forecasts for CPI Inflation, PGDP, RGDP and UNEMP

produced by different combination methods.9 Figure 2 shows, in terms of

bar charts, the MSEs for PGDP and UNEMP relative to SA based on what

are reported in Table 1.Thus, a vertical bar less than one means the method

under consideration is superior to SA. We find that despite producing

combined forecasts that are inferior to SA at times, most of the

performance-based combination methods, especially the newly developed

AFTER algorithms and machine learning algorithm (MLS), provide

considerable improvements in generating PGDP forecasts in the first

period (1968IV-2000:I) and in generating UNEMP forecasts in the second

8 For choosing the optimum learning rate, Sancetta (2010) proposes two methods in

addition to choosing the learning rate ex post. But in our case, in most cases, the

performance of MLS algorithm is found to be insensitive to choice of learning rate,

similar to Sancetta (2010, p. 613). 9 In what follows, we report only the results from using the second imputation method.

Also, we omit results related to the median, trimmed mean, and Winsorized mean

combination methods because they did not contribute any additional insight to our

findings in the paper.

24

period (2001:I-2011:III). For PGDP, these combination methods perform

better in subsample 1, presumably because the volatility and heterogeneity

in individual forecasters were more substantial in the early 1980s. For

current-quarter forecasts, both the MLS and the AFTER algorithms

produce combined forecasts with MSEs lower than 80% of that of the

simple average forecasts. Improvements in forecast accuracies are noted

for one-quarter and two-quarter horizons as well with decreasing efficacy.

MSE of the current-quarter forecasts produced by the MLS algorithm, the

best performer for this subsample, is 1.250, far less than the benchmark

MSE of 1.687.

For UNEMP, the performance-based algorithms contribute the most

to forecast accuracy in the later subsample 2, when the unemployment rate

was drastically and unexpectedly affected by the most recent recession

beginning 2007:IV. The combined forecasts produced by the L1-AFTER

algorithm has about 10% lower MSE than the simple average for all four

horizons. The MLS algorithm also provides slightly more accurate

forecasts for all four horizons. The h-AFTER and the s-AFTER algorithms

noticeably outperform the benchmark at one-quarter and two-quarter

horizons.

Now looking at Table 1, we find that the MSEs for CPI inflation

associated with MLS, s-AFTER, L1-AFTER and h-AFTER for the current

quarter forecasts during the second period (i.e., 2000:I-2011:III) are less

than those for SA by substantial margins. For RGDP, none of the

alternative combining procedures shows clear-cut superiority over SA. We

also note that for all variables and in both subsamples, BG and MLS

methods never produce combined forecasts that are much less accurate

than the simple average, while the AFTER algorithms sometimes show

inferior performance. Overall, as the horizon increases, relative

25

contribution of the combination methods relative to the simple average

becomes smaller and often inferior to SA.

Since L1-AFTER and h-AFTER are derived using loss functions other

than the squared loss, it is necessary to compare their performance with

our simple average benchmark using appropriate loss functions. Table 2

provides such a comparison in two panels corresponding to Huber and

absolute losses for PGDP and UNEMP. As expected, the losses, when

evaluated under the appropriate loss, are smaller than those reported in

Table 1 under squared loss.10

More appropriately, in Figure 3 we have

presented the MAE and Huber losses for L1-AFTER and h-AFTER

respectively, normalized by respective SA losses. The lengths of these bars

in Figure 3 are mostly smaller that the corresponding bars in Figure 2.

However the differences are not big and do not make any difference in our

conclusions.

4.3. IMPUTATION METHOD AND MISSING DATA RE-EXAMINED

The second imputation method discussed above preserves an

individual’s tendency to make forecasts that deviate from the cross-section

mean. This benefit can only be realized when the forecasters in our sample

do indeed have such a tendency, i.e., idiosyncratic biases. By regressing

the deviation of individual forecasts from the contemporaneous mean on

lagged deviations, we can check to what extent such a tendency exists.

These are reported in Table 3. For each variable and each subsample, we

report the number of regressions we ran, as well as the average proportion

10

Only exception are few L1-AFTER figures for UNEMP for current and 1-quarter-

ahead forecasts where the errors are very small fractions, so that their squares become

much less that their absolute values.

26

of these regressions in which the past deviation is statistically significant

across all forecasters.

For PGDP subsample 1, only in 9% of the regressions, past deviations

are significant in explaining future deviations, when current-quarter

forecasts are examined. But for three-quarter-ahead forecasts, in 34% of

the regressions, past deviations are significant. For PGDP subsample 2,

past deviations are significant in a considerably larger proportion of the

regressions at all horizons, increasing from 22% to as high as 65%. The

situation is the opposite for UNEMP, where in subsample 2, past

deviations are significant in a smaller number of regressions compared to

those in subsample 1. Still, as horizon increases, for UNEMP, past

deviations become significant in more percentage of cases.

However, note that the difference between the missing forecasts

imputed by the second imputation method and those imputed by simple

average is generally rather small. Even for some of the forecasters whose

forecasts consistently deviate from the mean, since the deviations or

��𝑖‍may be small, the imputed values are often close to the mean. A similar

result was reported in Genre et al. (2013).

5. BEHAVIOR OF SELECTED FORECAST

COMBINATION ALGORITHMS

5.1. A CLOSER LOOK AT S-AFTER, MLS, AND BG METHOD

The results in the previous section clearly show the advantage of the

newly developed AFTER and MLS methods in certain cases. It is

therefore particularly interesting and informative to compare their

behavior to that of the familiar BG method. This comparison is presented

in Figures 4, 5 and 6. In each figure, from top to bottom, we show

individual forecasters’ squared forecast errors, the evolution of individual

27

forecasters’ cumulative MSEs, the weights estimated using the BG

method, the weights estimated using the s-AFTER (or MLS) method, as

well as squared errors of the combined forecasts produced by the two

methods.

Figure 4 compares the MLS method with the BG method using

current-quarter forecasts of PGDP for subsample 1. In this case, the MLS

method performs better than the BG method with 25% lower MSE. As the

individual squared errors and MSEs show, individual performances are

rather stable and clearly heterogeneous, with the exception of a few

quarters in the beginning. The BG method produces stable weights after

the first year or two, essentially weighting most of the forecasters equally

around 8%. This is the so-called portfolio diversification logic of BG

emphasized by Timmermann (2006). On the contrary, the MLS method

puts extremely high weights on the best forecaster who shows persistently

good performance. In the beginning of 1978, the previously identified best

forecaster showed a small uptick in MSE which led to a drastic down

weighting of the forecaster with second best picking up the share of the

weight drastically. Subsequently, the weights assigned by the MLS

method dropped by nearly 50%, while the weights assigned to this

forecaster by the BG method dropped very little. As shown in the

comparison of the squared errors of the combined forecasts, the squared

error of BG combined forecasts are significantly larger than that of the

MLS combined forecasts. A similar event happened again in early 1981,

where the MLS combined forecasts showed a smaller error. In addition,

we note that starting from 1978, after the deterioration in the performance

of the previous top forecaster, the MLS method gave almost equal weights

to the two best forecasters, until after 1981 when the previous top

forecaster re-established his/her edge. During this period, no significant

28

change happened to the weights assigned to these two forecasters by the

BG method.

Figure 5 compares the s-AFTER method with the BG method using

the one-quarter-ahead forecasts of UNEMP for subsample 2. In this case,

apart from the relatively large errors seen after 2007, for most of the

quarters, individual forecast errors are small and individual performances

are clearly heterogeneous, especially for the best and worst forecasters.

This persistence in ranking is similar to the evidence in Aiolfi and

Timmermann (2006). Both the BG and the s-AFTER methods successfully

identify the best forecaster and assign relatively high weight. Still, the

weights assigned by the BG method are around 10% while weights

assigned by the s-AFTER method varied from 20% to 95%. In two

quarters around early 2008 and early 2009, the best forecaster made

relatively big mistakes that led to notable increase in the MSE. Similar to

the behavior of the MLS method in the previous case, the s-AFTER

method drastically decreased the weights as a result of the performance hit

– about a 40% decrease in weight in early 2008 and about a 20% decrease

in early 2009. Yang (2004) has emphasized this property of the s-AFTER

algorithm wherein a small error by a very good established forecaster

produces drastic weight adjustment. Aggressively weighting the good

forecaster and penalizing poor performance, s-AFTER displayed superior

performance with more than 20% reduction in MSE compared to the

benchmark. This is consistent with the theoretical relationship between s-

AFTER and BG derived in section 2, equation (6). Note that the superior

performance of the s-AFTER method for this period, compared to the BG

method, comes mostly because of smaller errors during the period from

late 2008 to early 2009.

29

From the above two cases, we see that s-AFTER and MLS methods

behave more aggressively in adjusting the weights than the familiar BG

method.

This makes the algorithms adapt to changes in individual

forecasters’ performances and adjust their weights in a speedy manner, so

that changes in performances get quickly reflected in weights. However, if

the changes in performance do not persist into the future, the adjustments

made by these algorithms may even worsen the situation. For example,

during periods with high volatility, a poor forecaster may produce a highly

accurate forecast purely by chance rather than due to forecasting skill.

When a change in performance is less likely to persist or uncertain, the

weights should arguably be adjusted cautiously rather than aggressively. A

psychological support for this logic can be found in Denrell and Fang

(2010). From a diversification perspective, aggressively adjusting weights

creates increased amounts of risks. If structural break happens or a top

forecaster happens to behave poorly in one period, the combined forecasts

may suffer a huge unexpected loss. Figure 6 provides such an example. In

forecasting current quarter RGDP, the best forecaster, forecaster 44, who

was receiving nearly 90% of the weight from s-AFTER, made a big

mistake in 1979:IV. Even though immediately after this mistake, weight

assigned to this forecaster by s-AFTER method dropped to below 1%,

there was no chance to avoid a big forecast error in that period. This

mistake alone made the s-AFTER inferior over the whole sample on the

average compared to BG – even though for all other quarters in the

sample, the two forecasts are very close. Interestingly, after the 1979:IV

mistake, forecaster 44 vanished completely from combining even though

his/her performance continued to be in the middle range. Unlike the

model-based forecasts that tend to be more stable and rank preserving, the

30

relatively large psychological component in these expert survey forecasts

makes s-AFTER-type algorithms susceptible to such outliers.

In order to see how outliers are accommodated in L-AFTER, we

reexamine Figure 5 where the dynamics of BG and s-AFTER were

discussed in forecasting UNEMP during 2000:I-2011:III. In Figure 7 we

presented the individual MAEs, L-AFTER weights and squared error

from L-AFTER forecasts during the same episode. Due to the huge

forecast errors during 2008-09, whereas s-AFTER hesitated with its prime

forecaster by scaling down the weight temporarily (see Figure 5), L-

AFTER downplayed the importance of the initial error and steadily

increased the weight of the original best forecaster. The result was that in

the post 2008 period, L-AFTER had significantly less forecast error than

both BG and s-AFTER. Thus, this example clearly illustrates how,

compared to s-AFTER, L-AFTER dealt with and minimized the

importance of the forecast outliers generated by the latest recession.

5.2. REQUIREMENTS FOR SUCCESSFUL COMBINATIONS

Based on the results in the previous sections, we are able to identify

the following conditions, under which the performance-based weighting

algorithms are likely to outperform the simple average method.

Firstly, it is crucial that performances of individual forecasters are

relatively stable over time and rank preserving. A stricter requirement is

that past performance of a forecaster is a good predictor of this

forecaster’s future performance. This condition is necessary because the

performance based weighting methods rely on past performance to predict

future performance. In our experiments, we found the predictive power to

be very low, and the widespread incidence of missing forecasts in

31

subjective survey forecasts like the SPF data make the value of the

combination weights even more tenuous.

Secondly, in order to produce a series of optimally-combined

forecasts that outperforms simple benchmarks, the differences in

forecasters’ performances should be sufficiently large together with widely

different correlations in forecast errors between forecasters. Much of these

requirements have been discussed in Aiolfi and Timmermann (2006), and

have been corroborated in the burgeoning psychological literature.

11However, they assume special significance while analyzing the

performance of aggressive on-line combination algorithms facing many

missing forecasts.12

Thirdly, for weighting methods that generously weight the best

forecaster based on past performance, e.g., the AFTER methods, it is

necessary that the best forecaster does not make big mistakes. Otherwise,

such sparse combining would produce combined forecasts that suffer

greatly from such mistakes.

6. CONCLUSIONS

This study focuses on the performance and behavior of the newly

developed AFTER and the MLS methods for forecast combination in

unbalanced panels. For monitoring large surveys like SPF or Blue Chip

forecasts, these on-line algorithms can be automated such that that

learning and adaptation to time-varying relative usefulness of forecasters,

old and new, can take place without out user intervention Our aim here is

not to run a horse race among alternative combining schemes, but rather to

11

See for example, Larrick and Soll (2006, 2009), Yaniv and Milyavsky (2007), Vul and

Pashler (2008), Herzog and Hertwig (2009). 12

Elliott (2011) finds the simple averaging will be optimal if the row sums of the

covariance matrix of forecast errors are equal for each row.

32

understand the conditions under which alternative forecast combination

algorithms can work compared against the simple weighted average.

To have a better understanding of how these alternative algorithms

work, we first establish the asymptotic relationship between the new s-

AFTER algorithm and the familiar Bates and Granger procedure. We find

that under the assumption that the conditional variance of the forecast

errors for each forecaster converges to the same value for all forecasters,

both the s-AFTER and BG’s method operate in a way similar to simple

average scheme. However, when heterogeneity in variances is present,

there is a simple nonlinear relationship between the two methods. s-

AFTER algorithm magnifies the weight assigned to a forecaster by BG

method, if the forecaster can be distinguished from other forecasters by

good past performance. On the other hand, s-AFTER reduces the weight

drastically towards zero if the performance is sufficiently below some data

dependent threshold level. Our empirical findings using SPF data illustrate

this theoretical result. In many cases, when using the on-line algorithms,

only a few top forecasters are given nontrivially positive weight. As a

result, this approach has the advantage that it does not require the

estimation of a large number of weigh parameters.

We then show that when implementing different forecast combination

methods on unbalanced panels, each method implicitly imputes the

missing forecasts differently. This makes the performance of the combined

forecasts produced by different combination algorithms incomparable. To

address this issue, we explicitly impute missing forecasts using a

regression method that incorporates individual idiosyncrasies as well as

the average forecast of others, and use the same data to evaluate

alternative combination methods.

33

Furthermore, we evaluate these newly developed forecast

combination algorithms and examine in details the inner mechanics

characterizing the algorithms. The empirical evidence confirms our

analytical results on the behavior of the combination algorithms. Our

results suggest that these robust on-line algorithms help to reduce the MSE

of the combined forecasts, when persistent forecaster heterogeneity and

outliers are prevalent in the forecast data. This is achieved mostly because

the algorithms are very agile in adapting to recent changes in individual

performances and weighting good forecasters aggressively. We find that

the on-line algorithms tend perform well at shorter horizons, especially

when the other algorithms fail due to volatility clustering, structural

breaks, and outliers. In particular, the performances of individual

forecasters need to be sufficiently persistent and heterogeneous for the

newly developed pattern recognition and machine learning algorithms to

deliver maximum improvement in forecast accuracy. Unfortunately,

situations in which these conditions will prevail are difficult to determine

a priori. Thus, on balance, our evidence suggests that the simple un-

weighted average continues to be a dependable combination method in

summarizing survey data of forecasts provided by large panels with

frequent entry and exit of experts.

REFERENCES

Aiolfi, M., Timmermann, A. (2006). Persistence in forecasting performance and

conditional combination strategies. Journal of Econometrics 135:31-53.

Altavilla, C., De Grauwe, P. (2010). Forecasting and combining competing models of

exchange rate determination. Applied Economics 42:3455-3480.

Bates, J., Granger, C. W. J. (1969). The combination of forecasts. Operations Research

Quarterly 20:451-468.

34

Capistrán, C., Timmermann, A. (2009). Forecast combination with entry and exit of

experts. Journal of Business and Economic Statistics 27(4):428–40

Davies, A., Lahiri, K. (1995). A new framework for analyzing survey forecasts using

three-dimensional panel data. Journal of Econometrics 68:205-227.

Denrell, J., Fang, C. (2010). Predicting the next big thing: Success as a signal of poor

judgment. Management Science 56(10):1653-1667.

Diebold, F. X., Mariano, R. S. (1995). Comparing predictive accuracy. Journal of

Business and Economic Statistics 13:253-263.

Elliott, G. (2011). Averaging and the optimal combination of forecasts. Unpublished

manuscript.

Fan, S., Chen, L., Lee, W. J. (2008). Short-term load forecasting using comprehensive

combination based on multi-meteorological information. Industrial and

Commercial Power Systems Technical Conference. ICPS. IEEE/IAS.

Genre, V., Kenny, G., Meyler, A., Timmermann, A. (2013). Combining expert forecasts:

can anything beat the simple average? International Journal of Forecasting

29(1):108-121.

Herzog, S. M., Hertwig, R. (2009). The wisdom of many in one mind. Improving

individual judgments with dialectical bootstrapping. Psychological Science, 20:231-

237.

Inoue, A., Kilian, L. (2008). How useful is bagging in forecasting economic time series?

A case study of U.S. consumer price inflation. Journal of the American Statistical

Association 103(482):511-522.

Issler, J., Lima, L. (2009). A panel data approach to economic forecasting: The bias-

corrected average forecast. Journal of Econometrics 152(2):153-164.

Kang, H, (1986). Unstable weights in the combination of forecasts. Management Science

32(6):683 - 695.

Lahiri, K., Sheng, X. (2010). Measuring Forecast Uncertainty by Disagreement: The

Missing Link, Journal of Applied Econometrics 25 (4), 514-538

Larrick, R. P., Soll, J. B. (2006). Intuitions about combining opinions: Misappreciation of

the averaging principle. Management Science, 52:111-127.

Leung, G., Barron, A.R. (2006). Information theory and mixing least-squares regressions.

Information Theory, IEEE Transactions 52(8):3396 – 3410.

Poncela, P., Rodriguez, J., Sanchez-Mangas, R., Senra, E. (2011). Forecast combination

through dimension reduction techniques. International Journal of Forecasting

27:224-237.

Rapach, D. E., Strauss, J. K. (2005). Forecasting employment growth in missouri with

many potentially relevant predictors: An analysis of forecast combination methods.

Federal Reserve Bank of St. Louis Regional Economic Development 1(1):97-112.

http://amstat.tandfonline.com/action/doSearch?action=runSearch&type=advanced&result=true&prevSearch=%2Bauthorsfield%3A%28Inoue%2C+Atsushi%29

http://amstat.tandfonline.com/action/doSearch?action=runSearch&type=advanced&result=true&prevSearch=%2Bauthorsfield%3A%28Kilian%2C+Lutz%29

http://amstat.tandfonline.com/loi/uasa20?open=103#vol_103

http://amstat.tandfonline.com/loi/uasa20?open=103#vol_103

35

Rapach, D. E., Strauss, J. K. (2007). Forecasting real housing price growth in the eighth

district states, Federal Reserve Bank of St. Louis, Regional Economic Development

3(2):33-42.

Sancetta, A. (2010). Recursive forecast combination for dependent heterogeneous data.

Econometric Theory 26:598-631.

Sảnchez, I. (2008). Adaptive combination of forecasts with application to wind energy.

International Journal of Forecasting 24:679-693.

Schmidt, P. (1977). Estimation of seemingly unrelated regressions with unequal numbers

of observations. Journal of Econometrics 5:365-377.

Smith, J., Wallis, K.F. (2009). A Simple Explanation of the Forecast Combination Puzzle.

Oxford Bulletin of Economics and Statistics 71:331-355.

Soll, J. B., Larrick, R. P. (2009). Strategies for revising judgment: How (and how well)

people use others’ opinion, Journal of Experimental Psychology: Learning,

Memory, and Cognition 35:780-805.

Stock, J.H., Watson, M.W. (2004). Combination forecasts of output growth in a seven-

country data set. Journal of Forecasting 23:405-430.

Timmermann, A. (2006). Forecast Combinations. in: Elliott, G., Granger, C.W.J.,

Timmermann, A. (Eds.), Handbook of Economic Forecasting. Elsevier Press.

Vovk, V.G. (1990). Aggregating strategies. Proceedings of the third annual workshop on

computational learning theory. Morgan Kaufmann Publishers Inc., Rochester, New

York, United States, 371-386.

Vul E., Pashler, H. (2008). Measuring the crowd within: Probabilistic representations

within individuals. Psychological Science, 19:645-647.

Wei, Y., Yang, Y., (2012). Robust forecast combinations. Journal of Econometrics

166(2):224-236.

Yang, Y. (2004). Combining forecasting procedures: some theoretical results.

Econometric Theory 20:176-222.

Yaniv, I., Milyavsky, M. (2007). Using advice from multiple sources to revise and

improve judgments. Organizational Behavior and Human Decision Processes

103:104-120.

Zou, H., Yang, Y. (2004). Combining time series models for forecasting. International

Journal of Forecasting 20:69-84.

1

Table 1. MSEs of forecasts made by different combination methods

Method

Current Quarter 1-Quarter Ahead 2-Quarter Ahead 3-Quarter Ahead

Forecasts Forecasts Forecasts Forecasts

1968:IV 2000:I 1968:IV 2000:I 1968:IV 2000:I 1968:IV 2000:I

to to to to to to to to

1990:IV 2011:III 1990:IV 2011:III 1990:IV 2011:III 1990:IV 2011:III

CPI Inflation (CPI)

BG 3.579 10.973 11.576 11.402

ME

3.831

10.867

11.819

11.420

RB

2.724

12.382

12.037

11.429

SA

3.870

10.994

11.513

11.313

TM

3.787

10.915

11.520

11.502

MLS

2.887

11.017

11.507

9.156

L1-AFTER

2.429

13.033

12.355

11.219

h-AFTER

2.617

12.207

12.269

11.879

s-AFTER 2.659 12.172 12.264 11.799

GDP Price Deflator Inflation (PGDP)

BG 1.605 1.013 3.403 1.018 4.212 1.061 4.840 1.103

ME 1.639 1.140 3.412 1.019 4.202 1.081 4.778 1.229

RB 1.494 1.047 3.140 1.821 4.706 1.616 6.255 1.434

SA 1.687 1.044 3.481 1.039 4.286 1.058 4.869 1.080

TM 1.650 1.079 3.430 1.020 4.254 1.056 4.821 1.114

MLS 1.250 1.042 3.358 1.036 4.251 0.978 4.872 1.083

L1-AFTER 1.291 0.993 3.129 1.263 4.483 1.323 7.521 1.474

h-AFTER 1.325 1.005 3.142 1.281 4.151 1.224 5.491 1.393

s-AFTER 1.326 1.016 3.135 1.288 4.149 1.213 5.460 1.386

Real GDP Growth (RGDP)

BG 5.890 1.534 11.657 4.025 14.696 6.687 18.320 8.498

ME 6.100 1.559 11.231 4.075 14.541 6.724 17.801 8.505

RB 9.335 1.715 14.222 4.319 16.067 7.262 22.689 7.514

SA 6.038 1.524 11.374 4.005 14.662 6.669 17.971 8.494

TM 5.982 1.517 11.457 3.973 14.471 6.716 17.503 8.482

MLS 6.096 1.526 11.416 4.019 14.701 6.683 18.170 8.508

L1-AFTER 8.300 1.580 12.520 4.591 17.370 6.993 21.615 8.656

h-AFTER 7.350 1.514 12.096 4.199 15.069 7.443 21.300 8.666

s-AFTER 7.297 1.516 12.173 4.189 15.022 7.630 21.543 8.738

Unemployment Rate (UNEMP)

BG 0.049 0.027 0.241 0.182 0.490 0.529 0.812 1.192

ME 0.047 0.028 0.245 0.188 0.494 0.544 0.810 1.204

RB 0.051 0.033 0.227 0.132 0.501 0.484 0.793 1.101

SA 0.050 0.029 0.242 0.188 0.496 0.546 0.822 1.222

TM 0.050 0.027 0.238 0.186 0.504 0.545 0.814 1.205

MLS 0.049 0.028 0.242 0.185 0.496 0.541 0.822 1.219

L1-AFTER 0.053 0.025 0.240 0.144 0.492 0.486 0.812 1.056

h-AFTER 0.051 0.028 0.248 0.150 0.486 0.497 0.798 1.279

s-AFTER 0.052 0.029 0.250 0.147 0.486 0.497 0.801 1.287

* Numbers in bold denote DM test rejection at 10%, i.e., the method performs significantly better than SA.

2

Table 2. Performance of h-AFTER and L1-AFTER using alternative loss functions

Panel I. Performance of h-AFTER and SA benchmark measured by Huber loss

Method

Current Quarter

Forecasts

1-Quarter Ahead

Forecasts

2-Quarter Ahead

Forecasts

3-Quarter Ahead

Forecasts

1968:IV

to

1990:IV

2000:I

to

2011:III

1968:IV

to

1990:IV

2000:I

to

2011:III

1968:IV

to

1990:IV

1968:IV

to

1990:IV

2000:I

to

2011:III

1968:IV

to

1990:IV


SA 1.272 0.872 1.959 0.888 2.277 0.916 2.527 0.945

h-AFTER 1.000 0.822 1.755 0.961 2.239 0.995 2.702 1.138


SA 0.050 0.029 0.235 0.188 0.435 0.454 0.665 0.829

h-AFTER 0.051 0.028 0.242 0.150 0.438 0.420 0.657 0.846

Panel II. Performance of L1-AFTER and SA benchmark measured by MAE

Method

Current Quarter

Forecasts

1-Quarter Ahead

Forecasts

2-Quarter Ahead

Forecasts

3-Quarter Ahead

Forecasts

1968:IV

to

1990:IV

2000:I

to

2011:III

1968:IV

to

1990:IV

2000:I

to

2011:III

1968:IV

to

1990:IV

1968:IV

to

1990:IV

2000:I

to

2011:III

1968:IV

to

1990:IV


SA 1.019 0.828 1.402 0.868 1.575 0.889 1.711 0.891

L1-AFTER 0.899 0.784 1.265 0.841 1.581 0.971 2.039 1.015


SA 0.176 0.127 0.381 0.298 0.523 0.504 0.675 0.716

L1-AFTER 0.164 0.124 0.377 0.277 0.533 0.494 0.692 0.711

* L1-AFTER implemented using dji.

Table 3. Percentage of significant imputation regressions at each horizon

Variable Subsample Number of

Regressions

Current-Quarter

Forecasts

One-quarter

Ahead

Forecasts

Two-Quarter

Ahead

Forecasts

Three-Quarter

Ahead

Forecasts

PGDP 1 88 0.09 0.23 0.28 0.34

2 25 0.22 0.64 0.75 0.65

UNEMP 1 88 0.26 0.52 0.68 0.75

2 25 0.23 0.44 0.47 0.53

* N: total number of regressions run for each forecaster; P: percentage of significant regressions at each horizon.

3

Figure 1. Overview of data patterns

This figure shows the overall data patterns for all forecasters (using PGDP one-quarter ahead forecasts as example as the patterns are similar for all other

variables and horizons). A dot in the figure represents a data point. Blank represents missing.

Date

Forecaster 300

Su

bsa

mp

le 2

:S

ub

sam

ple

1:

19

68

:IV

to

19

90

:IV

20

00

:I t

o 2

01

1:I

II

50 100 150 200 250

1970q4

1975q4

1980q4

1985q4

1990q4

1995q4

2000q4

2005q4

2010q4

4

Figure 2. Relative forecast performance of alternative combination methods

Performance reported in this set of figures is MSE of the method of interest relative to MSE of simple average

method. Each group of bars represents one method (denoted under the group). The four bars in each group represent

current quarter forecasts to 3-quarter ahead forecasts (left to right).


Subsample 1: 1968:IV to 1990:IV Subsample 2: 2000:I to 2011:III


Subsample 1: 1968:IV to 1990:IV Subsample 2: 2000:I to 2011:III

0.6

0.7

0.8

0.9

1

1.1

1.2

BG

RB

ML

S

L-A

FT

ER

h-A

FT

ER

s-A

FT

ER

0.6

0.7

0.8

0.9

1

1.1

1.2

BG

RB

ML

S

L-A

FT

ER

h-A

FT

ER

s-A

FT

ER

0.6

0.7

0.8

0.9

1

1.1

1.2

BG

RB

ML

S

L-A

FT

ER

h-A

FT

ER

s-A

FT

ER

0.6

0.7

0.8

0.9

1

1.1

1.2

BG

RB

ML

S

L-A

FT

ER

h-A

FT

ER

s-A

FT

ER

1

Figure 3. L1-AFTER and h-AFTER Evaluated Using Appropriate Loss Functions

Performance reported in this set of figures is MAE (for L-AFTER) or Huber loss (for h-AFTER) relative to the

respective loss of simple average method. The four bars in each group of bars represent current quarter forecasts to

3-quarter ahead forecasts (left to right). The name of the variable, method, and subsample are denoted under the

bars.

0.6

0.7

0.8

0.9

1

1.1

1.2

PG

DP

UN

EM

P

PG

DP

UN

EM

P

PG

DP

UN

EM

P

PG

DP

UN

EM

P

h-AFTER L-AFTER h-AFTER L-AFTER

Subsample 1 Subsample 2

2

Figure 4. Evolution of weights and performance of individual forecasters - PGDP

GDP Price Deflator Inflation (PGDP), subsample 1, current-quarter forecasts

0

40

80

120

Indiv

idual

Square

d E

rrors

0.00

2.00

4.00

6.00

8.00

Indiv

idual

MS

Es

.000

.040

.080

.120

.160

BG

Weig

hts

0.00

0.25

0.50

0.75

1.00

ML

S W

eih

ts

0.0

5.0

10.0

15.0

1974 1976 1978 1980 1982 1984 1986 1988 1990

Sq. Err. of BG Combined Forecasts

Sq. Err. of MLS Combined Forecasts

Sq. E

rr. of

Com

bin

ed F

ore

cast

s

3

Figure 5: Evolution of weights and performance of individual forecasters - UNEMP

Unemployment Rate (UNEMP), subsample 2, one-quarter ahead forecasts

0.00

1.00

2.00

3.00

Indiv

idual

Square

d E

rrors

.100

.200

Indiv

idual

MS

Es

.040

.060

.080

.100

BG

Weig

hts

0.00

0.25

0.50

0.75

1.00

sAF

TE

R W

eig

hts

0.00

0.40

0.80

1.20

I II III IV I II III IV I II III IV I II III IV I II III IV I II III IV I

2005 2006 2007 2008 2009 2010 2011

Sq. Err. of sAFTER Combined Forecasts


Sq. E

rr. of

Com

bin

ed F

ore

cast

s

4

Figure 6. Reacting to Outlier – Behavior of BG and s-AFTER method

Real GDP (RGDP), subsample 1, current-quarter forecasts. In the top four panels, solid line represents forecaster 44,

dashed line represents forecaster 65. The outlier is forecaster 44’s forecast for the fourth quarter of 1974.

0

100

200

300

Indiv

idual

Square

d E

rrors

0

10

20

30

40

Indiv

idual

MS

Es

.00

.05

.10

.15

BG

Weig

hts

0.00

0.25

0.50

0.75

1.00

sAF

TE

R W

eih

ts

0

40

80

120

74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90


Sq. Err. of sAFTER Combined Forecasts

Sq. E

rr. of

Com

bin

ed F

ore

cast

s

5

Figure 7. Reacting to Structural Break– Behavior of L-AFTER method

Unemployment rate (UNEMP), subsample 2, one-quarter-ahead forecasts. Shown from top to bottom are individual

MAEs, weights assigned to individual forecasters by L-AFTER method, and the squared errors of the combined

forecasts produced by L-AFTER method. Behavior of s-AFTER and BG methods for this case can be seen in

subfigure 2 of Figure 3. The individual (forecaster 483) receiving the highest weight is the same in BG, s-AFTER,

and L-AFTER.

.100

.200

.300

.400

Indiv

idual

MA

Es

0.00

0.25

0.50

0.75

1.00

L-A

FT

ER

Weig

hts

0.00

0.25

0.50

0.75

1.00

I II III IV I II III IV I II III IV I II III IV I II III IV I II III IV I

2005 2006 2007 2008 2009 2010 2011

Sq. E

rr. of

Com

bin

ed F

ore

cast

s

Sq. Err. of Combined Forecasts Produced by L-AFTER

machine learning and forecast combination in incomplete panels

Documents