assessing the ensemble spread-error relationshiphopson/mwrpaperi.pdf · temporally-varying ensemble...
TRANSCRIPT
MONTHLY WEATHER REVIEW, VOL. , NO. , PAGES 1–31,
Assessing the Ensemble Spread-Error
Relationship
T. M. Hopson
Research Applications Laboratory, National Center for Atmospheric
Research, Boulder, Colorado, USA
T. M. Hopson, RAL - NCAR, P. O. Box 3000, Boulder, Colorado 80307-3000, USA. (hop-
2 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
Abstract.
With the increased utilization of ensemble forecasts in weather and hydro-
logic applications, there is a need for verification tools to test their benefit
over less expensive deterministic forecasts. This paper examines the ensem-
ble spread-error relationship, beginning with the ability of the Pearson cor-
relation to verify a forecast system’s capacity to represent its own varying
forecast error. Considering only perfect model conditions, this work theoret-
ically extends the results from previous numerical studies showing the cor-
relation’s diagnostic limitations: it can never reach its maximum value of one;
its theoretical asymptotic value depends on the specific definition of spread
and error used, ranging from 0 and asymptoting to either 1/√
3 or√
2/π;
and, perhaps most fatal to its utility, its theoretical limits depend on the vary-
ing stability properties of the physical system being modeled.
Building from this, we argue there are two aspects of an ensembles disper-
sion that should be assessed. First, and perhaps more fundamentally: is there
enough variability in the ensembles dispersion to justify the maintenance of
an expensive ensemble prediction system (EPS), irrespective of whether the
EPS is well-calibrated or not? To diagnose this, the factor that controls the
theoretical upper limit of the spread-error correlation can be useful. Secondly,
does the variable dispersion of an ensemble relate to variable expectation of
forecast error? Representing the spread-error correlation in relation to its the-
oretical limit can provide a simple diagnostic of this attribute. A context for
these concepts is provided by assessing two operational ensembles: Western
US temperature forecasts and Brahmaputra River flow.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 3
1. Introduction
The development of ensemble weather, climate, and hydrologic forecasting has brought
new opportunities to provide significant economic and humanitarian benefit over a single
”best guess” forecast (Richardson 2000, Zhu et al. 2002, Palmer 2002, among others).
One potentially significant if not fundamental attribute of an ensemble prediction system
(EPS) is its ability to forecast its own expected forecast error. This is accomplished if
the EPS provides an accurate expectation of its temporally-varying errors through its
temporally-varying ensemble dispersion (Molteni et al. 1996, Toth and Kalnay 1997,
Houtekamer et al. 1996, Toth et al. 2003, Zhu et al. 2002, Hopson and Webster 2010).
Given that one would expect that larger ensemble dispersion implies more uncertainty
in the forecast ensemble mean or in any one ensemble member (likewise, smaller disper-
sion implying less uncertainty), many past approaches have used the Pearson correlation
coefficient as a diagnostic for this potential EPS property by linearly-correlating differing
measures of ensemble spread with differing measures of forecast error. However, the con-
clusions drawn from the use of this metric have often been ambiguous in many of these
studies (Barker 1991; Molteni et al. 1996; Buizza 1997; Scherrer et al. 2004).
Houtekamer (1993), Whitaker and Loughe (1998), and Grimit and Mass (2007) have
investigated why linear correlation may not be a conclusive metric, primarily in the con-
text of a statistical model presented originally by Kruizinga and Kok (1988; hereafter
”KK”). The above authors’ analyses were done in the context of an EPS perfect forecast
assumption, one in which the underlying probability distribution function (PDF) of the
forecast error is known, and individual ensemble members represent random draws from
this distribution, with the ensemble spread providing a measure of the expected forecast
4 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
error. Note the distinction between ”perfect forecast” and ”EPS perfect forecast” as-
sumptions: the former being when the forecast is identical to the future observation; the
latter being when the distribution of the EPS ensembles is statistically indistinguishable
from the forecast error PDF. In the context of the KK model, these authors showed that
even for a perfect EPS, the correlation between skill and spread need not be statistically
significant, with the magnitude of the linear correlation depending on the day-to-day vari-
ability of spread: for verification data where there is large temporal variation in ensemble
spreads, the correlation between spread and skill is at a maximum (but less than one),
and in regions where the ensemble spread is more temporally uniform, the correlation is
at a minimum. Grimit and Mass (2007) also numerically assessed the behavior of the
spread-error correlation with the same KK model in the context of differing continuous
and categorical spread and error metrics, and for ensemble systems of finite size, showing
additional dependencies of the spread-skill correlation on these additional factors.
Although conducted in the context of one particular statistical model (i.e. KK), the
general conclusion one could draw from these analyses is that the linear correlation is defi-
cient as a verification measure by virtue of its dependence on factors other than exclusive
properties of EPS forecast performance. One purpose of this current paper is to elaborate
on and generalize this last point further by presenting some of these dependencies from a
more theoretical framework for continuous spread and error measures. Among the depen-
dencies that can effect the spread-error correlation, many studies assessing the forecast
spread-skill correlation used differing definitions and combinations of measures represent-
ing spread and skill. It is not clear how these different combinations of measures affect
the theoretical limits of the correlation, and therefore how these studies might interrelate.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 5
Here we calculate some of the theoretical limits of the correlation for different spread and
error combinations, which we argue provide two generalizable metrics to test the utility
of an EPS’s ability to provide ensemble members with varying dispersion.
In section 2 we start by presenting some of the possible continuous error and spread
measures, arguing that only certain combinations of these spread and error metrics are
dimensionally well-matched and should be used in conjunction. Later in the section we
provide explicit calculations for theoretical simplifications on the linear correlation for four
different matched spread-skill metrics. For this we also utilize the EPS perfect forecast
assumption with no sampling limitations, but do not rely on a particular functional form
for the distribution of ensemble spread. In section 3 we discuss the results of section 2’s
calculations, showing how the theoretical asymptotic limits of the spread-skill correlation
can vary greatly depending on which spread-skill metrics are used, and providing the re-
sults for the KK model as one particular case study. In section 4, we discuss two metrics
for assessing the utility of an ensemble’s temporally varying dispersion, which itself were
generalized from the analysis provided in section 2. In section 5, we place our analysis in
the context of two particular EPS examples of spread and error using ensemble temper-
ature forecasts for a region of southwest USA, and ensemble river discharge forecasts for
Bangladesh.
2. Calculations
In this section we present calculations to simplify the linear correlation for four pairings
of continuous error and spread metrics. The purpose of these calculations is to simplify
these theoretical correlations to a point where the mathematical form of the asymptotic
limits become clear, as well as the dependencies dictating these limits. It is assumed
6 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
there are no sampling limitations and that the EPS perfect forecast assumption holds,
such that for a given forecast, there is an underlying PDF from which both individual
ensemble members and the associated observable (verification) are randomly drawn. As a
result, the expected error of an ensemble forecast is completely determined by this PDF,
and the theoretical form of error-spread correlation reduces to only the PDF moments.
To make these simplifications, without loss of generality (WLOG) we can introduce in
the equation for the Pearson correlation coefficient a calculation to replace the forecast
error with its expected value; and in the case of an EPS perfect forecast, the domain of
this calculation over all errors is equivalent to the forecast ensemble member PDF. This
replaces the error with its expected value, proportional to a measure of ensemble spread.
As well, WLOG, expectation value operations over all possible ensemble members are also
done.
2.1. Notation
The population of members of an ensemble forecast is represented by Ψ, with an indi-
vidual member (realization) represented by ψ. Similarly, for some measure of spread s,
we represent the population of ensemble forecasts, each with a value of s, as Σ. Consider
that Ψ could be viewed as the underlying (implied) PDF of an ensemble forecast at a par-
ticular time from which the ensemble members are randomly drawn. Likewise, Σ could
be viewed as representing the whole set of ensemble forecasts, each with an identifiable
value of associated ensemble spread, over all the times forecasts are generated.
Bra-ket expectation value notation is used for the expectation value of some quantity
A = A(ψ) over an ensemble population Ψ, which could be in terms of discrete variables
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 7
with probability density function P (ψ)
〈A(ψ)〉Ψ ≡∑ψ
A(ψ)P (ψ), (1)
or in terms of continuous variables with associated probability density function f(ψ)
〈A(ψ)〉Ψ ≡∫
ΨA(ψ)f(ψ)dψ. (2)
The subscript (Ψ) on the brackets (〈·〉) specifies the population domain over which the
expectation is calculated. Similarly, we define the expectation value of A = A(s) over a
population of forecasts, each with defined ensemble spread s, as 〈A〉Σ, and we represent
the double expectation value of A = A(ψ, s) over both populations Ψ and Σ as 〈A〉Ψ,Σ.
In terms of expectation values, the Pearson correlation coefficient between a generic
spread (s) and error (ε) measure is given by
r =〈(s− 〈s〉Σ)(ε− 〈ε〉Σ)〉Σ
[〈(s− 〈s〉Σ)2〉Σ〈(ε− 〈ε〉Σ)2〉Σ]1/2(3)
where the population domain over which the expectation (average) is calculated is the set
of ensemble forecasts Σ (with associated spread measure s). For further simplifications
as we will show below, for a given ensemble forecast with some measure of spread s, an
average can also be made over the possible realizations of the observable 〈·〉Ψo ; or over the
population of ensemble members Ψ(s) given by 〈·〉Ψ. Note by our perfect model definition,
〈·〉Ψo ≡ 〈·〉Ψ.
2.2. Spread-error measures
The forecast member spread is often defined as the variance, standard deviation, mean
absolute difference of the ensemble members about the ensemble mean, or less commonly,
mean absolute difference of the ensemble members about a chosen ensemble member. In
addition, we include the 4th moment of the ensemble members about the mean, which
8 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
arises in the calculations. The forecast error of an ensemble forecast is often defined in
terms of the squared or absolute difference between the verification (observation) and
either any one ensemble member or the ensemble mean forecast. Symbolic notation for
these measures are given in Tables 1 and 2, respectively.
Arguably only certain of these error and spread measures are appropriately matched if
one wants to directly relate expected error to a measure of ensemble spread. Measures
that are naturally paired have a direct functional relationship relating forecast error to
forecast spread, and have the same moments (physical units). Of the measures presented
here, these pairings are: 1) the set of squared error measures with the variance as spread
measure; and 2) the set of absolute difference error measures with either the standard
deviation or mean absolute difference as spread measure. Although other error and spread
measures could also be used (e.g. rank probability skill score) to assess the forecast
spread-error relationship, arguably the useful information in the ensemble spread is that
it should be a statement about the expected error in the forecast, and these error and
spread measures directly make this connection.
For reference, Table 3 shows how the expected values of the error measures ε (column
1) can be given in terms of measures of forecast spread s (column 2) for an EPS per-
fect forecast (i.e. one in which the observation ψo is equivalent to a random draw from
the forecast ensemble member PDF). These relationships are used in the calculations
below. WLOG, these relationships were derived by introducing an expectation value op-
eration over all possible observational states, and in some cases, over all possible ensemble
members. Column 3 of this table shows how the expected value of error corresponds to
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 9
either the standard deviation σψ or variance σ2ψ when the forecast ensembles are normally
distributed.
Figure 1 provides a schematic of the correlation coefficient simplification calculation.
Shown are ensemble six-member forecasts of a continuous variable ψ for three different
forecast times. The ensemble members are represented by the six thin black vertical lines,
with the implied PDF p(ψ; si) from which the members are samples given by the bell-
shaped curves. The PDF represents the forecast in the asymptotic limit of no sampling
limitations. The observations corresponding to the forecasts are shown by the vertical red
lines, with the ensemble mean given by the dashed vertical lines. Some measure of error
ε (shown here as a distance the observation is from the ensemble mean) for each forecast
is also shown, as is some measure of ensemble member spread s. In our calculations to
simplify the correlation between spread s and error ε, we replace the error by its expected
value, which can be calculated by performing a weighted integration of the observation
over all possible values. The result is that the expected value is proportional to a measure
of ensemble member spread:
〈ε〉Ψi =∫
Ψi(s)ε p(ψ; si) dψ ∝ si. (4)
In practice, p(ψ; si) does not have to be explicitly given, and the relationship of the
expected value of the error to a measure of ensemble member spread can be shown either
through algebraic manipulation or by inspection (see Table 3 for examples).
In this example, the expected value of the error over all forecasts then is proportional
to:
〈ε〉Σ ∝1
n
n∑i=1
si. (5)
10 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
2.3. Correlation of sabs with ε|µ| and the correlation of σ2ψ with εµ2
In this section we simplify the correlations for two specific cases: 1) (sabs, ε|µ|) and 2)
(σ2ψ, εµ2). As seen in Table 3, these pairings are especially well matched since (for an EPS
perfect forecast) the expectation value of the error measure is the spread measure itself
(〈ε〉Ψo = s).
Left in terms of a generic ε and s for these two sets of spread-error measures, WLOG we
can introduce into (3) an expectation value 〈·〉Ψo over all possible states of the observation
within each expectation value of error 〈ε〉Σ over the population of forecasts Σ
r =〈(s− 〈s〉Σ,Ψo)(ε− 〈ε〉Σ,Ψo)〉Σ,Ψo
[〈(s− 〈s〉Σ,Ψo)2〉Σ,Ψo〈(ε− 〈ε〉Σ,Ψo)2〉Σ,Ψo ]1/2. (6)
Noting that 〈s〉Ψo = 〈s〉Ψ = s and expanding,
r =〈(s− 〈s〉Σ)(〈ε〉Ψo − 〈ε〉Σ,Ψo)〉Σ
[〈(s− 〈s〉Σ)2〉Σ〈(ε− 〈ε〉Σ,Ψo)2〉Σ,Ψo ]1/2, (7)
and using 〈ε〉Ψo = 〈ε〉Ψ = s,
r =〈(s− 〈s〉Σ)(s− 〈s〉Σ)〉Σ
[〈(s− 〈s〉Σ)2〉Σ〈(ε− 〈s〉Σ)2〉Σ,Ψ]1/2, (8)
so the correlation coefficient further simplifies to
r =
√√√√ 〈s2〉Σ − 〈s〉2Σ〈ε2〉Σ,Ψ − 〈s〉2Σ
. (9)
To simplify things further, we return to the specific metrics of cases 1) and 2). Simpli-
fying for case 1), we have 〈ε2|µ|〉Ψo ≡ 〈|〈ψ〉Ψ−ψo|2〉Ψo = 〈(〈ψ〉Ψ−ψ)2〉Ψ ≡ σ2ψ by definition.
And for case 2), we have 〈εµ2〉Ψo ≡ 〈(〈ψ〉Ψ − ψo)2〉Ψo = 〈(〈ψ〉Ψ − ψ)2〉Ψ ≡ σ2
ψ again by
definition. In addition for case 2), 〈ε2µ2〉Ψo ≡ 〈(〈ψ〉Ψ − ψo)4〉Ψo = 〈(〈ψ〉Ψ − ψ)4〉Ψ ≡ m4,
where m4 is the 4th moment about the mean 〈ψ〉Ψ defined in Table 1. Substituting into
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 11
(9) for cases 1) and 2) we have
r =
√√√√〈s2abs〉Σ − 〈sabs〉2Σ〈σ2
ψ〉Σ − 〈sabs〉2Σ, (10)
and
r =
√√√√〈(σ2ψ)2〉Σ − 〈σ2
ψ〉2Σ〈m4〉Σ − 〈σ2
ψ〉2Σ, (11)
respectively, which are now dependent only on the moments of the ensemble member
spread.
To simplify (10) and (11) further, we would need to impose a requirement on the
distribution of the ensemble members holding for all forecasts, and specific to each case.
These requirements are: for case 1) sabs = βσψ; for case 2) m4 = α(σ2ψ)2, where α and β
are constants determined by the PDF of the ensemble distribution. Note that normally-
distributed ensemble members satisfy the requirements for both of these cases, where for
case 1) β =√
2/π), and for case 2) α = 3.
Imposing these requirements on sabs (case 1) and on m4 (case 2), (10) and (11) become
r = β
√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ
1− β2〈σψ〉2Σ/〈σ2ψ〉Σ
(12)
and
r =
√√√√ 1− 〈σ2ψ〉2Σ/〈(σ2
ψ)2〉Σα− 〈σ2
ψ〉2Σ/〈(σ2ψ)2〉Σ
(13)
respectively.
2.4. Correlation of σ2ψ with εd2
For the case of (σ2ψ, εd2), we have
r =〈(σ2
ψ − 〈σ2ψ〉Σ)(〈εd2〉Ψo,Ψ − 〈εd2〉Σ,Ψo,Ψ)〉Σ
[〈(σ2ψ − 〈σ2
ψ〉Σ)2〉Σ〈(εd2 − 〈εd2〉Σ,Ψo,Ψ)2〉Σ,Ψo,Ψ]1/2. (14)
12 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
where, WLOG, we have introduced an additional expectation value operation (〈·〉Ψ) over
the population of ensemble members (Ψ) (performed for each forecast, with specific σ2ψ
value). This was done in addition to the expectation value operation (〈·〉Ψo) over the
observation population (Ψo) as was introduced in the previous calculation.
Under the EPS perfect forecast assumption, we have 〈εd2〉Ψo,Ψ = 2(〈ψ2〉Ψ−〈ψ〉2Ψ) = 2σ2ψ,
and the numerator simplifies to 2[〈(σ2ψ)2〉Σ−〈σ2
ψ〉2Σ]. Similarly, the denominator simplifies
to [(〈(σ2ψ)2〉Σ − 〈σ2
ψ〉2Σ)(〈ε2d2〉Σ,Ψa,Ψ − 4〈σ2ψ〉2Σ)]1/2. Again, using the EPS perfect forecast
assumption, 〈ε2d2〉Ψa,Ψ ≡ 〈(ψ − ψo)4〉Σ,Ψo,Ψ = 2〈(ψ − 〈ψ〉Ψ)4〉Ψ + 6〈(ψ − 〈ψ〉Ψ)2〉2Ψ = 2m4 +
6(σ2ψ)2. Putting this together, (14) simplifies to
r =
√√√√ 〈(σ2ψ)2〉Σ − 〈σ2
ψ〉2Σ〈m4〉Σ/2 + 3〈(σ2
ψ)2〉Σ/2− 〈σ2ψ〉2Σ
(15)
and the correlation coefficient is now given only in terms of the moments of the ensemble
member spread.
To simplify the relationship further, we would need to impose a requirement on the
distribution of the ensemble members holding for all forecasts. As done in the previous
section, if we imposem4 = α(σ2ψ)2, where α is a proportionality constant, then substituting
for m4 in the denominator, combining, and simplifying, we get
r =
√√√√ 〈(σ2ψ)2〉Σ − 〈σ2
ψ〉2Σ(α + 3)〈(σ2
ψ)2〉Σ/2− 〈σ2ψ〉2Σ
. (16)
For normally distributed ensembles α = 3, and we derive the same result as given in the
previous section for (σ2ψ, εµ2) (case 2).
2.5. Correlation of σψ and ε|µ|
Finally, we consider the case of (σψ, ε|µ|), given by
r =〈(σψ − 〈σψ〉Σ)(ε|µ| − 〈ε|µ|〉Σ)〉Σ
[〈(σψ − 〈σψ〉Σ)2〉Σ〈(ε|µ| − 〈ε|µ|〉Σ)2〉Σ]1/2. (17)
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 13
To simplify this expression, we expand the denominator noting that ε|µ|ε|µ| = εµ2 , WLOG
introduce an expectation value operation over the possible observational states (〈·〉Ψo), and
use 〈ε|µ|〉Ψo ≡ 〈|〈ψ〉Ψ − ψo|〉Ψo = sabs and 〈εµ2〉Ψo ≡ 〈(〈ψ〉Ψ − ψo)2〉Ψo = σ2ψ by the EPS
perfect forecast assumption. Doing so, (17) simplifies to
r =〈(σψ − 〈σψ〉Σ)(sabs − 〈sabs〉Σ)〉Σ
[(〈σ2ψ〉Σ − 〈σψ〉2Σ)(〈σ2
ψ〉Σ − 〈sabs〉2Σ)]1/2, (18)
or
r =〈σψsabs〉Σ − 〈σψ〉Σ〈sabs〉Σ
[(〈σ2ψ〉Σ − 〈σψ〉2Σ)(〈σ2
ψ〉Σ − 〈sabs〉2Σ)]1/2, (19)
and again, the correlation coefficient is given only in terms of moments of the ensemble
member spread.
To simplify the relationship for the correlation coefficient further, we impose the same
requirement on the distribution of the ensemble members holding for all forecasts as was
done with (sabs, ε|µ|) above, namely sabs = βσψ (which applies for normally-distributed
ensemble members, with β =√
2/π). Using this, we obtain
r = β
√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ
1− β2〈σψ〉2Σ/〈σ2ψ〉Σ
, (20)
which is identical to the result for (sabs, ε|µ|).
3. Results of correlation analysis
One focus of this paper has been to assess the limited utility of the linear spread-error
correlation as a verification measure from a theoretical perspective. In the process of doing
so, we have clarified the dependencies of the correlation through calculations performed
under the assumptions of an EPS perfect forecast (i.e. the observation is statistically
indistinguishable from any one ensemble member) for different combinations of continuous
14 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
spread and error measures and in the case of no sampling limitation (i.e. large ensemble
size). Tables 4 and 5 show results of these calculations, and from these we make the
following points:
(1) The spread-error correlation can be simplified to forms no longer explicitly dependent
on the error metric, but dependent only on different moments of the ensemble member
distribution, and what the average value (i.e. expectation value) of these moments are
over the forecast verification set. This can be seen in column 2 of Table 4, for different
combinations of spread (s) and error (ε) measures. To clarify, none of these simplifica-
tions explicitly depend on either how the ensemble members are distributed, or how the
varying spread metric (moments) of these distributions are distributed themselves. The
dependence is instead implicit, by virtue of what the average value of these moments are
when averaged over the set of all forecasts used in the verification.
(2) Because, even for a ”perfect” forecast, the correlation remains dependent on at-
tributes of the ensemble member distribution, these dependencies cloud the ability of
the spread-error correlation to provide a diagnostic of EPS performance for an imperfect
model. One would rather hope for a verification metric to at least be asymptotically-
constant (e.g. value of 1.0) when tested with perfect model results. Further dependence
of ensemble size on the correlation’s value further clouds this metric’s utility (see Grimit
and Mass 2007 and Kolczynski et al. 2011 for a numerical studies of this issue). Although
the variability of ensemble member spread over a verification set could be indicative of
EPS performance, such variability also could depend on the stability properties of the
environmental system being modeled. In particular, if the system being modeled is in a
very stable regime, then one may expect that the distribution of ensemble spreads would
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 15
be relatively narrow, and as we argue below, this would lead to a very different result for
r than if the system samples a variety of stable/unstable states (i.e. a large ”spread” in
the ensemble spreads). More to the point, one would hope that for a perfect model, a
measure of forecast performance such as r would be a fixed value, and not depend on the
inherent properties of the system the forecast is trying to model.
(3) If further constraints are placed on the relationship between the moments of the
ensemble member distribution (column 3 of Table 4), then further simplifications can be
made on the form of the correlation (column 4, Table 4), reducing to only three forms for
the six combinations considered in Table 4. For the metrics with the same units as the
weather variable itself, with the constraint that sabs = βσψ and β is some constant, this
is given by
r = β
√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ
1− β2〈σψ〉2Σ/〈σ2ψ〉Σ
. (21)
For the two squared metrics in the table, with the constraint that m4 = α(σ2ψ)2 and α is
some constant, the two correlation expressions are
r =
√√√√ 1− 〈σ2ψ〉2Σ/〈(σ2
ψ)2〉Σα− 〈σ2
ψ〉2Σ/〈(σ2ψ)2〉Σ
(22)
and
r =
√√√√ 1− 〈σ2ψ〉2Σ/〈(σ2
ψ)2〉Σ(α + 3)/2− 〈σ2
ψ〉2Σ/〈(σ2ψ)2〉Σ
. (23)
More specifically, if the ensemble member distribution is normally-distributed (satisfying
β =√
2/π and α = 3), the theoretical form of the correlation is given in column 2, Table
5, which reduces to two forms for the metrics considered. For the metrics with same units
16 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
as the weather variable itself, this is given by
r =
√2
π
√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ
1− (2/π) 〈σψ〉2Σ/〈σ2ψ〉Σ
. (24)
For the squared metrics, the correlation is
r =
√√√√1− 〈σ2ψ〉2Σ/〈(σ2
ψ)2〉Σ3− 〈σ2
ψ〉2Σ/〈(σ2ψ)2〉Σ
. (25)
What can be seen, then, is that depending on what paired metric definitions are used,
one can get different correlations for the same EPS forecasts, and along with this, different
values for the correlations’ upper bounds, as shown below. This, then, would allow one
to artificially increase or decrease the spread-error correlation through optimal choice of
metric depending on the result desired.
(4) Examining the more general (21)-(23), and (24)-(25) specific to normally-distributed
ensembles, one can see there are two governing ratios (g) that determine the value of the
correlation. For the metrics with the same units as the weather variable itself (rows 1
through 4 of Table 5), the ratio is
g1 = 〈σψ〉2Σ/〈σ2ψ〉Σ = 〈σψ〉2Σ/[〈σψ〉2Σ + var(σψ)] (26)
where var(·) represents the variance. For the squared metrics (rows 5 through 6 of Table
5), the governing ratio is
g2 = 〈σ2ψ〉2Σ/〈(σ2
ψ)2〉Σ = 〈σ2ψ〉2Σ/[〈σ2
ψ〉2Σ + var(σ2ψ)]. (27)
Consider the situation where the EPS consistently generates a probabilistic forecast
with similar ensemble member dispersion from one forecast to the next. In the limit as
the change in the dispersion vanishes, both var(σψ)→ 0 and var(σ2ψ)→ 0, and g → 1 in
both (26) and (27). As a result r → 0 in (21)-(25).
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 17
In the other extreme limit as the EPS generates a (infinitely-) wide range of ensemble
dispersion, then both var(σψ) → ∞ and var(σ2ψ) → ∞, and g → 0 in both (26) and
(27). As a result r → β in (21), r →√
1/α in (22), and r →√
2/(α + 3) in (23).
For normally-distributed ensemble members, r →√
2/π in (24), and r →√
1/3 in (25).
Figure 2 provides a graphic illustration of how r varies as a function of 〈σψ〉2Σ/〈σ2ψ〉Σ and
〈σ2ψ〉2Σ/〈(σ2
ψ)2〉Σ for normally-distributed ensemble members.
(5) The more general results in Tables 4 and 5 compare well with past numeric results
in the literature. Barker (1991) examined the correlation between the ensemble variance
(s; row 1, Table 1) and the square error of any one ensemble member (ε; row 2, Table
2) using geopotential height anomalies from extended range forecasts. He numerically
generated a maximum correlation value of 0.58, which is the same result we derive in row
6, Table 5 (√
1/3 = 0.58).
Also consider a specific distribution for the standard deviation σψ of the ensemble
member spread. If the possible values of σψ over the forecasts of interest are lognormally
distributed, then r takes on the specific form given in column 5 of Table 5. Modified ver-
sions of the lognormal distribution for σψ were presented earlier by KK. This distribution
is given by
f(σψ) =1
σψσΣ
√2πexp
(−(ln(σψ)− ln(σψM))2
2σΣ
), (28)
where σΣ is the standard deviation of the distribution of ln(σψ), and σψM is the median
value of σψ. (Note: for the lognormal distribution, the mean 〈σψ〉Σ and median σψM are
not identical but are related by 〈σψ〉Σ = σψMexp(σ2Σ/2).) For specified values of σψM and
σΣ, values of σψ can be derived from ln (σψ) = N (ln (σψM) , σΣ), where N (γ, δ) represents
a random draw from a Normal distribution with mean γ and standard deviation δ. For
18 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
normally-distributed ensemble members, with spread metric σψ and error metric ε|µ|, with
σψ lognormally distributed, we then have the same case explored by Houtekamer (1993),
Whitaker and Loughe (1998), and Grimit and Mass (2007). For this case, the governing
ratio simplifies to
g = 〈σψ〉2Σ/〈σ2ψ〉Σ = Exp[−σ2
Σ], (29)
and the correlation simplifies to the expression in column 5, row 2 of Table 5, which
itself duplicates (33) of Houtekamer (1993). Note, however, that defining the specific
distribution of the ensemble member spread is not important to determining the limiting
behavior of the correlation, which for this case is given by column 2, row 2 of Table 5,
with correlation limits of [0,√
2/π] ≈ [0, 0.80]. This same limit was numerically estimated
by Houtekamer (1993), Whitaker and Loughe (1998), and Grimit and Mass (2007).
4. Two aspects of the variation of ensemble dispersion
In this section we argue that there are two aspects of an ensemble’s variation in dis-
persion that should be assessed. The first aspect is: do the day-to-day variations in the
dispersion of an ensemble forecast relate to day-to-day variations in the expectated fore-
cast error? The second aspect is: is there enough variability in the EPS dispersion to
justify the expense of generating the ensemble? We respectively address each of these
aspects in turn below.
We have argued in the previous section that the Pearson correlation does not provide a
definite tool to assess the reliability of the ensemble spread-error relationship due to the
fact that even for an EPS perfect forecast, the correlation can vary widely by virtue of
its dependence on factors other than exclusive properties of EPS forecast performance.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 19
However, this does not necessarily mean that the correlation does not still have utility in
answering this question, which we will return below.
Because of the correlation’s deficiences, Wang and Bishop (2003) suggested creating bins
of the spread measure of choice (in their case, ensemble variance), and then averaging the
corresponding error metrics (e.g. square error of the ensemble mean) over these bins
to remove statistical noise. After this bin-averaging, properly matched spread and error
measures should then equate (with the removal of observation error), and a perfect EPS
forecast should therefore produce points lying along a 45 degree line. As the variations
in an ensemble’s dispersion become less informative, the slope of this curve (binned error
versus binned spread) becomes more horizontal. However, as visually informative as
this approach can be, ambiguities in the EPS’s error-spread reliability can arise due to
ambiguities in the sufficient number of bins and number of points in each bin required for
this test, especially for small verification data sets. Similarly, Wang and Bishop (2003)
also argued that the rate at which the binned error metric becomes noisier as bin size (thus
sample size) decreases, and the degree of kurtosis in the binned sample of errors, both
provide measures of the accuracy in the EPS error variation prediction. However, both of
these latter two approaches rely on an assumption of gaussianity for proper interpretation.
An alternative to the Wang and Bishop approach that produces a single scalar of EPS
error-spread reliability and requires no distributional assumptions, can be created from the
Pearson correlation r. Benefits of single scalar metrics are that they can better leverage
limited verification data sets, they can often provide a more objective metric for assessing
EPS performance as compared to, say, graphical assessments, and they can more easily
lend themselves to constructing confidence bounds. This alternative can be constructed by
20 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
reframing r relative to a perfect EPS forecast in the context of a skill score (Wilks, 1995).
Note that although skill scores need to be used with care since they can be improper
in certain contexts (Gneiting and Raftery 2007; Murphy 1973), they can still provide a
useful relative measure of forecast system improvement. A candidate for an error-spread
Pearson correlation skill score SSr is
SSr =rforc − rrefrperf − rref
, (30)
where rforc is the EPS spread-error correlation, rref is that of a reference forecast, and
rperf is that for a perfect EPS forecast. For the possible correlation’s spread-error metrics
we use the standard deviation of the ensemble (σψ) and the absolute error of the ensemble
mean (ε|µ|), respectively. If we take the no-skill forecast or the reference forecast, such
that rref = 0, then SSr simplifies to
SSr =rforcrperf
. (31)
For simplicity, we could also take the perfect EPS forecast as assumed to have close to
normally-distributed ensemble forecasts, such that rperf is given by (24) above.
A second, and perhaps more essential, aspect of an ensemble’s variation in dispersion
that should be assessed is whether there is enough variability in the dispersion to begin
with to justify the generation of an expensive ensemble, irrespective of whether the EPS
spread-error relationship is reliable or not. Implicitly, both Wang and Bishop (2003)
and Grimit and Mass (2007) also examined this issue in the context of the binned error
and spread metric comparison approach discussed above. Wang and Bishop used the y-
axis range as a metric (binned error metric variation); while after applying an analogue
calibration approach to each bin, Grimit and Mass used gains in the rank probability
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 21
(RPS) skill score as a gauge (where the RPS of a fixed ensemble-mean error climatology
was used as a reference). However, the former approach does not provide a normalized
metric (thus retaining sensitivity to unit scale). And likewise both of these approaches
do not isolate the issue of degree of variability in the ensemble’s native dispersion; this
is because both EPS accuracy in discerning error variability, as well as issues in bin size,
cloud this issue for both approaches.
One possible metric for measuring the degree of variability in the ensemble’s native
dispersion is to utilize the ”governing ratios” g presented above, but in the context of
a skill score, as was done with the correlation coefficient’s use in a skill score for EPS
error-spread reliability assessment. Because g is calculated using only the moments of the
ensemble member set, it focuses on the EPS potential to produce dispersion variability.
In terms of the ”governing ratio” skill score SSg, we have
SSg =gforc − grefgperf − gref
, (32)
where gforc is the EPS governing ratio, gref is that of a reference forecast, and gperf is
that for a perfect forecast. Considering only the governing ratio, g1, of (26), and taking
gref = 1 (i.e. no dispersion variability), and gperf = 0 (i.e. extremely-large dispersion
variability), and after simplifying, we then have
SSg = 1− gforc =〈σ2
ψ〉Σ − 〈σψ〉2Σ〈σ2
ψ〉Σ=
var(σψ)
〈σψ〉2Σ + var(σψ), (33)
where var(σψ) represents the variance of the ensemble member standard deviation over
the verification data set. SSg can be viewed as a normalized, or relative, measure of how
much variability there is in the ensemble day-to-day dispersion as compared to the mean,
or average, amount of this dispersion.
22 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
5. EPS examples
In this section we show two examples of EPS forecasts to highlight some of the points
made above. The first example EPS produces ensembles from a mixture of WRF and
MM5 mesoscale models, using a variety of different intitial conditions, outer boundary
conditions, and physics packages (Liu et. al 2007), and post-processed with a quantile
regression approach (Hopson et. al 2010) to produce a calibrated 30-member ensemble,
although in this paper we use a 19-member subset. The ensemble generates gridded
temperature forecasts over the Dugway Proving Grounds of the Army Test and Evaluation
Command (ATEC) outside Salt Lake City, Utah. Figure 3 shows time-series and rank
histograms of this EPS out-of-sample verification set. Panel 3a shows a subset time-
series of 3-hr lead-time sorted ensembles (colored lines) downscaled to a meteorological
station (black line) over the ATEC range, while 3b shows the out-of-sample post-processed
results. Panels 3c and 3d show rank histograms of the same forecasts, respectively, with
the red dashed lines in the figure showing 95% confidence bounds on the histograms
(for which we could expect approximately one bin to lie outside of these bounds for a
perfectly-calibrated 19-member ensemble). From the rank histograms we see significant
under-dispersion (U-shaped) in the pre-processed forecasts, but near-perfect calibration
in the post-processed ensemble member set. Panels 3e - 3h show results for 36-hr forecasts
with similar conclusions concerning pre- and post-processed forecasts’ under- and near-
perfect dispersion, respectively, as for the 3-hr forecasts.
Figure 4 shows our second example of EPS forecasts. In this figure is shown ensem-
ble streamflow forecasts (colored lines) for the Brahmaputra river at the Bahadurabad
gauging station within Bangladesh of the Climate Forecast Applications in Bangladesh
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 23
(CFAB) project for years 2003 - 2007 (Hopson and Webster 2010), along with observed
streamflow from the Bangladesh Flood Forecasting and Warning Centre (FFWC; black
line). Panels a) and e) show time-series of sorted 51-member multi-model forecasts of river
flow at 1- and 10-day lead-times, respectively. These forecasts were generated by using
ensemble weather forecasts from the European Centre for Medium-Range Weather Fore-
casts (ECMWF) 51-member Ensemble Prediction System (EPS) (Molteni et. al 1996),
near-real-time satellite-derived precipitation products from the NASA Tropical Rainfall
Measuring Mission (TRMM; Huffman et al. 2005, 2007) and the NOAA CPC morph-
ing technique (CMORPH; Joyce et al. 2004), a GTS-NOAA rain gauge product (Xie
et al. 1996), and near-real-time river flow estimates from the FFWC. Panels b) and f)
show the respective post-processed results of these forecasts, where a k-nearest-neighbor
analogue approach (KNN) was used for this application. Panels c) and d) show the re-
spective pre- and post-processed rank histograms and 95% confidence bounds (for which
we could expect approximately three bins to lie outside of these bounds for a perfectly-
calibrated 51-member ensemble) for the 1-day lead-time forecasts, and panels g) and h)
show the same but for the 10-day forecasts. As with our first example, from the rank
histograms we see significant under-dispersion (U-shaped) in the pre-processed forecasts,
but near-perfect calibration in the post-processed ensemble member set.
Utilizing the CFAB EPS 10-day lead-time streamflow forecasts post-processed with a
KNN algorithm, we examine the concepts discussed in section 3. Figure 5 presents scatter
plots of ensemble error versus spread using the metric pairings shown in Tables 4 and 5.
The black dots are the actual error-spread data. The blue dots are calculated by treating
the CFAB forecasts as if they were derived from an EPS perfect forecast, which is practially
24 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
done here by each day randomly choosing one member to represent the verification from
the set of 51 ensemble forecast members plus the observation, with the remaining 51
unchosen ”members” treated as the ensemble forecast. Linear fits to both actual and
perfect model data sets are included (black and blue lines, respectively). In the upper
right corner of each panel are included the following correlation values for the error-spread
data: ensemble r derived from the actual forecast metrics (black dots); ”perf. model” r
derived from the EPS perfect forecast metrics (blue dots); perf. gaussian r derived from
actual forecasts’ moments but using the theoretical form for normally-distributed EPS
perfect forecast ensemble members (column 2, Table 5); theor. up. lim. the theoretical
maximum value the correlation can attain for normally-distributed ensembles (column 4,
Table 5).
In Figure 5 notice the positive slope to both actual and perfect model data in each panel,
such that as the spread increases, the error also is more likely to be larger. But also notice
that even for large spread values of either the perfect model (blue dots) or actual forecast
data (black dots), the error can be very small, and as such the correlation is not (cannot
be) perfect (i.e. 1.0) as shown by both the ensemble r and ”perf. model” r values ranging
from [.21, .29] and [.22, .27], respectively. The similarity of the actual and perfect model
ranges also shows that the KNN post-processing algorithm appears to have produced
well-calibrated ensembles with respect to the error-spread relationship. Also notice that
the perf. gaussian r values are quite close to the ”perf. model” r values, showing the
normally-distributed ensemble member assumption is a good approximation for this data
set, and thus could provide a much simpler theoretical r value to calculate (column 2,
Table 5) than the method to generate ”perf. model” r discussed above. But also note
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 25
that the actual and perfect model values are well below the theoretical maximum values
they could attain of√
2/π ≈ .80 (panels a - d) and√
1/3 ≈ .58 (panels e - f), respectively,
showing that the data’s ”governing ratios” (column 3, Table 5) are not at their minimum.
Finally, and non-intuitively, notice the almost identical values of all the respective actual
forecast correlations, even though the theoretical maximum value of panels a - d is very
different from that of panels e - f.
6. Conclusions
There clearly is a need to verify the value of the 2nd moment of ensemble forecasts: if,
for a particular forecast, the forecast ensemble spread is large or small, does this mean
the forecast skill is diminished or increased, respectively? This paper has argued that the
Pearson correlation coefficient r of forecast spread and error is not a good verification
measure to directly test this relationship between ensemble spread and skill, since it
depends on factors other than just forecast model performance.
The important point here is that the forecast model’s correlation coefficient can take
on a wide range of values, for a perfectly calibrated model. What this correlation is
could depend on an inherent property of the EPS (such as its resolution), but it could
also depend on the variety of states available to the physical system being modeled,
completely irrespective of the forecast model’s performance. Given this latter dependence,
we argue that the spread-skill correlation is not an adequate verification gauge of how well
a variation in ensemble spread forecasts a change in forecast certainty.
These ideas were examined in the context of ensemble temperature forecasts for Utah
and for streamflow forecasts for the Brahmaputra River. It was shown that even for a
perfect model, r depends on how one defines forecast spread and forecast skill (error); and
26 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
in Tables 4 and 5 of the previous section we also showed how the spread-error correlation r
for a variety of different measures of spread and error was dependent on higher moments
of the distribution of the ensemble spreads, which themselves should be dependent on
the stability properties of the modeled system during the period the forecasts are being
verified (among other factors). In particular, we showed that under certain conditions,
the correlation depends on the ratio of how much the forecast spread varies from forecast
to forecast compared to its mean value of spread,
〈s〉2/〈s2〉 = 〈s〉2/[〈s〉2 + var(s)], (34)
where s is some measure of forecast ensemble spread, 〈s〉 its mean value, and var(s) =
〈(s − 〈s〉)2〉 represents its variance. As this ratio approaches zero, the skill-spread corre-
lation asymptotes to its upper value of√
2/π or√
1/3, depending on how the skill and
spread measures are defined. These theoretical results validate and generalize some of
the previous numerical and theoretical findings of Barker (1991) Houtekamer (1993), in
particular (see section 2).
Because r is strongly dependent on factors other than just the skill of the forecast
system, we argue that r is an unreliable verification measure of whether changes in forecast
skill can be associated with changes in ensemble forecast spread. To meet the clear need
of a measure that can objectively test the usefulness of the variability of the forecast
ensemble spread, we propose in the second part to this paper three alternatives to the
skill-spread correlation. In particular, if there is no usefulness in this ”2nd moment”
of an ensemble forecast, then one might lose little benefit (and possibly gain) by using
hindcasts to calculate a much less expensive invariant ”climatological” error distribution
(Leith 1974, Atger 1999), or fit a simple heteroscedastic error model (i.e. error variance
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 27
that depends on the magnitude of the variable) to use in conjunction with the ensemble
mean or control member forecast instead of using the full suite of forecast ensembles
themselves.
References
Atger, F., 1999: The Skill of Ensemble Prediction Systems. Mon. Wea. Rev., 127, 1941–
1953.
Barker, T. W., 1991: The relationship between spread and forecast error in extended-range
forecasts. J. Climate, 4, 733–742.
Buizza, R., 1997: Potential forecast skill of ensemble prediction and spread and skill
distributions of the ECMWF ensemble prediction system. Mon. Wea. Rev., 125, 99–
119.
Gneiting, T., and A. E. Raftery, 2007: Strictly Proper Scoring Rules, Prediction, and Es-
timation, J. Amer. Stat. Assoc., 102(477), 359–378, doi:10.1198/016214506000001437.
Grimit, E. P., and C. F. Mass, 2007: Measuring the Ensemble Spread-Error Relationship
with a Probabilistic Appraach: Stochastic Ensemble Results. Mon. Wea. Rev., 135,
203–221.
Hopson, T. M., and P. J. Webster, 2010: Operational flood forecasting for Bangladesh
using ECMWF ensemble weather forecasts. J. Hydrometeor., 11, 618–641.
Hopson, T., J. Hacker, Y. Liu, G. Roux, W. Wu, J. Knievel, T. Warner, S. Swerdlin, J.
Pace and S. Halvorson, 2010: Quantile regression as a means of calibrating and veri-
fying a mesoscale NWP ensemble. Prob. Fcst Symp., American Meteorological Society,
Atlanta, GA, 17-23 January 2010.
28 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
Houtekamer, P. L., 1993: Global and local skill forecasts. Mon. Wea. Rev., 121, 1834–
1846.
Houtekamer, P. L., L. Lefaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: system
simulation approach to ensemble prediction. Mon. Wea. Rev., 124, 1225–1242.
Huffman, G. J., R. F. Adler, S. Curtis, D. T. Bolvin, and E. J. Nelkin, 2005: Global
rainfall analyses at monthly and 3-hr time scales. Measuring Precipitation from Space:
EURAINSAT and the Future, V. Levizzani, P. Bauer, and J. F. Turk, Eds., Springer,
722 pp.
Huffman, G. J., R. F. Adler, D. T. Bolvin, G. Gu, E. J. Nelkin, K. P. Bowman, Y. Hong,
E. F. Stocker, D. B. Wolff, 2007: The TRMM Multisatellite Precipitation Analysis
(TMPA): Quasi-global, multiyear, combined sensor precipitation estimates at fine scales.
J. Hydrometeor., 8, 38-55.
Joyce, R. J., J. E. Janowiak, P. A. Arkin, and P. P. Xie, 2004: CMORPH: A method
that produces global precipitation estimates from passive microwave and infrared data
at high spatial and temporal resolution. J. Hydrometeor., 5, 487–503.
Kolczynski, W. C., D. R. Stauffer, S. E. Haupt, N. S. Altman, and A. Deng, 2011:
Investigation of Ensemble Variance as a Measure of True Forecast Variance. Mon. Wea.
Rev., 139, 3954–3963.
Kruizinga, S., and C. J. Kok, 1988: Evaluation of the ECMWF experimental skill predic-
tion scheme and a statistical analysis of forecast errors. Proc. ECMWF Workshop on
Predictability in the Medium and Extended Range, Reading, United Kingdom, ECMWF,
403–415.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 29
Leith, C. E., 1974: Theoretical Skill of Monte Carlo Forecasts. Mon. Wea. Rev., 102,
409–418.
Liu, Y., M. Xu, J. Hacker, T. Warner, and S. Swerdlin, 2007: A WRF and MM5-
based 4-D mesoscale ensemble data analysis and prediction system (E-RTFDDA) devel-
oped for ATEC operational applications. 18th Conf. on Numerical Weather Prediction,
Amer. Meteor. Soc., June 25-29, 2007. Park City, Utah.
Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble
Prediction System: Methodology and validation. Q. J. R. Meteorol. Soc., 122, 73–119.
Murphy, A. H., 1973: Hedging and Skill Scores for Probability Forecasts, J. of Applied
Meteor., 12, 215–223.
Palmer, T. N., 2002: The economic value of ensemble forecasts as a tool for risk assess-
ment: From days to decades. Q. J. R. Meteorol. Soc., 128, 747–774.
Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble
prediction system. Q. J. R. Meteorol. Soc., 126, 649–667.
Scherrer, S.C., C. Appenzeller, P. Eckert, D. Cattani, 2004: Analysis of the spread-skill
relations using the ECMWF ensemble prediction system over Europe. Wea. Forecasting,
19 (3), 552–565.
Toth, Z. and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.
Mon. Wea. Rev., 125, 3297–3319.
Toth, Z. and O. Talagrand and G. Candille and Y. Zhu, 2003: Probability and Ensemble
Forecasts. Chapter 7 of Forecast Verification: A Practitioner’s Guide in Atmospheric
Science. John Wiley and Sons, 254pp.
30 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
Wang, X., and C. H. Bishop, 2003: A Comparison of Breeding and Ensemble Transform
Kalman Filter Ensemble Forecast Schemes. J. Atmos. Sci., 60, 1140–1158.
Whitaker, J. S., and A. F. Loughe, 1998: The Relationship between Ensemble Spread and
Ensemble Mean Skill. Mon. Wea. Rev., 126, 3292–3302.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press,
467pp.
Xie, P. P., B. Rudolf, U. Schneider, and P. A. Arkin, 1996: Gauge-based monthly analysis
of global land precipitation from 1971 to 1994. J. Geophys. Res. - Atmos., 101 (D14),
19023–19034.
Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The Economic Value
Of Ensemble-Based Weather Forecasts. Bull. Amer. Meteor. Soc., 83, 73–83.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 31
Table 1. Measures of spread used and their symbolic representation, where 〈·〉Ψ
represents the expectation operation over the population Ψ of forecast ensemble members
ψ of a given forecast
Spread measure Symbolic representation Mathematical form
variance of the ensemble members σ2ψ 〈(ψ − 〈ψ〉Ψ)2〉Ψ′ = 〈ψ2〉Ψ − 〈ψ〉2Ψ
about the ensemble mean
root mean square difference σψ√σ2ψ
or standard deviation
mean absolute difference sabs 〈|ψ − 〈ψ〉Ψ|〉Ψ′
of the ensemble members
about the ensemble mean
mean absolute difference sd abs 〈|ψ − ψ′|〉Ψ,Ψ′
of the ensemble members about
any one chosen ensemble member
4th moment about m4 〈(〈ψ〉Ψ − ψ)4〉Ψ′
the ensemble mean
32 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
Table 2. Measures of error used and their symbolic representation where ψo represents
the observation or verification of the forecast
Error measure Symbolic representation Mathematical form
square error of εµ2 (〈ψ〉Ψ − ψo)2
the ensemble mean
square error of εd2 (ψ − ψo)2
one ensemble member
absolute error of ε|µ| |〈ψ〉Ψ − ψo|
the ensemble mean
absolute error of ε|d| |ψ − ψo|
any one ensemble member
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 33
Table 3. The measures of spread s (column 2) that correspond to given error measures
ε (column 1) after an expectation value operation over the distribution of the observations
is performed (〈·〉Ψo) under the perfect EPS assumption. In some cases, a double expec-
tation value operation is performed over both the forecast ensemble distribution Ψ and
the possible distribution of the observations Ψo (which are equivalent distributions for a
perfect model). Column 3 shows the same results, but for when the forecast ensemble is
normally-distributed.
ε 〈ε〉Ψo or 〈ε〉Ψ,Ψo Ensembles normally
distributed
ε|µ| sabs√
2πσψ
ε|d| sd abs2√πσψ
εµ2 σ2ψ σ2
ψ
εd2 2σ2ψ 2σ2
ψ
34 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
Table 4. Reduced forms of EPS perfect forecast spread-error correlation coefficients for
different combinations of spread (s) and error (ε) measures (column 1), where the correla-
tion is dependent only on the moments of the ensemble member spread (column 2). All ex-
pectation value operations are evaluated over the population of forecasts (i.e. 〈·〉 = 〈·〉Σ).
However further simplifications can be made if constraints are placed on the ensemble
member distribution (which are required to hold for all forecasts). These constraints are
given in column 3, where α and β are constants (and noting that if sabs = βσψ, then it
follows that sd abs =√
2βσψ). These lead to the further simplifications on the correlation
shown in column 4. (Note that if the forecast ensemble members are normally-distributed,
then α = 3 and β =√
2/π, with results shown in Table 5.)
s ; ε r theoretical form further simplified r
constraint
sabs ; ε|µ|
√〈s2abs〉−〈sabs〉2
〈σ2ψ〉−〈sabs〉2
sabs = βσψ β
√1−〈σψ〉2Σ/〈σ
2ψ〉Σ
1−β2〈σψ〉2Σ/〈σ2ψ〉Σ
σψ ; ε|µ|〈σψsabs〉−〈σψ〉〈sabs〉
[(〈σ2ψ〉−〈σψ〉2)(〈σ2
ψ〉−〈sabs〉2)]1/2
same same
sabs ; ε|d|〈sabssd abs〉−〈sabs〉〈sd abs〉
[(〈s2abs〉−〈sabs〉2)(2〈σ2ψ〉−〈sd abs〉2)]1/2
same same
σψ ; ε|d|〈σψsd abs〉−〈σψ〉〈sd abs〉
[(〈σ2ψ〉−〈σψ〉2)(2〈σ2
ψ〉−〈sd abs〉2)]1/2
same same
σ2ψ ; εµ2
√〈(σ2
ψ)2〉−〈σ2
ψ〉2
〈m4〉−〈σ2ψ〉2 m4 = α(σ2
ψ)2
√1−〈σ2
ψ〉2Σ/〈(σ
2ψ
)2〉Σα−〈σ2
ψ〉2Σ/〈(σ
2ψ
)2〉Σ
σ2ψ ; εd2
√〈(σ2
ψ)2〉−〈σ2
ψ〉2
〈m4〉/2+3〈(σ2ψ
)2〉/2−〈σ2ψ〉2 same
√1−〈σ2
ψ〉2Σ/〈(σ
2ψ
)2〉Σ(α+3)/2−〈σ2
ψ〉2Σ/〈(σ
2ψ
)2〉Σ
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 35
Table 5. EPS perfect forecast spread-error correlation coefficient results (column 2) for
different combinations of spread (s) and error (ε) measures (column 1). The results are
the same as for Table 4, except the distribution of the ensembles members Ψ is constrained
to be normally-distributed (with α = 3 and β =√
2/π), simplifying the results. As with
Table 4, all expectation value operations are evaluated over the population of forecasts
(i.e. 〈·〉 = 〈·〉Σ). Also shown is the ratio of moments of the spread g that governs the
value of r (column 3), the theoretical limiting values for r (column 4), and its form for one
specific distribution for the possible ensemble member standard deviations σψ (column 5).
s ; ε r theoretical form governing r limits: r for σψ lognormally
Ψ normally distributed ratio g g → 1; → 0 distributed
sabs ; ε|µ|√
2π
√1−〈σψ〉2/〈σ2
ψ〉
1−(2/π)〈σψ〉2/〈σ2ψ〉 〈σψ〉2/〈σ2
ψ〉 0 ;√
2/π√
2π
√1−exp(−σ2
Σ)1−(2/π)exp(−σ2
Σ)
σψ ; ε|µ| same same same same
sabs ; ε|d| same same same same
σψ ; ε|d| same same same same
σ2ψ ; εµ2
√1−〈σ2
ψ〉2/〈(σ2
ψ)2〉
3−〈σ2ψ〉2/〈(σ2
ψ)2〉 〈σ2
ψ〉2/〈(σ2ψ)2〉 0 ;
√1/3
√1−exp(−4σ2
Σ)3−exp(−4σ2
Σ)
σ2ψ ; εd2 same same same same
36 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
!"#$
%$&'&()*
+*!"#$
%$&'&()*
+*
!"#
,-*
#$.*
#$.*
/#"01%.(*!23*45+6.-7*
/#"01%.(*!23*45+6.87*
,8*
!$#
!"#$
%$&'&()*
+*
!%#
,9*
#$.*
/#"01%.(*!23*45+6.97*
!"#$%&$
!"#$%'$
!"#$%($
Figure 1. Schematic of the correlation coefficient simplification calculation. Thin solid
vertical lines represent six-member ensemble forecasts of variable ψ that are randomly-
drawn from the Gaussian-shaped grey PDF curve with mean value given by the vertical
dashed line, and some definition of spread s1, s2, and s3 for forecast times t1, t2, and
t3, respectively. The observation (verification) corresponding to the ensemble forecast is
given by the vertical red line, and the forecast errors ε1, ε2, and ε3 are defined here as the
distance of the ensemble mean to the observation. See text for further details.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 37
Figure 2. Dependence of the correlation coefficient on two different ratios of moments
of ensemble member spread, or ”governing ratios” for an EPS perfect forecast.
38 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
0 12 24 36 50Days
-5
0
5
10
15
T [d
eg C
]
0 12 24 36 50Days
-5
0
5
10
15
T [d
eg C
]Rank Histogram of Uncalibrated Ensembles
Interval
0
100
200
300
400C
ount
1 5 9 13 17
Rank Histogram of Uncalibrated Ensembles
Interval
0
100
200
300
400
Cou
nt
1 5 9 13 17
0 12 24 36 50Days
-5
0
5
10
15
T [d
eg C
]
Rank Histogram of Calibrated Ensembles
Interval
0
10
20
30
40
50
60
Cou
nt
1 5 9 13 17
0 12 24 36 50Days
-10
-5
0
5
10
15T
[deg
C]
Rank Histogram of Calibrated Ensembles
Interval
0
10
20
30
40
50
60
Cou
nt
1 5 9 13 17
!"# $"#
%"# &"#
'"# ("#
)"# *"#
Figure 3. ATEC EPS pre- and post-processed 19-member 3-hr and 36-hr lead-time
temperature forecasts compared to weather station observations. Panel b) shows a time-
series of a sorted subset of 3-hr ensemble forecasts (colored lines) and observations (black
line), with panel b) showing the (out-of-sample) ensembles after post-processing. Panels
c) and d) show rank histograms of the pre- and post-processed 3-hr forecasts, respectively,
with red dashed lines providing 95% confidence bounds. Panels e) - h) show the same
results but for the 36-hr lead-time forecasts. See text for details.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 39
!"# $"#
%"# &"#
'"# ("#
)"# *"#
2003 2004 2006 2007 Monsoon Year
20
40
60
80
Q [1
03 m3 /s
]
2003 2004 2006 2007 Monsoon Year
20
40
60
80
100
Q [1
03 m3 /s
]
2003 2004 2006 2007 Monsoon Year
20
40
60
80
100
Q [1
03 m3 /s
]
2003 2004 2006 2007 Monsoon Year
0
20
40
60
80
100Q
[103 m
3 /s]
Rank Histogram of Uncalibrated Ensembles
Interval
0
50
100
150
200C
ount
1 5 9 13 17 21 25 29 33 37 41 45 49
Rank Histogram of Calibrated Ensembles
Interval
0
5
10
15
20
Cou
nt
1 5 9 13 17 21 25 29 33 37 41 45 49
Rank Histogram of Uncalibrated Ensembles
Interval
0
20
40
60
80
Cou
nt
1 5 9 13 17 21 25 29 33 37 41 45 49
Rank Histogram of Calibrated Ensembles
Interval
0
5
10
15
Cou
nt
1 5 9 13 17 21 25 29 33 37 41 45 49
Figure 4. CFAB EPS pre- and post-processed 51-member 1-day and 10-day lead-time
temperature forecasts compared to river gauging station observations. Panel b) shows a
time-series of a sorted subset of 1-day ensemble forecasts (colored lines) and observations
(black line), with panel b) showing the (out-of-sample) ensembles after post-processing.
Panels c) and d) show rank histograms of the pre- and post-processed 3-hr forecasts,
respectively, with red dashed lines providing 95% confidence bounds. Panels e) - h) show
the same results but for the 10-day lead-time forecasts. See text for details.
40 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP
abs error of ensemble vs mean abs deviation
0 2 4 6 8 10spread [103 m3/s]
0
10
20
30
40
erro
r [10
3 m3 /s
]
"perf. model" r = .22ensemble r = .21
(perf. gaussian r = .24)(theor. up. lim. = .80)
abs error of ensemble vs mean abs deviation
0 2 4 6 8 10spread [103 m3/s]
0
10
20
30
40er
ror [
103 m
3 /s]
"perf. model" r = .22ensemble r = .21
(perf. gaussian r = .24)(theor. up. lim. = .80)
absolute error of mean vs std deviation
0 2 4 6 8 10 12spread [103 m3/s]
05
10
15
20
25
30
erro
r [10
3 m3 /s
]
"perf. model" r = .24ensemble r = .29
(perf. gaussian r = .24)(theor. up. lim. = .80)
abs error of ensemble vs std deviation
0 2 4 6 8 10 12spread [103 m3/s]
0
10
20
30
erro
r [10
3 m3 /s
]
"perf. model" r = .23ensemble r = .28
(perf. gaussian r = .24)(theor. up. lim. = .80)
square error of mean vs variance
0 50 100 150spread [(103 m3/s)2]
0
200
400
600
800
1000
erro
r [(1
03 m3 /s
)2 ] "perf. model" r = .27ensemble r = .27
(perf. gaussian r = .25)(theor. up. lim. = .58)
square error of ensemble vs variance
0 50 100 150spread [(103 m3/s)2]
0
500
1000
1500
2000
erro
r [(1
03 m3 /s
)2 ] "perf. model" r = .27ensemble r = .25
(perf. gaussian r = .25)(theor. up. lim. = .58)
!"#
$"#%"#
&"#
'"# ("#
Figure 5. Scatter plots of ensemble error versus spread using the CFAB EPS data
and the metric pairings of Tables 4 and 5 (rows 1 - 6 corresponding to panels a - f,
respectively). The black dots are the actual error-spread data, the blue dots are the EPS
perfect forecast-equivalent of the CFAB data. Linear fits to both actual and perfect model
data sets are included (black and blue lines, respectively). In the upper right corner of
each panel are the correlation values for the actual forecast, the EPS perfect forecast,
the theoretical normally-distributed perfect forecast, and the theoretical maximum value,
listed top-to-bottom, respectively. See text for further details.
HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 41
3h 6h 9h 12h 15h 18h 21h 24h 27h 30h 33h 36hForecast Hour
0.0
0.2
0.4
0.6
0.8
Cor
rela
tion
3h 6h 9h 12h 15h 18h 21h 24h 27h 30h 33h 36hForecast Hour
0.0
0.5
1.0
1.5
Skill
Sco
re
3h 6h 9h 12h 15h 18h 21h 24h 27h 30h 33h 36hForecast Hour
-0.1
0.0
0.1
0.2
0.3
Skill
Sco
re
1 2 3 4 5 6 7 8 9 10Forecast Day
0.0
0.2
0.4
0.6
0.8
Cor
rela
tion
1 2 3 4 5 6 7 8 9 10Forecast Day
0.0
0.5
1.0
1.5
2.0
2.5
Skill
Sco
re
1 2 3 4 5 6 7 8 9 10Forecast Day
0.0
0.2
0.4
0.6
Skill
Sco
re
a) b)
c)
e)
d)
f)
Figure 6.