assessing the ensemble spread-error relationshiphopson/mwrpaperi.pdf · temporally-varying ensemble...

MONTHLY WEATHER REVIEW, VOL. , NO. , PAGES 1–31,

Assessing the Ensemble Spread-Error

Relationship

T. M. Hopson

Research Applications Laboratory, National Center for Atmospheric

Research, Boulder, Colorado, USA

T. M. Hopson, RAL - NCAR, P. O. Box 3000, Boulder, Colorado 80307-3000, USA. (hop-

[email protected])

2 HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP

Abstract.

With the increased utilization of ensemble forecasts in weather and hydro-

logic applications, there is a need for verification tools to test their benefit

over less expensive deterministic forecasts. This paper examines the ensem-

ble spread-error relationship, beginning with the ability of the Pearson cor-

relation to verify a forecast system’s capacity to represent its own varying

forecast error. Considering only perfect model conditions, this work theoret-

ically extends the results from previous numerical studies showing the cor-

relation’s diagnostic limitations: it can never reach its maximum value of one;

its theoretical asymptotic value depends on the specific definition of spread

and error used, ranging from 0 and asymptoting to either 1/√

3 or√

2/π;

and, perhaps most fatal to its utility, its theoretical limits depend on the vary-

ing stability properties of the physical system being modeled.

Building from this, we argue there are two aspects of an ensembles disper-

sion that should be assessed. First, and perhaps more fundamentally: is there

enough variability in the ensembles dispersion to justify the maintenance of

an expensive ensemble prediction system (EPS), irrespective of whether the

EPS is well-calibrated or not? To diagnose this, the factor that controls the

theoretical upper limit of the spread-error correlation can be useful. Secondly,

does the variable dispersion of an ensemble relate to variable expectation of

forecast error? Representing the spread-error correlation in relation to its the-

oretical limit can provide a simple diagnostic of this attribute. A context for

these concepts is provided by assessing two operational ensembles: Western

US temperature forecasts and Brahmaputra River flow.

HOPSON: ASSESSING THE ENSEMBLE SPREAD-ERROR RELATIONSHIP 3

1. Introduction

The development of ensemble weather, climate, and hydrologic forecasting has brought

new opportunities to provide significant economic and humanitarian benefit over a single

”best guess” forecast (Richardson 2000, Zhu et al. 2002, Palmer 2002, among others).

One potentially significant if not fundamental attribute of an ensemble prediction system

(EPS) is its ability to forecast its own expected forecast error. This is accomplished if

the EPS provides an accurate expectation of its temporally-varying errors through its

temporally-varying ensemble dispersion (Molteni et al. 1996, Toth and Kalnay 1997,

Houtekamer et al. 1996, Toth et al. 2003, Zhu et al. 2002, Hopson and Webster 2010).

Given that one would expect that larger ensemble dispersion implies more uncertainty

in the forecast ensemble mean or in any one ensemble member (likewise, smaller disper-

sion implying less uncertainty), many past approaches have used the Pearson correlation

coefficient as a diagnostic for this potential EPS property by linearly-correlating differing

measures of ensemble spread with differing measures of forecast error. However, the con-

clusions drawn from the use of this metric have often been ambiguous in many of these

studies (Barker 1991; Molteni et al. 1996; Buizza 1997; Scherrer et al. 2004).

Houtekamer (1993), Whitaker and Loughe (1998), and Grimit and Mass (2007) have

investigated why linear correlation may not be a conclusive metric, primarily in the con-

text of a statistical model presented originally by Kruizinga and Kok (1988; hereafter

”KK”). The above authors’ analyses were done in the context of an EPS perfect forecast

assumption, one in which the underlying probability distribution function (PDF) of the

forecast error is known, and individual ensemble members represent random draws from

this distribution, with the ensemble spread providing a measure of the expected forecast


error. Note the distinction between ”perfect forecast” and ”EPS perfect forecast” as-

sumptions: the former being when the forecast is identical to the future observation; the

latter being when the distribution of the EPS ensembles is statistically indistinguishable

from the forecast error PDF. In the context of the KK model, these authors showed that

even for a perfect EPS, the correlation between skill and spread need not be statistically

significant, with the magnitude of the linear correlation depending on the day-to-day vari-

ability of spread: for verification data where there is large temporal variation in ensemble

spreads, the correlation between spread and skill is at a maximum (but less than one),

and in regions where the ensemble spread is more temporally uniform, the correlation is

at a minimum. Grimit and Mass (2007) also numerically assessed the behavior of the

spread-error correlation with the same KK model in the context of differing continuous

and categorical spread and error metrics, and for ensemble systems of finite size, showing

additional dependencies of the spread-skill correlation on these additional factors.

Although conducted in the context of one particular statistical model (i.e. KK), the

general conclusion one could draw from these analyses is that the linear correlation is defi-

cient as a verification measure by virtue of its dependence on factors other than exclusive

properties of EPS forecast performance. One purpose of this current paper is to elaborate

on and generalize this last point further by presenting some of these dependencies from a

more theoretical framework for continuous spread and error measures. Among the depen-

dencies that can effect the spread-error correlation, many studies assessing the forecast

spread-skill correlation used differing definitions and combinations of measures represent-

ing spread and skill. It is not clear how these different combinations of measures affect

the theoretical limits of the correlation, and therefore how these studies might interrelate.


Here we calculate some of the theoretical limits of the correlation for different spread and

error combinations, which we argue provide two generalizable metrics to test the utility

of an EPS’s ability to provide ensemble members with varying dispersion.

In section 2 we start by presenting some of the possible continuous error and spread

measures, arguing that only certain combinations of these spread and error metrics are

dimensionally well-matched and should be used in conjunction. Later in the section we

provide explicit calculations for theoretical simplifications on the linear correlation for four

different matched spread-skill metrics. For this we also utilize the EPS perfect forecast

assumption with no sampling limitations, but do not rely on a particular functional form

for the distribution of ensemble spread. In section 3 we discuss the results of section 2’s

calculations, showing how the theoretical asymptotic limits of the spread-skill correlation

can vary greatly depending on which spread-skill metrics are used, and providing the re-

sults for the KK model as one particular case study. In section 4, we discuss two metrics

for assessing the utility of an ensemble’s temporally varying dispersion, which itself were

generalized from the analysis provided in section 2. In section 5, we place our analysis in

the context of two particular EPS examples of spread and error using ensemble temper-

ature forecasts for a region of southwest USA, and ensemble river discharge forecasts for

Bangladesh.

2. Calculations

In this section we present calculations to simplify the linear correlation for four pairings

of continuous error and spread metrics. The purpose of these calculations is to simplify

these theoretical correlations to a point where the mathematical form of the asymptotic

limits become clear, as well as the dependencies dictating these limits. It is assumed


there are no sampling limitations and that the EPS perfect forecast assumption holds,

such that for a given forecast, there is an underlying PDF from which both individual

ensemble members and the associated observable (verification) are randomly drawn. As a

result, the expected error of an ensemble forecast is completely determined by this PDF,

and the theoretical form of error-spread correlation reduces to only the PDF moments.

To make these simplifications, without loss of generality (WLOG) we can introduce in

the equation for the Pearson correlation coefficient a calculation to replace the forecast

error with its expected value; and in the case of an EPS perfect forecast, the domain of

this calculation over all errors is equivalent to the forecast ensemble member PDF. This

replaces the error with its expected value, proportional to a measure of ensemble spread.

As well, WLOG, expectation value operations over all possible ensemble members are also

done.

2.1. Notation

The population of members of an ensemble forecast is represented by Ψ, with an indi-

vidual member (realization) represented by ψ. Similarly, for some measure of spread s,

we represent the population of ensemble forecasts, each with a value of s, as Σ. Consider

that Ψ could be viewed as the underlying (implied) PDF of an ensemble forecast at a par-

ticular time from which the ensemble members are randomly drawn. Likewise, Σ could

be viewed as representing the whole set of ensemble forecasts, each with an identifiable

value of associated ensemble spread, over all the times forecasts are generated.

Bra-ket expectation value notation is used for the expectation value of some quantity

A = A(ψ) over an ensemble population Ψ, which could be in terms of discrete variables


with probability density function P (ψ)

〈A(ψ)〉Ψ ≡∑ψ

A(ψ)P (ψ), (1)

or in terms of continuous variables with associated probability density function f(ψ)

〈A(ψ)〉Ψ ≡∫

ΨA(ψ)f(ψ)dψ. (2)

The subscript (Ψ) on the brackets (〈·〉) specifies the population domain over which the

expectation is calculated. Similarly, we define the expectation value of A = A(s) over a

population of forecasts, each with defined ensemble spread s, as 〈A〉Σ, and we represent

the double expectation value of A = A(ψ, s) over both populations Ψ and Σ as 〈A〉Ψ,Σ.

In terms of expectation values, the Pearson correlation coefficient between a generic

spread (s) and error (ε) measure is given by

r =〈(s− 〈s〉Σ)(ε− 〈ε〉Σ)〉Σ

[〈(s− 〈s〉Σ)2〉Σ〈(ε− 〈ε〉Σ)2〉Σ]1/2(3)

where the population domain over which the expectation (average) is calculated is the set

of ensemble forecasts Σ (with associated spread measure s). For further simplifications

as we will show below, for a given ensemble forecast with some measure of spread s, an

average can also be made over the possible realizations of the observable 〈·〉Ψo ; or over the

population of ensemble members Ψ(s) given by 〈·〉Ψ. Note by our perfect model definition,

〈·〉Ψo ≡ 〈·〉Ψ.

2.2. Spread-error measures

The forecast member spread is often defined as the variance, standard deviation, mean

absolute difference of the ensemble members about the ensemble mean, or less commonly,

mean absolute difference of the ensemble members about a chosen ensemble member. In

addition, we include the 4th moment of the ensemble members about the mean, which


arises in the calculations. The forecast error of an ensemble forecast is often defined in

terms of the squared or absolute difference between the verification (observation) and

either any one ensemble member or the ensemble mean forecast. Symbolic notation for

these measures are given in Tables 1 and 2, respectively.

Arguably only certain of these error and spread measures are appropriately matched if

one wants to directly relate expected error to a measure of ensemble spread. Measures

that are naturally paired have a direct functional relationship relating forecast error to

forecast spread, and have the same moments (physical units). Of the measures presented

here, these pairings are: 1) the set of squared error measures with the variance as spread

measure; and 2) the set of absolute difference error measures with either the standard

deviation or mean absolute difference as spread measure. Although other error and spread

measures could also be used (e.g. rank probability skill score) to assess the forecast

spread-error relationship, arguably the useful information in the ensemble spread is that

it should be a statement about the expected error in the forecast, and these error and

spread measures directly make this connection.

For reference, Table 3 shows how the expected values of the error measures ε (column

1) can be given in terms of measures of forecast spread s (column 2) for an EPS per-

fect forecast (i.e. one in which the observation ψo is equivalent to a random draw from

the forecast ensemble member PDF). These relationships are used in the calculations

below. WLOG, these relationships were derived by introducing an expectation value op-

eration over all possible observational states, and in some cases, over all possible ensemble

members. Column 3 of this table shows how the expected value of error corresponds to


either the standard deviation σψ or variance σ2ψ when the forecast ensembles are normally

distributed.

Figure 1 provides a schematic of the correlation coefficient simplification calculation.

Shown are ensemble six-member forecasts of a continuous variable ψ for three different

forecast times. The ensemble members are represented by the six thin black vertical lines,

with the implied PDF p(ψ; si) from which the members are samples given by the bell-

shaped curves. The PDF represents the forecast in the asymptotic limit of no sampling

limitations. The observations corresponding to the forecasts are shown by the vertical red

lines, with the ensemble mean given by the dashed vertical lines. Some measure of error

ε (shown here as a distance the observation is from the ensemble mean) for each forecast

is also shown, as is some measure of ensemble member spread s. In our calculations to

simplify the correlation between spread s and error ε, we replace the error by its expected

value, which can be calculated by performing a weighted integration of the observation

over all possible values. The result is that the expected value is proportional to a measure

of ensemble member spread:

〈ε〉Ψi =∫

Ψi(s)ε p(ψ; si) dψ ∝ si. (4)

In practice, p(ψ; si) does not have to be explicitly given, and the relationship of the

expected value of the error to a measure of ensemble member spread can be shown either

through algebraic manipulation or by inspection (see Table 3 for examples).

In this example, the expected value of the error over all forecasts then is proportional

to:

〈ε〉Σ ∝1

n

n∑i=1

si. (5)


2.3. Correlation of sabs with ε|µ| and the correlation of σ2ψ with εµ2

In this section we simplify the correlations for two specific cases: 1) (sabs, ε|µ|) and 2)

(σ2ψ, εµ2). As seen in Table 3, these pairings are especially well matched since (for an EPS

perfect forecast) the expectation value of the error measure is the spread measure itself

(〈ε〉Ψo = s).

Left in terms of a generic ε and s for these two sets of spread-error measures, WLOG we

can introduce into (3) an expectation value 〈·〉Ψo over all possible states of the observation

within each expectation value of error 〈ε〉Σ over the population of forecasts Σ

r =〈(s− 〈s〉Σ,Ψo)(ε− 〈ε〉Σ,Ψo)〉Σ,Ψo

[〈(s− 〈s〉Σ,Ψo)2〉Σ,Ψo〈(ε− 〈ε〉Σ,Ψo)2〉Σ,Ψo ]1/2. (6)

Noting that 〈s〉Ψo = 〈s〉Ψ = s and expanding,

r =〈(s− 〈s〉Σ)(〈ε〉Ψo − 〈ε〉Σ,Ψo)〉Σ

[〈(s− 〈s〉Σ)2〉Σ〈(ε− 〈ε〉Σ,Ψo)2〉Σ,Ψo ]1/2, (7)

and using 〈ε〉Ψo = 〈ε〉Ψ = s,

r =〈(s− 〈s〉Σ)(s− 〈s〉Σ)〉Σ

[〈(s− 〈s〉Σ)2〉Σ〈(ε− 〈s〉Σ)2〉Σ,Ψ]1/2, (8)

so the correlation coefficient further simplifies to

r =

√√√√ 〈s2〉Σ − 〈s〉2Σ〈ε2〉Σ,Ψ − 〈s〉2Σ

. (9)

To simplify things further, we return to the specific metrics of cases 1) and 2). Simpli-

fying for case 1), we have 〈ε2|µ|〉Ψo ≡ 〈|〈ψ〉Ψ−ψo|2〉Ψo = 〈(〈ψ〉Ψ−ψ)2〉Ψ ≡ σ2ψ by definition.

And for case 2), we have 〈εµ2〉Ψo ≡ 〈(〈ψ〉Ψ − ψo)2〉Ψo = 〈(〈ψ〉Ψ − ψ)2〉Ψ ≡ σ2

ψ again by

definition. In addition for case 2), 〈ε2µ2〉Ψo ≡ 〈(〈ψ〉Ψ − ψo)4〉Ψo = 〈(〈ψ〉Ψ − ψ)4〉Ψ ≡ m4,

where m4 is the 4th moment about the mean 〈ψ〉Ψ defined in Table 1. Substituting into


(9) for cases 1) and 2) we have

r =

√√√√〈s2abs〉Σ − 〈sabs〉2Σ〈σ2

ψ〉Σ − 〈sabs〉2Σ, (10)

and

r =

√√√√〈(σ2ψ)2〉Σ − 〈σ2

ψ〉2Σ〈m4〉Σ − 〈σ2

ψ〉2Σ, (11)

respectively, which are now dependent only on the moments of the ensemble member

spread.

To simplify (10) and (11) further, we would need to impose a requirement on the

distribution of the ensemble members holding for all forecasts, and specific to each case.

These requirements are: for case 1) sabs = βσψ; for case 2) m4 = α(σ2ψ)2, where α and β

are constants determined by the PDF of the ensemble distribution. Note that normally-

distributed ensemble members satisfy the requirements for both of these cases, where for

case 1) β =√

2/π), and for case 2) α = 3.

Imposing these requirements on sabs (case 1) and on m4 (case 2), (10) and (11) become

r = β

√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ

1− β2〈σψ〉2Σ/〈σ2ψ〉Σ

(12)

and

r =

√√√√ 1− 〈σ2ψ〉2Σ/〈(σ2

ψ)2〉Σα− 〈σ2

ψ〉2Σ/〈(σ2ψ)2〉Σ

(13)

respectively.

2.4. Correlation of σ2ψ with εd2

For the case of (σ2ψ, εd2), we have

r =〈(σ2

ψ − 〈σ2ψ〉Σ)(〈εd2〉Ψo,Ψ − 〈εd2〉Σ,Ψo,Ψ)〉Σ

[〈(σ2ψ − 〈σ2

ψ〉Σ)2〉Σ〈(εd2 − 〈εd2〉Σ,Ψo,Ψ)2〉Σ,Ψo,Ψ]1/2. (14)


where, WLOG, we have introduced an additional expectation value operation (〈·〉Ψ) over

the population of ensemble members (Ψ) (performed for each forecast, with specific σ2ψ

value). This was done in addition to the expectation value operation (〈·〉Ψo) over the

observation population (Ψo) as was introduced in the previous calculation.

Under the EPS perfect forecast assumption, we have 〈εd2〉Ψo,Ψ = 2(〈ψ2〉Ψ−〈ψ〉2Ψ) = 2σ2ψ,

and the numerator simplifies to 2[〈(σ2ψ)2〉Σ−〈σ2

ψ〉2Σ]. Similarly, the denominator simplifies

to [(〈(σ2ψ)2〉Σ − 〈σ2

ψ〉2Σ)(〈ε2d2〉Σ,Ψa,Ψ − 4〈σ2ψ〉2Σ)]1/2. Again, using the EPS perfect forecast

assumption, 〈ε2d2〉Ψa,Ψ ≡ 〈(ψ − ψo)4〉Σ,Ψo,Ψ = 2〈(ψ − 〈ψ〉Ψ)4〉Ψ + 6〈(ψ − 〈ψ〉Ψ)2〉2Ψ = 2m4 +

6(σ2ψ)2. Putting this together, (14) simplifies to

r =

√√√√ 〈(σ2ψ)2〉Σ − 〈σ2

ψ〉2Σ〈m4〉Σ/2 + 3〈(σ2

ψ)2〉Σ/2− 〈σ2ψ〉2Σ

(15)

and the correlation coefficient is now given only in terms of the moments of the ensemble

member spread.

To simplify the relationship further, we would need to impose a requirement on the

distribution of the ensemble members holding for all forecasts. As done in the previous

section, if we imposem4 = α(σ2ψ)2, where α is a proportionality constant, then substituting

for m4 in the denominator, combining, and simplifying, we get

r =

√√√√ 〈(σ2ψ)2〉Σ − 〈σ2

ψ〉2Σ(α + 3)〈(σ2

ψ)2〉Σ/2− 〈σ2ψ〉2Σ

. (16)

For normally distributed ensembles α = 3, and we derive the same result as given in the

previous section for (σ2ψ, εµ2) (case 2).

2.5. Correlation of σψ and ε|µ|

Finally, we consider the case of (σψ, ε|µ|), given by

r =〈(σψ − 〈σψ〉Σ)(ε|µ| − 〈ε|µ|〉Σ)〉Σ

[〈(σψ − 〈σψ〉Σ)2〉Σ〈(ε|µ| − 〈ε|µ|〉Σ)2〉Σ]1/2. (17)


To simplify this expression, we expand the denominator noting that ε|µ|ε|µ| = εµ2 , WLOG

introduce an expectation value operation over the possible observational states (〈·〉Ψo), and

use 〈ε|µ|〉Ψo ≡ 〈|〈ψ〉Ψ − ψo|〉Ψo = sabs and 〈εµ2〉Ψo ≡ 〈(〈ψ〉Ψ − ψo)2〉Ψo = σ2ψ by the EPS

perfect forecast assumption. Doing so, (17) simplifies to

r =〈(σψ − 〈σψ〉Σ)(sabs − 〈sabs〉Σ)〉Σ

[(〈σ2ψ〉Σ − 〈σψ〉2Σ)(〈σ2

ψ〉Σ − 〈sabs〉2Σ)]1/2, (18)

or

r =〈σψsabs〉Σ − 〈σψ〉Σ〈sabs〉Σ

[(〈σ2ψ〉Σ − 〈σψ〉2Σ)(〈σ2

ψ〉Σ − 〈sabs〉2Σ)]1/2, (19)

and again, the correlation coefficient is given only in terms of moments of the ensemble

member spread.

To simplify the relationship for the correlation coefficient further, we impose the same

requirement on the distribution of the ensemble members holding for all forecasts as was

done with (sabs, ε|µ|) above, namely sabs = βσψ (which applies for normally-distributed

ensemble members, with β =√

2/π). Using this, we obtain

r = β

√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ

1− β2〈σψ〉2Σ/〈σ2ψ〉Σ

, (20)

which is identical to the result for (sabs, ε|µ|).

3. Results of correlation analysis

One focus of this paper has been to assess the limited utility of the linear spread-error

correlation as a verification measure from a theoretical perspective. In the process of doing

so, we have clarified the dependencies of the correlation through calculations performed

under the assumptions of an EPS perfect forecast (i.e. the observation is statistically

indistinguishable from any one ensemble member) for different combinations of continuous


spread and error measures and in the case of no sampling limitation (i.e. large ensemble

size). Tables 4 and 5 show results of these calculations, and from these we make the

following points:

(1) The spread-error correlation can be simplified to forms no longer explicitly dependent

on the error metric, but dependent only on different moments of the ensemble member

distribution, and what the average value (i.e. expectation value) of these moments are

over the forecast verification set. This can be seen in column 2 of Table 4, for different

combinations of spread (s) and error (ε) measures. To clarify, none of these simplifica-

tions explicitly depend on either how the ensemble members are distributed, or how the

varying spread metric (moments) of these distributions are distributed themselves. The

dependence is instead implicit, by virtue of what the average value of these moments are

when averaged over the set of all forecasts used in the verification.

(2) Because, even for a ”perfect” forecast, the correlation remains dependent on at-

tributes of the ensemble member distribution, these dependencies cloud the ability of

the spread-error correlation to provide a diagnostic of EPS performance for an imperfect

model. One would rather hope for a verification metric to at least be asymptotically-

constant (e.g. value of 1.0) when tested with perfect model results. Further dependence

of ensemble size on the correlation’s value further clouds this metric’s utility (see Grimit

and Mass 2007 and Kolczynski et al. 2011 for a numerical studies of this issue). Although

the variability of ensemble member spread over a verification set could be indicative of

EPS performance, such variability also could depend on the stability properties of the

environmental system being modeled. In particular, if the system being modeled is in a

very stable regime, then one may expect that the distribution of ensemble spreads would


be relatively narrow, and as we argue below, this would lead to a very different result for

r than if the system samples a variety of stable/unstable states (i.e. a large ”spread” in

the ensemble spreads). More to the point, one would hope that for a perfect model, a

measure of forecast performance such as r would be a fixed value, and not depend on the

inherent properties of the system the forecast is trying to model.

(3) If further constraints are placed on the relationship between the moments of the

ensemble member distribution (column 3 of Table 4), then further simplifications can be

made on the form of the correlation (column 4, Table 4), reducing to only three forms for

the six combinations considered in Table 4. For the metrics with the same units as the

weather variable itself, with the constraint that sabs = βσψ and β is some constant, this

is given by

r = β

√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ

1− β2〈σψ〉2Σ/〈σ2ψ〉Σ

. (21)

For the two squared metrics in the table, with the constraint that m4 = α(σ2ψ)2 and α is

some constant, the two correlation expressions are

r =

√√√√ 1− 〈σ2ψ〉2Σ/〈(σ2

ψ)2〉Σα− 〈σ2

ψ〉2Σ/〈(σ2ψ)2〉Σ

(22)

and

r =

√√√√ 1− 〈σ2ψ〉2Σ/〈(σ2

ψ)2〉Σ(α + 3)/2− 〈σ2

ψ〉2Σ/〈(σ2ψ)2〉Σ

. (23)

More specifically, if the ensemble member distribution is normally-distributed (satisfying

β =√

2/π and α = 3), the theoretical form of the correlation is given in column 2, Table

5, which reduces to two forms for the metrics considered. For the metrics with same units


as the weather variable itself, this is given by

r =

√2

π

√√√√ 1− 〈σψ〉2Σ/〈σ2ψ〉Σ

1− (2/π) 〈σψ〉2Σ/〈σ2ψ〉Σ

. (24)

For the squared metrics, the correlation is

r =

√√√√1− 〈σ2ψ〉2Σ/〈(σ2

ψ)2〉Σ3− 〈σ2

ψ〉2Σ/〈(σ2ψ)2〉Σ

. (25)

What can be seen, then, is that depending on what paired metric definitions are used,

one can get different correlations for the same EPS forecasts, and along with this, different

values for the correlations’ upper bounds, as shown below. This, then, would allow one

to artificially increase or decrease the spread-error correlation through optimal choice of

metric depending on the result desired.

(4) Examining the more general (21)-(23), and (24)-(25) specific to normally-distributed

ensembles, one can see there are two governing ratios (g) that determine the value of the

correlation. For the metrics with the same units as the weather variable itself (rows 1

through 4 of Table 5), the ratio is

g1 = 〈σψ〉2Σ/〈σ2ψ〉Σ = 〈σψ〉2Σ/[〈σψ〉2Σ + var(σψ)] (26)

where var(·) represents the variance. For the squared metrics (rows 5 through 6 of Table

5), the governing ratio is

g2 = 〈σ2ψ〉2Σ/〈(σ2

ψ)2〉Σ = 〈σ2ψ〉2Σ/[〈σ2

ψ〉2Σ + var(σ2ψ)]. (27)

Consider the situation where the EPS consistently generates a probabilistic forecast

with similar ensemble member dispersion from one forecast to the next. In the limit as

the change in the dispersion vanishes, both var(σψ)→ 0 and var(σ2ψ)→ 0, and g → 1 in

both (26) and (27). As a result r → 0 in (21)-(25).


In the other extreme limit as the EPS generates a (infinitely-) wide range of ensemble

dispersion, then both var(σψ) → ∞ and var(σ2ψ) → ∞, and g → 0 in both (26) and

(27). As a result r → β in (21), r →√

1/α in (22), and r →√

2/(α + 3) in (23).

For normally-distributed ensemble members, r →√

2/π in (24), and r →√

1/3 in (25).

Figure 2 provides a graphic illustration of how r varies as a function of 〈σψ〉2Σ/〈σ2ψ〉Σ and

〈σ2ψ〉2Σ/〈(σ2

ψ)2〉Σ for normally-distributed ensemble members.

(5) The more general results in Tables 4 and 5 compare well with past numeric results

in the literature. Barker (1991) examined the correlation between the ensemble variance

(s; row 1, Table 1) and the square error of any one ensemble member (ε; row 2, Table

2) using geopotential height anomalies from extended range forecasts. He numerically

generated a maximum correlation value of 0.58, which is the same result we derive in row

6, Table 5 (√

1/3 = 0.58).

Also consider a specific distribution for the standard deviation σψ of the ensemble

member spread. If the possible values of σψ over the forecasts of interest are lognormally

distributed, then r takes on the specific form given in column 5 of Table 5. Modified ver-

sions of the lognormal distribution for σψ were presented earlier by KK. This distribution

is given by

f(σψ) =1

σψσΣ

√2πexp

(−(ln(σψ)− ln(σψM))2

2σΣ

), (28)

where σΣ is the standard deviation of the distribution of ln(σψ), and σψM is the median

value of σψ. (Note: for the lognormal distribution, the mean 〈σψ〉Σ and median σψM are

not identical but are related by 〈σψ〉Σ = σψMexp(σ2Σ/2).) For specified values of σψM and

σΣ, values of σψ can be derived from ln (σψ) = N (ln (σψM) , σΣ), where N (γ, δ) represents

a random draw from a Normal distribution with mean γ and standard deviation δ. For


normally-distributed ensemble members, with spread metric σψ and error metric ε|µ|, with

σψ lognormally distributed, we then have the same case explored by Houtekamer (1993),

Whitaker and Loughe (1998), and Grimit and Mass (2007). For this case, the governing

ratio simplifies to

g = 〈σψ〉2Σ/〈σ2ψ〉Σ = Exp[−σ2

Σ], (29)

and the correlation simplifies to the expression in column 5, row 2 of Table 5, which

itself duplicates (33) of Houtekamer (1993). Note, however, that defining the specific

distribution of the ensemble member spread is not important to determining the limiting

behavior of the correlation, which for this case is given by column 2, row 2 of Table 5,

with correlation limits of [0,√

2/π] ≈ [0, 0.80]. This same limit was numerically estimated

by Houtekamer (1993), Whitaker and Loughe (1998), and Grimit and Mass (2007).

4. Two aspects of the variation of ensemble dispersion

In this section we argue that there are two aspects of an ensemble’s variation in dis-

persion that should be assessed. The first aspect is: do the day-to-day variations in the

dispersion of an ensemble forecast relate to day-to-day variations in the expectated fore-

cast error? The second aspect is: is there enough variability in the EPS dispersion to

justify the expense of generating the ensemble? We respectively address each of these

aspects in turn below.

We have argued in the previous section that the Pearson correlation does not provide a

definite tool to assess the reliability of the ensemble spread-error relationship due to the

fact that even for an EPS perfect forecast, the correlation can vary widely by virtue of

its dependence on factors other than exclusive properties of EPS forecast performance.


However, this does not necessarily mean that the correlation does not still have utility in

answering this question, which we will return below.

Because of the correlation’s deficiences, Wang and Bishop (2003) suggested creating bins

of the spread measure of choice (in their case, ensemble variance), and then averaging the

corresponding error metrics (e.g. square error of the ensemble mean) over these bins

to remove statistical noise. After this bin-averaging, properly matched spread and error

measures should then equate (with the removal of observation error), and a perfect EPS

forecast should therefore produce points lying along a 45 degree line. As the variations

in an ensemble’s dispersion become less informative, the slope of this curve (binned error

versus binned spread) becomes more horizontal. However, as visually informative as

this approach can be, ambiguities in the EPS’s error-spread reliability can arise due to

ambiguities in the sufficient number of bins and number of points in each bin required for

this test, especially for small verification data sets. Similarly, Wang and Bishop (2003)

also argued that the rate at which the binned error metric becomes noisier as bin size (thus

sample size) decreases, and the degree of kurtosis in the binned sample of errors, both

provide measures of the accuracy in the EPS error variation prediction. However, both of

these latter two approaches rely on an assumption of gaussianity for proper interpretation.

An alternative to the Wang and Bishop approach that produces a single scalar of EPS

error-spread reliability and requires no distributional assumptions, can be created from the

Pearson correlation r. Benefits of single scalar metrics are that they can better leverage

limited verification data sets, they can often provide a more objective metric for assessing

EPS performance as compared to, say, graphical assessments, and they can more easily

lend themselves to constructing confidence bounds. This alternative can be constructed by


reframing r relative to a perfect EPS forecast in the context of a skill score (Wilks, 1995).

Note that although skill scores need to be used with care since they can be improper

in certain contexts (Gneiting and Raftery 2007; Murphy 1973), they can still provide a

useful relative measure of forecast system improvement. A candidate for an error-spread

Pearson correlation skill score SSr is

SSr =rforc − rrefrperf − rref

, (30)

where rforc is the EPS spread-error correlation, rref is that of a reference forecast, and

rperf is that for a perfect EPS forecast. For the possible correlation’s spread-error metrics

we use the standard deviation of the ensemble (σψ) and the absolute error of the ensemble

mean (ε|µ|), respectively. If we take the no-skill forecast or the reference forecast, such

that rref = 0, then SSr simplifies to

SSr =rforcrperf

. (31)

For simplicity, we could also take the perfect EPS forecast as assumed to have close to

normally-distributed ensemble forecasts, such that rperf is given by (24) above.

A second, and perhaps more essential, aspect of an ensemble’s variation in dispersion

that should be assessed is whether there is enough variability in the dispersion to begin

with to justify the generation of an expensive ensemble, irrespective of whether the EPS

spread-error relationship is reliable or not. Implicitly, both Wang and Bishop (2003)

and Grimit and Mass (2007) also examined this issue in the context of the binned error

and spread metric comparison approach discussed above. Wang and Bishop used the y-

axis range as a metric (binned error metric variation); while after applying an analogue

calibration approach to each bin, Grimit and Mass used gains in the rank probability


(RPS) skill score as a gauge (where the RPS of a fixed ensemble-mean error climatology

was used as a reference). However, the former approach does not provide a normalized

metric (thus retaining sensitivity to unit scale). And likewise both of these approaches

do not isolate the issue of degree of variability in the ensemble’s native dispersion; this

is because both EPS accuracy in discerning error variability, as well as issues in bin size,

cloud this issue for both approaches.

One possible metric for measuring the degree of variability in the ensemble’s native

dispersion is to utilize the ”governing ratios” g presented above, but in the context of

a skill score, as was done with the correlation coefficient’s use in a skill score for EPS

error-spread reliability assessment. Because g is calculated using only the moments of the

ensemble member set, it focuses on the EPS potential to produce dispersion variability.

In terms of the ”governing ratio” skill score SSg, we have

SSg =gforc − grefgperf − gref

, (32)

where gforc is the EPS governing ratio, gref is that of a reference forecast, and gperf is

that for a perfect forecast. Considering only the governing ratio, g1, of (26), and taking

gref = 1 (i.e. no dispersion variability), and gperf = 0 (i.e. extremely-large dispersion

variability), and after simplifying, we then have

SSg = 1− gforc =〈σ2

ψ〉Σ − 〈σψ〉2Σ〈σ2

ψ〉Σ=

var(σψ)

〈σψ〉2Σ + var(σψ), (33)

where var(σψ) represents the variance of the ensemble member standard deviation over

the verification data set. SSg can be viewed as a normalized, or relative, measure of how

much variability there is in the ensemble day-to-day dispersion as compared to the mean,

or average, amount of this dispersion.


5. EPS examples

In this section we show two examples of EPS forecasts to highlight some of the points

made above. The first example EPS produces ensembles from a mixture of WRF and

MM5 mesoscale models, using a variety of different intitial conditions, outer boundary

conditions, and physics packages (Liu et. al 2007), and post-processed with a quantile

regression approach (Hopson et. al 2010) to produce a calibrated 30-member ensemble,

although in this paper we use a 19-member subset. The ensemble generates gridded

temperature forecasts over the Dugway Proving Grounds of the Army Test and Evaluation

Command (ATEC) outside Salt Lake City, Utah. Figure 3 shows time-series and rank

histograms of this EPS out-of-sample verification set. Panel 3a shows a subset time-

series of 3-hr lead-time sorted ensembles (colored lines) downscaled to a meteorological

station (black line) over the ATEC range, while 3b shows the out-of-sample post-processed

results. Panels 3c and 3d show rank histograms of the same forecasts, respectively, with

the red dashed lines in the figure showing 95% confidence bounds on the histograms

(for which we could expect approximately one bin to lie outside of these bounds for a

perfectly-calibrated 19-member ensemble). From the rank histograms we see significant

under-dispersion (U-shaped) in the pre-processed forecasts, but near-perfect calibration

in the post-processed ensemble member set. Panels 3e - 3h show results for 36-hr forecasts

with similar conclusions concerning pre- and post-processed forecasts’ under- and near-

perfect dispersion, respectively, as for the 3-hr forecasts.

Figure 4 shows our second example of EPS forecasts. In this figure is shown ensem-

ble streamflow forecasts (colored lines) for the Brahmaputra river at the Bahadurabad

gauging station within Bangladesh of the Climate Forecast Applications in Bangladesh


(CFAB) project for years 2003 - 2007 (Hopson and Webster 2010), along with observed

streamflow from the Bangladesh Flood Forecasting and Warning Centre (FFWC; black

line). Panels a) and e) show time-series of sorted 51-member multi-model forecasts of river

flow at 1- and 10-day lead-times, respectively. These forecasts were generated by using

ensemble weather forecasts from the European Centre for Medium-Range Weather Fore-

casts (ECMWF) 51-member Ensemble Prediction System (EPS) (Molteni et. al 1996),

near-real-time satellite-derived precipitation products from the NASA Tropical Rainfall

Measuring Mission (TRMM; Huffman et al. 2005, 2007) and the NOAA CPC morph-

ing technique (CMORPH; Joyce et al. 2004), a GTS-NOAA rain gauge product (Xie

et al. 1996), and near-real-time river flow estimates from the FFWC. Panels b) and f)

show the respective post-processed results of these forecasts, where a k-nearest-neighbor

analogue approach (KNN) was used for this application. Panels c) and d) show the re-

spective pre- and post-processed rank histograms and 95% confidence bounds (for which

we could expect approximately three bins to lie outside of these bounds for a perfectly-

calibrated 51-member ensemble) for the 1-day lead-time forecasts, and panels g) and h)

show the same but for the 10-day forecasts. As with our first example, from the rank

histograms we see significant under-dispersion (U-shaped) in the pre-processed forecasts,

but near-perfect calibration in the post-processed ensemble member set.

Utilizing the CFAB EPS 10-day lead-time streamflow forecasts post-processed with a

KNN algorithm, we examine the concepts discussed in section 3. Figure 5 presents scatter

plots of ensemble error versus spread using the metric pairings shown in Tables 4 and 5.

The black dots are the actual error-spread data. The blue dots are calculated by treating

the CFAB forecasts as if they were derived from an EPS perfect forecast, which is practially


done here by each day randomly choosing one member to represent the verification from

the set of 51 ensemble forecast members plus the observation, with the remaining 51

unchosen ”members” treated as the ensemble forecast. Linear fits to both actual and

perfect model data sets are included (black and blue lines, respectively). In the upper

right corner of each panel are included the following correlation values for the error-spread

data: ensemble r derived from the actual forecast metrics (black dots); ”perf. model” r

derived from the EPS perfect forecast metrics (blue dots); perf. gaussian r derived from

actual forecasts’ moments but using the theoretical form for normally-distributed EPS

perfect forecast ensemble members (column 2, Table 5); theor. up. lim. the theoretical

maximum value the correlation can attain for normally-distributed ensembles (column 4,

Table 5).

In Figure 5 notice the positive slope to both actual and perfect model data in each panel,

such that as the spread increases, the error also is more likely to be larger. But also notice

that even for large spread values of either the perfect model (blue dots) or actual forecast

data (black dots), the error can be very small, and as such the correlation is not (cannot

be) perfect (i.e. 1.0) as shown by both the ensemble r and ”perf. model” r values ranging

from [.21, .29] and [.22, .27], respectively. The similarity of the actual and perfect model

ranges also shows that the KNN post-processing algorithm appears to have produced

well-calibrated ensembles with respect to the error-spread relationship. Also notice that

the perf. gaussian r values are quite close to the ”perf. model” r values, showing the

normally-distributed ensemble member assumption is a good approximation for this data

set, and thus could provide a much simpler theoretical r value to calculate (column 2,

Table 5) than the method to generate ”perf. model” r discussed above. But also note


that the actual and perfect model values are well below the theoretical maximum values

they could attain of√

2/π ≈ .80 (panels a - d) and√

1/3 ≈ .58 (panels e - f), respectively,

showing that the data’s ”governing ratios” (column 3, Table 5) are not at their minimum.

Finally, and non-intuitively, notice the almost identical values of all the respective actual

forecast correlations, even though the theoretical maximum value of panels a - d is very

different from that of panels e - f.

6. Conclusions

There clearly is a need to verify the value of the 2nd moment of ensemble forecasts: if,

for a particular forecast, the forecast ensemble spread is large or small, does this mean

the forecast skill is diminished or increased, respectively? This paper has argued that the

Pearson correlation coefficient r of forecast spread and error is not a good verification

measure to directly test this relationship between ensemble spread and skill, since it

depends on factors other than just forecast model performance.

The important point here is that the forecast model’s correlation coefficient can take

on a wide range of values, for a perfectly calibrated model. What this correlation is

could depend on an inherent property of the EPS (such as its resolution), but it could

also depend on the variety of states available to the physical system being modeled,

completely irrespective of the forecast model’s performance. Given this latter dependence,

we argue that the spread-skill correlation is not an adequate verification gauge of how well

a variation in ensemble spread forecasts a change in forecast certainty.

These ideas were examined in the context of ensemble temperature forecasts for Utah

and for streamflow forecasts for the Brahmaputra River. It was shown that even for a

perfect model, r depends on how one defines forecast spread and forecast skill (error); and


in Tables 4 and 5 of the previous section we also showed how the spread-error correlation r

for a variety of different measures of spread and error was dependent on higher moments

of the distribution of the ensemble spreads, which themselves should be dependent on

the stability properties of the modeled system during the period the forecasts are being

verified (among other factors). In particular, we showed that under certain conditions,

the correlation depends on the ratio of how much the forecast spread varies from forecast

to forecast compared to its mean value of spread,

〈s〉2/〈s2〉 = 〈s〉2/[〈s〉2 + var(s)], (34)

where s is some measure of forecast ensemble spread, 〈s〉 its mean value, and var(s) =

〈(s − 〈s〉)2〉 represents its variance. As this ratio approaches zero, the skill-spread corre-

lation asymptotes to its upper value of√

2/π or√

1/3, depending on how the skill and

spread measures are defined. These theoretical results validate and generalize some of

the previous numerical and theoretical findings of Barker (1991) Houtekamer (1993), in

particular (see section 2).

Because r is strongly dependent on factors other than just the skill of the forecast

system, we argue that r is an unreliable verification measure of whether changes in forecast

skill can be associated with changes in ensemble forecast spread. To meet the clear need

of a measure that can objectively test the usefulness of the variability of the forecast

ensemble spread, we propose in the second part to this paper three alternatives to the

skill-spread correlation. In particular, if there is no usefulness in this ”2nd moment”

of an ensemble forecast, then one might lose little benefit (and possibly gain) by using

hindcasts to calculate a much less expensive invariant ”climatological” error distribution

(Leith 1974, Atger 1999), or fit a simple heteroscedastic error model (i.e. error variance


that depends on the magnitude of the variable) to use in conjunction with the ensemble

mean or control member forecast instead of using the full suite of forecast ensembles

themselves.

References

Atger, F., 1999: The Skill of Ensemble Prediction Systems. Mon. Wea. Rev., 127, 1941–

1953.

Barker, T. W., 1991: The relationship between spread and forecast error in extended-range

forecasts. J. Climate, 4, 733–742.

Buizza, R., 1997: Potential forecast skill of ensemble prediction and spread and skill

distributions of the ECMWF ensemble prediction system. Mon. Wea. Rev., 125, 99–

119.

Gneiting, T., and A. E. Raftery, 2007: Strictly Proper Scoring Rules, Prediction, and Es-

timation, J. Amer. Stat. Assoc., 102(477), 359–378, doi:10.1198/016214506000001437.

Grimit, E. P., and C. F. Mass, 2007: Measuring the Ensemble Spread-Error Relationship

with a Probabilistic Appraach: Stochastic Ensemble Results. Mon. Wea. Rev., 135,

203–221.

Hopson, T. M., and P. J. Webster, 2010: Operational flood forecasting for Bangladesh

using ECMWF ensemble weather forecasts. J. Hydrometeor., 11, 618–641.

Hopson, T., J. Hacker, Y. Liu, G. Roux, W. Wu, J. Knievel, T. Warner, S. Swerdlin, J.

Pace and S. Halvorson, 2010: Quantile regression as a means of calibrating and veri-

fying a mesoscale NWP ensemble. Prob. Fcst Symp., American Meteorological Society,

Atlanta, GA, 17-23 January 2010.


Houtekamer, P. L., 1993: Global and local skill forecasts. Mon. Wea. Rev., 121, 1834–

1846.

Houtekamer, P. L., L. Lefaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: system

simulation approach to ensemble prediction. Mon. Wea. Rev., 124, 1225–1242.

Huffman, G. J., R. F. Adler, S. Curtis, D. T. Bolvin, and E. J. Nelkin, 2005: Global

rainfall analyses at monthly and 3-hr time scales. Measuring Precipitation from Space:

EURAINSAT and the Future, V. Levizzani, P. Bauer, and J. F. Turk, Eds., Springer,

722 pp.

Huffman, G. J., R. F. Adler, D. T. Bolvin, G. Gu, E. J. Nelkin, K. P. Bowman, Y. Hong,

E. F. Stocker, D. B. Wolff, 2007: The TRMM Multisatellite Precipitation Analysis

(TMPA): Quasi-global, multiyear, combined sensor precipitation estimates at fine scales.

J. Hydrometeor., 8, 38-55.

Joyce, R. J., J. E. Janowiak, P. A. Arkin, and P. P. Xie, 2004: CMORPH: A method

that produces global precipitation estimates from passive microwave and infrared data

at high spatial and temporal resolution. J. Hydrometeor., 5, 487–503.

Kolczynski, W. C., D. R. Stauffer, S. E. Haupt, N. S. Altman, and A. Deng, 2011:

Investigation of Ensemble Variance as a Measure of True Forecast Variance. Mon. Wea.

Rev., 139, 3954–3963.

Kruizinga, S., and C. J. Kok, 1988: Evaluation of the ECMWF experimental skill predic-

tion scheme and a statistical analysis of forecast errors. Proc. ECMWF Workshop on

Predictability in the Medium and Extended Range, Reading, United Kingdom, ECMWF,

403–415.


Leith, C. E., 1974: Theoretical Skill of Monte Carlo Forecasts. Mon. Wea. Rev., 102,

409–418.

Liu, Y., M. Xu, J. Hacker, T. Warner, and S. Swerdlin, 2007: A WRF and MM5-

based 4-D mesoscale ensemble data analysis and prediction system (E-RTFDDA) devel-

oped for ATEC operational applications. 18th Conf. on Numerical Weather Prediction,

Amer. Meteor. Soc., June 25-29, 2007. Park City, Utah.

Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble

Prediction System: Methodology and validation. Q. J. R. Meteorol. Soc., 122, 73–119.

Murphy, A. H., 1973: Hedging and Skill Scores for Probability Forecasts, J. of Applied

Meteor., 12, 215–223.

Palmer, T. N., 2002: The economic value of ensemble forecasts as a tool for risk assess-

ment: From days to decades. Q. J. R. Meteorol. Soc., 128, 747–774.

Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble

prediction system. Q. J. R. Meteorol. Soc., 126, 649–667.

Scherrer, S.C., C. Appenzeller, P. Eckert, D. Cattani, 2004: Analysis of the spread-skill

relations using the ECMWF ensemble prediction system over Europe. Wea. Forecasting,

19 (3), 552–565.

Toth, Z. and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.

Mon. Wea. Rev., 125, 3297–3319.

Toth, Z. and O. Talagrand and G. Candille and Y. Zhu, 2003: Probability and Ensemble

Forecasts. Chapter 7 of Forecast Verification: A Practitioner’s Guide in Atmospheric

Science. John Wiley and Sons, 254pp.


Wang, X., and C. H. Bishop, 2003: A Comparison of Breeding and Ensemble Transform

Kalman Filter Ensemble Forecast Schemes. J. Atmos. Sci., 60, 1140–1158.

Whitaker, J. S., and A. F. Loughe, 1998: The Relationship between Ensemble Spread and

Ensemble Mean Skill. Mon. Wea. Rev., 126, 3292–3302.

Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press,

467pp.

Xie, P. P., B. Rudolf, U. Schneider, and P. A. Arkin, 1996: Gauge-based monthly analysis

of global land precipitation from 1971 to 1994. J. Geophys. Res. - Atmos., 101 (D14),

19023–19034.

Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The Economic Value

Of Ensemble-Based Weather Forecasts. Bull. Amer. Meteor. Soc., 83, 73–83.


Table 1. Measures of spread used and their symbolic representation, where 〈·〉Ψ

represents the expectation operation over the population Ψ of forecast ensemble members

ψ of a given forecast

Spread measure Symbolic representation Mathematical form

variance of the ensemble members σ2ψ 〈(ψ − 〈ψ〉Ψ)2〉Ψ′ = 〈ψ2〉Ψ − 〈ψ〉2Ψ

about the ensemble mean

root mean square difference σψ√σ2ψ

or standard deviation

mean absolute difference sabs 〈|ψ − 〈ψ〉Ψ|〉Ψ′

of the ensemble members

about the ensemble mean

mean absolute difference sd abs 〈|ψ − ψ′|〉Ψ,Ψ′

of the ensemble members about

any one chosen ensemble member

4th moment about m4 〈(〈ψ〉Ψ − ψ)4〉Ψ′

the ensemble mean


Table 2. Measures of error used and their symbolic representation where ψo represents

the observation or verification of the forecast

Error measure Symbolic representation Mathematical form

square error of εµ2 (〈ψ〉Ψ − ψo)2

the ensemble mean

square error of εd2 (ψ − ψo)2

one ensemble member

absolute error of ε|µ| |〈ψ〉Ψ − ψo|

the ensemble mean

absolute error of ε|d| |ψ − ψo|

any one ensemble member


Table 3. The measures of spread s (column 2) that correspond to given error measures

ε (column 1) after an expectation value operation over the distribution of the observations

is performed (〈·〉Ψo) under the perfect EPS assumption. In some cases, a double expec-

tation value operation is performed over both the forecast ensemble distribution Ψ and

the possible distribution of the observations Ψo (which are equivalent distributions for a

perfect model). Column 3 shows the same results, but for when the forecast ensemble is

normally-distributed.

ε 〈ε〉Ψo or 〈ε〉Ψ,Ψo Ensembles normally

distributed

ε|µ| sabs√

2πσψ

ε|d| sd abs2√πσψ

εµ2 σ2ψ σ2

ψ

εd2 2σ2ψ 2σ2

ψ


Table 4. Reduced forms of EPS perfect forecast spread-error correlation coefficients for

different combinations of spread (s) and error (ε) measures (column 1), where the correla-

tion is dependent only on the moments of the ensemble member spread (column 2). All ex-

pectation value operations are evaluated over the population of forecasts (i.e. 〈·〉 = 〈·〉Σ).

However further simplifications can be made if constraints are placed on the ensemble

member distribution (which are required to hold for all forecasts). These constraints are

given in column 3, where α and β are constants (and noting that if sabs = βσψ, then it

follows that sd abs =√

2βσψ). These lead to the further simplifications on the correlation

shown in column 4. (Note that if the forecast ensemble members are normally-distributed,

then α = 3 and β =√

2/π, with results shown in Table 5.)

s ; ε r theoretical form further simplified r

constraint

sabs ; ε|µ|

√〈s2abs〉−〈sabs〉2

〈σ2ψ〉−〈sabs〉2

sabs = βσψ β

√1−〈σψ〉2Σ/〈σ

2ψ〉Σ

1−β2〈σψ〉2Σ/〈σ2ψ〉Σ

σψ ; ε|µ|〈σψsabs〉−〈σψ〉〈sabs〉

[(〈σ2ψ〉−〈σψ〉2)(〈σ2

ψ〉−〈sabs〉2)]1/2

same same

sabs ; ε|d|〈sabssd abs〉−〈sabs〉〈sd abs〉

[(〈s2abs〉−〈sabs〉2)(2〈σ2ψ〉−〈sd abs〉2)]1/2

same same

σψ ; ε|d|〈σψsd abs〉−〈σψ〉〈sd abs〉

[(〈σ2ψ〉−〈σψ〉2)(2〈σ2

ψ〉−〈sd abs〉2)]1/2

same same

σ2ψ ; εµ2

√〈(σ2

ψ)2〉−〈σ2

ψ〉2

〈m4〉−〈σ2ψ〉2 m4 = α(σ2

ψ)2

√1−〈σ2

ψ〉2Σ/〈(σ

2ψ

)2〉Σα−〈σ2

ψ〉2Σ/〈(σ

2ψ

)2〉Σ

σ2ψ ; εd2

√〈(σ2

ψ)2〉−〈σ2

ψ〉2

〈m4〉/2+3〈(σ2ψ

)2〉/2−〈σ2ψ〉2 same

√1−〈σ2

ψ〉2Σ/〈(σ

2ψ

)2〉Σ(α+3)/2−〈σ2

ψ〉2Σ/〈(σ

2ψ

)2〉Σ


Table 5. EPS perfect forecast spread-error correlation coefficient results (column 2) for

different combinations of spread (s) and error (ε) measures (column 1). The results are

the same as for Table 4, except the distribution of the ensembles members Ψ is constrained

to be normally-distributed (with α = 3 and β =√

2/π), simplifying the results. As with

Table 4, all expectation value operations are evaluated over the population of forecasts

(i.e. 〈·〉 = 〈·〉Σ). Also shown is the ratio of moments of the spread g that governs the

value of r (column 3), the theoretical limiting values for r (column 4), and its form for one

specific distribution for the possible ensemble member standard deviations σψ (column 5).

s ; ε r theoretical form governing r limits: r for σψ lognormally

Ψ normally distributed ratio g g → 1; → 0 distributed

sabs ; ε|µ|√

2π

√1−〈σψ〉2/〈σ2

ψ〉

1−(2/π)〈σψ〉2/〈σ2ψ〉〈σψ〉2/〈σ2

ψ〉 0 ;√

2/π√

2π

√1−exp(−σ2

Σ)1−(2/π)exp(−σ2

Σ)

σψ ; ε|µ| same same same same

sabs ; ε|d| same same same same

σψ ; ε|d| same same same same

σ2ψ ; εµ2

√1−〈σ2

ψ〉2/〈(σ2

ψ)2〉

3−〈σ2ψ〉2/〈(σ2

ψ)2〉〈σ2

ψ〉2/〈(σ2ψ)2〉 0 ;

√1/3

√1−exp(−4σ2

Σ)3−exp(−4σ2

Σ)

σ2ψ ; εd2 same same same same


!"#$

%$&'&()*

+*!"#$

%$&'&()*

+*

!"#

,-*

#$.*

#$.*

/#"01%.(*!23*45+6.-7*

/#"01%.(*!23*45+6.87*

,8*

!$#

!"#$

%$&'&()*

+*

!%#

,9*

#$.*

/#"01%.(*!23*45+6.97*

!"#$%&$

!"#$%'$

!"#$%($

Figure 1. Schematic of the correlation coefficient simplification calculation. Thin solid

vertical lines represent six-member ensemble forecasts of variable ψ that are randomly-

drawn from the Gaussian-shaped grey PDF curve with mean value given by the vertical

dashed line, and some definition of spread s1, s2, and s3 for forecast times t1, t2, and

t3, respectively. The observation (verification) corresponding to the ensemble forecast is

given by the vertical red line, and the forecast errors ε1, ε2, and ε3 are defined here as the

distance of the ensemble mean to the observation. See text for further details.


Figure 2. Dependence of the correlation coefficient on two different ratios of moments

of ensemble member spread, or ”governing ratios” for an EPS perfect forecast.


0 12 24 36 50Days

-5

0

5

10

15

T [d

eg C

]

0 12 24 36 50Days

-5

0

5

10

15

T [d

eg C

]Rank Histogram of Uncalibrated Ensembles

Interval

0

100

200

300

400C

ount

1 5 9 13 17

Rank Histogram of Uncalibrated Ensembles

Interval

0

100

200

300

400

Cou

nt

1 5 9 13 17

0 12 24 36 50Days

-5

0

5

10

15

T [d

eg C

]

Rank Histogram of Calibrated Ensembles

Interval

0

10

20

30

40

50

60

Cou

nt

1 5 9 13 17

0 12 24 36 50Days

-10

-5

0

5

10

15T

[deg

C]


Interval

0

10

20

30

40

50

60

Cou

nt

1 5 9 13 17

!"# $"#

%"# &"#

'"# ("#

)"# *"#

Figure 3. ATEC EPS pre- and post-processed 19-member 3-hr and 36-hr lead-time

temperature forecasts compared to weather station observations. Panel b) shows a time-

series of a sorted subset of 3-hr ensemble forecasts (colored lines) and observations (black

line), with panel b) showing the (out-of-sample) ensembles after post-processing. Panels

c) and d) show rank histograms of the pre- and post-processed 3-hr forecasts, respectively,

with red dashed lines providing 95% confidence bounds. Panels e) - h) show the same

results but for the 36-hr lead-time forecasts. See text for details.


!"# $"#

%"# &"#

'"# ("#

)"# *"#

2003 2004 2006 2007 Monsoon Year

20

40

60

80

Q [1

03 m3 /s

]

2003 2004 2006 2007 Monsoon Year

20

40

60

80

100

Q [1

03 m3 /s

]

2003 2004 2006 2007 Monsoon Year

20

40

60

80

100

Q [1

03 m3 /s

]

2003 2004 2006 2007 Monsoon Year

0

20

40

60

80

100Q

[103 m

3 /s]


Interval

0

50

100

150

200C

ount

1 5 9 13 17 21 25 29 33 37 41 45 49


Interval

0

5

10

15

20

Cou

nt

1 5 9 13 17 21 25 29 33 37 41 45 49


Interval

0

20

40

60

80

Cou

nt

1 5 9 13 17 21 25 29 33 37 41 45 49


Interval

0

5

10

15

Cou

nt

1 5 9 13 17 21 25 29 33 37 41 45 49

Figure 4. CFAB EPS pre- and post-processed 51-member 1-day and 10-day lead-time

temperature forecasts compared to river gauging station observations. Panel b) shows a

time-series of a sorted subset of 1-day ensemble forecasts (colored lines) and observations

(black line), with panel b) showing the (out-of-sample) ensembles after post-processing.

Panels c) and d) show rank histograms of the pre- and post-processed 3-hr forecasts,

respectively, with red dashed lines providing 95% confidence bounds. Panels e) - h) show

the same results but for the 10-day lead-time forecasts. See text for details.


abs error of ensemble vs mean abs deviation

0 2 4 6 8 10spread [103 m3/s]

0

10

20

30

40

erro

r [10

3 m3 /s

]

"perf. model" r = .22ensemble r = .21

(perf. gaussian r = .24)(theor. up. lim. = .80)

abs error of ensemble vs mean abs deviation

0 2 4 6 8 10spread [103 m3/s]

0

10

20

30

40er

ror [

103 m

3 /s]



absolute error of mean vs std deviation

0 2 4 6 8 10 12spread [103 m3/s]

05

10

15

20

25

30

erro

r [10

3 m3 /s

]



abs error of ensemble vs std deviation

0 2 4 6 8 10 12spread [103 m3/s]

0

10

20

30

erro

r [10

3 m3 /s

]



square error of mean vs variance

0 50 100 150spread [(103 m3/s)2]

0

200

400

600

800

1000

erro

r [(1

03 m3 /s

)2 ] "perf. model" r = .27ensemble r = .27


square error of ensemble vs variance

0 50 100 150spread [(103 m3/s)2]

0

500

1000

1500

2000

erro

r [(1

03 m3 /s

)2 ] "perf. model" r = .27ensemble r = .25


!"#

$"#%"#

&"#

'"# ("#

Figure 5. Scatter plots of ensemble error versus spread using the CFAB EPS data

and the metric pairings of Tables 4 and 5 (rows 1 - 6 corresponding to panels a - f,

respectively). The black dots are the actual error-spread data, the blue dots are the EPS

perfect forecast-equivalent of the CFAB data. Linear fits to both actual and perfect model

data sets are included (black and blue lines, respectively). In the upper right corner of

each panel are the correlation values for the actual forecast, the EPS perfect forecast,

the theoretical normally-distributed perfect forecast, and the theoretical maximum value,

listed top-to-bottom, respectively. See text for further details.


3h 6h 9h 12h 15h 18h 21h 24h 27h 30h 33h 36hForecast Hour

0.0

0.2

0.4

0.6

0.8

Cor

rela

tion


0.0

0.5

1.0

1.5

Skill

Sco

re


-0.1

0.0

0.1

0.2

0.3

Skill

Sco

re

1 2 3 4 5 6 7 8 9 10Forecast Day

0.0

0.2

0.4

0.6

0.8

Cor

rela

tion

1 2 3 4 5 6 7 8 9 10Forecast Day

0.0

0.5

1.0

1.5

2.0

2.5

Skill

Sco

re

1 2 3 4 5 6 7 8 9 10Forecast Day

0.0

0.2

0.4

0.6

Skill

Sco

re

a) b)

c)

e)

d)

f)

Figure 6.

assessing the ensemble spread-error relationshiphopson/mwrpaperi.pdf · temporally-varying ensemble...

Documents