verification of probabilistic streamflow forecasts - iihr – hydroscience … · 2013-06-14 ·...

VERIFICATION OFPROBABILISTIC STREAMFLOW

FORECASTS

by

Tempei Hashino, A. Allen Bradley, and Stuart. S. Schwartz

Sponsored by

National Oceanic and Atmospheric Administration (NOAA)No. NA86GP0365 and No. NA16GP1569

IIHR Report No. 427

IIHR-Hydroscience & Engineeringand Department of Civil and Environmental Engineering

The University of IowaIowa City IA 52242-1585

August 2002

ACKNOWLEDGMENTS

This report basically constitutes the master thesis of Tempei Hashino. Fund-

ing for the research was provided by National Oceanic and Atmospheric Adminis-

tration (NOAA) under the following grants: #NA86GP0365 and #NA16GP 1569.

This support is gratefully acknowledged.

i

EXECTIVE SUMMARY

Long-range streamflow forecasts, such as the ensemble streamflow predictions

(ESP) produced by the National Weather Service (NWS) Advanced Hydrologic

Prediction Services (AHPS), are usually probabilistic forecasts. The format of the

forecast is essentially a continuous probability distribution function, which predicts

the likelihood of occurence of a streamflow variable, conditioned on the current

hydroclimatic state. Although significant advances in forecast verification method-

ologies have been made in recent years, many of these approaches are not directly

applicable to probabilistic streamflow forecasts. The main purposes of this research

are (1) to extend the distributions-oriented (DO) approach to the verification of

probability distribution forecasts of streamflow, and (2) to demonstrate the useful-

ness of the DO approach in assessing the quality of streamflow forecasts. Techniques

for forecast verification using the DO approach are proposed and studied using prob-

ability distribution forecasts for an experimental forecasting system for the Upper

Des Moines River basin.

One significant obstacle in the verification of probabilistic streamflow fore-

casts is the small data sample available for verification. Verification sample sizes

for long-range hydrologic forecasts are typically much smaller than those available

for weather forecasts. Since verification with the DO approach is equivalent to es-

timation of the joint distribution of forecasts and observations, application of the

DO approach to streamflow forecasts with small samples results in large estimation

uncertainties. Three continuous statistical modeling approaches are considered that

deal with estimation uncertainties by reducing the dimensionality D of the verifi-

cation problem. Based on Monte Carlo experiments, the continuous approach with

a logistic regression or kernel density estimation produces better estimates of fore-

cast quality, especially with small sample sizes (say 50 or 100), than the traditional

discrete approach with a contingency table. Moreover, the continuous approaches

work better whether the forecasts were issued in discrete or continuous numbers.

A significant concern when using the ESP technique for streamflow forecasting

is hydrologic model biases. The simulation biases of the hydrologic model propa-

gate to the probability distribution forecasts through the ensemble traces produced

ii

by the hydrological model, and could degrade the quality of the forecasts. Bias

correction methods are often applied to try to reduce the effects of model biases.

The impacts of three bias correction methods on streamflow forecast quality are

examined using the DO techniques developed for streamflow forecast verification.

The three bias correction methods examined are the Event-Bias Correction method

(EBC), the reression-type method, and the Quantile-mapping method (QM). The

results showed that all bias correction methods improve skill scores, mostly by re-

ducing the conditional bias (Reliability) and unconditional bias (Mean Error). It is

remarkable that in some cases the bias correction methods also improve the associa-

tion (potential skill) between forecasts and observations. The forecasts modified by

EBC tend to have the lowest sharpness and discrimination over all flow quantiles,

whereas QM tends to give the highest sharpness and discrimination. The regression-

type methods seem to be in between of these. This application shows a strength

of the proposed DO approach for probabilistic streamflow verification. Specifically,

the approach produces detailed information on many aspects of forecast quality,

which helps in determining the differences between alternate forecasting systems.

iii

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 FORECASTING SYSTEM . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Study Area and Data Resources . . . . . . . . . . . . . . . . 32.2 Probabilistic Forecasting System . . . . . . . . . . . . . . . . 42.3 Proposed Verification Approach . . . . . . . . . . . . . . . . 5

2.3.1 Forecasts for a Discrete Event . . . . . . . . . . . . . 52.3.2 Verification Dataset . . . . . . . . . . . . . . . . . . . 7

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 9

3 VERIFICATION APPROACH . . . . . . . . . . . . . . . . . . . . 12

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Distributions-Oriented Measures . . . . . . . . . . . . . . . . 14

3.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Calibration-Refinement Measures . . . . . . . . . . . . 153.2.4 Likelihood-Base Rate Measures . . . . . . . . . . . . . 16

3.3 Estimation of Measures . . . . . . . . . . . . . . . . . . . . . 173.3.1 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . 183.3.2 Other Derivative Estimators . . . . . . . . . . . . . . 193.3.3 Estimation of CR Decompositions . . . . . . . . . . . 21

3.4 Example of Verification . . . . . . . . . . . . . . . . . . . . . 233.4.1 Absolute and Relative Measures . . . . . . . . . . . . 233.4.2 Marginal and Conditional Distributions . . . . . . . . 26

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 30

4 DISTRIBUTIONS-ORIENTED METHODSFOR SMALL VERIFICATION DATASET . . . . . . . . . . . . . 32

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Monte Carlo Simulation with

Analytical Model for Joint Distribution . . . . . . . . . . . . 344.2.1 Assumptions and Procedure . . . . . . . . . . . . . . 34

iv

4.2.2 Result and Discussion . . . . . . . . . . . . . . . . . . 364.3 Monte Carlo Simulation with

Stochastic Model of Streamflow Forecasting System . . . . . 504.3.1 Assumptions and Procedure . . . . . . . . . . . . . . 504.3.2 Result and Discussion . . . . . . . . . . . . . . . . . . 55

4.4 Monte Carlo Simulation withDiscrete Joint Distribution Model . . . . . . . . . . . . . . . 604.4.1 Assumptions and Procedure . . . . . . . . . . . . . . 604.4.2 Result and Discussion . . . . . . . . . . . . . . . . . . 64

4.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 68

5 ASSESSMENT OF BIAS CORRECTION METHODSFOR ENSEMBLE FORECASTS . . . . . . . . . . . . . . . . . . . 70

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2 Biases in Historical Simulations . . . . . . . . . . . . . . . . . 725.3 Bias Correction Methods . . . . . . . . . . . . . . . . . . . . 73

5.3.1 Event-Bias Correction Method . . . . . . . . . . . . . 755.3.2 Regression-Type Method . . . . . . . . . . . . . . . . 755.3.3 Quantile-Mapping Method . . . . . . . . . . . . . . . 78

5.4 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . 785.4.1 Performance Measures . . . . . . . . . . . . . . . . . . 785.4.2 CR Factorization and Decompositions . . . . . . . . . 855.4.3 LBR Factorization and Decompositions . . . . . . . . 885.4.4 Results for All Months . . . . . . . . . . . . . . . . . 94

5.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 101

6 SUMMARY AND CONCLUSIONS . . . . . . . . . . . . . . . . . 105

6.1 Distributions-Oriented Methodsfor Small Verification Dataset . . . . . . . . . . . . . . . . . . 105

6.2 Assessment of Bias Correction Methodsfor Ensemble Forecasts . . . . . . . . . . . . . . . . . . . . . 106

6.3 Future Study and Remarks . . . . . . . . . . . . . . . . . . . 107

APPENDIX

A STATISTICAL METHODS . . . . . . . . . . . . . . . . . . . . . . 110

A.1 Logistic Regression Method . . . . . . . . . . . . . . . . . . . 110A.2 Kernel Density Estimation Method . . . . . . . . . . . . . . . 111A.3 Combination Method . . . . . . . . . . . . . . . . . . . . . . 116A.4 Contingency Table Approach . . . . . . . . . . . . . . . . . . 116

B SELECTED FIGURES AND TABLES . . . . . . . . . . . . . . . 117

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

v

LIST OF TABLES

Table Page

2.1 Example of verification dataset for June-September volume forecasts. 8

4.1 Parameters of beta distributions for the analytical model and trueforecast quality measures. . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Root Mean Squared Error (RMSE) in MSE/σ2x, ME/σx, TY2/σ2

x, andDIS/σ2

x for the forecasts generated for nonexceedance probability p =0.25 by the analytical model. . . . . . . . . . . . . . . . . . . . . . 40

4.3 Root Mean Squared Error (RMSE) in MSE/σ2x, ME/σx, TY2/σ2

x, andDIS/σ2

x for the forecasts generated for nonexceedance probability p =0.05 by the analytical model. . . . . . . . . . . . . . . . . . . . . . 40

4.4 Root Mean Squared Error (RMSE) in REL/σ2x for the forecasts gener-

ated for nonexceedance probability p = 0.25 by the analytical model. 46

4.5 Root Mean Squared Error (RMSE) in RES/σ2x for the forecasts gener-






4.8 Parameters used in fitting the distribution to observed monthly volume(U) and the first three L-moments of the ensemble volumes (X`1 , X`2 ,and X`3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.9 Summary statistics of the standardized random variables. . . . . . 53


ated for nonexceedance probability p = 0.25 by the stochastic model. 61







vi

4.14 Basic information and true forecast quality measures of Subjective12-24-h Projection Probability-of-Precipitation Forecasts for UnitedStates during October 1980-March 1981 from Wilks (1995). . . . . 64

4.15 Root Mean Squared Error (RMSE) in REL/σ2x for the forecasts gen-

erated by the discrete model. . . . . . . . . . . . . . . . . . . . . . 66

4.16 Root Mean Squared Error (RMSE) in RES/σ2x for the forecasts gen-

erated by the discrete model. . . . . . . . . . . . . . . . . . . . . . 66

5.1 Mean, Standard Deviation (SD), and Coefficient of Variation (CV)of the observed monthly volume (cfsd) for the Des Moines River atStratford. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Mean Error (ME), Root Mean Square Error (RMSE), correlation co-efficient (CC), and Mean Square Error (MSE) Skill Score (SSMSE)between the observed monthly volume and historical simulations. . 73

B.1 BIAS in REL/σ2x for the forecasts generated for nonexceedance prob-

ability p = 0.25 by the analytical model of joint distribution. . . . 119

B.2 Standard Deviation in REL/σ2x for the forecasts generated for nonex-

ceedance probability p = 0.25 by the analytical model of joint distri-bution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

B.3 BIAS in RES/σ2x for the forecasts generated for nonexceedance prob-

ability p = 0.25 by the analytical model of joint distribution. . . . . 120

B.4 Standard Deviation in RES/σ2x for the forecasts generated for nonex-


B.5 BIAS in REL/σ2x for the forecasts generated for nonexceedance prob-


B.6 Standard Deviation in REL/σ2x for the forecasts generated for nonex-


B.7 BIAS in RES/σ2x for the forecasts generated for nonexceedance prob-


B.8 Standard Deviation in RES/σ2x for the forecasts generated for nonex-


vii

LIST OF FIGURES

Figure Page

2.1 Map of Des Moines River Basin. . . . . . . . . . . . . . . . . . . . 5

2.2 Ensemble traces simulated for forecast on 1 June 1965. . . . . . . . 6

2.3 Probability distribution forecast for June-September volume. Theensemble traces are simulated with the current conditions as of 1June 1965. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Schematic of the current Extended Streamflow Prediction System. 10

3.1 Mean Error (ME) and Mean Square Error (MSE) for June-Septemberseasonal volume forecasts. . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 CR (on left) and LBR (on right) decompositions of MSE for June-September seasonal volume forecasts. . . . . . . . . . . . . . . . . 25

3.3 Various decompositions of MSE Skill Score for June-September sea-sonal volume forecasts. The upper left indicates CR decompositions,Relative Resolution (RRES) and Relative Reliability (RREL). Theupper right indicates LBR decompositions, Relative Discrimination(RDIS), Relative Sharpness (RS), and Relative Type 2 ConditionalBias (RTY2). The lower left shows Potential Skill, Reliability Mea-sure, and Unconditional Bias Measure. . . . . . . . . . . . . . . . 26

3.4 Reliability diagram for June-September seasonal volume forecasts is-sued for 0.25 quantile. . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Discrimination diagram for June-September seasonal volume forecastsissued for 0.25 quantile. . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 MSE/σ2x, ME/σx, TY2/σ2

x, and DIS/σ2x estimated by two approaches

for nonexceedance probability p = 0.25; “D” is discretized (11-binned)approach (DSC), “C” represents a continuous approach such as LRM,KDM, and CM. The maximum, upper quartile, median, lower quar-tile, and minimum are indicated from top to bottom. The forecastsare produced by the analytical model. . . . . . . . . . . . . . . . . 37

4.2 MSE/σ2x, ME/σx, TY2/σ2


for nonexceedance probability p = 0.05; “D” is discretized (11-binned)approach (DSC), “C” represents a continuous approach such as LRM,KDM, and CM. The maximum, upper quartile, median, lower quar-tile, and minimum are indicated from top to bottom. The forecastsare produced by the analytical model. . . . . . . . . . . . . . . . . 38

viii

4.3 Conditional mean of the observations given the forecasts µx|f andmarginal distribution of the forecasts s(f) estimated by three meth-ods, DSC, LRM, and KDM, for nonexceedance probability p = 0.25with a sample size 50. The forecasts are produced by the analyticalmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Conditional distribution of the forecasts given the observations r(f |x)estimated by three methods, DSC, LRM, and KDM, for nonexceedan-ce probability p = 0.25 with a sample size 50. The forecasts areproduced by the analytical model. . . . . . . . . . . . . . . . . . . 42

4.5 Conditional mean of the observations given the forecasts µx|f andmarginal distribution of forecasts s(f) estimated by three methods,DSC, LRM, and KDM, for nonexceedance probability p = 0.05 witha sample size 50. The forecasts are produced by the analytical model. 43

4.6 Conditional distribution of the forecasts given the observations r(f |x)estimated by three methods, DSC, LRM, and KDM, for nonexceedan-ce probability p = 0.05 with a sample size 50. The forecasts areproduced by the analytical model. . . . . . . . . . . . . . . . . . . 44

4.7 CR decompositions estimated by four approaches for nonexceedanceprobability p = 0.25; “D” is discretized (11-binned) approach (DSC),“L” is logistic regression (LRM), “K” is kernel density estimationdirectly applied to r(f |x) (KDM), and “C” is combination of logisticregression and kernel density estimation (CM). The maximum, upperquartile, median, lower quartile, and minimum are indicated from topto bottom. The forecasts are produced by the analytical model. . . 45

4.8 CR decompositions estimated by 4 approaches for nonexceedanceprobability p = 0.05; “D” is discretized (11-binned) approach (DSC),“L” is logistic regression (LRM), “K” is kernel density estimation di-rectly applied to r(f |x) (KDM), and “C” is combination of logisticregression and kernel density estimation (CM). The maximum, upperquartile, median, lower quartile, and minimum are indicated from topto bottom. The forecasts are produced by the analytical model. . . 48

4.9 Relations between observations and L-moments for September month-ly volume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.10 Scatterplot of transformed observed monthly volume and transformedL-moments of monthly volume ensembles. . . . . . . . . . . . . . . 53

4.11 Conditional mean of the observations given the forecasts µx|f esti-mated by three methods, DSC, LRM, and KDM, for nonexceedanceprobability p = 0.25 with sample sizes 50 and 1000. The forecasts areproduced by the stochastic model. . . . . . . . . . . . . . . . . . . 56

4.12 Conditional mean of the observations given the forecasts µx|f esti-mated by three methods, DSC, LRM, and KDM, for nonexceedanceprobability p = 0.05 with sample sizes 50 and 1000. The forecasts areproduced by the stochastic model. . . . . . . . . . . . . . . . . . . 57

ix

4.13 CR decompositions estimated by four approaches for nonexceedanceprobability p = 0.25; “D” is discretized (11-binned) approach (DSC),“L” is logistic regression (LRM), “K” is kernel density estimationdirectly applied to r(f |x) (KDM), and “C” is combination of logisticregression and kernel density estimation (CM). The maximum, upperquartile, median, lower quartile, and minimum are indicated from topto bottom. The forecasts are produced by the stochastic model. . . 58

4.14 CR decompositions estimated by four approaches for nonexceedanceprobability p = 0.05; “D” is discretized (11-binned) approach (DSC),“L” is logistic regression (LRM), “K” is kernel density estimationdirectly applied to r(f |x) (KDM), and “C” is combination of logisticregression and kernel density estimation (CM). The maximum, upperquartile, median, lower quartile, and minimum are indicated from topto bottom. The forecasts are produced by the stochastic model. . . 59

4.15 True marginal and conditional distributions of the discrete forecasts. 63

4.16 Conditional mean of the observations given the forecasts µx|f esti-mated by three methods, DSC, LRM, and KDM, with sample sizes50 and 1000. The forecasts are produced by the discrete model. . . 65

4.17 CR decompositions estimated by four approaches for Discrete Fore-cast; “D” is discretized (12-binned) approach (DSC), “L” is logisticregression (LRM), “K” is kernel density estimation directly appliedto r(f |x) (KDM), and “C” is combination of logistic regression andkernel density estimation (CM). The maximum, upper quartile, me-dian, lower quartile, and minimum are indicated from top to bottom.The forecasts are produced by the discrete model. . . . . . . . . . 67

5.1 Example of Bias Correction Method applied to ensemble traces. . 71

5.2 Comparison of observed monthly volume and historical simulationfrom January 1988 to December 1997. . . . . . . . . . . . . . . . . 74

5.3 Example of the bias-correction for 1-month lead time forecast withinitial condition of January in 1949; EBC (Event-Bias CorrectionMethod) is left and RLI (Linear Interpolation) right. . . . . . . . . 76

5.4 Observed monthly volume versus simulated monthly volume withpower function for May and September. . . . . . . . . . . . . . . . 77

5.5 Observed monthly volume versus simulated monthly volume withLOWESS regression for May and September. . . . . . . . . . . . . 78

5.6 Example of the Quantile Mapping method (QM) for 1-month leadtime forecast with initial condition of January in 1949. . . . . . . . 79

5.7 MSE Skill Score (left) and Skill Score for Bias Correction (right) ver-sus forecasted month for 1, 2, and 3-month lead times, averaged overthe quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

x

5.8 Skill Score for Bias Correction for May and September monthly vol-umes, averaged over the quantiles. . . . . . . . . . . . . . . . . . . 82

5.9 Comparison of Mean Error (left) and Mean Square Error (right) byfive Bias Correction methods, actual (non bias-corrected) streamflowsimulation (NBC), and pseudoperfect streamflow simulation (PSS),for 1, 2, and 3-month lead time September monthly volume forecasts. 83

5.10 Comparison of MSE Skill Score (left) and measure of association(right) by five Bias Correction methods, actual (non bias-corrected)streamflow simulation (NBC), and pseudoperfect streamflow simula-tion (PSS), for 1, 2, and 3-month lead time September monthly volumeforecasts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.11 Comparison of Decompositions of Skill Score by five Bias Correctionmethods, actual (non bias-corrected) streamflow simulation (NBC),and pseudoperfect streamflow simulation (PSS), for 1, 2, and 3-monthlead time September monthly volume forecasts. The measure of reli-ability is left, and the measure of unconditional bias is right. . . . 86

5.12 Performance measures and decompositions of MSE Skill Score by fiveBias Correction methods, actual (non bias-corrected) streamflow sim-ulation (NBC), and pseudoperfect streamflow simulation (PSS), for1, 2, and 3-month lead time May monthly volume forecasts. . . . . 87

5.13 Marginal distribution of the forecasts s(f) and the conditional meanof the forecasts µx|f by five Bias Correction methods, actual (nonbias-corrected) streamflow simulation (NBC), and pseudoperfect stre-amflow simulation (PSS), for 1-month lead time September monthlyvolume forecasts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.14 CR decompositions by five Bias Correction methods, actual (nonbias-corrected) streamflow simulation (NBC), and pseudoperfect stre-amflow simulation (PSS), for 1, 2, and 3-month lead time Septembermonthly volume forecasts. . . . . . . . . . . . . . . . . . . . . . . . 90

5.15 Conditional distributions of the forecasts r(f |x = 0) (left) and r(f |x =1) (right) by five Bias Correction methods, actual (non bias-corrected)streamflow simulation (NBC), and pseudoperfect streamflow simula-tion (PSS), for 1-month lead time September monthly volume fore-casts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.16 Conditional mean of the forecasts given the observations µf |x for fiveBias Correction methods, actual (non bias-corrected) streamflow sim-ulation (NBC), and pseudoperfect streamflow simulation (PSS), for1-month lead time September monthly volume forecasts. . . . . . 93

xi

5.17 Conditional mean of the forecasts given the observations µf |x for EBCand QM bias correction methods with NBC. The forecasts were issuedfor September monthly volume with 1-month lead time. The threecurves for each colour in the bottom two figures show µf |x=1, µf , andµf |x=0 from top to bottom. . . . . . . . . . . . . . . . . . . . . . . 93

5.18 Relative sharpness by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect stream-flow simulation (PSS), for 1, 2, and 3-month lead time Septembermonthly volume forecasts. . . . . . . . . . . . . . . . . . . . . . . . 94

5.19 LBR decompositions by five Bias Correction methods, actual (nonbias-corrected) streamflow simulation (NBC), and pseudoperfect stre-amflow simulation (PSS), for 1, 2, 3-month lead time September mont-hly volume forecasts. . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.20 Mean Error and unconditional bias from decomposition of Skill Scoreby five Bias Correction methods, actual (non bias-corrected) stream-flow simulation (NBC), and pseudoperfect streamflow simulation (PS-S), for all the months with 1, 3, and 6-month lead times. . . . . . . 97

5.21 CR decompositions by five Bias Correction methods, actual (nonbias-corrected) streamflow simulation (NBC), and pseudoperfect stre-amflow simulation (PSS), for all the months with 1, 3, and 6-monthlead times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.22 LBR decompositions by five Bias Correction methods, actual (nonbias-corrected) streamflow simulation (NBC), and pseudoperfect stre-amflow simulation (PSS), for all the months with 1, 3, and 6-monthlead times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.23 Relative sharpness by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect stream-flow simulation (PSS), for all the months with 1, 3, and 6-month leadtimes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.24 MSE Skill Score and potential skill by five Bias Correction methods,actual (non bias-corrected) streamflow simulation (NBC), and pseu-doperfect streamflow simulation (PSS), for all the months with 1, 3,and 6-month lead times. . . . . . . . . . . . . . . . . . . . . . . . . 102

A.1 Example of logistic regression applied to the pairs of forecasts andobservations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

A.2 Unbounded estimation with biweight kernel. . . . . . . . . . . . . 112

A.3 Bounded estimation with floating boundary kernel. . . . . . . . . . 114

A.4 Bounded estimation with biweight kernel and reflection boundarytechnique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

xii

A.5 Example of kernel density estimation method applied to forecasts toestimate the marginal distribution s(f). . . . . . . . . . . . . . . . 116

B.1 CR decompositions by five Bias Correction methods, actual (nonbias-corrected) streamflow simulation (NBC), and pseudoperfect stre-amflow simulation (PSS), for 1-month lead time May monthly volumeforecasts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

B.2 LBR decompositions by five Bias Correction methods, actual (nonbias-corrected) streamflow simulation (NBC), and pseudoperfect stre-amflow simulation (PSS), for 1-month lead time May monthly volumeforecasts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

xiii

1

CHAPTER 1

INTRODUCTION

After the devastating floods in 1993 in the Midwest, the National Weather

Service (NWS) proposed development of Advanced Hydrologic Prediction Services

(AHPS) for streamflow forecasting. The first demonstration of AHPS system was

carried out for the Des Moines River basin. AHPS produces short-range forecasts

of the flood levels and the timing of flood crests. AHPS also produces long-range

probabilistic streamflow forecasts. The forecast includes the chance (or probabil-

ity) of exceeding minor, moderate, or major flooding, and the chance of exceeding

certain water levels, volumes, and flows on the river over the next 90 days. These

probabilistic forecasts are issued as probability distributions for streamflow, where

streamflow is treated as a continuous random variable. Hence, they are called prob-

ability distribution forecasts, as opposed to more traditional probabilistic forecasts

for discrete events. The probability distribution forecast AHPS produces has the

advantage that users can obtain probabilistic forecasts for the events they are inter-

ested in. On the other hand, intuitively it is more difficult to evaluate probability

distribution forecasts than categorized forecasts.

This research defines forecast verification as the procedure to assess the de-

gree of agreement between forecasts and observations, following Murphy and Daan

(1985). Forecast verification has traditionally been implemented using one or more

verification measures (Murphy, 1993). This approach fails to give a complete picture

of the forecast quality for many kinds of forecasts, not to mention for the probability

distribution forecasts. In the late 1980s, Murphy and Winkler (1987) proposed a

general framework of forecast verification called the Distributions-Oriented (DO)

approach. Based on the joint distribution of forecasts and observations, this ap-

proach unifies and imposes a structure on the verification methodology, provides

insight into the relationships among verification measures, and creates a sound sci-

entific basis to develop and/or choose particular verification measures in specific

contexts (Murphy and Winkler, 1987).

The original DO approach assumes the forecasts and observations are ex-

pressed as discrete variables. Thus, the DO approach is not directly applicable to

2

probability distribution forecasts of continuous variables. The objectives of this re-

search are (1) to extend the DO approach to the verification problem of probability

distribution forecasts (or ensemble forecasts) of streamflow, and (2) to demonstrate

its usefulness in assessing the quality of streamflow forecasts.

In the application of the DO approach to streamflow forecasts, the major prob-

lem stems from the small sample size. For instance, in the case of meteorological

forecasts, say, maximum daily temperature, 365 pairs of forecasts and observations

would be available per year. After 50 years, 18250 pairs could be utilized for verifi-

cation. But if the forecast of interest is summer season flow volume, after 50 years,

just 50 pairs would be available for verification. The DO approach outlined by

Murphy (1997) requires the construction of the joint distribution of forecasts and

observations, where forecasts and observations are discrete random variables. With

such a small sample, categorizing continuous probabilistic forecasts into discrete bins

may not be appropriate to estimate the joint distribution, and the verification may

lead to a distorted impression of forecasting system. In this research, an alternative

approach which does not categorize the probabilistic forecasts is investigated.

In order to demonstrate that the DO approach provides useful information

to assess forecast quality, this research addresses the assessment of bias correction

methods applied to ensemble forecasts. The forecasting system in this research

is based on Extended Streamflow Prediction (ESP), which produces probabilistic

forecasts by doing statistical analysis of ensemble traces. The ensemble traces are

simulated by a hydrological model. However, in most cases the hydrological model

may have some bias associated with its assumptions or input data. In practice, bias

correction methods are utilized to correct the bias in simulations. However, it is

not clear how the bias in ensemble traces propagates to the probabilistic forecasts,

and how these methods improve the forecasts. The probabilistic forecasts modified

with the bias correction methods are investigated by using the DO approach.

3

CHAPTER 2

FORECASTING SYSTEM

An experimental forecasting system for the Des Moines River basin (Bradley

and Schwartz, 2000) is used to develop and test approaches for verification of prob-

abilistic streamflow forecasts. Like the Advanced Hydrologic Prediction Services

(AHPS) forecasts from the National Weather Service (NWS), the experimental fore-

casts are made using an ensemble forecasting technique. This chapter explains the

study area and input datasets first, and then discusses how forecasts are made. An

overview of approach used for verification is given along with the development of a

verification dataset from the forecasts.

2.1 Study Area and Data Resources

The study area is the Upper Des Moines River basin stretching from the

southern part of Minnesota to central Iowa (Figure 2.1). This research uses the

discharge data obtained at Stratford, Iowa. The Des Moines River basin contains

two major reservoirs, and the Upper Des Moines River is a main source of inflow

into Saylorville Reservoir. This is why Stratford was chosen as the station for this

research, since long-term forecasts of inflow into reservoir are important in reservoir

operations.

The drainage area of the Upper Des Moines River basin is about 14,120 km2,

and the elevation ranges from 290 to 518 m above mean sea level. The gently rolling

terrain, formed by continental glaciation and subsequent erosion, supports extensive

cultivated corn fields. The Upper Des Moines River has two main tributaries: the

West Fork and the East Fork Des Moines River. The West Fork River has its origins

in the glacial moraine area of Pipestone, Lyon, and Murray Counties, Minnesota at

the elevation of about 580 m. The southeastward flow meets the East Fork, which

flows southeasterly from Jackson County, Minnesota. The subbasins of the West and

East Forks have many lakes, especially in Minnesota. The Upper Des Moines River

passes through Fort Dodge, Iowa and joins the Boone River before Stratford. Ac-

cording to USGS NWISWeb Data for the Nation (http://waterdata.usgs.gov/nwis),

the daily mean streamflow, obtained from 82 years of records, varies from 500 to

4

6,000 cfs. The maximum peak streamflow of 42,300 cfs was recorded on 2 April 1993.

For more on the hydrological characteristics, see Bae and Georgakakos (1992).

The Hydrological Simulation Program-Fortran (HSPF) (Donigian et al., 1984,

and Bicknell et al., 1997) was applied to the Upper Des Moines River basin, and

the basin was modeled as a single lumped catchment. HSPF is a lumped hydrologic

model that can simulate both watershed hydrology and water quality continuously.

The time series of simulated streamflow is obtained by inputting a set of mean

areal meteorological time series for the land segment. The input time series data

consists of daily data of precipitation and potential evapotranspiration, and hourly

data of air temperature, dew point temperature, wind movement, cloud cover, and

solar radiation. For calibration, daily streamflow records were obtained at Strat-

ford, Iowa, from U.S. Geological Survey (USGS). Precipitation and air temperature

data obtained from the National Climatic Data Center were interpolated over the

basin. The dew point temperature, wind movement, and cloud cover were obtained

for three surface airways stations from National Center for Atmospheric Research

(NCAR). The solar radiation and potential evapotranspiration time series were es-

timated based on the air temperature, dew point temperature, wind movement, and

cloud cover data (see Shuttleworth, 1993).

HSPF model was calibrated with two objective functions at Stratford; the

first one is the root mean squared error of the simulated and observed flows, and

the second one is the root mean squared error of the logarithms of these flows.

The two objective functions were evaluated, using weekly time step flows. To auto-

mate the calibration of HSPF model parameters, Shuffled Complex Evolution global

optimization method (SCE-UA) was applied.

2.2 Probabilistic Forecasting System

The experimental forecasting system implemented in this research is based

on Extended Streamflow Prediction (ESP) (Day 1985). ESP produces probabilistic

forecasts by statistical analysis of different realizations in the future. This is the

same concept as NWS uses in AHPS.

To explain the basic idea of ESP, an example of streamflow forecasts is shown.

Assume the present time is June 1st 1965, and a forecast will be made of June-

September flow volume. ESP assumes that historical meteorological time series

represent possible realizations in the future. One streamflow trace is simulated by

inputting each historical meteorological time series into HSPF, using the current

5

Figure 2.1: Map of Des Moines River Basin.

watershed conditions as the initial conditions. Since 48 years of historical record

are available (from 1948 to 1996, excluding the current year), 48 streamflow traces

are obtained (Figure 2.2). As June-September volume is of interest in this example,

flow volumes are computed from the streamflow traces. Then, the cumulative dis-

tribution function of the ensemble traces is estimated by weighting each trace, using

the method proposed by Smith et al. (1992). Finally, the probability distribution

forecasts are produced for June-September volume in terms of nonexceedance prob-

ability (Figure 2.3); for any value of the volume (threshold), the likelihood of the

event whose volume is less than or equal to the threshold is obtained.

2.3 Proposed Verification Approach

2.3.1 Forecasts for a Discrete Event

From the framework of ESP explained above, probability distribution forecasts

are obtained in terms of nonexceedance probability. The mathematical definition

6

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 10 20 30 40 50 60 70 80 90

Stre

amflo

w (c

fs-d

ays)

Days (after June 1)

Des Moines at Stratford

Figure 2.2: Ensemble traces simulated for forecast on 1 June 1965.

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

2200000

99989489806960504030201051

Vol

ume

(cfs

-day

s)

Nonexceedance Probability (%)

Des Moines River near Stratford, Iowa

Figure 2.3: Probability distribution forecast for June-September volume. The en-semble traces are simulated with the current conditions as of 1 June 1965.

7

of the forecasts is given as:

Gt(y) = P{Y ≤ y|αt}, (2.1)

where P{Y ≤ y|αt} is the probability that the forecast variable Y , for example

monthly streamflow volume, is less than or equal to some threshold y, conditioned

on the state of the hydroclimatic system αt at a certain forecast date t. Obviously,

it is not straightforward to verify the forecasts in the form of Gt(y). The follow-

ing discusses the approach taken for the verification of the probability distribution

forecasts.

First, consider a discrete event Y ≤ yp where yp has the climatological nonex-

ceedance probability p. The probabilistic forecast for the event Y ≤ yp is simply

given as

f(yp) = Gt(yp). (2.2)

Then, the corresponding observation can be discretized as:

x(yp) = 1, Y ≤ yp

= 0, Y > yp. (2.3)

Therefore, pairs of probabilistic forecasts f(yp) and discrete observations x(yp)

are obtained for the discrete event Y ≤ yp. The pairs that are used to esti-

mate the joint distribution of f(yp) and x(yp) are called a verification dataset.

From one verification dataset, one set of measures of forecast quality for a thresh-

old yp is computed. To evaluate the quality of probability distribution forecast

over the range of possible outcomes, this research uses nine thresholds yp with

p = 0.05, 0.10, 0.25, 0.33, 0.50, 0.66, 0.75, 0.90, 0.95. Hence, nine verification datasets

are obtained.

2.3.2 Verification Dataset

One verification dataset for threshold yp is made up of the N pairs of forecasts

f(yp) and observations x(yp). It is important to note that this verification dataset

contains a portion of information needed to obtain a complete picture of the forecast

quality. Table 2.1 shows the example of verification dataset for June-September vol-

ume forecasts. The forecasts are issued on June 1st every year for the event Y ≤ yp

with the threshold yp = 376212 (p = 0.66). According to the probability distribu-

tion forecast shown in Figure 2.3, the probabilistic forecast on June 1st in 1965 for

the event is f(yp) = 0.350422. The volume observed from June to September in

is 377897, which is slightly greater than yp. Therefore, the corresponding discrete

observation x(yp) is equal to 0, which indicates that the event did not occur.

8

Table 2.1: Example of verification dataset for June-September volume forecasts.

Pairs Obs.c(cfs-d)

Date of Forecasta f(yp) x(yp)b Y

1949/06/01 0.846400 1 72315

1950/06/01 0.723477 1 198444

1951/06/01 0.518131 0 677080

1952/06/01 0.761163 1 259610

1953/06/01 0.766671 1 303072

: : : :

1964/06/01 0.847121 1 229902

1965/06/01 0.365253 0 377897

1966/06/01 0.762934 1 132601

1967/06/01 0.930124 1 301787

: : : :

a The forecasts were issued on June 1st every year.b The threshold for forecasts, yp = 376212 is 0.66

quantile.c Obs. is observed June-September volume.

2.4 Discussion

More research on forecasts using ensembles has been done in the meteorologi-

cal field. In the meteorological field, ensemble forecasts are often called an Ensemble

Prediction System (EPS), whereas they are called Extended Streamflow Prediction

(ESP) in the hydrological field. The main difference between current EPS and ESP

is that the ensemble traces in meteorological and hydrological fields are produced

in the different ways. Figure 2.4 shows the current version of ESP used in NWS.

NWS has extended the original idea to facilitate incorporation of climate outlooks

into the ESP (Perica, 1998). The NWS ESP program produces ensemble traces

9

by inputting historical meteorological events adjusted with meteorological and cli-

matological forecasts, and deterministic precipitation forecasts. Another way to

incorporate climate outlooks and meteorology probabilistic forecasts is to adjust

weights for ensemble traces simulated with historical meteorological events (Croley,

2000). More investigations on incorporation of climate and meteorology forecasts

into ESP are needed.

On the other hand, in meteorological research, the ensembles of geopotential

heights, temperatures, or moisture are created from slightly different initial condi-

tions. The main methods to generate the initial conditions of the ensemble members

are (1) Monte Carlo methods, (2) methods which generate perturbations dynam-

ically constrained by the flow of the day, including breeding and singular vectors,

(3) the perturbed observations method which uses data assimilation cycles with

random errors, and (4) methods which make perturbations by varying the model

parameterizations of subgrid-scale physical processes (Hou et al. 1998). These en-

sembles of meteorological variables could be directly input into a hydrological model

to produce an ensemble of streamflow.

As mentioned in Chapter 1, AHPS provides probabilistic forecasts which in-

dicate the exceedance probability of certain levels over the next 90 days. In the

meteorological field, ensemble traces have been utilized mainly in the following four

ways (Anderson 1996): (1) use the ensemble mean forecast as a substitute for a sin-

gle discrete forecast; (2) produce a small, easily understood set of forecast states by

clustering algorithms; (3) make a priori predictions of forecast skill, that is, figure

out the relation between ensemble spread and skill of the control forecast; and (4)

examine the entire ensemble to extract as much information as possible. For ex-

ample, the quantitative precipitation forecasts are given by exceedance probability

for some continuous thresholds. In fact, this is the same method as AHPS utilizes.

Nowadays, more efforts have been put into item (4).

As we see, many aspects of ensemble forecasting are common in meteorological

and hydrological fields. The cooperation between the researchers in these fields is

necessary to improve ensemble forecasts of streamflow.

2.5 Summary and Conclusions

This research utilizes an experimental forecasting system that has been devel-

oped for the Upper Des Moines River basin. The discharge at Stratford, Iowa, was

10

His

torica

l se

rie

s o

fp

recip

ita

tio

n a

nd

tem

pe

ratu

re

Me

teo

rolo

gic

al

fore

ca

sts

/clim

ate

ou

tlo

oks:

- 1

- to

5-d

ay

- 6

- to

10

-da

y-

mo

nth

ly o

utlo

ok

- 1

3 3

-mo

nth

ou

tlo

oks

NW

Sa

dju

stm

en

tp

roce

du

re

“Ad

juste

d”

se

rie

s o

fp

recip

ita

tio

n a

nd

tem

pe

ratu

re

Cu

rre

nt

co

nd

itio

ns:

- sn

ow

pa

ck

- so

il m

ois

ture

- str

ea

mflo

w-

rese

rvo

ir le

ve

ls

NW

Sh

yd

rolo

gic

Mo

de

l

Pre

cip

ita

tio

nd

ete

rmin

istic F

ore

ca

st

24

(48

) h

ou

rs

Str

ea

mflo

wtr

ace

s

19

48

19

95

P PT T

tim

e

19

48

19

95

P PT T

tim

e

“19

48

”

“19

95

”

tim

etim

e

EX

TE

ND

ED

(E

NS

EM

BLE

) S

TR

EA

MF

LO

W P

RE

DIC

TIO

N P

RO

CE

DU

RE

Str

ea

mflo

wtr

ace

s

Q

- 5

0 -

75

%

- 2

5 -

50

%

- 0

- 2

5 %

Fig

ure

2.4:

Sch

emat

icof

the

curr

ent

Exte

nded

Str

eam

flow

Pre

dic

tion

Syst

em.

Sour

ce:

Per

ica,

Sanj

a,In

tegr

atio

nof

Met

eoro

logi

cal

Fore

cast

s/C

limat

eO

utlo

oks

Into

Ext

ende

dSt

ream

flow

Pre

dict

ion

(ESP

)Sy

stem

,ht

tp:/

/ww

w.n

ws.

noaa

.gov

/oh/

hrl

/pap

ers/

ams/

ams9

8-6.

htm

,(a

cces

sed

Mar

ch10

,19

98).

11

chosen since long-term forecasts of inflow into the downstream reservoir are impor-

tant for operations. The Upper Des Moines River basin drains about 14,120 km2,

and the gently rolling terrain was formed by continental glaciation and subsequent

erosion. In 1993, the record-breaking peak streamflow of 42,300 cfs was observed.

The Hydrological Simulation Program-Fortran (HSPF), which is a lumped

hydrologic model, was applied to the Upper Des Moines River basin. Sets of mean

areal time series data, such as daily precipitation, potential evapotranspiration,

hourly data of air temperature, and so on, were produced from various sources.

HSPF model was calibrated with two objective functions at Stratford, and the

optimum parameters were automatically obtained by Shuffled Complex Evolution

global optimization method (SCE-UA).

The experimental forecasting system is based on the idea of Extended Stream-

flow Prediction (ESP). The time series of historical meteorological information ob-

tained were inputted into the HSPF model, and then streamflow was simulated

with the current hydroclimatological conditions on a forecast date. The outputs of

streamflow are assumed to be different realizations in the future, which are called

ensemble traces. Finally, the probabilistic distribution forecast, for the forecast date

expressed in nonexceedance probability, was produced by statistical analysis of the

ensemble traces.

The problem is to verify the forecast for continuous range of streamflows, since

the probability distribution forecast gives a probabilistic forecast for any possible

outcome. One solution is to consider a discrete event that a forecast variable is less

than or equal to a threshold. For the event, one probabilistic forecast is derived

from the probability distribution forecast in terms of nonexceedance probability.

On the other hand, the corresponding continuous observation is converted into 0

or 1; 0 indicates the event did not occur, and 1 means the event occurred. This

pair of probabilistic forecast and discrete observation is obtained every forecast

date. Thus, one verification dataset for a threshold is made up of as many pairs

of forecasts and observations as forecast date. Since nine quantiles of observations

are used as the thresholds covering the possible outcomes, nine verification datasets

are computed. Investigation of these nine verification datasets can be considered

equivalent to examination of forecast quality of the probability distribution forecast.

12

CHAPTER 3

VERIFICATION APPROACH

The proposed approach for verification of ensemble streamflow predictions

involves selecting discrete events. The probabilistic forecast for an event – a forecast

variable is less than or equal to a threshold – is obtained from the probability

distribution forecast. The corresponding continuous observation is also converted

into a discrete number: 1 indicates that the event occurred, 0 means that the

event did not occur. The verification dataset for the event consists of the pairs

of probabilistic forecasts and discrete observations. Using the verification datasets

derived for discrete events, forecast quality of the probability distribution forecast

can be assessed over the range of possible outcomes.

In this chapter, a distributions-oriented (DO) approach for forecast verifica-

tion is described. The DO approach is extended to case of continuous probabilistic

forecasts with parametric and nonparametric techniques to estimate the joint distri-

bution of forecasts and observations. Secondly, the technical methods are described

in detail. Then, DO measures and other common measures for forecast verification

are discussed. The technical methods in the extended DO approach will be assessed

in the next chapter.

3.1 Introduction

Verification procedures can be classified into two categories (Murphy, 1997):

a measures-oriented (MO) approach and a distributions-oriented (DO) approach.

The MO approach is traditionally used in the verification process. Literally, this

approach emphasizes calculating quantitative measures of only one or two aspects

of forecasting quality such as bias, accuracy, or skill, and then makes conclusions

based on these measures. In most cases, the mean squared error (hereafter re-

ferred to as MSE), and the correlation coefficient (CC) are used as the accuracy

measure. However, CC is shown to be a measure of potential skill by Murphy et

al. (1989). Although many verification measures had been developed, until the

1980’s the investigation of the relationships between measures, examination of their

relative strengths and weaknesses, or general concepts about verification itself had

13

not been studied extensively (Murphy and Winkler, 1987). For example, Barnston

(1992) showed the nonlinear, one-to-one relationship between CC and RMSE for

standardized forecasts and observations, and the significant variation of the mean

correspondence between CC and Heidke score with the number of equally likely

Heidke categories. Murphy (1995) concluded that the coefficient of determination

is superior to the CC as the measure of linear association, and both of them are not

proper as the measure of skill.

The DO approach was developed in the 1980’s. Since then, the DO approach

has played an important role, especially in the verification of meteorological fore-

casts. For instance, the diagnostic verification of Climate Prediction Center Long-

Lead Outlooks has been done with this approach (Wilks 2000). The forecasts made

by human forecasters and guidance products from numerical weather prediction

models were investigated (Brooks and Doswell, 1996). Also, the verification of the

forecasts produced based on the Ensemble Prediction System (EPS) has been done

(e.g., Hamill and colucci 1997, and Hou et al. 1998). The DO approach involves

the use of the joint distribution of forecasts and observations, from which all the

measures of forecast quality are derived systematically. The reason why the DO ap-

proach is preferable is that it gives insights on forecast quality from various aspects

and allows the user to identify the situations in which forecast performance may be

weak or strong, something the MO approach fails to do (Brooks and Doswell III,

1996).

The major difficulty in applying this approach to verification stems from the

estimation of the joint distribution. Two fundamental characteristics of the verifi-

cation problem are complexity and dimensionality, which are quantitatively defined

by Murphy (1991). Complexity is defined by the number of factorizations (CF ),

number of basic factors in each factorization (CBF ), or total number of basic factors

(CTBF ). For example, in Absolute Verification (AV), where one kind of observation

and one forecasting system are examined, the joint distribution can be factorized

into one conditional and one marginal distribution. Thus, CF = 2, CBF = 2, and

CTBF = 4. On the other hand, the general definition of dimensionality D is that

D is the number of degrees of freedom in order to estimate the joint distribution of

forecasts and observations. In the case where a forecasting system uses nx categories

for observations and nf for forecasts, the dimensionality D is defined as

D = nf × nx − 1. (3.1)

14

For example, when a forecast is issued in 11 categories from 0 to 1 with 0.1 inter-

val for a dichotomous (two-category) observation, the verification problem has the

dimension D = 11× 2− 1 = 21. Then, given 50 pairs of forecasts and observations,

which is not unusual with hydrological variables, it is no wonder that some bins may

not have enough (or any) subsamples to estimate the joint distribution. Hence, most

verification problems suffer from the “curse of dimensionality” (Murphy, 1997). In

Chapter 4, techniques to reduce the dimensionality (but not the complexity) are

investigated. This chapter describes the measures of the DO approach tailored to

probabilistic forecasts and dichotomous observations, and their estimators, in such

a way that the dimensionality of the joint distribution is reduced. As a result, all

of the DO measures are derived from six basic variables and one integral.

3.2 Distributions-Oriented Measures

One can derive the distributions-oriented (DO) measures from the joint dis-

tribution of forecasts and observations p(f, x) ,where f is the probabilistic forecast

issued for an event that forecast variable Y is equal to or less than a threshold yp

(f = f(yp)), and x is the corresponding discrete observation (x = x(yp)), which

takes on 1 for occurrence of the event and 0 for no occurrence. The measures de-

rived from the joint distribution can be examined over the range of thresholds for

which the forecasts are issued.

To cast light on understanding of the joint distribution from various aspects,

it can be factorized into one conditional and one marginal distribution in two ways

(Murphy and Winkler, 1987):

CR factorization: p(f, x) = q(x|f)s(f) (3.2)

LBR factorization: p(f, x) = r(f |x)t(x). (3.3)

The calibration refinement (CR) factorization is used more often, and easy to un-

derstand partly because a forecast is issued first, then the observation is compared

(Murphy and Winkler 1987, Brooks and Doswell III 1996). On the other hand, it

is easier to reconstruct the marginal and conditional distribution for the likelihood-

based rate (LBR) factorization, since the observation random variable x takes on 1

or 0 only. This research mainly utilizes the following measures described in Murphy

(1997).

15

3.2.1 Bias

The expected value µf for the marginal distribution of the forecasts s(f) and

the expected value µx for the marginal distribution of the observations t(x) are

utilized to characterize the unconditional bias defined as Mean Error (ME):

ME = µf − µx. (3.4)

3.2.2 Accuracy

A measure of the Accuracy of the forecasts is the mean square error (MSE),

which is defined using the joint distribution p(f, x) as:

MSE(f, x) =∑

f

∑x

p(f, x)(f − x)2. (3.5)

One decomposition of MSE can be written as (Murphy, 1988):

MSE(f, x) = (µf − µx)2 + (σf − σx)

2 + 2(1− ρf,x)σfσx. (3.6)

The last term, called dispersion error, may be considered the most important mea-

sure of the forecast error, because it cannot be calibrated out (Hou et al., 1998).

The Skill of the forecast is the accuracy relative to a reference forecast method-

ology. The skill score using climatology as a reference (i.e., the forecast f is the

unconditional mean, µx) is:

SSMSE(f, µx, x) = 1− [MSE(f, x)/σ2x]. (3.7)

where σ2x is the variance of the observations. A decomposition of SSMSE:

SSMSE(f, µx, x) = ρ2fx − [ρfx − (σf/σx)]

2 − [(µf − µx)/σx]2 (3.8)

consists of a measure of potential skill (the first term), a relative measure of reliabil-

ity, also known as type 1 conditional bias (the second term), and a relative measure

of unconditional bias (the third term). The third term is better than ME when the

unconditional bias is compared over the possible outcomes.

3.2.3 Calibration-Refinement Measures

Given a specific probability forecast f , certain aspects of the distribution

of observations x are desirable. The Calibration-refinement factorization, which is

conditional on the forecasts, can be used to explore these aspects of forecast quality.

Reliability (Type 1 conditional bias) describes the bias of the observations

given a forecast f . Forecasts that are conditionally unbiased are desirable. One

16

measure of this conditional bias is:

REL = Ef (µx|f − f)2 (3.9)

where Ef denotes the expected value with respect to the distribution of the forecasts

and µx|f is the expected value of the observations conditional on the forecasts.

Resolution indicates the degree to which the mean observation for a specific

forecast f differs from the unconditional mean (or climatological probability). Fore-

casts with large differences (higher resolution) are more desirable. One measure of

the resolution is:

RES = Ef (µx|f − µx)2 (3.10)

The connection between the reliability and resolution of the forecasts, and the

MSE (or skill) of the forecasts, can be seen through a decomposition of the MSE

into its components. Conditioning on the forecast leads to the so-called calibration-

refinement (CR) decomposition:

MSECR(f, x) = σ2x + REL− RES. (3.11)

where σ2x, the variance of the observation, measures the inherent uncertainty. If the

p-quantile of observation is used as the threshold for which forecasts are issued, the

uncertainty is analytically obtained as:

σ2x = p(1− p). (3.12)

In case of perfect forecasts, since MSE = 0 and REL = 0, RES = σ2x. Substituting

the CR decomposition into SS (Equation (3.7)) gives

SS =RES

σ2x

− REL

σ2x

. (3.13)

This research evaluates the measures of forecast quality over the range of possible

outcomes. It is more insightful to use the measures of Resolution and Reliability

relative to the Uncertainty of events. Thus, RES/σ2x is referred to as Relative Reso-

lution (RRES) and REL/σ2x is called Relative Reliability (RREL). Perfect forecasts

take on RRES = 1, and RREL = 0.

3.2.4 Likelihood-Base Rate Measures

Given a specific discrete observation x (i.e., the event occurs or it does not),

certain aspects of the distribution of the probabilistic forecasts f are desirable. The

likelihood-base rate factorization, which conditions on the observations, can be used

17

to explore these aspects of forecast quality.

Discrimination describes the degree to which the forecasts differ for a specific

observation x (x = 1, or x = 0). Forecasts with larger differences (higher discrimi-

nation) are more desirable. One measure of the discrimination is:

DIS = Ex(µf |x − µx)2 (3.14)

where Ex is the expected value with respect to the distribution of the observations

and µf |x is the expected value of the forecasts given an observation.

In the same way as CR decomposition, the connection between the discrimi-

nation of forecasts, and the MSE (or skill) can be seen through likelihood-base rate

(LBR) decomposition:

MSELBR(f, x) = σ2f + Ex(µf |x − x)2 −DIS. (3.15)

The first term in the decomposition measures the sharpness of the forecasts. The

second term is a measure of the bias of the forecasts conditioned on the observation,

which is called the Type II conditional bias:

TY2 = Ex(µf |x − x)2. (3.16)

The Sharpness, σ2f , is a measure of the degree to which the probability forecasts

approach 0 and 1. Higher sharpness indicates more confidence in the forecast out-

come.

Substituting the LBR decomposition into SS (Equation 3.7) gives

SS = 1 +DIS

σ2x

− σ2f

σ2x

− TY2

σ2x

. (3.17)

Again, DIS/σ2x, σ2

f/σ2x, and TY2/σ2

x are refered to as Relative Discrimination (RDIS),

Relative Sharpness (RS), and Relative Type 2 Conditional Bias (RTY2). A perfect

forecast system takes on RDIS = 1, RS = 1, and RTY2 = 0.

3.3 Estimation of Measures

This research aims to verify the forecasting system using observations, which

have Bernoulli distribution described in Section 2.2, and the forecasts expressed in

probability. In practice, the forecast is often issued as discrete set of values. Discrete

forecast may be used due to the accuracy of the physical model or measurement, as

an effort to reduce the cost to record data (forecasts), or because of the difficulty in

verification related to the dimensionality. However, to discretize a forecast, which is

issued as a continuous number for the sake of verification, provokes questions. How

18

does the discretization, especially with a small sample size, affect the measures

of forecast quality? How much information on the forecasting system is lost or

distorted by discretization? To answer these questions, the comparison of the results

by discretizing forecasts and dealing with the forecasts as continuous numbers will

be conducted in Chapter 4. In this section the estimators for the above DO measures

are presented in order that they can be dealt with in a continuous manner.

3.3.1 Basic Statistics

Certain conditions for estimators of forecast quality aspects are desirable.

Estimators that are unbiased with low variance would be best. Given the small

sample sizes, estimators that utilize the entire sample of forecasts and/or observa-

tions would be better than those based on conditional subsamples. If conditional

subsamples are used, estimators for low-order moments would be better than those

for higher-order moments. Based on theses considerations, the basic moments cho-

sen to be estimated are the mean and variance of the observations, µx and σ2x, the

mean and variance of the forecasts, µf and σ2x, and the conditional means of the

forecasts given the observations, µf |x=0 and µf |x=1. The four quantities related to

the marginal distributions can be estimated with the entire sample:

µx =1

N

N∑

i=1

xi (3.18)

σ2x =

1

N − 1

N∑

i=1

x2i −

N

N − 1µ2

x (3.19)

µf =1

N

N∑

i=1

fi (3.20)

σ2f =

1

N − 1

N∑

i=1

f 2i −

N

N − 1µ2

f (3.21)

Then, since the observations are Bernoulli random variates,

t(x = 1) = µx (3.22)

t(x = 0) = 1− µx. (3.23)

Next, divide the pairs (fi, xi), i = 1, · · · , N into two sets; one set has x = 0

(A) and another has x = 1 (B). Denote NA and NB the numbers of pairs included

in A and B, respectively.

19

µf |x=0 =1

NA

NA∑

i=1

fi, (fi, xi) ∈ A (3.24)

µf |x=1 =1

NB

NB∑

i=1

fi, (fi, xi) ∈ B (3.25)

Note that these two estimators can have significant uncertainty when the subsample

sizes are small.

The reason why the above statistics are called basic is that most forecast

quality measures (except CR decompositions) can be derived from these statistics.

To do this is beneficial in that the propagation of uncertainty could be seen more

easily with limited estimators.

3.3.2 Other Derivative Estimators

First, estimators for the basic performance measures of forecast quality (namely,

ME, MSE, and SS) are expressed with the basic estimators µx, µf , σ2x, σ2

f , µf |x=0,

and µf |x=1 discussed in the former section. ME (see Section 3.2) is simply:

ME = µf − µx (3.26)

The measure of the accuracy is also estimated with basic estimators:

MSE(f, x) =∑

x∈0,1

∫ 1

0(f − x)2p(f, x)df

=∑

x∈0,1

∫ 1

0(f 2 − 2fx + x2)p(f, x)df

=∫ 1

0f 2q(x = 0|f)s(f)df +

∫ 1

0f 2q(x = 1|f)s(f)df

−2∫ 1

0f1r(f |x = 1)t(x = 1)df +

∫ 1

012r(f |x = 1)t(x = 1)df

=∫ 1

0f 2s(f)df − 2t(x = 1)

∫ 1

0fr(f |x = 1)df + t(x = 1)

∫ 1

0r(f |x = 1)df

= (σ2f + µ2

f )− 2µxµf |x=1 + µx (3.27)

Then, the MSE Skill Score (SSMSE) is estimated by Equation (3.7). To derive

the estimator for the correlation coefficient (CC), also called the potential skill or

association, we first manipulate:

E[fx] =∑

x∈0,1

∫ 1

0(fx)p(f, x)df

21

=∫ 1

0µ2

x|fs(f)df − 2µxµf + µ2f . (3.33)

Thus, the problem of estimation of CR decomposition measures boils down to one

of estimating∫ 10 µ2

x|fs(f)df .

3.3.3 Estimation of CR Decompositions

This subsection investigates the several ways to estimate the integral∫ 1

0µ2

x|fs(f)df. (3.34)

In the case where the forecasts are discretized, the integral becomes

M∑

j=1

µ2x|fj

s(fj), (3.35)

where M is the number of bins in which the forecasts are discretized. From this

equation, the natural estimator for the integral is the sample average of µ2x|fj

or:

1

N

N∑

i=1

µ2x|fi

. (3.36)

Therefore, the simplest estimation of the integral is to use regression to estimate

µx|f from the set of pairs (fi, xi).

This approach fills the demand from the computation of CR decomposition

measures. However, the marginal distribution s(f) is informative and indispensable

in terms of the philosophy of DO approach. The marginal distribution s(f) can be

expressed by the marginal and conditional distribution of the LBR factorization as:

s(f) = t(0)r(f |x = 0) + t(1)r(f |x = 1). (3.37)

The conditional distribution of the observations given the forecasts q(x|f) can also

be written as:

q(x|f) =t(x)r(f |x)

s(f)

=t(x)r(f |x)

t(0)r(f |x = 0) + t(1)r(f |x = 1), (3.38)

and then, the expected value of observations conditional on the forecasts is:

µx|f =∑

x∈0,1

xq(x|f) = q(x = 1|f)

22

=t(1)r(f |x = 1)

s(f). (3.39)

Thus, the estimation of conditional distribution r(f |x) is another way to estimate

s(f), µx|f and the integral.

From the above discussion, the following three methods to estimate the inte-

gral are considered:

1. The Logistic Regression Method (LRM) estimates the conditional mean µx|fby logistic regression, and utilize Equation (3.36). Logistic regression is a

suitable model for the situation where response variables are binary. This

approach directly estimates µx|f by applying the logistic regression to the

pairs of observations and forecasts. In essence, estimated is the conditional

distribution q(x = 1|f) and q(x = 0|f) (= 1− q(x = 1|f)). Then, the integral

is obtained by equal weighting of µx|fi, i.e., Equation (3.36).

2. The Kernel Density estimation Method (KDM) estimates the conditional dis-

tribution r(f |x) by kernel density estimation method. Equations (3.37) and

(3.39) are used for the numerical integration of Equation (3.34). Thus, this

approach uses the LBR factorization to reconstruct the joint distribution.

3. The Combination Method (CM) estimates the conditional mean µx|f by logistic

regression, and the marginal distribution s(f) by kernel density method; this

approach rebuilds the joint distribution through the CR factorization. Then,

Equation (3.34) is numerically integrated.

The reasons why the kernel density method is adopted are that (1) the error by

specifying a parametric model may be greater than the error that a nonparametric

model can produce due to small sample size, since one does not know a priori the

correct distributional model, and that (2) the kernel density method is motivated

by the limiting case of the averaged shifted histogram, which is a computationally

and statistically efficient density estimator (Scott 1992).

For the first approach, the marginal distribution s(f) is estimated by kernel

estimation density method. Although the estimation of µx|f in Equation (3.36) is

enough to obtain the CR decomposition estimates, the marginal distribution s(f)

graphically indicates the important aspect of sharpness of the forecasts. The con-

ditional distribution r(f |x), related to a measure of discrimination, is also obtained

by

23

r(f |x = 1) =s(f)µx|ft(x = 1)

(3.40)

r(f |x = 0) =s(f)(1− µx|f )

t(x = 0). (3.41)

The above methods to estimate CR decompositions are examined with the

traditional method that discretizes (or bins) probabilistic forecasts into 11 or 12

bins, referred as to DSC, in Chapter 4. The technical methods used for LRM,

KDM, CM, and DSC are described in Appendix A.

3.4 Example of Verification

The verification of June-September seasonal volume forecasts is used as an

example to illustrate the proposed verification approach. The forecasts were pro-

duced by the experimental forecasting system based on the Extended Streamflow

Prediction concept. The forecasting system issues the probability distribution fore-

cast at Stratford, Iowa, on every June 1st. By considering a event that the seasonal

volume Y is less than or equal to a threshold yp, the probabilistic forecast is then

derived from the probability distribution forecast. The corresponding continuous

observation is discretized into 0 for no occurrence of the event, or 1 for occurrence

of the event. Since the forecast is issued from 1949 to 1996, 48 pairs of probabilistic

forecasts and discrete observations make one verification dataset. Nine quantiles yp

are used as thresholds, so the probability distribution forecasts are assessed overall

through nine verification datasets.

The following examines the measures of forecast quality, and distributions

of forecasts and observations that were introduced in Section 3.2. The integral

necessary for computing CR decompositions is estimated by LRM (see Section A

for details). The measures of forecast quality are depicted against the nonexceedance

probability p corresponding to the threshold yp, which can be simply thought of as

the magnitude of flow event.

3.4.1 Absolute and Relative Measures

First, absolute measures are examined. Mean Error (ME) in the left of Fig-

ure 3.1 shows that the forecasting system tends to underestimate the occurrence

of low flow events, and overestimate the occurrence of moderate and high flow

24

-0.1

-0.05

0

0.05

0.1

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror

Nonexceedance probability p

June-September Flow

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

MS

E


June-September Flow

Figure 3.1: Mean Error (ME) and Mean Square Error (MSE) for June-Septemberseasonal volume forecasts.

events. Mean Squared Error (MSE) in the right of Figure 3.1 indicates the down-

ward concave shape; the low and high flow events have better absolute accuracy

than moderate flow event. Figure 3.2 shows the CR and LBR decompositions of

Mean Squared Error (MSE) for June-September seasonal volume forecasts. The

estimated Uncertainty almost fits to the analytical solution σ2x = p(1 − p). The

Reliability is small and constant, whereas moderate events have more Resolution.

The LBR decompositions have greater values for moderate flow events.

Since the magnitude of the absolute measures strongly depend on p, one should

not evaluate the forecasting system over the range of outcomes using these absolute

measures alone. Relative measures compare the forecasts for each event with the

climatology forecast µx. The CR decompositions of MSE Skill Score are shown

in the upper left of Figure 3.3. Examination of the MSE Skill Score indicates

that the forecasts for moderate flow events have more skill than those for extreme

flow events. Relative Reliability (RREL) shows that the reliability is almost the

same over the range of outcomes, in sense of contribution to the Skill Score, while

Relative Resolution (RRES) indicates that the forecasts for p =0.25 have the highest

resolution. The LBR decompositions of the MSE Skill Score are shown in the upper

right of Figure 3.3. The low and high flow events have more Relative Type 2

Conditional Bias (RTY2), which cannot be seen from absolute Type 2 Conditional

Bias (TY2). The magnitude is much larger than other terms. According to Relative

Sharpness (RS) and Relative Discrimination (RDIS), the forecasts for moderate flow

events have more sharpness and discrimination than those for the high and low flow

events. Examination of the lower left of Figure 3.3 illustrates that the potential

25

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

Unc

erta

inty

Mea

sure


June-September Flow

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

Sha

rpne

ssNonexceedance probability p

June-September Flow

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

Rel

iabi

lity

(Typ

e 1

Con

ditio

nal B

ias)


June-September Flow

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

Type

2 C

ondi

tiona

l Bia

s)


June-September Flow

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

Res

olut

ion


June-September Flow

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

Dis

crim

inat

ion


June-September Flow

Figure 3.2: CR (on left) and LBR (on right) decompositions of MSE for June-September seasonal volume forecasts.

26

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e M

easu

re


June-September Flow

SSRRESRREL

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e M

easu

re


June-September Flow

SSRDISRSRTY2

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e M

easu

re


June-September Flow

SSPotential SkillReliability MeasureUnconditional Bias Measure

Figure 3.3: Various decompositions of MSE Skill Score for June-September seasonalvolume forecasts. The upper left indicates CR decompositions, Relative Resolution(RRES) and Relative Reliability (RREL). The upper right indicates LBR decom-positions, Relative Discrimination (RDIS), Relative Sharpness (RS), and RelativeType 2 Conditional Bias (RTY2). The lower left shows Potential Skill, ReliabilityMeasure, and Unconditional Bias Measure.

skill of extreme low flow events is lower than that of high extreme flow events.

3.4.2 Marginal and Conditional Distributions

The marginal and conditional distributions of forecasts and observations pro-

vide more details of forecast quality than scalar measures shown above. The main

diagrams to display these distributions are called Reliability diagrams and Discrim-

ination diagrams (Murphy 1997, or Wilks 1995).

Figure 3.4 shows the Reliability diagram consisting of the marginal distribu-

tion s(f) and conditional distribution q(x = 1|f) = µx|f . The marginal distribution

s(f) is estimated by kernel density estimation method, and the conditional distri-

bution µx|f is obtained by fitting a logistic regression. The marginal distribution

s(f) indicates how sharp (or confidence) this forecasting system is. No density is

27

on f = 1, and most of density concentrates near f = 0. Since perfect forecasting

systems have mass points at 0 and 1, this system’s forecast are not very sharp.

According to Reliability diagram, Resolution measures the distance between sam-

ple points and the line µx|f = µx (no resolution), whereas Reliability measures the

distance between sample points and the line µx|f = f (perfect reliability). It can

be seen that the forecasting system tends to overestimate the occurrence of events,

when it issues the probabilistic forecasts less than about 0.37. For the probabilistic

forecasts more than that, it underestimates the occurrence of events. Comparison

with no resolution line indicates that this system has good Resolution. From Equa-

tion (3.13) the straight line between the line µx|f = f and the line µx|f = µx is called

the no-skill line. Subsamples of forecasts contribute to the Skill Score positively if

the corresponding points (µx|f , f) lie to the right (left) of the vertical line f = µx

and above (below) the no-skill line (see Murphy 1997). The negative contribution

to the Skill Score occurs around the intersection of the lines µx|f = f and µx|f = µx.

The Discrimination diagram is made of the conditional distributions of fore-

casts given observations r(f |x), and marginal distributions of observations t(x) (Fig-

ure 3.5). From Equation 3.40 and 3.41, the conditional distributions r(f |x = 0) and

r(f |x = 1) are estimated. The modes for r(f |x = 0) and r(f |x = 1) should be lo-

cated on f = 0 and f = 1 for perfect forecasting systems. Therefore, the forecasts

when events occur (x = 1) are worse than those when events do not occur.

3.5 Discussion

Among the various forecast aspects, the measures of the accuracy or skill of

forecast quality have been studied most. In general, most of the measures of skill (or

accuracy) are categorized into the following four groups (Zhang and Casey 2000):

1. Those that directly measure the differences between forecasts and observations

(Root Mean Squared Error (RMSE) or Brier score). In addition, Murphy

(1995) identified the above measures as a “squared-error approach”, and a

linear measure of correspondence such as Mean Absolute Error as “linear-

distance approach”.

2. Those that measure the differences between forecasts and observations in cu-

mulative probability space (the ranked probability score (RPC), the Linear

Error in Probability Space score (LEPS)). Murphy (1995) termed this as

“linear-error-in-probability-space approach”.

28

0

0.5

1

1.5

2

2.5

3

3.5

Re

lative

fre

qu

en

cy [

s(f

)]

s(f) for p=0.25

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Ob

se

rve

d r

ela

tive

fre

qu

en

cy (

mx|f)

Forecast probability (f)

No resolution

No skillPe

rfect

relia

bilit

y

Figure 3.4: Reliability diagram for June-September seasonal volume forecasts issuedfor 0.25 quantile.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 0.2 0.4 0.6 0.8 1

Like

lihoo

d [r(

f|x)]


r(f|x) for p=0.25

t(x=0)=0.771t(x=1)=0.229

x=0x=1µx

Figure 3.5: Discrimination diagram for June-September seasonal volume forecastsissued for 0.25 quantile.

29

3. Those based on concepts derived from signal detection theory (SDT), produce

various measures from the ratios of relative “signal” and “noise” expressed by

the conditional distributions of forecasts given observations (Relative Operat-

ing Characteristics (ROC)).

4. Those based on converting probability forecasts to binary forecasts and the

generation of a contingency table from the hit and miss rates.

Murphy (1997) also showed similar classification. Most researchers use the ensemble

spread and rank histograms to characterize the ensembles themselves and use CR

decomposition, Brier score, RPS, or ROC for the probability forecasting (Hamill

and colucci 1997; Hou et al. 1998 etc.). It is desirable that a number of different

scoring techniques should be applied in order to obtain an objective assessment

of any given forecast scheme, instead of using just one scoring scheme (Murphy

1991; Zhang and Casey 2000). For example, the combination of forecasts from

different models give better mean-square errors by proper weighting, although the

LEPS, which gives higher scores and less penalty for forecasting rare events, does

not always show better results (Zhang and Casey 2000). But it is also important to

figure out which scoring methods are more stable to the effects of sample variability

and desirable for small samples.

On the other hand, new scores for EPS are still under development. Wilson

et al. (1999) introduced a new score that is in the form of probabilities of occur-

rence of the observation given the EPS distribution. That is, this measure seeks

to assess the ensemble outputs in terms of probability. This measure would be

useful to see how the bias correction methods discussed in Chapter 5 change the

distribution of ensemble traces. Hersbach (2000) showed that for an EPS the con-

tinuous ranked probability score (CRPS) can be decomposed into a reliability part

and a resolution/uncertainty part, in a way that is similar to the decomposition of

the Brier score. The reliability is closely related to the rank histogram, and sen-

sitive to the width of the ensemble bins. The resolution expresses the superiority

of a forecast system with respect to a forecast system based on climatology. The

uncertainty/reliability part was found to be both sensitive to the average spread

within the ensemble, and to the behaviour of the outliers (Hersbach, 2000). Since

the CRPS can be interpreted as the integral of the Brier score over all possible

threshold values (Hersbach, 2000), the use of CRPS would be another approach

to verification of whole the probability distribution forecast. In this case, another

30

decomposition corresponding to LBR decomposition of the Brier score would have

to be derived in order to look at forecast quality given observations. Still, since

the CRPS is just one scalar variable, the information on how forecast quality varies

over the range of possible outcomes may not be obtained. Stensrud and Wandishin

(2000) extended the critical success index to measure the agreement between two or

more spatial distribution of ensemble forecasts and the observation, which is called

the correspondence ratio.


All the measures of the distributions-oriented (DO) approach are derived from

the joint distribution of forecasts and observations. These measures can be examined

over the range of thresholds for which the forecasts are issued. The joint distribution

can be examined from calibration refinement (CR) factorization and likelihood-

based rate (LBR) factorization.

Measures of unconditional bias and accuracy take well-known forms; uncondi-

tional bias is defined by the mean error (ME), and accuracy is Mean Square Error

(MSE) between forecasts and observations. As a relative measure of accuracy, the

MSE Skill Score is introduced. It uses climatology forecasts, or mean of obser-

vations, as the reference. Decomposition of MSE gives other aspects of forecast

quality. CR decompositions, which are Reliability and Resolution, are conditioned

on the forecasts. Reliability is a measure of conditional bias given a forecast. Res-

olution measures how much the outcomes given probabilistic forecasts are different

from the climatology (or the mean of observations). Thus, smaller Reliability and

larger Resolution are more desirable. LBR decompositions, which are Sharpness,

Type 2 Conditional Bias and Discrimination, are conditioned on the observations.

Sharpness is just the variance of forecasts, which is important especially for forecasts

issued in probability. Sharpness measures how much issued forecasts are distinct,

which reflects confidence of forecasters. Type 2 Conditional Bias and Discrimination

are based on the same concepts as Reliability and Resolution. Type 2 Conditional

Bias is a measure of conditional bias given an observation. Discrimination measures

how much the forecasts when events occurred are different from those when events

did not occur. Therefore, smaller Type 2 Conditional Bias, and larger Discrimina-

tion and Sharpness are better. In order to compare these five decompositions over

the magnitude of possible outcomes, they are normalized by Uncertainty, or variance

of observations. The normalized values are referred to as “Relative” measures.

31

In practice, forecasts are often issued in discrete numbers due to various rea-

sons. However, the discretization, especially with small sample size, may affect the

measures of forecast quality, or information on the forecasting system may be lost

or distorted by discretization. This chapter derived estimators for the above DO

measures in order that they can be dealt with in a continuous manner. It turned

out that all the measures except for CR decompositions can be expressed by six ba-

sic statistics, without any assumption on mathematical form of distribution. Then,

the problem of estimation of CR decomposition boils down to the estimation of

the integral∫ 10 µ2

x|fs(f)df , where µx|f is the conditional mean of observations given

forecasts, and s(f) denotes the marginal distribution of forecasts.

Three statistical methods to estimate the integral were explained in detail.

The logistic regression method (LRM) estimates the conditional mean µx|f by lo-

gistic regression, and the integral is estimated by the sample average of µ2x|f . The

kernel density estimation method (KDM) estimates the conditional distribution of

forecasts given observations r(f |x) by the kernel density estimation method. From

these distributions, the marginal distribution of forecasts s(f) and the conditional

distribution of observations given forecasts q(x = 1|f) are computed. The integral

is then numerically integrated using these distributions. The combination method

(CM) estimates the conditional mean µx|f by logistic regression, and the marginal

distribution s(f) by kernel density method, and then integrates the integral nu-

merically. The traditional discrete approach with contingency table (DSC) is also

considered. In general, if forecasts and observations are divided into I and J bins,

the dimensionality D is defined as D = I × J − 1. In case of the contingency table

that divides probabilistic forecasts into I = 11 bins from 0 to 1 with 0.1 interval,

the dimensionality is D = 11× 2− 1 = 21, since the observations are dichotomous

(J = 2). These three continuous approaches, LRM, KDM, and CM, reduce the

dimensionality to 9, 7, and 9, respectively.

32

CHAPTER 4

DISTRIBUTIONS-ORIENTED METHODSFOR SMALL VERIFICATION DATASET

The distributions-oriented (DO) approach based on the joint distribution of

forecasts and observations is superior to the measures-oriented (MO) approach for

forecast verification. However, when applying the DO approach to hydrological

forecasts, which typically have a small verification dataset, it is difficult to estimate

the joint distribution and DO measures properly. This chapter examines three

statistical methods to reduce the estimation uncertainty of DO measures. Three

forecasting systems are developed to produce verification datasets, to which the

three statistical methods are applied. This chapter describes and discusses each

forecasting model and the verification results.

4.1 Introduction

The distributions-oriented (DO) approach gives structure to the verification

process, and some idea of what aspects the forecasts are good or bad based on

the joint distribution of the forecasts and corresponding observations. In essence,

applying the DO approach to real verification problems is equivalent to estimating

the joint distribution of forecasts and observations.

The dimensionality D, one of the characteristics of verification problem, is

defined as the number of degrees of freedom in estimating the joint distribution.

Since available samples for hydrological variables are very limited, lower dimension-

ality is more desirable. In some applications, reduction of dimensionality has been

carried out. For instance, Brooks et al. (1996) dealt with the temperature forecasts

produced by Model Output Statistics (MOS). In order to reduce the dimensionality,

they chose to verify forecasts and observations in the context of day-to-day tem-

perature change. The forecasts and observations were binned into 5F intervals. As

a result, they succeeded in reducing the dimensionality from D = 389016 to 120.

However, reducing the number of categories for forecasts could result in losing or

distorting information of the original forecasting system. Thus, not changing the

original forecasts but applying a parsimonious statistical model to the conditional

and/or unconditional distributions may be more reasonable and effective (Murphy,

33

1991). Murphy and Wilks (1998) modeled the conditional distributions q(x|f), or

µx|f , with a linear regression equation, and the marginal distribution s(f) with a

beta distribution, to reduce the dimensionality of the underlying verification prob-

lem from D = 11 × 2 − 1 = 21 to 4. This research also makes use of statistical

models to reduce dimensionality.

As described in Section 3.3, the measures of forecast quality, except for CR

decompositions, can be estimated from six basic moment estimators, without any

assumptions on the conditional or marginal distributions. The estimation of CR

decompositions, however, requires the estimation of the integral∫ 10 µ2

x|fs(f)df . In

order to estimate the integral, three statistical methods are utilized (Subsection

3.3.3). The logistic regression method (LRM) estimates the conditional mean µx|fby logistic regression; the integral is estimated by the arithmetic average of µ2

x|f for

the set of forecasts f . The kernel density estimation method (KDM) estimates the

conditional distribution of forecasts given observations r(f |x) using kernel density

estimation. From these distributions, the marginal distribution of forecasts s(f) and

conditional distribution of observation given forecasts q(x = 1|f) are computed. The

integral is then numerically integrated using these distributions. The combination

method (CM) estimates the conditional mean µx|f by logistic regression, and the

marginal distribution s(f) by kernel density method, and then integrates the integral

numerically. This research calls the above methods, where statistical models are

used to construct the joint distribution of forecasts and observations, a continuous

approach, as opposed to the discrete approach based on the traditional contingency

table.

The key questions from the above discussion include:

1. Are continuous approaches for the joint distribution better than the discrete

approach?

2. When is one of the continuous approaches better than the others?

3. How does the performance depend on the nature of the forecasts?

To answer these questions, this chapter examines the three continuous approaches by

evaluating the CR decompositions (or Reliability (REL) and Resolution (RES)) for

many verification datasets. In addition, four other measures, Mean Square Error

(MSE), Mean Error (ME), Type 2 Conditional Bias (TY2), and Discrimination

(DIS), are compared to evaluate the estimation error quantitatively. These aspects

34

of forecast quality are normalized by the theoretical uncertainty (or the variance of

observations σ2x = µx(1−µx)), so that the results can be compared over the different

flow events for which forecasts are issued.

The verification datasets are generated by three different models. The first

two models produce continuous forecasts, and the last one issues discrete forecasts.

In the first investigation, beta distributions are assumed to represent the condi-

tional distribution r(f |x) (analytical model for the joint distribution), so that the

true measures of forecast quality are obtained analytically. Second, as a practical

case, verification datasets are produced by a stochastic forecasting model that repre-

sents monthly streamflow volume (stochastic model of streamflow forecast). Then,

the true forecast quality are assumed to be those obtained by the contingency-table

method (DSC) with 1 million pairs of forecasts and observations. The third analysis

uses a discrete forecasting model that issues forecasts in discrete numbers directly

(discrete joint distribution model). Similarly, the true values of the DO measures

are calculated by DSC. Therefore, this analysis investigates how effective the three

continuous approaches are compared to the discrete forecasting model. Each inves-

tigation looks at verification datasets with 50, 100, 200, 400, 600, 800, and 1,000

forecast-observation pairs. For each sample size, 1,000 verification datasets are gen-

erated by Monte Carlo methods to evaluate the estimation uncertainty of forecast

quality measures.

4.2 Monte Carlo Simulation withAnalytical Model for Joint Distribution

Along with dichotomous observations, probabilistic forecasts are generated

in continuous numbers by beta distributions, which are fitted to the conditional

distribution of the forecasts r(f |x) given the observations. The fitting of a beta

distribution facilitates obtaining the true CR decompositions. The resultant CR

decompositions by the three different continuous approaches (LRM, KDR, and CM)

and 11-binned discrete approach (DSC) are compared and discussed.

4.2.1 Assumptions and Procedure

The random variable X, the discretized observation, has a Bernoulli distri-

bution, while F , the continuous or discretized forecast, has unknown distribution.

35

Thus, from the LBR factorization (Equation (3.3)), once the conditional distribu-

tions r(f |x = 0) and r(f |x = 1) are specified, any forecast can be generated given

the marginal probability t(x). The procedure to generate the verification dataset

using beta distributions for the conditional distributions r(f |x) is:

1. Generate the Bernoulli variate x.

2. If the generated observation x is 0, generate the corresponding forecast f

based on the conditional distribution r(f |x = 0):

r(f |x = 0) =

1B(α0,β0)

fα0−1(1− f)β0−1 for 0 < f < 1, α0 > 0, β0 > 0,

0 otherwise.(4.1)

Similarly, in the case of x = 1, use the conditional distribution r(f |x = 1)

with α1 and β1 in the above equation.

The four parameters of beta distribution αi, and βi (i = 0, 1) are chosen so that the

forecasts are unconditionally unbiased and have positive MSE Skill Score (see Sec-

tion 3.2 for these definitions), based on repeated trial and error. First, conditional

means of forecasts given observations µf |x=0 and µf |x=1 are chosen to satisfy:

µf |x=0t(0) + µf |x=1t(1) = µf = µx,

and then the βi are obtained by substituting a chosen αi into:

βi =αi(1− µx|f=i)

µx|f=i

. (4.2)

The true measures, except CR decompositions, are calculated using the ex-

pressions derived in Subsection 3.3.3. To calculate the true CR decompositions,

at first the marginal distribution of forecasts s(f), the conditional distribution of

observations given a forecast, and the mean of observations given a forecast are cal-

culated from Equations (3.37), (3.38), and (3.39). Then, the integral of CR decom-

positions, Equation (3.34), is numerically integrated by Equation (A.10). Finally,

the numerically integrated value is substituted into Equations (3.32) and (3.33) to

obtain Reliability and Resolution.

Two cases, a moderate case, where t(x = 1) = 0.25, and an extreme case,

where t(x = 1) = 0.05, are considered. For example, the case t(x = 1) = 0.25

corresponds to forecasts for an event that a volume is less than or equal to 0.25

quantile of observations. The four parameters of beta distributions and true forecast

quality aspects are listed in Table 4.1.

36

4.2.2 Result and Discussion

The box plots of the four forecast quality aspects that do not require any

assumption to estimate are shown in Figures 4.1 and 4.2. There are small differences

caused by rounding error, because DSC estimates the aspects by reconstructing the

joint distribution first, i.e., calculating the 11 × 2 probabilities. The ranges of

maximum and minimum for 50 sample sizes are fairly large for these four aspects.

For example, in the case of p = 0.05, one may obtain about 0 (1.11) ≤ MSE/σ2x ≤

0.49 (2.46) with 25 % chance, which is equivalent to 0.51 (-1.46) ≤ MSE Skill Score

(SS) = 1 - MSE/σ2x ≤ 1 (-0.11). These are very different from the true MSE/σ2

x

= 0.80 (SS = 0.20), and may lead to wrong perception of skill of the forecasting

system. The difference between the top (upper quartile, Q3) and bottom (lower

quartile, Q1) of the box is referred to as interquartile range (IQR): IQR = Q3−Q1.

The decrease in the IQR indicates that the uncertainty decreases with the increase

of sample size.

In general, the Root Mean Squared Error (RMSE) is defined as:

RMSE =√

SD(θ)2 + BIAS(θ)2 (4.3)

Table 4.1: Parameters of beta distributions for the analytical model and true fore-cast quality measures.

t(x = 1) t(x = 1)

0.25 0.05 0.25 0.05

α0 1.0 0.25 MSE/σ2x 0.478 0.801

β0 5.667 6.0 ME/σx 0.000 0.000

α1 3.0 0.6 TY2/σ2x 0.360 0.640

β1 2.455 1.9 DIS/σ2x 0.160 0.040

µf |x=0 0.15 0.04 REL/σ2x 0.086 0.016

µf |x=1 0.55 0.24 RES/σ2x 0.608 0.215

σ2x 0.1875 0.0475

37

0

0.2

0.4

0.6

0.8

1

50 100 200 400 600 800 1000

MS

E/σ

x2

Sample size

MSE/σx2 versus sample size

D C D C D C D C D C D C D C

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

50 100 200 400 600 800 1000

(µx

- µf)/

σ x

Sample size

ME/σx versus sample size


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

50 100 200 400 600 800 1000

Ex(

µ f|x

- x)

2 /σx2

Sample size

TY2/σx2 versus sample size

D C D C D C D C D C D C D C0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

50 100 200 400 600 800 1000

Ex(

µ f|x

- µ f

)2 /σx2

Sample size

DIS/σx2 versus sample size


Figure 4.1: MSE/σ2x, ME/σx, TY2/σ2


for nonexceedance probability p = 0.25; “D” is discretized (11-binned) approach(DSC), “C” represents a continuous approach such as LRM, KDM, and CM. Themaximum, upper quartile, median, lower quartile, and minimum are indicated fromtop to bottom. The forecasts are produced by the analytical model.

38

0

0.5

1

1.5

2

2.5

50 100 200 400 600 800 1000

MS

E/σ

x2

Sample size

MSE/σx2 versus sample size

D C D C D C D C D C D C D C -0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

50 100 200 400 600 800 1000

(µx

- µf)/

σ x

Sample size

ME/σx versus sample size


0

0.5

1

1.5

2

2.5

50 100 200 400 600 800 1000

Ex(

µ f|x

- x)

2 /σx2

Sample size

TY2/σx2 versus sample size

D C D C D C D C D C D C D C 0

0.1

0.2

0.3

0.4

0.5

0.6

50 100 200 400 600 800 1000

Ex(

µ f|x

- µ f

)2 /σx2

Sample size

DIS/σx2 versus sample size


Figure 4.2: MSE/σ2x, ME/σx, TY2/σ2


for nonexceedance probability p = 0.05; “D” is discretized (11-binned) approach(DSC), “C” represents a continuous approach such as LRM, KDM, and CM. Themaximum, upper quartile, median, lower quartile, and minimum are indicated fromtop to bottom. The forecasts are produced by the analytical model.

39

where θ is an estimator for θ, SD denotes the standard deviation√

E[(θ − E[θ])2],

and BIAS is the unconditional bias E[θ] − θ. Tables 4.2 and 4.3 show the RMSE

for MSE/σ2x, ME/σx, TY2/σ2

x, and DIS/σ2x. In relative sense, the measures for

the extreme case have more sampling uncertainty than the moderate case. Also,

MSE/σ2x (or MSE Skill Score) has the largest uncertainty among these measures.

Figures 4.3 and 4.4 show the true value of the conditional mean of observations

µx|f , the marginal distribution of forecasts s(f), and the conditional distributions

of forecasts given observations r(f |x), for the moderate flow event p = 0.25. Also,

the error bar of these distributions for sample size 50 are shown. It can be seen that

DSC has more uncertainty in µx|f and the expected value of estimated µx|f drops

near the forecast f = 1. On the other hand, the logistic regression of LRM and

kernel density estimation of KDM retain the proper structure. A more dramatic

failure in estimation can be seen for the case of extreme flow event forecasts (Figures

4.5 and 4.6).

Next, the box plots of CR decompositions, REL/σ2x and RES/σ2

x, obtained by

three continuous approaches and one discrete approach are discussed. Figure 4.7

shows REL/σ2x for the moderate flow event p = 0.25. KDM has the closest median to

the true value in the case of sample size 50, although it produces negative estimates.

Note that for the sample size 50 and 100, KDR and LRM give closer median to the

true than DSC does. After the sample size 400, DSC produces the closest median

to the true value. The median of RES/σ2x (Figure 4.7) has almost the same results

as REL/σ2x. The BIAS defined by Equation (4.3) for REL/σ2

x and RES/σ2x (Tables

B.1 and B.3) shows the same thing. All methods indicate a similar reduction in the

dispersion (IQR) as the sample size decreases, which can be seen in the SD defined

by Equation (4.3) (see Tables B.2 and B.4). CM performs worst among them with

the largest range of maximum and minimum for small sample sizes (less than 100

or 200).

The RMSE of REL/σ2x and RES/σ2

x for the moderate flow event p = 0.25 is

shown in Tables 4.4 and 4.5. KDM is the most efficient estimator of REL/σ2x until

the sample size reaches 200, followed by LRM. The result of REL/σ2x for the sample

size 50 is remarkable; KDM gives about half the RMSE of DSC, and the RMSE of

LRM is about 28 percent less than the one of DSC. On the other hand, there is

only a minor improvement by KDM and LRM in RES/σ2x.

The CR decompositions for the extreme case are discussed. Extreme flow

40

Table 4.2: Root Mean Squared Error (RMSE) in MSE/σ2x, ME/σx, TY2/σ2

x, andDIS/σ2

x for the forecasts generated for nonexceedance probability p = 0.25 by theanalytical model.

MSE/σ2x ME/σx TY2/σ2

x DIS/σ2x

50 1.079e-001 1.007e-001 9.600e-002 5.273e-002

100 7.546e-002 6.879e-002 6.670e-002 3.733e-002

200 5.244e-002 4.973e-002 4.657e-002 2.674e-002

400 3.786e-002 3.398e-002 3.267e-002 1.821e-002

600 3.065e-002 2.883e-002 2.689e-002 1.584e-002

800 2.655e-002 2.431e-002 2.342e-002 1.339e-002

1000 2.240e-002 2.120e-002 1.975e-002 1.213e-002

Note: these measures were calculated with the continuous approach.

Table 4.3: Root Mean Squared Error (RMSE) in MSE/σ2x, ME/σx, TY2/σ2

x, andDIS/σ2

x for the forecasts generated for nonexceedance probability p = 0.05 by theanalytical model.

MSE/σ2x ME/σx TY2/σ2

x DIS/σ2x

50 4.557e-001 1.237e-001 4.271e-001 8.096e-002

100 3.277e-001 9.326e-002 3.160e-001 5.049e-002

200 2.400e-001 6.688e-002 2.315e-001 3.456e-002

400 1.628e-001 4.587e-002 1.575e-001 2.245e-002

600 1.309e-001 3.714e-002 1.273e-001 1.824e-002

800 1.140e-001 3.185e-002 1.093e-001 1.629e-002

1000 1.034e-001 2.816e-002 9.919e-002 1.451e-002

Note: these measures were calculated with the continuous approach.

41

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f

Probablistic Forecast

µx|f with sample size 50 by DSC

errorbarstrue

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

s(f)


s(f) with sample size 50 by DSC

errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f


µx|f with sample size 50 by LRM

errorbarstrue

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 0.2 0.4 0.6 0.8 1

s(f)


s(f) with sample size 50 by LRM

errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f


µx|f with sample size 50 by KDM

errorbarstrue

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 0.2 0.4 0.6 0.8 1

s(f)


s(f) with sample size 50 by KDM

errorbarstrue

Figure 4.3: Conditional mean of the observations given the forecasts µx|f andmarginal distribution of the forecasts s(f) estimated by three methods, DSC, LRM,and KDM, for nonexceedance probability p = 0.25 with a sample size 50. The fore-casts are produced by the analytical model.

42

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r(f|x

=0)


r(f|x=0) with sample size 50 by DSC

errorbarstrue

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r(f|x

=1)



errorbarstrue

-1

0

1

2

3

4

5

6

0 0.2 0.4 0.6 0.8 1

r(f|x

=0)


r(f|x=0) with sample size 50 by LRM

errorbarstrue

-0.5

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

r(f|x

=1)



errorbarstrue

-1

0

1

2

3

4

5

6

0 0.2 0.4 0.6 0.8 1

r(f|x

=0)


r(f|x=0) with sample size 50 by KDM

errorbarstrue

-0.5

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

r(f|x

=1)



errorbarstrue

Figure 4.4: Conditional distribution of the forecasts given the observations r(f |x)estimated by three methods, DSC, LRM, and KDM, for nonexceedance probabilityp = 0.25 with a sample size 50. The forecasts are produced by the analytical model.

43

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

s(f)


s(f) with sample size 50 by DSC

errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

0

5

10

15

20

0 0.2 0.4 0.6 0.8 1

s(f)


s(f) with sample size 50 by LRM

errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

0

5

10

15

20

0 0.2 0.4 0.6 0.8 1

s(f)


s(f) with sample size 50 by KDM

errorbarstrue

Figure 4.5: Conditional mean of the observations given the forecasts µx|f andmarginal distribution of forecasts s(f) estimated by three methods, DSC, LRM,and KDM, for nonexceedance probability p = 0.05 with a sample size 50. Theforecasts are produced by the analytical model.

44

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r(f|x

=0)



errorbarstrue

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r(f|x

=1)



errorbarstrue

0

5

10

15

20

0 0.2 0.4 0.6 0.8 1

r(f|x

=0)



errorbarstrue

0

5

10

15

20

0 0.2 0.4 0.6 0.8 1

r(f|x

=1)



errorbarstrue

0

5

10

15

20

0 0.2 0.4 0.6 0.8 1

r(f|x

=0)



errorbarstrue

0

5

10

15

20

0 0.2 0.4 0.6 0.8 1

r(f|x

=1)



errorbarstrue

Figure 4.6: Conditional distribution of the forecasts given the observations r(f |x)estimated by three methods, DSC, LRM, and KDM, for nonexceedance probabilityp = 0.05 with a sample size 50. The forecasts are produced by the analytical model.

45

-0.2

0

0.2

0.4

0.6

0.8

50 100 200 400 600 800 1000

Ef(µ

x|f -

f)2 /σ

x2

Sample size

REL/σx2 versus sample size

DL KC DL KC DL KC DL KC DL KC DL KC DL KC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

50 100 200 400 600 800 1000

Ef(µ

x|f -

µx)

2 /σx2

Sample size

RES/σx2 versus sample size


Figure 4.7: CR decompositions estimated by four approaches for nonexceedanceprobability p = 0.25; “D” is discretized (11-binned) approach (DSC), “L” is logis-tic regression (LRM), “K” is kernel density estimation directly applied to r(f |x)(KDM), and “C” is combination of logistic regression and kernel density estimation(CM). The maximum, upper quartile, median, lower quartile, and minimum areindicated from top to bottom. The forecasts are produced by the analytical model.

46

Table 4.4: Root Mean Squared Error (RMSE) in REL/σ2x for the forecasts generated

for nonexceedance probability p = 0.25 by the analytical model.

DSC LRM KDM CM

50 1.122e-001 8.061e-002 5.554e-002 1.401e-001

100 6.088e-002 4.868e-002 4.377e-002 8.796e-002

200 3.801e-002 3.463e-002 3.422e-002 6.226e-002

400 2.476e-002 2.561e-002 2.825e-002 4.538e-002

600 1.813e-002 2.036e-002 2.450e-002 3.751e-002

800 1.514e-002 1.859e-002 2.138e-002 3.407e-002

1000 1.428e-002 1.786e-002 1.979e-002 3.211e-002

Note: the underlined value is the smallest in the row.

Table 4.5: Root Mean Squared Error (RMSE) in RES/σ2x for the forecasts generated


DSC LRM KDM CM

50 1.792e-001 1.762e-001 1.632e-001 2.093e-001

100 1.184e-001 1.185e-001 1.155e-001 1.386e-001

200 8.680e-002 8.662e-002 8.543e-002 1.005e-001

400 6.142e-002 6.154e-002 6.286e-002 7.163e-002

600 5.112e-002 5.116e-002 5.455e-002 5.885e-002

800 4.220e-002 4.318e-002 4.534e-002 5.108e-002

1000 3.989e-002 4.008e-002 4.228e-002 4.732e-002


47

events are very important in water resource planning, because these events could

cause tremendous damages. LRM keeps producing closer estimates to the true

REL/σ2x than DSC for any sample sizes of 1000 or less (Figure 4.8). KDM is,

however, beaten by DSC at the sample size 200 or larger. CM also has closer median

to the true value than DSC for any sample size less than 1000 or less, although for

the sample size 50 and 100 it has extremely high maximums. The IQR in REL/σ2x

by LRM and CM is also slightly smaller than the one by DSC for the sample sizes of

1000 or less, while KDM have similar or larger IQR. Again, some negative estimates

by KDM are found in Reliability, while LRM and CM give positive ones. In this

point, LRM is more suitable as the estimator of Reliability and Resolution. As for

Resolution, LRM also performs better than DSC for sample sizes of 800 or less in

terms of the median. The reason why the KDR with p = 0.05 does not work well

may be the small sample to estimate r(f |x = 1) caused by dividing the samples.

For example, in case of a sample size 100 only about five subsamples of x = 1 is

generated.

Tables 4.6 and 4.7 show the RMSE of REL/σ2x and RES/σ2

x for the extreme

case p = 0.05. For the Reliability, LRM is the best estimator for sample sizes of

1000 or less. KDM is better than DSC until a sample size of 100, whereas CM

surpasses DSC after a sample size of 200. Even in the Resolution, the estimates by

LRM are the best before the sample size reaches 600.

From the above results, for the moderate flow event p = 0.25, the KDM

estimator is better than the DSC estimate until the sample size reaches 200. In case

of the extreme flow event p = 0.05, LRM is successful in reducing the uncertainty

compared to DSC. LRM is the best for the sample sizes of 1000 or less. Note that the

continuous approaches achieve better RMSE mostly by reducing the BIAS. For the

analytical model of the joint distribution used for Monte Carlo simulation, the true

distribution of µx|f seems to be fitted quite well by logistic regression. In general,

this may not be the case. The remaining cases give realistic examples where the fit

may not be as good.

48

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

50 100 200 400 600 800 1000

Ef(µ

x|f -

f)2 /σ

x2

Sample size



5.58 5.56

0

0.5

1

1.5

2

50 100 200 400 600 800 1000

Ef(µ

x|f -

µx)

2 /σx2

Sample size



5.36 5.33

Figure 4.8: CR decompositions estimated by 4 approaches for nonexceedance prob-ability p = 0.05; “D” is discretized (11-binned) approach (DSC), “L” is logistic re-gression (LRM), “K” is kernel density estimation directly applied to r(f |x) (KDM),and “C” is combination of logistic regression and kernel density estimation (CM).The maximum, upper quartile, median, lower quartile, and minimum are indicatedfrom top to bottom. The forecasts are produced by the analytical model.

49

Table 4.6: Root Mean Squared Error (RMSE) in REL/σ2x for the forecasts generated


DSC LRM KDM CM

50 2.613e-001 1.664e-001 2.435e-001 1.479e+000

100 1.701e-001 9.102e-002 1.683e-001 4.702e-001

200 1.089e-001 5.150e-002 1.142e-001 6.604e-002

400 6.566e-002 3.351e-002 7.775e-002 4.353e-002

600 4.812e-002 2.723e-002 6.199e-002 3.567e-002

800 3.916e-002 2.500e-002 5.451e-002 3.253e-002

1000 3.269e-002 2.299e-002 4.784e-002 2.968e-002


Table 4.7: Root Mean Squared Error (RMSE) in RES/σ2x for the forecasts generated


DSC LRM KDM CM

50 4.124e-001 3.523e-001 4.036e-001 1.439e+000

100 2.704e-001 2.256e-001 2.719e-001 4.926e-001

200 1.754e-001 1.580e-001 1.883e-001 1.687e-001

400 1.133e-001 1.079e-001 1.280e-001 1.141e-001

600 8.875e-002 8.806e-002 1.033e-001 9.276e-002

800 7.841e-002 8.128e-002 9.327e-002 8.551e-002

1000 6.723e-002 6.949e-002 7.931e-002 7.292e-002


50

4.3 Monte Carlo Simulation withStochastic Model of Streamflow ForecastingSystem

A stochastic model of monthly volume forecasts for the experimental Des

Moines River system will be used for the Monte Carlo simulations. This model

produces dichotomous observations and corresponding continuous probabilistic fore-

casts. The forecast quality aspects calculated from one million pairs of forecasts and

observations are considered as the true ones for the forecasting system. In devel-

oping a stochastic model for the forecasting system, the historical simulation of the

monthly volume is assumed to be the observed volume. This assumption eliminates

the impacts of hydrological model biases and errors from the ensemble predictions.

The three continuous approaches (LRM, KDR, and CM) and 11-binned discrete

approach (DSC) are compared and discussed.


The September monthly volume forecasts with a 1-month lead time are chosen

as forecasts to be modeled. An analysis was made of the September ESP (Extended

Streamflow Prediction) forecasts from the experimental system for the Des Moines

River. Using the Chi-Squared Goodness-of-Fit Test and the L-moment ratio dia-

gram, the following assumptions are made:

1. The observed monthly volume U for September has a lognormal distribution.

2. The ensembles of September monthly volume Y given U have Generalized

Pareto (GPA) distributions.

For ensembles with a GPA distribution, the parameters of the distribution can be

related to the first three L-moments (x`1 , x`2 , and x`3). Hence, an ESP forecast

and its corresponding observation can be represented by the four random variables

{U , (X`1 , X`2 , and X`3)}. Figure 4.9 shows scatter plots of these four variables for

the September forecasts for the Des Moines River for 1948-1997. Note that there

are fairly strong associations between observations and the first L-moment, the first

and the second L-moments, and the second and the third L-moments. A stochastic

model of the relationship between these variables is used in the following Monte

Carlo experiments.

51

0

50000

100000

150000

200000

250000

300000

350000

0 50000 100000 150000 200000 250000 300000

x l1

u

L-moment 1 of Ensemble Volume versus Observation

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

55000

0 50000 100000 150000 200000 250000 300000 350000

x l2

xl1

L-moment 2 versus L-moment 1 of Ensemble Volume

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 10000 20000 30000 40000 50000

x l3

xl2

L-moment 3 versus L-moment 2 of Ensemble Volume

Figure 4.9: Relations between observations and L-moments for September monthlyvolume.

52

Table 4.8: Parameters used in fitting the distribution to observed monthly volume(U) and the first three L-moments of the ensemble volumes (X`1 , X`2 , and X`3).

r. v. distribution location (ξ) scale (α) shape (k)

X`1 GEV 32887.133 24156.669 -0.38524433

X`2 GEV 9068.2572 5657.7145 -0.18209885

X`3 GEV 4844.7753 2633.4439 -0.0019766099

r. v. distribution mean (µ) s.d. (σ)

U LN 10.582935 0.9640626

First, the four variables are transformed into standard normal variates using

the following transformation:

zu = Φ−1(FU(u)) (4.4)

z`1 = Φ−1(F1(x`1)) (4.5)

z`2 = Φ−1(F2(x`2)) (4.6)

z`3 = Φ−1(F3(x`3)) (4.7)

where Φ−1 is the inverse function of standard normal cumulative distribution func-

tion (cdf), and Fi represents the cdf of individual variables. As noted above, FU

is assumed to be a lognormal distribution. Based on empirical analysis of the L-

moments for the 49-year forecast period, each of the L-moments is assumed to

have a generalized extreme-value (GEV) distribution. The estimated parameters

for these distributions is shown in Table 4.8. Figure 4.10 shows the transformed

observations and L-moments. Each scatterplot indicates an strong linear relation.

Hence, the relationships between the variables are assumed to have a bivariate nor-

mal distribution. Table 4.9 indicates the parameters necessary to model the system

of observations and forecasts by bivariate normal distributions.

The following steps are carried out to generate forecast-observation pairs for

this stochastic model:

1. Generate a lognormal variate, and then transform it into the standard normal

variate zu.

2. Generate a normal variate, z`1 , whose distribution has the mean, µ`1 , and

53

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

z l1

zo

The 1st L-moment of ensemble volumes versus observed volumes

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

z l2

zl1

The 1st versus the 2nd L-moments of ensemble volumes

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

z l3

zl2

The 2nd versus the 3rd L-moments of ensemble volumes

Figure 4.10: Scatterplot of transformed observed monthly volume and transformedL-moments of monthly volume ensembles.

Table 4.9: Summary statistics of the standardized random variables.

mean s.d. rho

zu 9.38e-08 1.00 ρu`1 0.863

z`1 0.0188 0.982 ρ`1`2 0.986

z`2 0.0139 0.978 ρ`2`3 0.998

z`3 0.00924 0.970

54

variance, σ2`1

, given by

µ`1 = ρu`1 · zu (4.8)

σ2`1

= 1.0− ρ2u`1

, (4.9)

which come from the conditional pdf of the bivariate normal distribution.

3. Untransform z`1 to the ensemble first L-moment x`1 through the GEV:

F1 = Φ(z`1) (4.10)

x`1(F1) =

ξ`1 + α`1{1− (− log F1)k`1}/k`1 , k`1 6= 0

ξ`1 − α`1 log(− log F1), k`1 = 0(4.11)

4. Generate a normal variate, z`2 , in the same manner as 2; replace ρu`1 and zu

with ρ`1`2 and z`1 in Equations (4.8) and (4.9) to get µ`2 and σ`2 .

5. Untransform z`2 to the ensemble second L-moment x`2 through the GEV.

6. One more time, generate a normal variate, z`3 , in the same manner as 2 and 4;

substitute ρ`2`3 and z`2 into ρu`1 and zu in Equations (4.8) and (4.9) to obtain

µ`3 and σ`3 .

7. Untransform z`3 to the ensemble third L-moment x`3 through the GEV.

8. Let yp be a critical threshold, for instance, low flow during the summer. Then

the forecast of the non-exceedance probability for yp is calculated by

f = FY |U(yp)

= 1− e−v (4.12)

where

v =

−k−1f ln{1− kf (yp − ξf )/αf}, kf 6= 0

(yp − ξf )/αf . kf = 0(4.13)

kf = (1− 3x`3/x`2)/(1 + x`3/x`2), (4.14)

αf = (1 + kf )(2 + kf )x`2 , (4.15)

ξf = x`1 − (2 + kf )x`2 . (4.16)

The true CR decompositions are assumed to be obtained by DSC applied

to one million pairs of forecasts and observations. The verification data set is

produced for two yp thresholds, 0.25 and 0.05 quantiles of observations (p = 0.25

and p = 0.05), so that the results can be compared with those of the analytical model

55

described in the previous section. Again, the probabilistic forecasts are issued in

continuous numbers.


The estimations of µx|f for both the cases, p = 0.25 and p = 0.05, with the

sample sizes 50 and 1000, are shown in Figures 4.11 and 4.12. The logistic regression

model cannot have the two peaks at f = 0.1 and 0.9 as the true relationship shows.

KDM also failed to represent them for a sample size of 50. However, for the case of

p = 0.25 with the sample size 50, these methods follow the true line with smaller

uncertainty than DSC does. With a sample size of 1000, the logistic regression and

kernel density estimation still show smaller uncertainties than DSC. For the extreme

case p = 0.05 with 50 samples the mean estimates by DSC are much lower than the

true line, while the mean ones by LRM and KDM are closer. In the case with 1000

samples, the logistic regression is a highly biased estimator with low variability .

Figure 4.13 shows the box plots of REL/σ2x and RES/σ2

x for the moderate flow

event p = 0.25. As seen in the case of the analytical model, LRM and CM produces

medians closer to the true REL/σ2x than DSC does for sample sizes of 1000 or less.

Those of KDM are also closer to the true value than DSC for sample sizes less than

600. Surprisingly, the three continuous approach show less dispersion (IQR) than

DSC does. As for Resolution, each continuous approach produces medians closer

to the true RES/σ2x than DSC does for sample sizes of 800 or less. On the other

hand, the IQRs are very similar to each other. Figure 4.14 shows the box plots

of REL/σ2x and RES/σ2

x for the extreme flow event p = 0.05. The median of the

Reliability estimator by LRM is closer to the true value than the one of DSC for

sample sizes of 1000 or less. CM performs the worst for sample sizes 50 through 200

for Reliability. The estimators for Reliability by KDM and CM are negative for the

small sample sizes. LRM has the smallest uncertainty (IQR) over all the sample

sizes, while KDM and CM have large uncertainty for the sample sizes 50 and 100.

For Resolution, the medians of LRM estimators are closer to the true one for all

the sample sizes studied.

Finally, the RMSE for REL/σ2x and RES/σ2

x is discussed. For the moderate

threshold, p = 0.25 (Table 4.10), KDM is the best estimator of Reliability for small

sample sizes of 50 and 100. After sample size 200, LRM becomes the best one.

Table 4.11 shows that CM achieves the lowest error in Resolution for sample sizes

of 1000 or less. Actually, all the continuous approaches are better for Resolution

56

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

Figure 4.11: Conditional mean of the observations given the forecasts µx|f estimatedby three methods, DSC, LRM, and KDM, for nonexceedance probability p = 0.25with sample sizes 50 and 1000. The forecasts are produced by the stochastic model.

57

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

Figure 4.12: Conditional mean of the observations given the forecasts µx|f estimatedby three methods, DSC, LRM, and KDM, for nonexceedance probability p = 0.05with sample sizes 50 and 1000. The forecasts are produced by the stochastic model.

58

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

50 100 200 400 600 800 1000

Ef(µ

x|f -

f)2 /σ

x2

Sample size



0

0.2

0.4

0.6

0.8

1

1.2

50 100 200 400 600 800 1000

Ef(µ

x|f -

µx)

2 /σx2

Sample size



Figure 4.13: CR decompositions estimated by four approaches for nonexceedanceprobability p = 0.25; “D” is discretized (11-binned) approach (DSC), “L” is logis-tic regression (LRM), “K” is kernel density estimation directly applied to r(f |x)(KDM), and “C” is combination of logistic regression and kernel density estimation(CM). The maximum, upper quartile, median, lower quartile, and minimum areindicated from top to bottom. The forecasts are produced by the stochastic model.

59

-2

-1

0

1

2

3

4

5

6

50 100 200 400 600 800 1000

Ef(µ

x|f -

f)2 /σ

x2

Sample size



-1

0

1

2

3

4

5

6

50 100 200 400 600 800 1000

Ef(µ

x|f -

µx)

2 /σx2

Sample size



Figure 4.14: CR decompositions estimated by four approaches for nonexceedanceprobability p = 0.05; “D” is discretized (11-binned) approach (DSC), “L” is logis-tic regression (LRM), “K” is kernel density estimation directly applied to r(f |x)(KDM), and “C” is combination of logistic regression and kernel density estimation(CM). The maximum, upper quartile, median, lower quartile, and minimum areindicated from top to bottom. The forecasts are produced by the stochastic model.

60

with small samples than the discrete approach. LRM is still the best estimator for

p = 0.05 for sample sizes of 1000 or less, although the logistic regression produces

the biased distribution of conditional mean µx|f as discussed before. The lowest

error seems to be achieved by the very low variability.

In conclusion, all the continuous approaches achieved less error in Reliabil-

ity (REL) and Resolution (RES) than the discrete approach for the moderate and

extreme flow events. By imposing some structure on the distribution of µx|f the con-

tinuous approaches reduced the variability or/and the bias. Note that kernel density

estimations have more flexibility in the estimation of µx|f than logistic regressions,

whereas logistic regressions indicate lower variability. In case of the moderate flow

event p = 0.25, each continuous approach produces closer median of REL (RES)

estimator to the true value with the sample sizes 50 to 600 (1000) than the discrete

approach. Moreover, KDM is the best estimator of REL for small sample sizes of

50 and 100, while CM had the least error in RES for all the sample sizes studied.

In the case of the extreme flow event p = 0.05, LRM is the best estimator of REL

and RES for small sample sizes.

4.4 Monte Carlo Simulation withDiscrete Joint Distribution Model

The verification datasets generated by analytical and stochastic model consist

of the forecasts originally issued in the continuous numbers between 0 and 1. In that

case, it turned out that the continuous approaches work better than the discrete

approach for moderate and extreme thresholds with the small sample size. What

if the forecasts are originally issued in the discrete manner? In this section, the

forecasts are generated in 12 discrete numbers, and then three continuous methods,

LRM, KDM, and CM are applied to the verification datasets. Here, the discrete

approach referred to as DSC uses 12 discrete forecast values from which the discrete

forecasts are generated.


This example of a discrete forecast system is taken from Wilks (1995, p. 246),

Subjective 12-24-h Projection Probability-of-Precipitation Forecasts for United States

during October 1980-March 1981. Since the conditional distribution q(x|f) and

marginal distribution s(f) are given, the joint distribution is reconstructed from

61

Table 4.10: Root Mean Squared Error (RMSE) in REL/σ2x for the forecasts gener-

ated for nonexceedance probability p = 0.25 by the stochastic model.

DSC LRM KDM CM

50 1.944e-001 6.448e-002 5.833e-002 7.743e-002

100 1.133e-001 3.673e-002 3.536e-002 4.001e-002

200 5.953e-002 2.061e-002 2.587e-002 2.298e-002

400 3.256e-002 1.279e-002 2.049e-002 1.589e-002

600 2.310e-002 1.067e-002 1.793e-002 1.359e-002

800 1.758e-002 8.469e-003 1.661e-002 1.176e-002

1000 1.482e-002 7.713e-003 1.544e-002 1.082e-002


Table 4.11: Root Mean Squared Error (RMSE) in RES/σ2x for the forecasts gener-


DSC LRM KDM CM

50 2.272e-001 1.678e-001 1.648e-001 1.565e-001

100 1.456e-001 1.182e-001 1.190e-001 1.093e-001

200 9.322e-002 8.251e-002 8.333e-002 7.752e-002

400 6.368e-002 5.906e-002 5.973e-002 5.564e-002

600 5.327e-002 5.109e-002 5.176e-002 4.892e-002

800 4.381e-002 4.227e-002 4.271e-002 4.043e-002

1000 3.951e-002 3.870e-002 3.952e-002 3.739e-002


62



DSC LRM KDM CM

50 9.255e-001 7.682e-001 1.441e+001 3.290e+002

100 7.171e-001 5.617e-001 7.241e-001 6.152e+001

200 5.253e-001 3.972e-001 5.172e-001 3.259e+000

400 3.621e-001 2.720e-001 3.726e-001 5.435e-001

600 2.918e-001 2.283e-001 3.097e-001 2.304e-001

800 2.470e-001 1.982e-001 2.755e-001 2.013e-001

1000 2.067e-001 1.656e-001 2.465e-001 1.689e-001




DSC LRM KDM CM

50 9.238e-001 7.230e-001 1.447e+001 3.289e+002

100 6.576e-001 4.247e-001 6.957e-001 6.135e+001

200 4.560e-001 2.403e-001 4.538e-001 2.733e+000

400 3.112e-001 1.599e-001 3.222e-001 4.790e-001

600 2.340e-001 1.323e-001 2.692e-001 1.295e-001

800 2.131e-001 1.228e-001 2.604e-001 1.195e-001

1000 1.657e-001 1.028e-001 2.230e-001 1.025e-001


63

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

PD

F


r(f|x=0) of the Discrete Forecast

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

PD

F


r(f|x=1) of the Discrete Forecast

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

PD

F


s(f) of the Discrete Forecast

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f


µx|f of the Discrete Forecast

Figure 4.15: True marginal and conditional distributions of the discrete forecasts.

CR factorizations. The forecast takes on the following twelve values: 0.0, 0.05,

0.1, 0.2, · · ·, 0.8, 0.9, 1.0. The verification set is generated easily, using the CR

factorizations:

1. Generate a forecast through the cumulative distribution function of the fore-

cast f by generating a uniform random number.

2. Generate Bernoulli variate x by the probability π that an event x = 1 occurs

given the forecast f ; the π is obtained from the conditional distribution of the

observations given the forecasts, π = q(x = 1|f).

Note that the unconditional probability of precipitation is t(x = 1) = 0.162,

which is between the moderate p = 0.25 and extreme p = 0.05 cases considered in

the previous sections (Table 4.14).

64

Table 4.14: Basic information and true forecast quality measures of Subjective12-24-h Projection Probability-of-Precipitation Forecasts for United States duringOctober 1980-March 1981 from Wilks (1995).

Sample size 12,402 MSE/σ2x 5.39E-01

t(x=1) 0.162 ME/σx 3.96E-02

σ2x 0.135718

TY2/σ2x 2.86E-01 Dis/σ2

x 2.17E-01

REL/σ2x 6.03E-03 RES/σ2

x 4.67E-01


First, the distribution of µx|f is discussed (Figure 4.16). Even though DSC

does not have any loss of information caused by binning, its estimator with the 50

samples is biased with larger uncertainty than the continuous methods; the estima-

tor of DSC becomes almost unbiased at the sample size 200. The logistic regression

by LRM gives a smooth “S” curve; it is biased in the upward and downward direc-

tions even with the 1,000 samples. On the other hand, KDM has smaller bias as

a whole with the 1,000 samples. This shows the high flexibility of kernel density

estimation method.

Figure 4.17 shows the box plot for REL/σ2x and RES/σ2

x. In this verification

data set, KDM’s medians for REL/σ2x and RES/σ2

x are closer to the true values than

any other method. The median of LRM is closer to the true REL/σ2x (RES/σ2

x) than

DSC until sample sizes of 400 (200), whereas CM also has median closer to the true

REL/σ2x (RES/σ2

x) than DSC for sample sizes of 200 (100) or less. According to

IQR for Reliability, the estimators by LRM and KDM have less uncertainty. In case

of Resolution, IQRs are almost the same.

Tables 4.15 and 4.16 show the RMSE in REL/σ2x and RES/σ2

x. KDM’s estima-

tor for REL/σ2x for a sample size 50 is about one third of DSC’s. Again, KDM could

yield some negative estimates for Reliability, while LRM comes closer to the true

one from the positive direction only (Figure 4.17). KDM gives the best estimator

for Resolution, which is positive.

The main findings of this analysis are the following. Even though DSC does

not have any loss of information caused by binning, its estimator for µx|f is biased

and has larger uncertainty than the continuous methods. On the other hand, KDM

65

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

µ x|f



errorbarstrue

Figure 4.16: Conditional mean of the observations given the forecasts µx|f estimatedby three methods, DSC, LRM, and KDM, with sample sizes 50 and 1000. Theforecasts are produced by the discrete model.

66


ated by the discrete model.

DSC LRM KDM CM

50 2.134e-001 8.891e-002 7.053e-002 1.572e-001

100 1.234e-001 5.120e-002 3.417e-002 8.092e-002

200 6.423e-002 3.345e-002 1.655e-002 5.341e-002

400 3.320e-002 2.640e-002 1.085e-002 4.068e-002

600 2.257e-002 2.369e-002 9.491e-003 3.519e-002

800 1.720e-002 2.226e-002 8.552e-003 3.248e-002

1000 1.378e-002 2.141e-002 7.739e-003 3.062e-002



ated by the discrete model.

DSC LRM KDM CM

50 2.769e-001 2.221e-001 2.032e-001 2.657e-001

100 1.824e-001 1.555e-001 1.448e-001 1.749e-001

200 1.228e-001 1.108e-001 1.041e-001 1.219e-001

400 8.320e-002 8.134e-002 7.512e-002 8.917e-002

600 6.875e-002 6.927e-002 6.479e-002 7.513e-002

800 5.958e-002 6.134e-002 5.596e-002 6.684e-002

1000 5.154e-002 5.417e-002 4.908e-002 5.918e-002


67

-0.2

0

0.2

0.4

0.6

0.8

50 100 200 400 600 800 1000

Ef(µ

x|f -

f)2 /σ

x2

Sample size



2.0

0

0.5

1

1.5

2

50 100 200 400 600 800 1000

Ef(µ

x|f -

µx)

2 /σx2

Sample size



Figure 4.17: CR decompositions estimated by four approaches for Discrete Forecast;“D” is discretized (12-binned) approach (DSC), “L” is logistic regression (LRM),“K” is kernel density estimation directly applied to r(f |x) (KDM), and “C” is com-bination of logistic regression and kernel density estimation (CM). The maximum,upper quartile, median, lower quartile, and minimum are indicated from top tobottom. The forecasts are produced by the discrete model.

68

shows more flexibility than LRM in estimation of µx|f . In estimation of REL/σ2x and

RES/σ2x, the median by KDM was the closest to the true value. KDM’s estimator

for Reliability with a sample size of 50 is about three times more efficient than DSC.

Even in the case where the forecasts are issued originally in discrete numbers, all

the continuous approaches gave better estimators for REL/σ2x and RES/σ2

x for small

sample sizes such as 50 or 100 than the discrete approach with contingency table.


This chapter investigated the three continuous approaches (LRM, KDM, and

CM) to reduce the estimation error in DO measures. Verification datasets were

generated by three forecasting systems: an analytical model for the joint distribu-

tion, a stochastic model of an ESP streamflow forecasting system, and a discrete

joint distribution model. For these verification datasets, the distributions of the DO

measures, and marginal and conditional distributions, are calculated by the three

continuous approaches and discrete approach.

For small samples, the continuous approach with LRM is the best estimator for

CR decompositions. It is better than traditional contingency table approach, DSC,

whether the forecasts are issued in discrete or continuous numbers. LRM is the best

estimator for the forecasts issued for extreme events. KDM is also a better estimator

than DSC for small samples, and works better than LRM for the forecasts issued

for moderate events. But, the estimator of LRM for Reliability seems to be always

positive, whereas the one of KDM could be negative. In this point, LRM is more

desirable. Examination of box plots and decompositions of RMSE indicated that the

continuous approach achieves better estimation by reducing both the unconditional

bias and variance in Reliability and Resolution estimates.

As seen in the case of analytical model, the logistic regression models the

conditional mean µx|f very well. However, the LRM approach has difficulty where

the true µx|f has some peaks, as illustrated by the stochastic model of the ESP

streamflow forecast. The kernel density estimation has high flexibility, although this

research utilized a simple method to cope with the boundary effect. The arithmetic

average, Equation (3.36), may have more a desirable feature which reflects the

marginal distribution of forecasts s(f) rather than using indirect estimation of s(f)

by kernel density estimation method in Equation (A.10) in case of extreme events.

The case of arithmetic average of µ2x|f estimated by KDM was also examined (not

shown here), which produced smaller RMSE for the extreme flow event p = 0.05

69

than original KDM. In both the verification data sets generated for the extreme

threshold p = 0.05, only LRM surpasses DSC in terms of RMSE. KDM fails because

the samples to estimate r(f |x = 1) are so limited, while the logistic regression

can utilize all the samples of forecasts. For example, KDR fails to determine the

smoothing parameter h for r(f |x = 1) if the SD of the forecasts given x = 1 is 0.

This often happens in the cases of the small sample sizes, 50, and 100.

Finally, even if forecasts are originally issued in discrete numbers, the contin-

uous approaches may yield better estimators than the discrete approach, especially

for small sample sizes, say, less than 100. For the forecasts given in continuous num-

bers, the discrete approach produces bias by changing the continuous forecasts into

the discrete numbers. Since the continuous approaches impose some structure on

the distribution of µx|f without changing the original forecasts, it is easy to imagine

that the continuous approaches perform better than the discrete approach for small

sample sizes. However, for the forecasts originally issued in discrete numbers, the

continuous approaches succeeded in giving the better estimation of the measures

than the discrete approach. Moreover, the discrete approach with contingency ta-

ble requires the selection of bin widths in order to obtain reasonable sample sizes

in the bins, whereas few parameters have to be estimated to use the continuous ap-

proaches. Thus, in terms of implementation, the continuous approaches are superior

to the discrete approach in case of small sample size of verification dataset.

70

CHAPTER 5

ASSESSMENT OF BIAS CORRECTION METHODSFOR ENSEMBLE FORECASTS

This research extends the Distributions-Oriented (DO) approach to the veri-

fication of probability distribution forecasts (or ensemble forecasts) of streamflow.

Using the verification datasets derived for discrete events, forecast quality of the

probability distribution forecast can be assessed over the range of possible out-

comes. This chapter demonstrates the usefulness of the DO approach. Three types

of bias correction methods are applied to ensemble forecasts that an experimen-

tal forecasting system for the Upper Des Moines River basin produces. The DO

approach is utilized to assess the probabilistic forecasts modified by the bias cor-

rection methods, and the resultant DO measures and distributions of the forecasts

are discussed.

5.1 Introduction

All dynamic models contain some bias. A hydrological model for streamflow

simulation is no exception; it has conditional biases due to input data or the model

assumptions. This fact naturally provokes the questions: what are the effects of

biases on the potential use of a hydrological model, and how can these biases be

removed? In recent years, Ensemble Prediction Systems (EPSs), which produce

forecasts based on many realizations from an initial condition, are gaining popularity

in hydrological and meteorological forecasting. The set of realizations is called the

ensemble. The conditional biases from a hydrological model may propagate to the

ensemble, and then to the probabilistic forecasts obtained by frequency analysis

of the ensemble. However, since the true distribution of ensemble (e.g., produced

by a hydrological model without biases), cannot be obtained, direct comparison of

modified ensembles and true ensembles cannot be done to assess the bias correction

methods. One indirect approach is to examine the probabilistic forecasts produced

with the bias-corrected ensemble. Hence, the distributions-oriented (DO) approach

can be a powerful tool to evaluate bias correction methods.

To illustrate, we examine the probabilistic forecasts for monthly streamflow

volume observed at Stratford on the Des Moines River. The available historical

71

0

20000

40000

60000

80000

100000

120000

140000

99989489806960504030201051

Vol

ume

(cfs

-day

s)

Nonexceedance Probability (%)

Des Moines River near Stratford, Iowa

Original EnsembleCorrected Ensemble

Figure 5.1: Example of Bias Correction Method applied to ensemble traces.

record consists of the observations of N = 49 years. On each forecast date, the

Hydrological Simulation Program-Fortran (HSPF) produces different realizations

(or ensemble) of streamflow with an initial hydroclimatological condition of the

basin at current time t, and i-year meteorological information (i = 1, · · · , N , except

the year including t). Thus, N − 1 = 48 traces of streamflow that start from the

current time, or forecasting time, and continue for one year, are obtained. By

seperating each trace by a month, 12 monthly streamflow volumes with leadtime

0 through 11 month are obtained for each trace. Then, frequency analysis on the

ensemble volumes obtained for one monthly volume with a leadtime produces a

probabilistic distribution forecast (see Chapter 2).

Three types of bias correction methods are applied to the ensembles of monthly

streamflow volumes. Changing the ensemble volumes lead to different probability

distribution forecasts (Figure 5.1). As another way of implementing bias correction,

post-hoc recalibration, which minimizes the conditional bias (Reliability) by linear

transformation, has been suggested by Wilks (2000). Hou et al. (1998) also suggest

a post-hoc recalibration to achieve calibration of probabilistic forecasts, using rank

distributions in conjunction with the ensemble.

The probabilistic forecast for an event that monthly volume is less than or

equal to a threshold is obtained from the probability distribution forecast. The

corresponding continuous observation is converted into a discrete number: 1 indi-

cates that the event occurred, or 0 means that the event did not occur. Hence, the

verification dataset for the event consists of the pairs of probabilistic forecasts and

72

discrete observations. Using the verification datasets derived for discrete events,

forecast quality of the probability distribution forecast can be assessed over the

range of possible outcomes. The thresholds for which forecasts are issued are nine

quantiles of observations, or nonexceedance probabilities of observations (p =0.05,

0.10, 0.25, 0.33, 0.50, 0.66, 0.75, 0.90 and 0.95). The verification datasets produced

by the bias correction methods are assessed by DO measures described in Section

3.2. In calculating the DO measures, the Logistic Regression Method (LRM) is

implemented (Subsection 3.3.3). Various leadtimes are examined to investigate the

relation between bias by the hydrological model and measures of forecast quality.

5.2 Biases in Historical Simulations

Let Yi be the volume observed in a month and year i, and Yi be the corre-

sponding historical simulation of monthly volume (i = 1, · · · , 49). The main idea to

find a “correction” for ensemble volumes is to use the relationship between Yi and

Yi. To shed light on the characteristics of the monthly streamflow volume, the mean,

standard deviation (SD) and coefficient of variation (CV) of the observations, and

the Mean Error (ME), Root Mean Square Error (RMSE), and correlation coefficient

(CC) between the observations and historical simulations are calculated. The mean

of the observed monthly volume in Table 5.1 reveals that the wet season of this basin

is from March to July, and the dry season is from August to February. According

to Table 5.2, January, February, August, September, October, and November have

the positive values in Mean Error, which indicates that the monthly volumes tend

to be overestimated by the hydrological model. In contrast, the other months un-

derestimate the monthly volume. Since this basin has high flow events from March

to July and low flow events in other months, this model tends to underestimate

high flow events and overestimate low flow events. This can be also seen from the

time series of the observed monthly volume and historical simulation (Figure 5.2).

The MSE (Mean Square Error) Skill Score (SSMSE), which is a relative mea-

sure of accuracy, is also calculated for each month. It compares the MSE (absolute

accuracy measure) for the historical simulations with one for climatology (or mean

of observation) as SSMSE = 1 − (RMSE/SD)2. The SSMSE indicates that the

January histrical simulation has the lowest accuracy of twelve months.

73

Table 5.1: Mean, Standard Deviation (SD), and Coefficient of Variation (CV) ofthe observed monthly volume (cfsd) for the Des Moines River at Stratford.

Month Mean SD CV Month Mean SD CV

Jan 18022 21501 1.19 Jul 109626 137366 1.25

Feb 27432 40896 1.49 Aug 50769 74678 1.47

Mar 116214 101187 0.871 Sep 40627 53710 1.32

Apr 176127 168633 0.957 Oct 43520 55086 1.27

May 134802 111722 0.829 Nov 38202 44075 1.15

Jun 148183 137626 0.929 Dec 29157 35102 1.20

Table 5.2: Mean Error (ME), Root Mean Square Error (RMSE), correlation coefficient(CC), and Mean Square Error (MSE) Skill Score (SSMSE) between the observed monthlyvolume and historical simulations.

Month MEa RMSEa CC SSMSE Month MEa RMSEa CC SSMSE

Jan 3610 15122 0.831 0.505 Jul -6588 35658 0.970 0.933

Feb 11817 22669 0.921 0.693 Aug 15666 26622 0.957 0.873

Mar -8301 50431 0.870 0.752 Sep 17509 25112 0.950 0.781

Apr -22705 50842 0.967 0.909 Oct 7348 25799 0.893 0.781

May -18143 42335 0.946 0.856 Nov 823 15810 0.935 0.871

Jun -29524 53399 0.952 0.849 Dec -2099 18583 0.866 0.720

a The unit is cfsd.

5.3 Bias Correction Methods

Let Yi be the ensemble monthly volume, which is conditional on the initial

hydrological conditions with i-year meteorological conditions. It is naturally con-

ceived that the bias in the ensemble volume depends on the month for which the

ensemble volume is issued, and its magnitude. Thus, the bias-corrected ensemble

volumes are obtained through some function fj() for a certain month j:

Zi = fj(Yi). (5.1)

74

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1-8

84

-88

7-8

81

0-8

81

-89

4-8

97

-89

10

-89

1-9

04

-90

7-9

01

0-9

01

-91

4-9

17

-91

10

-91

1-9

24

-92

7-9

21

0-9

21

-93

4-9

37

-93

10

-93

1-9

44

-94

7-9

41

0-9

41

-95

4-9

57

-95

10

-95

1-9

64

-96

7-9

61

0-9

61

-97

4-9

77

-97

10

-97

Observed streamflow

Historical simulation

Figure 5.2: Comparison of observed monthly volume and historical simulation fromJanuary 1988 to December 1997.

75

Then, the function fj() is estimated from the set of the volumes observed in the

month j, and the corresponding historical simulations, that is, Yi and Yi (i =

1, · · · , 49). This research investigates a multiplicative correction, regression, and

quantile-mapping as the function.

5.3.1 Event-Bias Correction Method

The first method is the Event-Bias Correction method (EBC). This method

assumes the same bias exists for the same historical meteorological input. Smith et

al. (1992) defined a multiplicative corrector as:

Zi = [YiY−1i ]Yi (5.2)

This method modifies the historical simulation Yi perfectly, i.e., when Yi = Yi. The

unique feature of this method is that it expects the same multiplicative bias for the

historical meteorological event, regardless of the magnitude of the ensemble volume.

The left of Figure 5.3 shows an example of EBC.

5.3.2 Regression-Type Method

The second method uses the expected value of observations given the simu-

lated volume, based on a regression between observed volumes and corresponding

historical simulations. First, consider the simplest case where the bias-corrected

simulation is given by linear interpolation between the historical simulation and

observation (hereafter RLI). With the order statistics for the historical simulation

denoted by

Y(1) ≤ Y(2) ≤ · · · ≤ Y(n), (5.3)

and the observations corresponding to Y(i) denoted by Y ′(i), the corrected ensemble

simulation is defined by

Zi =Y ′

(2) − Y ′(1)

Y(2) − Y(1)

(Y(1) − Y(1)) + Y ′(1)

for Yi < Y(2)

Zi =Y ′

(j+1) − Y ′(j)

Y(j+1) − Y(j)

(Yi − Y(j)) + Y ′(j) (5.4)

for Y(j) ≤ Yi < Y(j+1)

Zi =Y ′

(n) − Y ′(n−1)

Y(n) − Y(n−1)

(Yi − Y(n−1)) + Y ′(n−1)

76

0

20000

40000

60000

80000

100000

120000

0 20000 40000 60000 80000 100000 120000 140000 160000

Obs

erve

d vo

lum

e (c

fsd)

Simulated volume (cfsd)

Event-Bias Correction Method (EBC)

Corrected ensemble volumesHistorical simulation

0

20000

40000

60000

80000

100000

120000

0 20000 40000 60000 80000 100000 120000 140000 160000

Obs

erve

d vo

lum

e (c

fsd)


Linear Interpolation (RLI)


Figure 5.3: Example of the bias-correction for 1-month lead time forecast withinitial condition of January in 1949; EBC (Event-Bias Correction Method) is leftand RLI (Linear Interpolation) right.

for Y(n−1) ≤ Yi

The right of Figure 5.3 shows the results by RLI for January in 1949. RLI gives the

true simulation to the historical simulation as EBC does and takes the magnitude

of simulation into account.

Secondly, consider a power function (RPF), one of the most common regression

functions. The bias-corrected ensemble simulation is given by

Zi = bY ci . (5.5)

The parameters b and c are optimized to minimize the sum of squared error for each

month using the historical simulation and observed volumes as

N∑

i=1

(Yi − bY ci )2 → min (5.6)

Figure 5.4 shows the power functions fitted to each set of observations and cor-

responding historical simulations for May and September monthly volume. The

broken line showing no bias reveals that each month contains conditional bias.

A third regression-type method, known as LOWESS (LOcally WEighted Scat-

terplot Smoothing) is investigated (called RLW). In essence, the expected value is

obtained by considering the vertical and horizontal distances between the sam-

ples within a moving window. The procedure of LOWESS is roughly stated here

(See Cleveland, 1979 for details). Consider a scatterplot of points (xi, yi) for

i = 1, · · · , n. For each xi, a weight function W is used to make weights ωk(xi)

for all xk (k = 1, · · · , n). In this procedure, centering W at xi and scaling it are

77

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

0 50000 100000 150000 200000 250000 300000 350000 400000

Obs

erve

d vo

lum

e (c

fsd)


Power Function for May

0.910*(x**(1.02))Raw data

0

50000

100000

150000

200000

250000

300000

0 50000 100000 150000 200000 250000 300000

Obs

erve

d vo

lum

e (c

fsd)


Power Function for September

0.101*(x**(1.17))Raw data

Figure 5.4: Observed monthly volume versus simulated monthly volume with powerfunction for May and September.

done in order that the point at which W first becomes zero is at the rth nearest

neighbour of xi. By introducing parameter f , 0 ≤ f ≤ 1, r is defined as the near-

est integer to fn. The initial smoothed value yi at each xi is obtained by a linear

regression to the data using weighted least squares with weights ωk(xi). Then, an-

other set of weights, δi, is constructed for each (xi, yi) based on the size of the

residual yi − yi using the weight function W . As large distances, xk − xi, lead to

small weight ωk(xi), large residuals result in small weight δi, and vice versa. New

smoothed values are computed by the linear regression using weighted least squares

with weights δiωk(xi).

Cleveland (1979) recommended using PRESS (PRediction Error Sum of Squares)

for estimating the smoothing parameter f . The PRESS statistic is a cross validation-

type estimator of error and the application of PRESS determines f so that the

regression produces the least error in making new predictions (Helsel and Hirsch,

1992). The weight function W used is the bisquare function:

B(x) =

(1− x2)2 if |x| < 1

0 otherwise(5.7)

In addition, in the case that any pair of adjacent estimated points makes the

slope negative, the smoothing parameter f is increased by 0.01 increments until a

positive slope is obtained. This results in a one-to-one function. Finally, the bias-

corrected conditional model estimator is obtained by using linear interpolation with

estimated points by LOWESS. Figure 5.5 shows the points estimated by LOWESS

78

0

100000

200000

300000

400000

500000

600000

0 50000 100000 150000 200000 250000 300000 350000 400000

Obs

erve

d vo

lum

e (c

fsd)


LOWESS for May

LOWESS ( f = 0.44 )Raw data

0

50000

100000

150000

200000

250000

300000

0 50000 100000 150000 200000 250000 300000

Obs

erve

d vo

lum

e (c

fsd)


LOWESS for September

LOWESS ( f = 0.44 )Raw data

Figure 5.5: Observed monthly volume versus simulated monthly volume withLOWESS regression for May and September.

and segments between the points by interpolation for May and September monthly

volumes.

5.3.3 Quantile-Mapping Method

The idea of the third method, quantile-mapping Method (QM), is that the

observed volumes and the historical simulations should have the same cumulative

relative frequency. As with the historical simulations in linear interpolation method

(Equation 5.3), take the order statistics for the observations:

Y(1) ≤ Y(2) ≤ · · · ≤ Y(n). (5.8)

With this set of order statistics, (Y(i), Y(i)), the corrected simulation is obtained by

interpolation; the equations are just the replacement of Y ′(i) with Y(i) in Equation

5.5. This method is based on the one-to-one transformation by the empirical cu-

mulative distribution functions between the historical simulations and observations.

Therefore, there is an assumption that the conditional model estimators obey the

empirical cumulative distribution function (CDF) of the historical simulations. An

example for QM is shown in Figure 5.6.

5.4 Result and Discussion

5.4.1 Performance Measures

First, the MSE (Mean Square Error) Skill Score (SSMSE) is considered for

each the month. The MSE for the probabilistic forecasts of monthly volume is

79

0

20000

40000

60000

80000

100000

120000

0 20000 40000 60000 80000 100000 120000 140000 160000

Obs

erva

tion

(cfs

d)

Simulation (cfsd)

Quantile Mapping method (QM)


Figure 5.6: Example of the Quantile Mapping method (QM) for 1-month lead timeforecast with initial condition of January in 1949.

compared with the variance of the observations, which is the MSE for a climatology

forecast (the mean of the observations). As explained in Section 5.1, the probability

distribution forecasts issued by the forecasting system are assessed at nine quantiles.

Thus, nine MSEs are obtained from one probabilisty distribution forecast for each

forecasted month and lead time. The SSMSE was calculated with the MSE and

the variance of observations averaged over the nine quantiles. The left-hand side

of Figure 5.7 indicates the SSMSE versus forecasted month for 1 to 3-month lead

times. Examination of the SSMSE for the forecasts without bias correction (NBC)

indicates the monthly variation. It is interesting to note that the NBC for 2 and 3-

month lead times show the similar patterns, although the pattern for a 1-month lead

time is somewhat different. Another important point is that the relative accuracy

of these probabilistic forecasts indicates different monthly characteristic from one of

the historical simulations (see 5.2). Thus, it is incorrect to speculate that the month

with high relative accuracy in historical simulation has high relative accuracy for

ensemble forecasting. Comparison of the SSMSE for the bias correction methods

with 0 (no skill line) illustrates that all of the bias correction methods have skill for

all the months.

In order to compare the SSMSE for five bias correction methods with NBC,

the Skill Score for Bias Correction (SSBC) is introduced as

SSBC = 1− MSE

MSENBC(5.9)

80

where MSENBC is the MSE for the simulations without bias correction. The right-

hand side of Figure 5.7 shows the SSBC versus forecasted month for 1 through 3-

month lead times. These values are also obtained by averaging MSE and MSENBCin Equation 5.9 over the nine quantiles. Clearly, some months such as August or

September benefit from the bias correction methods, but others such as May or

July do not. It is interesting to note that RPF failed to improve NBC in the winter

season: November, December, January and February. Examination of the SSBCs

for a 1-month lead time depicted in the top of Figure 5.7 reveals that RLI gives

largest improvement in accuracy. The SSBC for 2-month lead times indicates QM

is the best except for March, July, and November.

From the results of SSBC, May and September are selected as examples of

small and large improvement to examine in detail. Figure 5.8 shows the SSBCfor May and September volume over the lead time. None of the bias correction

methods improves the forecast skill much in May, even for a 1-month lead time. For

both months, the best score is obtained by QM at lead times greater than 1 month.

EBC is slightly more accurate at a 1-month lead time. It is speculated that the

multiplicative bias between historical simulation and the corresponding observation

is well preserved for the ensemble simulation with the year’s meteorological input

during the short time.

Generally, the difference between historical simulation and the corresponding

observation stems from model deficiency and input-data deficiency. In reality, it

is impossible to achieve a perfect simulation. However, in order to approximate

the measures of forecast quality by a perfect simulation model, a forecasting case

where the observations are replaced with the corresponding historical simulations

is considered. In the usual forecasting process, the observations are discretized into

0 or 1 based on a threshold of monthly volume, or a quantile calculated with the

observations. But in this case, the historical simulations are discretized based on the

quantiles of the historical simulations. Then, the DO measures are calculated for

the pairs of the discretized historical simulations and probabilistic forecasts. This

forecast is called a pseudoperfect streamflow simulation (PSS). The result for PSS

should approximate the maximum improvement that any bias correction method is

able to achieve. However, the set of observations for the PSS is different from the

actual set of observations, so other approaches do have better results in some cases.

Still, the result for PSS is also depicted in the following figures as a reference.

From here, September volume is investigated first. Figure 5.9 shows the Bias

81

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

MS

E S

kill

Sco

re

Forecasted Month

1-month lead time averaged over the quantiles

NBCRLIRPFRLWEBCQM

-0.6

-0.4

-0.2

0

0.2

0.4

0.6


Ski

ll S

core

for B

ias

Cor

rect

ion

Forecasted Month


RLIRPFRLWEBCQM

-0.6

-0.4

-0.2

0

0.2

0.4

0.6


MS

E S

kill

Sco

re

Forecasted Month


NBCRLIRPFRLWEBCQM

-0.6

-0.4

-0.2

0

0.2

0.4

0.6


Ski

ll S

core

for B

ias

Cor

rect

ion

Forecasted Month


RLIRPFRLWEBCQM

-0.6

-0.4

-0.2

0

0.2

0.4

0.6


MS

E S

kill

Sco

re

Forecasted Month


NBCRLIRPFRLWEBCQM

-0.6

-0.4

-0.2

0

0.2

0.4

0.6


Ski

ll S

core

for B

ias

Cor

rect

ion

Forecasted Month


RLIRPFRLWEBCQM

Figure 5.7: MSE Skill Score (left) and Skill Score for Bias Correction (right) versusforecasted month for 1, 2, and 3-month lead times, averaged over the quantiles.

82

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10 11 12

Ski

ll S

core

for B

ias

Cor

rect

ion

Lead Time (month)

May Flow

RLIRPFRLWEBCQM

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10 11 12

Ski

ll S

core

for B

ias

Cor

rect

ion

Lead Time (month)

September Flow

RLIRPFRLWEBCQM

Figure 5.8: Skill Score for Bias Correction for May and September monthly volumes,averaged over the quantiles.

(left) and MSE (right) for 1 to 3-month lead times. Without bias correction, the

forecast system has negative values and a “U” shape in the Bias. That is to say,

in absolute sense, the forecasting system without bias correction tends to under-

estimate the occurrence of the event, especially for the moderate flow events. All

the bias correction methods improve the unconditional bias (Mean Error), although

RPF issues the forecasts with relatively large bias in the middle. The Mean Er-

ror seems to keep the same magnitude and shape as the lead time increases. This

means that the Mean Error does not depend on the lead time, but on the month for

which forecasts are issued. Comparison of the MSE shown in the right of Figure 5.9

reveals that all the bias correction methods succeed in reducing the MSE, especially

for the moderate flow event.

Next, the decompositions of the MSE Skill Score (Equation (3.8)) are dis-

cussed. Figure 5.10 shows the MSE Skill Score and the potential skill (first term

in the decomposition). The MSE Skill Score by NBC shows negative scores for low

and moderate flow events, which means that the forecasts without bias correction

are worse than climatology forecasts. On the other hand, all the bias correction

methods but RLW improved the skill for the moderate and low flow events. The

potential skill for NBC for the low flow events is poorer than the other methods for

1-month lead time. It is surprising that all the bias correction methods improve the

potential skill, or squared association, in the low flow. This may be related to the

improvement in Resolution or Discrimination measures, since the other terms in the

Skill Score decompositions include conditional and unconditional measures. Note

that by definition the potential skill for any method has to be equal to or greater

83

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror


September Flow with 1-month Lead Time (N=48)

PSSNBCRLIRPFRLWEBCQM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.2 0.4 0.6 0.8 1

MS

E




-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror




0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.2 0.4 0.6 0.8 1

MS

E




-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror




0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.2 0.4 0.6 0.8 1

MS

E




Figure 5.9: Comparison of Mean Error (left) and Mean Square Error (right) byfive Bias Correction methods, actual (non bias-corrected) streamflow simulation(NBC), and pseudoperfect streamflow simulation (PSS), for 1, 2, and 3-month leadtime September monthly volume forecasts.

84

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

MS

E S

kill

Sco

re




-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Squ

ared

Cor

rela

tion

Nonexceedance probability



-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

MS

E S

kill

Sco

re




-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Squ

ared

Cor

rela

tion




-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

MS

E S

kill

Sco

re




-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Squ

ared

Cor

rela

tion




Figure 5.10: Comparison of MSE Skill Score (left) and measure of association (right)by five Bias Correction methods, actual (non bias-corrected) streamflow simulation(NBC), and pseudoperfect streamflow simulation (PSS), for 1, 2, and 3-month leadtime September monthly volume forecasts.

than 0 (no skill).

Figure 5.11 depicts the other terms in the decompositions of the MSE Skill

Score: a measure of conditional bias (reliability), and a relative measure of un-

conditional bias. These terms cause the forecasts without bias correction to have

less accuracy than a climatology forecast has. The conditional biases (reliability)

85

by some bias correction methods are worse than the original forecasts for low and

moderate flow events, but these contributions to Skill Score are relatively small.

As seen in the examination of Mean Error, the relative unconditional bias for the

forecasts without bias correction is also high in the moderate flow events, and all the

bias correction methods improve it dramatically. QM decreases the unconditional

bias most, while RPF has more bias in the moderate flow event among the bias

correction methods.

As for May monthly volume forecasts, the performance measures and the

decompositions of MSE Skill Score for 1-month lead time are depicted in Figure 5.12.

Compared to September monthly volume forecasts, May monthly volume forecasts

with no bias correction are less biased; they have almost the same magnitude of bias

as all the bias correction methods have. Similarly the MSEs for NBC are close to

those by the bias correction methods. Why did the bias correction methods make

relatively small improvement in the Skill Score? The reasons are that (1) the original

forecasts are well calibrated except for the extreme low flows, and overall unbiased,

and that (2) the bias correction methods could not improve the association over all

the range of quantiles, although PSS suggests the possibility of improvement in the

low and moderate flow events.

5.4.2 CR Factorization and Decompositions

The distributions s(f) and q(x = 1|f) = µx|f for threshold nonexceedance

probabilities p of 0.05, 0.25, and 0.5, with 1-month lead time are depicted in Fig-

ure 5.13. The forecasts are issued for September monthly volume. The marginal

distribution of the forecasts s(f) and conditional mean µx|f are estimated by the ker-

nel density estimation method and logistic regression, respectively (see Subsection

3.3.3).

In the case of p = 0.05, 46 observations take on 0, and just 2 take on 1. The

density of marginal distribution of the forecasts with no bias correction concentrates

near f = 0. RLW issued only f = 0, which means the variance of the observations

σ2f = 0. Since the optimal width h for kernel is calculated with σ2

f , the kernel density

estimation method failed to estimate the marginal distribution of forecasts. As the

magnitude of the threshold increases, more density of the marginal distribution of

forecasts shifts toward f = 1. QM shows almost the same distribution s(f) as PSS

does for p =0.05, 0.25, and 0.50. EBC has a flatter distribution of the forecasts for

p = 0.5, which leads to lower sharpness than the others. The results clearly show

86

00.050.1

0.150.2

0.250.3

0.350.4

0.450.5

0 0.2 0.4 0.6 0.8 1

Rel

iabi

lity

from

SS




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Unc

ondi

tiona

l bia

s fro

m S

S




00.050.1

0.150.2

0.250.3

0.350.4

0.450.5

0 0.2 0.4 0.6 0.8 1

Rel

iabi

lity

from

SS




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Unc

ondi

tiona

l bia

s fro

m S

S




00.050.1

0.150.2

0.250.3

0.350.4

0.450.5

0 0.2 0.4 0.6 0.8 1

Rel

iabi

lity

from

SS




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Unc

ondi

tiona

l bia

s fro

m S

S




Figure 5.11: Comparison of Decompositions of Skill Score by five Bias Correctionmethods, actual (non bias-corrected) streamflow simulation (NBC), and pseudop-erfect streamflow simulation (PSS), for 1, 2, and 3-month lead time Septembermonthly volume forecasts. The measure of reliability is left, and the measure ofunconditional bias is right.

87

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror


May Flow with 1-month Lead Time (N=48)


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.2 0.4 0.6 0.8 1

MS

E




-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

MS

E S

kill

Sco

re




-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Squ

ared

Cor

rela

tion




00.050.1

0.150.2

0.250.3

0.350.4

0.450.5

0 0.2 0.4 0.6 0.8 1

Rel

iabi

lity

from

SS




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Unc

ondi

tiona

l bia

s fro

m S

S




Figure 5.12: Performance measures and decompositions of MSE Skill Score by fiveBias Correction methods, actual (non bias-corrected) streamflow simulation (NBC),and pseudoperfect streamflow simulation (PSS), for 1, 2, and 3-month lead time Maymonthly volume forecasts.

88

that the bias correction methods improve sharpness for some range of quantile.

As for the reliability diagrams shown in the right of Figure 5.13, the distribu-

tions of µx|f by RPF, EBC, and QM indicate a step function for p = 0.05, which

does not seem to be reasonable for µx|f ; all methods seem to have significant con-

ditional biases. Only PSS shows the positive contribution to MSE Skill Score over

all the forecasts for p = 0.05 (see Subsection 3.4 for how to read the diagram).

QM would be the second best (it has more area contributing positively to the Skill

Score). However, RPF achieves the best reliability measure for p = 0.05. Since the

measures of reliability and resolution weight area by the relative frequency s(f),

s(f) has to be considered to obtain proper insights on the measures of CR decom-

positions from the reliability diagram. For the moderate case, p = 0.25, RPF and

EBC have closer curves to y = x, which lead to the least Mean Error, discussed

later. The result for p = 0.50 shows EBC and RLW have inflection points at the

intersection of f = µx and µx|f = µx. This implies EBC and RLW have positive

contribution to the MSE Skill Score over all the forecasts, and small conditional

bias.

Next discussed are the relative measures of the Reliability and Resolution

for September monthly volume forecasts shown in Figure 5.14. For the 1-month

lead time, all the bias correction methods reduce the Relative Reliability (RREL)

(conditional bias) except at low flows. Note that the RREL remains almost the same

magnitude as lead time increases. The Relative Resolution (RRES) measures how

much different the expected observations given forecasts are from the mean of the

observations. The RRES for the bias correction methods decreases with increase

in lead time, while the RRES for the forecasts without bias correction retains a

convex down shape. Since the subtraction of RRES from RREL determines the

MSE Skill Score, the fact that the bias correction methods give almost the same

RRES as the forecasts without bias correction for 1-month lead time forecasts, along

with much less RREL, leads to the improvement in accuracy. However, in the cases

where original forecasts are already well calibrated, bias correction methods may not

improve the Skill Score. The CR decompositions of May monthly volume forecasts

illustrate the same conclusion as Skill Score decompositions (Figure B.1).

5.4.3 LBR Factorization and Decompositions

The distributions r(f |x) and t(x) for the nonexceedance probability p =0.05,

0.25, and 0.5 with 1-month lead time are depicted in Figure 5.15. The forecasts

89

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e fre

quen

cy [s

(f)]


s(f) for p=0.05


0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Obs

erve

d re

lativ

e fre

quen

cy (µ

x|f)


µx|f for p=0.05


0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e fre

quen

cy [s

(f)]


s(f) for p=0.25


0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Obs

erve

d re

lativ

e fre

quen

cy (µ

x|f)


µx|f for p=0.25


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e fre

quen

cy [s

(f)]


s(f) for p=0.50


0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Obs

erve

d re

lativ

e fre

quen

cy (µ

x|f)


µx|f for p=0.50


Figure 5.13: Marginal distribution of the forecasts s(f) and the conditional meanof the forecasts µx|f by five Bias Correction methods, actual (non bias-corrected)streamflow simulation (NBC), and pseudoperfect streamflow simulation (PSS), for1-month lead time September monthly volume forecasts.

90

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

elia

bilit

y




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

esol

utio

n




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

elia

bilit

y




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

esol

utio

n




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

elia

bilit

y




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

esol

utio

n




Figure 5.14: CR decompositions by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for 1, 2, and 3-month lead time September monthly volume forecasts.

91

are issued for September monthly volume. The conditional distributions r(f |x = 0)

and r(f |x = 1) are estimated with the combination of the kernel density estimation

method and logistic regression by Equations (3.40) and (3.41).

First of all, the distributions r(f |x) and t(x) are considered. The closer to

f = 1 (f = 0) the conditional mean µf |x=1 (µf |x=0) is, the smaller the Type 2

conditional bias. Examination of r(f |x = 1) for p = 0.05 reveals that RLI, RPF, and

EBC have issued the forecasts around f = 0.5, whereas NBC, PSS, and QM have

more density near f = 0. As also seen for the cases of p = 0.25 and 0.50, it is evident

that NBC issues more forecasts that are less than 0.5 for the observations x = 1.

On the other hand, NBC produces the forecasts closer to 0 for the observations

x = 0. The improvement in r(f |x = 1) rather than the degradation in r(f |x = 0)

by bias correction resulted in less Type 2 conditional bias depicted in the upper left

of Figure 5.19.

The measure of Type 2 Conditional Bias (TY2) and Discrimination (DIS) for

the probabilistic forecasts, defined in Section 3.2, can be written as

TY2 = t(x = 0)(µf |x=0 − 0)2 + t(x = 1)(µf |x=1 − 1)2 (5.10)

DIS = t(x = 0)(µf |x=0 − µf )2 + t(x = 1)(µf |x=1 − µf )

2. (5.11)

These equations indicate that Type 2 Conditional Bias measures the differences

between x = i and µf |x=i averaged with the weights t(x = i) (i=0, 1), and that

Discrimination measures the differences between µf and µf |x=i averaged with the

weights t(x = i) (i=0, 1). Figure 5.16 shows the conditional means of forecasts given

observations µf |x for all the bias correction methods, NBC and PSS. Comparison

of the distance between µf |x=i and x = i (i = 0 or 1) indicates that NBC has

more (Type 2) conditional bias given the observations x = 1 for the low flow.

Figure 5.17 compares the conditional means by EBC and QM with NBC. The bigger

Discrimination does not mean more accurate forecasts; they have to be conditionally

unbiased. The figure illustrates that the two methods, QM and EBC, are successful

in enlarging the difference between µf |x=1 and µf in the low flow, making µf close

to µx.

Next, relative measures of LBR decompositions for September and May monthly

volume forecasts are examined. Examination of Figure 5.18 indicates that the orig-

inal forecasting system has no sharpness for the extreme low flow events. As the

magnitude of the event increases, the sharpness also increases. The bias correction

92

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1

Like

lihoo

d [r(

f|x)]


r(f|x=0) for p=0.05

t(x=0)=0.958


0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1

Like

lihoo

d [r(

f|x)]


r(f|x=1) for p=0.05

t(x=1)=0.0417


0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1

Like

lihoo

d [r(

f|x)]


r(f|x=0) for p=0.25

t(x=0)=0.771


0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1

Like

lihoo

d [r(

f|x)]


r(f|x=1) for p=0.25

t(x=1)=0.229


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.2 0.4 0.6 0.8 1

Like

lihoo

d [r(

f|x)]


r(f|x=0) for p=0.50

t(x=0)=0.521


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.2 0.4 0.6 0.8 1

Like

lihoo

d [r(

f|x)]


r(f|x=1) for p=0.50

t(x=1)=0.479


Figure 5.15: Conditional distributions of the forecasts r(f |x = 0) (left) and r(f |x =1) (right) by five Bias Correction methods, actual (non bias-corrected) streamflowsimulation (NBC), and pseudoperfect streamflow simulation (PSS), for 1-monthlead time September monthly volume forecasts.

93

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pro

babi

lity


µf|x=0 versus Nonexceedance Probability


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pro

babi

lity


µf|x=1 versus Nonexceedance Probability


Figure 5.16: Conditional mean of the forecasts given the observations µf |x for fiveBias Correction methods, actual (non bias-corrected) streamflow simulation (NBC),and pseudoperfect streamflow simulation (PSS), for 1-month lead time Septembermonthly volume forecasts.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pro

babi

lity


Summary Measures versus Nonexceedance Probability

EBCNBCµx

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pro

babi

lity


Summary Measures versus Nonexceedance Probability

QMNBCµx

Figure 5.17: Conditional mean of the forecasts given the observations µf |x for EBCand QM bias correction methods with NBC. The forecasts were issued for Septembermonthly volume with 1-month lead time. The three curves for each colour in thebottom two figures show µf |x=1, µf , and µf |x=0 from top to bottom.

94

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e S

harp

ness




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e S

harp

ness




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e S

harp

ness




Figure 5.18: Relative sharpness by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for 1, 2, and 3-month lead time September monthly volume forecasts.

methods make the shape of sharpness more symmetrical. Relative Type 2 Condi-

tional Bias (RTY2) shows the reverse shape of sharpness and Discrimination (Figure

5.19). For example, RTY2 of the original forecasts decreases as the magnitude of the

event increases. RTY2 also maintains the shape, but not magnitude, as lead time

increases. The increases in lead time result in dramatic decreases in sharpness and

discrimination. Therefore, all the bias correction methods improved RTY2 most

for low and moderate flow events, which lead to the improvement in the MSE Skill

Score. As for May monthly volume, the LBR decompositions do not show much

improvement by any bias correction method (Figure B.2).

5.4.4 Results for All Months

To try to better see the characteristics of the bias correction methods, the per-

formance measures, CR, and LBR decompositions are calculated with verification

datasets for all months (N = 576). Since the sample size is much larger than one

95

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e Ty

pe 2

Con

ditio

nal B

ias




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e D

iscr

imin

atio

n




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e Ty

pe 2

Con

ditio

nal B

ias




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e D

iscr

imin

atio

n




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e Ty

pe 2

Con

ditio

nal B

ias




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e D

iscr

imin

atio

n




Figure 5.19: LBR decompositions by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for 1, 2, 3-month lead time September monthly volume forecasts.

96

for each month (N = 48), estimated DO measures have smaller sample variability.

However, the assumption that all the pairs of observations and forecasts obey the

same joint distribution is likely invalid. The lead time is considered from 1, 3 and

6 months.

The right figures of Figure 5.20 indicate that the forecasts without bias cor-

rection have more unconditional bias for low and moderate flow events than high

flow events, in terms of contribution to Skill Score. All the bias correction methods

improve the unconditional bias for low and moderate flow events. According to

the left figures of Figure 5.20, after a 3-month lead time, all the bias correction

methods tend to underestimate the occurrence of the low flow events (p ≤ 0.5) and

overestimate the occurrence of the high flow events (p ≥ 0.5). This tendency seems

to stem from the bias in the hydrological model which tends to underestimate the

high streamflow volume and overestimate the low streamflow volume. Still, since

PSS with no hydrological model bias also shows the tendency, the estimation of the

forecast conditional distribution Gt(y) itself might have a problem.

The Relative Reliability (RREL) is reduced especially, in the moderate flows,

and the Relative Resolution (RRES) is maintained at the same or higher level as

original forecasts achieves (Figure 5.21). The order of RREL is almost the same for

1-month lead time through 6-month lead time. In the extreme low flow event, RPF

gives the worst Reliability among the others for 1-month lead time. It is speculated

from Figure 5.4 that the power functions does not fit low flows well. Thus, the

failure of improvement in the RREL for moderate and low flow is the main reason

why RPF obtains the worst Skill Score in this range. As the lead time increases,

the RRES by all the methods decrease.

QM has the smallest Relative Type 2 conditional bias (RTY2) and largest

Relative Discrimination (RDIS) as a whole, whereas EBC issues forecasts with rel-

atively large RTY2, and smallest RDIS (Figure 5.22). Since the RDIS and Relative

Sharpness (RS) are closely related, RS shows almost the same characteristic of the

bias correction methods as RDIS (Figure 5.23). The other regression-type methods

seem to be in between QM and EBC, although the regression-type methods tend

to have poorer sharpness in the low flow events than EBC. As for the change of

measures by lead time, the LBR decompositions shows more dramatic decreases

than the CR decompositions. Note that the forecasts that have large RS need to

have large RDIS to achieve high Skill Score. In other words, small RDIS is enough

for the forecasts with small RS to achieve the same Skill Score. This is why RLI,

97

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror


All Months with 1-month Lead Time (N=576)


0

0.05

0.1

0.15

0.2

0 0.2 0.4 0.6 0.8 1

Unc

ondi

tiona

l bia

s fro

m S

S




-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror




0

0.05

0.1

0.15

0.2

0 0.2 0.4 0.6 0.8 1

Unc

ondi

tiona

l bia

s fro

m S

S




-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0 0.2 0.4 0.6 0.8 1

Mea

n E

rror




0

0.05

0.1

0.15

0.2

0 0.2 0.4 0.6 0.8 1

Unc

ondi

tiona

l bia

s fro

m S

S




Figure 5.20: Mean Error and unconditional bias from decomposition of Skill Scoreby five Bias Correction methods, actual (non bias-corrected) streamflow simulation(NBC), and pseudoperfect streamflow simulation (PSS), for all the months with 1,3, and 6-month lead times.

98

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

elia

bilit

y




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

esol

utio

n




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

elia

bilit

y




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

esol

utio

n




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

elia

bilit

y




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e R

esol

utio

n




Figure 5.21: CR decompositions by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for all the months with 1, 3, and 6-month lead times.

99

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e Ty

pe 2

Con

ditio

nal B

ias




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e D

iscr

imin

atio

n




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e Ty

pe 2

Con

ditio

nal B

ias




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e D

iscr

imin

atio

n




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e Ty

pe 2

Con

ditio

nal B

ias




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e D

iscr

imin

atio

n




Figure 5.22: LBR decompositions by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for all the months with 1, 3, and 6-month lead times.

EBC, and QM have obtained almost the same Skill Score (Figure 5.24).

For this experimental forecasting system, it is clear that using the bias cor-

rection methods is better than doing nothing (Figure 5.24).The exception is the

RPF, which produces poorer skill than NBC in low flow events. For the 1-month

lead time, RLI gives the best accuracy, and the second best one is EBC. After the

2-month lead time, QM becomes the best correction method in terms of accuracy,

100

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e S

harp

ness




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e S

harp

ness




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e S

harp

ness




Figure 5.23: Relative sharpness by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for all the months with 1, 3, and 6-month lead times.

101

and RLW and RLI show about the same accuracy. On the other hand, RPF keeps

giving less accuracy than NBC in the low flow. Among the regression-type meth-

ods, RLI is the best one. This might be because RLI has some favorable features

that take into account the multiplicative bias for the meteorological event and the

magnitude of simulated volume. In addition, RLI is much easier to implement than

the other two methods, RLW and RPF. Note that the MSE Skill Score drops down

at the extreme high and low quantiles. Another important point is that the fore-

casts for low flow events tend to be more accurate than high flow event. The Skill

Score for 1-month lead time with PSS indicates the room for further improvement

achievable by bias correction methods. It should be noted that the potential skill

is also improved by the bias correction methods.


Three types of bias correction methods were applied to the ensemble volumes

produced by the experimental forecasting system. The first one is the Event-Bias

Correction method (EBC). The multiplicative bias between observed volume and

the simulation with the historical meteorological input is first obtained. This multi-

plicative bias is then used to correct the ensemble volume simulated with the same

historical meteorological event. The second type is a Regression method, including

the linear interpolation between corresponding values of the historical simulation

and observation (RLI), and power function (RPF) and LOWESS (locally weighted

scatterploting smoothing) (RLW) regressions, which were fitted to the scatter plot

between observed flow and historical simulation. The ensemble volume is replaced

by the expected observed volume given the ensemble simulated volume. The third

method is the Quantile Mapping method (QM). The ensemble volumes are corrected

by the same cumulative relative frequency between observed flow and historical sim-

ulation.

In the investigation, some problems were found with continuous approaches

LRM and KDM. In LRM, the kernel density estimation method is used to estimate

the marginal distribution of forecasts s(f). However, in situations where σf = 0 the

optimum bandwidth cannot be estimated, which led to the failure in estimation of

s(f). This occurs for extreme low or high quantiles with small sample sizes. KDM

also has the same problem in estimating the conditional distribution of forecasts

given observations r(f |x). Another problem is that the logistic regression may

produce unreasonable estimates of the conditional mean µx|f for forecasts of extreme

102

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

MS

E S

kill

Sco

re




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Squ

ared

Cor

rela

tion




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

MS

E S

kill

Sco

re




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Squ

ared

Cor

rela

tion




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

MS

E S

kill

Sco

re




0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Squ

ared

Cor

rela

tion




Figure 5.24: MSE Skill Score and potential skill by five Bias Correction meth-ods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfectstreamflow simulation (PSS), for all the months with 1, 3, and 6-month lead times.

103

low or high flow events. The pseudoperfect streamflow simulation (PSS) was utilized

to see the forecast quality without no hydrological model bias. It also showed the

tendency to overestimate the occurrence of low flow event, and underestimate the

occurrence of high flow events. Hence, the unconditional biases may stem from

Gt(y).

Examination of the MSE Skill Score for the forecasts without bias correction

indicates the monthly variation. However, the MSE Skill Score for the historical

simulation showed a different monthly variation from the probabilistic forecasts.

Therefore, it is dangerous to simply speculate that months with high relative accu-

racy in the historical simulation have relative high relative accuracy in the proba-

bilistic forecasts.

The DO measures vary uniquely as lead time increases. The Type 2 condi-

tional bias increases more rapidly than the Bias and Reliability (Type 1 conditional

bias), which are fairly constant as lead time increases. The Discrimination decreases

more dramatically than the Resolution as lead time increases. The association ρ2fx

for the forecasts with longer lead time is less affected by the bias in the hydrological

model.

The characteristics of the bias correction methods were investigated. The de-

compositions of the Skill Score reveal that all the bias correction methods achieve

better skill score mostly by reducing the conditional bias (Reliability) and uncon-

ditional bias (Mean Error). Therefore, if the model is well calibrated, not much

improvement can be obtained. Surprisingly, in some cases bias correction methods

also improve the association. The reduction of the Reliability and maintenance of

the Resolution is the one of the characteristics of the bias correction methods. As

for LBR decompositions, the improvement of Skill Score by the bias corrections is

achieved mostly by reducing the Type 2 conditional bias. From estimated distri-

bution s(f) and relative sharpness, it can be seen that the bias correction methods

shift the density from 0 toward the middle, but this does not always appear over

all the thresholds. The clearest characteristic of the bias correction methods is that

the ensembles corrected by EBC gives the lowest sharpness over all the quantiles,

whereas QM gives as high the sharpness and discrimination. The regression-type

methods seem to be in between of these. RLI gives the best improvement in ac-

curacy for a 1-month lead time, while after a 2-month lead time, QM method has

the best accuracy. RPF gives lower accuracy than original ensembles for the low

flow quantile. RLW and RLI produce almost the same accuracy after a 2-month

104

lead time. Hence, it is clear that the DO approach is a useful, sound framework to

assess bias correction methods for probabilistic streamflow forecasts.

105

CHAPTER 6

SUMMARY AND CONCLUSIONS

One objective of this research is to extend the Distributions-Oriented (DO)

approach to the verification of probability distribution forecasts of streamflow.

The Advanced Hydrologic Prediction Services (AHPS) forecasts from the National

Weather Service (NWS) utilize the idea of Extended Streamflow Prediction (ESP).

First, the hydrological model embedded in the forecasting system produces ensem-

ble traces (or different realizations in the future) of streamflow by inputting histor-

ical meteorological information. Then, statistical analysis of the ensemble volumes

produces probability distribution forecasts. Verifying the probability distribution

forecasts is a problem, since they contain a probabilistic forecast for any possible

outcome. One solution was proposed. Consider a discrete event that a forecast

variable is less than or equal to a threshold. By setting up the threshold, one prob-

abilistic forecast is derived from the probability distribution forecast in terms of

nonexceedance probability. The corresponding continuous observation is converted

into 0 (no occurrence) or 1 (occurrence). This pair of probabilistic forecast and dis-

crete observation becomes part of the verification dataset. Many sets of verification

datasets are obtained by setting up many thresholds. Investigation of this set of

verification datasets was considered equivalent to examination of forecast quality of

the probability distribution forecast.

6.1 Distributions-Oriented Methodsfor Small Verification Dataset

The verification of streamflow forecasts suffers from its small sample size.

Since applying the DO approach is equivalent to estimation of the joint distribu-

tion of forecasts and observations, actual implementation of the DO approach to

streamflow forecasts faces serious estimation problems related to small samples.

The difficulty in estimation is expressed by the dimensionality D, which is defined

as the number of degrees of freedom in order to estimate the joint distribution of

forecasts and observations. For instance, if probabilistic forecasts are issued from 0

to 1 with 0.1 interval for dichotomous observations, D is equal to 11× 2− 1 = 21.

106

However, in case of May-September seasonal volume forecasts the sample size to

estimate the joint distribution could be around 50, because many gaging stations

have a short period of record, and events occur once a year. In order to reduce

the dimensionality D, a continuous approach was introduced. All the measures

except for CR decompositions were derived from six basic statistics; CR decompo-

sitions required estimation of the integral∫ 10 µ2

x|fs(f)df , where µx|f is the conditional

mean of observations given forecasts, and s(f) denotes the marginal distribution of

forecasts. Three methods LRM, KDM, and CM were considered to estimate the

integral; LRM uses the arithmetic mean with logistic regression, KDM uses the

numerical integration with kernel density estimation (nonparametric method), and

CM uses numerical integration with logistic regression and kernel density estima-

tion. For a 11-bin contingency-table model (discrete approach, DSC), the continuous

approaches reduce the dimensionality by about one-third.

Three Monte Carlo experiments were carried out to investigate the three con-

tinuous approaches. One experiment used an analytical model for the joint dis-

tribution. Another used a stochastic model of the streamflow forecasting system.

The third used a discrete joint distribution model. Verification datasets were gen-

erated to see how the estimated measures of forecast quality vary with the number

of forecast-observation pairs. The number of the pairs was varied from 50 to 1000.

It turned out that the continuous approach with LRM for small samples is the best

estimator for CR decompositions, whether the forecasts are issued in discrete or

continuous numbers. LRM is the best estimator in the case of the forecasts issued

for extreme events with small sample size. KDM is also a better estimator than DSC

for small sample, and works better than LRM for the forecasts issued for moderate

events. A reason for the improvement over DSC is that the continuous approaches

impose some structure on the estimation of marginal distribution of forecasts or

conditional distribution of forecasts given observations.

6.2 Assessment of Bias Correction Methodsfor Ensemble Forecasts

The second objective of this research is to demonstrate the usefulness of the

DO approach in assessing the quality of streamflow forecasts. The forecast of inter-

est is the probabilistic forecast for monthly streamflow volume observed at Stratford

on the Des Moines River. Three different types of bias correction methods are ap-

plied to the ensemble volumes. Event-Bias Correction method (EBC) expects an

107

ensemble volume simulated with i-th year’s meteorological conditions to have the

same bias as the i-th year’s historical simulation has, regardless of the magnitude

of the ensemble volume. A Regression method replaces an ensemble volume with

the expected observation given the ensemble volume. The regression is obtained

from observations and corresponding historical simulations. As forms of the re-

gression, Linear Interpolation (RLI), Power Function (RPF), and LOWESS (RLW)

are considered. The Quantile-mapping method (QM) corrects an ensemble volume

based on the cumulative distributions of observations and historical simulations,

so that the ensemble volume and corrected volume have the same nonexceedance

probability.

In the investigation, the major characteristic common in the bias correction

methods was shown by the decompositions of the Skill Score: all the bias correction

methods achieve better skill scores mostly by reducing the conditional bias (Reliabil-

ity) and unconditional bias (Mean Error). Therefore, if the model is well calibrated,

not much improvement may be obtained. It is remarkable that in some cases bias

correction methods also improve the association. The reduction of the Reliability

and maintenance of the Resolution is another characteristic of the bias correction

methods. As for LBR decompositions, the improvement of Skill Score by the bias

corrections is achieved mostly by reducing the Type 2 conditional bias. The distinct

characteristic of the bias correction methods is that the forecasts modified by EBC

tend to have the lowest sharpness and discrimination over all the quantile, whereas

QM tends to give the highest sharpness and discrimination. The regression-type

methods seem to be in between of these. Thus, the DO approach enabled us to

obtain insights on the forecasts produced by various bias correction methods. Some

problems were found with the continuous approaches LRM and KDM, when the

forecasts were issued for extreme low or high quantiles with small sample size. For

example, a event that a monthly volume is equal or less than the 0.05 quantile

does not occur 47 or 48 times out of 50 samples. Then, forecasting systems with

moderate skill could issue probability 0 for 49 or all the events. In the cases, the

logistic regression and kernel density estimation cannot estimate µx|f properly.

6.3 Future Study and Remarks

The first objective of this research was to extend the DO approach to the veri-

fication of probability distribution forecasts. We suggested recoding the probability

distribution forecasts by setting up many thresholds. Another possibility to assess

108

the probability distribution forecasts is to utilize the continuous ranked probability

score (CRPS), which is the integral of the Brier score over all possible threshold

values (Hersbach, 2000). However, since the CRPS is just one scalar variable, the

information on how forecast quality varies over the range of possible outcomes may

not be obtained. Although the CR decomposition has been derived by Hersbach

(2000), another decomposition corresponding to LBR decomposition of the Brier

score should be also derived in order to look at forecast quality given observations.

The new score proposed by Wilson et al. (1999), which is in the form of probabili-

ties of occurrence of the observation given the EPS distribution, would be useful to

see how the bias correction methods or incorporation of climatological forecasting

change the distribution of ensemble volumes.

One of the important points in application of verification framework with DO

approach is that DO approach assumes the stationarity of streamflow and forecasts,

and no serial correlations in streamflow time series and in forecast time series.

The joint distribution includes all of the nontime-dependent information relevant to

forecast verification (Murphy, 1991). In other words, this framework does not have

the ability to measure the evolution of aspects of forecast measure due to change in

time. This point could lead to the expansion of DO approach. For example, in the

cases where the observations are not stationary, the observations might be divided

into stationary groups to apply the DO approach. If the daily streamflow volume

forecast is considered, the serial correlation cannot be ignored. However, even if

we succeeded in detecting nonstationarity or serial correlation, another big problem

would be waiting. A more complex model of the relationship between forecasts

and observations would have even higher dimensionality than the joint distribution

model studied.

The problem of boundary effects with kernel density estimation method is

still under research, and this work did not utilize the best technique that is now

available. However, it is obvious that the kernel density estimation method has more

flexibility than logistic regression. Moreover, one can avoid the error introduced

by choosing improper distribution functions. Innovations in techniques for kernel

density estimation should be followed and adapted into the forecast verification

methodology.

This research used an experimental forecasting system that utilizes ensemble

volumes produced with all the meteorological information, regardless of forecast

109

date. In reality, ESP can only utilize the meteorological information recorded be-

fore the forecast date to produce the ensemble traces. The effect of the number of

available ensemble traces on the skill of the forecasting system should be investi-

gated.

In the assessment of bias correction methods, the MSE Skill Score for the fore-

casts before bias correction was examined over all the months. In addition, the his-

torical simulation and the observations (used to develop bias-correction functions)

were compared to estimate the MSE Skill Score for the simulation period. Note

that the MSE Skill Score for the probabilistic forecasts showed monthly variations

that were different from those of the MSE Skill Score for the historical simulation.

One possible area of investigation would be to use the DO approach to examine the

joint distribution of historical simulations and observations more closely. Further

comparisons may indicate how the quality of the model predictions are related to

forecast quality of the probabilistic forecasts from ESP.

How best to incorporate the climate forecast information into ESP is another

issue to be studied. In this research, equal weighting of the ensemble traces was

used; no climate forecast information was utilized. Although the use of climate

information has been investigated (e.g., Croley II, 2000, and Perica, 1998), it is not

clear that these methods for incorporating weather or climate forecasts are really

beneficial or effective to improve streamflow forecasts. As was demonstrated for

bias-correction methods, application of the extended DO approach would be use-

ful for evaluating alternate approaches for using climate information in streamflow

forecasting.

110

APPENDIX A

STATISTICAL METHODS

This appendix describes the three statistical methods (LRM, KDM, and CM)

for estimating the integral:∫ 1

0µ2

x|fs(f)df.

These methods reduce dimensionality of a verification problem. The traditional

discrete approach with a contingency table (DSC) is also described.

A.1 Logistic Regression Method

The logistic regression for one explanatory variable, probabilistic forecast f ,

is expressed with two parameters β0 and β1 as:

ln(

π

1− π

)= β0 + β1f (A.1)

where π is the probability that discretized observation X is equal to 1. Estimation

of the two parameters is as follows. Since the probability that X is equal to 0 or 1

can be written as

Pr(X = x) =exp{x(β0 + β1f)}1 + exp(β0 + β1f)

(A.2)

x = 0 or 1,

the likelihood function can be formulated as

Pr(X1 = x1, X2 = x2, · · · , XN = xN) =exp{∑N

i=1 xi(β0 + β1fi)}ΠN

i=1{1 + exp(β0 + β1fi)} . (A.3)

Then, the logarithm of the above equation is derived:

F (β0, β1) = ln Pr

=N∑

i=1

[xi(β0 + β1fi)− ln{1 + exp(β0 + β1fi)}]. (A.4)

In order to maximize the logarithm likelihood function, the Levenberg-Marquardt

optimization method is used to minimize −F .

The estimator of the conditional mean µx|f for a sample i can be written with

111

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Obs

erva

tions

Forecasts

observationlogistic regression

Figure A.1: Example of logistic regression applied to the pairs of forecasts andobservations.

the two parameters:

µx|fi= Pr(Oi = 1)

=exp(β0 + β1fi)

1 + exp(β0 + β1fi). (A.5)

Figure A.1 illustrates an example of logistic regression applied to the pairs of fore-

casts and observations generated by the analytical model for the joint distribution

developed in Section 4.2. The estimates of µ2x|fi

corresponding to each forecast fi are

obtained, and the sample average (Equation (3.36)) is used to estimate the integral.

Since the marginal distribution of the forecasts s(f) is graphically informative, it

is estimated directly by the kernel density estimation method explained later on.

Therefore, the dimensionality by LRM is D = 9 since 6 basic statistics, 2 param-

eters for logistic regression, and 1 parameter for kernel density estimation method

have to be estimated.

A.2 Kernel Density Estimation Method

The basic kernel density estimator is written as

f(x) =1

nh

n∑

i=1

K(

x− xi

h

)(A.6)

where h denotes the bandwidth, n is the number of samples, and K() is a kernel.

The biweight kernel is utilized in this research:

K(t) =15

16(1− t2)2 (A.7)

112

0

1

2

3

4

5

6

7

-1 -0.5 0 0.5 1 1.5 2

PD

F


r(f|x=0) by kernel estimation

0

0.5

1

1.5

2

2.5

3

3.5

-1 -0.5 0 0.5 1 1.5 2

PD

F



0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1 1.5 2

CD

F



0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1 1.5 2

CD

F



Figure A.2: Unbounded estimation with biweight kernel.

How to obtain optimum bandwidth is described in the following. Equivalent

bandwidth scaling is often used to convert the bandwidth obtained from a Normal

kernel cross-validation rule into the one for the another kernel (Scott 1992, p142).

Here, the bandwidth is determined through a Normal reference rule that minimizes

asymptotic mean integrated error (AMISE) between the normal distribution and

the normal kernel estimate:

h∗ = (4/3)1/5σn−1/5. (A.8)

Finally, h∗ is multiplied by the equivalent bandwidth scaling factor for a biweight

kernel, 2.623, to produce h.

As an example, fifty pairs of forecasts and observations, produced by the

stochastic model of streamflow forecast system explained in Section 4.3, are used

for kernel density estimation. Figure A.2 shows the kernel density estimation with-

out any consideration of the boundaries f = 0 and 1. The difficulty using kernel

estimation rises from the boundary of f at 0 and 1; the estimate for x = ch or

113

x = 1 − ch, 0 ≤ c < 1, is not necessarily a consistent estimate of f(x). This is

known as the boundary effect. Some well-known methods to deal with this problem

are the reflection method, boundary kernel method, boundary kernel method im-

plicit in local linear fitting, transformation method, and pseudodata method (e.g.,

see Zhang et al. 1999). In the following, two simple methods out of the above ones

are examined.

First, the boundary kernel method is applied. The kernel is designed from

the biweight kernel to lead to the elimination of the O(h) in the bias (Scott 1992,

p146):

Kc(t) =3

4

[(c + 1)− 5

4(1 + 2c)(t− c)2

][t− (c + 2)]2I[c,c+2](t). (A.9)

In the range of xi ∈ [0, h) the kernel Kc with c = (0 − xi)/h is used instead of the

ordinary biweight kernel. As for the other boundary at x = 1, measure the distance

from x = 1 for the samples xi ∈ (1 − h, 1] to the negative direction and apply the

above kernel to the samples with the remeasured distance. However, this kernel can

produce the negative values as shown in Figure A.3, which is known as a drawback

of this method.

The second method is the reflection method. This method literally reflects

original samples against a boundary, and then applies the ordinary kernel to the

reflected samples. The final estimation is obtained by tripling the estimates for

0 ≤ x ≤ 1, since the original samples have been tripled (see Scott 1992). Figure

A.4 illustrates the limitation of this method; this method is specially designed for

the case where the first derivative of f at a boundary is 0.

Zhang (1999) has proposed a new method for modifying the boundary effect,

combining methods of pseudodata, transformation, and reflection. The kernel den-

sity estimation method is still under development. Therefore, it is reasonable enough

for this research to use the simple reflection method in order to illustrate the use-

fulness of kernel density estimation method in the verification problem, specifically

reduction of dimensionality.

The estimated conditional distribution r(f |x)), the marginal distribution s(f),

and conditional mean µx|f are connected by Equations (3.37) and (3.39). The

numerical integral form of Equation (3.34) can be given as:∫ 1

0µ2

x|fs(f)df ≈ ∑ 1

2

{µ2

x|f=ts(f = t) + µ2x|f=t+∆ts(f = t + ∆)

}∆t. (A.10)

114

-5

0

5

10

15

20

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

PD

F



-1

0

1

2

3

4

5

6

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

PD

F



0

0.2

0.4

0.6

0.8

1

1.2

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

CD

F



00.10.20.30.40.50.60.70.80.9

1

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

CD

F



Figure A.3: Bounded estimation with floating boundary kernel.

115

0

2

4

6

8

10

12

14

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

PD

F



0

0.5

1

1.5

2

2.5

3

3.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

PD

FProbablistic Forecast


0

2

4

6

8

10

12

14

0 0.2 0.4 0.6 0.8 1

PD

F



0

0.5

1

1.5

2

2.5

3

3.5

0 0.2 0.4 0.6 0.8 1

PD

F



00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

CD

F



00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

CD

F



Figure A.4: Bounded estimation with biweight kernel and reflection boundary tech-nique.

116

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 0.2 0.4 0.6 0.8 1

s(f)


s(f) with sample size 50

Figure A.5: Example of kernel density estimation method applied to forecasts toestimate the marginal distribution s(f).

where t is the parameter that takes on the value from 0.0 to 0.999 with the incre-

ment ∆t = 0.001. This method has the dimensionality D = 7 since kernel density

estimation method has one parameter, bandwidth.

A.3 Combination Method

The marginal distribution of the forecasts s(f) is estimated directly by the

kernel density estimation method in the same manner as the one for the condi-

tional distribution r(f |x). Figure A.5 shows an example of s(f) estimated with

the forecasts generated by the analytical model for joint distribution in Section 4.2.

With µx|f estimated by the logistic regression, the integral of CR decompositions

is numerically estimated by Equation (A.10). The dimensionality of this method is

D = 9, which is the same as LRM.

A.4 Contingency Table Approach

The forecasts originally issued in continuous numbers are converted into 11

discrete numbers: {f | 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} by rounding

up numbers to the nearest discrete value. In the case of the analysis with the

discrete joint distribution in Section 4.4, one more discrete number 0.05 is added

to correspond to the 12 original discrete numbers of forecasts. Thus, the joint

distribution has 11 (or 12)×2 probabilities to be estimated; the dimensionality is

D = 11 (or 12) × 2 − 1 = 21(or 23). All the measures of forecast quality are

calculated from joint relative frequency based on their definitions.

117

APPENDIX B

SELECTED FIGURES AND TABLES

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

iabi

lity




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Res

olut

ion




Figure B.1: CR decompositions by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for 1-month lead time May monthly volume forecasts.

118

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Type

2 C

ondi

tiona

l Bia

s




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Dis

crim

inat

ion




0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Rel

ativ

e S

harp

ness




Figure B.2: LBR decompositions by five Bias Correction methods, actual (non bias-corrected) streamflow simulation (NBC), and pseudoperfect streamflow simulation(PSS), for 1-month lead time May monthly volume forecasts.

119

Table B.1: BIAS in REL/σ2x for the forecasts generated for nonexceedance proba-

bility p = 0.25 by the analytical model of joint distribution.

Size DSC LRM KDM CM

50 9.172e-002 4.930e-002 −1.273e-002 1.122e-001

100 4.381e-002 2.699e-002 −2.235e-002 7.343e-002

200 2.203e-002 2.071e-002 -2.074e-002 5.426e-002

400 9.793e-003 1.552e-002 -1.892e-002 4.019e-002

600 4.612e-003 1.293e-002 -1.830e-002 3.371e-002

800 3.413e-003 1.289e-002 -1.622e-002 3.116e-002

1000 2.596e-003 1.269e-002 -1.476e-002 2.943e-002

Note: the underlined value is the closest to 0 in the row.

Table B.2: Standard Deviation in REL/σ2x for the forecasts generated for nonex-

ceedance probability p = 0.25 by the analytical model of joint distribution.

Size DSC LRM KDM CM

50 6.459e-002 6.378e-002 5.406e-002 8.395e-002

100 4.227e-002 4.051e-002 3.763e-002 4.843e-002

200 3.098e-002 2.776e-002 2.723e-002 3.055e-002

400 2.275e-002 2.037e-002 2.098e-002 2.109e-002

600 1.753e-002 1.573e-002 1.629e-002 1.643e-002

800 1.475e-002 1.340e-002 1.394e-002 1.377e-002

1000 1.405e-002 1.257e-002 1.318e-002 1.283e-002


120

Table B.3: BIAS in RES/σ2x for the forecasts generated for nonexceedance proba-


Size DSC LRM KDM CM

50 6.917e-002 3.485e-002 −2.718e-002 9.772e-002

100 3.285e-002 2.171e-002 -2.764e-002 6.814e-002

200 1.585e-002 1.991e-002 -2.153e-002 5.346e-002

400 4.027e-003 1.448e-002 -1.996e-002 3.915e-002

600 -3.946e-003 9.323e-003 −2.191e-002 3.011e-002

800 -3.037e-003 1.106e-002 -1.805e-002 2.933e-002

1000 -4.020e-003 1.056e-002 -1.689e-002 2.731e-002


Table B.4: Standard Deviation in RES/σ2x for the forecasts generated for nonex-


Size DSC LRM KDM CM

50 1.653e-001 1.727e-001 1.609e-001 1.851e-001

100 1.138e-001 1.165e-001 1.121e-001 1.207e-001

200 8.534e-002 8.430e-002 8.267e-002 8.506e-002

400 6.129e-002 5.981e-002 5.961e-002 5.998e-002

600 5.097e-002 5.030e-002 4.996e-002 5.056e-002

800 4.209e-002 4.174e-002 4.159e-002 4.181e-002

1000 3.969e-002 3.866e-002 3.876e-002 3.865e-002


121

Table B.5: BIAS in REL/σ2x for the forecasts generated for nonexceedance proba-


Size DSC LRM KDM CM

50 2.166e-001 1.260e-001 1.711e-001 5.536e-001

100 1.477e-001 7.242e-002 1.436e-001 1.303e-001

200 9.639e-002 4.265e-002 1.041e-001 5.722e-002

400 5.795e-002 2.948e-002 7.235e-002 3.979e-002

600 4.206e-002 2.473e-002 5.860e-002 3.341e-002

800 3.410e-002 2.314e-002 5.183e-002 3.084e-002

1000 2.793e-002 2.163e-002 4.540e-002 2.849e-002


Table B.6: Standard Deviation in REL/σ2x for the forecasts generated for nonex-


Size DSC LRM KDM CM

50 1.462e-001 1.087e-001 1.732e-001 1.371e+000

100 8.445e-002 5.513e-002 8.784e-002 4.518e-001

200 5.068e-002 2.886e-002 4.715e-002 3.296e-002

400 3.087e-002 1.593e-002 2.845e-002 1.765e-002

600 2.339e-002 1.141e-002 2.021e-002 1.248e-002

800 1.927e-002 9.468e-003 1.690e-002 1.036e-002

1000 1.698e-002 7.810e-003 1.507e-002 8.339e-003


122

Table B.7: BIAS in RES/σ2x for the forecasts generated for nonexceedance proba-


Size DSC LRM KDM CM

50 1.944e-001 1.136e-001 1.587e-001 5.412e-001

100 1.240e-001 5.738e-002 1.286e-001 1.153e-001

200 8.001e-002 3.608e-002 9.749e-002 5.066e-002

400 4.559e-002 2.720e-002 7.008e-002 3.752e-002

600 2.871e-002 2.241e-002 5.628e-002 3.109e-002

800 2.338e-002 2.363e-002 5.232e-002 3.133e-002

1000 1.468e-002 1.885e-002 4.263e-002 2.572e-002


Table B.8: Standard Deviation in RES/σ2x for the forecasts generated for nonex-


Size DSC LRM KDM CM

50 3.638e-001 3.334e-001 3.711e-001 1.334e+000

100 2.403e-001 2.182e-001 2.396e-001 4.789e-001

200 1.561e-001 1.538e-001 1.611e-001 1.609e-001

400 1.038e-001 1.044e-001 1.071e-001 1.078e-001

600 8.397e-002 8.516e-002 8.658e-002 8.740e-002

800 7.484e-002 7.777e-002 7.722e-002 7.957e-002

1000 6.561e-002 6.689e-002 6.688e-002 6.824e-002


123

REFERENCES

Anderson, Jeffrey L., A Method for Producing and Evaluating Probabilistic Fore-casts from Ensemble Model Integrations, Journal of Climate, 9 (7), 1518–1530,1996.

Bae, Deg Hyo and Konstantine P. Georgakakos, Hydrologic Modeling for FlowForecasting and Climate Studies in Large Drainage Basins, IIHR Report No.360, The University of Iowa, Iowa City, Iowa, 1992.

Bicknell, B.R., J. C. Imhoff, J. L. Kittle, Jr., A. S. Donigian, Jr., and R. C. Jo-hanson, Hydrological Simulation Program–Fortran: User’s manual for version11, U.S. Environmental Protection Agency, National Exposure Research Lab-oratory, Athens, Georgia, 1997.

Bradley, A. A. and S. S. Schwartz, Evaluating the Impact of Climate Forecast In-formation on Probabilistic Streamflow Forecast, EOS Transactions, AmericanGeophysical Union, 81 Supplement, Washington D. C., May, 2000; AbstractH32B-09.

Buizza, R. and T. N. Palmer, Impact of ensemble size on ensemble prediction,Monthly Weather Review, 126 (9), 2503–2518, 1998.

Cleveland, W.S., Robust Locally Weighted Regression and smoothing Scatterplots,Journal of the American Statistical Association, 74 (12), 829–836, 1979.

Croley II, Thomas E. Using Meteorology Probability Forecasts in Operational Hy-drology; American Society of Civil Engineers: Reston, Virginia, 2000.

Day, G. N., Extended Streamflow Forecasting Using NWSRFS, Journal of WaterResources Planning and Management, 111 (2), 157–170, 1985.

Donigian, A.S., Jr., J. C. Imhoff, Brian Bicknell and J. L. Kittle, Jr., Applica-tion guide for Hydrological Simulation Program–Fortran (HSPF), U.S. Envi-ronmental Protection Agency, Environmental Research Laboratory, Athens,Georgia, 1984.

Doswell III, Charles A., Robert Davies-Jones and David L. Keller, On summaryMeasures of Skill in Rare Event Forecasting Based on Contingency Tables,Weather and Forecasting, 5 (12), 576–585, 1990.

Hamill, Thomas M. and Stephen J. Colucci, Verification of Eta-RSM Short-RangeEnsemble Forecasts, Monthly Weather Review, 125 (6), 1312–1327, 1997.

Helsel, D. R. and R. M. Hirsch Statistical Methods in Water Resources; Elsevier:New York, 1992.

124

Hersbach, Hans, Decomposition of the continuous ranked probability score forensemble prediction systems, Weather and Forecasting, 15 (5), 559–570, 2000.

Hou, Dingchen, Eugenia Kalnay and Kelvin K. Droegemeier, Objective Verificationof the SAMEX ’98 Ensemble Forecasts, Monthly Weather Review, 129 (1), 73–91, 2001.

Kottegoda, N. T. and R. Rosso Statistics, Probability, and Reliability for Civiland Environmental Engineers; McGraw-Hill, Inc.: New York, 1997.

Marzban, Caren, Scalar Measures of Performance in Rare-Event Situations, Weatherand Forecasting, 13 (9), 753–763, 1998.

Murphy, Allan H., Forecast verification, Economic Value of Weather and ClimateForecasts, Katz, Richard W. and Allan H. Murphy, editors; Cambridge Uni-versity Press: New York, 19–74, 1997.

Murphy, Allan H., Forecast verification: its complexity and dimensionality, MonthlyWeather Review, 119 (7), 1590–1601, 1991.

Murphy, Allan H., Skill scores based on the mean square error and their rela-tionships to the correlation coefficient, Monthly Weather Review, 116 (12),2417–2424, 1988.

Murphy, Allan H., What is a good forecast? An essay on the nature of goodnessin weather forecasting. Weather and Forecasting, 8 (2), 281–293, 1993.

Murphy, Allan H. and E. S. Epstein, Skill scores and correlation coefficients inmodel verification, Monthly Weather Review, 117 (3), 572–581, 1989.

Murphy, Allan H. and Daniel S. Wilks, A Case Study of the Use of StatisticalModels in Forecast Verification: Precipitation Probability Forecasts, Weatherand Forecasting, 13 (3), 795–810, 1998.

Murphy, Allan H. and Robert L. Winkler, Diagnostic verification of probabilityforecasts, International Journal of Forecasting, 7, 435–455, 1992.

Murphy, Allan H. and Robert L. Winkler, A General Framework for ForecastVerification, Monthly Weather Review, 115 (7), 1330–1338, 1987.

Perica, Sanja, Integration of Meteorological Forecasts/Climate Outlooks Into Ex-tended Streamflow Prediction (ESP) System, http://www.nws.noaa.gov/oh/hrl/papers/ams/ams98-6.htm, (accessed March 10, 1998).

Scott, David W. Multivariate Density Estimation; John Wiley & Sons, Inc. :NewYork, 1992.

Shuttleworth, W. James, Evaporation, Handbook of Hydrology, Maidment, DavidR., editor; McGRAW-HILL, Inc.:New York, 4.1–4.53, 1993.

Smith, J. A., G. N. Day and M. D. Kane, Nonparametric Framework for Long-Range Stremflow Forecasting, Journal of Water Resources Planning and Man-agement, 118 (1), 82–91, 1992.

Stensrud, David J. and Matthew S. Wandishin, The Correspondence Ratio inForecast Evaluation, Weather and Forecasting, 15 (10), 593–602, 2000.

125

Stephenson, David B., Use of the “Odds Ratio” for Diagnosing Forecast Skill,Weather and Forecasting, 15 (4), 221–232, 2000.

Wilks, D. S., Diagnostic Verification of the Climate Prediction Center Long-LeadOutlooks, 1995-98, Journal of Climate, 13(7), 2389-2403, 2000.

Wilks, D. S. Statistical Methods in the Atmospheric Sciences; New York: AcedemicPress, 1995.

Wilson, Laurence J., William R. Burrows and Andreas Lanzinger, A Strategyfor Verification of Weather Element Forecasts from an Ensemble PredictionSystem, Monthly Weather Review, 127 (6), 956–970, 1999.

Zhang, H. and T. Casey, Verification of Categorical Probability Forecasts, Weatherand Forecasting, 15 (1), 80–89, 2000.

Zhang, S., R. J. Karunamuni and M. C. Jones, An Improved Estimator of theDensity Function at the Boundary, Journal of the American Statistical Asso-ciation, 94 (448), 1231–1241, 1999.

verification of probabilistic streamflow forecasts - iihr – hydroscience … · 2013-06-14 ·...

Documents