a quick guide to competitive forecast verification testing

28
A quick guide to competitive forecast verification testing Eric Gilleland 4 October 2019 Summary This write-up is intended to give a high-level overview of statistical confidence intervals and hypothesis testing aimed at an atmospheric scientist and weather forecaster audience. Some background about weather forecast verification is assumed along with familiarity with the model evaluation toolkit (MET) software. The reader is pointed to the configuration option that returns matched-pair data (instead of simply matched pair summary measures) from this software, but otherwise assumes the reader has knowledge in obtaining their verification sets by whatever means desired. Some instruction in how to create confidence intervals for the mean loss differential series in the R software language is provided. Detailed description of how to diagnose (via graphical tools) necessary assumptions is given along with some techniques for handling deviations from these assumptions, such as temporal and contemporaneous dependencies, as well as trends. Introduction Confidence intervals (CI’s) provide information about the sampling, or aleatory, error for an experiment. They are often used in forecast verification to provide additional information about resulting verification summaries beyond the estimates, which could be attained by random chance. That is, if the experiment could be re-run under identical conditions (think alternate realities rather than alternate time points), would the result still be favorable (or unfavorable) or would the experimenter’s thinking change. It is important to keep in mind that other forms of uncertainty may still be unaccounted for in the experiment. For example, epistemic error results from uncertainty in the parameterizations of the physical weather forecast model; ensembles of forecasts can inform about this type of error. Moreover, the model itself could be a good model but nevertheless fundamentally wrong, so called ontological error. For example, one could make a fairly accurate model for the positions of the planets in the solar system under the assumption that the earth is the center of the system. The model might be fairly accurate though it is clearly wrong. That is, the information contained in this write-up concerns aleatory error only, so the researcher should remember to consider other factors before judging one model to be superior or not. This write-up illustrates methods using the R software language (R Core Team 2018). This language requires data to be in certain formats, but those possibilities are wide and varied (see,

Upload: others

Post on 06-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A quick guide to competitive forecast verification testing

A quick guide to competitive forecast verification testing Eric Gilleland 4 October 2019 Summary This write-up is intended to give a high-level overview of statistical confidence intervals and hypothesis testing aimed at an atmospheric scientist and weather forecaster audience. Some background about weather forecast verification is assumed along with familiarity with the model evaluation toolkit (MET) software. The reader is pointed to the configuration option that returns matched-pair data (instead of simply matched pair summary measures) from this software, but otherwise assumes the reader has knowledge in obtaining their verification sets by whatever means desired. Some instruction in how to create confidence intervals for the mean loss differential series in the R software language is provided. Detailed description of how to diagnose (via graphical tools) necessary assumptions is given along with some techniques for handling deviations from these assumptions, such as temporal and contemporaneous dependencies, as well as trends.

Introduction Confidence intervals (CI’s) provide information about the sampling, or aleatory, error for an experiment. They are often used in forecast verification to provide additional information about resulting verification summaries beyond the estimates, which could be attained by random chance. That is, if the experiment could be re-run under identical conditions (think alternate realities rather than alternate time points), would the result still be favorable (or unfavorable) or would the experimenter’s thinking change. It is important to keep in mind that other forms of uncertainty may still be unaccounted for in the experiment. For example, epistemic error results from uncertainty in the parameterizations of the physical weather forecast model; ensembles of forecasts can inform about this type of error. Moreover, the model itself could be a good model but nevertheless fundamentally wrong, so called ontological error. For example, one could make a fairly accurate model for the positions of the planets in the solar system under the assumption that the earth is the center of the system. The model might be fairly accurate though it is clearly wrong. That is, the information contained in this write-up concerns aleatory error only, so the researcher should remember to consider other factors before judging one model to be superior or not. This write-up illustrates methods using the R software language (R Core Team 2018). This language requires data to be in certain formats, but those possibilities are wide and varied (see,

Page 2: A quick guide to competitive forecast verification testing

e.g., the tidyverse package, Wickham 2018). It is generally assumed that the reader can obtain their verification sets in R in a format that can be readily used. In this write-up, only a very simple pair of forecast and observation vectors whose elements matching element by element are assumed (actually two forecast vectors with one observation vector) for the temporal setting. Similar testing procedures are available for spatial fields where the vectors are replaced with matrices (i.e., regularly gridded verification sets assumed), but this situation is not considered in this document (cf. Hering and Genton 2011; Gilleland 2013). Here, a verification set is considered to be an observation series (or spatial field, or series of spatial fields) with one or more forecast series whose time points all match. That is, the valid time for the forecast(s) is matched with the appropriate time from the observation at position 𝑖. So, if a 00Z forecast with six-hour lead time is in the third element of the vector, then the third element of the observation series is 06Z for that same date. All R code is given in the appendix at the end of this document. For those who use the model evaluation toolkit (MET) software, when running Point-Stat, request the MPR output line type in the config file as follows: output_flag = { ... mpr = STAT; // to include them in the .stat file or use BOTH to also write them to a separate output file. }

Diagnostic Plots Diagnostic plots are extremely useful tools that are often under-utilized in an operational setting where there may not be time to consult them, and where the number of cases to be analyzed is too large to do on a consistent basis and automated procedures are necessary. Nevertheless, they ought to be consulted at least periodically to judge what type of behavior might be occurring. Additionally, very poor behavior, or unknown issues might come to bear that otherwise might not turn up in summary measures and statistical tests. An example from one real random variable is given for a particular verification set. The particular variable and set of observations and forecasts are not important, here. It is only the diagnostic methods that are of interest. The original time series, a diagnostic plot in and of itself, is displayed in Figure 1. The figure is important to look at because sometimes obvious errors can be seen. However, such plots do not show as much as is usually concluded because the human eye is easily tricked. In this case, it is difficult to discern much from the three series overlaid on top of each other but plotting their time series separately is also difficult to differentiate, especially when considering which of the two models is more closely related to the observed series.

Page 3: A quick guide to competitive forecast verification testing

Figure 1: Time series plots for a verification set consisting of paired observations with two competing forecast models: model 1 and model 2.

Page 4: A quick guide to competitive forecast verification testing

Figure 2: The absolute-error (AE) series of Model 1 minus the observed series from the series shown in Figure 1.

Page 5: A quick guide to competitive forecast verification testing

Figure 3: Same as Figure 2, but for Model 2.

Figure 2 and Figure 3 show the absolute-error (AE) series between the two forecast models and the observed series. Note that the time units have been converted to indices, but they are the same time units (please ignore). It is a little easier to discern from these plots how good each model is, but perhaps is still difficult to conclude if one model is superior to the other model or not.

Page 6: A quick guide to competitive forecast verification testing

Figure 4: Scatter plot of the observations against models 1 (blue squares) and model 2 (black plus signs).

Figure 4 shows a scatter plot of the observed series against each of the two models. This plot removes the temporal component from consideration, which means that time-specific information is lost, but insight into the closeness of each model’s series with the observed series gained. Indeed, a clearer picture of some potentially serious problems with model 1 is now readily seen. Although model 2 seems to demonstrate fewer systematic problems, some of the large errors are again readily observed. All of the diagnostic plots discussed to this point display an issue with timing for model 2. That is, it is clear that model 2 has positive values, in some cases relatively large positive values, where the observed series is zero. The figure also shows that there is a strong linear correlation between the observed series and model 1, and while some form of dependence exists between the observed series and model 2, it is perhaps not as linear, at least not at the low end. Dependence clearly exists between both models and

Page 7: A quick guide to competitive forecast verification testing

the observed series, and subsequently also between the two models. This dependence is referred as contemporaneous dependence in the current verification setting.

Figure 5: A quantile-quantile (qq) plot of the observed series against model 1 (blue squares) and model 2 (black plus signs).

Figure 5 shows a qq-plot (quantile-quantile plot) of the observed series against each model. Now, not only is time removed, but the matched pairs are no longer matched. Instead, the quantiles of each series are matched. Such a diagnostic plot allows for observing how close the (probability) distributions for each series are to the observed one. A straight one-to-one line indicates perfect agreement in distribution. A straight line that is not one-to-one indicates a difference in means (if one is merely shifted rigidly from the one-to-one line) and/or variances (if the slope is not unity). Other departures from a straight line can indicate differences in skewness, kurtosis or other distributional properties. In this case, it is clear that model 1 has the same distribution of values as the observed series, while model 2 shows serious departures in distribution because of the S-shape; it is skewed to the right.

Page 8: A quick guide to competitive forecast verification testing

Figure 6: Box plots of each time series. Midline is the median with notches indicating a 95% CI for the median. Top and bottom of the box represents the upper and lower quartile for each series, respectively. Dashed line goes out to 1.6 times the standard deviation for each series and any values beyond this line are plotted as circles (deemed to be outliers).

Figure 6 shows box plots for each series. A box plot again summarizes the probability distribution for each series, but in a different way. From these plots, summary information including the median and upper and lower quartiles, as well as a sense of the variability and outlier values are all readily discerned. A comparison of distributions is generally not as easy as with the qq-plot, but clearly model 1’s distribution is very similar to the observed series’ distribution, while model 2’s distribution has a larger number of lower-valued outliers, and again has too few zero-valued time points.

Page 9: A quick guide to competitive forecast verification testing

Figure 7: Histogram of the AE loss for model 1.

Page 10: A quick guide to competitive forecast verification testing

Figure 8: Same as Figure 7, but for model 2.

Histograms are another way to look at the distribution of each series. Here, instead of showing the histogram for each series, only the AE loss series histograms are shown for each model in Figure 7 and Figure 8, respectively. Clearly, there are far lower errors for model 1 than model 2.

Page 11: A quick guide to competitive forecast verification testing

Figure 9: The difference between AE loss series displayed in Figure 2 and Figure 3. That is, the time series displays the so called loss-differential series given by |Model1 βˆ’ Observed| βˆ’ |Model2 βˆ’ Observed|.

Recall that Figure 2 and Figure 3 show the time series giving the AE loss series for models 1 and 2, respectively. Figure 9 shows the straight difference series of Figure 2 minus Figure 3; a series dubbed the loss-differential series. Values above zero indicate that model 2 has smaller error, and therefore superior at those time points. Conversely, values below zero indicate that model 1 has lower error and is superior to model 2 at those time points. It is difficult to judge by the human eye, in this instance, if one model is better than the other or not, but one can envisage cases where it could be much more obvious. A question to be discussed in a later section concerns which model is better on average in this way. That is, if the average of Figure 9 is taken, would that average be positive (model 2 is better, on average) or negative (model 1 is better, on average)? Moreover, the more important

Page 12: A quick guide to competitive forecast verification testing

question will concern the aleatory uncertainty to provide evidence as to whether the observed average loss differential value is larger or smaller than zero simply by random chance.

Figure 10: Autocorrelation-function (ACF) plot of the loss differential series shown in Figure 9.

Page 13: A quick guide to competitive forecast verification testing

Figure 11: Partial auto-correlation-function (PACF) plot of the loss differential series shown in Figure 9.

Figure 10 and Figure 11 show the autocorrelation-function (ACF) and partial ACF (PACF) plots for the loss differential series displayed in Figure 9. Details about these plots can be found in any book on time series modeling (e.g., Brockwell and Davis 1996). Briefly, the ACF is obtained by binning together pairs of time points by how far apart they are from each other in time and calculating their correlations. The resulting values are plotted according to the lag distances. The PACF is a conditional correlation of a time series with its own lags controlling for shorter lags. The numerous values on the ACF plot that are well outside the dashed blue horizontal lines suggest that there is strong temporal dependence in the loss differential series. The cyclic pattern indicates that there is also a cyclic trend in the loss differential series, which from Figure 9 appears to be the result of the numerous zero values that appear cyclically.

Page 14: A quick guide to competitive forecast verification testing

The large value at the first lag of the PACF, with the remaining values mostly staying inside the dashed blue lines, also indicates strong dependence, but more specifically that is likely auto-regressive of order one; that is, an AR(1) model may appropriately characterize the series.

Figure 12: ACF plot of the order-one differenced loss-differential series from Figure 9.

Page 15: A quick guide to competitive forecast verification testing

Figure 13: Same as Figure 12, but the PACF instead of the ACF.

Figure 12 shows the ACF plot for the order-1 differenced loss-differential series from Figure 9. That is, if π‘₯2 represents the time series in Figure 9, then Figure 12 is the ACF for the series 𝑦2 =π‘₯2 βˆ’ π‘₯256. The cyclic trend seems to be removed for the differenced series, but the PACF for this series (Figure 13) indicates that there is still strong dependence present.

Hypothesis Testing and CI’s Statistical hypothesis testing and CI’s are highly related, and it is instructive to first discuss hypothesis testing. A hypothesis test consists of two hypotheses called the null and alternative hypotheses and are denoted β„‹8 and β„‹6, respectively. They consist of two regions that partition all of the possible values for the summary measure. Two possible outcomes of the test can occur: the null hypothesis can be rejected, or it can be accepted. The alternative hypothesis is never accepted, which emphasizes that the outcome of the statistical test is

Page 16: A quick guide to competitive forecast verification testing

merely evidence about the truth that the experimenter hopes to reveal. These two hypotheses concern a fixed but unknown parameter that is estimated from the data using an estimator, which is a function of the observable random variables. For example, it might be the mean of daily rain amount, maximum wind speed, median temperature. In classical parametric hypothesis testing, an assumption regarding the distribution of the estimator is made considering that the null hypothesis is true. For example, suppose interest is in the mean of a random variable, say 𝑋, where 𝑋 could represent daily rain amount, wind speed, or any other random variable. Use πœ‡ to represent the true mean for 𝑋 so that πœ‡ = β„°[𝑋], the expected value of 𝑋. A sample of size 𝑛 of the random variable is to be observed, and an estimator for πœ‡ is a function of this random sample, 𝑋6,… , 𝑋A. For example, one of the most well-known estimators for the mean is the sample mean given by

𝑋BA =𝑋6 + 𝑋D +β‹―+ 𝑋A

𝑛 , where the 𝑛 subscript emphasizes the sample size, but sometimes the sample mean is written without this subscript so that 𝑋BA = 𝑋B. In order to conduct a classical parametric hypothesis test, this sample estimator must have a distribution that is estimable, which means that its distribution must be free of any unknown parameters. Such an estimator is then called a test statistic. In the case of the sample mean, there is theoretical justification for assuming that the estimator 𝑍 = GBH5I

J √A⁄ follows a standard normal distribution. Because the standard normal distribution does not have any unknown parameters, the estimator 𝑍 is a test statistic. Note, however, that 𝑍 contains two unknown parameters. The first, πœ‡, is the parameter for which inference is to be made, so β„‹8 determines its value, and it is usually written as πœ‡8. The second parameter, 𝜎, is problematic, however. It must, itself, be estimated from the random sample. It can be estimated using the sample standard deviation, 𝑆A, as an estimator where

𝑆A = O(𝑋6 βˆ’ 𝑋BA)D

𝑛 βˆ’ 1 +(𝑋D βˆ’ 𝑋BA)D

𝑛 βˆ’ 1 +β‹―+(𝑋A βˆ’ 𝑋BA)D

𝑛 βˆ’ 1 .

In this case, however, the normal approximation is no longer valid. Instead, the estimator

𝑇 =𝑋BA βˆ’ πœ‡8𝑆A βˆšπ‘›β„

can be shown to follow a Student-t distribution with 𝑛 βˆ’ 1 degrees of freedom, which for large sample sizes is approximately the same as the standard normal distribution. Additional assumptions about the sample, 𝑋6,… , 𝑋A, in this case, however, are that the sample is independent and identically distributed (iid). If 𝑋 represents rain amount and it does not rain

Page 17: A quick guide to competitive forecast verification testing

today, then it might be more likely that it also does not rain tomorrow, so such a sample is not likely to be independent. If the length of the series is long enough, then it is possible that the distribution of 𝑋 may be changing with climate change. In that case, the series is not likely to be identically distributed. Similarly, if the series has a diurnal trend or seasonal variability. Diagnostic plots such as the ones described in the previous section are useful for determining if these types of assumption violations are present in the random sample. One of the biggest challenges to classical parametric hypothesis testing is estimating the standard error (e.g., the standard error for 𝑍 above is given by se(𝑋BA) = 𝜎/βˆšπ‘›). Violations of the assumptions about the random sample as described above are often the main culprits in obtaining accurate estimates of the standard error. A solution to this problem in the realm of comparative forecast verification is discussed in the next section, where an additional challenge is introduced by the contemporaneous correlation of the two competing models. Efron (1979) introduced the bootstrap method for statistical inference. In many situations, the underlying distribution for a random sample is not known, and no theoretical justification is available to suggest an appropriate approximation. The bootstrap can be used in these types of instances, and generally provides a good means for testing hypotheses and constructing valid CI’s. In fact, Gilleland et al. (2018) found that some bootstrap procedures were fairly accurate with high power when testing which of two competing forecasts superior relative to the same observation series as described in the next section. Many texts are available that describe the bootstrap in detail, such as Efron and Tibshirani (1998); Davison and Hinkley (1997); Lahiri (2003), but this approach is not discussed further here. See Gilleland (2010) for an introduction to the bootstrap, and the assumptions involved, in the forecast verification setting. Also, see Hamill (1999) for other nonparametric resampling procedures in this domain, such as the permutation test. All statistical hypothesis tests are constructed to control against a type-I error a priori to conducting the test. That is, rejecting β„‹8 when it is true (the type-I error) is considered to be the more critical error so the test is designed to this error at a low level, called the size of the test. This control is obtained through the assumed distribution for the estimator under the assumption that β„‹8 is true. In this scenario, it is possible to observe a value in the tail of the estimator’s distribution, even when β„‹8 is true. Therefore, a region of the distribution is determined before running the test such that the probability of observing an estimate in this region is less than or equal to the size of the test. A significance test is similar to a hypothesis test, but strictly speaking, it does not reject or fail to reject β„‹8. Instead, a p-value is found, which is defined to be the probability of observing an estimate, based on the assumed distribution for β„‹8, that is β€œat least as large” as the value observed. It is a value that is computed a posteriori to observing the sample. While it is true that if this value falls within the critical region for rejecting β„‹8, then the p-value will be less than the size of the test, it is a dangerous practice (cf. Goodman 1999); one that many authors employ. The issue is that the size of the test is often not considered within this framework

Page 18: A quick guide to competitive forecast verification testing

before the test is conducted so that human bias can enter through the way the p-value is calculated (e.g., changing between a one- or two-sided test after the experiment is conducted).

Hypothesis Testing and CI’s for Competing Forecast Verification Competing forecast verification is often of interest in meteorology because new forecast models are always being introduced and it is of interest to know if these new models are superior to the currently used ones. Various methods for comparing competing forecasts have been proposed (see, e.g., Hamill 1999). Here, the method proposed by Hering and Genton (2011) and tested in Gilleland et al. (2018) against commonly used procedures in the realm of weather forecast verification (e.g., procedures available in MET). Henceforth, the test will be dubbed the HG test. The test is itself a modification of the Diebold and Mariano (DM) test introduced in the seminal paper Diebold and Mariano (1995). The situation follows the setting described in the diagnostic plots section of this manuscript for time series, but a spatial version of the test is also available, but not described here. Denote the observation series by 𝑋2 and the two competing forecast models by π‘Œ62 and π‘ŒD2, respectively. For any loss function, 𝑔, let 𝑔62 = 𝑔(π‘Œ62, 𝑋2) and 𝑔D2 = 𝑔(π‘ŒD2, 𝑋2) denote the loss series for each respective forecast model. For example, for AE loss, 𝑔62 = |π‘Œ62 βˆ’ 𝑋2| and 𝑔D2 =|π‘ŒD2 βˆ’ 𝑋2|. Then, the loss differential series, 𝐷2, is given by 𝐷2 = 𝑔62 βˆ’ 𝑔D2. Capital letters for π‘Œ62, π‘ŒD2, 𝑋2 and 𝐷2 indicate that these series are random samples that have not yet been observed. Once they are observed, then their associated lower-case letters are used. The hypothesis to be tested concerns the mean loss differential, πœ‡X. In particular, if πœ‡X = 0, then there is no difference, on average, between the performance of the two forecast models. Formally,

β„‹8: πœ‡X = 0;β„‹6: πœ‡X β‰  0 The test statistic is the same as for 𝑇 or 𝑍 described previously, but where the standard error is estimated by way of a weighted average of the lag terms derived from a parametric model fit to the auto-covariance function (ACVF); the ACVF is the same as the ACF except covariances are calculated instead of correlations (the ACF is the ACVF divided by the ACVF’s lag-zero term). Several parametric models may be used, but a simple choice is to use the exponential model. For details on this modeling approach, the reader is referred to Hering and Genton (2011). The sample mean of 𝐷] of the loss-differential series is used as the estimator of πœ‡X. The HG test described above is robust to contemporaneous correlation and directly deals with the temporal dependence. No assumptions about the distributions of π‘Œ62, π‘ŒD2, 𝑋2 and 𝐷2 are required. The only assumption is that 𝐷] follows either a Student-t distribution with 𝑛 βˆ’ 1 degrees of freedom (for small sample sizes) or a standard normal distribution (for large sample sizes). Hering and Genton (2011) demonstrated these properties and that their test is an improvement over the DM test procedure using empirical size and power tests from simulated MA(1) series for varying means of the second error series, 𝑔D2. Gilleland et al. (2018) further evaluated the test against commonly used testing procedures from weather forecast

Page 19: A quick guide to competitive forecast verification testing

verification and found it to be the most accurate with bootstrap methods also shown to be highly accurate. Other procedures were found to be highly negatively impacted by the contemporaneous correlation. For the example shown in the diagnostic plots section, the estimated mean AE loss differential is found to be οΏ½Μ…οΏ½ β‰ˆ βˆ’10.96, which suggests that model 1 is the better model on average over model 2. Because, for a 5% size test, this value exceeds the a priori critical value of βˆ’2.96, β„‹8 is rejected; meaning that the evidence suggests that model 1 is superior because the observed value of the estimator for the mean loss differential is estimated to be less than zero and the outcome is not likely to be a result of random chance.

Accounting for Trends If trends are present in the loss differential series, they can lead to erroneous results from a hypothesis test and misleading CI’s. The issue primarily concerns the estimated standard error as a trend can greatly impact such estimates. Trends can take many forms. For example, a series might increase linearly, quadratically, or more generally polynomially with time, it might have a diurnal cycle, a seasonal cycle, or some other nonlinear type of trend. If it is reasonable to assume a particular trend, then the trend can easily be removed before estimating the standard error. A linear trend is readily removed from a time series because, for example, suppose a series follows the model below

π‘₯2 = π‘Ž + 𝑏𝑑 + 𝑐π‘₯256 + πœ€2 where π‘₯6, … , π‘₯A is a time series, π‘Ž is a constant term, 𝑏 is a linear trend term, 𝑐 is an order-one auto-regressive (AR) coefficient and πœ€2 is white noise. The differenced series is given by

π‘₯2 βˆ’ π‘₯256 = 𝑏 + 𝑐(π‘₯256 βˆ’ π‘₯25D) + πœ‚2 which no longer has a trend term. If the trend is of a higher-order polynomial, say 𝑝, then the series would need to be differenced 𝑝 times. This process is most generally known as a fractionally integrated ARMA process (Brockwell and Davis 1996). For more general trends, differencing may or may not work depending on the trend function. In the case of the loss differential series, if two competing forecasts have the same trend, then this trend may be removed from the loss differential series, or at least greatly reduced. A good example would be in the case of solar radiation where there is a distinct diurnal trend and most forecasts should get the trend part correct as it is a simple process of the time of day; cloudiness notwithstanding. To determine whether the competing forecasts test of HG is affected by such a series, simulations are made with the assumption that nighttime is first removed from the analysis.

Page 20: A quick guide to competitive forecast verification testing

Two error series are first simulated such that they are dependent in time and contemporaneously correlated. A cyclic trend is added to both and differencing is applied to the resulting (AE) loss differential series in order to try to remove any additional trends. Empirical power is assessed by varying the variance of the second series (𝜎DD) as well as the more usual practice of varying the mean for the second series (πœ‡D). Varying the variance makes sense in this realm because often forecasts are calibrated to have the same mean as the observation series yielding an average loss of zero, so the only way one forecast could be superior to the other is if its loss series has a lower variance. Figure 14 and Figure 15 show the results for large and small sample sizes, respectively, when varying 𝜎DD. For a large sample size, the power curve looks very good, having a type I error rate close to the (empirical) size of the test and high power when 𝜎DD differs greatly from 𝜎6D with a reasonable sloping increase. On the other hand, there is very little power under changing variance only when the sample size is small (𝑛 = 30 in this case).

Page 21: A quick guide to competitive forecast verification testing

Figure 14: Empirical power curves for simulations of AE loss differential series derived from contemporaneously correlated, temporally dependent series with a diurnal cycle. Differencing is used to remove any diurnal cycle that might remain after calculating the loss differential series in estimating the standard error for the mean loss differential. Power, here, is based on having identical mean-error series but where the second series has a variance term that varies while the first series always has a variance of unity. Circles are tests performed for size 𝛼 = 5%, plus signs for 𝛼 = 10% and stars for 𝛼 = 20%. horizontal lines are drawn through the three size levels. When 𝜎DD is close to unity, the empirical power represents the empirical type I error rate, so that values close to unity should be close to their respective horizontal lines. When 𝜎DD is far from unity, higher power is desired. Sample sizes are large.

Page 22: A quick guide to competitive forecast verification testing

Figure 15: Same as Figure 14 but with a small sample size (𝑛 = 30).

Figure 16 and Figure 17 show the empirical power curves for the more usual (statistically) case of differing means. In this case, especially for large sample sizes, the empirical type-I error rate is good, and the empirical power is good, but the power increases far too rapidly, especially for large samples. Therefore, the HG test in this case may reject the null hypothesis for very small deviations between the means so that one might question the practical significance of the results.

Page 23: A quick guide to competitive forecast verification testing

Figure 16: Same as Figure 14, but now the variance is held at unity for both series and the mean of the second error series is varied. Thus, values of πœ‡D that are close to zero represent the empirical type I error rate. The sample size is large here and the power increases much too quickly.

Page 24: A quick guide to competitive forecast verification testing

Figure 17: Same as Figure 16 but with a small sample size (𝑛 = 30).

Summary To summarize the main points in this document:

β€’ When conducting a hypothesis test or making confidence intervals (CI’s), there are always assumptions about the distribution of the estimator for the population parameter (or function thereof) for which inferences are to be made. These assumptions should be checked, and diagnostic plots described in this document can be helpful in that regard. Typically, assumptions are made about:

Page 25: A quick guide to competitive forecast verification testing

o The specific parametric probability distribution for the estimator, which may have been derived from the assumed distribution of the underlying random sample used to calculate the estimator. Bootstrap methods do not make such an assumption (but they are not void of other assumptions).

o The independence, or dependence, of the underlying random sample. Specific bootstrap methods assume one or the other and are not accurate if the wrong choice is made.

o Independence between forecast models (no contemporaneous dependence) in the competing forecast verification framework. An assumption that is often violated and results in poor performance for many standard hypothesis test procedures; even some that account for temporal dependence.

o The stationarity (or lack thereof) of an estimator’s distribution. That is, does the distribution change over time or is it stable through time.

o The absence of a trend. β€’ The Hering-Genton (HG) test is an accurate and powerful procedure that accounts for

temporal dependence (or independence) and is robust to contemporaneous correlation. No assumptions are made about the underlying random sample, but that the estimator (a mean) follows the Student-t or normal distribution (depending on sample size).

β€’ Bootstrap methods provide fairly accurate and robust hypothesis tests and CI’s, but still require assumptions that should be checked, and the correct bootstrap procedure should be applied for a specific setting.

β€’ Statistical tests and CI’s need to be interpreted in the correct context, and researchers often make false conclusions based on them. The American Statistical Association (ASA) put out a statement on p-values (Wasserstein and Lazar 2016). Perhaps the most important consideration is to remember that the result of a statistical test is just one piece of evidence that should be considered when conducting an experiment and that a hard yes or no answer to an experiment’s hypothesized question is not usually useful.

Appendix (R Code) Bootstrap code is available in R via the boot package (Canty and Ripley 2017), as well as (less tested at the time of this writing) software available in the distillery package (Gilleland 2017). See Gilleland (2010) for some guidance on using R to perform bootstrap (and other) procedures to obtain CI’s in the forecast verification domain. For the remainder of this appendix, it is assumed that the observed series is a vector object of mode numeric named x and the two competing forecast models are in vectors named x1 and x2, respectively. Finally, the time points are contained in a vector (as numerical indices, for simplicity) called tiid. To make the various plots in the diagnostic section of this document (anything on the same line after # is a comment):

Page 26: A quick guide to competitive forecast verification testing

# Time series plot. See Figure 1. plot( tiid, y, type = β€œl” ) lines( tiid, x1, lty = 2 ) lines( tiid, x2, lty = 3 ) # Time series of AE loss functions. See Figure 2 and Figure 3. plot( tiid, abs( x1 – y ), type = β€œl” ) plot( tiid, abs( x2 – y ), type = β€œl” ) # Times series of AE loss differential series. See Figure 9. d <- abs( x1 – y ) - abs( x2 – y ) plot( tiid, d, type = β€œl” ) # Scatter plot. See Figure 4. # Use ?par to see options for point symbols, etc. plot( y, x1 ) points( y, x2, pch = β€œ+” ) # QQ-plots. See Figure 5. qq1 <- qqplot( y, x1, plot = FALSE ) qq2 <- qqplot( y, x2, plot = FALSE ) rng <- apply( cbind( c( qq1$x, qq2$x ), c( qq1$y, qq2$y ),

Range, finite = TRUE )

plot( qq1, xlab = β€œModel”, ylab = β€œObserved” ) points( qq2$x, qq2$y ) # Box plots. Figure 6. boxplot( data.frame( observed = y, model1 = x1, model2 = x2 ),

col = β€œdarkblue”, notch = TRUE )

# Histograms of AE loss series. See Figure 7 and Figure 8. hist( abs( x1 – y ), col = β€œdarkblue”, freq = FALSE ) hist( abs( x2 – y ), col = β€œdarkblue”, freq = FALSE ) # ACF/PACF of loss differential series (d calculated above). # See Figure 10 and Figure 11. acf( d ) pacf( d ) # ACF/PACF of differenced loss differential series. # See Figure 12 and Figure 13. N <- length( d ) acf( d[ 2:N ] - d[ 1:(N – 1) ] )

Page 27: A quick guide to competitive forecast verification testing

pacf( d [ 2:N ] - d[ 1:(N – 1) ] ) # HG test. Currently not available with differencing, # but trend removal is possible if the trend is known. predcomp.test( x = y, xhat1 = x1, xhat2 = x2,

alternative = β€œtwo-sided” )

Acknlowledgments Support for this write-up was provided by the Developmental Testbed Center (DTC). The DTC Visitor Program is funded by the National Oceanic and Atmospheric Administration, the National Center for Atmospheric Research and the National Science Foundation.

References Brockwell, Peter J. and Davis, Richard A., 1996. Introduction to Time Series and Forecasting. Springer, New York, NY, 420 pp. Canty, Angelo and Brian Ripley, 2017. boot: Bootstrap R (S-Plus) Functions. R package version 1.3-20. Davison, Anthony C. and Hinkley, D.V., 1997. Bootstrap Methods and Their Application. Cambridge University Press, Cambridge, U.K., ISBN: 0-521-57391-2, 592 pp. Diebold, F. X. and R. S. Mariano, 1995: Comparing predictive accuracy. J. Bus. Econ. Stat.,13, 253–263,http://doi.org/10.1080/07350015.1995.10524599. Efron, Bradley, 1979. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1), 1 – 26. Gilleland, E., 2010. Confidence intervals for forecast verification. NCAR Technical Note NCAR/TN-479+STR, 71pp. Gilleland, E., 2013: Testing competing precipitation forecasts accurately and efficiently: The spatial prediction comparison test. Mon. Wea. Rev., 141, 340–355, https://doi.org/10.1175/MWR-D-12-00155.1. Gilleland, E., 2017. distillery: Method Functions for Confidence Intervals and to Distill Information from an Object. R package version 1.0-4. URL https://CRAN.R-project.org/package=distillery. Gilleland, E., A. S. Hering, T. L. Fowler, and B. G. Brown, 2018. Testing the tests: What are the impacts of incorrect assumptions when applying confidence intervals or hypothesis tests to

Page 28: A quick guide to competitive forecast verification testing

compare competing forecasts? Mon. Wea. Rev., 146 (6), 1685 - 1703, doi: 10.1175/MWR-D-17-0295.1. Goodman, Steven N., 1999. Toward evidence-based medical statistics. 1: The p-value fallacy. Ann. Intern. Med., 130, 995 – 1004. Hamill, Thomas M., 1999. Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167, https://doi.org/10.1175/1520-0434(1999)014<0155:HTFENP>2.0.CO;2. Hering, A. S., and M. G. Genton, 2011: Comparing spatial predictions. Technometrics, 53, 414–425, https://doi.org/10.1198/TECH.2011.10136. Lahiri, S., 2003. Resampling Methods for Dependent Data. Springer, New York, NY, 382 pp. NCAR - Research Applications Laboratory, 2015. verification: Weather Forecast Verification Utilities. R package version 1.42. URL https://CRAN.R-project.org/package=verification R Core Team, 2018. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Wasserstein, Ronald L. and Nicole A. Lazar, 2016. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129-133, DOI: 10.1080/00031305.2016.1154108 Wickham, Hadley, 2017. tidyverse: Easily Install and Load the 'Tidyverse'. R package version 1.2.1. URL https://CRAN.R-project.org/package=tidyverse.