nachiketa acharya - india meteorological...

Seasonal Forecasting Using the Climate Predictability Tool

Validation& Verification in CPT Nachiketa Acharya [email protected]

Big Thanks to Dr. Simon Mason

2 Seasonal Forecasting Using the Climate Predictability Tool

Validation vs Verification

• “Validation” v “verification”: we validate a model, but verify forecasts.

• In CPT, “validation” relates to the assessment of a model for deterministic (“best guess”) cross-validated and retroactive predictions; “verification” relates to the assessment of probabilistic forecasts.


Cross-validation Leave-one-out cross-validation

1971 Predict 1971

Training period

1972 Training period

Predict 1972

Training Period


Predict 1973

Training period


Predict 1974

Training Period


Predict 1975

Training period

… repeat to 2010.

Leave-k-out cross-validation

1971 Predict 1971

Omit 1972

Omit 1973

Training period

1972 Omit 1971

Predict 1972

Omit 1973

Omit 1974

Training period

1973 Omit 1971

Omit 1972

Predict 1973

Omit 1974

Omit 1975

Training period


Omit 1972

Omit 1973

Predict 1974

Omit 1975

Omit 1976

Training period


Omit 1973

Omit 1974

Predict 1975

Omit 1976

Omit 1977

Data period (24 year)

1982 2005

1982 2005

Cross validation manner (leave- one –out method)


Retroactive forecasting

Given data for 1951-2000, it is possible to calculate a retroactive set of probabilistic forecasts. CPT will use an initial training period to cross-validate a model and make predictions for the subsequent year(s), then update the training period and predict additional years, repeating until all possible years have been predicted.


(1951-1980) Predict 1981

Omit 1982+


(1951-1981) Predict 1982

Omit 1983+


(1951-1982) Predict 1983

Omit 1984+


(1951-1983) Predict 1984

Omit 1985+


(1951-1984) Predict 1985


Forecasts and observations

Discrete Continuous

Deterministic It will rain tomorrow

There will be 10 mm of rain tomorrow

Probabilistic There is a 50% chance of rain

tomorrow

There is a p% chance of more than k mm of rain tomorrow



Discrete Continuous




tomorrow



Continuous measures compare the best-guess forecasts with the observed values without regard to the categories. They compare forecasts in mm or °C against observations in mm or °C. Tools ~ Validation ~ Cross-validated ~ Performance measures


Pearson’s correlation

Pearson’s correlation measures association (are increases and decreases in the forecasts associated with increases and decreases in the observations?).

It does not measure accuracy.

When squared, it tells us how much of the variance of the observations is correctly forecast.

2 2

n

i ii

n n

i ii i

x x y yr

x x y y

Correlation: Measuring the strength of

Linear relationship between two variables

Correlation between two variables Pearson product-moment correlation

Correlation is a systematic relationship between x and y: When one goes up, the other tends to go up also, or may tend to go down. Need corresponding pairs of cases of x, y. “Perfect” positive correlation is +1 “Perfect” negative correlation is –1 No correlation (x and y completely unrelated) is 0 Correlation can be anywhere between –1 and +1. A relationship between x and y may or may not be causal – if not, x and y may be under control of some third variable. Correlation can be estimated visually by looking at a scatterplot of dots on an x vs. y graph.

| | | o | | o o | | o o o | | o | | o o | Y| o o o o | | o | | o o | | o o | | o | | o | | o | |_______________________________________________| X correlation = 0.8

| o | | o | | o o o | | o o o o | | o o o | | o o | Y| o o o o | | o o o | | o o o | | o o | | o | | o o o | | o | |_______________________________________________| X correlation = 0.55

| | | | Y | | | | | o | | o o | | o o | | o o | | o o | |______o_______________________o____|

X correlation = 0

there is a strong nonlinear relationship The Pearson correlation only detects linear relationships

| o | | | Y | | | | | | | | | o | |o o o | |oooooo o o | |oooooo_o_________________________________|

X correlation = 0.87 (due to one outlier in upper right) If domination by one case is not desired, can use the Spearman rank correlation (correlation among ranks instead of actual values).


Spearman’s correlation

Numerator: ?

Denominator: ?

How much of the squared variance of the ranks for the observations can we correctly forecast?

Huh?

Spearman’s correlation does not have as obvious an interpretation as Pearson’s, but it is much less sensitive to extremes.

2

1

2

6

11

i i

n

x yi

r r

n n

Spearman rank correlation Rank correlation is the Pearson correlation between the ranks of X vs. the ranks of Y, treating ranks as numbers. Rank correlation measures the strength of monotonic relationship between two variables. Rank correlation defuses outliers by not honoring original intervals between adjacent ranks. Adjacent ranks simply differ by 1. Simpler formula for rank correlation for small samples: If difference in rank for a given case is D,

Spearman cor = 1 -

If ranks identical for all cases, all D are zero and cor = 1. An example of the use of this formula is given in next slide.

n

i iDNN 1

22 )1(

6

Spearman rank correlation

Rank correlation is simply the correlation between the ranks of X vs. the ranks of Y, treating ranks as numbers. When there are outliers, or when the X and/or Y data are very much non-normal, the Spearman rank correlation should be computed in addition to the standard correlation.

Example of conversion to ranks for X or for Y: Original numbers: 2 9 189 3 21 7 Corresponding ranks: 6 3 1 5 2 4 can also be 1 4 6 2 5 3 Note in above example that the difference between 189 and 21 is treated as the same as that between 9 and 7.


2 AFC (Kendall’s tau)

Denominator: total number of pairs.

Numerator: difference in the numbers of concordant and discordant pairs.

Kendall’s correlation measures discrimination (do the forecasts increase and decrease as the observations increase and decrease?). It can be transformed to the probability that the forecasts successfully distinguish the wetter (or hotter) of two observations?

12 1

c dn n

n n


Error measures compare the best-guess forecasts with the observed values without regard to the categories. They compare forecasts in mm or °C against observations in mm or °C.


Biases Mean bias:

Always close to zero for cross-validated forecasts;

Slightly negative if predictand data are positively skewed.

Indicates ability to forecast shifts in climate for retroactive forecasts.

Variance or amplitude bias:

Typically very small if skill is low because forecasts always close to the mean

If there is no mean or variance bias, the RMSE of the forecasts will exceed that of climatology if the correlation is less than 0.5.

Root-mean-Square Skill Score: RMSSS for continuous deterministic forecasts RMSSS is defined as: where: RMSEf = root mean square error of forecasts, and RMSEs = root mean square error of standard used as no-skill baseline. Both persistence and climatology can be used as baseline. Persistence, for a given parameter, is the persisted anomaly from the forecast period immediately prior to the LRF period being verified. For example, for seasonal forecasts, persistence is the seasonal anomaly from the season period prior to the season being verified. Climatology is equivalent to persisting an anomaly of zero.

RMSf =



Discrete Continuous




tomorrow



Categorical measures measure the skill of the deterministic forecasts with the observations as categories. Some compare forecasts in mm or °C with observations as categories, others compare categories with categories.


Hit scores convert the forecasts to categories and then compare these with the observed categories. But note that the category containing the best guess is not necessarily the most likely!


Hit scores

The contingency tables are based on cross-validated definitions of the categories and so may not perfectly match implied scores from the graph.

Some hits can be expected even with useless forecasts (e.g., guessing, or always forecasting the same outcome…

Tools ~ Contingency Tables ~ Cross-validated


Measures of discrimination: can the forecasts successfully distinguish different outcomes? The observations are categories, but the forecasts are continuous (except where indicated).


ROC diagrams

ROC areas: do we issue a higher probability when the category occurs?

Graph bottom left: when the probabilities are high, does the category occur?

Graph top right: when the probabilities are low, does the category not occur?

Retroactive forecasts of MAM 1986 – 2010 Thailand rainfall using February Pacific SSTs


Relative Operating Characteristics


Continuous scores

Correlations

Pearson’s: % variance

Spearman’s: % variance of ranks

Kendall’s: 2AFC – probability of successfully identifying warmer / wetter observation

Errors

Mean bias: unconditional error

Variance bias: underestimation of variability

RMSE: correlation, mean and variance bias

MAE: average error


Categorical scores

Hits

Hit score: % correct

Hit skill: % correct adjusted for guessing

LEPS: adjusts for near-misses

Gerrity: adjusts for near-misses

Discrimination

2AFC: probability of successfully identifying warmer / wetter category

ROC: probability of successfully identifying observation in current category


Significance testing

Tools ~ Validation ~ Cross-validated ~ Bootstrap


Probabilistic Forecasts

Why do we issue forecasts probabilistically?

• We cannot be certain what is going to happen

• The probabilities try to give an indication of how confident we are that the specified outcome will occur.


Verification of probabilistic forecasts

Attributes Diagrams: graphs reliability, resolution, sharpness ROC Diagrams: graphs showing discrimination Scores: a table of scores for probabilistic forecasts Skill Maps: maps of scores for probabilistic forecasts Tendency Diagram: graphs showing unconditional biases Ranked Hits Diagram: graphs showing frequencies of observed

categories having the highest probability Weather Roulette: graphs showing estimates of forecast value


What makes a “good” probabilistic forecast?

Reliability the event occurs as frequently as implied by the forecast

Sharpness the forecasts frequently have probabilities that differ from climatology considerably

Resolution the outcome differs when the forecast differs

Discrimination the forecasts differ when the outcome differs

36


Attributes diagrams

The histograms show the sharpness.

The vertical and horizontal lines show the observed climatology and indicate the forecast bias.

The diagonal lines show reliability and “skill”.

The coloured line shows the reliability and resolution of the forecasts.

The dashed line shows a smoothed fit.


Probabilistic scores

Scores per category

Brier score: mean squared error in probability (assuming that the probability should be 100% if the category occurs and 0% if it does not occur)

Brier skill score: % improvement over Brier score using climatology forecasts (often pessimistic because of strict requirement for reliability)

ROC area: probability of successfully discriminating the category (i.e., how frequently the forecast probability for that category is higher when it occurs than when it does not occur)

Resolution slope: % increase in frequency for each 1% increase in forecast probability


Probabilistic scores

Overall scores

Ranked prob score: mean squared error in cumulative probabilities

RPSS: % improvement over RPS using climatology forecasts (often pessimistic because of strict requirement for reliability)

2AFC score: probability of successfully discriminating the wetter or warmer category

Resolution slope: % increase in frequency for each 1% increase in forecast probability

Effective interest: % return given fair odds

Linear prob score: average probability on the category that occurs

Hit score (rank n): how often the category with the nth highest probability occurs

Verification of Probabilistic Categorical Forecasts: The Ranked Probability Skill Score (RPSS)

Epstein (1969), J. Appl. Meteor.

RPSS measures cumulative squared error between categorical forecast probabilities and the observed categorical probabilities relative to a reference (or standard baseline) forecast. The observed categorical probabilities are 100% in the observed category, and 0% in all other categories.

2( ) ( )

1

( )Ncat

F cat O catcat

RPS Pcum Pcum

Where Ncat = 3 for tercile forecasts. The “cum” implies that the sum- mation is done for cat 1, then cat 1 and 2, then cat 1 and 2 and 3.

2( ) ( )

1

( )Ncat

F cat O catcat

RPS Pcum Pcum

The higher the RPS, the poorer the forecast. RPS=0 means that the probability was 100% given to the category that was observed. The RPSS is the RPS for the forecast compared to the RPS for a reference forecast such as one that gives climatological probabilities.

1 forecast

reference

RPSRPSS

RPS

RPSS > 0 when RPS for actual forecast is smaller than RPS for the reference forecast.


What is “skill”?

42

nachiketa acharya - india meteorological...

Documents