chapter 4 prediction and bayesian inference 4.1 estimators versus predictors 4.2 prediction for...

Chapter 4Prediction and Bayesian Inference

• 4.1 Estimators versus predictors• 4.2 Prediction for one-way ANOVA models

– Shrinkage estimation, types of predictions• 4.3 Best linear unbiased predictors (BLUPs)• 4.4 Mixed model predictors• 4.5 Bayesian inference• 4.6 Case study: Forecasting lottery sales• 4.7 Credibility Theory• Appendix 4A Linear unbiased predictors

4.1 Estimators versus predictors• In the longitudinal data model, yit = zit´ i + xit´ + it , the

variables {i} describe subject-specific effects.

• Given the data {yit, zit, xit}, in some problems it is of interest to “summarize” subject effects.– We have discussed how to estimate fixed, unknown

parameters .– It is also of interest to summarize subject-specific effects,

such as those described by the random variable i.

• Predictors are “estimators” of random variables.– Like estimators, predictors are said to be linear if they are

formed from a linear combination of the response y.

Applications of prediction• In animal and plant breeding, one wishes to predict the

production of milk for cows based on (1) their lineage (random) and (2) herds (fixed)

• In credibility theory, one wishes to predict expected claims for a policyholder given exposure to several risk factors

• In sample surveys, one wishes to predict the size of a specific age-sex-race cohort within a small geographical area (known as “small area estimation”).

• In a survey article, Robinson (1991) also cites (1) ore reserve estimation in geological surveys, (2) measuring quality of a production plan and (3) ranking baseball players abilities.

4.2. Prediction for one-way ANOVA models

• Consider the traditional one-way random effects ANOVA (analysis of variance) model:

yit = + i + it

– Suppose that we wish to summarize the subject-specific conditional mean, + i .

• For contrast, first consider using the fixed effects model with – Here, we have that is the “best” (Gauss-Markov)

estimate of i.

– This estimate is unbiased, that is, E = i.

– This estimate has minimum variance among all linear unbiased estimators (BLUE).

iy

iy

Shrinkage estimator• Using the one-way random effects model.

– Consider an “estimator” of + i that is a linear combination of and , that is,

for constants c1 and c2.

• Calculations show that the best values of c1 and c2 that minimize are c2 = 1 – c1 and

• For large n, we have the shrinkage estimator, or predictor, of + i to be , where

221 )(E ii ycyc

iy ycyc i 21 y

yyy iiisi )1(,

22

2

i

ii

T

T

2*2

2*

1

i

i

T

Tc

11

1

22

*

121

NT

TNN

T

Ti

n

jj

i

i

Example of shrinkage estimator Hypothetical Run Times for Three Machines

• Machine Run Times Average Run Time• 1 14, 12, 10, 12 1 = 12

• 2 9, 16, 15, 12 2 = 13

• 3 8, 10, 7, 7 3 = 8

– Notation: yij means the jth run from the ith machine.

– For example, y21 = 9 and y23 = 15.

• Are there real differences among machines?

yyy

• To see the “shrinkage” effect, consider

• Figure 4.1 Comparison of Subject-Specific Means to Shrinkage Estimators.

8 131211

11.825 12.6508.525

2yy 3y1y

Example - Continued

More on shrinkage estimators• Under the random effects model, is an unbiased

predictor of +i in the sense that E - ( + i) = 0.

– However, is inefficient in the sense that has a smaller mean square error than .

– Here, has been “shrunk” towards the stable estimator – The “estimator” is said to “borrow strength” from the

stable estimator• Recall

• Note that i1 as either (i) Ti or (ii) 2/

2 .

y

y

iyiy

iy

iyiy

siy ,

siy ,

22

2

i

ii

T

T

Best predictors• From Section 3.1, it is easy to check that the generalized

least square estimator of is

• The linear predictor of + i that has minimum variance is = i + (1 - i ) m,GLS .

– Here, the acronym BLUP stands for best linear unbiased predictor.

n

i i

n

i ii

GLS

ym

1

1,

22

2

ei

ii

T

T

iyBLUPiy ,

Types of Predictors• We have now introduced the BLUP of + i . This quantity is a

linear combination of global parameters and subject-specific effects.• Two other types of predictors are of interest.

– Residuals. Here, we wish to “predict” it . The BLUP residual turns out to be

– Forecasts. Here, we wish to predict, for “L” lead time units into the future,

– Without serial correlation, the predictor is the same as the predictor of + i . However, we will see that the mean square error turns out to be larger.

BLUPiitBLUPit yye ,,

LTiiLTi iiy ,,

4.3 Best linear unbiased predictors• This section develops best linear unbiased predictors in the

context of mixed linear models, then specializes the consideration to longitudinal data mixed models.

• BLUPs are developed by examining the minimum mean square error predictor of a random variable, w.– We give a development due to Harville (1976).– The argument is originally due to Goldberger (1962),

who coined the phrase best linear unbiased predictor.– The acronym was first used by Henderson (1973).

• BLUPs can also be developed as conditional expectations using multivariate normality

• BLUPs can also be developed in a Bayesian context.

Mixed linear models• Suppose that we observe an N 1 random vector y with mean E y = X and

variance Var y = V.– We wish to predict a random variable w, that has mean E w = and

Var w = 2.

– Denote the covariance between w and y as Cov(w,y) = covwy.

• Assuming known regression parameters (), the best linear (in y) predictor of w is

w* = E w + covwy V-1(y - E y ) = + covwy V-1(y - X ).

– If w,y are multivariate normal, then w* equals E (w | y ) and hence is a minimum mean square predictor of w.

– The predictor w* is also a minimum mean square predictor of w without the assumption of normality. See Appendix 4A.1.

BLUP’s as predictors• To develop the BLUP,

– define bGLS = ( X V -1 X )-1 X V-1 y to be the generalized

least squares (GLS) estimator of . – This is the best linear unbiased estimator (BLUE).

– Replace by bGLS in the definition of w* to get the BLUP

wBLUP = bGLS + covwy V-1(y - X bGLS )

= (- covwy V-1X) bGLS + covwy V-1 y.– See Appendix 4A.2 for a check, establishing wBLUP as the

best linear unbiased predictor of w. • From Appendix 4A.3, we also have the form for the

minimum mean square error:

Var (wBLUP - w) = (- covwy V-1X) ( X V -1 X )-1

(- covwy V-1X) - covwy V-1 covwy + 2.

Example: One-way model• Recall, yit = + i + it

– Thus, yi = 1i ( + i) + i . Thus,

Xi = 1i and

– With this, we note that Vi-1 (yi - Xi bGLS)=

– Thus, for predicting w = + i we have =1 and Cov(w, yi) = 1i

for the ith subject, 0 otherwise. Thus,

n

i i

n

i ii

GLSGLS

ymb

1

1,

ii

iiT

JIV22

2

21 1

)()(1

,,2 GLSiiiGLSii mym

1 1 - y

)(),(Cov 1, GLSiiiiGLSBLUP wmw b X- yVy

)()(1

,,22

, GLSiiiGLSiiiGLS mymm

1 1 - y1

BLUPiGLSiiGLS ymym ,,, )( -

Random effect ANOVA model• For predicting residuals it we have =0 and Cov(w, yi) =

for the ith subject, tth time period, 0 otherwise.

• Let 1it be a Ti 1 vector with a 1 in the tth position, 0 otherwise. Thus,

• is our BLUP residual.

BLUPiit yy ,)(12

GLSiiiitBLUPw b X- yV1

4.4 Mixed model predictors

• Recall the longitudinal data mixed model

yi = Zi i + Xi + i

• As described in Section 3.3, this is a special case of the mixed linear model. We use

V = block diagonal (V1, ..., Vn) ,

where Vi = Zi D Zi + Ri.

X = (X1, ... Xn)• For BLUP calculations, note that

covwy = ( Cov(w, y1 ),…, Cov(w, yn) )

Longitudinal data mixed model BLUP

• Recall that the r.v. w has mean E w = and Var w = 2.

• The BLUP is

• The mean square error is Var (wBLUP - w) =

n

iiii

n

iiii

n

iiii ww

1

1

1

1

1

1

1 ),(Cov),(Cov XVyλXVXXVyλ

n

iwiii ww

1

21 ),(Cov),(Cov yVy

n

iGLSiiiiGLSBLUP ww

1

1 )(),(Cov b X- yVybλ

BLUP special cases• Global parameters and subject-specific effects.

– Suppose that the interest is in predicting linear combinations of global parameters and subject-specific effect i.

– Consider linear combinations of the form

w = c1 i + c2 .

• Residuals. Here, w = it .

• Forecasts. Suppose that the ith subject is included in the data set; predict

– for L lead time units in the future. LTiLTiiLTiLTi iiiiy ,,,, βxαz

Predicting global parameters and subject-specific effects

• Consider linear combinations of the form w = c1 i + c2 .

• Straightforward calculations show that

– E w = c2 so that = c2,

– Cov (w, yj ) = c1 D Zi for j = i

– Cov (w , yj ) = 0 for j i.

• Thus, wBLUP = c2 bGLS + c1 D Zi Vi-1 (yi - Xi bGLS ).

Special case 1• Take c2 = 0 . Because the means and variance expressions

are true for all vectors c2, we may write this in vector notation to get the BLUP of i, the vector

ai,BLUP = D Zi Vi-1 (yi - Xi bGLS ).

• This is unbiased in the sense that E ai,BLUP - i = 0.

• This estimate has minimum variance among all linear unbiased predictors (BLUP).

• In the case of the error components model (zit = 1), this reduces to

• For comparison, recall the fixed effects parameter estimate,

)(, GLSiiiBLUPi ya b x -

bx iii ya

Motivating BLUP’s

• We can also motivate BLUP’s using normal theory:

– Consider the case where i and are multivariate normally distributed.

– Then, it can be shown that E (i | yi) = D Zi Vi

-1 (yi -Xi ).

– To motivate this, consider asking the question: what realization of i could be associated with yi? The expectation!

– The BLUP is the BLUE of E (i | yi). (That is, replace by bGLS.)

Special case 2• As another example, it is of interest to predict

• • Choose and• This yields

• This predictor is of interest in actuarial science, where it is known as the credibility estimator.

1,2 iTixc1,1

iTizc

GLSTiBLUPiTiBLUP iiw bxaz 1,,1,

βxαz 1,1, ii TiiTi)|(E 1, iTi i

yw α

BLUP Residuals• Here, w = it . Because E w = 0, it follows that = 0.

• Straightforward calculations show that

– Cov (w, yj ) = 2 1it for j = i and

– Cov (w , yj ) = 0 for j i.

– Here, the symbol 1it denotes a Ti 1 vector that has a “one” in the tth position and is zero otherwise.

• Thus

eit,BLUP = 2 1it Vi

-1 (yi - Xi bGLS ).

• This can also be expressed as GLStiBLUPitiitBLUPit ye bxaz ,,,,

Predicting future observations• Suppose that the ith subject is included in the data set; predict

– for L lead time units in the future.• We will assume that and are known.• It follows that • Straightforward calculations show that

• Thus, the forecast of yi,Ti+L is

• Thus, the forecast is the estimate of the conditional mean plus the serial correlation correction factor

LTi i ,xLTi i ,z

1, iTixλ

LTiLTiiLTiLTi iiiiy ,,,, βxαz

ijij

w iLTiLTiij

ii

forfor),Cov(

),Cov( ,,

0εDzZ

y

BLUPiiiLTiBLUPiLTiGLSLTiLTi iiiiy ,

1,,,,, ),Cov(ˆ eRεazbx

BLUPiiiLTi i ,1

, ),Cov( eRε

Predicting future observations

• To illustrate, consider the special case where we have autoregressive of order 1 (AR(1)), serially correlated errors.

• Thus, we have

• After some algebra, the L step forecast is

1

1

1

1

321

32

2

12

2

TTT

T

T

T

R

1000

1000

0010

001

0001

)1(

1

2

2

2

221

R

BLUPiTL

BLUPiLTiGLSLTiLTi iiiiey ,,,,,ˆ azbx

4.5 Bayesian Inference• With Bayesian statistical models, one views both the model

parameters and the data as random variables. – We assume distributions for each type of random variable.

• Given the parameters β and α, the response model is

– Specifically, we assume that the responses y conditional on α and β are normally distributed and that

E (y | α, β ) = Z α + X β and Var (y | α, β) = R.• Assume that α is distributed normally with mean α and

variance D and that β is distributed normally with mean μβ and variance β, each independent of the other.

εZαXβy

Distributions• The joint distribution of (α, β) is known as the prior

distribution. • To summarize, the joint distribution of (α, β, y) is

• where V = R + Z D Z.

XXΣVXΣZD

XΣΣ0

ZD0D

μXZμ

μ

μ

y

β

α

ββ

ββ

βα

β

α

,N

Posterior Distribution

• The distribution of parameters given the data is known as the posterior distribution.

• The posterior distribution of (α, β) given y is normal. • The conditional moments are

βαβββ

βαβα

μXZμyXXΣVXΣμ

μXZμyXXΣVZDμy

β

α1

1

|E

ββββ

XΣZDXXΣVXΣZD

Σ00D

yβα 1|Var

Relation with BLUPs

• In longitudinal data applications, one typically has more information about the global parameters β than subject-specific parameters α.

• Consider first the case β = 0, so that β = β with probability

one. – Intuitively, this means that β is precisely known, generally

from collateral information.

– Assuming that α = 0, it is easy to check that the best linear

unbiased estimator (BLUE) of E ( α | y ) is

aBLUP = D Z V-1 ( y – X bGLS)

– Recall from equation (4.11) that aBLUP is also the best linear

unbiased predictor in the frequentist (non-Bayesian) model framework.

Relation with BLUPs• Consider second the case where β

-1 = 0. – In this case, prior information about the parameter β is

vague; this is known as using a diffuse prior. – Assuming α = 0, one can show that

E ( α | y ) = aBLUP

• It is interesting that in both extreme cases, we arrive at the statistic aBLUP as a predictor of α. – This analysis assumes D and R are matrices of fixed

parameters. – It is also possible to assume distributions for these

parameters; typically, independent Wishart distributions are used for D-1 and R-1 as these are conjugate priors.

– The general strategy of substituting point estimates for certain parameters in a posterior distribution is called empirical Bayes estimation.

Example – One-way random effects ANOVA model

• The posterior means turn out to be

• where

• Note that measures the precision of knowledge about .

Specifically, we see that approaches one as 2 , and

approaches zero as 2 0.

)()(|Eˆ yyiii y

222

1

222

1|Eˆ

y

T

nT

T

nTββ y

22

2

T

T

222

2

nTT

nT

iii yyy )1()1()1(ˆˆ

4.6 Wisconsin Lottery Sales

• T=40 weeks of sales from n =50 zip codes

Lottery Sales Data Analysis

• Cross-sectional analysis shows that population size heavily influences sales, with Kenosha as an outlier

• Multiple time series plots – show the effect of jackpots that is common to all postal

codes– show the heterogeneity among postal codes (reaffirmed

by a pooling test)– show the heteroscedasticity that is accommodated

through a logarithmic transformation

Lottery Sales Model Selection• In-sample results show that

– One-way error components dominates pooled cross-sectional models

– An AR(1) error specification significantly improves the fit.

– The best model is probably the two-way error component model, with an AR(1) error specification (not yet documented)

• Out-of-sample analysis suggests that – logarithmic sales is the preferred choice of response; it

outperforms sales and percentage change.

4.7. What is Credibility?• Hickman’s (1975) Analogy

– In politics, leaders begin with a reservoir of credibility which decreases as executive experience is compiled.

– Insurance behaves in a reverse fashion!– Here, credibility increases as experience increases.

Credibility Theory• Credibility is a technique for predicting future expected

claims for a risk class, given past claims of that and related risk classes.

• Importance– Credibility is widely used for pricing property and

casualty, worker’s compensation and health care coverages.

– According to Rodermund (1989), “the concept of credibility has been the casualty actuaries’ most important and enduring contribution to casualty actuarial science.”

History• Mowbray (1914 - PCAS)

– Asked the question, “how extensive is an exposure necessary to give a dependable pure premium?”

– This approach is now known as the “limited fluctuation” or “American” credibility

• Question 1 – do we have enough exposure to give full weight to the risk class under consideration?

• Question 2 – if not, how can we combine information from this and related risk classes?

More History• Whitney (1918 - PCAS)

– introduced the idea of using a weighted average of average claims of (1) a given risk class and (2) all risk classes.

– The weight is known as the credibility factor.

– It is of the form

New Premium =

Z Claims Experience + (1 – Z) Old Premium.

Example - Balanced Bühlmann

• Consider the model

yit = + i + it.

• The credibility factor is

• The traditional credibility estimator is

.)1( iBLUP yyw

22

T

T

Example Hypothetical Claims for Three Towns

Town Claims Average Claim

1 14, 12, 10, 12 1 = 12

2 9, 16, 15, 12 2 = 13

3 8, 10, 7, 7 3 = 8

• Are there real differences among towns?

• Mowbray - does Town 3 have enough data to support its own estimator of pure premiums?

• Whitney - how can I use the information in Towns 1 and 2 to help determine my rate for Town 3?

y

yy

Response toWhitney• Known as the “shrinkage” effect

• Comparison of Subject-Specific Means to Credibility Estimators.

8 131211

11.825 12.6508.525

2yy 3y1y

Why study credibility theory?• Long history of applications – “a business necessity”

– More recently, many theoretical advances with fewer innovative applications

• Credibility techniques required in legal statutes and standards of practice– Standard of Practice 25 by the Actuarial Standards Board

of the American Academy of Actuaries– Wisconsin statutes on credibility insurance and disability

income• Advanced techniques are critical for keeping up with

competition (health insurance – health economists)• Innovative techniques enhance the “credibility” of the

profession

chapter 4 prediction and bayesian inference 4.1 estimators versus predictors 4.2 prediction for...

Documents

estimator of ma ai

e ma ai

ma ai eitsuppose

unbiased predictor of

random variable ai

subject effects

oneway random effects

subjectspecific effects