the small area estimation problem (contd) · the small area estimation problem (contd) ray chambers...

24
1 The small area estimation problem (contd) Ray Chambers University of Wollongong Thanks to Hukum Chandra (IIASR, Delhi) Nicola Salvati (Pisa) Nikos Tzavidis (Southampton)

Upload: others

Post on 10-Feb-2020

11 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

1

The small area estimation problem (contd)

Ray Chambers University of Wollongong

Thanks to

Hukum Chandra (IIASR, Delhi) Nicola Salvati (Pisa)

Nikos Tzavidis (Southampton)

Page 2: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

2

The Small Area Estimation Problem

• Estimates derived using only the area-specific sample data are known as direct estimates, e.g.

yi = wjj∈si∑

⎝⎜⎞

⎠⎟

−1

wjyjj∈si∑

⎝⎜⎞

⎠⎟= ywi

• An area i is regarded as small if the sample drawn from the

area is not large enough to support direct estimates of adequate precision

Page 3: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

3

There are actual two small area estimation problems: • How should one produce reliable small area estimates of the

characteristics of interest? • How should one assess the reliability of these estimates? Direct estimation is typically not an option. Due to budgetary and other operational constraints, a large enough overall sample size to support direct estimation for all areas of interest rarely exists

Furthermore, in many cases small areas are defined only after the survey has been carried out

Page 4: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

4

First Steps in SAE: Synthetic Estimation of Small Area Means Many small area estimation problems can be defined in terms of allocating a set of national estimates over a set of small areas or domains • Aim is to estimate average values yi of a survey variable Y for m

areas that partition the population • Sample sizes in each area are too small to produce direct

estimates of sufficient accuracy for all areas • Accurate national estimates yg of class means yg are available for

G classes that also partition the population (G << m) • Know Census area × class population cross-classification {Mig} • Gonzalez and Hoza (1978); Steinberg (1979)

Page 5: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

5

• Use these counts to apportion national estimates for classes to area level, then aggregate

yiSYN =

Mig

Mi•

⎛⎝⎜

⎞⎠⎟yg

i=1

G

• Strong assumptions:

o Distribution of Y is essentially the same in every area, so E(yig ) = E(yg )

o Population distributions have remained essentially

unchanged since the Census - i.e. Mig

Mi•

=Nig

Ni•

• Generally biased (sometimes substantially) but with low variance

Page 6: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

6

Modern Approaches to the Small Area Problem

• Use model-based estimation methods to ‘borrow strength’ from the survey information for other, similar, areas

• The idea is to use statistical models to link the variable of

interest with supplementary contextual (auxiliary) information, e.g. census and administrative data, for the small areas. The auxiliary information is assumed to explain part of the between area variability

• We shall usually assume that even if survey data are only

available for a limited number of individuals/units in a small area, we have auxiliary information for all individuals/units in that area

- allows use of unit-level models in SAE

Page 7: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

7

Modern SAE Methods Based On Indirect Estimators An indirect estimator for a small area is one that uses all the sample data in order to construct the estimate for that area e.g. Indirect estimator for the mean of Y in area i

yi = Ni−1 wijyj

j=1

n

∑⎛⎝⎜

⎠⎟

Note: Weights wij are area i specific - i.e. they are computed using the data from area i only

Page 8: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

8

A Regression-Synthetic Approach Assumes between area variability in Y can be explained entirely in terms of variability in values of a set of auxiliary variables X Linear Model yU = XUβ + eU ⇒ ys = Xsβ + es

Note Same value of β in all small areas

yiREG−SYN = Ni

−1 yjsi∑ + x j

T βri

∑( ) • β = full sample estimate • Prediction estimator under linear model

Page 9: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

9

Allowing for Area-Specific Effects Domain specific variability typically remains even after accounting for the auxiliary information

• Use random area effects to account for between area variation beyond that explained by the covariate information

• An area effect indicates how different one area is from another

after allowing for differences in their auxiliary variable distributions

• Estimating the effect for a particular area requires using data

from all areas and not just the data from the particular area. This increases the effective sample size for that area and is known as borrowing strength

Page 10: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

10

Small Area Estimation Based on Area-Specific Effects

Separate fitted lines for different areas

x • Few 'dissimilar' small areas = fixed effects • Many 'similar' small areas = random effects

Mean of x area1 Overall mean of x Mean of x area 2

Fit for Area 1

Mean of y area1

Mean of y area 2

Overall mean of y

y Overall fit

Fit for Area 2

Page 11: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

11

Mixed (Multilevel) Models for Small Area Estimation In general, a mixed or multilevel model has the following form

Y = Fixed Part + Area Effect + Residual Fixed Part Contribution of the auxiliary information to explaining

between area variability

Area Effect Area specific value that accounts for between area variability beyond that explained by the auxiliary information

Residual Remaining within area individual/unit level variability

Page 12: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

12

The Linear Mixed Model

yU = XUβ +GUu+ eU

• GU = N × mq matrix of known ‘contextual’ covariates - GU = diag(Gi )

• u = random vector of area effects of dimension mq - u = (ui )

- uncorrelated area specific effects, so Var(u) = diag Var(ui ){ } = diag σ u

2Γ(ϕ1){ } =σ u2Ω(ϕ1)

• eU = random vector of individual/unit effects of dimension N - Var(eU ) =σ e

2IN - uncorrelated with u

• Special case of the general linear model

Page 13: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

13

Best Linear Unbiased Prediction (BLUP) under the General Linear Model

E(yU |XU ) = XUβ =Xs

Xr

⎣⎢⎢

⎦⎥⎥β

Var(yU |XU ) = ΣU =Σ ss Σ sr

Σ rs Σ rr

⎣⎢⎢

⎦⎥⎥

Basic Result (Royall, 1976) - BLUP of θ = aUT yU is

θ = as

Tys + arT [Xrβ + Σ rsΣ ss

−1(ys −Xsβ)]

β = (XsTΣ ss

−1Xs )−1(Xs

TΣ ss−1ys )

Page 14: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

14

BLUP for the Linear Mixed Model

yU = XUβ +GUu+ eU Assumes parameters σ e

2 , ϕ0 =σ u2 /σ e

2 and ϕ1 determining distribution of random effects (variance components) are known

Var(u) =σ u2Ω(ϕ1) =σ e

2ϕ0Ω(ϕ1)

Var(eU ) =σ e2IN

Var(yU ) = ΣU =σ e2 IN +ϕ0GUΩ(ϕ1)GU

T( ) Var(ys ) = Σ ss =σ e

2 In +ϕ0GsΩ(ϕ1)GsT( )

Cov(yr ,ys ) = Σ rs = Cov(Gru,Gsu) =σ e2ϕ0GrΩ(ϕ1)Gs

T

Page 15: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

15

Application to SAE Based on a Linear Mixed Model Let ai denote the population vector with value Ni

−1 for each population unit in area i and value zero everywhere else. The BLUP of yi = ai

TyU is then

yiBLUP = ais

Tys + airT [Xrβ +Gru] = fiyis + (1− fi )[xir

T β + girT ud ]

where u = (ui ) = Δ(ys −Xsβ)

Δ =ϕ0Ω(ϕ1)GsT In +ϕ0GsΩ(ϕ1)Gs

Ts⎡⎣ ⎤⎦

−1

β = XsT In +ϕ0GsΩ(ϕ1)Gs

T⎡⎣ ⎤⎦−1Xs{ }−1Xs

T In +ϕ0GsΩ(ϕ1)GsT⎡⎣ ⎤⎦

−1ys

Note subscript of is (ir)= sample (non-sample) units in area i fi = ni / Ni

Page 16: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

16

Empirical Best Linear Unbiased Prediction EBLUP – see Rao (2003) • Replaces unknown variance components by sample estimates EBLUP method usually implemented using maximum likelihood (ML) or residual (restricted) maximum likelihood (REML) estimates of variance components Software now generally available • SAS – PROC MIXED, PROC NLMIXED • Stata – xtreg, gllamm • R – libraries nlme, lme4, mlmmm and mgcv • MLWin

Page 17: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

17

Generalized Linear Mixed Model (GLMM) • Data underpinning small area estimates are often categorical

- GLMs are standard models for such data - leads to synthetic GLM-based SAE

• Use Generalized Linear Mixed Model (GLMM) extension of

GLM to allow for correlation within the small areas of interest

- E(ys | u) = h(ηs ) for a specified function h - ηs = Xsβ +Gsu - u = vector of random area effects, normally distributed with

zero mean vector and covariance matrix Ω = Ω(ϕ)

Page 18: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

18

Binomial Response / Random Mean Model • Small area counts ys1, ys2 ,...,ysm are independent binomial values • Estimate of the proportion of ‘successes’ in area i

θi = Ni−1[niyis + (Ni − ni )(β + ui )]

- yis = proportion of successes in the sample in small area d

- β and ui obtained by fitting logistic mixed model to the sample data in the small areas of interest

- ONS estimates of ILO unemployment for UA/LADs based on

logistic-linear mixed model (27 covariates, including number unemployment benefit claimants)

Page 19: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

19

Out of Sample Areas A pervasive problem, especially as the small area ‘geography’ becomes finer - NUTS 4 (UK LAD/UAs) All (~400) in the LFS sample - NUTS 5 (UK Wards) ~3000 out of ~10000 in sample Options • Zero all area effects (synthetic EBLUPS) - ignores information from in sample areas • Zero area effects for out of sample areas (proper EBLUPs) - in sample/out of sample areas treated differentially • ‘Borrow strength’ from in sample area effects to estimate/guess

out of sample area effects - via a spatial model (e.g. SAR) - via some form of local smoothing (GWR)

Page 20: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

20

Level Calibration Often the case that EBLUPs are computed at small area level, but direct estimates are preferred at higher (e.g. population) levels Example EBLUP estimates based on a logistic mixed model at NUTS 4 level (LAD/UAs) for UK LFS, but direct estimates, i.e. survey estimates calibrated to NUTS2 (UK region) by age by gender population benchmarks, preferred at higher (NUTS 1 – 3) levels

- EBLUP estimates iteratively re-scaled to agree with direct estimates at ‘level of benchmarking’

- typically only small changes since logistic SAE model includes benchmarking variables

However, substantial impact on SAE at NUTS 5 (Wards), where most small areas are out of sample ...

Page 21: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

21

Area Level Models Unit level data for Y may not be available – instead, we have small area direct estimates ywi + small area averages xi for covariates Fay & Herriot (1979) model

measurement model: yis = yi + eis with eis ∼ N (0,ni−1si

2 )process model: yi = xi

Tβ + ui with ui ∼ N (0,σ u2 )

⇒ overall model: yis = xiTβ + ui + eis

• Special case of linear mixed model with one unit/area • EBLUE of β and EBLUP of ui can be calculated given si

2 (derived from estimated sampling variance of yis)

• Also referred to as ‘Empirical Bayes’ (Cressie, 1991)

Page 22: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

22

Empirical Bayes Methods (Measurement Error Models) Suited to cases where unit level data are not available (Fay-Herriot)

θ = population target; θ = sample estimate θ θ ~ N θ,Σ( ) and θ ~ N Xβ,V( )

E θ( ) = E E θ θ( )⎡

⎣⎤⎦ = Xβ

Var θ( ) = E Var θ θ( )⎡⎣

⎤⎦ +Var E θ θ( )( ) = Σ +V

Cov θ,θ( ) = E Cov θ,θ θ( )⎡⎣

⎤⎦ +Cov E θ θ( ),θ( ) = V

θθ

⎝⎜

⎠⎟ ~ N

XβXβ

⎝⎜

⎠⎟ ,

Σ +V VV V

⎣⎢

⎦⎥

⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

Page 23: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

23

Empirical Bayes Predictor (also called the Empirical Best Predictor) is the 'plug-in' approximation to the minimum mean squared error predictor E θ θ( )

E θ θ( ) = E θ( ) +Cov θ, θ( ) Var θ( ){ }−1 θ − E θ( )( )= V Σ +V[ ]−1 θ + I−V Σ +V[ ]−1{ }Xβ

β = XT Σ +V{ }−1X⎡⎣ ⎤⎦

−1XT Σ +V{ }−1 θ⎡⎣ ⎤⎦

• Plug-in estimates Σ and Vare necessary • Σ based on prior information about measurement error

distribution underpinning distribution of θ given θ • V calculated from model residuals

Page 24: The small area estimation problem (contd) · The small area estimation problem (contd) Ray Chambers University of Wollongong ... Modern SAE Methods Based On Indirect Estimators An

24

Application to Fay-Herriot modelling

θθ

⎝⎜

⎠⎟ =

ysy

⎝⎜

⎠⎟ ~ N

XβXβ

⎝⎜

⎠⎟ ,

diag ni−1si

2 + Ni−1σ u

2( ) σ u2diag Ni

−1( )σ u2diag Ni

−1( ) σ u2diag Ni

−1( )⎡

⎢⎢⎢

⎥⎥⎥

⎨⎪

⎩⎪

⎬⎪

⎭⎪

• F-H assumes si

2 is known (or can be specified accurately) o β and σ u

2 estimated via WLS/ML

y = V Σ + V⎡⎣ ⎤⎦−1ys + I− V Σ + V⎡⎣ ⎤⎦

−1{ }Xβ= Xβ + V Σ + V⎡⎣ ⎤⎦

−1ys −Xβ( )

= Xβ + diag niσ u2

niσ u2 + Nisi

2

⎛⎝⎜

⎞⎠⎟ys −Xβ( )