2021 siscer module 2 small area estimation: lecture 2: non
TRANSCRIPT
2021 SISCER Module 2 Small AreaEstimation:
Lecture 2: Non-Spatial Area-Level Modeling
Jon Wakefield
Departments of Statistics and BiostatisticsUniversity of Washington
1 / 41
Outline
Motivation for Mixed Models and Smoothing
Linear Mixed Effects Models
Generalized Linear Mixed Models
Bayesian Inference
Simulated Data
Fay-Herriot Modeling
2 / 41
Motivation for Mixed Models and Smoothing
3 / 41
Motivation
In a SAE context, we are often faced with situations in which the dataare sparse in space, which leads to great uncertainty in calculatedestimates.
Mixed models1are designed to alleviate this problem, by modeling thetotality of data from all areas, in order to leverage similarities in thedata – these are examples of indirect estimates.
The key element is coupling the different areas, by assuming thisparameters in these areas are linked through a common probabilitydistribution.
In this lecture we describe mixed models for data that are bothnormal (via linear mixed models) and binomial (via generalized linearmixed models) – spatial smoothing will be covered in the next lecture.
1also known as random effects or hierarchical models4 / 41
Model-based approaches
Indirect methods provide a link, often with an implicit model, betweendifferent areas; we describe explicit models that provide such a link.
The models we describe are mixed-effects models which aim toaccurately describe between domain (area) differences.
Such models offer several advantages:• Models can be tuned to the application, building on the existing
theory and practical experience of mixed models, includingnon-linear models, such as logistic mixed models.
• Domain-(Area)-specific measures of uncertainty are produced.• Estimates for areas with no data can be formed.• One can attempt to check assumptions using diagnostics.• A variety of area-specific random effects, including spatial, are
available.
5 / 41
Model-based approaches
The use of explicit models has not been carried out greatly in surveysampling, where design-based inference is historically the norm.
The design consistency of model-based estimators is therefore ofinterest.
Disadvantages of mixed models:• How to incorporate the design weights/acknowledge the design?• It is often difficult to check modeling assumptions.• Computation can be demanding, though this is improving.
Models can be specified at the level of the area or the unit
6 / 41
Motivating Examples: Normal and Binomial Data
We consider simulated data: we have two outcomes, one continuousand one binary, again using the King County geography.
These data were simulated with non-constant prevalence acrossHRAs (areas).
Nominally, the variables will be labeled weight and diabetes.
7 / 41
Motivating Examples: Normal and Binomial Data
170
180
190
ave weight
0.0
0.1
0.2
0.3
0.4
prob
3
4
5
sd(ave weight)
0.02
0.04
0.06
0.08
0.10
sd(prob)
Figure 1: Sample mean weights (top left) and fractions with diabetes (topright) Standard errors of: mean weights (bottom left) and fractions withdiabetes (bottom right). Gray areas in the right map correspond to areas withzero counts, and hence an estimated standard error of zero.
8 / 41
Linear Mixed Effects Models
9 / 41
Smoothing Models
Instability of estimates has lead to methods being developed toimpose smoothness on the underlying parameters using mixedeffects models that use the data from the totality of areas to providemore reliable estimates in each of the constituent areas.
Overview of Models:• Basic Linear Model: No smoothing.• Linear Mixed Models:
• Normal likelihood. Random effects with no spatial structure (knownas IID2).
• Normal Likelihood, Two sets of random effects, one set with nospatial structure, one set with spatial structure.
• Covariates may be added to each of these in order to smoothover covariate space.
• Estimation in these models is a separate issue.
2Independent and Identically Distributed10 / 41
Linear Mixed Effects ModelLet Yik be the weight of the k -th sampled individual in area i .
Three possible models:
1. No between-area variability:
Yik = β0︸︷︷︸Common Mean
+εik ,
with εik ∼ N(0, σ2ε). (Basic synthetic estimates).
2. Distinct between-area variability:
Yik = βi︸︷︷︸Mean of Area i
+εik ,
with εik ∼ N(0, σ2ε). Known as a fixed effects model. Note: no way to link
different areas. (Direct estimates).3. Linked between-area variability:
Yik = β0 + δi︸ ︷︷ ︸Mean of Area i
+εik ,
with δi ∼ N(0, σ2δ), εik ∼ N(0, σ2
ε). Known as a mixed effects model.(Indirect estimates).
11 / 41
Basic Linear Mixed ModelWe will concentrate on the linked between-area mixed effects model:
Yik = β0 + δi︸ ︷︷ ︸Mean of Area i
+εik ,
with δi ∼ N(0, σ2δ) – these are the area-specific deviations (the
random effects) from the overall level β0 – and εik ∼ N(0, σ2ε ), is the
measurement error.
This model is also known as a Linear Mixed Effects Model (LMEM).
In this model, the totality of data are used to inform on the overalllevel β0 and between-area variability σ2
δ .
The unknown parameters are:
Overall mean β0
Between-Area Variance σ2δ
Measurement Error Variance σ2ε
Random Effects δ1, . . . , δn
12 / 41
Maximum Likelihood Estimation
The MLEs are the values of the parameters that maximize theprobability of the observed data, under an assumed probability model.
For a LMEM the likelihood to be maximized is,
L(β, σ2δ , σ
2ε ) =
n∏i=1
∫δi
p(y i |β0, δi , σ2ε )× p(δi |σ2
δ) dδi .
If a likelihood approach is taken, the random effect estimates δi , areobtained as the conditional means E[δi |y ].
These are known as the best linear unbiased predictors (BLUPs).
Known as estimated BLUPs (EBLUPs) when β and variancecomponents σ2
ε , σ2δ are estimated.
Technical Note: Restricted ML (REML) is preferred for estimation ofvariances – accounts for estimation of β.
13 / 41
Estimation in the Linear Mixed Model
In general, there are no closed-form (i.e., explicit) forms for theestimates of the parameters.
Suppose, for simplicity, that β0, σ2δ and σ2
ε are known; this allowssome insight into inference.
14 / 41
Estimation in the Linear Mixed Model
The posterior mean of the random effect (area-specific adjustment) is:
δi = E[δi |yi ]
=niσ
2δ
σ2ε + niσ
2δ
(y i − β0)
= qi(y i − β0)
where
qi =niσ
2δ
σ2ε + niσ
2δ
≤ 1
and is small (so more shrinkage) if:• ni is small (not much data in the area), or• σ2
δ is small (between-area variability is small), or• σ2
ε is large (within-area variability is large).
δi is the BLUP.15 / 41
Inference for Random Effcets
From a frequentist mixed-model perspective:• The BLUPs are unbiased in the sense that,
E[δi − δi ] = 0,
where the expectation is over δi and δi since both are viewed asrandom.
• Uncertainty is measured by var(δi − δi) which is equal to themean squared error (MSE), MSE(δi).
• The MSE can also be used for construction of asymptoticconfidence intervals:
δi − δi ∼ N(0,MSE(δi)︸ ︷︷ ︸var(δi−δi )
).
• The MSE can be estimated analytically, or via the bootstrap (Raoand Molina, 2015).
16 / 41
Predicting the Population Total and MeanLet Si and Ri denote, respectively, the set of indices of the sampledand unsampled individuals in area i .
Let Ti =∑Ni
k=1 yik be the total for the population in area i , where Ni isthe population size.
The average for the population in area i is
Y i =Ti
Ni
=
∑Nik=1 yik
Ni
=
∑k∈Si
yik +∑
k∈Riyik
Ni
=
∑k∈Si
yik
ni︸ ︷︷ ︸Mean of Sampled
× ni
Ni+
∑k∈Ri
yik
Ni − ni︸ ︷︷ ︸Mean of Unsampled
×Ni − ni
Ni
17 / 41
Predicting the Population Total and Mean
Suppose now we have fitted the linear mixed effects model andobtained posterior medians β0 and δi .
The obvious estimate is:
Y i =
∑k∈Si
yik
ni︸ ︷︷ ︸Mean of Sampled
× ni
Ni+ (β0 + δi)︸ ︷︷ ︸
Estimated Mean
×Ni − ni
Ni
If Ni � ni , then the sampled data provide a small fraction of the totalpopulation in the area and we can estimate the area mean by
Y i = β0 + δi . (1)
If an area contains no data, then its mean is assumed to be β0 + δ?,where δ? ∼ N(0, σ2
δ).
18 / 41
Motivating Example: Continuous Outcome
Totals can be similarly estimated:
Ti =∑k∈Si
yik +∑k∈Ri
yik
=∑k∈Si
yik︸ ︷︷ ︸Total of Sampled
+ (Ni − ni)× (β0 + δi)︸ ︷︷ ︸Estimated Total for Unsampled
.
If Ni � ni ,
Ti ≈ Ni × (β0 + δi).
This assumes that population size Ni is known.
19 / 41
Generalized Linear Mixed Models
20 / 41
Smoothing Models
The above considerations of instability led to methods beingdeveloped to smooth the prevalences using mixed effects models thatuse the data from the totality of areas to provide more reliableestimates in each of the constituent areas.
Overview of Models:• Basic Binomial Model: No smoothing.• Mixed Effects Models:
• Binomial IID GLMM3: Non-spatial smoothing.• Binomial Spatial GLMM: Spatial and non-spatial smoothing.
• Covariates may be added to each of these in order to smoothover covariate space.
• Estimation in these models is a separate issue.
3Generalized Linear Mixed Model21 / 41
Overview of Models
The individual responses are the binary random variables Yik , for thesampled individuals k ∈ Si .
If we take Yi =∑
k∈SiYik , the obvious sampling model (likelihood) is:
Yi |pi ∼ Binomial (ni ,pi) .
Having unconstrained pi and taking the MLE’s leads to pi = Yi/ni .
22 / 41
Overview of ModelsThe simplest GLMM assumes the odds in area i are of the form
pi
1− pi= exp(β0)× exp(δi),
with exp(δi) are area-specific adjustments that multiply the overallodds exp(β0).
This model is equivalent to a linear model on the logistic scale:
log
(pi
1− pi
)= β0 + δi .
The random effects δi are assumed to follow a normal distribution,that is,
δi ∼iid N(0, σ2δ).
It is straightforward to add area-level covariates to this model via
log
(pi
1− pi
)= β0 + x T
i β1 + δi .
23 / 41
SAE Inference
We may wish to estimate the total number of cases or the average(prevalence) in area i :
Ti =∑k∈Si
yik +∑k∈Ri
yik
Ti
Ni=
∑k∈si
yik +∑
k∈Riyik
Ni.
Estimates:
Ti =∑k∈Si
yik + (Ni − ni)× pi
Ti
Ni=
∑k∈Si
yik
ni× ni
Ni+ pi ×
Ni − ni
Ni.
24 / 41
SAE Inference
If Ni � ni :
Ti = Ni pi
Ti
Ni= pi .
In the Binomial GLMM:
pi =exp(β0 + x T
i β1 + δi)
1 + exp(β0 + x Ti β1 + δi)
.
25 / 41
Bayesian Inference
26 / 41
Bayesian InferenceBayesian modeling is convenient for implementing notions ofsmoothing.
There are two key elements that must be specified:• The sampling model (likelihood) describes the distribution of the
data – this model depends on unknown parameters, that we willdenote p.
• The prior distribution expresses beliefs about the parameters pand provides a mechanism by which penalization/smoothing canbe imposed.
These elements are probabilistically combined via Bayes Theorem:
p(p|y)︸ ︷︷ ︸Posterior
∝ L(p)︸︷︷︸Likelihood
×π(p)︸︷︷︸Prior
.
On the log scale:
log p(p|y)︸ ︷︷ ︸Updated Beliefs
= log L(p)︸ ︷︷ ︸Sampling Model
+ log π(p)︸ ︷︷ ︸Penalization
.
27 / 41
Bayesian Inference
• In a Bayesian analysis the complete set of unknowns(parameters) is summarized via the multivariate posteriordistribution – one or two dimensional marginal posteriordistributions can be vizualised.
• The marginal distribution for each parameter may besummarized via its mean, standard deviation, or quantiles.
• It is common to report the posterior median and a 90% or 95%posterior range for parameters of interest.
• The range that is reported is known as a credible interval.
28 / 41
Bayesian Computation
• The computations required for Bayesian inference (integrals) areoften not trivial and may be carried out using a variety ofanalytical, numerical and simulation based techniques.
• We use the integrated nested Laplace approximation (INLA),introduced by Rue et al. (2009).
• R-INLA is a package that implements the INLA approach.• The SUMMER package uses INLA for all Bayesian computation.• Book-length treatments on INLA:
• Blangiardo and Cameletti (2015) – space-time modeling.• Wang et al. (2018) – general modeling.• Krainski et al. (2018) – advanced space-time modeling.
29 / 41
Bayes Example
• Imagine the data model is normal with an unknown mean µ:
yi | µ ∼ N(µ, σ2),
i = 1, . . . ,n, with σ2 assumed known.• This is equivalent to:
y | µ ∼ N(µ, σ2/n),
where σ/√
n is the standard error.• Suppose a normal prior is appropriate:
µ ∼ N(m, v),
so that values of the mean µ that are (relatively) far from m arepenalized – v is the smoothing parameter, more/less smoothingif small/big.
• The log posterior is:
log p(µ | y︸ ︷︷ ︸Updated Beliefs
) = − n2σ2 (y − µ)
2︸ ︷︷ ︸Data Model
− 12v
(µ−m)2︸ ︷︷ ︸Penalization
.
30 / 41
10 15 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Parameter
Den
sity
priorposteriorlikelihood
Figure 2: Normal data model with n = 10, y = 19.3 and standard error 1.41.The prior for µ has mean m =15 and v = 32. The posterior for the parameterµ is a compromise between the two sources of information: the posteriormean is 18.5 and the posterior standard deviation is 1.28.
31 / 41
The Bayesian Linear Mixed Model
The LMEM with priors is:
Yik |β, δi , σ2ε ∼iid N(β0 + x T
i β1 + δi , σ2ε )
δi |σ2δ ∼iid N(0, σ2
δ)
β, σ2ε , σ
2δ ∼ Priors
The posterior, given data y , is obtained as
p(β, δ1, . . . , δn, σ2ε , σ
2δ |y) = p(y |β, δ1, . . . , δn, σ
2ε , σ
2δ)
× p(β, δ1, . . . , δn, σ2ε , σ
2δ)/p(y)
=n∏
i=1
p(y i |β, δi , σ2ε )× p(δi |σ2
δ)
× p(β, σ2ε , σ
2δ)/p(y)
Marginal distributions, such as p(β0|y), are obtained by integration/
In the SUMMER package, the INLA approach is used (Rue et al., 2009).
32 / 41
Simulated Data
33 / 41
Motivating Example: Linear Model
170
180
190
MLE
170
180
190
Bayes
3
4
5
MLE.sd
3
4
5
Bayes.sd
Figure 3: Top row: Estimates of area averages of weight via MLE’s (left) andposterior medians (right). Bottom row: Uncertainty of estimates with standarderrors (left) and posterior standard deviations (right).
34 / 41
Motivating Example: Linear Model
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
170 175 180 185 190 195
170
175
180
185
190
195
Estimates
MLE
Pos
terio
r m
edia
n
●
●
●
●●
●
●
●
●●
●
● ● ●
●
●
●●
●
●
●●●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
● ●
●
2 3 4 5 6
23
45
6
Standard errors
se(MLE)
Pos
terio
r S
D
Figure 4: Comparison of area averages: Posterior medians versus MLEs(left). Posterior standard deviations versus standard errors associated withthe MLEs (right).
The posterior medians are shrunk from the MLEs towards the overallmean, with the extreme values undergoing the most shrinkage.
In general, the Bayes measures of uncertainty (the posterior standarddeviations) are smaller than the standard errors of the MLEs, with thegreatest difference occurring for those areas with the large standarderrors (which have the smallest sample sizes).
35 / 41
Motivating Example: Binary Outcome
0.0
0.1
0.2
0.3
0.4
mle
0.0
0.1
0.2
0.3
0.4
median
0.025
0.050
0.075
0.100
mle.sd
0.025
0.050
0.075
0.100
sd
Figure 5: Top row: Estimates of area prevalences with diabetes via MLE’s(left) and posterior medians (right). Bottom row: Uncertainty of estimates withstandard errors (left) and posterior standard deviations (right).
36 / 41
Motivating Example: Binary Outcome
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
−4 −3 −2 −1
−4
−3
−2
−1
Logit Prevalence
MLE
Pos
terio
r m
edia
n
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
0.4 0.6 0.8 1.0
0.4
0.6
0.8
1.0
Uncertainty
se(MLE)
Pos
terio
r S
DFigure 6: Comparison of area averages: Posterior medians versus MLEs onthe logistic scale (left). Posterior standard deviations versus standard errorsassociated with the MLEs on the logistic scale (right).
As in the normal model case, we see that the Bayesian estimates areshrunk relative to the MLEs, and the uncertainty of the Bayesianestimates is in general smaller.
37 / 41
Fay-Herriot Modeling
38 / 41
Fay-Herriot ModelWhat about accounting for complex sampling?
So far we have been (implicitly) assuming the data are gatheredthrough simple random sampling (SRS).
In a ground-breaking paper, Fay and Herriot (1979) suggestedmodeling a transform of the weighted estimator via a linear mixedmodel.
Specifically, let yi denote a transformation of the weighted estimator:
y i = g
(∑k∈Si
wk yk∑k∈Si
wk
).
and assumeyi |β0, δi ∼ N(β0 + x T
i β1 + δi ,Vi),
with Vi assumed known and being derived from the variance of theweighted estimator, and
δi |σ2δ ∼iid N(0, σ2
δ).
39 / 41
Fay-Herriot Modeling
• The F-H approach is an extremely popular way of producingarea-level estimates.
• This formulation avoids the need to model the design.• For a continuous outcome (e.g., child growth measures) it may
suffice to simply take the weighted estimator, with its associatedvariance.
• For prevalences, we can take g(z) = log(z/(1− z)), and thenuse the delta method to obtain the appropriate variance.
• For rates, we can take g(z) = log(z + c), where the constant c ischosen to avoid taking the log of zero, and again use the deltamethod to obtain the appropriate variance.
• Area-level covariates are frequently used.• Problems can occur when the estimate lies on its boundary, both
for point estimation and variance estimation (e.g., for aprevalence when all the outcomes are 0).
40 / 41
Discussion: Area-Level Modeling
• Mixed effects models are a common approach for SAE, as theyacknowledge between-area differences in outcomes.
• But one must consider the design (e.g., stratification andclustering) – Fay-Herriot modeling is a simple way to do this.
• Model-checking is important.
41 / 41
References
Blangiardo, M. and Cameletti, M. (2015). Spatial and Spatio-TemporalBayesian Models with R-INLA. John Wiley and Sons.
Fay, R. and Herriot, R. (1979). Estimates of income for small places:an application of James–Stein procedure to census data. Journalof the American Statistical Association, 74, 269–277.
Krainski, E. T., Gomez-Rubio, V., Bakka, H., Lenzi, A., Castro-Camilo,D., Simpson, D., Lindgren, F., and Rue, H. (2018). AdvancedSpatial Modeling with Stochastic Partial Differential EquationsUsing R and INLA. Chapman and Hall/CRC.
Rao, J. and Molina, I. (2015). Small Area Estimation, Second Edition.John Wiley, New York.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesianinference for latent Gaussian models using integrated nestedLaplace approximations (with discussion). Journal of the RoyalStatistical Society, Series B, 71, 319–392.
Wang, X., Yue, Y., and Faraway, J. J. (2018). Bayesian RegressionModeling with INLA. Chapman and Hall/CRC.
41 / 41