a measurement error model approach to survey data...

A measurement error model approach to surveydata integration: combining information from two

surveys

Jae Kwang Kim 1

Iowa State University

2017 SAE conference, ParisJuly 11th, 2017

1Joint work with Seho Park

Survey data integration

Want to combine information from multiple surveys

Three situations1 Multiple samples for one target population2 One sample each from multiple populations3 Multiple samples from multiple populations

Small area estimation is a special case of survey data integration, inthat multiple sub-populations represent multiple domains.

Kim (ISU) Survey Data Integration 7/11/2017 2 / 25

Motivation

USAID Bureau for Food Security (BFS) sponsors Food and NutritionTechnical Assistance III project (FANTA).

Key technical areas of focus are food security, maternal and child health,agriculture, and livelihoods strengthening.


Motivation

FANTA has two projects: Feed the Future (FTF) and Food for Peace(FFP) development projects.

FFP project was conducted by ICF International, and FTF project wasconducted by UNC MEASURE.

Two surveys were conducted in 2013 from selected departments inGuatemala: San Marcos, Totonicapan, Quiche, Quezaltenango, andHuehuetenango.


Map of Guatemala


FFP and FTF Projects in Guatemala

Figure: Selected Departments in Guatemala


Overlap Area

Figure: FFP ZOI and FFP Project Implementation Area for Guatemala


Overlap Area

Table: Overlap Area: Departments and Municipalities

Department Municipality

San Marcos SibinalTajumulco

Totonicapan MomostenangoSanta Lucia La Reforma

Huehuetenango ChiantlaConcepcion HuistaJacaltenangoSan Antonio HuistaTodos Santos

Quetzaltenango San Juan Ostuncalco

Quiche Chichicastenango(Santa Maria) NebajUspantanCunenSan Juan Cotzal


Common Indicators

Two surveys have their own indicators and 11 common indicatorswere chosen to be studied.

The common items are about women’s nutritional status, children’swell-being status, and prevalence of poverty in household.


Common Indicators

Table: Common Indicators

Indicator Description

Daily Per Capita Expendi-tures (PCE)

Average daily per capita consumption con-stant 2010 USD

Prevalence of Poverty(PP)

Prevalence of poverty: percentage of peopleliving on less than $1.25 USD per capita perday

Mean Depth Poverty(MDP)

Average of the differences between totaldaily

Prevalence of Householdswith Hunger (HHS)

Prevalence of households with moderate orsevere hunger

Prevalence of Under-weight Women

Women that are eligible for BMI (not cur-rently pregnant and not within 2 months ofdelivery) who has BMI less than 18.5

Women’s Dietary Diver-sity Score (WDDS)

Mean number of food groups consumed bywomen of reproductive age (15-49 years)


Common Indicators

Table: Common Indicators (Cont’d)

Indicator Description

Prevalence of StuntedChildren

Prevalence of stunted children under fiveyears of age (0-59 months)

Prevalence of WastedChildren

Prevalence of wasted children under fiveyears of age (0-59 months)

Prevalence of Under-weight Children

Prevalence of underweight children underfive years of age (0-59 months)

Prevalence of Children Re-ceiving a Minimum Ac-ceptable Diet (MAD)

Prevalence of children 6-23 months receiv-ing a minimum acceptable diet

Prevalence of ExclusiveBreastfeeding (EBF)

Prevalence of exclusive breastfeeding of chil-dren under six months of age


Estimates from two surveys

Table: Daily Per Capita Expenditure

Department FFP/ICF FTF/UNC T-statisticsN Mean S.E. N Mean S.E.

San Marcos 1419 0.558 0.014 981 1.166 0.018 -23.376Totonicapan 1654 0.388 0.015 181 0.896 0.039 -5.505

Huehuetenango 877 0.456 0.023 1535 1.140 0.018 -30.587Quetzaltenango 628 0.695 0.022 60 1.325 0.112 -26.179

Quiche 1288 0.382 0.015 1350 1.045 0.015 -12.179


Estimates from two surveys

Table: Prevalence of Households with Hunger (%)

Department FFP/ICF FTF/UNC T-statisticsN Mean S.E. N Mean S.E.

San Marcos 1419 3.76 0.50 981 15.35 1.08 -9.733Totonicapan 1654 11.79 0.87 181 15.01 2.72 -1.125

Huehuetenango 877 8.91 0.91 1535 15.58 0.87 -5.323Quetzaltenango 628 6.84 0.91 60 9.94 3.96 -0.765

Quiche 1288 7.13 0.74 1350 9.73 0.77 -2.430


Data Structure

Table: Data Structure

X Ya Yb

Sample A o oSample B o o


Goal: Synthetic data imputation

Table: Data Structure

X Ya Yb

Sample A o o oSample B o o o


Methodology

Steps

1 Specify a measurement error model.

2 Derive prediction model using Bayes theorem.

3 Parameter estimation: EM algorithm.

4 Generating imputed values from the prediction model.


Step 1: Model specification

Assume that Sample A is a gold standard one. That is, Ya = Y .

Structural Equation model

Ya ∼ f1(ya | x ; θ1).

From the observations in Sample A, we can perform modeldiagnostics.

Measurement error model

Yb ∼ f2(yb | ya; θ2).

Assume nondifferentiability of measurement error model

f (yb | x , ya) = f (yb | ya)

For dichotomous y -variables, measurement error model becomesmisclassification model.


Step 2: Prediction model

Prediction model is the model for the counterfactual outcome,conditional on the observed values.

Prediction model for Yb in sample A:

p(yb | x , ya) = f2(yb | ya).

Prediction model for Ya in sample B: Using Bayes formula, we canderive

p(ya | x , yb) =f1(ya | x ; θ1)f2(yb | ya; θ2)∫f1(ya | x ; θ1)f2(yb | ya; θ2)dya

The prediction model can be used to obtain the best prediction of Yai

for i ∈ Sb.


Step 3: Parameter estimation - EM algorithm

E-step: compute

Q1(θ1 | data; θ(t)) =∑i∈Sa

wi ,a log f1(yai | xi )

+∑i∈Sb

wi ,bE{log f1(Ya | xi ) | xi , ybi ; θ(t)}

Q2(θ2 | data; θ(t)) =∑i∈Sa

wi ,aE{log f2(Yb | yai ) | x , yai ; θ(t))

+∑i∈Sb

wi ,bE{log f2(ybi | Ya) | x , ybi ; θ(t))},

where the conditional expectations are computed from the predictionmodel in Step 2.

M-step: update the parameters by maximizing Q1 and Q2 wrt θ1 andθ2, respectively.


Step 4: Best prediction

Using the measurement error model, we can predict yai byyai = E (Ya | xi , ybi ) for i ∈ SB .

A prediction estimation of µ = E (Ya) can be obtained by

µ∗ =

∑i∈SA wi ,ayai +

∑i∈SB wi ,byai∑

i∈SA wi ,a +∑

i∈SB wi ,b

Reference: Kim, Berg, and Park (2016). Statistical Matching usingfractional imputation. Survey Methodology, 42, 19–40.


Application to FANTA project

1 Model for PCE

yai = xiβ + ei

ybi = α0 + α1yai + ui

where ei ∼ N(0, σ2e ) and ui ∼ N(0, σ2u).

2 Model for HHS prevalence

yai ∼ Bernoulli(πi )

ybi ∼ Bernoulli{pyai + q(1− yai )}

where logit(πi ) = xiβ and p, q ∈ (0, 1).


Model Diagnostics for PCE model

-2 -1 0 1 2

-2-1

01

2

Fitted Values Vs Residuals

Fitted Values

Residuals

-4 -2 0 2 4

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s


Result: PCE Indictor

Department FFP FTF Combined

San Marcos 0.558 1.165 0.563(0.030) (0.038) (0.026)

Totonicapan 0.388 0.895 0.331(0.030) (0.085) (0.028)

Quiche 0.382 1.045 0.396(0.030) (0.031) (0.026)

Huehuetenango 0.456 1.140 0.479(0.044) (0.036) (0.027)

Quetzaltenango 0.695 1.325 0.795(0.044) (0.232) (0.043)


Results for HHS indicator

Department FFP FTF Combined

San Marcos 3.76 15.35 3.77(1.01) (2.22) (1.00)

Totonicapan 11.79 15.01 12.08(1.70) (6.00) (1.60)

Quiche 7.13 9.73 7.19(1.50) (1.57) (1.42)

Huehuetenango 8.91 15.58 8.75(1.90) (2.00) (1.90)

Quetzaltenango 6.84 9.94 6.85(1.80) (8.25) (1.70)


Concluding remark

Survey data integration using measurement error model is considered.

Prediction of the counterfactual outcome is obtained by Bayestheorem.

Parameter estimation involves EM algorithm.

Bayesian approach can be developed (not discussed here).

Extension to GLMM model for the structural equation model is underprogress.


a measurement error model approach to survey data...

Documents