a measurement error model approach to survey data...
TRANSCRIPT
A measurement error model approach to surveydata integration: combining information from two
surveys
Jae Kwang Kim 1
Iowa State University
2017 SAE conference, ParisJuly 11th, 2017
1Joint work with Seho Park
Survey data integration
Want to combine information from multiple surveys
Three situations1 Multiple samples for one target population2 One sample each from multiple populations3 Multiple samples from multiple populations
Small area estimation is a special case of survey data integration, inthat multiple sub-populations represent multiple domains.
Kim (ISU) Survey Data Integration 7/11/2017 2 / 25
Motivation
USAID Bureau for Food Security (BFS) sponsors Food and NutritionTechnical Assistance III project (FANTA).
Key technical areas of focus are food security, maternal and child health,agriculture, and livelihoods strengthening.
Kim (ISU) Survey Data Integration 7/11/2017 3 / 25
Motivation
FANTA has two projects: Feed the Future (FTF) and Food for Peace(FFP) development projects.
FFP project was conducted by ICF International, and FTF project wasconducted by UNC MEASURE.
Two surveys were conducted in 2013 from selected departments inGuatemala: San Marcos, Totonicapan, Quiche, Quezaltenango, andHuehuetenango.
Kim (ISU) Survey Data Integration 7/11/2017 4 / 25
Map of Guatemala
Kim (ISU) Survey Data Integration 7/11/2017 5 / 25
FFP and FTF Projects in Guatemala
Figure: Selected Departments in Guatemala
Kim (ISU) Survey Data Integration 7/11/2017 6 / 25
Overlap Area
Figure: FFP ZOI and FFP Project Implementation Area for Guatemala
Kim (ISU) Survey Data Integration 7/11/2017 7 / 25
Overlap Area
Table: Overlap Area: Departments and Municipalities
Department Municipality
San Marcos SibinalTajumulco
Totonicapan MomostenangoSanta Lucia La Reforma
Huehuetenango ChiantlaConcepcion HuistaJacaltenangoSan Antonio HuistaTodos Santos
Quetzaltenango San Juan Ostuncalco
Quiche Chichicastenango(Santa Maria) NebajUspantanCunenSan Juan Cotzal
Kim (ISU) Survey Data Integration 7/11/2017 8 / 25
Common Indicators
Two surveys have their own indicators and 11 common indicatorswere chosen to be studied.
The common items are about women’s nutritional status, children’swell-being status, and prevalence of poverty in household.
Kim (ISU) Survey Data Integration 7/11/2017 9 / 25
Common Indicators
Table: Common Indicators
Indicator Description
Daily Per Capita Expendi-tures (PCE)
Average daily per capita consumption con-stant 2010 USD
Prevalence of Poverty(PP)
Prevalence of poverty: percentage of peopleliving on less than $1.25 USD per capita perday
Mean Depth Poverty(MDP)
Average of the differences between totaldaily
Prevalence of Householdswith Hunger (HHS)
Prevalence of households with moderate orsevere hunger
Prevalence of Under-weight Women
Women that are eligible for BMI (not cur-rently pregnant and not within 2 months ofdelivery) who has BMI less than 18.5
Women’s Dietary Diver-sity Score (WDDS)
Mean number of food groups consumed bywomen of reproductive age (15-49 years)
Kim (ISU) Survey Data Integration 7/11/2017 10 / 25
Common Indicators
Table: Common Indicators (Cont’d)
Indicator Description
Prevalence of StuntedChildren
Prevalence of stunted children under fiveyears of age (0-59 months)
Prevalence of WastedChildren
Prevalence of wasted children under fiveyears of age (0-59 months)
Prevalence of Under-weight Children
Prevalence of underweight children underfive years of age (0-59 months)
Prevalence of Children Re-ceiving a Minimum Ac-ceptable Diet (MAD)
Prevalence of children 6-23 months receiv-ing a minimum acceptable diet
Prevalence of ExclusiveBreastfeeding (EBF)
Prevalence of exclusive breastfeeding of chil-dren under six months of age
Kim (ISU) Survey Data Integration 7/11/2017 11 / 25
Estimates from two surveys
Table: Daily Per Capita Expenditure
Department FFP/ICF FTF/UNC T-statisticsN Mean S.E. N Mean S.E.
San Marcos 1419 0.558 0.014 981 1.166 0.018 -23.376Totonicapan 1654 0.388 0.015 181 0.896 0.039 -5.505
Huehuetenango 877 0.456 0.023 1535 1.140 0.018 -30.587Quetzaltenango 628 0.695 0.022 60 1.325 0.112 -26.179
Quiche 1288 0.382 0.015 1350 1.045 0.015 -12.179
Kim (ISU) Survey Data Integration 7/11/2017 12 / 25
Estimates from two surveys
Table: Prevalence of Households with Hunger (%)
Department FFP/ICF FTF/UNC T-statisticsN Mean S.E. N Mean S.E.
San Marcos 1419 3.76 0.50 981 15.35 1.08 -9.733Totonicapan 1654 11.79 0.87 181 15.01 2.72 -1.125
Huehuetenango 877 8.91 0.91 1535 15.58 0.87 -5.323Quetzaltenango 628 6.84 0.91 60 9.94 3.96 -0.765
Quiche 1288 7.13 0.74 1350 9.73 0.77 -2.430
Kim (ISU) Survey Data Integration 7/11/2017 13 / 25
Data Structure
Table: Data Structure
X Ya Yb
Sample A o oSample B o o
Kim (ISU) Survey Data Integration 7/11/2017 14 / 25
Goal: Synthetic data imputation
Table: Data Structure
X Ya Yb
Sample A o o oSample B o o o
Kim (ISU) Survey Data Integration 7/11/2017 15 / 25
Methodology
Steps
1 Specify a measurement error model.
2 Derive prediction model using Bayes theorem.
3 Parameter estimation: EM algorithm.
4 Generating imputed values from the prediction model.
Kim (ISU) Survey Data Integration 7/11/2017 16 / 25
Step 1: Model specification
Assume that Sample A is a gold standard one. That is, Ya = Y .
Structural Equation model
Ya ∼ f1(ya | x ; θ1).
From the observations in Sample A, we can perform modeldiagnostics.
Measurement error model
Yb ∼ f2(yb | ya; θ2).
Assume nondifferentiability of measurement error model
f (yb | x , ya) = f (yb | ya)
For dichotomous y -variables, measurement error model becomesmisclassification model.
Kim (ISU) Survey Data Integration 7/11/2017 17 / 25
Step 2: Prediction model
Prediction model is the model for the counterfactual outcome,conditional on the observed values.
Prediction model for Yb in sample A:
p(yb | x , ya) = f2(yb | ya).
Prediction model for Ya in sample B: Using Bayes formula, we canderive
p(ya | x , yb) =f1(ya | x ; θ1)f2(yb | ya; θ2)∫f1(ya | x ; θ1)f2(yb | ya; θ2)dya
The prediction model can be used to obtain the best prediction of Yai
for i ∈ Sb.
Kim (ISU) Survey Data Integration 7/11/2017 18 / 25
Step 3: Parameter estimation - EM algorithm
E-step: compute
Q1(θ1 | data; θ(t)) =∑i∈Sa
wi ,a log f1(yai | xi )
+∑i∈Sb
wi ,bE{log f1(Ya | xi ) | xi , ybi ; θ(t)}
Q2(θ2 | data; θ(t)) =∑i∈Sa
wi ,aE{log f2(Yb | yai ) | x , yai ; θ(t))
+∑i∈Sb
wi ,bE{log f2(ybi | Ya) | x , ybi ; θ(t))},
where the conditional expectations are computed from the predictionmodel in Step 2.
M-step: update the parameters by maximizing Q1 and Q2 wrt θ1 andθ2, respectively.
Kim (ISU) Survey Data Integration 7/11/2017 19 / 25
Step 4: Best prediction
Using the measurement error model, we can predict yai byyai = E (Ya | xi , ybi ) for i ∈ SB .
A prediction estimation of µ = E (Ya) can be obtained by
µ∗ =
∑i∈SA wi ,ayai +
∑i∈SB wi ,byai∑
i∈SA wi ,a +∑
i∈SB wi ,b
Reference: Kim, Berg, and Park (2016). Statistical Matching usingfractional imputation. Survey Methodology, 42, 19–40.
Kim (ISU) Survey Data Integration 7/11/2017 20 / 25
Application to FANTA project
1 Model for PCE
yai = xiβ + ei
ybi = α0 + α1yai + ui
where ei ∼ N(0, σ2e ) and ui ∼ N(0, σ2u).
2 Model for HHS prevalence
yai ∼ Bernoulli(πi )
ybi ∼ Bernoulli{pyai + q(1− yai )}
where logit(πi ) = xiβ and p, q ∈ (0, 1).
Kim (ISU) Survey Data Integration 7/11/2017 21 / 25
Model Diagnostics for PCE model
-2 -1 0 1 2
-2-1
01
2
Fitted Values Vs Residuals
Fitted Values
Residuals
-4 -2 0 2 4
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Kim (ISU) Survey Data Integration 7/11/2017 22 / 25
Result: PCE Indictor
Department FFP FTF Combined
San Marcos 0.558 1.165 0.563(0.030) (0.038) (0.026)
Totonicapan 0.388 0.895 0.331(0.030) (0.085) (0.028)
Quiche 0.382 1.045 0.396(0.030) (0.031) (0.026)
Huehuetenango 0.456 1.140 0.479(0.044) (0.036) (0.027)
Quetzaltenango 0.695 1.325 0.795(0.044) (0.232) (0.043)
Kim (ISU) Survey Data Integration 7/11/2017 23 / 25
Results for HHS indicator
Department FFP FTF Combined
San Marcos 3.76 15.35 3.77(1.01) (2.22) (1.00)
Totonicapan 11.79 15.01 12.08(1.70) (6.00) (1.60)
Quiche 7.13 9.73 7.19(1.50) (1.57) (1.42)
Huehuetenango 8.91 15.58 8.75(1.90) (2.00) (1.90)
Quetzaltenango 6.84 9.94 6.85(1.80) (8.25) (1.70)
Kim (ISU) Survey Data Integration 7/11/2017 24 / 25
Concluding remark
Survey data integration using measurement error model is considered.
Prediction of the counterfactual outcome is obtained by Bayestheorem.
Parameter estimation involves EM algorithm.
Bayesian approach can be developed (not discussed here).
Extension to GLMM model for the structural equation model is underprogress.
Kim (ISU) Survey Data Integration 7/11/2017 25 / 25