transferring variables between different data-sets
DESCRIPTION
Transferring variables between different data-sets. Using imputation of individual scores Bojan Todosijevic University of Twente German Stata Users’ Meeting, April 2, 2007, Essen. The Problem: Data scattered in different data sets – surveys, census data, etc. Typical solution: - PowerPoint PPT PresentationTRANSCRIPT
1
Transferring variables between different data-sets
Using imputation of individual scores
Bojan TodosijevicUniversity of Twente
German Stata Users’ Meeting, April 2, 2007, Essen
2
The Problem: Data scattered in different data sets –
surveys, census data, etc.Typical solution: Data aggregation – geographical, cohort.The present task: To test a more general model for
transferring data/variables between data-sets - based on the imputation of individual scores.
3
Advantages of the individually imputed scores:
Wider range of applications (e.g., variables of interest may be unrelated to geographic or cohort units)
Aggregation method tends to neglect variability within aggregation units; individual imputation method retains information about distribution.
4
The proposed approach:
A question not asked in one survey could be seen as a special case of the missing data problem (Gelman et al., 1998).
Adopt Bayesian multiple imputation (MI) (Rubin, 1987) approach.
When data are missing because a question was not asked the MAR assumption applies
P(R|Ycomplete) = P(R|Yobserved)
5
Assessing the feasibility of the approach
1. Two data-sets selected - SOCON 2000 and NKO 2002 - contain a number of equivalent variables
2. Target variable: Left-Right self-placement – from SOCON to NKO
3. Test and comparisons of the ‘real’ and imputed L-R variables
6
Structure of the merged file
type of interview | data file record | SOCON2002 NKO 2002 | Total------------------------+----------------------+---------- NKO 2002 1st wave only | 0 333 | 333 NKO 1st and 2nd waves | 0 287 | 287 NKO 1st, 2nd and 3rd w. | 0 1,287 | 1,287 NKO 2003 only | 0 1,271 | 1,271 | | SOCON 2002 | 1,008 0 | 1,008 ------------------------+----------------------+---------- Total | 1,008 3,178 | 4,186
7
Imputation procedure and software
ICE – MICE application for Stata (Royston, 2005)UVIS – Univariate imputation sampling Ice imputes missing values by using switching
regression, an iterative multivariable regression technique (Stata module written by Patrick Royston, 2005). The multivariate distribution is estimated from the incomplete data in a Gibbs sampling process (Van Buuren & Oudshoorn 1999).
uvis imputes missing values in the single variable based on multiple regression on a list of predictors. uvis is called repeatedly by ice in a regression switching mode to perform multivariate imputation.
8
Common NKO and SOCON variables
Name Variableurb2 Urbanizationsex Sexage Ageclass Class - self-descriptionzincome Household income (standardized)educatio Education levelchurch_a Religious service attendanceparty Party choice (hypothetical, vote intention, vote
recollection); Employed Employment statuspm Post-materialism indexpol_int Political interestd_proud Proud being Dutch
L-R Left-right self-placement, SOCON 2000
L-R1 Left-Right self-placement, NKO 1st waveL-R2 Left-Right self-placement, NKO 2nd waveL-R2 Left-Right self-placement, NKO 3rd wave
9
Imputation – three steps
1. Imputation of the common variables in the SOCON file (using ice)
2. Imputation of the common variables in the NKO file (using ice)
3. Imputation of the L-R variable – from SOCON to NKO (‘the main thing’), using uvis.
10
Multiple Imputation - the SOCON variables
Stata command for imputation:
ice l_r urb2 sex age class zincome educatio church_a party employed pm pol_int d_proud using SOCON_iced, m(5) match(l_r urb2 sex age class educatio church_a party employed pm pol_int d_proud ) cmd( urb2 class pol_int d_proud pm:regress) cycles(50) seed(14) replace
11
Multiple Imputation - the NKO file
Stata command for imputation:
ice urb2 sex age class zincome educatio church_a party employed pm pol_int d_proud using NKO_iced, m(5) match(urb2 sex age class educatio church_a party employed pm pol_int d_proud ) cmd( urb2 class church_a pol_int d_proud pm:regress) cycles(50) seed(14) replace
L-R variables not included
12
Imputation of the L-R variable – from SOCON to NKO
Merged the imputed SOCON and NKO files (each containing the original and 5 imputed data-sets).
For each of the 5 imputed SOCON-NKO combinations, a univariate imputation of the L-R variable (from SOCON to NKO) was performed (using uvis).
Stata command for imputation: uvis regress l_r urb2 sex age class zincome educatio
church_a party employed pm pol_int d_proud if _j==1, gen (l_r_uvis1) match seed(14) replace
Seed numbers: 1: 14, 2: 32, 3: 432, 4: 11, 5: 55.
The Imputation equation – DV: SOCON L-R
Source | SS df MS Number of obs = 1008-------------+------------------------------ F( 12, 995) = 55.22 Model | 1717.64517 12 143.137097 Prob > F = 0.0000 Residual | 2579.00563 995 2.59196546 R-squared = 0.3998-------------+------------------------------ Adj R-squared = 0.3925 Total | 4296.65079 1007 4.26678331 Root MSE = 1.61 R = .63------------------------------------------------------------------------------ SOCON l_r | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- urb2 | .0570569 .0379983 1.50 0.134 .0387282 sex | -.156749 .1075442 -1.46 0.145 -.0379609 age | -.0013591 .0043088 -0.32 0.752 -.0087341 class | .3938487 .0834685 4.72 0.000 .1461826 zincome | .0254967 .0627695 0.41 0.685 .0123456 educatio | -.1585475 .0315719 -5.02 0.000 -.1664914 church_a | -.2379957 .0522299 -4.56 0.000 -.1199417 party | .3608413 .0198037 18.22 0.000 .48643 employed | .030681 .1241836 0.25 0.805 .0068578 pm | -.2921879 .1009341 -2.89 0.004 -.0801642 pol_int | .173143 .0716642 2.42 0.016 .0685405 d_proud | -.195067 .0623608 -3.13 0.002 -.0822013 _cons | 4.659597 .5426508 8.59 0.000 .-------------+----------------------------------------------------------------
Variable No. of Obs. Mean S. D. Skewness Kurtosis
Original SOCON var.
l_r (SOCON) 1008 5.329 2.066 0.069 2.451
Imputed variables
l_r_uvis (NKO) – 1 3178 5.388 1.997 0.037 2.572
l_r_uvis (NKO) – 2 3178 5.177 2.062 0.144 2.494
l_r_uvis (NKO) – 3 3178 5.269 2.042 0.065 2.404
l_r_uvis (NKO) – 4 3178 5.354 2.111 0.055 2.385
l_r_uvis (NKO) – 5 3178 5.241 1.985 0.186 2.632
Original NKO var’s
l_r1 1871 5.040 2.007 0.058 2.421
l_r2 1546 5.224 2.125 0.010 2.283
l_r3 2495 5.268 2.170 0.014 2.151
Descriptive statistics for the original and five imputed L-R variables
15
Correlation between the original NKO L-R variables
L_r1 l_r2
l_r2 .760 1
l_r3 .711 .779
Correlation between the imputed and original NKO L-R variables
l_r_uvis-1 l_r_uvis-2 l_r_uvis-3 l_r_uvis-4 l_r_uvis-5
l_r1 .332 .392 .353 .375 .377
l_r2 .403 .418 .394 .445 .419
l_r3 .402 .408 .395 .449 .416
R squared in 5 imputations range from Rsq=.39 to Rsq=.43.
Multiple imputation parameter estimates (5 imputations)------------------------------------------------------------------------------ l_r_Imputed | Coef. Std. Err. t P>|t| comparison with SOCON-------------+---------------------------------------------------------------- urb2 | .0750759 .0254447 2.95 0.003 became sig. sex | -.1851431 .1260133 -1.47 0.142 almost identical age | .0009682 .0037543 0.26 0.797 almost identical class | .3853003 .0718502 5.36 0.000 almost identical zincome | .0022461 .0421326 0.05 0.957 almost identical educatio | -.1453613 .0223744 -6.50 0.000 almost identical church_a | -.2166528 .0648066 -3.34 0.001 almost identical party | .3906498 .0247788 15.77 0.000 almost identical employed | .011624 .0950614 0.12 0.903 almost identical pm | -.3594543 .0790244 -4.55 0.000 increased a bit pol_int | .1972884 .101904 1.94 0.053 cf. incr., sig.dropped d_proud | -.243719 .0534851 -4.56 0.000 increased a bit _cons | 4.610467 .5392112 8.55 0.000 ------------------------------------------------------------------------------3178 observations (imputation 1).-------------+----------------------------------------------------------------
The Imputation equation – DV: Imputed L-R
17
Comparison of the imputed with the ‘original’ NKO L-R variables
18
Relationships with variables NOT included in the imputation model
Correction for attenuationCorrection for attenuation
ρimputed L-R=.40 * (1/.78)=.51
19
Correlations with attitudinal variables
Satisfaction with government and democracy l_r_uvis L-R uvis
corrected l_r1 l_r2 l_r3
I/141 General satisfaction with government. .0983* .138 .1678* .1736* .1719*
I/ Policy satisfaction score 2002 -.0836* -.117 -.1225* -.1343* -.1227*
I/142 Satisfaction with democracy .0634* .089 .1026* .1302* .1214*
II/291 Satisfaction with democracy in the Netherlands .0625* .088 .1196* .1199* .1449*
II/292 Satisfaction with democracy in the European Union -.0312 -.044 .0149 -.0057 .0209
II/299 Democracy is the best form of government .0657* .092 .1049* .1006* .1045*
III/1291 Satisfaction with democracy in the Netherlands .0099 .014 .0644* .0560* .0595*
III/1299 Democracy is the best form of government .0603* .084 .0594* .0446 .0528*
20
Correlations with attitudinal variables
Attitude toward political parties: 1st wave l_r_uvis L-R uvis
corrected l_r1 l_r2 l_r3
I/281 Sympathy score: CDA .1700* .238 .3329* .3826* .3675*
I/281 Sympathy score: PvdA -.2373* -.332 -.3630* -.3962* -.4112*
I/281 Sympathy score: VVD .2125* .298 .3901* .3787* .3888*
I/281 Sympathy score: D66 -.2293* -.321 -.2575* -.3165* -.2959*
I/281 Sympathy score: GroenLinks -.3143* -.440 -.4679* -.4928* -.4933*
I/281 Sympathy score: Leefbaar Nederland .1104* .155 .2335* .2352* .2351*
I/281 Sympathy score: Lijst Pim Fortuyn .2114* .296 .4013* .4114* .3993*
I/281 Sympathy score: SGP .1074* .150 .2095* .2403* .2548*
I/281 Sympathy score: ChristenUnie .0988* .138 .1616* .1938* .1832*
I/281 Sympathy score: SP -.2414* -.338 -.4062* -.4220* -.4122*
21
Correlations with attitudinal variables
Attitude toward political parties: 3rd wave l_r_uvis L-R uvis
corrected l_r1 l_r2 l_r3
III/281 Sympathy score: CDA .2605* .365 .4097* .4566* .4914*
III/281 Sympathy score: PvdA -.289* -.405 -.4335* -.4807* -.5020*
III/281 Sympathy score: VVD .3132* .439 .4707* .5279* .5475*
III/281 Sympathy score: D66 -.2158* -.302 -.2765* -.3284* -.3063*
III/281 Sympathy score: GroenLinks -.3846* -.539 -.5382* -.5842* -.5550*
III/281 Sympathy score: Lijst Pim Fortuyn .2807* .393 .4062* .4473* .4635*
III/281 Sympathy score: SGP .1326* .186 .2191* .2596* .2471*
III/281 Sympathy score: ChristenUnie .0597* .084 .1519* .1715* .1581*
III/281 Sympathy score: SP -.3405* -.477 -.4865* -.5241* -.5261*
22
Various political attitudes I l_r_uvis L-R uvis
corrected l_r1 l_r2 l_r3
II/283 European unification -position of respondent .0881* .123 .1320* .1337* .1365*
II/297 Attention to individual freedom and human rights -.0860* -.120 -.1731* -.1977* -.1640*
II/349 MP's do not care about opinions of people like me -.1112* -.156 -.1091* -.0988* -.0717*
II/349 Parties are only interested in my vote and not in my opinion -.0998* -.140 -.1485* -.1130* -.1121*
II/349 People like me have no influence on politics -.1131* -.158 -.1144* -.0909* -.1024*
II/349 So many people vote, my vote does not matter -.0357 -.050 -.0890* -.0649* -.0482
II/ External political efficacy score 2002 -.1326* -.186 -.1621* -.1406* -.1239*
II/355 Consider myself qualified for politics .0627* .088 .0770* .0712* .1120*
II/355 Good understanding of political problems -.0271 -.038 -.0115 .0159 .0158
II/355 Politics too complicated -.0565* -.079 -.0758* -.0661* -.0655*
II/ Internal political efficacy score 2002 .0067 .009 -.0117 -.027 -.0427
II/355 Politicians promise more than they can deliver -.0569* -.080 -.0611* -.0562* -.0415
II/355 Ministers and junior-ministers are primarily self-interested -.0778* -.109 -.1327* -.1041* -.1138*
II/355 Friends more important than abilities to become MP -.0592* -.083 -.1204* -.0792* -.0788*
II/ Political cynicism score 2002 .0696* .097 .1409* .1072* .1216*
II/355 Most Dutch parties look alike .0093 .013 -.0409 -.0206 -.0134
23
Various political attitudes II L_r_uvis L-R uvis
corrected l_r1 l_r2 l_r3
I/ Political knowledge 1 -.0145 -.020 -.0982* -.1206* -.1143*
I/ Political knowledge 2 -.0129 -.018 -.0799* -.1004* -.0899*
II/380 Trust in people .1532* .215 .1833* .1924* .1795*
III/1380 Trust in people .1379* .193 .1943* .2032* .1576*
II/298 Corruption in Dutch politics -.1146* -.160 -.1262* -.1237* -.1265*
III/1293 Views MP's are good reflection of views voters .0439* .061 .0358 .0483 .0245
III/1295 Parties necessary for functioning of democracy .0114 .016 .0676* .0297 .0437*
III/1303 Which aspect should politicians emphasize? -.0299 -.042 -.0765* -.0854* -.0942*
III/1310 Functions of elections -.0294 -.041 -.0023 .0083 -.0157
III/1350 MP's do not care about opinions of people like me -.0876* -.123 -.1177* -.1395* -.0851*
24
Various political attitudes III l_r_uvis L-R uvis
corrected L_r1 l_r2 l_r3
II/ Confessional attitude score 2002 .1000* .140 .1291* .1513* .1750*
III/132 Income differences -position of respondent -.2275* -.319 -.3445* -.3808* -.3765*
III/133 Asylum seekers -position of respondent .2916* .408 .4179* .4500* .4758*
III/134 European unification -position of respondent .1258* .176 .1374* .1654* .1876*
III/135 Ethnic minorities -position of respondent .3269* .458 .3978* .4435* .4400*
III/1150 Punishment of crimes .2221* .311 . . .2172*
III/150 Punishment of crimes .2078* .291 .2627* .2886* .2853*
III/160 Death penalty for certain crimes -.2135* -.299 -.3592* -.3607* -.3501*
III/1160 Death penalty for certain crimes -.2298* -.322 . . -.2574*
III/2460 Religion is a good guide in politics -.1689* -.237 . . -.2169*
25
L_r_uvis L-R uvis
corrected l_r1 l_r2 l_r3
I/ Political knowledge 1 -.0145 -.020 -.0982* -.1206* -.1143*
I/ Political knowledge 2 -.0129 -.018 -.0799* -.1004* -.0899*
III/1293 Views MP's are good reflection of views voters .0439* .061 .0358 .0483 .0245
III/1295 Parties necessary for functioning of democracy .0114 .016 .0676* .0297 .0437*
III/1303 Which aspect should politicians emphasize? -.0299 -.042 -.0765* -.0854* -.0942*
II/349 So many people vote, my vote does not matter -.0357 -.050 -.0890* -.0649* -.0482
III/1291 Satisfaction with democracy in the Netherlands .0099 .014 .0644* .0560* .0595*
Summary of the conclusions that differ between the original and imputed variables
The highest ‘missed’ correlation: with Political knowledge 1 – average for the three ‘real’ L-R variables: r=-.11.
26
% Total variables 63 1.0 Identical conclusion (direction and significance) 56 88.9
Identical conclusions Identical significant-significant conclusion: 51 91.1 Identical insig. - insig. conclusion 5 9.9
Different conclusions 7 11.1 Insignificant (imputed variable) - significant (original variable) 6 85.7 Significant (imputed variable) - insignificant (original variable) 1 14.3
Summary of the comparison between the imputed and original L-R variables
27
Summary
Coefficients associated with the imputed variables are lower in magnitude.
Correction for attenuation helps. In a number of cases even quite low correlations
were correctly predicted. In a single case the imputed variable showed a
significant relationship when the original variable showed an insignificant coefficient.
Using the imputed variable one is in danger of making Type II error, much less Type I error.
28
Problems to consider
Large proportion of missing values - use several ‘predictive’ data files for the imputation.
Small number of ‘predictive’ variables.
If the ‘imputationist’ and analyst are not the same person, the analyst may be interested in relationships unaccounted by the imputation model.
Imputation is done between different data sets - the major departure from the usual practice of the MI procedures.
29
Conclusion
The imputed variable strongly correlates with the ‘real’ responses (r is around .40, without correction for attenuation).
Multivariate model, using the variables from the prediction model, showed very close results if one used the imputed or original variables.
Univariate relationships with a broad set of attitudinal variables showed that by using imputed variable one is in danger of wrongly supporting the null-hypothesis, and underestimating the strength of the relationships.
The proposed method seems applicable especially in pilot-studies, and in studies using multiple surveys where particular questions are omitted from some studies.
Transfer of data between different sources through MI approach seems to be a reasonable alternative to aggregation.
30
“With our without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest – not to estimate, predict, or recover missing observations not to obtain the same results that we would have seen with complete data.”
Schafer & Graham 2002, p. 149.