transferring variables between different data-sets

1

Transferring variables between different data-sets

Using imputation of individual scores

Bojan TodosijevicUniversity of Twente

German Stata Users’ Meeting, April 2, 2007, Essen

2

The Problem: Data scattered in different data sets –

surveys, census data, etc.Typical solution: Data aggregation – geographical, cohort.The present task: To test a more general model for

transferring data/variables between data-sets - based on the imputation of individual scores.

3

Advantages of the individually imputed scores:

Wider range of applications (e.g., variables of interest may be unrelated to geographic or cohort units)

Aggregation method tends to neglect variability within aggregation units; individual imputation method retains information about distribution.

4

The proposed approach:

A question not asked in one survey could be seen as a special case of the missing data problem (Gelman et al., 1998).

Adopt Bayesian multiple imputation (MI) (Rubin, 1987) approach.

When data are missing because a question was not asked the MAR assumption applies

P(R|Ycomplete) = P(R|Yobserved)

5

Assessing the feasibility of the approach

1. Two data-sets selected - SOCON 2000 and NKO 2002 - contain a number of equivalent variables

2. Target variable: Left-Right self-placement – from SOCON to NKO

3. Test and comparisons of the ‘real’ and imputed L-R variables

6

Structure of the merged file

type of interview | data file record | SOCON2002 NKO 2002 | Total------------------------+----------------------+---------- NKO 2002 1st wave only | 0 333 | 333 NKO 1st and 2nd waves | 0 287 | 287 NKO 1st, 2nd and 3rd w. | 0 1,287 | 1,287 NKO 2003 only | 0 1,271 | 1,271 | | SOCON 2002 | 1,008 0 | 1,008 ------------------------+----------------------+---------- Total | 1,008 3,178 | 4,186

7

Imputation procedure and software

ICE – MICE application for Stata (Royston, 2005)UVIS – Univariate imputation sampling Ice imputes missing values by using switching

regression, an iterative multivariable regression technique (Stata module written by Patrick Royston, 2005). The multivariate distribution is estimated from the incomplete data in a Gibbs sampling process (Van Buuren & Oudshoorn 1999).

uvis imputes missing values in the single variable based on multiple regression on a list of predictors. uvis is called repeatedly by ice in a regression switching mode to perform multivariate imputation.

8

Common NKO and SOCON variables

Name Variableurb2 Urbanizationsex Sexage Ageclass Class - self-descriptionzincome Household income (standardized)educatio Education levelchurch_a Religious service attendanceparty Party choice (hypothetical, vote intention, vote

recollection); Employed Employment statuspm Post-materialism indexpol_int Political interestd_proud Proud being Dutch

L-R Left-right self-placement, SOCON 2000

L-R1 Left-Right self-placement, NKO 1st waveL-R2 Left-Right self-placement, NKO 2nd waveL-R2 Left-Right self-placement, NKO 3rd wave

9

Imputation – three steps

1. Imputation of the common variables in the SOCON file (using ice)

2. Imputation of the common variables in the NKO file (using ice)

3. Imputation of the L-R variable – from SOCON to NKO (‘the main thing’), using uvis.

10

Multiple Imputation - the SOCON variables

Stata command for imputation:

ice l_r urb2 sex age class zincome educatio church_a party employed pm pol_int d_proud using SOCON_iced, m(5) match(l_r urb2 sex age class educatio church_a party employed pm pol_int d_proud ) cmd( urb2 class pol_int d_proud pm:regress) cycles(50) seed(14) replace

11

Multiple Imputation - the NKO file

Stata command for imputation:

ice urb2 sex age class zincome educatio church_a party employed pm pol_int d_proud using NKO_iced, m(5) match(urb2 sex age class educatio church_a party employed pm pol_int d_proud ) cmd( urb2 class church_a pol_int d_proud pm:regress) cycles(50) seed(14) replace

L-R variables not included

12

Imputation of the L-R variable – from SOCON to NKO

Merged the imputed SOCON and NKO files (each containing the original and 5 imputed data-sets).

For each of the 5 imputed SOCON-NKO combinations, a univariate imputation of the L-R variable (from SOCON to NKO) was performed (using uvis).

Stata command for imputation: uvis regress l_r urb2 sex age class zincome educatio

church_a party employed pm pol_int d_proud if _j==1, gen (l_r_uvis1) match seed(14) replace

Seed numbers: 1: 14, 2: 32, 3: 432, 4: 11, 5: 55.

The Imputation equation – DV: SOCON L-R

Source | SS df MS Number of obs = 1008-------------+------------------------------ F( 12, 995) = 55.22 Model | 1717.64517 12 143.137097 Prob > F = 0.0000 Residual | 2579.00563 995 2.59196546 R-squared = 0.3998-------------+------------------------------ Adj R-squared = 0.3925 Total | 4296.65079 1007 4.26678331 Root MSE = 1.61 R = .63------------------------------------------------------------------------------ SOCON l_r | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------- urb2 | .0570569 .0379983 1.50 0.134 .0387282 sex | -.156749 .1075442 -1.46 0.145 -.0379609 age | -.0013591 .0043088 -0.32 0.752 -.0087341 class | .3938487 .0834685 4.72 0.000 .1461826 zincome | .0254967 .0627695 0.41 0.685 .0123456 educatio | -.1585475 .0315719 -5.02 0.000 -.1664914 church_a | -.2379957 .0522299 -4.56 0.000 -.1199417 party | .3608413 .0198037 18.22 0.000 .48643 employed | .030681 .1241836 0.25 0.805 .0068578 pm | -.2921879 .1009341 -2.89 0.004 -.0801642 pol_int | .173143 .0716642 2.42 0.016 .0685405 d_proud | -.195067 .0623608 -3.13 0.002 -.0822013 _cons | 4.659597 .5426508 8.59 0.000 .-------------+----------------------------------------------------------------

Variable No. of Obs. Mean S. D. Skewness Kurtosis

Original SOCON var.

l_r (SOCON) 1008 5.329 2.066 0.069 2.451

Imputed variables

l_r_uvis (NKO) – 1 3178 5.388 1.997 0.037 2.572

l_r_uvis (NKO) – 2 3178 5.177 2.062 0.144 2.494

l_r_uvis (NKO) – 3 3178 5.269 2.042 0.065 2.404

l_r_uvis (NKO) – 4 3178 5.354 2.111 0.055 2.385

l_r_uvis (NKO) – 5 3178 5.241 1.985 0.186 2.632

Original NKO var’s

l_r1 1871 5.040 2.007 0.058 2.421

l_r2 1546 5.224 2.125 0.010 2.283

l_r3 2495 5.268 2.170 0.014 2.151

Descriptive statistics for the original and five imputed L-R variables

15

Correlation between the original NKO L-R variables

L_r1 l_r2

l_r2 .760 1

l_r3 .711 .779

Correlation between the imputed and original NKO L-R variables

l_r_uvis-1 l_r_uvis-2 l_r_uvis-3 l_r_uvis-4 l_r_uvis-5

l_r1 .332 .392 .353 .375 .377

l_r2 .403 .418 .394 .445 .419

l_r3 .402 .408 .395 .449 .416

R squared in 5 imputations range from Rsq=.39 to Rsq=.43.

Multiple imputation parameter estimates (5 imputations)------------------------------------------------------------------------------ l_r_Imputed | Coef. Std. Err. t P>|t| comparison with SOCON-------------+---------------------------------------------------------------- urb2 | .0750759 .0254447 2.95 0.003 became sig. sex | -.1851431 .1260133 -1.47 0.142 almost identical age | .0009682 .0037543 0.26 0.797 almost identical class | .3853003 .0718502 5.36 0.000 almost identical zincome | .0022461 .0421326 0.05 0.957 almost identical educatio | -.1453613 .0223744 -6.50 0.000 almost identical church_a | -.2166528 .0648066 -3.34 0.001 almost identical party | .3906498 .0247788 15.77 0.000 almost identical employed | .011624 .0950614 0.12 0.903 almost identical pm | -.3594543 .0790244 -4.55 0.000 increased a bit pol_int | .1972884 .101904 1.94 0.053 cf. incr., sig.dropped d_proud | -.243719 .0534851 -4.56 0.000 increased a bit _cons | 4.610467 .5392112 8.55 0.000 ------------------------------------------------------------------------------3178 observations (imputation 1).-------------+----------------------------------------------------------------

The Imputation equation – DV: Imputed L-R

17

Comparison of the imputed with the ‘original’ NKO L-R variables

18

Relationships with variables NOT included in the imputation model

Correction for attenuationCorrection for attenuation

ρimputed L-R=.40 * (1/.78)=.51

19

Correlations with attitudinal variables

Satisfaction with government and democracy l_r_uvis L-R uvis

corrected l_r1 l_r2 l_r3

I/141 General satisfaction with government. .0983* .138 .1678* .1736* .1719*

I/ Policy satisfaction score 2002 -.0836* -.117 -.1225* -.1343* -.1227*

I/142 Satisfaction with democracy .0634* .089 .1026* .1302* .1214*

II/291 Satisfaction with democracy in the Netherlands .0625* .088 .1196* .1199* .1449*

II/292 Satisfaction with democracy in the European Union -.0312 -.044 .0149 -.0057 .0209

II/299 Democracy is the best form of government .0657* .092 .1049* .1006* .1045*

III/1291 Satisfaction with democracy in the Netherlands .0099 .014 .0644* .0560* .0595*

III/1299 Democracy is the best form of government .0603* .084 .0594* .0446 .0528*

20


Attitude toward political parties: 1st wave l_r_uvis L-R uvis


I/281 Sympathy score: CDA .1700* .238 .3329* .3826* .3675*

I/281 Sympathy score: PvdA -.2373* -.332 -.3630* -.3962* -.4112*

I/281 Sympathy score: VVD .2125* .298 .3901* .3787* .3888*

I/281 Sympathy score: D66 -.2293* -.321 -.2575* -.3165* -.2959*

I/281 Sympathy score: GroenLinks -.3143* -.440 -.4679* -.4928* -.4933*

I/281 Sympathy score: Leefbaar Nederland .1104* .155 .2335* .2352* .2351*

I/281 Sympathy score: Lijst Pim Fortuyn .2114* .296 .4013* .4114* .3993*

I/281 Sympathy score: SGP .1074* .150 .2095* .2403* .2548*

I/281 Sympathy score: ChristenUnie .0988* .138 .1616* .1938* .1832*

I/281 Sympathy score: SP -.2414* -.338 -.4062* -.4220* -.4122*

21


Attitude toward political parties: 3rd wave l_r_uvis L-R uvis


III/281 Sympathy score: CDA .2605* .365 .4097* .4566* .4914*

III/281 Sympathy score: PvdA -.289* -.405 -.4335* -.4807* -.5020*

III/281 Sympathy score: VVD .3132* .439 .4707* .5279* .5475*

III/281 Sympathy score: D66 -.2158* -.302 -.2765* -.3284* -.3063*

III/281 Sympathy score: GroenLinks -.3846* -.539 -.5382* -.5842* -.5550*

III/281 Sympathy score: Lijst Pim Fortuyn .2807* .393 .4062* .4473* .4635*

III/281 Sympathy score: SGP .1326* .186 .2191* .2596* .2471*

III/281 Sympathy score: ChristenUnie .0597* .084 .1519* .1715* .1581*

III/281 Sympathy score: SP -.3405* -.477 -.4865* -.5241* -.5261*

22

Various political attitudes I l_r_uvis L-R uvis


II/283 European unification -position of respondent .0881* .123 .1320* .1337* .1365*

II/297 Attention to individual freedom and human rights -.0860* -.120 -.1731* -.1977* -.1640*

II/349 MP's do not care about opinions of people like me -.1112* -.156 -.1091* -.0988* -.0717*

II/349 Parties are only interested in my vote and not in my opinion -.0998* -.140 -.1485* -.1130* -.1121*

II/349 People like me have no influence on politics -.1131* -.158 -.1144* -.0909* -.1024*

II/349 So many people vote, my vote does not matter -.0357 -.050 -.0890* -.0649* -.0482

II/ External political efficacy score 2002 -.1326* -.186 -.1621* -.1406* -.1239*

II/355 Consider myself qualified for politics .0627* .088 .0770* .0712* .1120*

II/355 Good understanding of political problems -.0271 -.038 -.0115 .0159 .0158

II/355 Politics too complicated -.0565* -.079 -.0758* -.0661* -.0655*

II/ Internal political efficacy score 2002 .0067 .009 -.0117 -.027 -.0427

II/355 Politicians promise more than they can deliver -.0569* -.080 -.0611* -.0562* -.0415

II/355 Ministers and junior-ministers are primarily self-interested -.0778* -.109 -.1327* -.1041* -.1138*

II/355 Friends more important than abilities to become MP -.0592* -.083 -.1204* -.0792* -.0788*

II/ Political cynicism score 2002 .0696* .097 .1409* .1072* .1216*

II/355 Most Dutch parties look alike .0093 .013 -.0409 -.0206 -.0134

23

Various political attitudes II L_r_uvis L-R uvis


I/ Political knowledge 1 -.0145 -.020 -.0982* -.1206* -.1143*


II/380 Trust in people .1532* .215 .1833* .1924* .1795*

III/1380 Trust in people .1379* .193 .1943* .2032* .1576*

II/298 Corruption in Dutch politics -.1146* -.160 -.1262* -.1237* -.1265*

III/1293 Views MP's are good reflection of views voters .0439* .061 .0358 .0483 .0245

III/1295 Parties necessary for functioning of democracy .0114 .016 .0676* .0297 .0437*

III/1303 Which aspect should politicians emphasize? -.0299 -.042 -.0765* -.0854* -.0942*

III/1310 Functions of elections -.0294 -.041 -.0023 .0083 -.0157

III/1350 MP's do not care about opinions of people like me -.0876* -.123 -.1177* -.1395* -.0851*

24

Various political attitudes III l_r_uvis L-R uvis

corrected L_r1 l_r2 l_r3

II/ Confessional attitude score 2002 .1000* .140 .1291* .1513* .1750*

III/132 Income differences -position of respondent -.2275* -.319 -.3445* -.3808* -.3765*

III/133 Asylum seekers -position of respondent .2916* .408 .4179* .4500* .4758*

III/134 European unification -position of respondent .1258* .176 .1374* .1654* .1876*

III/135 Ethnic minorities -position of respondent .3269* .458 .3978* .4435* .4400*

III/1150 Punishment of crimes .2221* .311 . . .2172*

III/150 Punishment of crimes .2078* .291 .2627* .2886* .2853*

III/160 Death penalty for certain crimes -.2135* -.299 -.3592* -.3607* -.3501*

III/1160 Death penalty for certain crimes -.2298* -.322 . . -.2574*

III/2460 Religion is a good guide in politics -.1689* -.237 . . -.2169*

25

L_r_uvis L-R uvis




III/1293 Views MP's are good reflection of views voters .0439* .061 .0358 .0483 .0245

III/1295 Parties necessary for functioning of democracy .0114 .016 .0676* .0297 .0437*

III/1303 Which aspect should politicians emphasize? -.0299 -.042 -.0765* -.0854* -.0942*

II/349 So many people vote, my vote does not matter -.0357 -.050 -.0890* -.0649* -.0482

III/1291 Satisfaction with democracy in the Netherlands .0099 .014 .0644* .0560* .0595*

Summary of the conclusions that differ between the original and imputed variables

The highest ‘missed’ correlation: with Political knowledge 1 – average for the three ‘real’ L-R variables: r=-.11.

26

% Total variables 63 1.0 Identical conclusion (direction and significance) 56 88.9

Identical conclusions Identical significant-significant conclusion: 51 91.1 Identical insig. - insig. conclusion 5 9.9

Different conclusions 7 11.1 Insignificant (imputed variable) - significant (original variable) 6 85.7 Significant (imputed variable) - insignificant (original variable) 1 14.3

Summary of the comparison between the imputed and original L-R variables

27

Summary

Coefficients associated with the imputed variables are lower in magnitude.

Correction for attenuation helps. In a number of cases even quite low correlations

were correctly predicted. In a single case the imputed variable showed a

significant relationship when the original variable showed an insignificant coefficient.

Using the imputed variable one is in danger of making Type II error, much less Type I error.

28

Problems to consider

Large proportion of missing values - use several ‘predictive’ data files for the imputation.

Small number of ‘predictive’ variables.

If the ‘imputationist’ and analyst are not the same person, the analyst may be interested in relationships unaccounted by the imputation model.

Imputation is done between different data sets - the major departure from the usual practice of the MI procedures.

29

Conclusion

The imputed variable strongly correlates with the ‘real’ responses (r is around .40, without correction for attenuation).

Multivariate model, using the variables from the prediction model, showed very close results if one used the imputed or original variables.

Univariate relationships with a broad set of attitudinal variables showed that by using imputed variable one is in danger of wrongly supporting the null-hypothesis, and underestimating the strength of the relationships.

The proposed method seems applicable especially in pilot-studies, and in studies using multiple surveys where particular questions are omitted from some studies.

Transfer of data between different sources through MI approach seems to be a reasonable alternative to aggregation.

30

“With our without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest – not to estimate, predict, or recover missing observations not to obtain the same results that we would have seen with complete data.”

Schafer & Graham 2002, p. 149.

transferring variables between different data-sets

Documents