getting the most out of missing data in educational ......by social statisticians, particularly when...

1

Getting the most out of missing data in educational research with Multiple

Imputation

Maria Pampaka*, Graeme Hutcheson, Julian Williams

School of Education, The University of Manchester, Manchester, UK

* Corresponding author

School of Education

Room B4.10 Ellen Wilkinson Building

University of Manchester

Oxford Road

Manchester M13 9PL

Email: [email protected]

2

Getting the most out of missing data in educational research with multiple

imputation

The ubiquitous presence of missing data is endemic to much educational research,

especially longitudinal studies. However, practices such as step-wise regression

common in the educational research literature have been long regarded as unacceptable

by social statisticians, particularly when missing data is an issue, and multiple

imputation is generally recommended. In this paper we firstly provide a review of these

advancements and their implications for educational research. We then draw on an

educational longitudinal survey in which missing data was significant, but for which we

were able to collect much of this missing data through subsequent data collection. We

thus compare step-wise regression (basically ignoring the missing data) and multiple

imputation models with the models from the actual enhanced sample. The results show

the value of multiple imputation and the risks involved in ignoring missing data.

Implications for best research practice are discussed.

Keywords: missing data, mathematics education, multiple imputation, longitudinal

surveys

1. Introduction

Missing data is certainly not a new issue for educational research, particularly given the

constraints of designing and performing research in schools and other educational

establishments. Consider the situation when a researcher gets permission to administer a

questionnaire to the students during class time. On the agreed day of administration various

scenarios could take place: (A) some pupils may have been absent due to essentially random

reasons (such as illness, and car breakdown), (B) some pupils may have been absent because

they are representing their school in competitions (these pupils may be the keenest and most

engaged), (C) some of the pupils who were present stopped before the end of the

questionnaire (these pupils may be the ones who are easily distracted or who have difficulty

reading), (D) some pupils may have skipped a page or skipped individual items (randomly, or

maybe they are careless), and (E) some pupils did not respond to sensitive questions (e.g.

3

about special educational needs). All the above scenarios will lead to missing data, but with

different degrees of bias depending on the object of the analysis. For example, if data are

missing due to scenario B, the analysis will under-represent the more highly attaining or

engaged pupils. If data are missing due to scenario E, those pupils with special educational

needs will be under-represented. In the real-world it is likely that data is missing due to

multiple reasons with the above scenarios happening simultaneously in any single project.

Missing data is a particular issue for longitudinal studies, especially when the design

involves transitions between institutions and phases of education. This is a major issue in the

wider social science literature, which acknowledges that nearly all longitudinal studies suffer

from significant attrition, raising concerns about the characteristics of the dropouts compared

to the remaining subjects and the implications of this for the validity of inferences when

applied to the target population (Kim and Fuller 2004; Little 1988; Little and Rubin 1989;

Plewis 2007; Schafer and Graham 2002).

Even though the issues around missing data are well-documented, it is common

practice to ignore missing data and employ analytical techniques that simply delete all cases

that have missing data on any of the variables considered in the analysis (see, for example,

Horton and Kleinman (2007) for a review of medical research reports, and King et al. (2001,

p.49) who state that “approximately 94% (of analyses) use listwise deletion to eliminate entire

observations. […] The result is a loss of valuable information at best and severe selection bias

at worst”). The use of step-wise selection methods is particularly dangerous when used in the

presence of missing data as the loss of information can be severe and may not even be

obvious. A demonstration of this is provided in Table 1, where the variable `SCORE' (a

continuous variable modelled using OLS regression) is modelled using three candidate

explanatory variables. In order to compare successive models, a typical step-wise procedure

first deletes any missing data listwise, leaving only the complete cases. Any case that has a

4

missing data point in any of the candidate variables is removed from the analysis, even when

the data is missing on a variable that is not included in the final model. This can result in the

loss of substantial amounts of information and introduce bias into the selected models.

Table 1: An example dataset for a regression mode of ‘SCORE’

Case SCORE GCSE AGE SES Cases included in a Step-wise

Model

Cases that could have

been included 1 25 19 25 2 2 32 23 NA 5 3 45 NA 17 7 4 65 NA 16 3 5 49 42 16 3 6 71 28 15 NA 7 68 97 19 NA 8 67 72 18 5 9 59 66 18 2 10 69 65 17 7 Note: NA denotes missing data point

A regression model of `SCORE' is derived using a step-wise procedure using `GCSE', ÀGE' and `SES' as potential explanatory variables (for simplicity, all variables are considered as continuous). The model Èxam ~ GCSE' was selected using backward deletion on the basis of estimates of significance for the individual variables (based on t-values). Using a backward elimination step-wise procedure, this model was constructed from 5 cases. There are, however, 8 cases which could have been used to construct 'the same model'.

Figure 1 shows the model Èxam ~ GCSE' obtained using a stepwise procedure and

the 'same model' constructed using all available data. The stepwise model is estimated from a

smaller sample as cases 2, 6 and 7 are excluded. The exclusion of these cases has introduced

bias in the analysis as these three cases all have relatively low GCSE marks. The parameter

estimates from the stepwise procedure are based on a sub-sample that does not even reflect

the sample particularly accurately. The models provide very different impressions about the

relationship between SCORE and GCSE marks. It is important to note that very little warning

may be given about the amount of data discarded by the model selection procedure. The loss

of data is only evident in the following output by comparing the degrees of freedom statistics

for the two models. It is, therefore, worth checking how a statistical package deals with

5

model selection1 and also to make sure that the selection process does not needlessly exclude

data.

SCORE ~ GCSE derived from stepwise regression based on 5 cases:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 12.3135 6.6089 1.863 0.15933

GCSE 0.7857 0.1172 6.702 0.00678

Multiple R-squared: 0.9374, Adjusted R-squared: 0.9165

F-statistic: 44.92 on 1 and 3 DF, p-value: 0.006777

SCORE ~ GCSE based on all 8 available cases:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 33.8170 13.3409 2.535 0.0444

GCSE 0.4884 0.2795 1.747 0.1312

Multiple R-squared: 0.3372, Adjusted R-squared: 0.2268

F-statistic: 3.053 on 1 and 6 DF, p-value: 0.1312

Figure 1. The use of a stepwise regression procedure has resulted in some loss of data. The amount of data excluded is only evident in the difference between the DF statistics for the two models

(3 compared to 6 DF).

Even though missing data is an important issue, it is rarely dealt with or even

acknowledged in educational research (for an exception to this, see Wayman 2003). Whilst

data imputation is now generally accepted by statisticians (particularly multiple imputation),

non-specialist researchers have been slow to adopt it and there is a general reluctance from

reviewers to accept analyses where data have been imputed. Data imputation makes an `easy

target' for criticism, mainly because it involves adding simulated data to a raw data set, which

causes some suspicion that the data sample is not representative.

1 Some statistical packages automatically delete all cases list-wise (SPSS, for example), while others (for

example, the “step( )” procedure implemented in R (R Development Core team, 2012)) do not allow step-wise regression to be easily applied in the presence of missing data - when the sample size changes as a result of variables being added or removed, the procedure halts.

6

Our aim in this paper is to review some of the issues surrounding missing data and

imputation methods and demonstrate how missing data can be imputed using readily-available

software. Using a real (longitudinal) dataset which (i) had serious quantities of missing data,

and (ii) was supplemented by subsequently acquiring a substantial proportion of this missing

data, we are able to evaluate the effects of ignoring missing data, compared with multiple

imputation of that missing data.

A more detailed description of our methodology and relevant results follows the brief

review around the issue that comes next.

2. On Missing Data and Possible Resolutions – A review

The importance and timeliness of the issue is highlighted in Wainer (2010) who looks forward

into the 21st century in regards to methodological and measurement trends. Dealing with

missing data appears as one of the six ‘necessary tools’ researchers must have to be successful

in tackling the problems “that loom in the near and more distant future” (p. 7). The topic is

not new however: work on missing data from the 1970s was brought together by Little and

Rubin (1987).

2.1. Some useful concepts’ definitions

Theoretically missing data can be classified according to the missing data pattern and the

assumptions underlying the ‘missingness’ mechanisms i.e. the mechanisms that are believed

as causing the data to be missing.

A fundamental distinction is that between ‘unit’ and ‘item’ nonresponse. Unit

nonresponse refers to the complete miss of information from an individual (or case), e.g.

because they either refused to respond, or were not contacted for survey completion. This is

usually considered as ‘selective’, implying that the answers of non respondents are different

7

from those of the respondents, and so suggestive of sample bias: famous examples include

wildly inaccurate predictions based on telephone polling, such as the failure of the 1936

Literary Digest Poll (Squire 1988). Item nonresponse, on the other hand, refers to the failure

of a respondent to give information on some variables of interest (i.e. some items in the

survey). In a longitudinal panel survey, ‘unit’ non response may be manifested as a special

case of ‘item’ nonresponse pattern: assuming the non respondents at a certain wave had

completed earlier surveys, then the responses at previous waives are still available (observed)

data even though the unit/case is completely missing from a wave.

Missing data mechanisms are described as falling into one of the three categories

briefly described below (Allison 2000), which sometimes are called ‘distributions of

missingness’ (Shaffer and Graham 2002). This classification is important because the

possible resolutions of missing data problems depend on this.

Missing completely at random (MCAR): the missingness is independent of the

observed and missing responses; i.e. the probability to be missing is constant for all

units/cases. This could be manifested with Scenario (A) from the example in the

introduction, where pupils are missing from a sample because they may have been

away from school for random reasons.

Missing at random (MAR): the missingness is conditionally independent of the

missing responses, given the observed responses; in other words, the probability of

missing data on a particular variable Y can depend on other observed variables (but

not on Y itself). An example of such missingness is Scenario (B) from our original

example, with some pupils missing because they have been away to represent their

school in competitions.

Missing not at random (MNAR): missingness depends on both observed and

unobserved (missing) data, such as the case of Scenario (E) with pupils not responding

8

to sensitive questions (e.g. about special educational needs).

Another inherent concept to nonresponse is ‘response rate’, information of which

should be a first step in critical appraisal of the missingness and its implications. This step is

even more complicated for designs that do not involve ‘simple random samples’, a common

practice in educational research where volunteer and convenience samples are more

prominent, making the test of defining a response rate more challenging. Finally, a trivial

essential term is ‘nonresponse bias’ which occurs when the distribution of the missing values,

assuming it was known, differs from the distribution of the observed values (Durrant 2009).

This is actually what we should try to reduce, define or avoid in our surveys.

So, we have defined the problem, we have realised we have significant missing data,

and we may have the means to define the missingness pattern. What can we do next?

2.2. What can/should be done?

Depending on missingness patterns and the particular practical problems, the technique to be

employed for handling missing data can now be defined. There are three general advice

standpoints as to what should be done with missing data: (1) always report details of missing

data; (2) adjust for missing data and its implications, if possible; and (3) report the likely

sensitivity of the results to the likely possible distribution of missing observations. The first

and the last of the above points are self-explanatory, and will be explored through our own

research example. Our emphasis here, therefore, is on the various available adjustment

methods.

Three general strategies for analysing incomplete data are summarised by Little and Rubin

(Little 1988; Little and Rubin 1987; 1989; Rubin 1987) and by others more recently (e.g.

Allison 2000; Durrant 2009; Ibrahim et al. 2005; Reiter and Raghunathan 2007; Zhang 2003):

(a) direct analysis of the incomplete data, (b) weighting, and (c) imputation.

9

The first, which is also referred to as “complete case method”, is considered the simplest

involving the analysis of the observations without missing data (in effect what happens with

step-wise regression methods and the illustrative example we presented earlier with EXAM).

When missingness is completely random (MCAR), the estimates produced with this method

are unbiased. Since this is not a very frequent scenario in educational research and because of

the limitations of these methods already detailed, the topic is not discussed further.

Weighting, on the other hand, is considered as a traditional remedy for dealing with missing

data and unit nonresponse in particular (Little 1988): Most nation-wide surveys usually have

already identifiable weights, and data can be considered ‘missing by design’ if sampling

fractions differ. Weights in general are calculated as the inverse of probability of selection for

each case (Horton and Kleinman 2007). Consequently, they are adjusted for nonresponse

from available data within the existing sampling frame or in the case of longitudinal surveys

from another time point (wave). In effect, incomplete cases are discarded and a weight is

assigned to each complete case to compensate for the missing cases. An example of how

weights could be calculated is shown in Table 2, with a hypothetical sample of 100 students

selected from one school (which represents the population in this case) of 1000 students. As

can be seen, the fact that female students were oversampled (i.e. n=60) compared to male

students (n=40) is compensated by assigning a higher weight to the male respondents.

Table 2: An example of weighting in a school survey

Gender Population (School) Proportion (N)

Sample Proportion (N)

Population/Sample Weight

Female 0.5 (500) 0.6 (60) 0.5 /0.6 0.8333 Male 0.5 (500) 0.4 (40) 0.5 /0.4 1.25 Total 1 (1000) 1 (100)

10

Limitations associated with this method include the need for complete data on

contributing factors, inefficiency of the method when there are extreme weights, the need for

many sets of weights and complications when the data pattern is not monotonic2. According

to Schafer & Graham (2002), weighting can eliminate bias due to differential responses

related to the variables used to model the response probabilities, but not for unused variables.

The same authors also review and recommend two (‘modern’) approaches for dealing

with missing data: maximum likelihood (ML) estimation based on all available data and

(Bayesian) multiple imputation. Statistically, both are related to the distribution of the

observed data as a function of the population distribution (complete dataset) with respect to

the missing data (the statistically minded reader may look into the relevant functions for

( ; )completeP Y 3 from Schafer & Graham, 2002, e.g. p. 154).

Maximum Likelihood Estimation and Multiple Imputation methods

ML estimation methods for dealing with missing data are considered advantageous and more

useful by Schafer & Graham (2002) because they only assume MAR (compared to the

MCAR assumption for those derived from repeated sampling arguments only). In sum the

method is based on maximising the (log of) likelihood function mentioned earlier (i.e. the

observed data as a function of the population distribution (complete dataset) with respect to

the missing data): for most problems this is computed iteratively with the Estimation- 2 Monotone missing data patern: the variables can be arranged so that for j=1,…,K-1, X(j) is observed whenever X(j+1) is observed (for example, in a longitudinal study with 5 data points an observation at the fourth time implies complete observations up to this time). 3 From a statistical point of view ( ; )completeP Y has two possible interpretations, which guide the choice of

estimation methods for dealing with missing data: - when regarded as the repeated-sampling distribution for Ycomplete, it describes the probability of

obtaining any specific data set among all the possible datasets that could arise over hypothetical repetitions of the sampling procedure and data collection

- when considered as a likelihood function for theta (unknown parameter) the realized value of Ycomplete is substituted into P and the resulting function for theta summarises the data’s evidence about parameters.

11

maximisation (EM) algorithm: the technique is very well established in statistical handbooks

and not our focus to detail again here. These methods overall summarise a likelihood

function averaged over a predictive distribution for the missing values (Ibrahim et al. 2005;

Schafer 1999; Schafer and Graham 2002). According to Horton & Kleinman (2007) “for each

observation with missing data, multiple entries are created in an augmented dataset for each

possible value of the missing covariates, and a probability of observing that value is estimated

… the augmented complete-dataset is then used to fit the regression model” (p. 82). Based on

this definition the ML estimation is very similar to Multiple Imputation.

Imputation methods involve replacing missing values by suitable estimates and then

applying standard complete-data methods to the filled-in data. The main reason for doing this

is to reduce nonresponse bias: “rather than deleting cases that are subject to item-nonresponse,

the sample size is maintained resulting in a potentially higher efficiency than for case

deletion” (Durrant 2009, p. 295). Imputation methods are broadly distinguished between

deterministic and random or stochastic. The latter requires the generation of random numbers,

whereas the deterministic does not. Hence, deterministic approaches produce the same

imputed value for cases with same characteristics (e.g. using the measure of central tendency

to represent the imputed distribution of the missing variable). Some derived methods, such as

regression and hot deck imputation can be either deterministic or random (Durrant 2009;

Schulte Nordholt 1998).

Hot deck imputation procedures replace missing values with values/records that occur

in the sample to compensate for item nonresponse. The record providing a value is called the

donor and the record with the missing value the recipient (Kim and Fuller 2004). In random

hot deck imputation, non-respondents are assigned values chosen at random from respondents

in the same imputation cell (Hox 2000). Nearest neighbour and predictive mean matching

12

imputations are also considered as hot deck methods (Durrant 2009; Sartori et al. 2005; Zhang

2003).

Repeated (Multiple) Imputation (MI) is becoming the most popular procedure for

handing missing data. It allows the analyst to use analyses designed for complete data, while

at the same time providing a method for estimating the uncertainty due to the missing data.

Some requirements need to be met for multiple imputation, such as: (a) data must be missing

at random (MAR), (b) the model used to generate the imputed values must be correct, and (c)

the model used for the analysis must match up with the model used in the imputation (Allison

2000, p. 302; Rubin 1987).

The basic idea of MI, as proposed by Rubin (1987) involves these steps:

Impute missing values using an appropriate model that incorporates random variation;

Do this M times4, producing M ‘complete’ data sets;

Perform the desired analysis on each of these data sets using standard complete-data

methods;

Average the values of the parameter estimates across the M samples to produce a

single-point estimate (i.e. ( )

1

1ˆ ˆM

m

mM

) ;

Calculate the standard errors by (a) averaging the squared standard errors of the M

estimates, (b) calculating the variance of the M parameter estimates across samples,

and (c) combining the two quantities using an adjustment term (i.e. 1+1/M). This step

is necessary so as to incorporate the variability due to imputation (Allison 2000;

Durrant 2009).

4 It has been shown by Rubin in 1987, that the relative efficiency of an estimate based on m imputations to one

based on an infinite number of them approximates the inverse of (1+λ/m), with λ the rate of missing information. Based on this, it is also reported that there is no practical benefit to using more than 5 to 10 imputations (Schafer 1999; Schafer and Graham 2002).

13

Fundamental to MI is the technique used for the imputation of values, and recently the

range of options has grown considerably. The three widely used MI techniques are the

propensity score method, the predictive model method and the Markov chain Monte Carlo

(MCMC) method (see Schafer 2000; Zhang 2003). The propensity score and predictive model

methods are appropriate when there is a monotone pattern of missing data, whereas the

MCMC method can be used with non monotone patterns.

Bayesian MI methods are getting increasingly popular: these are performed using a

Bayesian predictive distribution to generate the imputations (Nielsen 2003) and specifying

prior values for all the parameters (Ibrahim et al. 2005). According to Schafer & Graham

(2002) Bayesian methods are bringing together MI methods with Likelihood methods: “[…]

the attractive properties of likelihood carry over to the Bayesian method of MI, because in the

Bayesian paradigm we combine a likelihood function with a prior distribution for the

parameters. As the sample size grows, the likelihood dominates the prior, and Bayesian and

likelihood answers become similar” (p. 154).

As a final note on analytical options, multilevel analysis of longitudinal data has also a

reported advantage in handling missing data (Little and Rubin 1989). In particular the

problem may be avoided in the case of multilevel growth curve models (e.g. Plewis 2007;

2009). These models use all the available data (rather than being restricted to cases with

complete data at each occasion) giving them an advantage for studies that span a long period

of time when sample attrition is almost inevitable. If the data are MAR, this implies “that

differences between the means for those measured and not measured at occasion t disappear,

having conditioned on the known values at occasions t-k,k>=1 – then estimates from the

growth curve models are efficient” (Plewis 2009, p. 300-301).

14

Tools for resolving the issue

Various tools are also available for performing (multiple) imputation. As a guide for

the interested reader we list the names of some procedures available in R, but not

exhaustively: Amelia II, arrayImpute, cat (for categorical-variable datasets with missing

values), EMV (for the Estimation of Missing Values for a Data Matrix), impute, mi, mice,

Hmisc. Tools are also available within other statistical packages, such as ICE in STATA, the

SAS PROC MI, Missing Data Library and NORM for S-Plus and SOLAS. Multiple

Imputation can also be performed with MLwiN, and very recently with SPSS (version 20,

2012). Allison (2000) performed some imputations with various methods and identified

problems with the earlier versions of SOLAS commercial package. Horton & Kleinman

(2007) have recently applied imputation with Amelia II, Hmisc, ICE/Stata, Iveware, LogXact,

MICE, SAS and S-Plus packages and found similar parameter estimates for all analysis, as

well as considerable reduction of the standard error estimates when compared to the complete

case estimators. A recent practical tutorial about imputing missing data with the use of Amelia

II is also available (Hutchenson and Pampaka 2012).

In the remainder of the paper we will apply the previously presented theory in an

educational research context. To do this we draw on data (and missing data) from a recent

project which we present next.

3. Our research design - methodology

The context for the analysis and results presented next is situated within our recent work on

students’ transitions into post-compulsory settings. Our project has been concerned with the

widely known ‘mathematics problem’ in England, manifested by the relatively small number

of students being well prepared to continue their studies from schools and colleges into

courses in Higher Education (HE) Institutions in mathematically demanding subjects (see for

15

example ACME 2009). We have effectively joined the efforts of others (e.g. Roberts 2002;

Sainsbury 2007; Smith 2004) to inform policy and practice, and consequently support a

greater flow of students into HE to meet the demands of the emerging national STEM

(Science, Technology, Engineering, Mathematics) agenda in the UK.

The aim of the project we report here was to understand how cultures of learning and

teaching can support learners in ways that help them widen and extend participation in

mathematically demanding courses in Further and Higher Education. We focused on those

students for whom pre-university mathematics qualifications (like ‘AS’ and ‘A2’ in the UK

context) provide a barrier to progressing into mathematically demanding courses. The project

also contrasted the traditional mathematics programme with a new ‘Use of mathematics’

programme in the UK at ‘AS’ level, i.e. for students usually aged 16-19 who have completed

compulsory schooling and have chosen to study some mathematics further. The ‘Traditional’

(ASTrad) programme is designed to help students prepare for mathematically-demanding

courses in university (e.g. STEM). The new ‘Use of Mathematics’ (UoM) programme is

designed to widen participation to include those who may need to use mathematics in the

future generally, but who may or may not progress into very mathematical courses and who

may not have very strong previous mathematical backgrounds (Williams et al. 2008).

Methodologically, the project was a mixed method study. One part involved a survey

of students where responses to a questionnaire were recorded at three data points. This was

complemented with multiple case studies of students as they progress through a year of

further education. The longitudinal design is illustrated in Figure 2. In particular it indicates

the timing of the three data collection points, and the students’ educational status at the

16

moment.

Sept-Nov2006

April-June2007

Sept-Dec2007

TIME

Cohort’s Educational Level

Start of AS End of AS Start of A2

Data Point DP1 DP3DP2

Variables Collected

BackgroundPrevious (GCSE)

grades DispositionsSelf-Efficacy

Future Aspirations

DispositionsSelf-Efficacy

Future Aspirations

DispositionsSelf-Efficacy

Future AspirationsAS Grades

Sample 1792 1047 598

Figure 2: The longitudinal design of our project and achieved student survey returns

The sample size achieved at each data point, in respect to questionnaires returned from

schools, is also shown in the figure. As can be seen, there was significant attrition from the

1792 students at DP1 to 1047 at DP2 and only 598 students at DP3 (leading to 42% and 67%

nonresponse rates, respectively). Descriptive analysis indicates that missing data at DP2 is

not random. We particularly found that the proportion of ASTrad Maths students completing

DP2 was significantly lower than that of UoM students (chi-square=21.53, df=1, p<0.001).

Missingness at DP3 also differentiates between the two courses, this time completion being in

favour of the ASTrad students (chi-square=7.5, df=1, p=0.006). Further modelling of missing

data is presented in a following section.

The different types of variables collected at each time point (DP) are also summarised

in Figure 2. More details about these and the corresponding constructed measures can be seen

elsewhere (Pampaka et al. 2011a; Pampaka et al. 2013; Pampaka et al. 2011b).

Our modelling strategy then involved exploring relationships between students’

background variables, process variables (i.e. course and pedagogy), dispositions and their

attainment as they progress through their post-16 (mathematics) education.

17

What is of particular interest for this paper is the outcome educational outcome ‘AS

grade’ highlighted in the final data point of Figure 2. As noted, this important outcome

variable for our study was only requested from students at the final data point (because this is

when it would have been available) which suffer more from missing data.

In order to resolve some of the analytical problems that we had to deal with, with the

low response rate of DP3 and consequently important information on this response variables

of interest (e.g. final grades, dropout from course), we had been able to access additional

information about AS-level grades at a later date by collecting data directly from the schools

and the students via additional telephone surveys. With this approach we have been able to

fill-in the grades for 879 additional pupils.

In the remainder, we present some analysis from this project in relation to missing data

to show potential implications and demonstrate the potential benefits of multiple imputation.

4. Analysis / Results

4.1. Describing / Modeling nonresponse

As suggested in most relevant guidelines, a necessary first step in any analysis involving

missing data is to describe the problem. Towards this end, we had included within our dataset

binary variables denoting whether a student had completed each of the surveys data points.

What we present in this section are the results of logistic regression modelling for DP2 and

DP3 (i.e. whether student completed DP2 and DP3 respectively). Our modelling procedure

follows the same guidelines reported elsewhere (Fox and Weisburg 2011; Hutcheson et al.

2011; Pampaka et al. 2012a; Pampaka et al. 2012b) and in order to avoid further missing data

problems we use as explanatory variables background and other theoretically relevant

variables collected at the first data point of the study (DP1). Our presentation is limited to the

selected theoretical model (with the significant explanatory variables) for each data point and

18

the effects of the explanatory variables are illustrated via effect plots (Fox 1987; Fox and

Hong 2009).

Figure 3 summarises the models that best explain/describe response (and missingness)

at our study.

Model 1: DP2 ~ Course + UNIoption

Model 2: DP3 ~ GCSE grade + Gender + UNIoption

Pro

bab

ilit y

of

r esp

on

din

g a

t D

P3

Pro

bab

ility

of

r esp

on

din

g a

t D

P3

Pro

bab

ilit y

of

resp

on

din

g a

t D

P3

Figure 3: Logistic regression models of responding to Study 1 and their effect plots. The graph shows the size of the effect of the explanatory variables on the response is given by the range in Y (response variable) values shown on the graph. It should also be noted that the Y range is not the same for all graphs and this needs to be taken into account when interpreting the results

UNIoption is a categorical variable describing students choices for university study (or

not) as reported at DP1. As shown with the effect plots of Figure 3 (top), students of the

ASTrad course and students who do not intend to go to university have the highest chances of

being non respondents to DP2. In contrast students aiming for a STEM course at HE have the

19

highest response probabilities. For DP3, course is not significantly effective any more,

however previous mathematics attainment (GCSE Grade) and gender are now significant. The

trend regarding University option is similar to that described for Model 1. In addition, male

students tend to be more prone to attrition and response rate increases with increased previous

mathematics attainment.

4.2. A Multiple Imputation Illustration

One part of this project investigated the number of students who dropped-out (didn't

finish the course) and attempted to identify the factors involved in this. These analyses are

reported in Hutcheson, Williams and Pampaka (2011) and indicate that the type of course a

student was enrolled on, their previous GCSE results in mathematics, their disposition to

study mathematics at a higher level and their self-efficacy rating were all significantly related

to drop-out.

Descriptive analyses of the initial data (i.e. without the data acquired later from

schools) and the cases that were missing do not display significant differences in the courses

enrolled on, the students’ disposition and self-efficacy, or the drop-out rate, but does a

significant difference in the GCSE-grades (chi-square=43.01, df=5, p=3.7e-08). The

observed data had significantly higher GCSE grades than those that were missing. From this

evidence, we may conclude that a model of the original 494 cases may over-estimate the

effect of high-achievers and we could incorporate this conclusion into our analyses by

employing weights, or use an imputation technique to replace the missing cases.

The following compares complete-case analyses of drop-out (using binary logit

models) using the original sample (n = 494) with the same model computed on the more

complete sample (n=1373). Comparing these models allows an evaluation of the

consequences of analysing an incomplete dataset and also provides an opportunity for

evaluating the effectiveness of data imputation. This analysis compares models which have

20

the same selection of variables. Although this enables the models to be compared, it is not an

ideal situation as the chosen model was derived using the more complete data set. The model

is, therefore, optimised for these data, but can still provide some informative comparisons. A

binary logit model of drop-out was derived using a combination of theoretical and statistical

information (see Hutcheson et. al., 2011, for a full description of the model selection

procedure). The model reported for the full data was.

Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy

where Drop-out and Course are binary categorical variables, GCSE-grade is an ordered categorical variable and Disposition and Maths Self Efficacy are continuous.

Table 3 shows the fit statistics for the model using all 1373 cases and the same model using

only the 494 cases which were originally available. Both models identify the most significant

relationships (Course and GCSE-grade), however, Disposition and mathematics self efficacy

(MSE) are only identified as being significant for the full dataset.

Table 3: Fit statistics for models with and without missing data. The LR-chisq statistics and p-values are shown for binary logit models “Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy” computed for the initial dataset (N=494) and the full dataset (N=1373).

Explanatory Variables

Model with missing data [n=494]

Model using full data [n=1373]

GCSE-grade Course Disposition Maths Self Efficacy

85.897; p< 2.2e-16 *** 21.139; p=4.271e-06 *** 3.561; p=0.05914 0.458; p=0.49844

176.07; p<2.2e-16 *** 72.54; p<2.2e-16 *** 21.17; p=4.2e-06 *** 10.60; p=1.1e-03 **

***: p<0.001, **: p<0.01, *: p<0.05.

The importance of collecting additional data has been demonstrated here with regards

to the identification of the variables Disposition and Maths Self Efficacy, as these would not

have been identified as being 'important' using the reduced dataset. The additional data

collection procedure has influenced the analysis and has had important implications with

respect to the conclusions drawn from the research. The data that was initially missing was

21

not missing at random and it was therefore important that these data were collected. If it was

not possible or feasible to collect the additional data, the missing data points could have been

imputed. The following analysis imputes the 879 missing cases (data were imputed for the

drop-out variable) and compares the resulting regression model with those in Table 2.

There are a number of data imputation methods available and we selected the Amelia

package (Honaker et al. 2012) which is implemented in the R statistical programme (R core

Team, 2012). Amelia imputes missing data using multiple imputation and includes a number

of sophisticated options and diagnostics using a simple graphical interface. It is available for

all computing platforms and is freely available from the R software repository (see Hutcheson

and Pampaka, 2012, for a description of this software) and offers a simple, powerful and

accessible means for researchers to impute data. Amelia imputed the missing data

(information about whether the student had dropped out) and formed a number of 'complete'

datasets. The model of drop-out was then applied to each of these datasets. The fit statistics

for these are shown in Table 4.

22

Table 4: Fit statistics for models with imputed data. The LR-chisq statistics and p-values are shown for binary logit models “Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy” computed for five datasets containing imputed data (N=1373).

Datasets Explanatory Variables

LR Chisq Df Pr(>Chisq)

Dataset 1 Course Disposition GCSE-grade Maths Self Efficacy

31.895 11.695 126.248 0.015

1 1 5 1

1.628e-08 *** 0.0006265 *** < 2.2e-16 *** 0.9018691


46.860 7.872 112.747 2.269

1 1 5 1

7.623e-12 *** 0.00502 ** < 2.2e-16 *** 0.13202


24.842 18.711 97.550 0.008

1 1 5 1

6.222e-07 *** 1.521e-05 *** < 2.2e-16 *** 0.9278


43.230 0.612 121.636 2.646

1 1 5 1

4.868e-11 *** 0.4340 < 2.2e-16 *** 0.1038


40.984 17.809 143.106 0.002

1 1 5 1

1.535e-10 *** 2.442e-05 *** < 2.2e-16 *** 0.9663

***: p<0.001, **: p<0.01, *: p<0.05.

Each of the models using the imputed data identify Course and GCSE-grade as being

important. Four of the five models indicate that Disposition is also significant and no models

indicate Maths Self Efficacy as being important. A 'final' model can be computed for the

imputed data by combining all five datasets to provide an 'average'. However, for simplicity

in this analysis, we will select the model from dataset 1 for illustration as this model appears

to be roughly in the middle of the estimated significances for the variables and allows a

simple comparison to be made with the other models. Table 5 shows a comparison of all

three models (one computed with missing data, one with imputed data and one with the full

data).

23

Table 5: Fit statistics for models with imputed data. LR-chisq statistics and p-values are shown for binary logit models “Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy” computed for the original data with missing values (N=494), for the original data with missing values imputed (N=1373) and the full data set containing all 1373 cases.

Explanatory Variables

Model with missing data

Model using imputed data

Model using all available data

N=494 N=1373 N=1373

GCSE-grade Course Disposition Maths Self Efficacy

85.90; p < 2.2e-16 *** 21.14; p=4.27e-06 *** 3.56; p=0.059 0.46; p=0.498

126.25; p<2.2e-16 *** 31.90; p=1.63e-08 *** 11.70; p=0.001 ** 0.02; p=0.902

176.07; p<2.2e-16 *** 72.54; p<2.2e-16 *** 21.17; p=4.2e-06 *** 10.60; p=1.1e-03 **

***: p<0.001, **: p<0.01, *: p<0.05.

The data imputation appears to have been at least partially successful as it has identified

Disposition as being significant. It has not, however, provided a completely accurate picture

of the full data set as the imputed data have not indicated MSE as being significant and has

also underestimated the contributions of the other variables in the model. It is, however,

worthy of note that the model with the imputed data does appear to be 'closer' to the model

computed for the full dataset than the original model with the missing data.

5. Discussion –Implications - Conclusions

Our aim with this paper was to revisit and review the topic of dealing with nonresponse from

the lens of educational research, and show some implications of considering and ignoring this

from our recent work in widening participation in (mathematics) education. In addition we

wanted to assess the accuracy of multiple imputation using an example where models from

imputed data could be compared with models derived from actual data. We summarise and

discuss some main concluding points here.

First, we have shown, as the literature in social statistics suggests, that ignoring

missing data can have serious consequences in educational research contexts. Missing data is

effectively ignored when using 'complete-case' analyses and the automated use of stepwise

24

selection procedures often exacerbates this as cases can easily be excluded on the basis of data

missing from variables not even part of a final model. Ignoring missing data causes problems

when the data are not missing at random, as is likely for most missing data in educational

research.

Our approach to collecting more data (i.e. from schools or via telephone interviews

from students) as a remedy to resolve the implications from missing data is also aligned with

recent literature on the topic, linked with mixed mode surveys. This type of survey was

regarded as the new trend in survey research in the late 20th century with the definition of

surveys which combine the use of telephone, mail, and/or face to face interview procedures to

collect data (Dillman and Tarnai 1988). The developments in technology and the widespread

use of the internet, and consequently online surveys, made this type of survey the norm for

our times (Biemer and Lyberg 2003; Glover and Bush 2005). One of the reasons for their

increased use is their capacity to deal with response/nonresponse problems, in efforts to

increase response or investigate bias (Dillman 2009; Sikkel et al. 2009).

The results from our study demonstrated that the initial data sample which included a

substantial amount of missing data was likely to be biased, as a regression model using this

dataset was quite different to one based on a more complete dataset that included information

subsequently collected. Imputing the missing data proved to be a useful exercise, as it

'improved' the model (at least with respect to it's comparability with the more data set) but did

not entirely recreate the structure of the full dataset. The analysis of the imputed data resulted

in a model that 'correctly' identified the significance of Disposition in the full dataset, but did

not identify Maths Self Efficacy as being significant and also underestimated the significance

of the other variables in the model. A note should be made here about our modelling in

relation to the longitudinal nature of the study: the model we present is essentially

longitudinal since the outcome (dropout) variable is collected at the final data point of the

25

study, and this is modelled with explanatory variables collected at earlier stages (e.g.

dispositions and self efficacy). Since our emphasis was on imputation, however, we have

chosen to simplify the way the model appears.

The imputation procedure may be regarded as being partially successful and its use

justified for these data. It should be noted that the imputation was based on a limited

selection of variables that had been identified from a previous analysis. It is likely that the

imputation could have been improved if more variables had been used in the imputation

process. In addition, our data imputation was for a binary category and is likely to be

relatively weak. This, however, strengthens the case for imputation as imputing richer data

(multi-categories, ordered and continuous) may result in more accurate imputations. It is

encouraging that the multiple imputation worked even with such a simple imputation. Ideally

a 'model' of why data is missing (such as those presented in Figure 3) would be employed to

identify the variables that may be used to model the missing data points. Collecting such data

should be an important consideration for research design whenever missing data is likely to be

an issue. An important conclusion from this study was that even with quite limited available

data to model missing data points, multiple imputation still proved to be useful.

The most important conclusion from this paper is that missing data can have adverse

effects on analyses and imputation methods should be considered when this is an issue. This

study shows the value of multiple imputation even for imputing a binary outcome variable

using limited information. But, as in this case study, the imputation, while producing an

improvement (e.g. in modelling with Disposition) completely failed to detect the significance

of an important variable. It is encouraging that tools now exist to enable multiple imputation

to be applied relatively simply using easy to access software. Thus, multiple imputation

techniques are now within the grasp of most educational researchers and should be used

routinely in the analysis of educational data. However, it behove researchers still to be

26

cautious in modelling data sets with large amounts of missing data, even if they use multiple

imputation.

As a final remark, our analysis demonstrates the advantages of the technique and

suggests that it should be used routinely in longitudinal studies where data attrition can be

substantial.

Acknowledgements

The authors would like to acknowledge the support of the ESRC project with grant number RES-062-23-1213, the support of the ESRC-TLRP programme of research into widening participation, funded by ESRC: RES-139-25-0241and the continuing support of ESRC grant RES-061-025-0538.

References

ACME. 2009. The mathematics education landscape in 2009. A report of the Advisory Committee on Mathematics Education (ACME) for the DCSF/DIUS STEM High Level Strategy Group Meeting (12 June 2009) retrieved from http://www.acme-uk.org/downloaddoc.asp?id=139 [March, 2010].

Allison, P.D. 2000. Multiple imputation for missing data: A cautionary tale. Sociological Methods and Research 28, no. 3: 301-09.

Biemer, P.P. and L.E. Lyberg. 2003. Introduction to survey quality Wiley series in survey methodology. Hoboken, NJ: John Wiley & Sons, Inc.

Dillman, D. 2009. Some consequences of survey mode changes in longitudinal surveys. In Methodology of longitudinal surveys, ed. Lynn, P, 127-40. Chichester: Wiley.

Dillman, D. and J. Tarnai. 1988. Administrative issues in mixed mode surveys. In Telephone survey methodology, eds Groves, Rm, Biemer, P, Lyberg, L, Massey, J, Nicholls, Wl and Waksberg, J. New York: Wiley.

Durrant, G.B. 2009. Imputation methods for handling item non-response in practice: Methodological issues and recent debates. International Journal of Social Research Methodology 12, no. 4: 293-304.

Fox, J. 1987. Effect displays for generalized linear models. Sociological Methodology 17: 347-61. Fox, J. and J. Hong. 2009. The effects package. Effect displays for linear, generalized linear,

multinomial-logit, and proportional-odds logit models. Http://www.R-project.Org. Fox, J. and S. Weisburg. 2011. An r companion to applied regression (second edition). London: Sage. Glover, D. and T. Bush. 2005. The online or e-survey: A research approach for the ict age.

International Journal of Research & Method in Education 28, no. 2: 135-46. Honaker, J., G. King and M. Blackwell. 2012. Amelia ii: A programme for missing data (version

1.6.1). Available at: www.gking.harvard.edu/amelia (retrieved December 2012).

27

Horton, N.J. and K.P. Kleinman. 2007. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. The American Statistician 61, no. 1: 79-90.

Hox, J. 2000. Multilevel analysis of grouped and longitudinal data. In Modeling longitudinal and multilevel data, eds Little, Td and Schnabel, Ku. London: Lawrence Erlbaum Associates.

Hutchenson, G. and M. Pampaka. 2012. Missing data: Data replacement and imputation. Journal of Modelling in Management (Tutorial Section) 7, no. 2: 221-33.

Hutcheson, G., M. Pampaka and J.S. Williams. 2011. Enrolment, achievement and retention on ‘traditional’ and ‘use of mathematics’ as courses. Research in Mathematics Education 13, no. 2: special issue.

Ibrahim, J.G., M.-H. Chen, S.R. Lipsitz and A.H. Herring. 2005. Missing-data methods for generalized linear models. Journal of the American Statistical Association 100, no. 469: 332-46.

Kim, J.K. and W. Fuller. 2004. Fractional hot deck imputation. Biometrika 91, no. 3: 559-78. King, G., J. Honaker, A. Joseph and K. Scheve. 2001. Analyzing incomplete political science data: An

alternative algorithm for multiple imputation. American Political Science Review 95: 49-69. Little, R.J.A. 1988. Missing-data adjustments in large surveys. Journal of Business & Economic

Statistics 6, no. 3: 287-96. Little, R.J.A. and D.B. Rubin. 1987. Statistical analysis with missing data. New York: John Wiley &

Sons. Little, R.J.A. and D.B. Rubin. 1989. The analysis of social science data with missing values.

Sociological Methods Research 18, no. 2-3: 292-326. Nielsen, S.R.F. 2003. Proper and improper multiple imputation. International Statistical Review /

Revue Internationale de Statistique 71, no. 3: 593-607. Pampaka, M., I. Kleanthous, G.D. Hutcheson and G. Wake. 2011a. Measuring mathematics self-

efficacy as a learning outcome. Research in Mathematics Education 13, no. 2: 169-90. Pampaka, M., J. Williams and G. Hutchenson. 2012a. Measuring students’ transition into university

and its association with learning outcomes. British Educational Research Journal 38, no. 6: 1041-71.

Pampaka, M., J.S. Williams, G. Hutchenson, L. Black, P. Davis , P. Hernandez-Martines and G. Wake. 2013. Measuring alternative learning outcomes: Dispositions to study in higher education. Journal of Applied Measurement 14, no. 2.

Pampaka, M., J.S. Williams, G. Hutcheson, G. Wake, L. Black, P. Davis and P. Hernandez - Martinez. 2012b. The association between mathematics pedagogy and learners’ dispositions for university study. British Educational Research Journal 38, no. 3: 473-96.

Plewis, I. 2007. Non-response in a birth cohort study: The case of the millenium cohort study. International Journal for Social Research Methodology 10, no. 5: 325-34.

Plewis, I. 2009. Statistical modelling for structured longitudinal designs. In Methodology of longitudinal surveys, ed. Lynn, P. Chichester: Wiley.

Reiter, J.P. and T.E. Raghunathan. 2007. The multiple adaptations of multiple imputation. Journal of the American Statistical Association 102, no. 480: 1462-71.

Roberts, G. 2002. Set for success. The supply of people with science, technology, engineering and mathematics skills. London: HM Stationery Office.

Rubin, D.B. 1987. Multiple imputation for nonresponse in surveys. New York: John Wiley. Sainsbury, L. 2007. The race to the top: A review of government’s science and innovation policies:

HM Stationery office. Sartori, N., A. Salvan and K. Thomaseth. 2005. Multiple imputation of missing values in a cancer

mortality analysis with estimated exposure dose. Computational Statistics & Data Analysis 49, no. 3: 937-53.

Schafer, J.L. 1999. Multiple imputation: A primer. Statistical Methods in Medical Research 8: 3-15. Schafer, J.L. 2000. Analysis of incomplete multivariate data. London: Chapman and Hall. Schafer, J.L. and J.W. Graham. 2002. Missing data: Our view of the state of the art. Psychological

Methods 7, no. 2: 147-77. Schulte Nordholt, E. 1998. Imputation: Methods, simulation experiments and practical examples.

International Statistical Review 66, no. 2: 157-80.

28

Sikkel, D., J. Hox and E. De Leeuw. 2009. Using auxiliary data for adjustment in longitudinal research In Methodology of longitudinal surveys, ed. Lynn, P, 141-55. Chichester: Wiley.

Smith, A. 2004. Making mathematics count – the report of professor adrian smith’s inquiry into post-14 mathematics education. London: DfES.

Squire, P. 1988. Why the 1936 literary digest poll failed. Public Opinion Quarterly 52: 125-33. Wainer, H. 2010. 14 conversations about three things. Journal of Educational and Behavioral Statistics

35, no. 1: 5-25. Wayman, J.C. 2003. Multiple imputation for missing data: What is it and how can i use it? Paper

presentat at the Annual Meeting of the American Educational Research Association, Chicago. Williams, J.S., L. Black, P. Davis, P. Hernandez-Martinez, G. Hutcheson, S. Nicholson, M. Pampaka

and G. Wake. 2008. TLRP research briefing no 38: Keeping open the door to mathematically demanding programmes in further and higher education. School of Education, University of Manchester.

Zhang, P. 2003. Multiple imputation: Theory and method. International Statistical Review 71, no. 3: 581-92.

getting the most out of missing data in educational ......by social statisticians, particularly when...

Documents