getting the most out of missing data in educational ......by social statisticians, particularly when...
TRANSCRIPT
1
Getting the most out of missing data in educational research with Multiple
Imputation
Maria Pampaka*, Graeme Hutcheson, Julian Williams
School of Education, The University of Manchester, Manchester, UK
* Corresponding author
School of Education
Room B4.10 Ellen Wilkinson Building
University of Manchester
Oxford Road
Manchester M13 9PL
Email: [email protected]
2
Getting the most out of missing data in educational research with multiple
imputation
The ubiquitous presence of missing data is endemic to much educational research,
especially longitudinal studies. However, practices such as step-wise regression
common in the educational research literature have been long regarded as unacceptable
by social statisticians, particularly when missing data is an issue, and multiple
imputation is generally recommended. In this paper we firstly provide a review of these
advancements and their implications for educational research. We then draw on an
educational longitudinal survey in which missing data was significant, but for which we
were able to collect much of this missing data through subsequent data collection. We
thus compare step-wise regression (basically ignoring the missing data) and multiple
imputation models with the models from the actual enhanced sample. The results show
the value of multiple imputation and the risks involved in ignoring missing data.
Implications for best research practice are discussed.
Keywords: missing data, mathematics education, multiple imputation, longitudinal
surveys
1. Introduction
Missing data is certainly not a new issue for educational research, particularly given the
constraints of designing and performing research in schools and other educational
establishments. Consider the situation when a researcher gets permission to administer a
questionnaire to the students during class time. On the agreed day of administration various
scenarios could take place: (A) some pupils may have been absent due to essentially random
reasons (such as illness, and car breakdown), (B) some pupils may have been absent because
they are representing their school in competitions (these pupils may be the keenest and most
engaged), (C) some of the pupils who were present stopped before the end of the
questionnaire (these pupils may be the ones who are easily distracted or who have difficulty
reading), (D) some pupils may have skipped a page or skipped individual items (randomly, or
maybe they are careless), and (E) some pupils did not respond to sensitive questions (e.g.
3
about special educational needs). All the above scenarios will lead to missing data, but with
different degrees of bias depending on the object of the analysis. For example, if data are
missing due to scenario B, the analysis will under-represent the more highly attaining or
engaged pupils. If data are missing due to scenario E, those pupils with special educational
needs will be under-represented. In the real-world it is likely that data is missing due to
multiple reasons with the above scenarios happening simultaneously in any single project.
Missing data is a particular issue for longitudinal studies, especially when the design
involves transitions between institutions and phases of education. This is a major issue in the
wider social science literature, which acknowledges that nearly all longitudinal studies suffer
from significant attrition, raising concerns about the characteristics of the dropouts compared
to the remaining subjects and the implications of this for the validity of inferences when
applied to the target population (Kim and Fuller 2004; Little 1988; Little and Rubin 1989;
Plewis 2007; Schafer and Graham 2002).
Even though the issues around missing data are well-documented, it is common
practice to ignore missing data and employ analytical techniques that simply delete all cases
that have missing data on any of the variables considered in the analysis (see, for example,
Horton and Kleinman (2007) for a review of medical research reports, and King et al. (2001,
p.49) who state that “approximately 94% (of analyses) use listwise deletion to eliminate entire
observations. […] The result is a loss of valuable information at best and severe selection bias
at worst”). The use of step-wise selection methods is particularly dangerous when used in the
presence of missing data as the loss of information can be severe and may not even be
obvious. A demonstration of this is provided in Table 1, where the variable `SCORE' (a
continuous variable modelled using OLS regression) is modelled using three candidate
explanatory variables. In order to compare successive models, a typical step-wise procedure
first deletes any missing data listwise, leaving only the complete cases. Any case that has a
4
missing data point in any of the candidate variables is removed from the analysis, even when
the data is missing on a variable that is not included in the final model. This can result in the
loss of substantial amounts of information and introduce bias into the selected models.
Table 1: An example dataset for a regression mode of ‘SCORE’
Case SCORE GCSE AGE SES Cases included in a Step-wise
Model
Cases that could have
been included 1 25 19 25 2 2 32 23 NA 5 3 45 NA 17 7 4 65 NA 16 3 5 49 42 16 3 6 71 28 15 NA 7 68 97 19 NA 8 67 72 18 5 9 59 66 18 2 10 69 65 17 7 Note: NA denotes missing data point
A regression model of `SCORE' is derived using a step-wise procedure using `GCSE', `AGE' and `SES' as potential explanatory variables (for simplicity, all variables are considered as continuous). The model `Exam ~ GCSE' was selected using backward deletion on the basis of estimates of significance for the individual variables (based on t-values). Using a backward elimination step-wise procedure, this model was constructed from 5 cases. There are, however, 8 cases which could have been used to construct 'the same model'.
Figure 1 shows the model `Exam ~ GCSE' obtained using a stepwise procedure and
the 'same model' constructed using all available data. The stepwise model is estimated from a
smaller sample as cases 2, 6 and 7 are excluded. The exclusion of these cases has introduced
bias in the analysis as these three cases all have relatively low GCSE marks. The parameter
estimates from the stepwise procedure are based on a sub-sample that does not even reflect
the sample particularly accurately. The models provide very different impressions about the
relationship between SCORE and GCSE marks. It is important to note that very little warning
may be given about the amount of data discarded by the model selection procedure. The loss
of data is only evident in the following output by comparing the degrees of freedom statistics
for the two models. It is, therefore, worth checking how a statistical package deals with
5
model selection1 and also to make sure that the selection process does not needlessly exclude
data.
SCORE ~ GCSE derived from stepwise regression based on 5 cases:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.3135 6.6089 1.863 0.15933
GCSE 0.7857 0.1172 6.702 0.00678
Multiple R-squared: 0.9374, Adjusted R-squared: 0.9165
F-statistic: 44.92 on 1 and 3 DF, p-value: 0.006777
SCORE ~ GCSE based on all 8 available cases:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.8170 13.3409 2.535 0.0444
GCSE 0.4884 0.2795 1.747 0.1312
Multiple R-squared: 0.3372, Adjusted R-squared: 0.2268
F-statistic: 3.053 on 1 and 6 DF, p-value: 0.1312
Figure 1. The use of a stepwise regression procedure has resulted in some loss of data. The amount of data excluded is only evident in the difference between the DF statistics for the two models
(3 compared to 6 DF).
Even though missing data is an important issue, it is rarely dealt with or even
acknowledged in educational research (for an exception to this, see Wayman 2003). Whilst
data imputation is now generally accepted by statisticians (particularly multiple imputation),
non-specialist researchers have been slow to adopt it and there is a general reluctance from
reviewers to accept analyses where data have been imputed. Data imputation makes an `easy
target' for criticism, mainly because it involves adding simulated data to a raw data set, which
causes some suspicion that the data sample is not representative.
1 Some statistical packages automatically delete all cases list-wise (SPSS, for example), while others (for
example, the “step( )” procedure implemented in R (R Development Core team, 2012)) do not allow step-wise regression to be easily applied in the presence of missing data - when the sample size changes as a result of variables being added or removed, the procedure halts.
6
Our aim in this paper is to review some of the issues surrounding missing data and
imputation methods and demonstrate how missing data can be imputed using readily-available
software. Using a real (longitudinal) dataset which (i) had serious quantities of missing data,
and (ii) was supplemented by subsequently acquiring a substantial proportion of this missing
data, we are able to evaluate the effects of ignoring missing data, compared with multiple
imputation of that missing data.
A more detailed description of our methodology and relevant results follows the brief
review around the issue that comes next.
2. On Missing Data and Possible Resolutions – A review
The importance and timeliness of the issue is highlighted in Wainer (2010) who looks forward
into the 21st century in regards to methodological and measurement trends. Dealing with
missing data appears as one of the six ‘necessary tools’ researchers must have to be successful
in tackling the problems “that loom in the near and more distant future” (p. 7). The topic is
not new however: work on missing data from the 1970s was brought together by Little and
Rubin (1987).
2.1. Some useful concepts’ definitions
Theoretically missing data can be classified according to the missing data pattern and the
assumptions underlying the ‘missingness’ mechanisms i.e. the mechanisms that are believed
as causing the data to be missing.
A fundamental distinction is that between ‘unit’ and ‘item’ nonresponse. Unit
nonresponse refers to the complete miss of information from an individual (or case), e.g.
because they either refused to respond, or were not contacted for survey completion. This is
usually considered as ‘selective’, implying that the answers of non respondents are different
7
from those of the respondents, and so suggestive of sample bias: famous examples include
wildly inaccurate predictions based on telephone polling, such as the failure of the 1936
Literary Digest Poll (Squire 1988). Item nonresponse, on the other hand, refers to the failure
of a respondent to give information on some variables of interest (i.e. some items in the
survey). In a longitudinal panel survey, ‘unit’ non response may be manifested as a special
case of ‘item’ nonresponse pattern: assuming the non respondents at a certain wave had
completed earlier surveys, then the responses at previous waives are still available (observed)
data even though the unit/case is completely missing from a wave.
Missing data mechanisms are described as falling into one of the three categories
briefly described below (Allison 2000), which sometimes are called ‘distributions of
missingness’ (Shaffer and Graham 2002). This classification is important because the
possible resolutions of missing data problems depend on this.
Missing completely at random (MCAR): the missingness is independent of the
observed and missing responses; i.e. the probability to be missing is constant for all
units/cases. This could be manifested with Scenario (A) from the example in the
introduction, where pupils are missing from a sample because they may have been
away from school for random reasons.
Missing at random (MAR): the missingness is conditionally independent of the
missing responses, given the observed responses; in other words, the probability of
missing data on a particular variable Y can depend on other observed variables (but
not on Y itself). An example of such missingness is Scenario (B) from our original
example, with some pupils missing because they have been away to represent their
school in competitions.
Missing not at random (MNAR): missingness depends on both observed and
unobserved (missing) data, such as the case of Scenario (E) with pupils not responding
8
to sensitive questions (e.g. about special educational needs).
Another inherent concept to nonresponse is ‘response rate’, information of which
should be a first step in critical appraisal of the missingness and its implications. This step is
even more complicated for designs that do not involve ‘simple random samples’, a common
practice in educational research where volunteer and convenience samples are more
prominent, making the test of defining a response rate more challenging. Finally, a trivial
essential term is ‘nonresponse bias’ which occurs when the distribution of the missing values,
assuming it was known, differs from the distribution of the observed values (Durrant 2009).
This is actually what we should try to reduce, define or avoid in our surveys.
So, we have defined the problem, we have realised we have significant missing data,
and we may have the means to define the missingness pattern. What can we do next?
2.2. What can/should be done?
Depending on missingness patterns and the particular practical problems, the technique to be
employed for handling missing data can now be defined. There are three general advice
standpoints as to what should be done with missing data: (1) always report details of missing
data; (2) adjust for missing data and its implications, if possible; and (3) report the likely
sensitivity of the results to the likely possible distribution of missing observations. The first
and the last of the above points are self-explanatory, and will be explored through our own
research example. Our emphasis here, therefore, is on the various available adjustment
methods.
Three general strategies for analysing incomplete data are summarised by Little and Rubin
(Little 1988; Little and Rubin 1987; 1989; Rubin 1987) and by others more recently (e.g.
Allison 2000; Durrant 2009; Ibrahim et al. 2005; Reiter and Raghunathan 2007; Zhang 2003):
(a) direct analysis of the incomplete data, (b) weighting, and (c) imputation.
9
The first, which is also referred to as “complete case method”, is considered the simplest
involving the analysis of the observations without missing data (in effect what happens with
step-wise regression methods and the illustrative example we presented earlier with EXAM).
When missingness is completely random (MCAR), the estimates produced with this method
are unbiased. Since this is not a very frequent scenario in educational research and because of
the limitations of these methods already detailed, the topic is not discussed further.
Weighting, on the other hand, is considered as a traditional remedy for dealing with missing
data and unit nonresponse in particular (Little 1988): Most nation-wide surveys usually have
already identifiable weights, and data can be considered ‘missing by design’ if sampling
fractions differ. Weights in general are calculated as the inverse of probability of selection for
each case (Horton and Kleinman 2007). Consequently, they are adjusted for nonresponse
from available data within the existing sampling frame or in the case of longitudinal surveys
from another time point (wave). In effect, incomplete cases are discarded and a weight is
assigned to each complete case to compensate for the missing cases. An example of how
weights could be calculated is shown in Table 2, with a hypothetical sample of 100 students
selected from one school (which represents the population in this case) of 1000 students. As
can be seen, the fact that female students were oversampled (i.e. n=60) compared to male
students (n=40) is compensated by assigning a higher weight to the male respondents.
Table 2: An example of weighting in a school survey
Gender Population (School) Proportion (N)
Sample Proportion (N)
Population/Sample Weight
Female 0.5 (500) 0.6 (60) 0.5 /0.6 0.8333 Male 0.5 (500) 0.4 (40) 0.5 /0.4 1.25 Total 1 (1000) 1 (100)
10
Limitations associated with this method include the need for complete data on
contributing factors, inefficiency of the method when there are extreme weights, the need for
many sets of weights and complications when the data pattern is not monotonic2. According
to Schafer & Graham (2002), weighting can eliminate bias due to differential responses
related to the variables used to model the response probabilities, but not for unused variables.
The same authors also review and recommend two (‘modern’) approaches for dealing
with missing data: maximum likelihood (ML) estimation based on all available data and
(Bayesian) multiple imputation. Statistically, both are related to the distribution of the
observed data as a function of the population distribution (complete dataset) with respect to
the missing data (the statistically minded reader may look into the relevant functions for
( ; )completeP Y 3 from Schafer & Graham, 2002, e.g. p. 154).
Maximum Likelihood Estimation and Multiple Imputation methods
ML estimation methods for dealing with missing data are considered advantageous and more
useful by Schafer & Graham (2002) because they only assume MAR (compared to the
MCAR assumption for those derived from repeated sampling arguments only). In sum the
method is based on maximising the (log of) likelihood function mentioned earlier (i.e. the
observed data as a function of the population distribution (complete dataset) with respect to
the missing data): for most problems this is computed iteratively with the Estimation- 2 Monotone missing data patern: the variables can be arranged so that for j=1,…,K-1, X(j) is observed whenever X(j+1) is observed (for example, in a longitudinal study with 5 data points an observation at the fourth time implies complete observations up to this time). 3 From a statistical point of view ( ; )completeP Y has two possible interpretations, which guide the choice of
estimation methods for dealing with missing data: - when regarded as the repeated-sampling distribution for Ycomplete, it describes the probability of
obtaining any specific data set among all the possible datasets that could arise over hypothetical repetitions of the sampling procedure and data collection
- when considered as a likelihood function for theta (unknown parameter) the realized value of Ycomplete is substituted into P and the resulting function for theta summarises the data’s evidence about parameters.
11
maximisation (EM) algorithm: the technique is very well established in statistical handbooks
and not our focus to detail again here. These methods overall summarise a likelihood
function averaged over a predictive distribution for the missing values (Ibrahim et al. 2005;
Schafer 1999; Schafer and Graham 2002). According to Horton & Kleinman (2007) “for each
observation with missing data, multiple entries are created in an augmented dataset for each
possible value of the missing covariates, and a probability of observing that value is estimated
… the augmented complete-dataset is then used to fit the regression model” (p. 82). Based on
this definition the ML estimation is very similar to Multiple Imputation.
Imputation methods involve replacing missing values by suitable estimates and then
applying standard complete-data methods to the filled-in data. The main reason for doing this
is to reduce nonresponse bias: “rather than deleting cases that are subject to item-nonresponse,
the sample size is maintained resulting in a potentially higher efficiency than for case
deletion” (Durrant 2009, p. 295). Imputation methods are broadly distinguished between
deterministic and random or stochastic. The latter requires the generation of random numbers,
whereas the deterministic does not. Hence, deterministic approaches produce the same
imputed value for cases with same characteristics (e.g. using the measure of central tendency
to represent the imputed distribution of the missing variable). Some derived methods, such as
regression and hot deck imputation can be either deterministic or random (Durrant 2009;
Schulte Nordholt 1998).
Hot deck imputation procedures replace missing values with values/records that occur
in the sample to compensate for item nonresponse. The record providing a value is called the
donor and the record with the missing value the recipient (Kim and Fuller 2004). In random
hot deck imputation, non-respondents are assigned values chosen at random from respondents
in the same imputation cell (Hox 2000). Nearest neighbour and predictive mean matching
12
imputations are also considered as hot deck methods (Durrant 2009; Sartori et al. 2005; Zhang
2003).
Repeated (Multiple) Imputation (MI) is becoming the most popular procedure for
handing missing data. It allows the analyst to use analyses designed for complete data, while
at the same time providing a method for estimating the uncertainty due to the missing data.
Some requirements need to be met for multiple imputation, such as: (a) data must be missing
at random (MAR), (b) the model used to generate the imputed values must be correct, and (c)
the model used for the analysis must match up with the model used in the imputation (Allison
2000, p. 302; Rubin 1987).
The basic idea of MI, as proposed by Rubin (1987) involves these steps:
Impute missing values using an appropriate model that incorporates random variation;
Do this M times4, producing M ‘complete’ data sets;
Perform the desired analysis on each of these data sets using standard complete-data
methods;
Average the values of the parameter estimates across the M samples to produce a
single-point estimate (i.e. ( )
1
1ˆ ˆM
m
mM
) ;
Calculate the standard errors by (a) averaging the squared standard errors of the M
estimates, (b) calculating the variance of the M parameter estimates across samples,
and (c) combining the two quantities using an adjustment term (i.e. 1+1/M). This step
is necessary so as to incorporate the variability due to imputation (Allison 2000;
Durrant 2009).
4 It has been shown by Rubin in 1987, that the relative efficiency of an estimate based on m imputations to one
based on an infinite number of them approximates the inverse of (1+λ/m), with λ the rate of missing information. Based on this, it is also reported that there is no practical benefit to using more than 5 to 10 imputations (Schafer 1999; Schafer and Graham 2002).
13
Fundamental to MI is the technique used for the imputation of values, and recently the
range of options has grown considerably. The three widely used MI techniques are the
propensity score method, the predictive model method and the Markov chain Monte Carlo
(MCMC) method (see Schafer 2000; Zhang 2003). The propensity score and predictive model
methods are appropriate when there is a monotone pattern of missing data, whereas the
MCMC method can be used with non monotone patterns.
Bayesian MI methods are getting increasingly popular: these are performed using a
Bayesian predictive distribution to generate the imputations (Nielsen 2003) and specifying
prior values for all the parameters (Ibrahim et al. 2005). According to Schafer & Graham
(2002) Bayesian methods are bringing together MI methods with Likelihood methods: “[…]
the attractive properties of likelihood carry over to the Bayesian method of MI, because in the
Bayesian paradigm we combine a likelihood function with a prior distribution for the
parameters. As the sample size grows, the likelihood dominates the prior, and Bayesian and
likelihood answers become similar” (p. 154).
As a final note on analytical options, multilevel analysis of longitudinal data has also a
reported advantage in handling missing data (Little and Rubin 1989). In particular the
problem may be avoided in the case of multilevel growth curve models (e.g. Plewis 2007;
2009). These models use all the available data (rather than being restricted to cases with
complete data at each occasion) giving them an advantage for studies that span a long period
of time when sample attrition is almost inevitable. If the data are MAR, this implies “that
differences between the means for those measured and not measured at occasion t disappear,
having conditioned on the known values at occasions t-k,k>=1 – then estimates from the
growth curve models are efficient” (Plewis 2009, p. 300-301).
14
Tools for resolving the issue
Various tools are also available for performing (multiple) imputation. As a guide for
the interested reader we list the names of some procedures available in R, but not
exhaustively: Amelia II, arrayImpute, cat (for categorical-variable datasets with missing
values), EMV (for the Estimation of Missing Values for a Data Matrix), impute, mi, mice,
Hmisc. Tools are also available within other statistical packages, such as ICE in STATA, the
SAS PROC MI, Missing Data Library and NORM for S-Plus and SOLAS. Multiple
Imputation can also be performed with MLwiN, and very recently with SPSS (version 20,
2012). Allison (2000) performed some imputations with various methods and identified
problems with the earlier versions of SOLAS commercial package. Horton & Kleinman
(2007) have recently applied imputation with Amelia II, Hmisc, ICE/Stata, Iveware, LogXact,
MICE, SAS and S-Plus packages and found similar parameter estimates for all analysis, as
well as considerable reduction of the standard error estimates when compared to the complete
case estimators. A recent practical tutorial about imputing missing data with the use of Amelia
II is also available (Hutchenson and Pampaka 2012).
In the remainder of the paper we will apply the previously presented theory in an
educational research context. To do this we draw on data (and missing data) from a recent
project which we present next.
3. Our research design - methodology
The context for the analysis and results presented next is situated within our recent work on
students’ transitions into post-compulsory settings. Our project has been concerned with the
widely known ‘mathematics problem’ in England, manifested by the relatively small number
of students being well prepared to continue their studies from schools and colleges into
courses in Higher Education (HE) Institutions in mathematically demanding subjects (see for
15
example ACME 2009). We have effectively joined the efforts of others (e.g. Roberts 2002;
Sainsbury 2007; Smith 2004) to inform policy and practice, and consequently support a
greater flow of students into HE to meet the demands of the emerging national STEM
(Science, Technology, Engineering, Mathematics) agenda in the UK.
The aim of the project we report here was to understand how cultures of learning and
teaching can support learners in ways that help them widen and extend participation in
mathematically demanding courses in Further and Higher Education. We focused on those
students for whom pre-university mathematics qualifications (like ‘AS’ and ‘A2’ in the UK
context) provide a barrier to progressing into mathematically demanding courses. The project
also contrasted the traditional mathematics programme with a new ‘Use of mathematics’
programme in the UK at ‘AS’ level, i.e. for students usually aged 16-19 who have completed
compulsory schooling and have chosen to study some mathematics further. The ‘Traditional’
(ASTrad) programme is designed to help students prepare for mathematically-demanding
courses in university (e.g. STEM). The new ‘Use of Mathematics’ (UoM) programme is
designed to widen participation to include those who may need to use mathematics in the
future generally, but who may or may not progress into very mathematical courses and who
may not have very strong previous mathematical backgrounds (Williams et al. 2008).
Methodologically, the project was a mixed method study. One part involved a survey
of students where responses to a questionnaire were recorded at three data points. This was
complemented with multiple case studies of students as they progress through a year of
further education. The longitudinal design is illustrated in Figure 2. In particular it indicates
the timing of the three data collection points, and the students’ educational status at the
16
moment.
Sept-Nov2006
April-June2007
Sept-Dec2007
TIME
Cohort’s Educational Level
Start of AS End of AS Start of A2
Data Point DP1 DP3DP2
Variables Collected
BackgroundPrevious (GCSE)
grades DispositionsSelf-Efficacy
Future Aspirations
DispositionsSelf-Efficacy
Future Aspirations
DispositionsSelf-Efficacy
Future AspirationsAS Grades
Sample 1792 1047 598
Figure 2: The longitudinal design of our project and achieved student survey returns
The sample size achieved at each data point, in respect to questionnaires returned from
schools, is also shown in the figure. As can be seen, there was significant attrition from the
1792 students at DP1 to 1047 at DP2 and only 598 students at DP3 (leading to 42% and 67%
nonresponse rates, respectively). Descriptive analysis indicates that missing data at DP2 is
not random. We particularly found that the proportion of ASTrad Maths students completing
DP2 was significantly lower than that of UoM students (chi-square=21.53, df=1, p<0.001).
Missingness at DP3 also differentiates between the two courses, this time completion being in
favour of the ASTrad students (chi-square=7.5, df=1, p=0.006). Further modelling of missing
data is presented in a following section.
The different types of variables collected at each time point (DP) are also summarised
in Figure 2. More details about these and the corresponding constructed measures can be seen
elsewhere (Pampaka et al. 2011a; Pampaka et al. 2013; Pampaka et al. 2011b).
Our modelling strategy then involved exploring relationships between students’
background variables, process variables (i.e. course and pedagogy), dispositions and their
attainment as they progress through their post-16 (mathematics) education.
17
What is of particular interest for this paper is the outcome educational outcome ‘AS
grade’ highlighted in the final data point of Figure 2. As noted, this important outcome
variable for our study was only requested from students at the final data point (because this is
when it would have been available) which suffer more from missing data.
In order to resolve some of the analytical problems that we had to deal with, with the
low response rate of DP3 and consequently important information on this response variables
of interest (e.g. final grades, dropout from course), we had been able to access additional
information about AS-level grades at a later date by collecting data directly from the schools
and the students via additional telephone surveys. With this approach we have been able to
fill-in the grades for 879 additional pupils.
In the remainder, we present some analysis from this project in relation to missing data
to show potential implications and demonstrate the potential benefits of multiple imputation.
4. Analysis / Results
4.1. Describing / Modeling nonresponse
As suggested in most relevant guidelines, a necessary first step in any analysis involving
missing data is to describe the problem. Towards this end, we had included within our dataset
binary variables denoting whether a student had completed each of the surveys data points.
What we present in this section are the results of logistic regression modelling for DP2 and
DP3 (i.e. whether student completed DP2 and DP3 respectively). Our modelling procedure
follows the same guidelines reported elsewhere (Fox and Weisburg 2011; Hutcheson et al.
2011; Pampaka et al. 2012a; Pampaka et al. 2012b) and in order to avoid further missing data
problems we use as explanatory variables background and other theoretically relevant
variables collected at the first data point of the study (DP1). Our presentation is limited to the
selected theoretical model (with the significant explanatory variables) for each data point and
18
the effects of the explanatory variables are illustrated via effect plots (Fox 1987; Fox and
Hong 2009).
Figure 3 summarises the models that best explain/describe response (and missingness)
at our study.
Model 1: DP2 ~ Course + UNIoption
Model 2: DP3 ~ GCSE grade + Gender + UNIoption
Pro
bab
ilit y
of
r esp
on
din
g a
t D
P3
Pro
bab
ility
of
r esp
on
din
g a
t D
P3
Pro
bab
ilit y
of
resp
on
din
g a
t D
P3
Figure 3: Logistic regression models of responding to Study 1 and their effect plots. The graph shows the size of the effect of the explanatory variables on the response is given by the range in Y (response variable) values shown on the graph. It should also be noted that the Y range is not the same for all graphs and this needs to be taken into account when interpreting the results
UNIoption is a categorical variable describing students choices for university study (or
not) as reported at DP1. As shown with the effect plots of Figure 3 (top), students of the
ASTrad course and students who do not intend to go to university have the highest chances of
being non respondents to DP2. In contrast students aiming for a STEM course at HE have the
19
highest response probabilities. For DP3, course is not significantly effective any more,
however previous mathematics attainment (GCSE Grade) and gender are now significant. The
trend regarding University option is similar to that described for Model 1. In addition, male
students tend to be more prone to attrition and response rate increases with increased previous
mathematics attainment.
4.2. A Multiple Imputation Illustration
One part of this project investigated the number of students who dropped-out (didn't
finish the course) and attempted to identify the factors involved in this. These analyses are
reported in Hutcheson, Williams and Pampaka (2011) and indicate that the type of course a
student was enrolled on, their previous GCSE results in mathematics, their disposition to
study mathematics at a higher level and their self-efficacy rating were all significantly related
to drop-out.
Descriptive analyses of the initial data (i.e. without the data acquired later from
schools) and the cases that were missing do not display significant differences in the courses
enrolled on, the students’ disposition and self-efficacy, or the drop-out rate, but does a
significant difference in the GCSE-grades (chi-square=43.01, df=5, p=3.7e-08). The
observed data had significantly higher GCSE grades than those that were missing. From this
evidence, we may conclude that a model of the original 494 cases may over-estimate the
effect of high-achievers and we could incorporate this conclusion into our analyses by
employing weights, or use an imputation technique to replace the missing cases.
The following compares complete-case analyses of drop-out (using binary logit
models) using the original sample (n = 494) with the same model computed on the more
complete sample (n=1373). Comparing these models allows an evaluation of the
consequences of analysing an incomplete dataset and also provides an opportunity for
evaluating the effectiveness of data imputation. This analysis compares models which have
20
the same selection of variables. Although this enables the models to be compared, it is not an
ideal situation as the chosen model was derived using the more complete data set. The model
is, therefore, optimised for these data, but can still provide some informative comparisons. A
binary logit model of drop-out was derived using a combination of theoretical and statistical
information (see Hutcheson et. al., 2011, for a full description of the model selection
procedure). The model reported for the full data was.
Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy
where Drop-out and Course are binary categorical variables, GCSE-grade is an ordered categorical variable and Disposition and Maths Self Efficacy are continuous.
Table 3 shows the fit statistics for the model using all 1373 cases and the same model using
only the 494 cases which were originally available. Both models identify the most significant
relationships (Course and GCSE-grade), however, Disposition and mathematics self efficacy
(MSE) are only identified as being significant for the full dataset.
Table 3: Fit statistics for models with and without missing data. The LR-chisq statistics and p-values are shown for binary logit models “Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy” computed for the initial dataset (N=494) and the full dataset (N=1373).
Explanatory Variables
Model with missing data [n=494]
Model using full data [n=1373]
GCSE-grade Course Disposition Maths Self Efficacy
85.897; p< 2.2e-16 *** 21.139; p=4.271e-06 *** 3.561; p=0.05914 0.458; p=0.49844
176.07; p<2.2e-16 *** 72.54; p<2.2e-16 *** 21.17; p=4.2e-06 *** 10.60; p=1.1e-03 **
***: p<0.001, **: p<0.01, *: p<0.05.
The importance of collecting additional data has been demonstrated here with regards
to the identification of the variables Disposition and Maths Self Efficacy, as these would not
have been identified as being 'important' using the reduced dataset. The additional data
collection procedure has influenced the analysis and has had important implications with
respect to the conclusions drawn from the research. The data that was initially missing was
21
not missing at random and it was therefore important that these data were collected. If it was
not possible or feasible to collect the additional data, the missing data points could have been
imputed. The following analysis imputes the 879 missing cases (data were imputed for the
drop-out variable) and compares the resulting regression model with those in Table 2.
There are a number of data imputation methods available and we selected the Amelia
package (Honaker et al. 2012) which is implemented in the R statistical programme (R core
Team, 2012). Amelia imputes missing data using multiple imputation and includes a number
of sophisticated options and diagnostics using a simple graphical interface. It is available for
all computing platforms and is freely available from the R software repository (see Hutcheson
and Pampaka, 2012, for a description of this software) and offers a simple, powerful and
accessible means for researchers to impute data. Amelia imputed the missing data
(information about whether the student had dropped out) and formed a number of 'complete'
datasets. The model of drop-out was then applied to each of these datasets. The fit statistics
for these are shown in Table 4.
22
Table 4: Fit statistics for models with imputed data. The LR-chisq statistics and p-values are shown for binary logit models “Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy” computed for five datasets containing imputed data (N=1373).
Datasets Explanatory Variables
LR Chisq Df Pr(>Chisq)
Dataset 1 Course Disposition GCSE-grade Maths Self Efficacy
31.895 11.695 126.248 0.015
1 1 5 1
1.628e-08 *** 0.0006265 *** < 2.2e-16 *** 0.9018691
Dataset 2 Course Disposition GCSE-grade Maths Self Efficacy
46.860 7.872 112.747 2.269
1 1 5 1
7.623e-12 *** 0.00502 ** < 2.2e-16 *** 0.13202
Dataset 3 Course Disposition GCSE-grade Maths Self Efficacy
24.842 18.711 97.550 0.008
1 1 5 1
6.222e-07 *** 1.521e-05 *** < 2.2e-16 *** 0.9278
Dataset 4 Course Disposition GCSE-grade Maths Self Efficacy
43.230 0.612 121.636 2.646
1 1 5 1
4.868e-11 *** 0.4340 < 2.2e-16 *** 0.1038
Dataset 5 Course Disposition GCSE-grade Maths Self Efficacy
40.984 17.809 143.106 0.002
1 1 5 1
1.535e-10 *** 2.442e-05 *** < 2.2e-16 *** 0.9663
***: p<0.001, **: p<0.01, *: p<0.05.
Each of the models using the imputed data identify Course and GCSE-grade as being
important. Four of the five models indicate that Disposition is also significant and no models
indicate Maths Self Efficacy as being important. A 'final' model can be computed for the
imputed data by combining all five datasets to provide an 'average'. However, for simplicity
in this analysis, we will select the model from dataset 1 for illustration as this model appears
to be roughly in the middle of the estimated significances for the variables and allows a
simple comparison to be made with the other models. Table 5 shows a comparison of all
three models (one computed with missing data, one with imputed data and one with the full
data).
23
Table 5: Fit statistics for models with imputed data. LR-chisq statistics and p-values are shown for binary logit models “Drop-out ~ Course + Disposition + GCSE-grade + Maths Self Efficacy” computed for the original data with missing values (N=494), for the original data with missing values imputed (N=1373) and the full data set containing all 1373 cases.
Explanatory Variables
Model with missing data
Model using imputed data
Model using all available data
N=494 N=1373 N=1373
GCSE-grade Course Disposition Maths Self Efficacy
85.90; p < 2.2e-16 *** 21.14; p=4.27e-06 *** 3.56; p=0.059 0.46; p=0.498
126.25; p<2.2e-16 *** 31.90; p=1.63e-08 *** 11.70; p=0.001 ** 0.02; p=0.902
176.07; p<2.2e-16 *** 72.54; p<2.2e-16 *** 21.17; p=4.2e-06 *** 10.60; p=1.1e-03 **
***: p<0.001, **: p<0.01, *: p<0.05.
The data imputation appears to have been at least partially successful as it has identified
Disposition as being significant. It has not, however, provided a completely accurate picture
of the full data set as the imputed data have not indicated MSE as being significant and has
also underestimated the contributions of the other variables in the model. It is, however,
worthy of note that the model with the imputed data does appear to be 'closer' to the model
computed for the full dataset than the original model with the missing data.
5. Discussion –Implications - Conclusions
Our aim with this paper was to revisit and review the topic of dealing with nonresponse from
the lens of educational research, and show some implications of considering and ignoring this
from our recent work in widening participation in (mathematics) education. In addition we
wanted to assess the accuracy of multiple imputation using an example where models from
imputed data could be compared with models derived from actual data. We summarise and
discuss some main concluding points here.
First, we have shown, as the literature in social statistics suggests, that ignoring
missing data can have serious consequences in educational research contexts. Missing data is
effectively ignored when using 'complete-case' analyses and the automated use of stepwise
24
selection procedures often exacerbates this as cases can easily be excluded on the basis of data
missing from variables not even part of a final model. Ignoring missing data causes problems
when the data are not missing at random, as is likely for most missing data in educational
research.
Our approach to collecting more data (i.e. from schools or via telephone interviews
from students) as a remedy to resolve the implications from missing data is also aligned with
recent literature on the topic, linked with mixed mode surveys. This type of survey was
regarded as the new trend in survey research in the late 20th century with the definition of
surveys which combine the use of telephone, mail, and/or face to face interview procedures to
collect data (Dillman and Tarnai 1988). The developments in technology and the widespread
use of the internet, and consequently online surveys, made this type of survey the norm for
our times (Biemer and Lyberg 2003; Glover and Bush 2005). One of the reasons for their
increased use is their capacity to deal with response/nonresponse problems, in efforts to
increase response or investigate bias (Dillman 2009; Sikkel et al. 2009).
The results from our study demonstrated that the initial data sample which included a
substantial amount of missing data was likely to be biased, as a regression model using this
dataset was quite different to one based on a more complete dataset that included information
subsequently collected. Imputing the missing data proved to be a useful exercise, as it
'improved' the model (at least with respect to it's comparability with the more data set) but did
not entirely recreate the structure of the full dataset. The analysis of the imputed data resulted
in a model that 'correctly' identified the significance of Disposition in the full dataset, but did
not identify Maths Self Efficacy as being significant and also underestimated the significance
of the other variables in the model. A note should be made here about our modelling in
relation to the longitudinal nature of the study: the model we present is essentially
longitudinal since the outcome (dropout) variable is collected at the final data point of the
25
study, and this is modelled with explanatory variables collected at earlier stages (e.g.
dispositions and self efficacy). Since our emphasis was on imputation, however, we have
chosen to simplify the way the model appears.
The imputation procedure may be regarded as being partially successful and its use
justified for these data. It should be noted that the imputation was based on a limited
selection of variables that had been identified from a previous analysis. It is likely that the
imputation could have been improved if more variables had been used in the imputation
process. In addition, our data imputation was for a binary category and is likely to be
relatively weak. This, however, strengthens the case for imputation as imputing richer data
(multi-categories, ordered and continuous) may result in more accurate imputations. It is
encouraging that the multiple imputation worked even with such a simple imputation. Ideally
a 'model' of why data is missing (such as those presented in Figure 3) would be employed to
identify the variables that may be used to model the missing data points. Collecting such data
should be an important consideration for research design whenever missing data is likely to be
an issue. An important conclusion from this study was that even with quite limited available
data to model missing data points, multiple imputation still proved to be useful.
The most important conclusion from this paper is that missing data can have adverse
effects on analyses and imputation methods should be considered when this is an issue. This
study shows the value of multiple imputation even for imputing a binary outcome variable
using limited information. But, as in this case study, the imputation, while producing an
improvement (e.g. in modelling with Disposition) completely failed to detect the significance
of an important variable. It is encouraging that tools now exist to enable multiple imputation
to be applied relatively simply using easy to access software. Thus, multiple imputation
techniques are now within the grasp of most educational researchers and should be used
routinely in the analysis of educational data. However, it behove researchers still to be
26
cautious in modelling data sets with large amounts of missing data, even if they use multiple
imputation.
As a final remark, our analysis demonstrates the advantages of the technique and
suggests that it should be used routinely in longitudinal studies where data attrition can be
substantial.
Acknowledgements
The authors would like to acknowledge the support of the ESRC project with grant number RES-062-23-1213, the support of the ESRC-TLRP programme of research into widening participation, funded by ESRC: RES-139-25-0241and the continuing support of ESRC grant RES-061-025-0538.
References
ACME. 2009. The mathematics education landscape in 2009. A report of the Advisory Committee on Mathematics Education (ACME) for the DCSF/DIUS STEM High Level Strategy Group Meeting (12 June 2009) retrieved from http://www.acme-uk.org/downloaddoc.asp?id=139 [March, 2010].
Allison, P.D. 2000. Multiple imputation for missing data: A cautionary tale. Sociological Methods and Research 28, no. 3: 301-09.
Biemer, P.P. and L.E. Lyberg. 2003. Introduction to survey quality Wiley series in survey methodology. Hoboken, NJ: John Wiley & Sons, Inc.
Dillman, D. 2009. Some consequences of survey mode changes in longitudinal surveys. In Methodology of longitudinal surveys, ed. Lynn, P, 127-40. Chichester: Wiley.
Dillman, D. and J. Tarnai. 1988. Administrative issues in mixed mode surveys. In Telephone survey methodology, eds Groves, Rm, Biemer, P, Lyberg, L, Massey, J, Nicholls, Wl and Waksberg, J. New York: Wiley.
Durrant, G.B. 2009. Imputation methods for handling item non-response in practice: Methodological issues and recent debates. International Journal of Social Research Methodology 12, no. 4: 293-304.
Fox, J. 1987. Effect displays for generalized linear models. Sociological Methodology 17: 347-61. Fox, J. and J. Hong. 2009. The effects package. Effect displays for linear, generalized linear,
multinomial-logit, and proportional-odds logit models. Http://www.R-project.Org. Fox, J. and S. Weisburg. 2011. An r companion to applied regression (second edition). London: Sage. Glover, D. and T. Bush. 2005. The online or e-survey: A research approach for the ict age.
International Journal of Research & Method in Education 28, no. 2: 135-46. Honaker, J., G. King and M. Blackwell. 2012. Amelia ii: A programme for missing data (version
1.6.1). Available at: www.gking.harvard.edu/amelia (retrieved December 2012).
27
Horton, N.J. and K.P. Kleinman. 2007. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. The American Statistician 61, no. 1: 79-90.
Hox, J. 2000. Multilevel analysis of grouped and longitudinal data. In Modeling longitudinal and multilevel data, eds Little, Td and Schnabel, Ku. London: Lawrence Erlbaum Associates.
Hutchenson, G. and M. Pampaka. 2012. Missing data: Data replacement and imputation. Journal of Modelling in Management (Tutorial Section) 7, no. 2: 221-33.
Hutcheson, G., M. Pampaka and J.S. Williams. 2011. Enrolment, achievement and retention on ‘traditional’ and ‘use of mathematics’ as courses. Research in Mathematics Education 13, no. 2: special issue.
Ibrahim, J.G., M.-H. Chen, S.R. Lipsitz and A.H. Herring. 2005. Missing-data methods for generalized linear models. Journal of the American Statistical Association 100, no. 469: 332-46.
Kim, J.K. and W. Fuller. 2004. Fractional hot deck imputation. Biometrika 91, no. 3: 559-78. King, G., J. Honaker, A. Joseph and K. Scheve. 2001. Analyzing incomplete political science data: An
alternative algorithm for multiple imputation. American Political Science Review 95: 49-69. Little, R.J.A. 1988. Missing-data adjustments in large surveys. Journal of Business & Economic
Statistics 6, no. 3: 287-96. Little, R.J.A. and D.B. Rubin. 1987. Statistical analysis with missing data. New York: John Wiley &
Sons. Little, R.J.A. and D.B. Rubin. 1989. The analysis of social science data with missing values.
Sociological Methods Research 18, no. 2-3: 292-326. Nielsen, S.R.F. 2003. Proper and improper multiple imputation. International Statistical Review /
Revue Internationale de Statistique 71, no. 3: 593-607. Pampaka, M., I. Kleanthous, G.D. Hutcheson and G. Wake. 2011a. Measuring mathematics self-
efficacy as a learning outcome. Research in Mathematics Education 13, no. 2: 169-90. Pampaka, M., J. Williams and G. Hutchenson. 2012a. Measuring students’ transition into university
and its association with learning outcomes. British Educational Research Journal 38, no. 6: 1041-71.
Pampaka, M., J.S. Williams, G. Hutchenson, L. Black, P. Davis , P. Hernandez-Martines and G. Wake. 2013. Measuring alternative learning outcomes: Dispositions to study in higher education. Journal of Applied Measurement 14, no. 2.
Pampaka, M., J.S. Williams, G. Hutcheson, G. Wake, L. Black, P. Davis and P. Hernandez - Martinez. 2012b. The association between mathematics pedagogy and learners’ dispositions for university study. British Educational Research Journal 38, no. 3: 473-96.
Plewis, I. 2007. Non-response in a birth cohort study: The case of the millenium cohort study. International Journal for Social Research Methodology 10, no. 5: 325-34.
Plewis, I. 2009. Statistical modelling for structured longitudinal designs. In Methodology of longitudinal surveys, ed. Lynn, P. Chichester: Wiley.
Reiter, J.P. and T.E. Raghunathan. 2007. The multiple adaptations of multiple imputation. Journal of the American Statistical Association 102, no. 480: 1462-71.
Roberts, G. 2002. Set for success. The supply of people with science, technology, engineering and mathematics skills. London: HM Stationery Office.
Rubin, D.B. 1987. Multiple imputation for nonresponse in surveys. New York: John Wiley. Sainsbury, L. 2007. The race to the top: A review of government’s science and innovation policies:
HM Stationery office. Sartori, N., A. Salvan and K. Thomaseth. 2005. Multiple imputation of missing values in a cancer
mortality analysis with estimated exposure dose. Computational Statistics & Data Analysis 49, no. 3: 937-53.
Schafer, J.L. 1999. Multiple imputation: A primer. Statistical Methods in Medical Research 8: 3-15. Schafer, J.L. 2000. Analysis of incomplete multivariate data. London: Chapman and Hall. Schafer, J.L. and J.W. Graham. 2002. Missing data: Our view of the state of the art. Psychological
Methods 7, no. 2: 147-77. Schulte Nordholt, E. 1998. Imputation: Methods, simulation experiments and practical examples.
International Statistical Review 66, no. 2: 157-80.
28
Sikkel, D., J. Hox and E. De Leeuw. 2009. Using auxiliary data for adjustment in longitudinal research In Methodology of longitudinal surveys, ed. Lynn, P, 141-55. Chichester: Wiley.
Smith, A. 2004. Making mathematics count – the report of professor adrian smith’s inquiry into post-14 mathematics education. London: DfES.
Squire, P. 1988. Why the 1936 literary digest poll failed. Public Opinion Quarterly 52: 125-33. Wainer, H. 2010. 14 conversations about three things. Journal of Educational and Behavioral Statistics
35, no. 1: 5-25. Wayman, J.C. 2003. Multiple imputation for missing data: What is it and how can i use it? Paper
presentat at the Annual Meeting of the American Educational Research Association, Chicago. Williams, J.S., L. Black, P. Davis, P. Hernandez-Martinez, G. Hutcheson, S. Nicholson, M. Pampaka
and G. Wake. 2008. TLRP research briefing no 38: Keeping open the door to mathematically demanding programmes in further and higher education. School of Education, University of Manchester.
Zhang, P. 2003. Multiple imputation: Theory and method. International Statistical Review 71, no. 3: 581-92.