hierarchical linear modeling: a review of methodological ... · hierarchical linear measurement...
TRANSCRIPT
1
Hierarchical Linear Modeling:
A Review of Methodological Issues and Applications
John Ferron
Melinda R. Hess
Kris Y. Hogarty
Robert F. Dedrick
Jeffery D. Kromrey
Thomas R. Lang
John Niles
University of South Florida
Paper presented at the 2004 annual meeting of the American Educational Research Association,
San Diego, April 12-16.
2
Hierarchical Linear Modeling:
A Review of Methodological Issues and Applications
INTRODUCTION
Researchers in education and many other fields (e.g., psychology, sociology) are
frequently confronted with data that are hierarchical or multilevel in nature. For example, in the
context of school organizations, students are nested in classes, classes are nested in schools,
schools are nested in school districts, etc. In longitudinal research, repeated observations are
nested within individuals (i.e., units) and these individuals may be nested within groups. The
pervasiveness of multilevel data has led to a proliferation of statistical methods, referred to under
a number of names including hierarchical linear modeling (HLM, Bryk & Raudenbush, 1992),
multi-level modeling, mixed linear modeling, or growth curve modeling, and a parallel increase
in the number of applications of these methods to educational problems.
The complexity of these multilevel methods provides potential for misuse and confusion,
which may act as barriers for applied researchers attempting to use these methods. Careful
consideration of the methodological nuances of multilevel analyses is critical, as misuses may
result in statistical artifacts that may potentially influence statistical inference and cloud
interpretation. A close examination of some of the more common methods employed across
various disciplines, as well as an exploration of recent research trends can serve to both inform
the practice of research as well as to broaden our understanding of the various methodologies
and critical issues facing practitioners. Given this context, it is appropriate to pause and critically
analyze the methodological issues inherent in multilevel modeling techniques.
Similar methodological reviews have been conducted with more commonly used
techniques such as analysis of variance, multivariate analysis of variance, analysis of covariance,
path analysis, and structural equation modeling. Keselman et al. (1998) note that one consistent
finding of methodological research reviews is that a substantial gap often exists between the
methods that are recommended in the statistical research literature and those techniques that are
actually practiced by applied researchers (Goodwin & Goodwin, 1985b; Ridgeway, Dunston &
Qian, 1993). Methodological reviews can serve to identify issues, controversies, and current
trends as well as provide direction to applied researchers. In addition, these reviews may help
bridge the gap between statistical theory and application.
3
Purpose & Intended Audience
The purpose of this review is to provide an overview of the methodological landscape
and the critical issues surrounding multilevel modeling and to report on the current application
and reporting of multilevel analyses in education and related fields. Because a single review
cannot include every methodological consideration or technical nuance, it is important to clarify
that our intent is to focus primarily on what might be termed ‘traditional’ hierarchical linear
modeling. That is, we consider linear models of continuous outcomes where the random effects
are assumed normally distributed. This allows consideration of applications where individuals
are nested in contexts (e.g., students nested in schools), and applications where observations are
nested within individuals (e.g., growth curve models). Models in which the outcome is
represented by binary, count, or ordinal data are not considered (see Raudenbush & Bryk, 2002
for discussions of these types of applications). Nor did we venture beyond these modeling
techniques to explore related methods such as multilevel structural equation modeling (SEM),
hierarchical linear measurement modeling or applications of item response theory (IRT).
This review is intended to be useful to several distinct audiences in the research arena.
For example, practitioners whose inquiries in applied areas include multilevel models should
benefit from this treatment of the published literature in the field. A careful consideration of one’s
research methods, designs and habits of reporting in contrast to those evidenced in the field in
general will tend to suggest areas for refinement of techniques and serve as a source for
professional reflection. Similarly, manuscript reviewers and journal editors who serve the critical
roles of both gatekeepers and navigators for authors in the reporting of research results should
gain from this examination of multilevel models. A critical appraisal of the strengths and
limitations in the reporting of results from such models is intended to sharpen the eyes of these
scholars and lead to improvements in the corpus of research to be published. In addition,
research methodologists and others who serve as technical consultants in large-scale research
projects may glean additional distinctions and insights from the technical issues raised in this
paper. A tremendous amount of methodological work is needed to advance our understanding of
the limits and possibilities of hierarchical models, and this review of the published literature
highlights several areas needing focused attention. Further, university professors and teachers of
research methods and applied statistics may find that the issues raised here and the resources
cited will hone their craft and enhance the content of their curricula. Finally, students of
educational research (the scholars of tomorrow) will find a readable introduction to hierarchical
models, together with citations of sources that will provide logical entries into the growing body
4
of technical work in this area. Our treatment of hierarchical models as they are ‘practiced’ by
those publishing in education and related fields is intended to both guide and inspire researchers
at multiple levels of experience and interest.
Organization of the Review
To provide an overview of the methodological nuances and critical issues surrounding
multilevel modeling, literature in an array of disciplines (education, medicine, public health,
psychology, business, chemistry, physics, biology, statistics and math) was reviewed. Electronic
databases (e.g., ERIC, PsychInfo, etc.) were searched using a variety of keywords (e.g.,
hierarchical linear modeling, mixed linear models, nested designs, and multilevel modeling) and
additional articles were gleaned from the reference lists of select articles. Lastly, key issues were
identified from the reference list of pivotal articles, by exploring technical software manuals, and
by monitoring online conversations on certain Listservs.
Based on these reviews, four broad issues were identified. These issues are explored
during the initial phase of this paper and served as the framework for the coding protocol used to
analyze the hierarchical linear modeling applications in the literature. These issues include: (a)
model development and specification, which include issues of centering, selection of predictors,
specification of covariance structure, fit indices, generalizability and checks on specification; (b)
data considerations including distributional assumptions, outliers, measurement error for
predictors and outcomes, power, and missing data; (c) estimation procedures including
maximum likelihood, restricted maximum likelihood, Bayesian estimation, and alternative
procedures such as bootstrapping; and (d) hypothesis testing and statistical inference including
inferences about variance parameters and fixed effects.
Following the explication of these issues, we describe the development of the framework
and coding protocol used as a lens for our analysis of the literature and the search strategies
employed in this review. The results of our review of applications are then presented, along with
recommendations for improved reporting practice.
Critical Issues in Hierarchical Modeling
Model Development & Specification
Model development is a central component of inquiry in many disciplines, with notably
different approaches evidenced among researchers both within and across disciplines.
According to Dickmeyer (1989), a patchwork of styles and worldviews with respect to model
development exists in the educational research community. Models are typically built to allow
researchers to test theories or hypotheses, to manipulate and test changes in simplified systems,
and to allow for the exploration of relationships between variables that in some way characterize
5
a complex system. Support for various models may emerge from the extant literature and
research in a particular field. Other models may be the product of developing theories
employing exploratory types of analyses utilizing more data driven approaches.
To aid discussion of the particular specification issues confronted by those using
hierarchical models we first review the basic notation and terminology in the context of an
application. Consider a team of educational researchers who wish to study the relative
effectiveness of two reading programs. These researchers randomly assign participating
classrooms to one of the two programs, and gather reading comprehension data both prior to and
following the use of the program. A level-1 model could be developed to model the reading
comprehension of students within a class as a function of their reading comprehension prior to
the study. More specifically,
0 1ij j j ij ijy prior reading rβ β= + + (1)
where yij is the reading comprehension of the ith child in the jth classroom, β0j is the intercept of
the regression equation predicting reading comprehension at the end of the study in the jth
classroom, β1j is the regression coefficient indexing the relationship between reading
comprehension at the end of the study and reading comprehension before the study in the jth
classroom, and rij is the error, which is assumed to be normally distributed with a covariance of Σ.
A level-2 model could then be used to examine the relative effectiveness of the two
programs, and whether the relative effectiveness of the two programs depended on the prior
levels of reading comprehension. The level-2 model would use program to predict the intercepts
and slopes of the level-1 model.
0 00 01 0j j jprogram uβ γ γ= + + (2)
1 10 11 1j j jprogram uβ γ γ= + + , (3)
where programj is a dummy coded variable indicating whether the jth classroom received
Program A (coded 0) or Program B (coded 1), and u0j and u1j are the errors, which are assumed to
be normally distributed with a covariance of Γ.
Although models are frequently described by using equations for each level, it is possible
to combine all the equations into one. By substituting the second level model for β0j and β1j in the
first level model, a combined model is obtained,
00 01 10 11 0 1*ij j ij j ij j j ij ijy program prior read program prior read u u prior read rγ γ γ γ= + + + + + + (4)
In the combined model it becomes clear why 11γ is referred to as a cross-level interaction. It
should also be noted that the combined model has the same form as the mixed linear model,
6
uγ ε= + +y X Z , (5)
where y is a vector of outcome data, γ is a vector of fixed effects, X and Z are known model
matrices, u is a vector of random effects, and ε is a vector of errors (Henderson, 1975).
Centering. Centering of the level-1 and level-2 predictors has important implications for
interpreting the results and therefore is an important consideration in specifying the statistical
model. In our example, suppose the level-1 predictor, prior reading comprehension, was
measured on a scale ranging from 200 to 800. If the predictor was kept in its natural metric, 00γ
would be the predicted reading comprehension for a student in Program A (coded 0) who had a
prior reading comprehension of zero. Since a prior reading comprehension of zero is not
possible, the coefficient is difficult to interpret. The effect of the program, 01γ , would be
interpreted as the difference in the effectiveness of the two programs when prior reading
comprehension was zero. Again, a value that is not particularly informative. Prior reading
comprehension would need to be scaled or centered to make the interpretation of the coefficients
more meaningful. Although centering is used outside the context of multilevel modeling, it is
particularly important in multilevel modeling because the level-1 coefficients become outcomes
to be explained in higher level models (e.g., level 2).
One approach to scaling the predictor variable is to subtract the grand mean of the
predictor variable from each score ( ijx - x⋅⋅ ). Using grand mean centering with our example, 00γ
would become the predicted reading comprehension for a student from Program A, who had a
mean level of prior reading comprehension. The effect of the program, 01γ , would become the
difference in the effectiveness of the two programs for students with a mean level of prior
reading comprehension.
A second approach to scaling the predictor variable is to subtract the level-2 unit mean of
the predictor variable from each score ( ijx - jx⋅ ). Using group-mean centering with our example,
00γ , would become the predicted reading comprehension for a student in Program A, who’s
prior reading comprehension was at the mean of her class. The effect of the program, 01γ , would
become the difference in effectiveness of the two programs for students who are at their
classroom’s mean level of prior reading comprehension.
A third approach to scaling a predictor variable is to subtract a theoretically meaningful
value from each score ( ijx - Specific Value). This approach is similar to grand-mean centering in
that a constant is subtracted from each score. The 0 jβ is interpreted as the expected outcome for
7
individuals who score at the specific value that has been set by the researcher. For example, in a
growth curve model examining change in achievement from grades 6, 7, 8, a research may center
the grade predictor at grade 8. In this case, 0 jβ is interpreted as the expected value of the
outcome for a student in 8th grade.
In the case of level-2 predictor variables, Raudenbush and Bryk (2002) have noted that the
choice of the scale metric (e.g., natural metric, grand mean centered) is less critical. However,
when interaction terms are included at level-2, Raudenbush and Bryk (2002) have suggest that
grand mean centering has the advantage of reducing multicolinearity.
Selection of predictors. The selection of predictors is a critical aspect of the design of a
study. According to Little, Lindenberger and Nesselroade (1999) the issue of variable selection is
directly related to the quality of the research design and the value of the results. In hierarchical
modeling, variable selection can be complicated since predictors can be selected for each level of
the model, and interactions between predictors can be considered at either level or across levels.
In addition, the process of variable selection can take many forms. In some instances, the
selection of predictors is established prior to looking at the data, while in others the data help
guide selection decisions. Inclusion may be based partially on significance tests, effect sizes, or fit
indices.
In the reading comprehension example, one can imagine the researchers using the
research goals and a priori considerations to select prior reading achievement as a predictor at the
first level, and program as a predictor at the second level. Interpreting the regression coefficients
is influenced by the degree to which one believes the regression coefficients are unbiased
estimates of effects. In our illustrative example, 01γ can be interpreted as a program effect. Since
classrooms were randomly assigned to programs, we would anticipate that program was not
related to any other classroom level variables, and consequently we would anticipate an
unbiased estimate of effect. If we had not randomly assigned classrooms to program, our ability
to argue 01γ was an unbiased estimate of effect would depend on our ability to argue that all
relevant variables were included in the model, or that the set of predictors for the model was
correctly specified.
Researchers who wish to include all relevant variables, but who are unsure if particular
variables need to be included, may let the data help them decide which variables to include. For
example, a researcher may start with a model like we have previously described and then add
variables one at a time, keeping only those that are statistically significant. Other researchers
may start with a fuller set of variables, and eliminate ones that do not seem to be affecting the
8
results. When the data are used to guide the selection of predictors the researcher increases the
odds of capitalizing on chance, which heightens the need for replication.
Specification of covariance structure. The most notable difference between hierarchical
models and more common regression models is that hierarchical models have more error terms
and consequently more flexibility in defining the covariance structure. This greater flexibility
leads to two distinct advantages. First, it allows researchers to be more flexible in the questions
they ask about the covariance structure. In applications like our example, researchers can ask
questions about the degree to which the outcome varies within classrooms relative to the degree
to which it varies across classrooms. In growth curve modeling applications, researchers can ask
to what extent initial levels (intercepts) vary across participants and to what extent growth rates
(slopes) vary across participants. Second, the degree to which the standard errors for the
regression coefficients are unbiased depends on the degree to which the covariance structure is
correctly specified. Having the flexibility to model a more complex covariance structure improves
the chances of correct specification, which leads to better estimates of the standard errors of the
regression coefficients, which in turn leads to more accurate confidence intervals and/or more
valid statistical tests.
The covariance structure for the first-level model, Σ, is often assumed to be σ2I in
applications where students are nested in contexts. In repeated measures contexts this
assumption is more questionable since errors that are close in time may be correlated. A wide
variety of alternative structures have been discussed including first-order autoregressive,
banded, unstructured, toeplitz, banded toeplitz, and first-order autoregressive plus a diagonal
(Wolfinger, 1993). With so many options available, researchers are left with questions about how
to best specify the covariance structure of the first-level model.
Questions also arise as to how to best specify the covariance structure of the second-level
model. In the previous example, which is relatively simple, there are alternative specifications
for ΓΓΓΓ depending on whether one wanted to let both intercepts ( 0 jβ ) and slopes ( 1 jβ ) randomly
vary and whether or not one wanted to allow for covariance between the errors in predicting the
intercepts and slopes. As the number of predictors in the first-level model increases the potential
size of the ΓΓΓΓ matrix also increases. With more elements there are more variance parameters that
could be estimated. The question becomes which coefficients should be allowed to randomly
vary, and if the answer is more than one, which of the possible covariances between errors
should be estimated.
9
One could generally divide the covariance parameters into three categories: those that are
assumed to be zero and not estimated, those that are assumed to be non-zero and thus estimated,
and those that the researcher is less sure about. If researchers routinely leave out all questionable
variance parameters, they run the risk of leaving out needed parameters, biasing their standard
errors, and jeopardizing their inferences. If researchers routinely add in all questionable
parameters, they may estimate a model that is overly complex, which increases the chance that
they will encounter estimation problems. For example the estimation may not converge or a
variance component may be inadmissible (e.g., a variance less than zero or a covariance that
implies the correlation would exceed 1.0). Even when estimation seems smooth, estimating
many parameters that are equal to zero will negatively affect the precision in estimating the other
parameters in the model. Consequently, one would ideally only estimate the needed parameters.
Fit Indices. With the growing recognition of the importance of the selection of an
appropriate covariance structure, several methods have been developed that allow researchers to
use the data to help make decisions about which covariance structure to estimate. As it is often
not possible to know the underlying structure in advance, researchers will often examine
multiple structures and rely on fit indices to select among possible covariance structures (Singer,
1998; Wolfinger, 1993). Among the indices commonly used are Akaike’s Information Criterion
(Akaike, 1974) and Schwartz’s Bayesian Criterion (Schwartz, 1978). Akaike’s Information
Criterion is given by:
AIC = log(L) – q (6)
where q is the number of covariance parameters.
Schwartz’s Bayesian Criterion is given by:
SBC = log(L) – (qlog(N – p))/2 (7)
Both AIC and SBC start with the log likelihood value and then penalize for the number of
covariance parameters estimated, with SBC employing a stiffer penalty. For each of these indices
values closer to zero represent better fit, so typically the model with the value closest to zero is
selected. This approach, however, does not always lead to identification of the correct covariance
model, especially when data are somewhat limited. For example, with repeated measures data, it
is difficult to correctly select the covariance structure when the series length is short (Ferron,
Dailey, & Yi, 2002; Keselman, Algina, Kowalchuk, & Wolfinger, 1998). Furthermore,
10
misspecification can affect estimation and inference (Ferron, Dailey, & Yi, 2002; Lange & Laird,
1989).
Other work in this area has focused not on a single estimate, but rather a ‘confidence set
of models’ (Shimodaira, 1998). Instead of using the minimum AIC, Shimodaira proposed the use
of an ‘interval’ estimate of the best model. This author was quick to note that the confidence set
approach in not intended to replace the use of an obtained point estimate of the minimum AIC,
but rather provides supplemental information on model selection. This approach employs a
series of pairwise analyses in which a standardized difference of AIC is calculated for every pair
of models. Potential models are compared to the model evidencing the best estimate of sample
fit, and those models that are not observed to differ by a statistically significant amount become
part of a set of models for consideration.
Generalizability and Sensitivity. The degree to which the findings of a particular analysis
are generalizable, as well as how sensitive the findings are to characteristics of the data, should be
a concern of all researchers. Limitations of a particular sample, the nature of the data, as well as
techniques employed, all impact the breadth and depth of the inferences made. These issues can,
at least in part, be examined to help ascertain the strength of findings using a variety of statistical
methods and techniques. Such techniques include cross-validation, sensitivity analysis,
replication and extension of previous research, and internal replication.
Typically associated with more traditional statistical methods such as regression analysis,
the use of a technique such as cross-validation is a useful technique in HLM analyses that
provides further evidence of validity and model soundness. The primary purpose of cross-
validation is to provide a check of model integrity and generalizability. This model ‘check’ is
accomplished through using one set of data, sometimes referred to as the screening (or training)
set and one set of data that may be called the calibration (or test) set. The screening set is used to
estimate the model and then the calibration set is applied to the model to determine how well the
model was able to predict the degree of fitness relative to the screening data set. This process
allows the calculation of a magnitude of generalization error based on how well the calibration
data set fits the model identified by the training data. Depending on the data structure and
sample size, cross-validation may be conducted using various strategies. These include the
holdout method (also called the data splitting method), dual-sample method, k-fold cross
validation and the leave-one-out cross validation (LOO). The procedure can be further
complicated when a researcher might not believe, for either theoretical, conceptual, or data-based
reasons, that a single cross-validation process is sufficient and thus engages in double cross
validation. Double cross-validation is nothing more than doing cross-validation twice and then
11
using a combined equation or model. Depending on the statistical analysis being employed (e.g.,
regression or HLM) this process may vary in complexity and applicability.
Another means of addressing generalizability and sensitivity issues is through conduct of
a sensitivity analysis. This type of analysis examines the impact of data anomalies (e.g., extreme
data values, distribution irregularities) on model fit and parameter specifications. Bayesian
techniques such as the Gibbs sampling methods as well as other strategies and algorithms can be
used to examine impact of extreme observations at either level one or level two of the model
(Seltzer, Novak, Choi, & Lim, 2002). Other techniques, such as data transformations (e.g., log-
linear, square root) can be effective in addressing issues such as nonnormal distributions. The
degree to which model fit and parameter specifications remain constant when data issues such as
these are controlled for is critical to determine how ‘sensitive’ a given model is to fluctuations or
peculiarities in the data.
As with virtually all other techniques and methods of data analysis, HLM can be used for
both replication and extension of other studies as well as internal replication. The replication and
extension of other studies can be done in a multitude of ways. HLM can be a complementary
method used to examine a sample of a population previously analyzed using other statistical
techniques such as regression analysis. The degree to which HLM is more robust for accounting
for such issues as lack of independence among observations makes it well suited to replicate
previous research with very similar populations to either help strengthen the inferences made
from that research, or to identify possible areas of concern that were not identifiable with more
traditional analyses. Furthermore, replication can be conducted within a given study by
analyzing subsets of the data independently and subsequently examining the degree to which the
results are similar. Depending on the method(s) used, as well as the data source(s), replication
efforts often serve to enhance either the external or internal validity of the findings reported and
conclusions reached.
These are just a few of the means that researchers might consider using when conducting
HLM analyses. These techniques and approaches have the potential to enhance the credibility,
validity, and generalizability, depending on the focus, purpose, and resources considered in the
study. A careful and considerate selection of one of these analyses will enhance the integrity of
the findings of a research study.
Data Considerations
Distributional assumptions. All inferential statistical tools are based on a set of core
assumptions. Provided the assumptions are met, the method will function as planned or
intended. These underlying assumptions are often not satisfied, and it is common knowledge
12
that under some data-analytic conditions certain procedures will not produce the desired results.
According to Keselman et al. (1998), “the applied researcher who routinely adopts a traditional
procedure without giving thought to its associated assumptions may unwittingly be filling the
literatures with nonreplicable results” (p. 351).
For a hierarchical linear model, distributional assumptions are made about the errors at
each level in the model. The first level errors, the rij in equation 1, are assumed to be
independently and normally distributed with a covariance of Σ. Lack of normality can lead to
biases in the standard errors at both levels, and thus introduces questions about the validity of
statistical tests and the accuracy of reported confidence intervals. The normality assumption is
not realistic for certain types of outcome variables (e.g., binary outcomes, multinomial outcomes,
and ordinal outcomes), and in these cases it is generally recognized that hierarchical generalized
linear models are more appropriate. When one has a continuous outcome variable, the
normality assumption of hierarchical linear models may be reasonable, but even here the
assumption may not hold. Researchers can assess normality by examining the distribution of the
level-1 residuals. The distributions can be examined separately for each level-2 unit, or by
pooling across the level-2 units. If evidence of nonnormality is found, the researcher may wish to
consider transforming the outcome variable.
Also implicit in the assumptions about the first level error, is that the variance of the
errors is the same for each level-2 unit. If the variances are not homogeneous, but vary randomly,
it does not appear that the fixed effects or standard errors are biased (Kasim & Raudenbush,
1998), but if the variances vary as a function of the level-1 or level-2 predictors there may be more
serious consequences (Raudenbush & Bryk, 2002). A researcher can examine the homogeneity
assumption by examining the variance of the level-1 residuals for each level-2 unit. The
researcher could then look for units with variances that were notably different from the others, or
test whether the differences among the variances were greater than what could be attributed to
sampling error. Researchers could also examine the correlations between the variance estimates
and the values of the level-2 predictors.
Distributional assumptions are also made about the level-2 errors, the u0j and u1j from
equations 2 and 3. These errors are assumed to be normally distributed with a covariance of Γ.
Checking normality of these errors is a bit more complicated since the outcomes of the level-2
model are not directly observed but a procedure for estimating skewness and kurtosis of the
random effects has been presented (Teuscher, Herrendorfer, & Guiard, 1994).
13
It is well known that model-based statistical inference is dependent upon the scrupulous
attention to the assumed models, which necessarily includes the distributional assumptions
underlying a particular model. This is, of course, necessary if a researcher hopes to find a
suitable model or models that fit the data well. Although a number of researchers have
investigated these issues in the past, according to Ghosh and Rao (1994), the literature on
diagnostics for mixed linear models involving random effects is not as extensive as the literature
with respect to the treatment of standard regression diagnostics. Recently, however, Jiang (2001)
advanced a technique using goodness-of-fit tests to examine the distributional assumptions with
regard to mixed linear models.
Outliers. As with other statistical methods, researchers should screen their data for
outliers. These outlying observations may arise from data entry errors (e.g., a 27 that should have
been a 72), an inaccurate assessment of a student (e.g., a 0 used to indicate achievement for an
absent student), failure to identify a missing data code (e.g., a missing value entered as a 999),
failure to screen out participants who fall outside the inclusion parameters for the study (e.g., a
score from a student who was not part of the school during the focal time period for the study),
or simply from an individual who is different from the others in the sample. As with other
analyses, illegitimate outliers (e.g., data entry mistakes) can distort analyses and should be
corrected. With legitimate outliers (e.g., a score that is atypical but truly part of the population
being considered), the researcher needs to be aware of their presence and influence on the results.
When the influence is substantial, ameliorative strategies may need to be considered.
Initially the researcher may wish to look for univariate outliers by inspecting box plots, or
examining the distance from the mean in standard deviation units for the smallest and the largest
observations. Although these univariate checks are helpful, the researcher should also consider
examining the residuals at each level of the model (e.g., Raudenbush & Bryk, 2002). As an
example, consider a study that examined students nested within classrooms. At the first level,
one could look for outlying students, individual scores that were far from expectation given the
class’s regression equation. At the second level, one could look for outlying classes, where an
outlying class is one that has an atypical regression coefficient. In addition to the examination of
residuals, one may wish to examine simulation-based methods (Longford, 2001).
Measurement error for predictors and outcomes. Most measures in educational studies
contain error. Consequently, it is likely that the predictors and outcome variables used in
educational applications of hierarchical modeling will contain measurement error. These errors,
if not accounted for, can bias estimates of variance parameters, variance ratios – like the intraclass
correlation, fixed effects, and the standard errors of fixed effects (Woodhouse, Yang, Goldstein, &
14
Rashbash, 1996). Consequently, it is important for educational researchers to consider the
reliability of the data used in their applications of hierarchical linear models. In situations where
measurement error is anticipated, there are methods for specifying and adjusting for the
measurement error (Longford, 1993; Woodhouse et al., 1996).
Power. Considerable work has investigated the power of statistical tests of treatment
effects in multilevel data. Sample size formulas have been provided for obtaining given powers
in experiments where the 2nd-level units have been randomized (Donner, Birkett, & Buck, 1981;
Hsieh, 1988). Power calculations are also available through a website and through specialized
software. Optimal allocation of units among levels (e.g., fewer large groups versus more small
groups) has been considered (Raudenbush, 1997; Snijders & Boskers, 1993). Also the level of
randomization has been found to impact power, such that randomization of the 2nd-level units
leads to less power than randomization of the 1st-level units (Donner, Birkett, & Buck, 1981;
Hsieh, 1988).
Missing Data. It is not uncommon for missing data to occur on one or more variables
within an empirical investigation. Missing data may adversely affect data analyses,
interpretations and conclusions. Collins, Schafer, and Kam (2001) indicate that missing data may
potentially bias parameter estimates, inflate Type I and Type II error rates and influence the
performance of confidence bands. Further, because a loss of data is almost always associated with
a loss of information, concerns arise with regard to reductions of statistical power. Unfortunately,
researchers’ recommendations for managing missing data are not in complete agreement
(Guertin, 1968; Beale & Little, 1975; Gleason & Staelin, 1975; Frane, 1976; Kim & Curry, 1977;
Santos 1981; Basilevsky, Sabourin, Hum, & Anderson, 1985; Raymond & Roberts, 1987). Many
studies that have examined missing data treatments are not comparable due to the various
methods used, the stratification categories (number of variables, sample size, proportion of
missing data, and degree of multicollinearity), and the criteria that measure effectiveness
(Anderson, Basilevsky, & Hum, 1983). Contemporary discussion of missing data and their
treatment can often be confusing and at times may appear somewhat counterintuitive. For
example, the term ignorable, introduced by Little and Rubin (1987) was not intended to convey a
message that a particular aspect of missing data could be ignored, but rather under what
circumstances the missing data mechanism is ignorable. Additionally, when one speaks of data
missing at random, these words should not convey the notion that the missingness is derived from
a random process external or unrelated to other variables under study (Collins et al., 2001).
According to Heitjan and Rubin (1991) missing data can take many forms, and missing
values are part of a more general concept of coarsened data. This general category of missing
15
values results when data are grouped, aggregated, rounded, censored, or truncated, resulting in a
partial loss of information. The major classifications of missing data mechanisms can be best
explained by the relationship among the variables under investigation. Rubin (1987) identified
three general processes that can produce missing data. First, data that are missing purely due to
chance are considered to represent data that are missing completely at random (MCAR).
Specifically, data are missing completely at random if the probability of a missing response is
completely independent of all other measured or unmeasured characteristics under examination.
Accordingly, analyses of data of this nature will result in unbiased estimates of the population
parameters under investigation. Second, data that are classified as missing at random (MAR), do
not depend on the missing value itself, but may depend on other variables that are measured for
all participants under study. Lastly, and most problematic statistically, are data missing not at
random (MNAR). This type of missingness, also referred to as nonignorable missing data, is
directly related to the value that would have been observed for a particular variable. A
commonly encountered situation, in which data would be classified as MNAR, arises when
respondents in a certain income or age strata fail to provide responses to questions of this nature.
Given the nature of the data typically analyzed using hierarchical linear modeling, it is
not surprising that the issue of missing data becomes pertinent to inquiry of this nature (Roy &
Lin, 2002). Missing data may occur at the different levels of a model, or the loss of multiple data
points across time may be unavoidable or inevitable due to attrition of mortality. It is also not
uncommon to face a combination of these challenges when examining longitudinal outcomes.
The careful researcher must be concerned not only with nonignorable nonresponses but with
missing covariates as well (Roy & Lin, 2002).
Estimation
There is no a single agreed upon way to estimate the parameters in a hierarchical linear
model. Several methods of estimation can be employed, including maximum likelihood (ML),
restricted maximum likelihood (REML), and Bayesian (Raudenbush & Bryk, 2002; Kreft & De
Leeuw, 1998). These methods of estimation can be carried out using many different algorithms.
For example, ML estimation may be accomplished using the EM algorithm, the Newton-Raphson
algorithm, the Fisher scoring algorithm, or iterative generalized least squares (IGLS), while
Bayesian estimation may be accomplished using the Gibbs sampler. In addition, these algorithms
have been programmed into many different software packages. Thus one researcher may
accomplish REML estimation using the EM algorithm programmed into HLM, while another
may accomplish REML estimation using restricted iterative generalized least squares (RIGLS)
16
using MLn, while a third may accomplish REML using the Newton-Raphson algorithm
programmed in Proc MIXED within SAS.
Maximum likelihood estimation. The principle behind ML estimation is to select parameter
estimates that maximize the likelihood of the data. We consider how likely it is that we would
have obtained the data for each of many different values for the fixed effects (γs) and variance
parameters (elements in ΣΣΣΣ and ΓΓΓΓ), and then pick the values for which the likelihood is the
greatest. This involves an iterative algorithm that steps through possible values until the
likelihood reaches its maximum. When the maximum is reached the algorithm is said to have
converged. The goal for computational statisticians is to develop an algorithm that converges
fairly quickly across a wide range of applications. If the algorithm meanders through the
possibilities too slowly it may not converge given the time allocated, and if the algorithm moves
too quickly it may miss the maximum and fail to converge. Since the desirable properties of
maximum likelihood estimators are not realized when convergence fails, the objective for applied
researchers is to select an algorithm that will converge given their data and time constraints.
Maximum likelihood estimation is currently available through a variety of algorithms and
software packages. It can be accomplished using the EM algorithm (Dempster, Laird, & Rubin,
1977), which is implemented in the software package HLM (Raudenbush, Bryk, Cheong, &
Congdon, 2000), or by using the Newton-Raphson algorithm (Lindstom & Bates, 1988) which is
implemented in Proc MIXED (SAS, 2000), or by using the Fisher scoring algorithm (Longford,
1987) which is implemented in VARCL, or by using iterative generalized least squares (IGS;
Goldstein, 1986) which is implemented in MLn. The EM algorithm has the advantage that it will
always converge if given enough time, but the disadvantage is that it may take a relatively long
time to converge (Draper, 1995).
If convergence is met and the estimated variance/covariance matrices are positive definite
(i.e., the variances are positive and the absolute value of the implied correlations do not exceed
1.0), then the estimators have some desirable properties. The fixed effects (γs) are unbiased
(Kacker & Harville, 1981, 1984), and the estimates of the variance parameters (elements in ΣΣΣΣ and
ΓΓΓΓ) are asymptotically unbiased, that is the bias disappears as sample size gets large (Raudenbush
& Bryk, 2000). The estimates of the fixed effects and variance parameters also tend to be
asymptotically efficient, which implies that when the sample size is large the maximum
likelihood estimates show minimum variance from sample to sample (Raudenbush & Bryk,
2000). Finally, as sample size increases the sampling distributions of the estimates become
approximately normal, which facilitate construction of confidence intervals and statistical tests
17
(Raudenbush & Bryk, 2000). Note that these properties hold for relatively large sample sizes,
where what is considered large is heavily influenced by the number of upper level units. For
example, if one studies students who are nested in classes, then many classes must be sampled if
one wishes to obtain these desirable properties.
Restricted maximum likelihood estimation (REML). In REML, maximum likelihood estimates
are obtained for the variance parameters (elements in ΣΣΣΣ and ΓΓΓΓ). These values are then used in
obtaining generalized least squares estimates of the fixed effects (γs). The REML estimates of the
variance parameters may be considered preferable to ML estimates because REML takes into
account uncertainty in the fixed effects (γs) when the variance parameters are estimated. Since
the uncertainty in the fixed effects is more pronounced with smaller sample sizes, one may
suspect the difference in these methods would tend to be greater when sample sizes were
smaller. A couple of empirical studies have been done which have found differences between
ML and REML estimates under a variety of conditions (Kreft & de Leeuw, 1998), but these
studies do not lead to uniform recommendation of one method over the other.
As with ML estimates, REML estimates can be obtained from a variety of software
packages (e.g., HLM, SAS Proc MIXED, MLn, VARCL) and through a variety of algorithms (e.g.,
EM, Newton-Raphson, Fisher scoring, and RIGLS), and have been shown to have desirable
properties under many conditions. Again under general conditions, the fixed effects (γs) are
unbiased (Kacker & Harville, 1981, 1984), the estimates of the variance parameters (elements in ΣΣΣΣ
and ΓΓΓΓ) are asymptotically unbiased (Raudenbush & Bryk, 2002), the estimates of the fixed effects
and variance parameters are asymptotically efficient (Raudenbush & Bryk, 2002), and as sample
size increases the sampling distributions of the estimates become approximately normal
(Raudenbush & Bryk, 2002). Consequently, both ML and REML are often recommended for
large sample size conditions. When sample sizes are smaller, and particularly when the data are
unbalanced, the functioning of both ML and REML becomes questionable, which may lead
researchers to consider alternatives. In addition, inferences about the fixed effects (e.g.,
confidence intervals for the γs) assume the variance estimates have no error. This also becomes
exceedingly questionable when the sample sizes are not large.
Bayesian estimation. With Bayesian estimation (Lindley & Smith, 1972) one can
acknowledge the uncertainty in the estimates of the variance parameters when the fixed effects
are estimated. Consequently, Bayesian estimation provides an appealing option for researchers
working with smaller data sets. This form of estimation can be accomplished using Markov
Chain Monte Carlo algorithms like the Gibbs sampler, which is implemented in the software
18
BUGS. Although Bayesian estimation is appealing in some circumstances, it also has some
drawbacks. Prior distributions must be specified, but this specification may conflict with some
researchers’ desire to not let prior beliefs influence the results of their analyses (Raudenbush &
Bryk, 2002). In addition, the algorithms are not as readily available, as they have only been
implemented in a couple of software packages, and the algorithms are very computer intensive,
making them impractical for large data sets.
Alternative Estimation Methods. Since none of the estimation methods is entirely
satisfactory across all data conditions that may be encountered in research, statisticians continue
to work on the development of alternatives. Bootstrapping has been presented as one option to
deal with the bias in the variance estimates and standard errors that results from using ML or
REML estimation with samples that are not large and normal. Bootstrapping is available in
MlwiN, and both parametric (Meijer, van der Leeden, & Busing, 1995), and nonparametric
(Carpenter, Goldstein, & Rashbash, 1999) versions have been discussed. Another alternative
stems from the motivation to restrict the influence of outlying observations. Robust ML
estimation methods and robust REML estimation methods have been proposed and show
promise (Richardson & Welsh, 1995), but as far as we know they have not been programmed into
readily available hierarchical modeling software.
Hypothesis Testing and Statistical Inference
The estimation method will produce point estimates of each parameter in the hierarchical
model. These point estimates are often valuable in addressing particular research questions, but
additional information is often provided to aid the researcher in making inferences. This
additional information may take the form of confidence intervals for parameters of interest or
hypothesis tests of these parameters. When considering the options available it becomes
convenient to distinguish between inferences made about variance parameters (elements in ΣΣΣΣ and
ΓΓΓΓ), inferences made about fixed effects (γs), and inferences made about the random level-1
coefficients (e.g., β0j).
Inferences about variance parameters. A researcher may be interested in creating a
confidence interval (CI) for a variance parameter. The simplest approach would be to make use
of the standard error of the variance parameter estimate, which is computed from the inverse of
the information matrix. By adding and subtracting 1.96 times the standard error of the parameter
estimate, one can create a 95% CI, assuming a normal sampling distribution. This approach,
however, has limitations, especially when the sample size is small or the variance parameter is
near zero (e.g., Littell, Milliken, Stroup, & Wolfinger 1996; Raudenbush & Bryk, 2002). Under
these conditions the variance parameter will tend to have a skewed sampling distribution,
19
making symmetric intervals based on the standard error unrealistic. Under these conditions
researchers should turn to other options including the Satterthwaite approach (Littell, Milliken,
Stroup, & Wolfinger 1996), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter,
Goldstein, & Rashbash; 1999), a method based on local asymptotic approximations (Stern &
Welsh, 2000), and if the data are balanced, the approach proposed by Yu and Burdick (1995).
For researchers wishing to test hypotheses regarding variance components, again a
variety of choices are available. The simplest would be to conduct a z-test by dividing the
estimate by its standard error. Although this approach is asymptotically valid, it, like the
standard error based CIs noted previously, becomes questionable when the sampling distribution
cannot be assumed normal. A somewhat more appealing option is to use a likelihood ratio χ2
(e.g., Little, Milliken, Stroup, & Wolfinger 1996). This test requires the user to estimate two
models, one with and one without the questionable variance parameter(s). The difference in the
log likelihoods obtained in these analyses is then used to construct a statistic that in large samples
follows a χ2 distribution. Note this method can be used for single parameter tests or multiple
parameter tests. Additional alternatives include an approximate χ2 test described by
Raudenbush and Bryk (2002), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter,
Goldstein, & Rashbash; 1999), a likelihood ratio test based on the local asymptotic approximation
(Stern & Welsh, 2000), and exact tests that have been established for some contexts (Christensen,
1996; Ofversten, 1993).
Finally, it should be noted that in addition to point estimates, confidence intervals, and
statistical tests, researchers should consider whether combining variance estimates and or
making variance ratios could help to answer the research questions. For example, one may be
interested in the explained variance (R2) at one or more levels of the model (Snijders & Bosker,
1994), the intraclass correlation (e.g., Raudenbush & Bryk, 2002), or the reliability of estimators
(Raudenbush & Bryk, 2002). Those interested in creating confidence intervals for variance ratios
are referred to the statistical literature (Lee & Seeley, 1996). As far as we know the methods
described there have not been implemented in the hierarchical linear modeling software
programs.
Inferences about fixed effects. A researcher interested in making inferences about fixed
effects may wish to construct confidence intervals for the effects of interest. A 95% CI could be
constructed around the point estimate by adding and subtracting 1.96 times the standard error.
This of course assumes a normal sampling distribution, which can be demonstrated
asymptotically, but which becomes questionable for smaller samples. Consequently, one would
20
typically substitute a t-value with ν degrees of freedom for the 1.96. Several methods for defining
the degrees of freedom have been given (Giesbrecht & Burns, 1985; Kenward & Rogers, 1997),
and some software packages (e.g., Proc Mixed) allow for different definitions to be specified. An
alternative to assuming an approximate t-distribution is to turn to bootstrapping to construct the
confidence intervals.
Hypothesis tests can also be conducted by using t- or F-tests with the approximate
degrees of freedom. Again, different definitions have been suggested, and thus researchers need
to be clear about the method used for obtaining the degrees of freedom for these tests. Several
alternatives to these approximate tests have been discussed. These include a test based on a
Bartlett corrected likelihood ratio statistic (Zucker, Lieberman, & Manor, 2000), a permutation test
(Reboussin & DeMets (1996), and bootstrapping.
Inferences about random level-1 coefficients. Researchers may also be interested in estimating
the random level-1 coefficients and making inferences about these coefficients. For example, a
researcher who is interested in estimating the effects of prior reading achievement on end of
school year reading achievement, may wish to get a separate effect estimate for each classroom.
One approach would be estimate the level-one model separately for each classroom using
standard ordinary least square (OLS) estimation methods, in which case standard methods are
available for constructing confidence intervals and testing hypotheses about coefficients. The
drawback of the OLS approach is that each estimate is based on relatively few observations, only
those from the classroom of interest, thus leaving a lot of room for sampling error.
An alternative is to obtain empirical Bayes estimates, which consider all the available
information. Empirical Bayes estimates tend to pull the effect estimates toward the overall
average estimate by an amount that depends on the uncertainty in the effect estimate being
considered and the variability in the effect estimates. This process biases the estimates, but leaves
us with values that tend to be closer to the parameter values (i.e., a smaller expected mean square
error) than those based on OLS estimation (Raudenbush & Bryk, 2002). For empirical Bayes
estimates the standard errors can be computed and used for the creation of confidence intervals
or z-tests of statistical significance. These methods assume a normal sampling distribution, and
thus may be unrealistic unless there is a large number of level-2 units (Raudenbusch & Bryk,
2002).
METHOD
Coding Protocol
To analyze the articles representing multilevel applications, we developed a coding
framework based in large part on the issue identified during the first phase of our review of the
21
literature. Within each area (i.e., model development and specification, data considerations,
estimation, and hypothesis testing and inference) specific questions were devised to guide our
review. The current issues and critical questions were organized into a checklist that was refined
using a series of pilot tests. In these pilot tests, members of the research team independently
analyzed the same application article using the checklist; members then came together as a group
to check the consistency of the responses, discuss coding decisions and possible alterations of the
checklist. A codebook, which facilitated coding efforts, was developed during these meetings to
capture in more detail the coding process. The final version of the checklist, which was used to
code each of the articles, is provided as an appendix.
Searching Strategies for Applications
To describe the current application and reporting of multilevel analyses in the field of
education, prominent educational and behavioral research journals were initially selected for
examination. We examined the same set of journals provided in the methodological research
review published in Review of Educational Research by Keselman et al. (1998). It was deemed
appropriate to begin with this set of journals, as these journals publish empirical research,
represent different sub disciplines in education and are highly regarded in the fields of education
and psychology. Additionally, we relied on the expertise of our research team to identify other
well known publications that might provide similar applications of multilevel modeling. For this
phase of our review, all of the issues of each volume of the chosen journals, published between
1998 – 2002, were hand searched for evidence of the employment of hierarchical linear modeling
techniques. That is, our research team did not rely solely on article titles and abstracts to make
our determination to include or exclude a particular study.
Description of the Sample
Of the identified articles, 20 have been reviewed at this time. The largest proportion
(40%) came from the most recent year considered for this study, 2002 (see Figure 1) and only one
study (5%) was from the earliest time point considered, 1998. The remainder of the sample (55%)
was distributed almost equally over the middle three years, 1999-2001. The sample was drawn
from 10 peer-reviewed journals that are fairly prominent in the social sciences (see Table 1).
Studies from four of the journals (The American Educational Research Journal, the Journal of
Educational Research, the Journal of Personality and Social Psychology, and the Journal of Applied
Psychology) accounted for the majority of the sample, 60%, with each supplying three studies
(15%) that were used in this analysis. Four of the remaining six journals were each a source for
one article in the analysis while two articles were retrieved from the remaining two journals.
22
Figure 1. Distribution of Sample Studies Based on Year Published
Table 1 Journals and Years of Sample Studies
YEAR JOURNAL 2002 2001 2000 1999 1998 Total American Educational Research Journal 0 1 2 0 0 3 Child Development 0 0 0 1 0 1 Journal of Educational Psychology 1 1 0 0 0 2 Sociology of Education 0 0 1 1 0 2 Journal of Applied Psychology 2 1 0 0 0 3 Journal of Educational Research 3 0 0 0 0 3 Journal of Personality and Social Psychology 0 0 1 1 1 3 Reading Research Quarterly 1 0 0 0 0 1 Cognition and Instruction 0 0 0 1 0 1 Developmental Psychology 1 0 0 0 0 1 Total 8 3 4 4 1 20
RESULTS
Study Characteristics
Before turning to the four central issues that were identified as important in the analysis
and presentation of multilevel models, we took care to examine a host of characteristics germane
to the set of articles examined. For this investigation, we thought that it would be prudent to
articulate the types of studies being examined (e.g., individuals nested in contexts versus
23
repeated measures), the rationale provided by authors for employing HLM methods, the study
design, sampling, the average number of units at varying levels of the model and the description
provided regarding the distribution of level 1 units across level 2 units.
The studies were typically nonexperimental (85%) and often did not use probability
sampling (65%). They covered a wide range of applications. Half the studies used two-level
models where individuals were nested in contexts. Two studies (10%) involved thee-level
models where individuals were nested in contexts that were nested in contexts, while the
remaining eight studies (40%) involved repeated measures data. Almost all of the studies (90%)
explicitly stated a rationale for using hierarchical modeling, but the level of detail in the
rationales varied greatly. The studies also differed widely in the amount of data used in the
analysis, where the number of level-two units ranged from 19 to 1406, and the average number of
level one-units per level-two unit ranged from a low just over 2 to a high of 160.
Model Development & Specification
Model development is a central component of inquiry in many disciplines, with notably
different approaches evidenced among researchers both within and across disciplines. In our
examination of this critical component, we considered a host of aspects related to divergent
approaches to the development and specification of multilevel statistical models. A considerable
amount of variability was evidenced in the number of models examined by researchers and the
clarity of how well the number of models was communicated. For example, in only 45% of the
articles reviewed were we able to determine with confidence the number of models analyzed.
For this subset of articles (n=9), the number of models examined ranged from 4 to 430 with the
median number equal to 9 (M = 51, SD = 126). For the set of published articles that we
scrutinized, baseline models (i.e., unconditional models) were frequently investigated as part of
data analysis (n=9, 45%). For 11 of the articles, we could not determine with confidence if
baseline models were examined (see Table 2). It was also common to encounter studies that
examined more than one set of predictor variables for each of the dependent variables under
investigation (n=15, 75%). For these studies, researchers employed between two and six sets of
predictors. In all of the studies that we examined, the predictors were selected based at least
partially on apriori considerations. In most of these cases, strong support was provided by the
literature base and empirical research. In six cases there was evidence that predictor variables
were selected, in part, on significance tests for the individual predictors. With respect to the
subset of researchers who explored multiple sets of predictors, the exact number of sets could not
accurately be determined for approximately 35% of the studies. Further, four of the studies
reported level two statistical interactions, while nine reported across level interactions. During
24
our examination of how researchers typically specify the covariance structure underlying the
data, we observed that for approximately two-third of the studies, there was no clear discussion
of this issue. For these instances, it appeared that software defaults were used in the analyses.
Although centering has important implications for interpreting the results from the
statistical modeling, 40% (n=8) of the studies provided no discussion of centering at level-1 and
60% (n=12) of the studies did not provide any discussion of centering at level-2. When centering
was used at level-1, researchers either used grand mean (30%), group mean (15%) or other
approaches (25%). Other types of centering for level-1 variables included from the last time
point, coding from a given point in time, centered from time of loss, or some form of
standardization. Grand mean centering was reported for the eight studies that reported the use
of centering at level-2.
Table 2 Model Development and Specification Characteristic N Percent (%)
Examination of baseline models Yes 10 50 No 3 15 Unable to determine 7 35
Selection of predictors: Based at least partially on: Aprior considerations 20 100 Significance test for individual predictors 6 30 Effect size for individual predictors 1 5 Fit statistics (e.g. AIC or SBC) 0 0
More than one set of predictor variables for each DV Yes, but exact number could not be determined with confidence
7 35
Yes, number of sets of predictors could be determined
8 40
Could not be determined with confidence 1 5 No 4 20
Interactions examined Level 1 1 5 Level 2 4 20 Across level No interactions
9 6
45 30
Selection of covariance structure Not discussed, or unclear, and/or appears that defaults were used
13 65
Established apriori prior to looking at the data 4 20 Based partially on LRTS or significance tests for individual variance components
7 35
Based partially on fit statistics (e.g. AIC or SBC) 0 0
25
Centering Level 1
No discussion of centering 8 40 Grand mean 6 30 Group mean 3 15 Other 5 25
Level 2 No discussion of centering 12 60 Grand mean 8 40
Note: Counts may exceed 100% if multiple methods were applied (i.e., selection of predictors, centering, selection of variance structure).
When we critiqued the extent to which models were well communicated, we observed
35% of the studies did not explicitly communicate the nature and number of the models, yet we
were able to glean this information through close scrutiny of the text, tables, and footnotes (Table
2). For the remaining studies, only 10% (n=2) provided explicit statements of the number of
models examined, while for the other 55% we could not determine this information with any
degree of confidence.
Given the complexity and number of models run, researchers tended to use multiple
approaches to reporting the results. The most prominent method of communicating fixed effects
was through the use of verbal descriptions (n=20), followed closely by lists of estimated effects
(n=19), and communication through a series of regression equations (70%). The most common
methods of communicating the estimated variance structure was a list of parameters (55%) and
verbal description (75%). Eight of the studies examined provided evidence of variance
parameters through the use of equations. None of the articles included matrix representations of
these relationships.
As researchers we are keenly interested in the extent to which our results are
generalizable. To examine the critical issue of generalizability, we considered a broad range of
evidence with respect to this aspect of inquiry. For example, we looked for both sensitivity
analysis and traditional cross-validation methods as evidence of generalizability. Further, we
also included both the replication or extension of previous research as well as internal
replications (e.g., between group differences). None of the studies addressed the possibility of
capitalizing on chance in model development by employing cross-validation analyses. However,
six studies provide evidence of internal replication and three studies provided evidence of
sensitivity analysis, while a single study reported replication/extension of previous research.
Table 3 Model Communication
26
Characteristic N Percent (%)
Communication of models presented
Not explicitly stated, but could be determined from the information provided in the text, footnotes, etc.
7 35
Explicit statement of the number of models examined
2 10
Could not be determined with confidence 11 55 Communication of fixed effects
Equation representation 14 70 List of estimated effects 19 95 Verbal description 20 100
Communication of Variance structure Equation representation 8 40 List of estimated effects 11 55 Verbal description 15 75
Generalizability Sensitivity analysis 3 15 Internal replication 6 30 Replication 1 5
Data Considerations
Because inferences in multilevel models are based on an analysis of the covariances
between and within the nested units, the consideration of distributional assumptions, outliers,
statistical power, and missing data are critical to obtaining credible results. The results of the
analysis of the treatment of such data considerations in the 20 articles reviewed are presented in
Table 4.
Despite the recent advances in statistical power analysis in multilevel models, none of the
studies examined included an explicit discussion of statistical power in the study design or
interpretation of results. Similarly, only three of the articles (15%) provided evidence of outlier
screening and only one article described a consideration of the potential impact of measurement
error on the resulting models. Conversely, 90% of the studies (n = 18) provided some discussion
of missing data in the analysis and six of the 17 studies that acknowledged missing data (35%)
included a consideration of the randomness of such missing data. However, details on the
treatment of missing data in the analysis were less prevalent. Ten of the studies used listwise
deletion for missing data at level 1 and two studies used a simple imputation procedure. For
missing data at level 2, eight of the studies used listwise deletion, two studies used imputation,
and two studies used other procedures (i.e., selecting a proxy variable with less missing data and
27
incorporating a missingness indicator vector in the analysis). Even when the nature of the
missingness was discussed, the articles generally provided little insight as to the overall impact of
the missing data treatment on the resulting estimates.
Multilevel models require assumptions about the errors at each level of the analysis, and a
consideration of the tenability of these assumptions is important in assessing the credibility of the
results. In the 20 studies examined in our study, only four (20%) discussed normality of the level
1 residuals and three (15%) discussed variance homogeneity of the residuals. Only two studies
provided details about the results of checking these residuals with corrective action taken.
Finally, four studies (20%) mentioned the assumption of residual normality at level 2, but only
one of these provided details on the extent to which this assumption was met.
Table 4 Data Considerations Characteristic N Percent (%)
Statistical Power 0 0
Discussion of Missing Data 18 90 Randomness of Missing Data 6 351
Treatment of Missing Data at Level 1 Listwise Deletion 10 672
Imputation 2 132
Treatment of Missing Data at Level 2 Listwise Deletion 8 1003
Imputation 2 253
Other 2 253
Discussion of Outliers 3 15 Screening for Outliers 1 5
Treatment of Imperfect Measurement 1 5
Assumptions Level 1 Residual Normality 4 20 Level 1 Residual Homogeneity of Variance 3 15 Level 2 Residual Normality 4 20
1 Percent based on 17 papers that acknowledged missing data. 2 Percent based on 15 papers that acknowledged Level 1 missing data. 3 Percent based on 8 papers that acknowledged Level 2 missing data.
Estimation and Testing Each article was examined for details about the analysis performed. In particular, articles
were examined for information regarding the software utilized as well as general estimation
techniques, including the method used, the algorithm used, whether or not convergence
28
problems were encountered, or if matrices were positive definite. Additionally, studies were
examined for details regarding the method employed for drawing inferences for variance
parameters and fixed effects. In general, limited information regarding methods of estimation
and testing was provided, including the type of software used in the analysis.
Only 40% of the studies explicitly state the type of software used in the analysis (Table 5).
Six of the eight indicated the use of a variation of Byrk and Raudenbush’s HLM software and two
indicated the use of the SAS software. Other available software for these types of analyses (e.g.,
ML-WIN, M-PLUS ) were not explicitly noted.
Table 5 Software Identified for Use in HLM Studies
N Percent (%) Details
No information about software used 12 60
Information about program used, no information regarding date/version 2 10 SAS Proc Mixed
Information about program used as well as date or version used 6 30 HLM
General information about model estimation methods, algorithms, convergence issues
and whether matrices were positive definite, was virtually non-existent. As Table 6 illustrates,
few of the studies examined thus far provided any information on these issues. Since there are
multiple ways to estimate hierarchical models, and evidence that these different methods can
lead to different results and potentially to estimation problems, it is important that authors
provide detail about the estimation process if other researchers are to be able to critically evaluate
or replicate the analyses.
Table 6 Model Estimation Considerations
N %
Estimation Method Stated 3 15%
Estimation Algorithm Stated 0 0%
Convergence Addressed 0 0%
29
Positive Definite Matrices Addressed 0 0%
Estimates of variance and covariance of model parameters varied across the studies.
Variance estimates tended to be provided more often than covariance estimates between
intercept and slope errors (Table 7). In 75% of the studies, one could not determine whether the
covariance had been estimated. The large percentage of articles containing incomplete
information was somewhat surprising. For other types of statistical models (e.g., multiple
regression or structural equation modeling) it is expected that a complete listing of the estimated
parameters will be given. It seems reasonable to expect the same in hierarchical linear modeling,
at least for the models that are presented and interpreted.
Table 7 Frequency of Reporting Variance and Covariance Estimates
Provided Estimated
but Not Provided
Insufficient Information
Given
Not Applicable Since Not Estimated
Error Variance of Intercepts 10 (50%) 8 (40%) 2 (10%) 0 (0%)
Error Variance of Regression Coefficient or Slope 7 (35%) 6 (30%) 3 (15%) 4 (20%)
Covariance Between the Intercept and Slope Errors 1 (5%) 0 (0%) 15 (75%) 4(20%)
First Level Error Variance 8 (40%) 12 (60%)
The degree to which variance information and fixed effect information was reported also
fluctuated across studies, as well as how such information was reported. Table 8 provides a
summary of what type of variance information was reported in the studies as well as how that
information was reported. Significance tests (n = 12) were the most prevalent method of
reporting additional information regarding variance. In addition, only six of the studies reported
the method they used for these significance tests, a chi-square analysis, and none of the articles
specified the type of chi-square analysis used. Again we observe the tendency to not provide
enough detail for thorough critique or replication. None of the studies used confidence intervals
to gauge precision of variance estimates although four (20%) studies provided such information
for fixed effect estimates. For fixed effects, significance tests and point estimates were widely
reported (95% and 100% of the time, respectively). However, although information on fixed
30
effects tended to be reported more often than variance estimates, the details of how inferences
were made were often not included. Only eight of the studies indicated the type of test used
(e.g., t-test) and of these none reported the method for determining the degrees of freedom.
Table 8 Additional Information on Variance and Fixed Effects Reported
N Percent (%)
Additional variance information provideda
None 7 35 Standard Errors 2 10 Confidence Intervals 0 0 Significance Tests 12 60 Reliabilities 2 10 Inter Class Correlations 6 30 Explained Variance 5 25
Method used for CIs or Significance Tests for Variance Parameters
Not Applicable 8 40 Not Stated 6 30 SE/z-estimate 0 0 Chi-Square 6 30 Other 0 0
Fixed Effect Information Provideda
None 1 5 Standard Errors 12 60 Confidence Intervals 4 20 Signficance Tests 19 95 Point Estimates 20 100 Other 6 30
Method used for CIs or Signficance Tests for Fixed Effects
Not Stated 12 60 Likelihood Ratio 0 0 T or F test 8 40 Other 0 0 Level-1 Parameter Information Provided None 20 100 Extimates Provided, Method Not Stated 0 0 OLS or EB Estimates 0 0 Statistical Tests for OLS or EB Estimates 0 0 CIs for OLS or EB Estimates 0 0
31
Other 0 0 aCounts may exceed 100% if multiple methods were applied
CONCLUSIONS AND RECOMMENDATIONS
The results presented from this study should be viewed as preliminary, and although we
will offer some conclusions and recommendations, it is important to note that these are being
offered tentatively at this point. We have only reviewed 20 of the articles published in the
selected journals between 1998 and 2002, which is about ¼ of the hierarchical modeling articles
published in these journals. After reviewing the remainder of the articles, we will be able to
make more precise statements about analysis and reporting practices. It should also be noted
that questions could be raised about the reliability of the coding. Each article was reviewed
independently by all members of the research team and then discussed at a team meeting, at
which time a master checklist was created. There were many items for which 100% agreement
was obtained (e.g., was there a statement of the statistical software used?), there were other items,
however, that involved greater levels of inference and that sometimes led to disagreements (e.g.,
how many models were estimated?). These disagreements were resolved through discussion,
and a codebook, which facilitated coding efforts, was developed to capture in more detail the
coding process. We anticipate estimating reliability for all coding decisions for a sample of the
remaining articles, and then using these estimates to guide the number of coders used to examine
the remaining articles. When the reliability has been estimated and more articles have been
coded, it will be possible to make less tentative conclusions and recommendations.
Even keeping in mind the preliminary nature of our results, there seems to be some
relatively clear problems in the reporting of HLM analyses. There is often not enough
information for a reader to technically critique the reported analyses, even when the writers have
done an admirable job in discussing the critical methodological issues of sampling, research
procedures, and measurement. With this in mind, we suggest the following reporting guidelines
for hierarchical modeling.
1. Provide a clear description of the process used to arrive at the model(s) presented. This
should include discussion of how the predictors were selected, how the covariance
structure was chosen, and a statement of how many models were examined. Readers can
more carefully consider the presented models if they understand the process from which
the models were generated.
32
2. Explicitly state whether centering was or was not used, and if it was, provide details on
which variables were centered and how they were centered. Without knowledge of
centering decisions, readers cannot easily interpret the regression coefficients.
3. Explicitly state whether there were specification checks, if distributional assumptions were
considered, and whether data were screened for outliers. If such checks were made, state
both the method used and what was found. Without this sort of information it is easier to
question the creditability of the results.
4. State whether the data were complete. If they were not complete, describe the missingness
and attempt to provide insight into its possible effects on the results.
5. Provide details on the analysis methods, including a statement of the software used, the
method of estimation, whether convergence was obtained, and whether all variance
estimates were admissible. It is important for authors to list the version of the software
used in case bugs in a specific version of the software are found at a later date that may
call into question the interpretation. The other details are helpful for interpreting the
parameter estimates.
6. For any interpreted model, provide a complete list of all parameter estimates. In addition
to providing critical information for interpreting the results, this helps to communicate
the precise model estimated.
7. Provide either standard errors or interval estimates of the parameters of interest. This
recommendation is consistent with the general reporting guidelines provided by the APA
taskforce on statistical inference (Wilkinson & Task Force on Statistical Inference, 1999).
Statistical significance tests provide limited inferential information, and can be difficult to
interpret when large numbers of tests have been conducted, which was typical in the
reviewed applications.
We recognize that it would also appear helpful if we provided some concrete guidelines
regarding the conduct of hierarchical modeling. Unfortunately, the models are complex and the
methodological decisions regarding their implementation are best made only after careful
consideration of a particular application. For example, whether group mean centering makes for
a good recommendation depends on the application being considered. Whether restricted
maximum likelihood estimation is the best recommendation for estimation depends on the
application being considered. We hope that providing some reporting guidelines will heighten
awareness of some of the technical issues among researchers and reviewers involved in a
particular application. This in turn may lead to a careful examination and critical dialog about
33
the issues within the context of the application, which may facilitate improvements in applied
practice.
34
References
Akaike, H. (1974). A new look at the statistical model of identification. IEEE Transaction on
Automatic Control, 19, 716-723. Anderson, A. B., Basilevsky, A., & Hum, D. P. (1983). Missing data. In P. H. Rossi, J. D. Wright, &
A. B. Anderson (Eds.), Handbook of survey research (pp. 415-494). New York: Academic Press. Basilevsky, A., Sabourin, D., Hum, D., & Anderson, A. (1985). Missing data estimators in the
general linear model: An evaluation of simulated data as an experimental design. Communications in Statistics, 14, 371-394.
Beale, E. M. L., & Little R. J. A. (1975). Missing values in multivariate analysis. Journal of the Royal
Statistical Society, Series B, 37, 129-145. Bremer, R.H. (1993). Choosing and modeling your mixed linear-model. Commun Stat – Theory,
22, 3491-3521. Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis
methods. Newbury Park: Sage Publications. Burstein, Kim, & Delandshere, (1989). Carpenter, J. Goldstein, H., & Rasbash, J. (1999). A non-parametric bootstrap for multilevel
models. Multilevel Modeling Newsletter, 11, 2-5. Christensen, R. (1996). Exact tests for variance components. Biometrics, 52, 309-314. Collins, L. M., Schafer, J. L., & Kam, C. M (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological Methods, 6, 330-351. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-8. Dickmeyer, N. (1989). Metaphor, model, and theory in education research. Teachers College Record,
91, 151-160. Donner A, Birkett N, Buck C. (1981). Randomization by cluster: Sample size requirements and
analysis. American Journal of Epidemiology, 114, 906-914. Draper, D. (1995). Inference and hierarchical modeling in the social sciences. Journal of
Educational and Behavioral Statistics, 20 (2), 115-147. Ferron, J., Dailey, R., & Yi, Q. (2002). Effects of misspecifying the first-level error structure in
two-level models of change. Multivariate Behavioral Research, 37, 379-403. Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis.
Psychometrika, 41, 409-415.
35
Giesbrecht, F., & Burns, J. (1985). Two-stage analysis based on a mixed model: Large-sample asymptotic theory and small-sample simulation results. Biometrics, 41, 477-486.
Gleason, T. C., & Staelin, R. (1975). A proposal for handling missing observations. Psychometrika,
40, 229-252. Goldstein, (1986). Multilevel mixed linear model analysis using iterative generalized least
squares. Biometrika, 73, 43-56. Goodwin, L. D. & Goodwin, W. L. (1985b). Statistical techniques in AERJ articles, 1979 – 1983: The
preparation of graduate students to read the educational literature. Educational Researcher, 14, 5 – 11.
Gosh, M., & Rao, J. (1994). Small area estimation: An appraisal. Statistical Science, 9, 55-93.
Guertin, W. H. (1968). Comparison of three methods of handling missing observations. Psychological Reports, 22, 896.
Heitjan, D. F. & Rubin, D.B. (1991). Ignorability and coarse data. Annals of Statistics, 12, 2244-
2253. Hopkins, K. D. (1982). The unit of analysis: Group means versus individual observations.
American Educational Research Journal, 19, 5-18. Hsieh, F. Y. (1988). Sample size formulae for intervention studies with the cluster as unit of
randomization. Statistics in Medicine, 7, 1195–1201. Jiang, JM (2001) Goodness-of-fit tests for mixed model diagnostics. The Annals of Statistics, 29,
1137-1164. Kackar, R., & Harville, D. (1981). Unbiasedness of two-stage estimation and prediction
procedures for mixed linear models. Communications in Statistics-Theory and Methods-A, 10, 1249-1261.
Kackar, R., & Harville, D. (1984). Approximations for standard errors of estimators of fixed and
random effects in mixed linear models. Journal of the American Statistical Association, 79, 853-862.
Kasim, R., & Raudenbush, S. (1998). Application of Gibbs sampling to nested variance
components models with heterogeneous with-in group variance. Journal of Educational and Behavioral Statistics, 20, 93-116.
Kenward, M., & Roger, J. (1997). Small sample inference for fixed effects from restricted
maximum likelihood. Biometrics, 53, 983-997. Keselman, H.J., Algina, J., Kowalchuk, R. K., & Wolfinger, R.D. (1998). A comparison of two
approaches for selecting covariance structures in the analysis of repeated measurements. Communications in Statistics: Simulation, 27, 591-604.
36
Keselman, H. J., Huberty, C. J., Lix, L.M., Olejnik, S., Cribbie, R.A., Donahue, B., Kowalchuk, Rhonda K., Lowman, L. L., Petoskey, M. D., Keselman, J. C., & Levin, J. R. (1998). Statistical Practices of Educational Researchers: An Analysis of Their ANOVA, MANOVA, and ANCOVA Analyses. Review of Educational, 68, 350-386.
Kim, J., & Curry J. (1977). The treatment of missing data in multivariate analysis. Sociological
Methods and Research, 6, 215-240. Kreft, I. G. G. (1995). Hierarchical linear models: Problems and prospects. Journal of Educational
and Behavioral Statistics, 20 (2), 109-133. Kreft, I.G.G., & de Leeuw, J. (1998). Introducing Multilevel Models. London: Sage. Kromrey, J. D., & Dickenson, W. B. (1996). Detecting unit of analysis problems in nested designs:
Statistical power and Type I error rates of the F test for groups-within-treatments effects. Educational and Psychological Measurement, 56, 215-231.
Lange, L., & Laird, N. M. (1989). The effect of covariance structure on variance estimation in
balanced growth-curve models with random parameters. Journal of the American Statistical Association, 84, 241-247.
Lee, Y., & Seely, J. (1996). Computing the Wald interval for a variance ratio. Biometrics, 52, 1486-
1491. Lindley, D., & Smith, A. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical
Soceity, B, 34, 1-41. Linstrom, M., & Bates, D. (1989). Newton-Raphson and EM algorithms for linear mixed-effects
models for repeated-measures data. Journal of the American Statistical Association, 83, 1014-1022.
Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999). On selecting indicators for
multivariate measurement and modeling with latent variables: When “good” indicators are bad and “bad” indicators are good. Psychological Methods, 4, 192-211.
Little, R. J., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John Wiley &
Sons. Littell, R., Milliken, G., Stroup, W., & Wolfinger, R. (1996). SAS system for mixed models. Cary, NC:
SAS Institute Inc. Longford, N. (1987). A fast scoring algorithm for maximum likelihood estimation in unbalanced
mixed models with nested random effects. Biometrika, 74, 817-827. Longford, N. (1993). Regression analysis of multilevel data with measurement error. British
Journal of Mathematical and Statistical Psychology, 46, 301-311. Longford, N. (2001). Simulation-based diagnostics in random-coefficient models. Journal of the
Royal Statistical Society Serie A-Statistics in Society, 164, 259-273.
37
Meijer, E., van der Leeden, R., & Busing, F. (1995). Implementing the bootstrap multilevel model.
Multilevel Modeling Newsletter, 7, 7-11. Ofversten, J. (1993). Exact tests for variance components in unbalanced mixed linear models.
Biometrics, 49, 45-57. Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials.
Psychological Methods, 2, 173-185. Raudenbush, S. W. & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis
methods. Newbury Park: Sage Publications. Raudenbush, S., Bryk, A., Cheong, Y. & Congdon, R. (2000). HLM 5: Hierarchical Linear and
Nonlinear Modeling. Chicago: Scientific Software International. Raymond, M. R., & Roberts, D. M. (1987). A comparison of methods for treating incomplete data
in selection research. Educational and Psychological Measurement, 47, 13-26. Reboussin, D. M., & DeMets, D. L. (1996). Exact permutation inference for two sample repeated
measures data. Communications in Statistical Theory and Methods, 25, 2223-2238. Richardson, A., & Welsh, A. (1995). Robust restricted maximum likelihood in mixed linear
models. Biometrics, 51, 1429-1439. Ridgeway, V. G., Dunston, P. J., & Quian, G. (1993). A methodological analysis of teaching and
learning strategy research at the secondary school level. Reading Research Quarterly, 28, 335 – 349.
Roy, J., & Lin, X. (2002). Analysis of multivariate longitudinal outcomes with nonignorable
dropouts and missing covariates: Changes in methadone treatment practices. Journal of the American Statistical Association, 97, 40-52.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons,
Inc. Santos, R. (1981). Effects of imputation on regression coefficients. Proceedings of the Section on
Survey Research Methods, American Statistical Association, 141-145. SAS Institute Inc. (2000). SAS/Proc Mixed (version 8) [comuputer program], Carey, NC: SAS
Institute Inc. Schwartz, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6, 461-464. Seltzer, M. Novak, J., Choi, K., & Lim, N. (2002). Sensitivity analysis for hierarchical models
employing t level-1 assumptions. Journal of Educational and Behavioral Statistics, 27, 181-222. Shimodaira, H. (1998). An application of multiple comparison techniques to model selection.
Annals of the Institute of Statistical Mathematics, 50, 1-13.
38
Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323-355.
Snijder, T., & Bosker, R. (1993). Standard errors and sample sizes for two-level research. Journal
of Educational Statistics, 18, 237-259. Snijder, T., & Bosker, R. (1994). Modeled variance in two-level models. Sociological Methods and
Research, 22, 342-363. Stern, S., & Welsh, A. (2000). Likelihood inference for small variance components. Canadian
Journal of Statistics, 28, 517-532. Teuscher, F., Herrendorfer, G., & Guiard, G. (1994). The estimation of skewness and kurtosis of
random effects in the linear model. Biometrical Journal, 36, 661-672. Wada, Y. & Kashiwagi, N. (1990). Selecting statistical models with information statistics. Journal of
Dairy Science, 73, 3575-3582. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology
journals: Guidelines and explanations. American Psychologist, 54, 594-604. Wolfinger, R. (1993). Covariance structure selection in general mixed models. Communication
Statistics – Simulation, 22, 1079-1106. Woodhouse, G., Yang, M., Goldstein, H., & Rashbash, J. (1996). Adjusting for measurement error
in multilevel analysis. Journal of the Royal Statistical Society-A, 159, 201-212. Yu, Q., & Burdick, R. (1995). Confidence-intervals on variance components in regression-models
with balanced (Q-1)-Fold nested error structure. Communications in Statistics-Theory and Methods, 24, 1151-1167.
Zucker, D., Lieberman, O., & Manor, O. (2000). Improved small sample inference in the mixed
linear model: Bartlett correction and adjusted likelihood. Journal of the Royal Statistical Society-B, 62, 827-838.
39
Title:___________________________________________________________________________________ Author(s):_______________________________________________________________ Journal, Year, Vol (Number), pgs: ___________________________________________
Study Characteristics (place holder)
Is there an appendix provided with technical details? ______ Yes ________No
____ a. individuals nested in contexts ____ b. growth curves
Page(s): Comments/Notes 1. What best describes
the study type?
____ c. individuals nested in contexts within contexts ____ d. growth curves nested within contexts ____ e. other, describe: ___________________________
Page(s): 2. Is a rationale
(and/or advantage(s)) provided for using HLM methods in the study?
____ a. no ____ b. yes: _____________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________
3. What is the study design?
_____ a. nonexperimental _____ b. experimental
Page(s):
4. Thoroughly describe the data set, including scope (national, etc.) if known.
Data set:__________________________________________ _________________________________________________ _________________________________________________ _________________________________________________ _________________________________________________
Page(s):
5. What type of sampling was used?
_____ a. nonprobability _____ b. probability _____ c. mixed—describe: _________________________ ________________________________________ ________________________________________
Page(s):
6. How many level 1 units per level 2 unit? (e.g., average number of students per school in nested designs, number of observations per student in growth curve designs) ________________
Page(s):
7. How many level 2 units? (e.g., number of schools in a nested design, number of students in a growth curve design, etc.) __________________
Page(s):
8. How well was the distribution of level 1 units across level 2 units addressed?
_____ a. not described _____ b. minimal description, e.g., only average number of
students per school or only number of observations per student with no further information
_____ c. described partially, e.g., one may know the maximum number of observations per student
_____ d. described fully so that it is clear how many level 1 units there are for each level 2 unit.
Page(s):
9. Fill out the following tables by listing the outcome(s) and predictors modeled
Outcome
Type 1 =
achievement 2 = other (specify)
Nature 1 = continuous
2 = binary 3 = count
4 = ordinal 5 = multinomial
Normality 0 = not discussed
1 = normal 2 = nonnormal
Outliers 0 = not
discussed 1 = no 2 = yes
Reliability 0 = not discussed
1 = estimated from data set 2 = other
Validity 0 = not discussed
1 = validity evidence gathered using this data 2 = other
Predictor
Nature
1 = continuous 2 = binary 3 = count
4 = ordinal 5 = multinomial
Normality
0 = not discussed 1 = normal
2 = nonnormal
Outliers
0 = not discussed
1 = no 2 = yes
Reliability
0 = not discussed 1 = estimated from data set
2 = other
Validity
0 = not discussed 1 = validity evidence gathered using
this data 2 = other
MODEL SPECIFICATION
10. How many models are examined in the study?
____________________________________________
Page(s):
Comments/Notes
11. How well were the number of models presented in this communicated?
_____ a. not explicit but can be determined from information
provided in text, tables, footnotes, etc. _____ b. explicit statement(s) of number of models examined _____ c. cannot be determined with confidence
Page(s):
12. Were baseline models run?
_____ a. no _____ b. yes _____ c. cannot be determined with confidence
Page(s):
13. How were the predictors selected?
NOTE: If different methods were used for different models in the study, please list all methods used
_____ a. not discussed or unclear _____ b. based at least partially on apriori considerations _____ c. based at least partially on significance tests for individual
predictors _____ d. based at least partially on effect size for individual
predictors _____ e. based at least partially on fit statistics like AIC or SBC _____ f. other: ____________________________________
Page(s):
14. Were there more than one set of predictors for each Dependent Variable?
_____ a. no _____ b. yes, but exact number of sets could not be determined _____ c. yes, number of different sets of predictors: _________ _____ d. cannot be determined with confidence
Page(s):
15. Were interactions examined and communicated? Check ALL that apply
_____ a. no _____ b. yes, level 1 interaction(s) _____ c. yes, level 2 interaction(s) _____ d. yes, across level interaction(s) _____ e. cannot be determined with confidence
Page(s):
16. How was the covariance structure of the model(s) specified?
NOTE: If different methods were used for different models in the study, please list all methods used
_____ a. not discussed, unclear, and/or appears defaults were used _____ b. established prior to looking at the data _____ c. based partially on fit statistics like the AIC or SBC _____ d. based partially on LRTs or significance tests for
individual variance components _____ e. other:____________________________________
Page(s):
17. Was there centering of variables at level 1?
_____ a. no discussion of centering _____ b. no centering _____ c. grand mean centering _____ d. group mean centering _____ e. other: ___________________________________ (e.g., growth curve centered at last point)
Page(s):
18. Was there centering of variables at level 2?
_____ a. no discussion of centering _____ b. no centering _____ c. grand mean centering _____ d. group mean centering _____ e. other: ___________________________________ (e.g., growth curve centered at last point)
Page(s):
19 How were the fixed effects (regression coefficients) in the model communicated (check all that apply)?
_____ a. series of regression equations _____ b. single mixed model equation _____ c. list of estimated effects _____ d. verbal description
Page(s):
20. How were variance structures in the model communicated (check all that apply)?
_____ a. not mentioned _____ b. equation representation _____ c. list of estimated variance parameters _____ d. verbal description _____ e. other: __________________________________
Page(s):
21. Is there evidence of generalizability?
_____ a. no _____ b. sensitivity analysis _____ c. cross-validation _____ d. replication/extension of previous research (explicit) _____ e. internal replication e.g., between group differences _____ f. other: ______________________________________
Page(s):
DATA
22. Was power considered?
_____ a. no discussion of power _____ b. general guidelines considered _____ c. power analysis conducted _____ d. other: __________________________________
Page(s):
Comments/Notes
23. Was there missing data?
_____ a. no missing data (skip to # 27) _____ b. no discussion of completeness of data (skip to #27) _____ c. missing data noted at level 1, e.g., attrition, absence
during testing, failure to complete instruments, etc. _____ d. missing data noted at level 2, e.g., attrition, absence during testing, failure to complete instruments, etc. _____ e. other: ___________________________________
Page(s):
24. If missing data were discussed, were relationships among missingness and other variables discussed?
_____ a. no _____ b. yes __________________________________________ __________________________________________
Page(s):
25. If there was missing data at level 1, how was it handled?
_____ a. not applicable _____ b. not discussed _____ c. listwise deletion _____ d. imputation. Type: _______________________ _____ e. other: __________________________________
Page(s):
26. If there was missing data at level 2, how was it handled?
_____ a. not applicable _____ b. not discussed _____ c. listwise deletion _____ d. imputation. Type: _______________________ _____ e. other: __________________________________
Page(s):
27. Were outliers present?
_____ a. not discussed _____ b. no _____ c. yes
Page(s):
28. What method was used to screen for outliers?
_____ a. not discussed _____ b. can’t tell _____ c. univariate methods _____ d. simulation diagnostics _____ e. residuals _____ f. other: _______________________________________ ____________________________________________
Page(s):
29. How was imperfect measurement handled?
_____ a. not discussed _____ b. consequences considered _____ c. other: _______________
Page(s):
30. Using the chart, indicate if distributional assumptions of the model were considered.
Assumption
Considered 0 = not discussed 1 = considered
Evidence of Violation 0 = not discussed 1 = examined and no violation found 2 = examined and violation found
Action Taken (if violated)
0 = ignored 1 = consequences considered 2 = corrective action taken
Level- 1 residuals~N
Lvl-1 residuals have equal variance for each lvl-2 unit
Level-2 residuals~N
ESTIMATION AND TESTING
31. What software package/ version was used? Please list name and version or year
_____ a. not given _____ b. package stated _____ c. package and version or year stated Software Information: __________________
Page(s): Comments/Notes
32. What method of estimation was used
_____ a. not given _____ b. given: _____________________________________
Page(s):
33. What estimation algorithm was used?
_____ a. not given _____ b. given: _____________________________________
Page(s):
34. Were any convergence problems encountered?
_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________
Page(s):
35. Were any of the covariance matrices not positive definite?
_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________
Page(s):
36. For which variance/covariance parameters were estimates provided?
36-a. error variance of the intercepts (ττττ00)
_____ a. not applicable since not estimated _____ b. provided _____ c. estimated but not provided _____ d. insufficient information provided
Page(s):
36-b. error variance of each regression coefficient or slope (e.g., ττττ11, ττττ22)
_____ a. not applicable since not estimated _____ b. provided _____ c. estimated but not provided _____ d. insufficient information provided
Page(s):
36-c. covariance between the intercept and slope errors (e.g., ττττ12, ττττ23)
_____ a. not applicable since not estimated _____ b. provided _____ c. estimated but not provided _____ d. Insufficient information provided
Page(s):
36-d. first level error variance (typically σσσσ2, but could be both σσσσ2
and ρρρρ)
_____ a. provided _____ b. estimated but not provided
Page(s):
37. What additional variance parameter information is provided? Check all that apply
_____ a. none _____ b. SEs _____ c. confidence intervals _____ d. significance tests _____ e. reliabilities _____ f. ICCs _____ g. explained variance
Page(s):
38. If CIs or significance tests were reported for variance parameters, what method was used
_____ a. not applicable _____ b. not stated _____ c. SE/z-estimate _____ d. χ2 , type (if given): ____________________ _____ e. exact, bootstrap, other__________________
Page(s):
39. What fixed effect parameter information is provided? Check all that apply.
_____ a. none _____ b. SEs _____ c. confidence intervals _____ d. significance tests _____ e. point estimates __________________________ _____ f. other (e.g., effect size): _____________________
Page(s):
40. If CIs or significance tests were reported for fixed effects, what method was used?
_____ a. not stated _____ b. likelihood ratio _____ c. t or F test, degree of freedom method NOT stated _____ d. t or F test, degree of freedom method IS stated _____ e. permutation, bootstrap, other
Page(s):
41. What level-1 parameter information is provided? Please check all that apply.
_____ a. none _____ b. estimates provided, but method not stated _____ c. OLS or EB estimates _____ d. statistical tests for OLS or EB estimates _____ e. CIs for OLS or EB estimates _____ f. other (e.g., just equations)___________________ ________________________________________
Page(s):
42. Is there something else in this study that needs to be ‘captured’ that hasn’t been addressed in other items??
____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________