hierarchical linear modeling: a review of methodological ... · hierarchical linear measurement...

1

Hierarchical Linear Modeling:

A Review of Methodological Issues and Applications

John Ferron

Melinda R. Hess

Kris Y. Hogarty

Robert F. Dedrick

Jeffery D. Kromrey

Thomas R. Lang

John Niles

University of South Florida

Paper presented at the 2004 annual meeting of the American Educational Research Association,

San Diego, April 12-16.

2

Hierarchical Linear Modeling:

A Review of Methodological Issues and Applications

INTRODUCTION

Researchers in education and many other fields (e.g., psychology, sociology) are

frequently confronted with data that are hierarchical or multilevel in nature. For example, in the

context of school organizations, students are nested in classes, classes are nested in schools,

schools are nested in school districts, etc. In longitudinal research, repeated observations are

nested within individuals (i.e., units) and these individuals may be nested within groups. The

pervasiveness of multilevel data has led to a proliferation of statistical methods, referred to under

a number of names including hierarchical linear modeling (HLM, Bryk & Raudenbush, 1992),

multi-level modeling, mixed linear modeling, or growth curve modeling, and a parallel increase

in the number of applications of these methods to educational problems.

The complexity of these multilevel methods provides potential for misuse and confusion,

which may act as barriers for applied researchers attempting to use these methods. Careful

consideration of the methodological nuances of multilevel analyses is critical, as misuses may

result in statistical artifacts that may potentially influence statistical inference and cloud

interpretation. A close examination of some of the more common methods employed across

various disciplines, as well as an exploration of recent research trends can serve to both inform

the practice of research as well as to broaden our understanding of the various methodologies

and critical issues facing practitioners. Given this context, it is appropriate to pause and critically

analyze the methodological issues inherent in multilevel modeling techniques.

Similar methodological reviews have been conducted with more commonly used

techniques such as analysis of variance, multivariate analysis of variance, analysis of covariance,

path analysis, and structural equation modeling. Keselman et al. (1998) note that one consistent

finding of methodological research reviews is that a substantial gap often exists between the

methods that are recommended in the statistical research literature and those techniques that are

actually practiced by applied researchers (Goodwin & Goodwin, 1985b; Ridgeway, Dunston &

Qian, 1993). Methodological reviews can serve to identify issues, controversies, and current

trends as well as provide direction to applied researchers. In addition, these reviews may help

bridge the gap between statistical theory and application.

3

Purpose & Intended Audience

The purpose of this review is to provide an overview of the methodological landscape

and the critical issues surrounding multilevel modeling and to report on the current application

and reporting of multilevel analyses in education and related fields. Because a single review

cannot include every methodological consideration or technical nuance, it is important to clarify

that our intent is to focus primarily on what might be termed ‘traditional’ hierarchical linear

modeling. That is, we consider linear models of continuous outcomes where the random effects

are assumed normally distributed. This allows consideration of applications where individuals

are nested in contexts (e.g., students nested in schools), and applications where observations are

nested within individuals (e.g., growth curve models). Models in which the outcome is

represented by binary, count, or ordinal data are not considered (see Raudenbush & Bryk, 2002

for discussions of these types of applications). Nor did we venture beyond these modeling

techniques to explore related methods such as multilevel structural equation modeling (SEM),

hierarchical linear measurement modeling or applications of item response theory (IRT).

This review is intended to be useful to several distinct audiences in the research arena.

For example, practitioners whose inquiries in applied areas include multilevel models should

benefit from this treatment of the published literature in the field. A careful consideration of one’s

research methods, designs and habits of reporting in contrast to those evidenced in the field in

general will tend to suggest areas for refinement of techniques and serve as a source for

professional reflection. Similarly, manuscript reviewers and journal editors who serve the critical

roles of both gatekeepers and navigators for authors in the reporting of research results should

gain from this examination of multilevel models. A critical appraisal of the strengths and

limitations in the reporting of results from such models is intended to sharpen the eyes of these

scholars and lead to improvements in the corpus of research to be published. In addition,

research methodologists and others who serve as technical consultants in large-scale research

projects may glean additional distinctions and insights from the technical issues raised in this

paper. A tremendous amount of methodological work is needed to advance our understanding of

the limits and possibilities of hierarchical models, and this review of the published literature

highlights several areas needing focused attention. Further, university professors and teachers of

research methods and applied statistics may find that the issues raised here and the resources

cited will hone their craft and enhance the content of their curricula. Finally, students of

educational research (the scholars of tomorrow) will find a readable introduction to hierarchical

models, together with citations of sources that will provide logical entries into the growing body

4

of technical work in this area. Our treatment of hierarchical models as they are ‘practiced’ by

those publishing in education and related fields is intended to both guide and inspire researchers

at multiple levels of experience and interest.

Organization of the Review

To provide an overview of the methodological nuances and critical issues surrounding

multilevel modeling, literature in an array of disciplines (education, medicine, public health,

psychology, business, chemistry, physics, biology, statistics and math) was reviewed. Electronic

databases (e.g., ERIC, PsychInfo, etc.) were searched using a variety of keywords (e.g.,

hierarchical linear modeling, mixed linear models, nested designs, and multilevel modeling) and

additional articles were gleaned from the reference lists of select articles. Lastly, key issues were

identified from the reference list of pivotal articles, by exploring technical software manuals, and

by monitoring online conversations on certain Listservs.

Based on these reviews, four broad issues were identified. These issues are explored

during the initial phase of this paper and served as the framework for the coding protocol used to

analyze the hierarchical linear modeling applications in the literature. These issues include: (a)

model development and specification, which include issues of centering, selection of predictors,

specification of covariance structure, fit indices, generalizability and checks on specification; (b)

data considerations including distributional assumptions, outliers, measurement error for

predictors and outcomes, power, and missing data; (c) estimation procedures including

maximum likelihood, restricted maximum likelihood, Bayesian estimation, and alternative

procedures such as bootstrapping; and (d) hypothesis testing and statistical inference including

inferences about variance parameters and fixed effects.

Following the explication of these issues, we describe the development of the framework

and coding protocol used as a lens for our analysis of the literature and the search strategies

employed in this review. The results of our review of applications are then presented, along with

recommendations for improved reporting practice.

Critical Issues in Hierarchical Modeling

Model Development & Specification

Model development is a central component of inquiry in many disciplines, with notably

different approaches evidenced among researchers both within and across disciplines.

According to Dickmeyer (1989), a patchwork of styles and worldviews with respect to model

development exists in the educational research community. Models are typically built to allow

researchers to test theories or hypotheses, to manipulate and test changes in simplified systems,

and to allow for the exploration of relationships between variables that in some way characterize

5

a complex system. Support for various models may emerge from the extant literature and

research in a particular field. Other models may be the product of developing theories

employing exploratory types of analyses utilizing more data driven approaches.

To aid discussion of the particular specification issues confronted by those using

hierarchical models we first review the basic notation and terminology in the context of an

application. Consider a team of educational researchers who wish to study the relative

effectiveness of two reading programs. These researchers randomly assign participating

classrooms to one of the two programs, and gather reading comprehension data both prior to and

following the use of the program. A level-1 model could be developed to model the reading

comprehension of students within a class as a function of their reading comprehension prior to

the study. More specifically,

0 1ij j j ij ijy prior reading rβ β= + + (1)

where yij is the reading comprehension of the ith child in the jth classroom, β0j is the intercept of

the regression equation predicting reading comprehension at the end of the study in the jth

classroom, β1j is the regression coefficient indexing the relationship between reading

comprehension at the end of the study and reading comprehension before the study in the jth

classroom, and rij is the error, which is assumed to be normally distributed with a covariance of Σ.

A level-2 model could then be used to examine the relative effectiveness of the two

programs, and whether the relative effectiveness of the two programs depended on the prior

levels of reading comprehension. The level-2 model would use program to predict the intercepts

and slopes of the level-1 model.

0 00 01 0j j jprogram uβ γ γ= + + (2)

1 10 11 1j j jprogram uβ γ γ= + + , (3)

where programj is a dummy coded variable indicating whether the jth classroom received

Program A (coded 0) or Program B (coded 1), and u0j and u1j are the errors, which are assumed to

be normally distributed with a covariance of Γ.

Although models are frequently described by using equations for each level, it is possible

to combine all the equations into one. By substituting the second level model for β0j and β1j in the

first level model, a combined model is obtained,

00 01 10 11 0 1*ij j ij j ij j j ij ijy program prior read program prior read u u prior read rγ γ γ γ= + + + + + + (4)

In the combined model it becomes clear why 11γ is referred to as a cross-level interaction. It

should also be noted that the combined model has the same form as the mixed linear model,

6

uγ ε= + +y X Z , (5)

where y is a vector of outcome data, γ is a vector of fixed effects, X and Z are known model

matrices, u is a vector of random effects, and ε is a vector of errors (Henderson, 1975).

Centering. Centering of the level-1 and level-2 predictors has important implications for

interpreting the results and therefore is an important consideration in specifying the statistical

model. In our example, suppose the level-1 predictor, prior reading comprehension, was

measured on a scale ranging from 200 to 800. If the predictor was kept in its natural metric, 00γ

would be the predicted reading comprehension for a student in Program A (coded 0) who had a

prior reading comprehension of zero. Since a prior reading comprehension of zero is not

possible, the coefficient is difficult to interpret. The effect of the program, 01γ , would be

interpreted as the difference in the effectiveness of the two programs when prior reading

comprehension was zero. Again, a value that is not particularly informative. Prior reading

comprehension would need to be scaled or centered to make the interpretation of the coefficients

more meaningful. Although centering is used outside the context of multilevel modeling, it is

particularly important in multilevel modeling because the level-1 coefficients become outcomes

to be explained in higher level models (e.g., level 2).

One approach to scaling the predictor variable is to subtract the grand mean of the

predictor variable from each score ( ijx - x⋅⋅ ). Using grand mean centering with our example, 00γ

would become the predicted reading comprehension for a student from Program A, who had a

mean level of prior reading comprehension. The effect of the program, 01γ , would become the

difference in the effectiveness of the two programs for students with a mean level of prior

reading comprehension.

A second approach to scaling the predictor variable is to subtract the level-2 unit mean of

the predictor variable from each score ( ijx - jx⋅ ). Using group-mean centering with our example,

00γ , would become the predicted reading comprehension for a student in Program A, who’s

prior reading comprehension was at the mean of her class. The effect of the program, 01γ , would

become the difference in effectiveness of the two programs for students who are at their

classroom’s mean level of prior reading comprehension.

A third approach to scaling a predictor variable is to subtract a theoretically meaningful

value from each score ( ijx - Specific Value). This approach is similar to grand-mean centering in

that a constant is subtracted from each score. The 0 jβ is interpreted as the expected outcome for

7

individuals who score at the specific value that has been set by the researcher. For example, in a

growth curve model examining change in achievement from grades 6, 7, 8, a research may center

the grade predictor at grade 8. In this case, 0 jβ is interpreted as the expected value of the

outcome for a student in 8th grade.

In the case of level-2 predictor variables, Raudenbush and Bryk (2002) have noted that the

choice of the scale metric (e.g., natural metric, grand mean centered) is less critical. However,

when interaction terms are included at level-2, Raudenbush and Bryk (2002) have suggest that

grand mean centering has the advantage of reducing multicolinearity.

Selection of predictors. The selection of predictors is a critical aspect of the design of a

study. According to Little, Lindenberger and Nesselroade (1999) the issue of variable selection is

directly related to the quality of the research design and the value of the results. In hierarchical

modeling, variable selection can be complicated since predictors can be selected for each level of

the model, and interactions between predictors can be considered at either level or across levels.

In addition, the process of variable selection can take many forms. In some instances, the

selection of predictors is established prior to looking at the data, while in others the data help

guide selection decisions. Inclusion may be based partially on significance tests, effect sizes, or fit

indices.

In the reading comprehension example, one can imagine the researchers using the

research goals and a priori considerations to select prior reading achievement as a predictor at the

first level, and program as a predictor at the second level. Interpreting the regression coefficients

is influenced by the degree to which one believes the regression coefficients are unbiased

estimates of effects. In our illustrative example, 01γ can be interpreted as a program effect. Since

classrooms were randomly assigned to programs, we would anticipate that program was not

related to any other classroom level variables, and consequently we would anticipate an

unbiased estimate of effect. If we had not randomly assigned classrooms to program, our ability

to argue 01γ was an unbiased estimate of effect would depend on our ability to argue that all

relevant variables were included in the model, or that the set of predictors for the model was

correctly specified.

Researchers who wish to include all relevant variables, but who are unsure if particular

variables need to be included, may let the data help them decide which variables to include. For

example, a researcher may start with a model like we have previously described and then add

variables one at a time, keeping only those that are statistically significant. Other researchers

may start with a fuller set of variables, and eliminate ones that do not seem to be affecting the

8

results. When the data are used to guide the selection of predictors the researcher increases the

odds of capitalizing on chance, which heightens the need for replication.

Specification of covariance structure. The most notable difference between hierarchical

models and more common regression models is that hierarchical models have more error terms

and consequently more flexibility in defining the covariance structure. This greater flexibility

leads to two distinct advantages. First, it allows researchers to be more flexible in the questions

they ask about the covariance structure. In applications like our example, researchers can ask

questions about the degree to which the outcome varies within classrooms relative to the degree

to which it varies across classrooms. In growth curve modeling applications, researchers can ask

to what extent initial levels (intercepts) vary across participants and to what extent growth rates

(slopes) vary across participants. Second, the degree to which the standard errors for the

regression coefficients are unbiased depends on the degree to which the covariance structure is

correctly specified. Having the flexibility to model a more complex covariance structure improves

the chances of correct specification, which leads to better estimates of the standard errors of the

regression coefficients, which in turn leads to more accurate confidence intervals and/or more

valid statistical tests.

The covariance structure for the first-level model, Σ, is often assumed to be σ2I in

applications where students are nested in contexts. In repeated measures contexts this

assumption is more questionable since errors that are close in time may be correlated. A wide

variety of alternative structures have been discussed including first-order autoregressive,

banded, unstructured, toeplitz, banded toeplitz, and first-order autoregressive plus a diagonal

(Wolfinger, 1993). With so many options available, researchers are left with questions about how

to best specify the covariance structure of the first-level model.

Questions also arise as to how to best specify the covariance structure of the second-level

model. In the previous example, which is relatively simple, there are alternative specifications

for ΓΓΓΓ depending on whether one wanted to let both intercepts ( 0 jβ ) and slopes ( 1 jβ ) randomly

vary and whether or not one wanted to allow for covariance between the errors in predicting the

intercepts and slopes. As the number of predictors in the first-level model increases the potential

size of the ΓΓΓΓ matrix also increases. With more elements there are more variance parameters that

could be estimated. The question becomes which coefficients should be allowed to randomly

vary, and if the answer is more than one, which of the possible covariances between errors

should be estimated.

9

One could generally divide the covariance parameters into three categories: those that are

assumed to be zero and not estimated, those that are assumed to be non-zero and thus estimated,

and those that the researcher is less sure about. If researchers routinely leave out all questionable

variance parameters, they run the risk of leaving out needed parameters, biasing their standard

errors, and jeopardizing their inferences. If researchers routinely add in all questionable

parameters, they may estimate a model that is overly complex, which increases the chance that

they will encounter estimation problems. For example the estimation may not converge or a

variance component may be inadmissible (e.g., a variance less than zero or a covariance that

implies the correlation would exceed 1.0). Even when estimation seems smooth, estimating

many parameters that are equal to zero will negatively affect the precision in estimating the other

parameters in the model. Consequently, one would ideally only estimate the needed parameters.

Fit Indices. With the growing recognition of the importance of the selection of an

appropriate covariance structure, several methods have been developed that allow researchers to

use the data to help make decisions about which covariance structure to estimate. As it is often

not possible to know the underlying structure in advance, researchers will often examine

multiple structures and rely on fit indices to select among possible covariance structures (Singer,

1998; Wolfinger, 1993). Among the indices commonly used are Akaike’s Information Criterion

(Akaike, 1974) and Schwartz’s Bayesian Criterion (Schwartz, 1978). Akaike’s Information

Criterion is given by:

AIC = log(L) – q (6)

where q is the number of covariance parameters.

Schwartz’s Bayesian Criterion is given by:

SBC = log(L) – (qlog(N – p))/2 (7)

Both AIC and SBC start with the log likelihood value and then penalize for the number of

covariance parameters estimated, with SBC employing a stiffer penalty. For each of these indices

values closer to zero represent better fit, so typically the model with the value closest to zero is

selected. This approach, however, does not always lead to identification of the correct covariance

model, especially when data are somewhat limited. For example, with repeated measures data, it

is difficult to correctly select the covariance structure when the series length is short (Ferron,

Dailey, & Yi, 2002; Keselman, Algina, Kowalchuk, & Wolfinger, 1998). Furthermore,

10

misspecification can affect estimation and inference (Ferron, Dailey, & Yi, 2002; Lange & Laird,

1989).

Other work in this area has focused not on a single estimate, but rather a ‘confidence set

of models’ (Shimodaira, 1998). Instead of using the minimum AIC, Shimodaira proposed the use

of an ‘interval’ estimate of the best model. This author was quick to note that the confidence set

approach in not intended to replace the use of an obtained point estimate of the minimum AIC,

but rather provides supplemental information on model selection. This approach employs a

series of pairwise analyses in which a standardized difference of AIC is calculated for every pair

of models. Potential models are compared to the model evidencing the best estimate of sample

fit, and those models that are not observed to differ by a statistically significant amount become

part of a set of models for consideration.

Generalizability and Sensitivity. The degree to which the findings of a particular analysis

are generalizable, as well as how sensitive the findings are to characteristics of the data, should be

a concern of all researchers. Limitations of a particular sample, the nature of the data, as well as

techniques employed, all impact the breadth and depth of the inferences made. These issues can,

at least in part, be examined to help ascertain the strength of findings using a variety of statistical

methods and techniques. Such techniques include cross-validation, sensitivity analysis,

replication and extension of previous research, and internal replication.

Typically associated with more traditional statistical methods such as regression analysis,

the use of a technique such as cross-validation is a useful technique in HLM analyses that

provides further evidence of validity and model soundness. The primary purpose of cross-

validation is to provide a check of model integrity and generalizability. This model ‘check’ is

accomplished through using one set of data, sometimes referred to as the screening (or training)

set and one set of data that may be called the calibration (or test) set. The screening set is used to

estimate the model and then the calibration set is applied to the model to determine how well the

model was able to predict the degree of fitness relative to the screening data set. This process

allows the calculation of a magnitude of generalization error based on how well the calibration

data set fits the model identified by the training data. Depending on the data structure and

sample size, cross-validation may be conducted using various strategies. These include the

holdout method (also called the data splitting method), dual-sample method, k-fold cross

validation and the leave-one-out cross validation (LOO). The procedure can be further

complicated when a researcher might not believe, for either theoretical, conceptual, or data-based

reasons, that a single cross-validation process is sufficient and thus engages in double cross

validation. Double cross-validation is nothing more than doing cross-validation twice and then

11

using a combined equation or model. Depending on the statistical analysis being employed (e.g.,

regression or HLM) this process may vary in complexity and applicability.

Another means of addressing generalizability and sensitivity issues is through conduct of

a sensitivity analysis. This type of analysis examines the impact of data anomalies (e.g., extreme

data values, distribution irregularities) on model fit and parameter specifications. Bayesian

techniques such as the Gibbs sampling methods as well as other strategies and algorithms can be

used to examine impact of extreme observations at either level one or level two of the model

(Seltzer, Novak, Choi, & Lim, 2002). Other techniques, such as data transformations (e.g., log-

linear, square root) can be effective in addressing issues such as nonnormal distributions. The

degree to which model fit and parameter specifications remain constant when data issues such as

these are controlled for is critical to determine how ‘sensitive’ a given model is to fluctuations or

peculiarities in the data.

As with virtually all other techniques and methods of data analysis, HLM can be used for

both replication and extension of other studies as well as internal replication. The replication and

extension of other studies can be done in a multitude of ways. HLM can be a complementary

method used to examine a sample of a population previously analyzed using other statistical

techniques such as regression analysis. The degree to which HLM is more robust for accounting

for such issues as lack of independence among observations makes it well suited to replicate

previous research with very similar populations to either help strengthen the inferences made

from that research, or to identify possible areas of concern that were not identifiable with more

traditional analyses. Furthermore, replication can be conducted within a given study by

analyzing subsets of the data independently and subsequently examining the degree to which the

results are similar. Depending on the method(s) used, as well as the data source(s), replication

efforts often serve to enhance either the external or internal validity of the findings reported and

conclusions reached.

These are just a few of the means that researchers might consider using when conducting

HLM analyses. These techniques and approaches have the potential to enhance the credibility,

validity, and generalizability, depending on the focus, purpose, and resources considered in the

study. A careful and considerate selection of one of these analyses will enhance the integrity of

the findings of a research study.

Data Considerations

Distributional assumptions. All inferential statistical tools are based on a set of core

assumptions. Provided the assumptions are met, the method will function as planned or

intended. These underlying assumptions are often not satisfied, and it is common knowledge

12

that under some data-analytic conditions certain procedures will not produce the desired results.

According to Keselman et al. (1998), “the applied researcher who routinely adopts a traditional

procedure without giving thought to its associated assumptions may unwittingly be filling the

literatures with nonreplicable results” (p. 351).

For a hierarchical linear model, distributional assumptions are made about the errors at

each level in the model. The first level errors, the rij in equation 1, are assumed to be

independently and normally distributed with a covariance of Σ. Lack of normality can lead to

biases in the standard errors at both levels, and thus introduces questions about the validity of

statistical tests and the accuracy of reported confidence intervals. The normality assumption is

not realistic for certain types of outcome variables (e.g., binary outcomes, multinomial outcomes,

and ordinal outcomes), and in these cases it is generally recognized that hierarchical generalized

linear models are more appropriate. When one has a continuous outcome variable, the

normality assumption of hierarchical linear models may be reasonable, but even here the

assumption may not hold. Researchers can assess normality by examining the distribution of the

level-1 residuals. The distributions can be examined separately for each level-2 unit, or by

pooling across the level-2 units. If evidence of nonnormality is found, the researcher may wish to

consider transforming the outcome variable.

Also implicit in the assumptions about the first level error, is that the variance of the

errors is the same for each level-2 unit. If the variances are not homogeneous, but vary randomly,

it does not appear that the fixed effects or standard errors are biased (Kasim & Raudenbush,

1998), but if the variances vary as a function of the level-1 or level-2 predictors there may be more

serious consequences (Raudenbush & Bryk, 2002). A researcher can examine the homogeneity

assumption by examining the variance of the level-1 residuals for each level-2 unit. The

researcher could then look for units with variances that were notably different from the others, or

test whether the differences among the variances were greater than what could be attributed to

sampling error. Researchers could also examine the correlations between the variance estimates

and the values of the level-2 predictors.

Distributional assumptions are also made about the level-2 errors, the u0j and u1j from

equations 2 and 3. These errors are assumed to be normally distributed with a covariance of Γ.

Checking normality of these errors is a bit more complicated since the outcomes of the level-2

model are not directly observed but a procedure for estimating skewness and kurtosis of the

random effects has been presented (Teuscher, Herrendorfer, & Guiard, 1994).

13

It is well known that model-based statistical inference is dependent upon the scrupulous

attention to the assumed models, which necessarily includes the distributional assumptions

underlying a particular model. This is, of course, necessary if a researcher hopes to find a

suitable model or models that fit the data well. Although a number of researchers have

investigated these issues in the past, according to Ghosh and Rao (1994), the literature on

diagnostics for mixed linear models involving random effects is not as extensive as the literature

with respect to the treatment of standard regression diagnostics. Recently, however, Jiang (2001)

advanced a technique using goodness-of-fit tests to examine the distributional assumptions with

regard to mixed linear models.

Outliers. As with other statistical methods, researchers should screen their data for

outliers. These outlying observations may arise from data entry errors (e.g., a 27 that should have

been a 72), an inaccurate assessment of a student (e.g., a 0 used to indicate achievement for an

absent student), failure to identify a missing data code (e.g., a missing value entered as a 999),

failure to screen out participants who fall outside the inclusion parameters for the study (e.g., a

score from a student who was not part of the school during the focal time period for the study),

or simply from an individual who is different from the others in the sample. As with other

analyses, illegitimate outliers (e.g., data entry mistakes) can distort analyses and should be

corrected. With legitimate outliers (e.g., a score that is atypical but truly part of the population

being considered), the researcher needs to be aware of their presence and influence on the results.

When the influence is substantial, ameliorative strategies may need to be considered.

Initially the researcher may wish to look for univariate outliers by inspecting box plots, or

examining the distance from the mean in standard deviation units for the smallest and the largest

observations. Although these univariate checks are helpful, the researcher should also consider

examining the residuals at each level of the model (e.g., Raudenbush & Bryk, 2002). As an

example, consider a study that examined students nested within classrooms. At the first level,

one could look for outlying students, individual scores that were far from expectation given the

class’s regression equation. At the second level, one could look for outlying classes, where an

outlying class is one that has an atypical regression coefficient. In addition to the examination of

residuals, one may wish to examine simulation-based methods (Longford, 2001).

Measurement error for predictors and outcomes. Most measures in educational studies

contain error. Consequently, it is likely that the predictors and outcome variables used in

educational applications of hierarchical modeling will contain measurement error. These errors,

if not accounted for, can bias estimates of variance parameters, variance ratios – like the intraclass

correlation, fixed effects, and the standard errors of fixed effects (Woodhouse, Yang, Goldstein, &

14

Rashbash, 1996). Consequently, it is important for educational researchers to consider the

reliability of the data used in their applications of hierarchical linear models. In situations where

measurement error is anticipated, there are methods for specifying and adjusting for the

measurement error (Longford, 1993; Woodhouse et al., 1996).

Power. Considerable work has investigated the power of statistical tests of treatment

effects in multilevel data. Sample size formulas have been provided for obtaining given powers

in experiments where the 2nd-level units have been randomized (Donner, Birkett, & Buck, 1981;

Hsieh, 1988). Power calculations are also available through a website and through specialized

software. Optimal allocation of units among levels (e.g., fewer large groups versus more small

groups) has been considered (Raudenbush, 1997; Snijders & Boskers, 1993). Also the level of

randomization has been found to impact power, such that randomization of the 2nd-level units

leads to less power than randomization of the 1st-level units (Donner, Birkett, & Buck, 1981;

Hsieh, 1988).

Missing Data. It is not uncommon for missing data to occur on one or more variables

within an empirical investigation. Missing data may adversely affect data analyses,

interpretations and conclusions. Collins, Schafer, and Kam (2001) indicate that missing data may

potentially bias parameter estimates, inflate Type I and Type II error rates and influence the

performance of confidence bands. Further, because a loss of data is almost always associated with

a loss of information, concerns arise with regard to reductions of statistical power. Unfortunately,

researchers’ recommendations for managing missing data are not in complete agreement

(Guertin, 1968; Beale & Little, 1975; Gleason & Staelin, 1975; Frane, 1976; Kim & Curry, 1977;

Santos 1981; Basilevsky, Sabourin, Hum, & Anderson, 1985; Raymond & Roberts, 1987). Many

studies that have examined missing data treatments are not comparable due to the various

methods used, the stratification categories (number of variables, sample size, proportion of

missing data, and degree of multicollinearity), and the criteria that measure effectiveness

(Anderson, Basilevsky, & Hum, 1983). Contemporary discussion of missing data and their

treatment can often be confusing and at times may appear somewhat counterintuitive. For

example, the term ignorable, introduced by Little and Rubin (1987) was not intended to convey a

message that a particular aspect of missing data could be ignored, but rather under what

circumstances the missing data mechanism is ignorable. Additionally, when one speaks of data

missing at random, these words should not convey the notion that the missingness is derived from

a random process external or unrelated to other variables under study (Collins et al., 2001).

According to Heitjan and Rubin (1991) missing data can take many forms, and missing

values are part of a more general concept of coarsened data. This general category of missing

15

values results when data are grouped, aggregated, rounded, censored, or truncated, resulting in a

partial loss of information. The major classifications of missing data mechanisms can be best

explained by the relationship among the variables under investigation. Rubin (1987) identified

three general processes that can produce missing data. First, data that are missing purely due to

chance are considered to represent data that are missing completely at random (MCAR).

Specifically, data are missing completely at random if the probability of a missing response is

completely independent of all other measured or unmeasured characteristics under examination.

Accordingly, analyses of data of this nature will result in unbiased estimates of the population

parameters under investigation. Second, data that are classified as missing at random (MAR), do

not depend on the missing value itself, but may depend on other variables that are measured for

all participants under study. Lastly, and most problematic statistically, are data missing not at

random (MNAR). This type of missingness, also referred to as nonignorable missing data, is

directly related to the value that would have been observed for a particular variable. A

commonly encountered situation, in which data would be classified as MNAR, arises when

respondents in a certain income or age strata fail to provide responses to questions of this nature.

Given the nature of the data typically analyzed using hierarchical linear modeling, it is

not surprising that the issue of missing data becomes pertinent to inquiry of this nature (Roy &

Lin, 2002). Missing data may occur at the different levels of a model, or the loss of multiple data

points across time may be unavoidable or inevitable due to attrition of mortality. It is also not

uncommon to face a combination of these challenges when examining longitudinal outcomes.

The careful researcher must be concerned not only with nonignorable nonresponses but with

missing covariates as well (Roy & Lin, 2002).

Estimation

There is no a single agreed upon way to estimate the parameters in a hierarchical linear

model. Several methods of estimation can be employed, including maximum likelihood (ML),

restricted maximum likelihood (REML), and Bayesian (Raudenbush & Bryk, 2002; Kreft & De

Leeuw, 1998). These methods of estimation can be carried out using many different algorithms.

For example, ML estimation may be accomplished using the EM algorithm, the Newton-Raphson

algorithm, the Fisher scoring algorithm, or iterative generalized least squares (IGLS), while

Bayesian estimation may be accomplished using the Gibbs sampler. In addition, these algorithms

have been programmed into many different software packages. Thus one researcher may

accomplish REML estimation using the EM algorithm programmed into HLM, while another

may accomplish REML estimation using restricted iterative generalized least squares (RIGLS)

16

using MLn, while a third may accomplish REML using the Newton-Raphson algorithm

programmed in Proc MIXED within SAS.

Maximum likelihood estimation. The principle behind ML estimation is to select parameter

estimates that maximize the likelihood of the data. We consider how likely it is that we would

have obtained the data for each of many different values for the fixed effects (γs) and variance

parameters (elements in ΣΣΣΣ and ΓΓΓΓ), and then pick the values for which the likelihood is the

greatest. This involves an iterative algorithm that steps through possible values until the

likelihood reaches its maximum. When the maximum is reached the algorithm is said to have

converged. The goal for computational statisticians is to develop an algorithm that converges

fairly quickly across a wide range of applications. If the algorithm meanders through the

possibilities too slowly it may not converge given the time allocated, and if the algorithm moves

too quickly it may miss the maximum and fail to converge. Since the desirable properties of

maximum likelihood estimators are not realized when convergence fails, the objective for applied

researchers is to select an algorithm that will converge given their data and time constraints.

Maximum likelihood estimation is currently available through a variety of algorithms and

software packages. It can be accomplished using the EM algorithm (Dempster, Laird, & Rubin,

1977), which is implemented in the software package HLM (Raudenbush, Bryk, Cheong, &

Congdon, 2000), or by using the Newton-Raphson algorithm (Lindstom & Bates, 1988) which is

implemented in Proc MIXED (SAS, 2000), or by using the Fisher scoring algorithm (Longford,

1987) which is implemented in VARCL, or by using iterative generalized least squares (IGS;

Goldstein, 1986) which is implemented in MLn. The EM algorithm has the advantage that it will

always converge if given enough time, but the disadvantage is that it may take a relatively long

time to converge (Draper, 1995).

If convergence is met and the estimated variance/covariance matrices are positive definite

(i.e., the variances are positive and the absolute value of the implied correlations do not exceed

1.0), then the estimators have some desirable properties. The fixed effects (γs) are unbiased

(Kacker & Harville, 1981, 1984), and the estimates of the variance parameters (elements in ΣΣΣΣ and

ΓΓΓΓ) are asymptotically unbiased, that is the bias disappears as sample size gets large (Raudenbush

& Bryk, 2000). The estimates of the fixed effects and variance parameters also tend to be

asymptotically efficient, which implies that when the sample size is large the maximum

likelihood estimates show minimum variance from sample to sample (Raudenbush & Bryk,

2000). Finally, as sample size increases the sampling distributions of the estimates become

approximately normal, which facilitate construction of confidence intervals and statistical tests

17

(Raudenbush & Bryk, 2000). Note that these properties hold for relatively large sample sizes,

where what is considered large is heavily influenced by the number of upper level units. For

example, if one studies students who are nested in classes, then many classes must be sampled if

one wishes to obtain these desirable properties.

Restricted maximum likelihood estimation (REML). In REML, maximum likelihood estimates

are obtained for the variance parameters (elements in ΣΣΣΣ and ΓΓΓΓ). These values are then used in

obtaining generalized least squares estimates of the fixed effects (γs). The REML estimates of the

variance parameters may be considered preferable to ML estimates because REML takes into

account uncertainty in the fixed effects (γs) when the variance parameters are estimated. Since

the uncertainty in the fixed effects is more pronounced with smaller sample sizes, one may

suspect the difference in these methods would tend to be greater when sample sizes were

smaller. A couple of empirical studies have been done which have found differences between

ML and REML estimates under a variety of conditions (Kreft & de Leeuw, 1998), but these

studies do not lead to uniform recommendation of one method over the other.

As with ML estimates, REML estimates can be obtained from a variety of software

packages (e.g., HLM, SAS Proc MIXED, MLn, VARCL) and through a variety of algorithms (e.g.,

EM, Newton-Raphson, Fisher scoring, and RIGLS), and have been shown to have desirable

properties under many conditions. Again under general conditions, the fixed effects (γs) are

unbiased (Kacker & Harville, 1981, 1984), the estimates of the variance parameters (elements in ΣΣΣΣ

and ΓΓΓΓ) are asymptotically unbiased (Raudenbush & Bryk, 2002), the estimates of the fixed effects

and variance parameters are asymptotically efficient (Raudenbush & Bryk, 2002), and as sample

size increases the sampling distributions of the estimates become approximately normal

(Raudenbush & Bryk, 2002). Consequently, both ML and REML are often recommended for

large sample size conditions. When sample sizes are smaller, and particularly when the data are

unbalanced, the functioning of both ML and REML becomes questionable, which may lead

researchers to consider alternatives. In addition, inferences about the fixed effects (e.g.,

confidence intervals for the γs) assume the variance estimates have no error. This also becomes

exceedingly questionable when the sample sizes are not large.

Bayesian estimation. With Bayesian estimation (Lindley & Smith, 1972) one can

acknowledge the uncertainty in the estimates of the variance parameters when the fixed effects

are estimated. Consequently, Bayesian estimation provides an appealing option for researchers

working with smaller data sets. This form of estimation can be accomplished using Markov

Chain Monte Carlo algorithms like the Gibbs sampler, which is implemented in the software

18

BUGS. Although Bayesian estimation is appealing in some circumstances, it also has some

drawbacks. Prior distributions must be specified, but this specification may conflict with some

researchers’ desire to not let prior beliefs influence the results of their analyses (Raudenbush &

Bryk, 2002). In addition, the algorithms are not as readily available, as they have only been

implemented in a couple of software packages, and the algorithms are very computer intensive,

making them impractical for large data sets.

Alternative Estimation Methods. Since none of the estimation methods is entirely

satisfactory across all data conditions that may be encountered in research, statisticians continue

to work on the development of alternatives. Bootstrapping has been presented as one option to

deal with the bias in the variance estimates and standard errors that results from using ML or

REML estimation with samples that are not large and normal. Bootstrapping is available in

MlwiN, and both parametric (Meijer, van der Leeden, & Busing, 1995), and nonparametric

(Carpenter, Goldstein, & Rashbash, 1999) versions have been discussed. Another alternative

stems from the motivation to restrict the influence of outlying observations. Robust ML

estimation methods and robust REML estimation methods have been proposed and show

promise (Richardson & Welsh, 1995), but as far as we know they have not been programmed into

readily available hierarchical modeling software.

Hypothesis Testing and Statistical Inference

The estimation method will produce point estimates of each parameter in the hierarchical

model. These point estimates are often valuable in addressing particular research questions, but

additional information is often provided to aid the researcher in making inferences. This

additional information may take the form of confidence intervals for parameters of interest or

hypothesis tests of these parameters. When considering the options available it becomes

convenient to distinguish between inferences made about variance parameters (elements in ΣΣΣΣ and

ΓΓΓΓ), inferences made about fixed effects (γs), and inferences made about the random level-1

coefficients (e.g., β0j).

Inferences about variance parameters. A researcher may be interested in creating a

confidence interval (CI) for a variance parameter. The simplest approach would be to make use

of the standard error of the variance parameter estimate, which is computed from the inverse of

the information matrix. By adding and subtracting 1.96 times the standard error of the parameter

estimate, one can create a 95% CI, assuming a normal sampling distribution. This approach,

however, has limitations, especially when the sample size is small or the variance parameter is

near zero (e.g., Littell, Milliken, Stroup, & Wolfinger 1996; Raudenbush & Bryk, 2002). Under

these conditions the variance parameter will tend to have a skewed sampling distribution,

19

making symmetric intervals based on the standard error unrealistic. Under these conditions

researchers should turn to other options including the Satterthwaite approach (Littell, Milliken,

Stroup, & Wolfinger 1996), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter,

Goldstein, & Rashbash; 1999), a method based on local asymptotic approximations (Stern &

Welsh, 2000), and if the data are balanced, the approach proposed by Yu and Burdick (1995).

For researchers wishing to test hypotheses regarding variance components, again a

variety of choices are available. The simplest would be to conduct a z-test by dividing the

estimate by its standard error. Although this approach is asymptotically valid, it, like the

standard error based CIs noted previously, becomes questionable when the sampling distribution

cannot be assumed normal. A somewhat more appealing option is to use a likelihood ratio χ2

(e.g., Little, Milliken, Stroup, & Wolfinger 1996). This test requires the user to estimate two

models, one with and one without the questionable variance parameter(s). The difference in the

log likelihoods obtained in these analyses is then used to construct a statistic that in large samples

follows a χ2 distribution. Note this method can be used for single parameter tests or multiple

parameter tests. Additional alternatives include an approximate χ2 test described by

Raudenbush and Bryk (2002), bootstrapping (Meijer, van der Leeden, & Busing, 1995; Carpenter,

Goldstein, & Rashbash; 1999), a likelihood ratio test based on the local asymptotic approximation

(Stern & Welsh, 2000), and exact tests that have been established for some contexts (Christensen,

1996; Ofversten, 1993).

Finally, it should be noted that in addition to point estimates, confidence intervals, and

statistical tests, researchers should consider whether combining variance estimates and or

making variance ratios could help to answer the research questions. For example, one may be

interested in the explained variance (R2) at one or more levels of the model (Snijders & Bosker,

1994), the intraclass correlation (e.g., Raudenbush & Bryk, 2002), or the reliability of estimators

(Raudenbush & Bryk, 2002). Those interested in creating confidence intervals for variance ratios

are referred to the statistical literature (Lee & Seeley, 1996). As far as we know the methods

described there have not been implemented in the hierarchical linear modeling software

programs.

Inferences about fixed effects. A researcher interested in making inferences about fixed

effects may wish to construct confidence intervals for the effects of interest. A 95% CI could be

constructed around the point estimate by adding and subtracting 1.96 times the standard error.

This of course assumes a normal sampling distribution, which can be demonstrated

asymptotically, but which becomes questionable for smaller samples. Consequently, one would

20

typically substitute a t-value with ν degrees of freedom for the 1.96. Several methods for defining

the degrees of freedom have been given (Giesbrecht & Burns, 1985; Kenward & Rogers, 1997),

and some software packages (e.g., Proc Mixed) allow for different definitions to be specified. An

alternative to assuming an approximate t-distribution is to turn to bootstrapping to construct the

confidence intervals.

Hypothesis tests can also be conducted by using t- or F-tests with the approximate

degrees of freedom. Again, different definitions have been suggested, and thus researchers need

to be clear about the method used for obtaining the degrees of freedom for these tests. Several

alternatives to these approximate tests have been discussed. These include a test based on a

Bartlett corrected likelihood ratio statistic (Zucker, Lieberman, & Manor, 2000), a permutation test

(Reboussin & DeMets (1996), and bootstrapping.

Inferences about random level-1 coefficients. Researchers may also be interested in estimating

the random level-1 coefficients and making inferences about these coefficients. For example, a

researcher who is interested in estimating the effects of prior reading achievement on end of

school year reading achievement, may wish to get a separate effect estimate for each classroom.

One approach would be estimate the level-one model separately for each classroom using

standard ordinary least square (OLS) estimation methods, in which case standard methods are

available for constructing confidence intervals and testing hypotheses about coefficients. The

drawback of the OLS approach is that each estimate is based on relatively few observations, only

those from the classroom of interest, thus leaving a lot of room for sampling error.

An alternative is to obtain empirical Bayes estimates, which consider all the available

information. Empirical Bayes estimates tend to pull the effect estimates toward the overall

average estimate by an amount that depends on the uncertainty in the effect estimate being

considered and the variability in the effect estimates. This process biases the estimates, but leaves

us with values that tend to be closer to the parameter values (i.e., a smaller expected mean square

error) than those based on OLS estimation (Raudenbush & Bryk, 2002). For empirical Bayes

estimates the standard errors can be computed and used for the creation of confidence intervals

or z-tests of statistical significance. These methods assume a normal sampling distribution, and

thus may be unrealistic unless there is a large number of level-2 units (Raudenbusch & Bryk,

2002).

METHOD

Coding Protocol

To analyze the articles representing multilevel applications, we developed a coding

framework based in large part on the issue identified during the first phase of our review of the

21

literature. Within each area (i.e., model development and specification, data considerations,

estimation, and hypothesis testing and inference) specific questions were devised to guide our

review. The current issues and critical questions were organized into a checklist that was refined

using a series of pilot tests. In these pilot tests, members of the research team independently

analyzed the same application article using the checklist; members then came together as a group

to check the consistency of the responses, discuss coding decisions and possible alterations of the

checklist. A codebook, which facilitated coding efforts, was developed during these meetings to

capture in more detail the coding process. The final version of the checklist, which was used to

code each of the articles, is provided as an appendix.

Searching Strategies for Applications

To describe the current application and reporting of multilevel analyses in the field of

education, prominent educational and behavioral research journals were initially selected for

examination. We examined the same set of journals provided in the methodological research

review published in Review of Educational Research by Keselman et al. (1998). It was deemed

appropriate to begin with this set of journals, as these journals publish empirical research,

represent different sub disciplines in education and are highly regarded in the fields of education

and psychology. Additionally, we relied on the expertise of our research team to identify other

well known publications that might provide similar applications of multilevel modeling. For this

phase of our review, all of the issues of each volume of the chosen journals, published between

1998 – 2002, were hand searched for evidence of the employment of hierarchical linear modeling

techniques. That is, our research team did not rely solely on article titles and abstracts to make

our determination to include or exclude a particular study.

Description of the Sample

Of the identified articles, 20 have been reviewed at this time. The largest proportion

(40%) came from the most recent year considered for this study, 2002 (see Figure 1) and only one

study (5%) was from the earliest time point considered, 1998. The remainder of the sample (55%)

was distributed almost equally over the middle three years, 1999-2001. The sample was drawn

from 10 peer-reviewed journals that are fairly prominent in the social sciences (see Table 1).

Studies from four of the journals (The American Educational Research Journal, the Journal of

Educational Research, the Journal of Personality and Social Psychology, and the Journal of Applied

Psychology) accounted for the majority of the sample, 60%, with each supplying three studies

(15%) that were used in this analysis. Four of the remaining six journals were each a source for

one article in the analysis while two articles were retrieved from the remaining two journals.

22

Figure 1. Distribution of Sample Studies Based on Year Published

Table 1 Journals and Years of Sample Studies

YEAR JOURNAL 2002 2001 2000 1999 1998 Total American Educational Research Journal 0 1 2 0 0 3 Child Development 0 0 0 1 0 1 Journal of Educational Psychology 1 1 0 0 0 2 Sociology of Education 0 0 1 1 0 2 Journal of Applied Psychology 2 1 0 0 0 3 Journal of Educational Research 3 0 0 0 0 3 Journal of Personality and Social Psychology 0 0 1 1 1 3 Reading Research Quarterly 1 0 0 0 0 1 Cognition and Instruction 0 0 0 1 0 1 Developmental Psychology 1 0 0 0 0 1 Total 8 3 4 4 1 20

RESULTS

Study Characteristics

Before turning to the four central issues that were identified as important in the analysis

and presentation of multilevel models, we took care to examine a host of characteristics germane

to the set of articles examined. For this investigation, we thought that it would be prudent to

articulate the types of studies being examined (e.g., individuals nested in contexts versus

23

repeated measures), the rationale provided by authors for employing HLM methods, the study

design, sampling, the average number of units at varying levels of the model and the description

provided regarding the distribution of level 1 units across level 2 units.

The studies were typically nonexperimental (85%) and often did not use probability

sampling (65%). They covered a wide range of applications. Half the studies used two-level

models where individuals were nested in contexts. Two studies (10%) involved thee-level

models where individuals were nested in contexts that were nested in contexts, while the

remaining eight studies (40%) involved repeated measures data. Almost all of the studies (90%)

explicitly stated a rationale for using hierarchical modeling, but the level of detail in the

rationales varied greatly. The studies also differed widely in the amount of data used in the

analysis, where the number of level-two units ranged from 19 to 1406, and the average number of

level one-units per level-two unit ranged from a low just over 2 to a high of 160.

Model Development & Specification

Model development is a central component of inquiry in many disciplines, with notably

different approaches evidenced among researchers both within and across disciplines. In our

examination of this critical component, we considered a host of aspects related to divergent

approaches to the development and specification of multilevel statistical models. A considerable

amount of variability was evidenced in the number of models examined by researchers and the

clarity of how well the number of models was communicated. For example, in only 45% of the

articles reviewed were we able to determine with confidence the number of models analyzed.

For this subset of articles (n=9), the number of models examined ranged from 4 to 430 with the

median number equal to 9 (M = 51, SD = 126). For the set of published articles that we

scrutinized, baseline models (i.e., unconditional models) were frequently investigated as part of

data analysis (n=9, 45%). For 11 of the articles, we could not determine with confidence if

baseline models were examined (see Table 2). It was also common to encounter studies that

examined more than one set of predictor variables for each of the dependent variables under

investigation (n=15, 75%). For these studies, researchers employed between two and six sets of

predictors. In all of the studies that we examined, the predictors were selected based at least

partially on apriori considerations. In most of these cases, strong support was provided by the

literature base and empirical research. In six cases there was evidence that predictor variables

were selected, in part, on significance tests for the individual predictors. With respect to the

subset of researchers who explored multiple sets of predictors, the exact number of sets could not

accurately be determined for approximately 35% of the studies. Further, four of the studies

reported level two statistical interactions, while nine reported across level interactions. During

24

our examination of how researchers typically specify the covariance structure underlying the

data, we observed that for approximately two-third of the studies, there was no clear discussion

of this issue. For these instances, it appeared that software defaults were used in the analyses.

Although centering has important implications for interpreting the results from the

statistical modeling, 40% (n=8) of the studies provided no discussion of centering at level-1 and

60% (n=12) of the studies did not provide any discussion of centering at level-2. When centering

was used at level-1, researchers either used grand mean (30%), group mean (15%) or other

approaches (25%). Other types of centering for level-1 variables included from the last time

point, coding from a given point in time, centered from time of loss, or some form of

standardization. Grand mean centering was reported for the eight studies that reported the use

of centering at level-2.

Table 2 Model Development and Specification Characteristic N Percent (%)

Examination of baseline models Yes 10 50 No 3 15 Unable to determine 7 35

Selection of predictors: Based at least partially on: Aprior considerations 20 100 Significance test for individual predictors 6 30 Effect size for individual predictors 1 5 Fit statistics (e.g. AIC or SBC) 0 0

More than one set of predictor variables for each DV Yes, but exact number could not be determined with confidence

7 35

Yes, number of sets of predictors could be determined

8 40

Could not be determined with confidence 1 5 No 4 20

Interactions examined Level 1 1 5 Level 2 4 20 Across level No interactions

9 6

45 30

Selection of covariance structure Not discussed, or unclear, and/or appears that defaults were used

13 65

Established apriori prior to looking at the data 4 20 Based partially on LRTS or significance tests for individual variance components

7 35

Based partially on fit statistics (e.g. AIC or SBC) 0 0

25

Centering Level 1

No discussion of centering 8 40 Grand mean 6 30 Group mean 3 15 Other 5 25

Level 2 No discussion of centering 12 60 Grand mean 8 40

Note: Counts may exceed 100% if multiple methods were applied (i.e., selection of predictors, centering, selection of variance structure).

When we critiqued the extent to which models were well communicated, we observed

35% of the studies did not explicitly communicate the nature and number of the models, yet we

were able to glean this information through close scrutiny of the text, tables, and footnotes (Table

2). For the remaining studies, only 10% (n=2) provided explicit statements of the number of

models examined, while for the other 55% we could not determine this information with any

degree of confidence.

Given the complexity and number of models run, researchers tended to use multiple

approaches to reporting the results. The most prominent method of communicating fixed effects

was through the use of verbal descriptions (n=20), followed closely by lists of estimated effects

(n=19), and communication through a series of regression equations (70%). The most common

methods of communicating the estimated variance structure was a list of parameters (55%) and

verbal description (75%). Eight of the studies examined provided evidence of variance

parameters through the use of equations. None of the articles included matrix representations of

these relationships.

As researchers we are keenly interested in the extent to which our results are

generalizable. To examine the critical issue of generalizability, we considered a broad range of

evidence with respect to this aspect of inquiry. For example, we looked for both sensitivity

analysis and traditional cross-validation methods as evidence of generalizability. Further, we

also included both the replication or extension of previous research as well as internal

replications (e.g., between group differences). None of the studies addressed the possibility of

capitalizing on chance in model development by employing cross-validation analyses. However,

six studies provide evidence of internal replication and three studies provided evidence of

sensitivity analysis, while a single study reported replication/extension of previous research.

Table 3 Model Communication

26

Characteristic N Percent (%)

Communication of models presented

Not explicitly stated, but could be determined from the information provided in the text, footnotes, etc.

7 35

Explicit statement of the number of models examined

2 10

Could not be determined with confidence 11 55 Communication of fixed effects

Equation representation 14 70 List of estimated effects 19 95 Verbal description 20 100

Communication of Variance structure Equation representation 8 40 List of estimated effects 11 55 Verbal description 15 75

Generalizability Sensitivity analysis 3 15 Internal replication 6 30 Replication 1 5

Data Considerations

Because inferences in multilevel models are based on an analysis of the covariances

between and within the nested units, the consideration of distributional assumptions, outliers,

statistical power, and missing data are critical to obtaining credible results. The results of the

analysis of the treatment of such data considerations in the 20 articles reviewed are presented in

Table 4.

Despite the recent advances in statistical power analysis in multilevel models, none of the

studies examined included an explicit discussion of statistical power in the study design or

interpretation of results. Similarly, only three of the articles (15%) provided evidence of outlier

screening and only one article described a consideration of the potential impact of measurement

error on the resulting models. Conversely, 90% of the studies (n = 18) provided some discussion

of missing data in the analysis and six of the 17 studies that acknowledged missing data (35%)

included a consideration of the randomness of such missing data. However, details on the

treatment of missing data in the analysis were less prevalent. Ten of the studies used listwise

deletion for missing data at level 1 and two studies used a simple imputation procedure. For

missing data at level 2, eight of the studies used listwise deletion, two studies used imputation,

and two studies used other procedures (i.e., selecting a proxy variable with less missing data and

27

incorporating a missingness indicator vector in the analysis). Even when the nature of the

missingness was discussed, the articles generally provided little insight as to the overall impact of

the missing data treatment on the resulting estimates.

Multilevel models require assumptions about the errors at each level of the analysis, and a

consideration of the tenability of these assumptions is important in assessing the credibility of the

results. In the 20 studies examined in our study, only four (20%) discussed normality of the level

1 residuals and three (15%) discussed variance homogeneity of the residuals. Only two studies

provided details about the results of checking these residuals with corrective action taken.

Finally, four studies (20%) mentioned the assumption of residual normality at level 2, but only

one of these provided details on the extent to which this assumption was met.

Table 4 Data Considerations Characteristic N Percent (%)

Statistical Power 0 0

Discussion of Missing Data 18 90 Randomness of Missing Data 6 351

Treatment of Missing Data at Level 1 Listwise Deletion 10 672

Imputation 2 132

Treatment of Missing Data at Level 2 Listwise Deletion 8 1003

Imputation 2 253

Other 2 253

Discussion of Outliers 3 15 Screening for Outliers 1 5

Treatment of Imperfect Measurement 1 5

Assumptions Level 1 Residual Normality 4 20 Level 1 Residual Homogeneity of Variance 3 15 Level 2 Residual Normality 4 20

1 Percent based on 17 papers that acknowledged missing data. 2 Percent based on 15 papers that acknowledged Level 1 missing data. 3 Percent based on 8 papers that acknowledged Level 2 missing data.

Estimation and Testing Each article was examined for details about the analysis performed. In particular, articles

were examined for information regarding the software utilized as well as general estimation

techniques, including the method used, the algorithm used, whether or not convergence

28

problems were encountered, or if matrices were positive definite. Additionally, studies were

examined for details regarding the method employed for drawing inferences for variance

parameters and fixed effects. In general, limited information regarding methods of estimation

and testing was provided, including the type of software used in the analysis.

Only 40% of the studies explicitly state the type of software used in the analysis (Table 5).

Six of the eight indicated the use of a variation of Byrk and Raudenbush’s HLM software and two

indicated the use of the SAS software. Other available software for these types of analyses (e.g.,

ML-WIN, M-PLUS ) were not explicitly noted.

Table 5 Software Identified for Use in HLM Studies

N Percent (%) Details

No information about software used 12 60

Information about program used, no information regarding date/version 2 10 SAS Proc Mixed

Information about program used as well as date or version used 6 30 HLM

General information about model estimation methods, algorithms, convergence issues

and whether matrices were positive definite, was virtually non-existent. As Table 6 illustrates,

few of the studies examined thus far provided any information on these issues. Since there are

multiple ways to estimate hierarchical models, and evidence that these different methods can

lead to different results and potentially to estimation problems, it is important that authors

provide detail about the estimation process if other researchers are to be able to critically evaluate

or replicate the analyses.

Table 6 Model Estimation Considerations

N %

Estimation Method Stated 3 15%

Estimation Algorithm Stated 0 0%

Convergence Addressed 0 0%

29

Positive Definite Matrices Addressed 0 0%

Estimates of variance and covariance of model parameters varied across the studies.

Variance estimates tended to be provided more often than covariance estimates between

intercept and slope errors (Table 7). In 75% of the studies, one could not determine whether the

covariance had been estimated. The large percentage of articles containing incomplete

information was somewhat surprising. For other types of statistical models (e.g., multiple

regression or structural equation modeling) it is expected that a complete listing of the estimated

parameters will be given. It seems reasonable to expect the same in hierarchical linear modeling,

at least for the models that are presented and interpreted.

Table 7 Frequency of Reporting Variance and Covariance Estimates

Provided Estimated

but Not Provided

Insufficient Information

Given

Not Applicable Since Not Estimated

Error Variance of Intercepts 10 (50%) 8 (40%) 2 (10%) 0 (0%)

Error Variance of Regression Coefficient or Slope 7 (35%) 6 (30%) 3 (15%) 4 (20%)

Covariance Between the Intercept and Slope Errors 1 (5%) 0 (0%) 15 (75%) 4(20%)

First Level Error Variance 8 (40%) 12 (60%)

The degree to which variance information and fixed effect information was reported also

fluctuated across studies, as well as how such information was reported. Table 8 provides a

summary of what type of variance information was reported in the studies as well as how that

information was reported. Significance tests (n = 12) were the most prevalent method of

reporting additional information regarding variance. In addition, only six of the studies reported

the method they used for these significance tests, a chi-square analysis, and none of the articles

specified the type of chi-square analysis used. Again we observe the tendency to not provide

enough detail for thorough critique or replication. None of the studies used confidence intervals

to gauge precision of variance estimates although four (20%) studies provided such information

for fixed effect estimates. For fixed effects, significance tests and point estimates were widely

reported (95% and 100% of the time, respectively). However, although information on fixed

30

effects tended to be reported more often than variance estimates, the details of how inferences

were made were often not included. Only eight of the studies indicated the type of test used

(e.g., t-test) and of these none reported the method for determining the degrees of freedom.

Table 8 Additional Information on Variance and Fixed Effects Reported

N Percent (%)

Additional variance information provideda

None 7 35 Standard Errors 2 10 Confidence Intervals 0 0 Significance Tests 12 60 Reliabilities 2 10 Inter Class Correlations 6 30 Explained Variance 5 25

Method used for CIs or Significance Tests for Variance Parameters

Not Applicable 8 40 Not Stated 6 30 SE/z-estimate 0 0 Chi-Square 6 30 Other 0 0

Fixed Effect Information Provideda

None 1 5 Standard Errors 12 60 Confidence Intervals 4 20 Signficance Tests 19 95 Point Estimates 20 100 Other 6 30

Method used for CIs or Signficance Tests for Fixed Effects

Not Stated 12 60 Likelihood Ratio 0 0 T or F test 8 40 Other 0 0 Level-1 Parameter Information Provided None 20 100 Extimates Provided, Method Not Stated 0 0 OLS or EB Estimates 0 0 Statistical Tests for OLS or EB Estimates 0 0 CIs for OLS or EB Estimates 0 0

31

Other 0 0 aCounts may exceed 100% if multiple methods were applied

CONCLUSIONS AND RECOMMENDATIONS

The results presented from this study should be viewed as preliminary, and although we

will offer some conclusions and recommendations, it is important to note that these are being

offered tentatively at this point. We have only reviewed 20 of the articles published in the

selected journals between 1998 and 2002, which is about ¼ of the hierarchical modeling articles

published in these journals. After reviewing the remainder of the articles, we will be able to

make more precise statements about analysis and reporting practices. It should also be noted

that questions could be raised about the reliability of the coding. Each article was reviewed

independently by all members of the research team and then discussed at a team meeting, at

which time a master checklist was created. There were many items for which 100% agreement

was obtained (e.g., was there a statement of the statistical software used?), there were other items,

however, that involved greater levels of inference and that sometimes led to disagreements (e.g.,

how many models were estimated?). These disagreements were resolved through discussion,

and a codebook, which facilitated coding efforts, was developed to capture in more detail the

coding process. We anticipate estimating reliability for all coding decisions for a sample of the

remaining articles, and then using these estimates to guide the number of coders used to examine

the remaining articles. When the reliability has been estimated and more articles have been

coded, it will be possible to make less tentative conclusions and recommendations.

Even keeping in mind the preliminary nature of our results, there seems to be some

relatively clear problems in the reporting of HLM analyses. There is often not enough

information for a reader to technically critique the reported analyses, even when the writers have

done an admirable job in discussing the critical methodological issues of sampling, research

procedures, and measurement. With this in mind, we suggest the following reporting guidelines

for hierarchical modeling.

1. Provide a clear description of the process used to arrive at the model(s) presented. This

should include discussion of how the predictors were selected, how the covariance

structure was chosen, and a statement of how many models were examined. Readers can

more carefully consider the presented models if they understand the process from which

the models were generated.

32

2. Explicitly state whether centering was or was not used, and if it was, provide details on

which variables were centered and how they were centered. Without knowledge of

centering decisions, readers cannot easily interpret the regression coefficients.

3. Explicitly state whether there were specification checks, if distributional assumptions were

considered, and whether data were screened for outliers. If such checks were made, state

both the method used and what was found. Without this sort of information it is easier to

question the creditability of the results.

4. State whether the data were complete. If they were not complete, describe the missingness

and attempt to provide insight into its possible effects on the results.

5. Provide details on the analysis methods, including a statement of the software used, the

method of estimation, whether convergence was obtained, and whether all variance

estimates were admissible. It is important for authors to list the version of the software

used in case bugs in a specific version of the software are found at a later date that may

call into question the interpretation. The other details are helpful for interpreting the

parameter estimates.

6. For any interpreted model, provide a complete list of all parameter estimates. In addition

to providing critical information for interpreting the results, this helps to communicate

the precise model estimated.

7. Provide either standard errors or interval estimates of the parameters of interest. This

recommendation is consistent with the general reporting guidelines provided by the APA

taskforce on statistical inference (Wilkinson & Task Force on Statistical Inference, 1999).

Statistical significance tests provide limited inferential information, and can be difficult to

interpret when large numbers of tests have been conducted, which was typical in the

reviewed applications.

We recognize that it would also appear helpful if we provided some concrete guidelines

regarding the conduct of hierarchical modeling. Unfortunately, the models are complex and the

methodological decisions regarding their implementation are best made only after careful

consideration of a particular application. For example, whether group mean centering makes for

a good recommendation depends on the application being considered. Whether restricted

maximum likelihood estimation is the best recommendation for estimation depends on the

application being considered. We hope that providing some reporting guidelines will heighten

awareness of some of the technical issues among researchers and reviewers involved in a

particular application. This in turn may lead to a careful examination and critical dialog about

33

the issues within the context of the application, which may facilitate improvements in applied

practice.

34

References

Akaike, H. (1974). A new look at the statistical model of identification. IEEE Transaction on

Automatic Control, 19, 716-723. Anderson, A. B., Basilevsky, A., & Hum, D. P. (1983). Missing data. In P. H. Rossi, J. D. Wright, &

A. B. Anderson (Eds.), Handbook of survey research (pp. 415-494). New York: Academic Press. Basilevsky, A., Sabourin, D., Hum, D., & Anderson, A. (1985). Missing data estimators in the

general linear model: An evaluation of simulated data as an experimental design. Communications in Statistics, 14, 371-394.

Beale, E. M. L., & Little R. J. A. (1975). Missing values in multivariate analysis. Journal of the Royal

Statistical Society, Series B, 37, 129-145. Bremer, R.H. (1993). Choosing and modeling your mixed linear-model. Commun Stat – Theory,

22, 3491-3521. Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis

methods. Newbury Park: Sage Publications. Burstein, Kim, & Delandshere, (1989). Carpenter, J. Goldstein, H., & Rasbash, J. (1999). A non-parametric bootstrap for multilevel

models. Multilevel Modeling Newsletter, 11, 2-5. Christensen, R. (1996). Exact tests for variance components. Biometrics, 52, 309-314. Collins, L. M., Schafer, J. L., & Kam, C. M (2001). A comparison of inclusive and restrictive

strategies in modern missing data procedures. Psychological Methods, 6, 330-351. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the

EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-8. Dickmeyer, N. (1989). Metaphor, model, and theory in education research. Teachers College Record,

91, 151-160. Donner A, Birkett N, Buck C. (1981). Randomization by cluster: Sample size requirements and

analysis. American Journal of Epidemiology, 114, 906-914. Draper, D. (1995). Inference and hierarchical modeling in the social sciences. Journal of

Educational and Behavioral Statistics, 20 (2), 115-147. Ferron, J., Dailey, R., & Yi, Q. (2002). Effects of misspecifying the first-level error structure in

two-level models of change. Multivariate Behavioral Research, 37, 379-403. Frane, J. W. (1976). Some simple procedures for handling missing data in multivariate analysis.

Psychometrika, 41, 409-415.

35

Giesbrecht, F., & Burns, J. (1985). Two-stage analysis based on a mixed model: Large-sample asymptotic theory and small-sample simulation results. Biometrics, 41, 477-486.

Gleason, T. C., & Staelin, R. (1975). A proposal for handling missing observations. Psychometrika,

40, 229-252. Goldstein, (1986). Multilevel mixed linear model analysis using iterative generalized least

squares. Biometrika, 73, 43-56. Goodwin, L. D. & Goodwin, W. L. (1985b). Statistical techniques in AERJ articles, 1979 – 1983: The

preparation of graduate students to read the educational literature. Educational Researcher, 14, 5 – 11.

Gosh, M., & Rao, J. (1994). Small area estimation: An appraisal. Statistical Science, 9, 55-93.

Guertin, W. H. (1968). Comparison of three methods of handling missing observations. Psychological Reports, 22, 896.

Heitjan, D. F. & Rubin, D.B. (1991). Ignorability and coarse data. Annals of Statistics, 12, 2244-

2253. Hopkins, K. D. (1982). The unit of analysis: Group means versus individual observations.

American Educational Research Journal, 19, 5-18. Hsieh, F. Y. (1988). Sample size formulae for intervention studies with the cluster as unit of

randomization. Statistics in Medicine, 7, 1195–1201. Jiang, JM (2001) Goodness-of-fit tests for mixed model diagnostics. The Annals of Statistics, 29,

1137-1164. Kackar, R., & Harville, D. (1981). Unbiasedness of two-stage estimation and prediction

procedures for mixed linear models. Communications in Statistics-Theory and Methods-A, 10, 1249-1261.

Kackar, R., & Harville, D. (1984). Approximations for standard errors of estimators of fixed and

random effects in mixed linear models. Journal of the American Statistical Association, 79, 853-862.

Kasim, R., & Raudenbush, S. (1998). Application of Gibbs sampling to nested variance

components models with heterogeneous with-in group variance. Journal of Educational and Behavioral Statistics, 20, 93-116.

Kenward, M., & Roger, J. (1997). Small sample inference for fixed effects from restricted

maximum likelihood. Biometrics, 53, 983-997. Keselman, H.J., Algina, J., Kowalchuk, R. K., & Wolfinger, R.D. (1998). A comparison of two

approaches for selecting covariance structures in the analysis of repeated measurements. Communications in Statistics: Simulation, 27, 591-604.

36

Keselman, H. J., Huberty, C. J., Lix, L.M., Olejnik, S., Cribbie, R.A., Donahue, B., Kowalchuk, Rhonda K., Lowman, L. L., Petoskey, M. D., Keselman, J. C., & Levin, J. R. (1998). Statistical Practices of Educational Researchers: An Analysis of Their ANOVA, MANOVA, and ANCOVA Analyses. Review of Educational, 68, 350-386.

Kim, J., & Curry J. (1977). The treatment of missing data in multivariate analysis. Sociological

Methods and Research, 6, 215-240. Kreft, I. G. G. (1995). Hierarchical linear models: Problems and prospects. Journal of Educational

and Behavioral Statistics, 20 (2), 109-133. Kreft, I.G.G., & de Leeuw, J. (1998). Introducing Multilevel Models. London: Sage. Kromrey, J. D., & Dickenson, W. B. (1996). Detecting unit of analysis problems in nested designs:

Statistical power and Type I error rates of the F test for groups-within-treatments effects. Educational and Psychological Measurement, 56, 215-231.

Lange, L., & Laird, N. M. (1989). The effect of covariance structure on variance estimation in

balanced growth-curve models with random parameters. Journal of the American Statistical Association, 84, 241-247.

Lee, Y., & Seely, J. (1996). Computing the Wald interval for a variance ratio. Biometrics, 52, 1486-

1491. Lindley, D., & Smith, A. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical

Soceity, B, 34, 1-41. Linstrom, M., & Bates, D. (1989). Newton-Raphson and EM algorithms for linear mixed-effects

models for repeated-measures data. Journal of the American Statistical Association, 83, 1014-1022.

Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999). On selecting indicators for

multivariate measurement and modeling with latent variables: When “good” indicators are bad and “bad” indicators are good. Psychological Methods, 4, 192-211.

Little, R. J., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John Wiley &

Sons. Littell, R., Milliken, G., Stroup, W., & Wolfinger, R. (1996). SAS system for mixed models. Cary, NC:

SAS Institute Inc. Longford, N. (1987). A fast scoring algorithm for maximum likelihood estimation in unbalanced

mixed models with nested random effects. Biometrika, 74, 817-827. Longford, N. (1993). Regression analysis of multilevel data with measurement error. British

Journal of Mathematical and Statistical Psychology, 46, 301-311. Longford, N. (2001). Simulation-based diagnostics in random-coefficient models. Journal of the

Royal Statistical Society Serie A-Statistics in Society, 164, 259-273.

37

Meijer, E., van der Leeden, R., & Busing, F. (1995). Implementing the bootstrap multilevel model.

Multilevel Modeling Newsletter, 7, 7-11. Ofversten, J. (1993). Exact tests for variance components in unbalanced mixed linear models.

Biometrics, 49, 45-57. Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials.

Psychological Methods, 2, 173-185. Raudenbush, S. W. & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis

methods. Newbury Park: Sage Publications. Raudenbush, S., Bryk, A., Cheong, Y. & Congdon, R. (2000). HLM 5: Hierarchical Linear and

Nonlinear Modeling. Chicago: Scientific Software International. Raymond, M. R., & Roberts, D. M. (1987). A comparison of methods for treating incomplete data

in selection research. Educational and Psychological Measurement, 47, 13-26. Reboussin, D. M., & DeMets, D. L. (1996). Exact permutation inference for two sample repeated

measures data. Communications in Statistical Theory and Methods, 25, 2223-2238. Richardson, A., & Welsh, A. (1995). Robust restricted maximum likelihood in mixed linear

models. Biometrics, 51, 1429-1439. Ridgeway, V. G., Dunston, P. J., & Quian, G. (1993). A methodological analysis of teaching and

learning strategy research at the secondary school level. Reading Research Quarterly, 28, 335 – 349.

Roy, J., & Lin, X. (2002). Analysis of multivariate longitudinal outcomes with nonignorable

dropouts and missing covariates: Changes in methadone treatment practices. Journal of the American Statistical Association, 97, 40-52.

Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons,

Inc. Santos, R. (1981). Effects of imputation on regression coefficients. Proceedings of the Section on

Survey Research Methods, American Statistical Association, 141-145. SAS Institute Inc. (2000). SAS/Proc Mixed (version 8) [comuputer program], Carey, NC: SAS

Institute Inc. Schwartz, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6, 461-464. Seltzer, M. Novak, J., Choi, K., & Lim, N. (2002). Sensitivity analysis for hierarchical models

employing t level-1 assumptions. Journal of Educational and Behavioral Statistics, 27, 181-222. Shimodaira, H. (1998). An application of multiple comparison techniques to model selection.

Annals of the Institute of Statistical Mathematics, 50, 1-13.

38

Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323-355.

Snijder, T., & Bosker, R. (1993). Standard errors and sample sizes for two-level research. Journal

of Educational Statistics, 18, 237-259. Snijder, T., & Bosker, R. (1994). Modeled variance in two-level models. Sociological Methods and

Research, 22, 342-363. Stern, S., & Welsh, A. (2000). Likelihood inference for small variance components. Canadian

Journal of Statistics, 28, 517-532. Teuscher, F., Herrendorfer, G., & Guiard, G. (1994). The estimation of skewness and kurtosis of

random effects in the linear model. Biometrical Journal, 36, 661-672. Wada, Y. & Kashiwagi, N. (1990). Selecting statistical models with information statistics. Journal of

Dairy Science, 73, 3575-3582. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology

journals: Guidelines and explanations. American Psychologist, 54, 594-604. Wolfinger, R. (1993). Covariance structure selection in general mixed models. Communication

Statistics – Simulation, 22, 1079-1106. Woodhouse, G., Yang, M., Goldstein, H., & Rashbash, J. (1996). Adjusting for measurement error

in multilevel analysis. Journal of the Royal Statistical Society-A, 159, 201-212. Yu, Q., & Burdick, R. (1995). Confidence-intervals on variance components in regression-models

with balanced (Q-1)-Fold nested error structure. Communications in Statistics-Theory and Methods, 24, 1151-1167.

Zucker, D., Lieberman, O., & Manor, O. (2000). Improved small sample inference in the mixed

linear model: Bartlett correction and adjusted likelihood. Journal of the Royal Statistical Society-B, 62, 827-838.

39

Title:___________________________________________________________________________________ Author(s):_______________________________________________________________ Journal, Year, Vol (Number), pgs: ___________________________________________

Study Characteristics (place holder)

Is there an appendix provided with technical details? ______ Yes ________No

____ a. individuals nested in contexts ____ b. growth curves

Page(s): Comments/Notes 1. What best describes

the study type?

____ c. individuals nested in contexts within contexts ____ d. growth curves nested within contexts ____ e. other, describe: ___________________________

Page(s): 2. Is a rationale

(and/or advantage(s)) provided for using HLM methods in the study?

____ a. no ____ b. yes: _____________________________________ __________________________________________ __________________________________________ __________________________________________ __________________________________________

3. What is the study design?

_____ a. nonexperimental _____ b. experimental

Page(s):

4. Thoroughly describe the data set, including scope (national, etc.) if known.

Data set:__________________________________________ _________________________________________________ _________________________________________________ _________________________________________________ _________________________________________________

Page(s):

5. What type of sampling was used?

_____ a. nonprobability _____ b. probability _____ c. mixed—describe: _________________________ ________________________________________ ________________________________________

Page(s):

6. How many level 1 units per level 2 unit? (e.g., average number of students per school in nested designs, number of observations per student in growth curve designs) ________________

Page(s):

7. How many level 2 units? (e.g., number of schools in a nested design, number of students in a growth curve design, etc.) __________________

Page(s):

8. How well was the distribution of level 1 units across level 2 units addressed?

_____ a. not described _____ b. minimal description, e.g., only average number of

students per school or only number of observations per student with no further information

_____ c. described partially, e.g., one may know the maximum number of observations per student

_____ d. described fully so that it is clear how many level 1 units there are for each level 2 unit.

Page(s):

9. Fill out the following tables by listing the outcome(s) and predictors modeled

Outcome

Type 1 =

achievement 2 = other (specify)

Nature 1 = continuous

2 = binary 3 = count

4 = ordinal 5 = multinomial

Normality 0 = not discussed

1 = normal 2 = nonnormal

Outliers 0 = not

discussed 1 = no 2 = yes

Reliability 0 = not discussed

1 = estimated from data set 2 = other

Validity 0 = not discussed

1 = validity evidence gathered using this data 2 = other

Predictor

Nature

1 = continuous 2 = binary 3 = count

4 = ordinal 5 = multinomial

Normality

0 = not discussed 1 = normal

2 = nonnormal

Outliers

0 = not discussed

1 = no 2 = yes

Reliability

0 = not discussed 1 = estimated from data set

2 = other

Validity

0 = not discussed 1 = validity evidence gathered using

this data 2 = other

MODEL SPECIFICATION

10. How many models are examined in the study?

____________________________________________

Page(s):

Comments/Notes

11. How well were the number of models presented in this communicated?

_____ a. not explicit but can be determined from information

provided in text, tables, footnotes, etc. _____ b. explicit statement(s) of number of models examined _____ c. cannot be determined with confidence

Page(s):

12. Were baseline models run?

_____ a. no _____ b. yes _____ c. cannot be determined with confidence

Page(s):

13. How were the predictors selected?

NOTE: If different methods were used for different models in the study, please list all methods used

_____ a. not discussed or unclear _____ b. based at least partially on apriori considerations _____ c. based at least partially on significance tests for individual

predictors _____ d. based at least partially on effect size for individual

predictors _____ e. based at least partially on fit statistics like AIC or SBC _____ f. other: ____________________________________

Page(s):

14. Were there more than one set of predictors for each Dependent Variable?

_____ a. no _____ b. yes, but exact number of sets could not be determined _____ c. yes, number of different sets of predictors: _________ _____ d. cannot be determined with confidence

Page(s):

15. Were interactions examined and communicated? Check ALL that apply

_____ a. no _____ b. yes, level 1 interaction(s) _____ c. yes, level 2 interaction(s) _____ d. yes, across level interaction(s) _____ e. cannot be determined with confidence

Page(s):

16. How was the covariance structure of the model(s) specified?

NOTE: If different methods were used for different models in the study, please list all methods used

_____ a. not discussed, unclear, and/or appears defaults were used _____ b. established prior to looking at the data _____ c. based partially on fit statistics like the AIC or SBC _____ d. based partially on LRTs or significance tests for

individual variance components _____ e. other:____________________________________

Page(s):

17. Was there centering of variables at level 1?

_____ a. no discussion of centering _____ b. no centering _____ c. grand mean centering _____ d. group mean centering _____ e. other: ___________________________________ (e.g., growth curve centered at last point)

Page(s):

18. Was there centering of variables at level 2?

_____ a. no discussion of centering _____ b. no centering _____ c. grand mean centering _____ d. group mean centering _____ e. other: ___________________________________ (e.g., growth curve centered at last point)

Page(s):

19 How were the fixed effects (regression coefficients) in the model communicated (check all that apply)?

_____ a. series of regression equations _____ b. single mixed model equation _____ c. list of estimated effects _____ d. verbal description

Page(s):

20. How were variance structures in the model communicated (check all that apply)?

_____ a. not mentioned _____ b. equation representation _____ c. list of estimated variance parameters _____ d. verbal description _____ e. other: __________________________________

Page(s):

21. Is there evidence of generalizability?

_____ a. no _____ b. sensitivity analysis _____ c. cross-validation _____ d. replication/extension of previous research (explicit) _____ e. internal replication e.g., between group differences _____ f. other: ______________________________________

Page(s):

DATA

22. Was power considered?

_____ a. no discussion of power _____ b. general guidelines considered _____ c. power analysis conducted _____ d. other: __________________________________

Page(s):

Comments/Notes

23. Was there missing data?

_____ a. no missing data (skip to # 27) _____ b. no discussion of completeness of data (skip to #27) _____ c. missing data noted at level 1, e.g., attrition, absence

during testing, failure to complete instruments, etc. _____ d. missing data noted at level 2, e.g., attrition, absence during testing, failure to complete instruments, etc. _____ e. other: ___________________________________

Page(s):

24. If missing data were discussed, were relationships among missingness and other variables discussed?

_____ a. no _____ b. yes __________________________________________ __________________________________________

Page(s):

25. If there was missing data at level 1, how was it handled?

_____ a. not applicable _____ b. not discussed _____ c. listwise deletion _____ d. imputation. Type: _______________________ _____ e. other: __________________________________

Page(s):

26. If there was missing data at level 2, how was it handled?

_____ a. not applicable _____ b. not discussed _____ c. listwise deletion _____ d. imputation. Type: _______________________ _____ e. other: __________________________________

Page(s):

27. Were outliers present?

_____ a. not discussed _____ b. no _____ c. yes

Page(s):

28. What method was used to screen for outliers?

_____ a. not discussed _____ b. can’t tell _____ c. univariate methods _____ d. simulation diagnostics _____ e. residuals _____ f. other: _______________________________________ ____________________________________________

Page(s):

29. How was imperfect measurement handled?

_____ a. not discussed _____ b. consequences considered _____ c. other: _______________

Page(s):

30. Using the chart, indicate if distributional assumptions of the model were considered.

Assumption

Considered 0 = not discussed 1 = considered

Evidence of Violation 0 = not discussed 1 = examined and no violation found 2 = examined and violation found

Action Taken (if violated)

0 = ignored 1 = consequences considered 2 = corrective action taken

Level- 1 residuals~N

Lvl-1 residuals have equal variance for each lvl-2 unit

Level-2 residuals~N

ESTIMATION AND TESTING

31. What software package/ version was used? Please list name and version or year

_____ a. not given _____ b. package stated _____ c. package and version or year stated Software Information: __________________

Page(s): Comments/Notes

32. What method of estimation was used

_____ a. not given _____ b. given: _____________________________________

Page(s):

33. What estimation algorithm was used?

_____ a. not given _____ b. given: _____________________________________

Page(s):

34. Were any convergence problems encountered?

_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________

Page(s):

35. Were any of the covariance matrices not positive definite?

_____ a. not mentioned _____ b. no _____ c. yes __________________________________________ __________________________________________ __________________________________________

Page(s):

36. For which variance/covariance parameters were estimates provided?

36-a. error variance of the intercepts (ττττ00)

_____ a. not applicable since not estimated _____ b. provided _____ c. estimated but not provided _____ d. insufficient information provided

Page(s):

36-b. error variance of each regression coefficient or slope (e.g., ττττ11, ττττ22)

_____ a. not applicable since not estimated _____ b. provided _____ c. estimated but not provided _____ d. insufficient information provided

Page(s):

36-c. covariance between the intercept and slope errors (e.g., ττττ12, ττττ23)

_____ a. not applicable since not estimated _____ b. provided _____ c. estimated but not provided _____ d. Insufficient information provided

Page(s):

36-d. first level error variance (typically σσσσ2, but could be both σσσσ2

and ρρρρ)

_____ a. provided _____ b. estimated but not provided

Page(s):

37. What additional variance parameter information is provided? Check all that apply

_____ a. none _____ b. SEs _____ c. confidence intervals _____ d. significance tests _____ e. reliabilities _____ f. ICCs _____ g. explained variance

Page(s):

38. If CIs or significance tests were reported for variance parameters, what method was used

_____ a. not applicable _____ b. not stated _____ c. SE/z-estimate _____ d. χ2 , type (if given): ____________________ _____ e. exact, bootstrap, other__________________

Page(s):

39. What fixed effect parameter information is provided? Check all that apply.

_____ a. none _____ b. SEs _____ c. confidence intervals _____ d. significance tests _____ e. point estimates __________________________ _____ f. other (e.g., effect size): _____________________

Page(s):

40. If CIs or significance tests were reported for fixed effects, what method was used?

_____ a. not stated _____ b. likelihood ratio _____ c. t or F test, degree of freedom method NOT stated _____ d. t or F test, degree of freedom method IS stated _____ e. permutation, bootstrap, other

Page(s):

41. What level-1 parameter information is provided? Please check all that apply.

_____ a. none _____ b. estimates provided, but method not stated _____ c. OLS or EB estimates _____ d. statistical tests for OLS or EB estimates _____ e. CIs for OLS or EB estimates _____ f. other (e.g., just equations)___________________ ________________________________________

Page(s):

42. Is there something else in this study that needs to be ‘captured’ that hasn’t been addressed in other items??

____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________ ____________________________________________

hierarchical linear modeling: a review of methodological ... · hierarchical linear measurement...

Documents