what sample sizes are needed to get correct significance ... · for cases not covered by this rule...
TRANSCRIPT
Methods of Psychological Research Online 2000, Vol.5, No.2 Institute for Science Education
Internet: http://www.mpr-online.de © 2000 IPN Kiel
What sample sizes are needed to get correct significance
levels for log-linear models? - A Monte Carlo Study using
the SPSS-procedure "Hiloglinear"
Ingeborg STELZL1
Abstract
Pearson's 2! and the Likelihood-ratio statistic 2G are the most common and widely
used test statistics for log-linear models. They are both asymptotically distributed as
chi-squared variables. The present article reports the results of a Monte-Carlo study
which compares the two test statistics for two-, three- and four-dimensional contingency
tables, employing conditions which may be judged reasonable for psychological research
and using one of the most prominent computer programs (SPSS "Hiloglinear"). Our re-
sults are consistent with previous research in that, on the whole, Pearson's 2! behaves
better than 2G . As a rule of thumb one may state that Pearson's 2! will not result in
severely inflated alpha values (empirical values of .075 or larger for a nominal level of
.05) if the total sample size equals five times the number of cells and the smallest expec-
ted cell frequency is larger than 0.50. On contrast, the Likelihood-ratio statistic 2G
yields in some cases severely inflated empirical alpha values for the higher interactions
even if the total sample size equals ten times the number of cells and the smallest expec-
ted cell frequency is larger than one. In those cases where sample size is large enough to
use Pearson's 2! , Pearson's 2! is preferable to 2G , as it is generally closer to the nomi-
nal alpha. For cases not covered by this rule parametric bootstrapping is recommended.
Key words: contingency tables, log-linear models, significance tests, Pearson's 2! ,
Likelihood ratio 2G , Monte Carlo study, simulation, SPSS Hiloglinear
1 Author's address: Prof. Dr. Ingeborg Stelzl, Philipps-University, Fachbereich Psychologie, Gutenbergstr.
18, D-35032 Marburg, Germany Tel.:+49-6421-2823669, Fax: +49-6421-2828929, e-mail:
96 MPR-Online 2000, Vol. 5, No. 2
1. Introduction
Log-linear models may be used for the analysis of contigency tables to test hypotheses
about main effects, two way and higher order interactions. Although several other
procedures have been suggested (for an overview see Read & Cressie, 1988; for two way
tables also Goodman, 1996), so far Pearson's 2! and the Likelihood Ratio Statistic 2G
are still the best known and most widely used test statistics for significance testing. As
both procedures, Pearson's 2! and 2G , are based on asymptotic theory, i. e., derived for
large sample sizes only, we are left with the question, which of the two statistics should
be preferred with moderate sample sizes. The present paper will report the results from
a Monte Carlo Study, from which some guidelines are derived to answer this question.
However, during the last years the role of significance testing in the behavioral and
social sciences has been questioned on epistemological grounds, the reasons for signifi-
cance testing have become controversial, and it has been questioned whether significance
testing should not be abandonned at all (see Cohen, 1994; Gigerenzer, 1993; Harlow,
Mulaik & Steiger, 1997; Sedlmeier, 1996; Iseler, 1997; Sedlmeier, 1998; Brandstätter,
1999). Therefore, we will first try to give some comments, how the specific topic of our
present study may be embedded in this debate.
Some of the main objections which were raized against the practice of significance
tests were: (1) that significance tests were often misunderstood and misinterpreted (for
an overview and critical discussion of common misunderstandings see Mulaik, Raj and
Marshman, 1977; for an attempt of a reformulation of the logical basis of significance
testing see Harris, 1997; Jones, 1999); (2) that significance were superfluous, as the null-
hypothesis were always known to be false, and that they were prohibitive to scientific
progress, if non-significant results were not published (see Schmidt & Hunter, 1997); (3)
that significance test would not contain the relevant information concerning effect sizes
and accuracy of estimation. (For alternatives, which may be used instead of significance
tests or supplementary, see Brandstätter, 1999; Meehl, 1997; Reichardt & Gollob, 1997;
Sedlmeier, 1996.)
This debate may be expected to continue still for a long time, as it reaches far into
the area of philosophy of science. The impact of some arguments (e. g., that the null
hypothesis is always known to be false) will depend also on substantial grounds and va-
ry with the field of application. Nevertheless, presumably most researchers engaged in
this discussion would agree to the following statements:
I. Stelzl: Sample sizes needed for log-linear models 97
(1) Significance tests must not be the main criterion to judge the scientific impact of
an empirical study, e. g., in deciding whether or not it should be published. The results
of an empirical study are reported insufficiently and inappropriately if only test stati-
stics and p-values from significance tests are reported. At least for the main hypotheses
effect sizes (parameter estimates with confidence intervals and/or global effect size mea-
sures, e. g., proportion of predidicted variance) should be given. Furthermore, there
should be a descriptive summary of the data including as many details as possible (e. g.,
all cell frequencies of a multidimensional contingency table if a log linear model is em-
ployed, the complete correlation matrix if a structural equation model is to be fitted to
the data). This will enable other authors to reanalyze the data by their own models.
(2) Significance testing requires sample sizes that are sufficiently large to provide
adaequate statistical power (Cohen, 1988 suggests 0.8 as a minimum for standard cases,
but higher values, e. g. 0.95, may be required depending on the specific hypothesis un-
der question) for effect sizes which are judged to be relevant for the specific matter of
research (that may be small, medium or large effects as defined by Cohen, 1988, or
Erdfelder, Faul & Buchner, 1996).
(3) If the results of significance tests are given by tail probabilities (p-values), these
p-values should be computed correctly. If only asymptotic distributions are known for
some test statistic this rises the question what minimal sample sizes are required to get
correct p-values from those asymptotic distributions.
The present study will contribute to this field referring to a special class of statistical
models, i. e. log-linear models for the analysis of contingency tables, for which two com-
peting asymptotic test procedures, Pearson's 2! and 2G , are compared.
The test statictics in the study: Pearson's 2! and the Likelihood-Ratio-statistic 2G
Both of the two test procedures require that one estimtes first the expected cell fre-
quencies under the null hypothesis. E. g., if a 4-factor interaction is to be tested the
model for the null hypothesis contains free parameters for all main effects, 2-factor and
3-factor interactions whereas the parameters for the 4-factor interaction are fixed to ze-
ro.
The two test statistics, Pearson's 2! and 2G are defined as follows:
Pearson's 2! :
��
2i i2 k
i=1i
x( - )m = m
! ! (1)
98 MPR-Online 2000, Vol. 5, No. 2
Likelihood-Ratio-Statistik 2G :
�ln i2 k ii=1i
mx = - 2 G x!(2)
where
i = 1 ... k index for the cells, with
k = total number of cells
�im = estimate of the expected cell frequency
in cell i under the null hypothesis
xi = observed cell frequency in cell i
Both test statistics are asymptotically distributed as chi-squared variables, with the
numer of degrees of freedom equal to the total number of cells minus the number of
estimated free parameters.
Several Monte Carlo studies have been conducted to investigate the behavior of Pear-
son's 2! and/or 2G for small or moderate sample sizes (Agresti & Yang, 1987; Berry &
Mieke, 1988; Chapman, 1976; Haber, 1984; Hosmane, 1986, 1987; Koehler, 1986; Koehler
& Larntz, 1980; Larntz, 1978; Lawal, 1984; Milligan, 1980; Rudas, 1986; Upton, 1982).
Most of them concentrate upon two way tables, whereas less is known about higher or-
der interactions, which require iterative algorithmas. If Pearson's 2! and 2G are compa-
red within the same study, results are heterogeneous for one-way tables, whereas most
of the studies investigating two- and higher order tables find a tendency Pearson's 2! to
perform somewhat better than 2G .
One way tables were investigated by Chapman (1976) and Koehler and Larntz (1980)
with heterogeneous results. Lawal (1984) who compared four test statistics including
Pearson's 2! and 2G found Pearson's 2! to be closest to the nominal alpha.
Two way tables with 2x2 cells were investigated by Upton (1982) with sample sizes
ranging from n = 14 to 96. He compared twelve test statistics, including Pearson's 2!
and 2G . He got the best results if Pearson�s 2! was modified by a multiplication factor
of (n - 1)/n. Yet, also Pearson�s 2! without a correction factor was found to perform
well.
Hosmane (1986) investigated two way tables with 2x2 up to 9x9 cells and sample si-
zes ranging from 10 to 190. He compared Pearson�s 2! and 2G to some modifications
I. Stelzl: Sample sizes needed for log-linear models 99
and concluded that 2! without any modification yields the best results.
Agresti and Yang (1987) investigated contingency tables with 2x3 up to 10x10 cells
and sample sizes of N = 50 or 100. They found acceptable results for Pearson's 2! if the
expected average cell frequency was not smaller than one. For direct testing of a log-
linear model the distribution on Pearson's 2! was closer to the asymptotic chi-squared
distribution than the distribution of 2G . On the other hand, for comparing two unsatu-
rated models 2G outperformed Pearson's 2! in many cases.
Berry and Mielke (1988) investigated 2x2 up to 3x4 tables with sample sizes ranging
from 20 to 80. They compared five test statistics and found a nonasymptotic chi-
squared test to be superior in overall performance to the other four tests including Pear-
son's 2! and 2G .
Three way tables with 2x2x2 cells and sample sizes from 10 to 90 were studied by
Milligan (1980). Both, Pearson's 2! and 2G showed a tendency to depressed " -levels
for main effects and 2-factor interactions whereas considerably inflated " -levels occur-
red for some cases for the 3-factor interactions with Pearson's 2! . Three way tables with
2x2x2 cells were also studied by Haber (1984) using larger sample sizes ranging from 40
to 400. He compared six test statistics, including Pearson's 2! and 2G . Among the tests
which do not inflate the nominal significance level Pearson's 2! was found to be the
most powerful.
Besides several one- and two way contingency tables Larntz (1978) investigated also
a 3x3x3 design. He compared Pearson's 2! to 2G and a further statistic and concluded
that Pearson's 2! should be preferred because its Type I error rates were closest to the
nominal level.
Rudas (1986) studied two and three way tables with 2x2 up to 3x3x5 cells and
sample sizes ranging from 15 to 150. He compared Pearson's 2! , 2G and a further stati-
stic suggested by Cressie and Read (1985). The results for Pearson's 2! and the Cressie
and Read-statistic were very similar and these two statistics were found more appro-
priate than 2G for small sample sizes. Also Koehler (1986) studied two- and three way
tables and 2ktables. He found acceptable results for Pearson's 2! except some cases of
sparse tables containing both very small and moderately large expected frequencies. The
accuracy of 2G , on the other hand, was judged generally unacceptable producing greatly
inflated Type I error levels in some cases and deflated levels in others. For very large
and sparse tables other asymptotic procedures were preferrable to Pearson's 2! and 2G .
100 MPR-Online 2000, Vol. 5, No. 2
Hosmane (1987) who investigated tables with 2x2x2 to 4x4x3 cells and sample sizes
ranging from 10 to 200 compared five test statistics including Pearson's 2! and 2G . He
recommends Pearson's 2! as the test statistic closest to nominal level " .
The present paper is to contribute to this field by focussing on three-way and four-
way designs. Whereas many of the previous studies used very small sample sizes in order
to examine the behavior of the test statistics at the lower end the present study will
focus on sample sizes which may be judged realistic and reasonable for psychological
research. That means that sample sizes should be sufficiently large to provide acceptable
statistical power at least for large effects.
As the results of an iterative estimation procedure may depend on the quality of the
mathematical algorithm it was decided to employ for our simulation study a widely used
procedure from a well known statistical package with the program options left in their
default modes. The procedure "Hiloglinear" from SPSS was chosen with the aim to de-
rive some guidelines when the results from this procedure can be used without a risk of
severely inflated or deflated Type I error rates and which of the two test statistics pro-
vided by the program should be preferred.
2. The Monte Carlo study
Our simulation study includes designs with two categories per variable (2x2, 2x2x2,
2x2x2x2) and designs with up to four categories per variable (4x3, 4x3x2, 4x3x2x2). The
first series of simulations was run with uniform marginals sampling from a null model
with zero main effects and zero interactions. The sample sizes chosen were 2.5, 5 or 10
times the number of cells. Smaller sample sizes were not considered because of the lack
in statistical power that would result. According to Erdfelder (1992) and Erdfelder, Faul
and Buchner (1996) a sample size of N = 32 is required for a chi-squared test with one
degree of freedom to reach a power value of 0.8 at alpha = .05 for large effects, and N =
88 for medium effects; for a chi-squared test with six degrees of freedom the required
sample sizes are N = 55 and N = 152, respectively (though these values are also based
on asymptotic theory they may nevertheless be used as a rough estimate).
A further series of simulations was run with the same designs and the same sample
sizes, but differing marginal distributions. For designs with two categories the marginals
were .7,.3 or .8,.2, for designs with three- and four-categorial variables the marginals
were .1, .2, .3, .4 for variable A, .1, .4, .5 for variable B and .5, .5 or .8, .2 for variable C
I. Stelzl: Sample sizes needed for log-linear models 101
and D. The population values for all interactions were zero. Due to the differences in the
marginal probabilities there was considerable variation in the expected cell frequencies
within the designs and in many cases some of the expected frequencies were smaller
than one.
3. The simulation program2
The data were generated by a SPSS-input-program as follows: The SPSS-procedure
"uniform" was used to generate for each person a value for each variable drawn from a
uniform distribution in the interval [0, 1], e. g., for a 4x3x2x2 design four variables A,
B, C, D. Next the variables were devided into categories: E.g., when variable A was de-
signed to have four categories a1 to a4 with marginal probabilities .1, .2, .3, .4 a person
was assigned to category a1 if his/her value for A was in the range 0 to ! .1, to category
a2 if it was between 0.1 and ! .3, etc. The procedure continued with categorizing the
next variable until the person was categorized completely. Then the values for the next
person were drawn and categorized until a sample of the required size was completed.
Then the input program was closed and the procedure "Hiloglinear" was started.
This input program including the line calling the procedure "Hiloglinear" was written
into a SPSS-MAKRO which was executed 10 000 times. This resulted in a very long
SPSS-output which was filtered by a program written in "UNIX-awk" to find the lines
with the results of the significance tests. The fields containing the p-values were trans-
ferred to a new file, which was then used as an input file for a SPSS programm to assess
the empirical distribution of the p-values. The p-values ! .05 were counted to get empi-
rical rejection rates.
The procedure "Hiloglinear" from SPSS for Unix Release 5.0 is described in the ma-
nual "SPSS-Statistical Algorithmus" als follows: First, maximum likelihood estimates of
the expected cell frequencies under the specific model under investigation are computed
using the iterative proportional fit algorithm as described by Fienberg (1977). The fol-
lowing default values for the program options were left unchanged: To avoid problems
with zero frequencies the program adds a constant delta = .5 to all empirical cell fre-
quencies before the fit algorithm is started. The fit algorithm stops, if the largest change
2The students Mrs. Sigrid Kühl, Mrs. Verena Polz and Mrs. Karina Wahl were engaged in the develop-
ment and running of the simulation program.
102 MPR-Online 2000, Vol. 5, No. 2
of an expected cell frequency in two consecutive iterations is less than .25 or if 20 itera-
tions have been executed.
Based on these expected frequencies as compared to the empirical cell counts the two
test statistics Pearson's 2! and 2G are computed and significance tests for the fit of the
model are performed using either of the two statistics. The resulting p-values are given.
The output provides significance tests for the fit of the following models:
M0: Global null hypothesis. All main effects and interactions are zero.
M1: Model with free parameters for the main effects. 2-factor and higher inter-
actions are zero.
M2: Model with free parameters for main effects and 2-factor interactions. 3-
factor and higher interactions are zero.
M3: Model with free parameters for main effects, 2-factor and 3-factor interac-
tions. The 4-factor interaction is zero.
Next, hypotheses about main effects, 2-factor, 3-factor and 4-factor interactions are
tested seperately by hierarchical model comparisons (Chi-square difference tests for ne-
sted models):
Test whether the marginals differ significantly from uniform distributions (M1
vs. M0)
Test of the 2-factor interactions (M2 vs. M1)
Test of the 3-factor interactions (M3 vs. M2)
Test of the 4-factor interaction (satisfied model with main effects and all interac-
tions including the 4-factor interactions vs. M3)
Furthermore, significance tests using 2G are performed for each of the main effects
and each of the interactions seperately ("partial associations"). Also these tests are based
on hierarchical comparisons: E. g., when in a model with four variables the 3-factor in-
teractions ABC is to be tested, two models with the following higher-order marginals
fitted to the data are compared:
H0: model with all main effects, all 2-factor interactions and the 3-factor inter-
actions ABD, BCD and ACD, (i. e., all 3-factor interactions except ABC)
H1: model with all main effects, all 2-factor interactions and all 3-factor inter-
actions including ABC
I. Stelzl: Sample sizes needed for log-linear models 103
In the next section simulation results will be reported for the significance tests of the
global null hypothesis, for groups of hypotheses (all main effects, all 2-factor interacti-
ons, etc.) and in some cases also for partial associations.
4. Results
Table 1 gives the empirical rejection rates corresponding to a nominal level of alpha
= .05 under the condition of the complete null hypothesis, i. e. with uniform marginal
distributions. As each entry in the table is based on 10 000 samples 95 percent of the
values should fall into the interval .046 to .054, if the true alphas were .05. Already a
rough inspection of Table 1 shows that the majority of the values (64%) is outside this
range and thus departs significantly from the nominal value. However, applying asym-
ptotic results to finite sample sizes one cannot expect to get exact error rates. In what
follows we will call increased empirical values up to .075 "moderately" inflated, higher
values will be called "severely" inflated. Severely inflated values are underlined in the
tables. On the other hand, empirical error rates below the nominal level indicate a con-
servative significance test, presumably with increased Type II error rates. Therefore,
values equal or below .025 are printed in italics.
Table 1 shows that there is a general tendency for 2G to yield higher empirical al-
phas than Pearson's 2! (in 60 cases out of 72 the value for 2G is higher than that for
Pearson's 2! , in 1 case equal and in 11 cases lower). Most of the cases of severely in-
creased alpha values occur for 2G , whereas severely depressed alphas occur for neither
statistic. Giving equal weight to discrepancies in either direction (increased or depressed
values) Pearson's 2! is found to be in 52 cases closer to the nominal alpha than 2G , 2
times equal, and in 18 cases more discrepant than 2G . Thus one may state that Pear-
son's 2! is on the whole closer to the nominal level than 2G .
Next we will look on the results for the two statistics in more detail:
Pearson's 2! : There are only two cases of severely inflated alpha levels, both occu-
ring when 4-factor interactions are tested with the smallest sample sizes (2.5 times the
number of cells). When the sample sizes are increased the empirical error rates improve
and differ only "moderately" from the nominal level.
104 MPR-Online 2000, Vol. 5, No. 2
Table 1: Empirical alpha-values for a nominal alpha = .05 when the complete null hy-
pothesis is true (uniform marginals, no interactions)
significance tests
Design N n/
cell
global marginals 2-way
interactions
3-way
interactions
4-way
interactionsG2 "2 G2 "2 G2 "2 G2 "2 G2 "2
2x2 10 2.5 .077 .036 .047 .059 .100 .052
20 5 .048 .038 .052 .046 .058 .051
40 10 .053 .042 .048 .048 .050 .050
4x3 30 2.5 .091 .042 .057 .053 .102 .045
60 5 .067 .047 .056 .053 .071 .046
120 10 .057 .048 .056 .057 .056 .047
2x2x2 20 2.5 .083 .043 .059 .049 .076 .050 .095 .069
40 5 .067 .045 .048 .050 .062 .047 .068 .060
80 10 .056 .047 .052 .055 .055 .049 .060 .058
4x3x2 60 2.5 .106 .056 .053 .058 .075 .045 .129 .063
120 5 .078 .054 .059 .060 .062 .050 .082 .057
240 10 .060 .053 .054 .055 .055 .050 .060 053
2x2x2x2 40 2.5 .095 045 .052 .055 .065 .043 .107 .065 .111 .078
80 5 .068 .047 .053 .059 .055 .043 .067 .050 .077 .069
160 10 .060 .053 .054 .053 .054 .047 .061 .054 .058 .057
4x3x2x2 120 2.5 .141 .054 .057 .068 .063 .044 .122 .052 .160 .091
240 5 .086 .056 .052 .061 .055 .047 .078 .048 .098 .073
480 10 .065 .054 .048 .055 .054 .048 .064 .051 .068 .063
N = total sample size
I. Stelzl: Sample sizes needed for log-linear models 105
Apart from a conservative tendency for 2x2 tables with small sample sizes, the tests
for the global null hypothesis lead to acceptable type I error rates (.042 to .056). The
tests for the main effects, 2-factor and 3-factor interactions lead to somewhat larger but
still "moderate" departures from the nominal level (.043 to .069). The tests for the 4-
factor interactions lead to the two severely inflated values mentioned above occurring
with sample sizes of 2.5 times the cell number. For sample sizes equalling 5 or 10 times
the cell number, these error rates were only moderately inflated.
The Likelihood-Ratio- 2G : Whereas the results for the main effects are satisfactory
(.047 to .059) 2-factor and higher interactions show an increasing tendency to yield in-
flated type I error rates. Severely inflated empirical alpha levels occur for all designs
with sample sizes only 2.5 times the number of cells and for many cases with sample size
5 times the cell number. Only with sample sizes equal to 10 times the cell number all
departures stay in the range defined as only "moderately" descrepant.
Tables 2a and 2b give the empirical rejection rates for all simulations with main ef-
fects present, i. e. with marginals differring from the uniform distribution. The sample
sizes used are the same as in Table 1, equalling 2.5 times, 5 times, or 10 times the cell
number. Yet, due to the variation in the marginal probabilities there was considerable
variation in the expected cell frequencies within a design. The smallest expected cell
frequency within a design is given in the column headed "min E(ni)" in Tables 2a and
2b.
The alternative hypotheses was true for the global tests and the tests of the margi-
nals. Except for the smallest sample sizes (N = 10, N = 20) statistical power exceeded
0.90 in all cases. As the results for the global tests and the tests for the marginal distri-
butions were very much alike, only the results for the marginal distributions are repor-
ted.
Table 2a refers to designs with two categories per variable (2x2 to 2x2x2x2), whereas
Table 2b contains the results for designs with up to four categories per variable (4x3 to
4x3x2x2). Taking together from both groups of designs the behavior of the two test sta-
tistics may be summarized as follows:
Pearson's 2! leads to only moderate discrepancies from the nominal level (empirical
levels .043 to .071) when the total sample size equals at least 5 times the cell number
and the smallest expected cell frequency is larger than 0.5. With expected cell frequen-
cies smaller than 0.5 serious departures occur in either direction (severely inflated or
depressed empirical levels) without following a simple pattern.
106 MPR-Online 2000, Vol. 5, No. 2
Table 2a: Empirical rejection rates for a nominal level of alpha = .05, when main effects
are present but all interactions are zero. Designs with only two-categorial variables.
significance tests
Design N n min
E(ni)
marginals# 2-way
interactions
3-way
interactions
4-way
interactions
mar-
gi-
nals G2 "2 G2 "2 G2 "2 G2 "2
2x2 10 2.5 .9 .3 .7 .331 .361 .071 .046
20 5 1.8 �� .647 .607 .074 .045
40 10 3.6 �� .922 .913 .065 .052
2x2 10 2.5 .4 .2 .8 .697 .744 .034 .054
20 5 .8 �� .970 .957 .055 .049
40 10 1.6 �� 1.00 1.00 .070 .046
2x2x2 20 2.5 .54 .3 .7 .768 .725 .088 .065 .070 .044
40 5 1.08 �� .977 .972 .069 .051 .072 .057
80 10 2.16 �� 1.00 1.00 .053 .047 .068 .055
2x2x2 20 2.5 .16 .2 .8 .990 .990 .059 .086 .036 .016
40 5 .32 �� 1.00 1.00 0.68 .065 .056 .039
80 10 .64 �� 1.00 1.00 .069 .058 .058 .052
2x2x2x2 80 5 .65 .3 .7 1.00 1.00 .062 .052 .096 .071 .081 .051
160 10 1.30 �� 1.00 1.00 .053 .050 .069 .053 .079 .061
2x2x2x2 80 5 .13 .2 .8 1.00 1.00 .081 .101 .071 .090 .022 .011
160 10 .26 �� 1.00 1.00 .058 .069 .078 .081 .050 .032#
As the alternative hypothesis holds for the main effects the values in these columns are
empirical power-values.
N = total sample size
n = N devided by the number of cells
min " #iE n = smallest expected cell frequency
I. Stelzl: Sample sizes needed for log-linear models 107
Table 2b: Empirical rejection rates for a nominal level of alpha = .05, when main effects
are present but all interactions are zero. Designs with more than two categories per va-
riable.
significance tests
Design N n min
E(ni)
marginals marginals# 2-way
interactions
3-way
interactions
4-way
interacti-G2 "2 G2 "2 G2 "2 G2 "2
4x3 60 5 .6 .1 .2 .3 .4I
.1 .4 .51.00 1.00 .065 .046
120 10 1.2 �� 1.00 1.00 .065 .048
4x3x2 120 5 .6 .1 .2 .3 .4I
.1 .4 .5I.5 .51.00 1.00 .071 .060 .081 .043
240 10 1.2 �� 1.00 1.00 .063 .056 .081 .052
120 5 .24 .1 .2 .3 .4I
.1 .4 .5I.2 .81.00 1.00 .074 .076 .067 .042
240 10 .48 �� 1.00 1.00 .066 .064 .072 .049
4x3x2x2 240 5 .6 .1 .2 .3 .4I
.1 .4 .5I.5 .5I1.00 1.00 .061 .060 .102 .064 .089 .056
480 10 1.2 �� 1.00 1.00 .054 .052 .081 .053 .091 .065
240 5 .24 .1 .2 .3 .4I
.1 .4 .5I.5 .5I1.00 1.00 .062 .072 .112 .093 .060 .033
480 10 .48 �� 1.00 1.00 .050 .056 .088 .064 .082 .053
240 5 .10 .1 .2 .3 .4I
.1 .4 .5I.2 .8I1.00 1.00 .067 .096 .117 .125 .028 .019
480 10 .19 �� 1.00 1.00 .066 .071 .010 .085 .060 .034
# As the altrnative hypothesis holds for the main effects the values in these columns are
empirical power-values.
N = total sample size
n = N devided by the number of cells
min " #iE n = smallest expected cell frequency
108 MPR-Online 2000, Vol. 5, No. 2
Table 3: Empirical alpha values for a nominal alpha = .05 when testing partial associa-
tions using 2G . Two design conditions selected from Table 2b.
a) Design 4x3x2:
N n min
E(ni)
marginals AB AC BC ABC
120 5 .6 .1 .2 .3 .4I
.1 .4 .5I.5 .5.072 .061 .062 .081
240 10 1.2 �� .063 .056 .054 .081
a) Design 4x3x2x2:
N n min
E(ni)
margi-
nals
AB AC BC AD BD CD
240 5 .6 .1 .2 .3 .4I
.1 .4 .5I.5.063 .056 .054 .055 .051 .051
480 10 1.2 �� .052 .052 .052 .056 .051 .050
N n min
E(ni)
margi-
nals
ABC ABD ACD BCD ABCD
240 5 .6 .1 .2 .3 .4I
.1 .4 .5I.5.097 .096 .076 .077 .089
480 10 1.2 �� .078 .071 .059 .061 .091
N = total sample size
n = N devided by the number of cells
min " #iE n = smallest expected cell frequency
I. Stelzl: Sample sizes needed for log-linear models 109
The Likelihood-Ratio 2G yields for the 2-factor interactions only moderate discrepencies
from the nominal level (empirical levels .053 to .070) when the total sample size equals
10 times the cell number and the smallest expected cell frequency is larger than 1.
However, even when these conditions are satisfied, the significance tests of the 3- and 4-
factor-interactions lead in many cases to seriously increased alpha levels. Therefore, 2G
cannot be recommended for these tests.
Comparing Pearson's 2! to 2G we find that in all cases which satisfy the above rule
for Pearson's 2! (total sample size 5 times the cell number, smallest expected cell fre-
quency � .5) Pearson's 2! is closer to the nominal level than 2G .
Table 3 shows the results for the tests of the partial associations in some of the larger
designs (4x3x2 and 4x3x2x2) using 2G . Also these results indicate that severely inflated
alpha levels occur for 3- and 4-factor interactions even under the conditions of large
sample sizes (10 times the cell number) and smallest expected cell frequencies larger
than 1.
4.1. Supplementary results:
Since the goodness-of-fit values obtained for a model and the resulting p-values may
depend to some extent also on the quality of the numerical optimization procedure, we
wanted to check whether an increase in numerical occuracy would affect the obtained
error rates.
Two designs were chosen: The 2x2x2 design with unequal marginals (.8,.2) and
sample size = 80, and the 2x2x2x2 design with unequal marginals (.8,.2) and sample size
= 160. The latter had produced severely inflated error rates for both, Pearson's 2! and2G . The program options for numerical accuracy were changed from the default values
of a maximum of 20 iterations and a convergence criterion of 0.25 to a maximum of 50
iterations and a convergence criterion of .05. For each of the two designs a simulations
with 10 000 data sets was run under the improved accuracy conditions. The results were
very close to that under the default conditions. In particular, the severely increased type
I error rates for the 3-factor interactions in the 2x2x2x2 design did not improve.
Next, the value of the constant delta, which is added to all observed cell frequencies
to avoid problems with empty cells, was changed from its default value 0.5 to 0.1. The
same two design conditions as before (2x2x2 with marginals .8,.2 n = 80; 2x2x2x2 with
marginals .8,.2 n = 160) and two further conditions (2x2x2x2 with marginals .7,.3, n =
160; 4x3x2x2 uniform marginals, n = 480) were chosen. The results are heterogeneous:
110 MPR-Online 2000, Vol. 5, No. 2
There were substantial improvements in many cases, expecially for 2G , but also detoria-
tions.
For those cases which satisfy the conditions given above for Pearson's 2! , (sample si-
ze at last 5 times the number of cells, smallest expected cell frequency larger than .5),
no severe departures from the nominal level occured, neither for Pearson's 2! nor 2G ,
and Pearson's 2! was always closer to the nominal level than 2G .
Furthermore, we wanted to check whether the procedure "Hiloglinear" in "SPSS for
Windows 8.0" differs in any respect from the procedure in "SPSS for UNIX Releas 5.0"
which was used in our simulations. We chose the last six design conditions from Table
2b (4x3x2x2 designs with unequal marginals) and generated one sample for each condi-
tion. This samples were analyzed using both "Hiloglinear" from "SPSS for Windows 8.0"
and from "SPSS for UNIX 5.0" with the program options left to their default values
(delta = .5, convergence = 25, iterate = 20). The output from the two versions of SPSS
was identical. Then the options were changed to delta = 0, convergence = .01 and itera-
te = 50. Again, the output was the same for both SPSS-versions.
5. Discussion
The results of the present study are in accordance with the majority of the previous
findings concluding that Pearson's 2! is generally closer to the nominal level than 2G .
Pearson's 2! did not lead to seriously increased type I error rates when the smallest ex-
pected cell frequency was larger than .5 and the total sample size was at least 5 times
the number of cells. This rule was found to hold for main effects and 2-factor interacti-
ons as well as for higher order interactions. For 2G , on the other hand, a rule was found
only for main effects and 2-factor interactions (to avoid seriously increased type I error
rates the smallest expected cell frequency should be larger than 1 and the total sample
size at least 10 times the cell number), whereas in many cases significance tests of higher
order interactions lead to seriously increased alpha levels even when this rule was satis-
fied.
However, a Monte Carlo study can only yield limited information on the behavior of
a test statistic as only a limited number of cases can be simulated, and the cases inclu-
ded will usually differ in some respects from the conditions that a researcher is faced
with when analyzing his/her data. In the present study we aimed at chosing the condi-
tions in a way to cover the most typical cases of psychological research and as a result
I. Stelzl: Sample sizes needed for log-linear models 111
of our study we defined an area of conditions where one may feel on the save side using
SPSS standard procedures.
Nevertheless, important questions are left open: What should one do when the above
rules for applying asymptotic procedures are not satisfied? E. g., when higher order in-
teractions are to be tested and the smallest expected cell sizes are smaller than .5, i. e.,
too small for either 2G and Pearson's 2! .
In such situations it would be desirable to provide bootstrapping procedures which
enable the researcher to start his/her own simulation study with his/her specific design
conditions and parameter values and with the actual values for computational accuracy
and the constant delta. Based on this simulation the distribution of the test statistic
could be assessed empirically and used instead of the asymptotic distribution to compu-
te the tail probability needed for the significance test. E. g., when a 4-factor interaction
is to be tested one might proceed as follows: First, the empirical data are used to esti-
mate the parameters under the model of the null hypothesis. As the 4-factor interaction
is under question this is a model containing main effects, 2-factor and 3-factor interacti-
ons but no 4-factor interaction. Next, a large number of data sets with the actual
sample size is generated from this model. To each data set the null hypothesis model
with all parameters free except the 4-factor interaction, and the alternative model with
all parameters free including the 4-factor interaction (i. e. the saturated model) are fit-
ted. Comparing these two models the test statistic for the 4-factor interaction is compu-
ted and the distribution of the test statistic over the samples as replications is assessed.
To perform the significance test for the 4-factor interaction one may use the empirical
distribution of the test statistic to estimate the 95th percentile point and use it as the
critical value or, alternatively one may estimate the tail probability (p-value) from the
proportion of samples which have led to a value exceeding that computed from the real
data set.
Similarily, one may obtain also estimated power values: A further simulation study
might be run generating Monte Carlo samples from a model with parameter values spe-
cified according to the respective alternative hypothesis. The proportion of cases leading
to a test statistic exceeding the 95th percentile point obtained from the null hypothesis
yields an estimate of statistical power for this alternative hypothesis.
This approach, which has been implemented in the program PANMARK by Van de
Pol, Langeheine and De Jong (1991), is called "sophisticated bootstrapping" or "parame-
tric bootstrapping" and has been presented and demonstrated with a variety of log-
112 MPR-Online 2000, Vol. 5, No. 2
linear models and latent class models also by Langeheine, Pannekoek and Van de Pol
(1996) and Langeheine, Van de Pol and Pannekoek (1997). Von Davier (1997) develop-
ped parametric bootstrapping procedures for various item response models, as for these
models the number of cells (= possible response patterns) becomes large even for mode-
rate item numbers, and hence the sample sizes available are in praxis nearly always too
small to use asymptotic tests as Pearson's 2! or 2G .
So far, parametric bootstrapping is the only alternative when the sample size is
known to be too small for asymptotic procedures. Furthermore, it may also be helpful
when the actual design conditions differ far from those covered by our tables and gene-
ralizations from our Monte Carlo study become hazardous.
As similar considerations apply in principle to all other asymptotically derived signi-
ficance tests it would be desirable if parametric bootstrapping procedures were included
in all major statistical packages whenever asymptotic tests are applied and only rough
knowledge is a vailable about their behavior with finite sample sizes.
I. Stelzl: Sample sizes needed for log-linear models 113
References
[1] Agresti, A. & Yang, M.C. (1987). An empirical investigation of some effects of spar-
seness in contingency tables. Computational Statistics and Data Analysis, 5, 9-21.
[2] Berry, K.J. & Mielke, P.W. Jr. (1988). Monte Carlo comparisons of the asymptotic
chi-square and Likelihood-ratio tests with the nonasymptotic chi-square test for
sparse r x c tables. Psychological Bulletin, 103, 256-264.
[3] Brandstätter, E. (1999). Confidence interval as an alternative to significance testing.
Methods of Psychological Research-Online. Vol. 4 No. 2. http://www. mpr-online.de
[4] Chapman, J.W. (1976). A comparison of the X², - 2 log R, and multinominal proba-
bility criteria for significance tests when expected frequencies are small. Journal of
the American Statistical Association, 71, 854-863.
[5] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, N.J.: Lawrence Erlbaum.
[6] Cohen, J. (1994). The earth is round (p �.05). American Psychologist, 49, 997-1003.
[7] Davier, von, M. (1997). Methoden zur Prüfung probabilistischer Testmodelle. [Insti-
tut für die Pädagogik der Naturwissenschaften an der Universität Kiel] IPN. Olsen-
hausenstraße 62, D-24098 Kiel.
[8] Erdfelder, E., Faul, F. & Buchner, A. (1996). GPOWER: A general power analysis
program. Behavior Research Methods, Instruments, & Computers, 28, 1-11.
[9] Faul, F. & Erdfelder, E. (1992). GPOWER: A priori-, post hoc-, and compromise
power analysis for MS-DOS (Computer program). Bonn: Bonn University.
[10] Fienberg, S.E. (1977). The analysis of cross-classified categorial data. Cambridge,
MA: The MIT Press.
[11] Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In
Keren, G. & Lewis, C. (Eds.), A handbook for data analysis in the behavioral scien-
ces. Methodological issues, 311-339. Hillsdale: Lawrence Erlbaum.
[12] Goodman, L.A. (1996). A single general method for the analysis of cross-classified
data: Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher,
and also some methods of correspondence analysis and association analysis. Journal
of the American Statistical Associaton, 433, 408-428.
114 MPR-Online 2000, Vol. 5, No. 2
[13] Haber, M. (1984). A comparison of tests for the hypothesis of no three-factor in-
teraction in 2 x 2 x 2 contingency tables. Journal of Statistical Computation and Si-
mulation, 20, 205-215.
[14] Harlow, L.L., Mulaik, S.A. & Steiger, J.H. (1997). What if there were no signifi-
cance tests? Hillsdale: Lawrence Erlbaum.
[15] Harris, R.J. (1997). Reforming significance testing via three-valued logic. In Har-
low, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance
tests. Mahwah, N.J.: Lawrence Erlbaum.
[16] Hosmane, B. (1986). Improved likelihood ratio tests and Pearson chi-square tests
for independence in two dimensional contingency tables. Communications in Stati-
stics - Theory and Methods, 15, 1875-1888.
[17] Hosmane, B. (1987). An empirical investigation of chi-square tests for the hypothe-
sis of no three-factor interaction in I x J x K contingency tables. Journal of Statisti-
cal Computation and Simulation, 28, 167-178.
[18] Iseler, A. (1997). Signifikanztests: Ritual, guter Brauch und gute Gründe. Methods
of Psychological Research-Online, Diskussionsforum. URL http://www.pabst-
publishers.de/impr/forum_e.html.
[19] Jones, L.V. (1999). A sensible reformulation of the significance test. ViSta: The
Visual Statistics System. http: // forrest.psych.unc.edu/jones-tukey 112399.html
[20] Koehler, K.J. (1986). Goodness-of-fit tests for log-linear models in sparse contin-
gency tables. Journal of the American Statistical Association, 81, 483-492.
[21] Koehler, K.J. & Larntz, K. (1980). An empirical investigation of goodness-of-fit
statistics for sparse multinomials. Journal of the American Statistical Association,
75, 336-344.
[22] Langeheine, R., Pannekoek, J. & Van de Pol, F. (1996). Bootstrapping goodness-
of-fit measures in categorical data analysis. Sociological Methods & Research, 24,
492-516.
[23] Langeheine, R., Van de Pol, F. & Pannekoek, J. (1997). Kontingenztabellen-
Analyse bei kleinen Stichproben: Probleme bei der Prüfung der Modellgültigkeit mit-
tels Chi-Quadrat Statistiken. Empirische Pädagogik, 11, 63-77.
I. Stelzl: Sample sizes needed for log-linear models 115
[24] Larntz, K. (1978). Small-sample comparisons of exact levels for chi-squared
goodness-of-fit statistics. Journal of the American Statistical Association, 73, 253-
263.
[25] Lawal, H.B. (1984). Comparisons of X², Y², Freeman-Tukey and William's impro-
ved 2G test statistics in small samples of one-way multinomials. Biometrika, 71,
415-458.
[26] Meehl, P.E. (1997). The problem is epistemology, not statistics: Replace significan-
ce tests by confidence intervals and quantify accuracy of risky numerical predictions.
In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no signifi-
cance tests (p. 393-426). Mahwah, N.J.: Lawrence Erlbaum.
[27] Milligan, G.W. (1980). Factors that affect Type I and Type II error rates in the
analysis of multidimensional contingency tables. Psychological Bulletin, 87, 238-244.
[28] Mulaik, St.A., Raju, N.S. & Harshman, R.A. (1997). There is a time and place of
significance testing. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if
there were no significance tests (p. 65-115). Hillsdale, N.J.: Lawrence Erlbaum.
[29] Read, T.R.C. & Cressie, N.A.C. (1988). Goodness-of-fit statistics for discrete mul-
tivariate data. New York: Springer.
[30] Reichardt, Ch.S. & Gollob, H.F. (1997). When confidence intervals should be used
instead of statistical significance tests, and vice versa. In Harlow, L.L., Mulaik, St.A.
& Steiger, J.H. (Eds.), What if there were no significance tests (p. 259-284). Hillsda-
le, N.J.: Lawrence Erlbaum.
[31] Rudas, T. (1986). A Monte Carlo comparison of the small sample behaviour of the
Pearson, the likelihood ratio and the Cressie-Read statistics. Journal of Statistical
Computation and Simulation, 24, 107-120.
[32] Schmidt, F.L. & Hutner, J.E. (1997). Eight common but false objections to the
discontinuation of significance testing in the analysis of research data. In In Harlow,
L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests
(p. 37-64). Hillsdale, N.J.: Lawrence Erlbaum.
[33] Sedlmeier, P. (1996). Jenseits des Signifikanztest-Rituals: Ergänzungen und Alter-
nativen. Methods of Psychological Research Online, 1, 41-63.
116 MPR-Online 2000, Vol. 5, No. 2
[34] Sedlmeier, P. (1998). Was sind die guten Gründe für Signifikanztests? Diskussi-
onsbeitrag zu Sedlmeier (1996) und Iseler (1997). Methods of Psychological Research-
Online, 3, 39-42.
[35] SPSS Statistical Algorithms (ohne Jahr). Chicago: SPSS Inc.
[36] Upton, G.J.G. (1982). A comparison of alternative tests for the 2 x 2 comparative
trial. Journal of the Royal Statistical Society Series A, 145, 86-105.
[37] Van de Pol, F., Langeheine, R. & De Jong, W. (1991). PANMARK user manual:
PANel analysis using MARKov chains. Voorburg: Netherlands Central Bureau of
Statistics.