what sample sizes are needed to get correct significance ... · for cases not covered by this rule...

Methods of Psychological Research Online 2000, Vol.5, No.2 Institute for Science Education

Internet: http://www.mpr-online.de © 2000 IPN Kiel

What sample sizes are needed to get correct significance

levels for log-linear models? - A Monte Carlo Study using

the SPSS-procedure "Hiloglinear"

Ingeborg STELZL1

Abstract

Pearson's 2! and the Likelihood-ratio statistic 2G are the most common and widely

used test statistics for log-linear models. They are both asymptotically distributed as

chi-squared variables. The present article reports the results of a Monte-Carlo study

which compares the two test statistics for two-, three- and four-dimensional contingency

tables, employing conditions which may be judged reasonable for psychological research

and using one of the most prominent computer programs (SPSS "Hiloglinear"). Our re-

sults are consistent with previous research in that, on the whole, Pearson's 2! behaves

better than 2G . As a rule of thumb one may state that Pearson's 2! will not result in

severely inflated alpha values (empirical values of .075 or larger for a nominal level of

.05) if the total sample size equals five times the number of cells and the smallest expec-

ted cell frequency is larger than 0.50. On contrast, the Likelihood-ratio statistic 2G

yields in some cases severely inflated empirical alpha values for the higher interactions

even if the total sample size equals ten times the number of cells and the smallest expec-

ted cell frequency is larger than one. In those cases where sample size is large enough to

use Pearson's 2! , Pearson's 2! is preferable to 2G , as it is generally closer to the nomi-

nal alpha. For cases not covered by this rule parametric bootstrapping is recommended.

Key words: contingency tables, log-linear models, significance tests, Pearson's 2! ,

Likelihood ratio 2G , Monte Carlo study, simulation, SPSS Hiloglinear

1 Author's address: Prof. Dr. Ingeborg Stelzl, Philipps-University, Fachbereich Psychologie, Gutenbergstr.

18, D-35032 Marburg, Germany Tel.:+49-6421-2823669, Fax: +49-6421-2828929, e-mail:

[email protected]

96 MPR-Online 2000, Vol. 5, No. 2

1. Introduction

Log-linear models may be used for the analysis of contigency tables to test hypotheses

about main effects, two way and higher order interactions. Although several other

procedures have been suggested (for an overview see Read & Cressie, 1988; for two way

tables also Goodman, 1996), so far Pearson's 2! and the Likelihood Ratio Statistic 2G

are still the best known and most widely used test statistics for significance testing. As

both procedures, Pearson's 2! and 2G , are based on asymptotic theory, i. e., derived for

large sample sizes only, we are left with the question, which of the two statistics should

be preferred with moderate sample sizes. The present paper will report the results from

a Monte Carlo Study, from which some guidelines are derived to answer this question.

However, during the last years the role of significance testing in the behavioral and

social sciences has been questioned on epistemological grounds, the reasons for signifi-

cance testing have become controversial, and it has been questioned whether significance

testing should not be abandonned at all (see Cohen, 1994; Gigerenzer, 1993; Harlow,

Mulaik & Steiger, 1997; Sedlmeier, 1996; Iseler, 1997; Sedlmeier, 1998; Brandstätter,

1999). Therefore, we will first try to give some comments, how the specific topic of our

present study may be embedded in this debate.

Some of the main objections which were raized against the practice of significance

tests were: (1) that significance tests were often misunderstood and misinterpreted (for

an overview and critical discussion of common misunderstandings see Mulaik, Raj and

Marshman, 1977; for an attempt of a reformulation of the logical basis of significance

testing see Harris, 1997; Jones, 1999); (2) that significance were superfluous, as the null-

hypothesis were always known to be false, and that they were prohibitive to scientific

progress, if non-significant results were not published (see Schmidt & Hunter, 1997); (3)

that significance test would not contain the relevant information concerning effect sizes

and accuracy of estimation. (For alternatives, which may be used instead of significance

tests or supplementary, see Brandstätter, 1999; Meehl, 1997; Reichardt & Gollob, 1997;

Sedlmeier, 1996.)

This debate may be expected to continue still for a long time, as it reaches far into

the area of philosophy of science. The impact of some arguments (e. g., that the null

hypothesis is always known to be false) will depend also on substantial grounds and va-

ry with the field of application. Nevertheless, presumably most researchers engaged in

this discussion would agree to the following statements:

I. Stelzl: Sample sizes needed for log-linear models 97

(1) Significance tests must not be the main criterion to judge the scientific impact of

an empirical study, e. g., in deciding whether or not it should be published. The results

of an empirical study are reported insufficiently and inappropriately if only test stati-

stics and p-values from significance tests are reported. At least for the main hypotheses

effect sizes (parameter estimates with confidence intervals and/or global effect size mea-

sures, e. g., proportion of predidicted variance) should be given. Furthermore, there

should be a descriptive summary of the data including as many details as possible (e. g.,

all cell frequencies of a multidimensional contingency table if a log linear model is em-

ployed, the complete correlation matrix if a structural equation model is to be fitted to

the data). This will enable other authors to reanalyze the data by their own models.

(2) Significance testing requires sample sizes that are sufficiently large to provide

adaequate statistical power (Cohen, 1988 suggests 0.8 as a minimum for standard cases,

but higher values, e. g. 0.95, may be required depending on the specific hypothesis un-

der question) for effect sizes which are judged to be relevant for the specific matter of

research (that may be small, medium or large effects as defined by Cohen, 1988, or

Erdfelder, Faul & Buchner, 1996).

(3) If the results of significance tests are given by tail probabilities (p-values), these

p-values should be computed correctly. If only asymptotic distributions are known for

some test statistic this rises the question what minimal sample sizes are required to get

correct p-values from those asymptotic distributions.

The present study will contribute to this field referring to a special class of statistical

models, i. e. log-linear models for the analysis of contingency tables, for which two com-

peting asymptotic test procedures, Pearson's 2! and 2G , are compared.

The test statictics in the study: Pearson's 2! and the Likelihood-Ratio-statistic 2G

Both of the two test procedures require that one estimtes first the expected cell fre-

quencies under the null hypothesis. E. g., if a 4-factor interaction is to be tested the

model for the null hypothesis contains free parameters for all main effects, 2-factor and

3-factor interactions whereas the parameters for the 4-factor interaction are fixed to ze-

ro.

The two test statistics, Pearson's 2! and 2G are defined as follows:

Pearson's 2! :

��

2i i2 k

i=1i

x( - )m = m

! ! (1)


Likelihood-Ratio-Statistik 2G :

�ln i2 k ii=1i

mx = - 2 G x!(2)

where

i = 1 ... k index for the cells, with

k = total number of cells

�im = estimate of the expected cell frequency

in cell i under the null hypothesis

xi = observed cell frequency in cell i

Both test statistics are asymptotically distributed as chi-squared variables, with the

numer of degrees of freedom equal to the total number of cells minus the number of

estimated free parameters.

Several Monte Carlo studies have been conducted to investigate the behavior of Pear-

son's 2! and/or 2G for small or moderate sample sizes (Agresti & Yang, 1987; Berry &

Mieke, 1988; Chapman, 1976; Haber, 1984; Hosmane, 1986, 1987; Koehler, 1986; Koehler

& Larntz, 1980; Larntz, 1978; Lawal, 1984; Milligan, 1980; Rudas, 1986; Upton, 1982).

Most of them concentrate upon two way tables, whereas less is known about higher or-

der interactions, which require iterative algorithmas. If Pearson's 2! and 2G are compa-

red within the same study, results are heterogeneous for one-way tables, whereas most

of the studies investigating two- and higher order tables find a tendency Pearson's 2! to

perform somewhat better than 2G .

One way tables were investigated by Chapman (1976) and Koehler and Larntz (1980)

with heterogeneous results. Lawal (1984) who compared four test statistics including

Pearson's 2! and 2G found Pearson's 2! to be closest to the nominal alpha.

Two way tables with 2x2 cells were investigated by Upton (1982) with sample sizes

ranging from n = 14 to 96. He compared twelve test statistics, including Pearson's 2!

and 2G . He got the best results if Pearson�s 2! was modified by a multiplication factor

of (n - 1)/n. Yet, also Pearson�s 2! without a correction factor was found to perform

well.

Hosmane (1986) investigated two way tables with 2x2 up to 9x9 cells and sample si-

zes ranging from 10 to 190. He compared Pearson�s 2! and 2G to some modifications


and concluded that 2! without any modification yields the best results.

Agresti and Yang (1987) investigated contingency tables with 2x3 up to 10x10 cells

and sample sizes of N = 50 or 100. They found acceptable results for Pearson's 2! if the

expected average cell frequency was not smaller than one. For direct testing of a log-

linear model the distribution on Pearson's 2! was closer to the asymptotic chi-squared

distribution than the distribution of 2G . On the other hand, for comparing two unsatu-

rated models 2G outperformed Pearson's 2! in many cases.

Berry and Mielke (1988) investigated 2x2 up to 3x4 tables with sample sizes ranging

from 20 to 80. They compared five test statistics and found a nonasymptotic chi-

squared test to be superior in overall performance to the other four tests including Pear-

son's 2! and 2G .

Three way tables with 2x2x2 cells and sample sizes from 10 to 90 were studied by

Milligan (1980). Both, Pearson's 2! and 2G showed a tendency to depressed " -levels

for main effects and 2-factor interactions whereas considerably inflated " -levels occur-

red for some cases for the 3-factor interactions with Pearson's 2! . Three way tables with

2x2x2 cells were also studied by Haber (1984) using larger sample sizes ranging from 40

to 400. He compared six test statistics, including Pearson's 2! and 2G . Among the tests

which do not inflate the nominal significance level Pearson's 2! was found to be the

most powerful.

Besides several one- and two way contingency tables Larntz (1978) investigated also

a 3x3x3 design. He compared Pearson's 2! to 2G and a further statistic and concluded

that Pearson's 2! should be preferred because its Type I error rates were closest to the

nominal level.

Rudas (1986) studied two and three way tables with 2x2 up to 3x3x5 cells and

sample sizes ranging from 15 to 150. He compared Pearson's 2! , 2G and a further stati-

stic suggested by Cressie and Read (1985). The results for Pearson's 2! and the Cressie

and Read-statistic were very similar and these two statistics were found more appro-

priate than 2G for small sample sizes. Also Koehler (1986) studied two- and three way

tables and 2ktables. He found acceptable results for Pearson's 2! except some cases of

sparse tables containing both very small and moderately large expected frequencies. The

accuracy of 2G , on the other hand, was judged generally unacceptable producing greatly

inflated Type I error levels in some cases and deflated levels in others. For very large

and sparse tables other asymptotic procedures were preferrable to Pearson's 2! and 2G .


Hosmane (1987) who investigated tables with 2x2x2 to 4x4x3 cells and sample sizes

ranging from 10 to 200 compared five test statistics including Pearson's 2! and 2G . He

recommends Pearson's 2! as the test statistic closest to nominal level " .

The present paper is to contribute to this field by focussing on three-way and four-

way designs. Whereas many of the previous studies used very small sample sizes in order

to examine the behavior of the test statistics at the lower end the present study will

focus on sample sizes which may be judged realistic and reasonable for psychological

research. That means that sample sizes should be sufficiently large to provide acceptable

statistical power at least for large effects.

As the results of an iterative estimation procedure may depend on the quality of the

mathematical algorithm it was decided to employ for our simulation study a widely used

procedure from a well known statistical package with the program options left in their

default modes. The procedure "Hiloglinear" from SPSS was chosen with the aim to de-

rive some guidelines when the results from this procedure can be used without a risk of

severely inflated or deflated Type I error rates and which of the two test statistics pro-

vided by the program should be preferred.

2. The Monte Carlo study

Our simulation study includes designs with two categories per variable (2x2, 2x2x2,

2x2x2x2) and designs with up to four categories per variable (4x3, 4x3x2, 4x3x2x2). The

first series of simulations was run with uniform marginals sampling from a null model

with zero main effects and zero interactions. The sample sizes chosen were 2.5, 5 or 10

times the number of cells. Smaller sample sizes were not considered because of the lack

in statistical power that would result. According to Erdfelder (1992) and Erdfelder, Faul

and Buchner (1996) a sample size of N = 32 is required for a chi-squared test with one

degree of freedom to reach a power value of 0.8 at alpha = .05 for large effects, and N =

88 for medium effects; for a chi-squared test with six degrees of freedom the required

sample sizes are N = 55 and N = 152, respectively (though these values are also based

on asymptotic theory they may nevertheless be used as a rough estimate).

A further series of simulations was run with the same designs and the same sample

sizes, but differing marginal distributions. For designs with two categories the marginals

were .7,.3 or .8,.2, for designs with three- and four-categorial variables the marginals

were .1, .2, .3, .4 for variable A, .1, .4, .5 for variable B and .5, .5 or .8, .2 for variable C


and D. The population values for all interactions were zero. Due to the differences in the

marginal probabilities there was considerable variation in the expected cell frequencies

within the designs and in many cases some of the expected frequencies were smaller

than one.

3. The simulation program2

The data were generated by a SPSS-input-program as follows: The SPSS-procedure

"uniform" was used to generate for each person a value for each variable drawn from a

uniform distribution in the interval [0, 1], e. g., for a 4x3x2x2 design four variables A,

B, C, D. Next the variables were devided into categories: E.g., when variable A was de-

signed to have four categories a1 to a4 with marginal probabilities .1, .2, .3, .4 a person

was assigned to category a1 if his/her value for A was in the range 0 to ! .1, to category

a2 if it was between 0.1 and ! .3, etc. The procedure continued with categorizing the

next variable until the person was categorized completely. Then the values for the next

person were drawn and categorized until a sample of the required size was completed.

Then the input program was closed and the procedure "Hiloglinear" was started.

This input program including the line calling the procedure "Hiloglinear" was written

into a SPSS-MAKRO which was executed 10 000 times. This resulted in a very long

SPSS-output which was filtered by a program written in "UNIX-awk" to find the lines

with the results of the significance tests. The fields containing the p-values were trans-

ferred to a new file, which was then used as an input file for a SPSS programm to assess

the empirical distribution of the p-values. The p-values ! .05 were counted to get empi-

rical rejection rates.

The procedure "Hiloglinear" from SPSS for Unix Release 5.0 is described in the ma-

nual "SPSS-Statistical Algorithmus" als follows: First, maximum likelihood estimates of

the expected cell frequencies under the specific model under investigation are computed

using the iterative proportional fit algorithm as described by Fienberg (1977). The fol-

lowing default values for the program options were left unchanged: To avoid problems

with zero frequencies the program adds a constant delta = .5 to all empirical cell fre-

quencies before the fit algorithm is started. The fit algorithm stops, if the largest change

2The students Mrs. Sigrid Kühl, Mrs. Verena Polz and Mrs. Karina Wahl were engaged in the develop-

ment and running of the simulation program.


of an expected cell frequency in two consecutive iterations is less than .25 or if 20 itera-

tions have been executed.

Based on these expected frequencies as compared to the empirical cell counts the two

test statistics Pearson's 2! and 2G are computed and significance tests for the fit of the

model are performed using either of the two statistics. The resulting p-values are given.

The output provides significance tests for the fit of the following models:

M0: Global null hypothesis. All main effects and interactions are zero.

M1: Model with free parameters for the main effects. 2-factor and higher inter-

actions are zero.

M2: Model with free parameters for main effects and 2-factor interactions. 3-

factor and higher interactions are zero.

M3: Model with free parameters for main effects, 2-factor and 3-factor interac-

tions. The 4-factor interaction is zero.

Next, hypotheses about main effects, 2-factor, 3-factor and 4-factor interactions are

tested seperately by hierarchical model comparisons (Chi-square difference tests for ne-

sted models):

Test whether the marginals differ significantly from uniform distributions (M1

vs. M0)

Test of the 2-factor interactions (M2 vs. M1)

Test of the 3-factor interactions (M3 vs. M2)

Test of the 4-factor interaction (satisfied model with main effects and all interac-

tions including the 4-factor interactions vs. M3)

Furthermore, significance tests using 2G are performed for each of the main effects

and each of the interactions seperately ("partial associations"). Also these tests are based

on hierarchical comparisons: E. g., when in a model with four variables the 3-factor in-

teractions ABC is to be tested, two models with the following higher-order marginals

fitted to the data are compared:

H0: model with all main effects, all 2-factor interactions and the 3-factor inter-

actions ABD, BCD and ACD, (i. e., all 3-factor interactions except ABC)

H1: model with all main effects, all 2-factor interactions and all 3-factor inter-

actions including ABC


In the next section simulation results will be reported for the significance tests of the

global null hypothesis, for groups of hypotheses (all main effects, all 2-factor interacti-

ons, etc.) and in some cases also for partial associations.

4. Results

Table 1 gives the empirical rejection rates corresponding to a nominal level of alpha

= .05 under the condition of the complete null hypothesis, i. e. with uniform marginal

distributions. As each entry in the table is based on 10 000 samples 95 percent of the

values should fall into the interval .046 to .054, if the true alphas were .05. Already a

rough inspection of Table 1 shows that the majority of the values (64%) is outside this

range and thus departs significantly from the nominal value. However, applying asym-

ptotic results to finite sample sizes one cannot expect to get exact error rates. In what

follows we will call increased empirical values up to .075 "moderately" inflated, higher

values will be called "severely" inflated. Severely inflated values are underlined in the

tables. On the other hand, empirical error rates below the nominal level indicate a con-

servative significance test, presumably with increased Type II error rates. Therefore,

values equal or below .025 are printed in italics.

Table 1 shows that there is a general tendency for 2G to yield higher empirical al-

phas than Pearson's 2! (in 60 cases out of 72 the value for 2G is higher than that for

Pearson's 2! , in 1 case equal and in 11 cases lower). Most of the cases of severely in-

creased alpha values occur for 2G , whereas severely depressed alphas occur for neither

statistic. Giving equal weight to discrepancies in either direction (increased or depressed

values) Pearson's 2! is found to be in 52 cases closer to the nominal alpha than 2G , 2

times equal, and in 18 cases more discrepant than 2G . Thus one may state that Pear-

son's 2! is on the whole closer to the nominal level than 2G .

Next we will look on the results for the two statistics in more detail:

Pearson's 2! : There are only two cases of severely inflated alpha levels, both occu-

ring when 4-factor interactions are tested with the smallest sample sizes (2.5 times the

number of cells). When the sample sizes are increased the empirical error rates improve

and differ only "moderately" from the nominal level.


Table 1: Empirical alpha-values for a nominal alpha = .05 when the complete null hy-

pothesis is true (uniform marginals, no interactions)

significance tests

Design N n/

cell

global marginals 2-way

interactions

3-way

interactions

4-way

interactionsG2 "2 G2 "2 G2 "2 G2 "2 G2 "2

2x2 10 2.5 .077 .036 .047 .059 .100 .052

20 5 .048 .038 .052 .046 .058 .051

40 10 .053 .042 .048 .048 .050 .050

4x3 30 2.5 .091 .042 .057 .053 .102 .045

60 5 .067 .047 .056 .053 .071 .046

120 10 .057 .048 .056 .057 .056 .047

2x2x2 20 2.5 .083 .043 .059 .049 .076 .050 .095 .069

40 5 .067 .045 .048 .050 .062 .047 .068 .060

80 10 .056 .047 .052 .055 .055 .049 .060 .058

4x3x2 60 2.5 .106 .056 .053 .058 .075 .045 .129 .063

120 5 .078 .054 .059 .060 .062 .050 .082 .057

240 10 .060 .053 .054 .055 .055 .050 .060 053

2x2x2x2 40 2.5 .095 045 .052 .055 .065 .043 .107 .065 .111 .078

80 5 .068 .047 .053 .059 .055 .043 .067 .050 .077 .069

160 10 .060 .053 .054 .053 .054 .047 .061 .054 .058 .057

4x3x2x2 120 2.5 .141 .054 .057 .068 .063 .044 .122 .052 .160 .091

240 5 .086 .056 .052 .061 .055 .047 .078 .048 .098 .073

480 10 .065 .054 .048 .055 .054 .048 .064 .051 .068 .063

N = total sample size


Apart from a conservative tendency for 2x2 tables with small sample sizes, the tests

for the global null hypothesis lead to acceptable type I error rates (.042 to .056). The

tests for the main effects, 2-factor and 3-factor interactions lead to somewhat larger but

still "moderate" departures from the nominal level (.043 to .069). The tests for the 4-

factor interactions lead to the two severely inflated values mentioned above occurring

with sample sizes of 2.5 times the cell number. For sample sizes equalling 5 or 10 times

the cell number, these error rates were only moderately inflated.

The Likelihood-Ratio- 2G : Whereas the results for the main effects are satisfactory

(.047 to .059) 2-factor and higher interactions show an increasing tendency to yield in-

flated type I error rates. Severely inflated empirical alpha levels occur for all designs

with sample sizes only 2.5 times the number of cells and for many cases with sample size

5 times the cell number. Only with sample sizes equal to 10 times the cell number all

departures stay in the range defined as only "moderately" descrepant.

Tables 2a and 2b give the empirical rejection rates for all simulations with main ef-

fects present, i. e. with marginals differring from the uniform distribution. The sample

sizes used are the same as in Table 1, equalling 2.5 times, 5 times, or 10 times the cell

number. Yet, due to the variation in the marginal probabilities there was considerable

variation in the expected cell frequencies within a design. The smallest expected cell

frequency within a design is given in the column headed "min E(ni)" in Tables 2a and

2b.

The alternative hypotheses was true for the global tests and the tests of the margi-

nals. Except for the smallest sample sizes (N = 10, N = 20) statistical power exceeded

0.90 in all cases. As the results for the global tests and the tests for the marginal distri-

butions were very much alike, only the results for the marginal distributions are repor-

ted.

Table 2a refers to designs with two categories per variable (2x2 to 2x2x2x2), whereas

Table 2b contains the results for designs with up to four categories per variable (4x3 to

4x3x2x2). Taking together from both groups of designs the behavior of the two test sta-

tistics may be summarized as follows:

Pearson's 2! leads to only moderate discrepancies from the nominal level (empirical

levels .043 to .071) when the total sample size equals at least 5 times the cell number

and the smallest expected cell frequency is larger than 0.5. With expected cell frequen-

cies smaller than 0.5 serious departures occur in either direction (severely inflated or

depressed empirical levels) without following a simple pattern.


Table 2a: Empirical rejection rates for a nominal level of alpha = .05, when main effects

are present but all interactions are zero. Designs with only two-categorial variables.

significance tests

Design N n min

E(ni)

marginals# 2-way

interactions

3-way

interactions

4-way

interactions

mar-

gi-

nals G2 "2 G2 "2 G2 "2 G2 "2

2x2 10 2.5 .9 .3 .7 .331 .361 .071 .046

20 5 1.8 �� .647 .607 .074 .045

40 10 3.6 �� .922 .913 .065 .052

2x2 10 2.5 .4 .2 .8 .697 .744 .034 .054

20 5 .8 �� .970 .957 .055 .049

40 10 1.6 �� 1.00 1.00 .070 .046

2x2x2 20 2.5 .54 .3 .7 .768 .725 .088 .065 .070 .044

40 5 1.08 �� .977 .972 .069 .051 .072 .057

80 10 2.16 �� 1.00 1.00 .053 .047 .068 .055

2x2x2 20 2.5 .16 .2 .8 .990 .990 .059 .086 .036 .016

40 5 .32 �� 1.00 1.00 0.68 .065 .056 .039

80 10 .64 �� 1.00 1.00 .069 .058 .058 .052

2x2x2x2 80 5 .65 .3 .7 1.00 1.00 .062 .052 .096 .071 .081 .051

160 10 1.30 �� 1.00 1.00 .053 .050 .069 .053 .079 .061

2x2x2x2 80 5 .13 .2 .8 1.00 1.00 .081 .101 .071 .090 .022 .011

160 10 .26 �� 1.00 1.00 .058 .069 .078 .081 .050 .032#

As the alternative hypothesis holds for the main effects the values in these columns are

empirical power-values.


n = N devided by the number of cells

min " #iE n = smallest expected cell frequency


Table 2b: Empirical rejection rates for a nominal level of alpha = .05, when main effects

are present but all interactions are zero. Designs with more than two categories per va-

riable.

significance tests

Design N n min

E(ni)

marginals marginals# 2-way

interactions

3-way

interactions

4-way

interacti-G2 "2 G2 "2 G2 "2 G2 "2

4x3 60 5 .6 .1 .2 .3 .4I

.1 .4 .51.00 1.00 .065 .046

120 10 1.2 �� 1.00 1.00 .065 .048

4x3x2 120 5 .6 .1 .2 .3 .4I

.1 .4 .5I.5 .51.00 1.00 .071 .060 .081 .043

240 10 1.2 �� 1.00 1.00 .063 .056 .081 .052

120 5 .24 .1 .2 .3 .4I

.1 .4 .5I.2 .81.00 1.00 .074 .076 .067 .042

240 10 .48 �� 1.00 1.00 .066 .064 .072 .049

4x3x2x2 240 5 .6 .1 .2 .3 .4I

.1 .4 .5I.5 .5I1.00 1.00 .061 .060 .102 .064 .089 .056

480 10 1.2 �� 1.00 1.00 .054 .052 .081 .053 .091 .065

240 5 .24 .1 .2 .3 .4I

.1 .4 .5I.5 .5I1.00 1.00 .062 .072 .112 .093 .060 .033

480 10 .48 �� 1.00 1.00 .050 .056 .088 .064 .082 .053

240 5 .10 .1 .2 .3 .4I

.1 .4 .5I.2 .8I1.00 1.00 .067 .096 .117 .125 .028 .019

480 10 .19 �� 1.00 1.00 .066 .071 .010 .085 .060 .034

# As the altrnative hypothesis holds for the main effects the values in these columns are

empirical power-values.





Table 3: Empirical alpha values for a nominal alpha = .05 when testing partial associa-

tions using 2G . Two design conditions selected from Table 2b.

a) Design 4x3x2:

N n min

E(ni)

marginals AB AC BC ABC

120 5 .6 .1 .2 .3 .4I

.1 .4 .5I.5 .5.072 .061 .062 .081

240 10 1.2 �� .063 .056 .054 .081

a) Design 4x3x2x2:

N n min

E(ni)

margi-

nals

AB AC BC AD BD CD

240 5 .6 .1 .2 .3 .4I

.1 .4 .5I.5.063 .056 .054 .055 .051 .051

480 10 1.2 �� .052 .052 .052 .056 .051 .050

N n min

E(ni)

margi-

nals

ABC ABD ACD BCD ABCD

240 5 .6 .1 .2 .3 .4I

.1 .4 .5I.5.097 .096 .076 .077 .089

480 10 1.2 �� .078 .071 .059 .061 .091





The Likelihood-Ratio 2G yields for the 2-factor interactions only moderate discrepencies

from the nominal level (empirical levels .053 to .070) when the total sample size equals

10 times the cell number and the smallest expected cell frequency is larger than 1.

However, even when these conditions are satisfied, the significance tests of the 3- and 4-

factor-interactions lead in many cases to seriously increased alpha levels. Therefore, 2G

cannot be recommended for these tests.

Comparing Pearson's 2! to 2G we find that in all cases which satisfy the above rule

for Pearson's 2! (total sample size 5 times the cell number, smallest expected cell fre-

quency � .5) Pearson's 2! is closer to the nominal level than 2G .

Table 3 shows the results for the tests of the partial associations in some of the larger

designs (4x3x2 and 4x3x2x2) using 2G . Also these results indicate that severely inflated

alpha levels occur for 3- and 4-factor interactions even under the conditions of large

sample sizes (10 times the cell number) and smallest expected cell frequencies larger

than 1.

4.1. Supplementary results:

Since the goodness-of-fit values obtained for a model and the resulting p-values may

depend to some extent also on the quality of the numerical optimization procedure, we

wanted to check whether an increase in numerical occuracy would affect the obtained

error rates.

Two designs were chosen: The 2x2x2 design with unequal marginals (.8,.2) and

sample size = 80, and the 2x2x2x2 design with unequal marginals (.8,.2) and sample size

= 160. The latter had produced severely inflated error rates for both, Pearson's 2! and2G . The program options for numerical accuracy were changed from the default values

of a maximum of 20 iterations and a convergence criterion of 0.25 to a maximum of 50

iterations and a convergence criterion of .05. For each of the two designs a simulations

with 10 000 data sets was run under the improved accuracy conditions. The results were

very close to that under the default conditions. In particular, the severely increased type

I error rates for the 3-factor interactions in the 2x2x2x2 design did not improve.

Next, the value of the constant delta, which is added to all observed cell frequencies

to avoid problems with empty cells, was changed from its default value 0.5 to 0.1. The

same two design conditions as before (2x2x2 with marginals .8,.2 n = 80; 2x2x2x2 with

marginals .8,.2 n = 160) and two further conditions (2x2x2x2 with marginals .7,.3, n =

160; 4x3x2x2 uniform marginals, n = 480) were chosen. The results are heterogeneous:


There were substantial improvements in many cases, expecially for 2G , but also detoria-

tions.

For those cases which satisfy the conditions given above for Pearson's 2! , (sample si-

ze at last 5 times the number of cells, smallest expected cell frequency larger than .5),

no severe departures from the nominal level occured, neither for Pearson's 2! nor 2G ,

and Pearson's 2! was always closer to the nominal level than 2G .

Furthermore, we wanted to check whether the procedure "Hiloglinear" in "SPSS for

Windows 8.0" differs in any respect from the procedure in "SPSS for UNIX Releas 5.0"

which was used in our simulations. We chose the last six design conditions from Table

2b (4x3x2x2 designs with unequal marginals) and generated one sample for each condi-

tion. This samples were analyzed using both "Hiloglinear" from "SPSS for Windows 8.0"

and from "SPSS for UNIX 5.0" with the program options left to their default values

(delta = .5, convergence = 25, iterate = 20). The output from the two versions of SPSS

was identical. Then the options were changed to delta = 0, convergence = .01 and itera-

te = 50. Again, the output was the same for both SPSS-versions.

5. Discussion

The results of the present study are in accordance with the majority of the previous

findings concluding that Pearson's 2! is generally closer to the nominal level than 2G .

Pearson's 2! did not lead to seriously increased type I error rates when the smallest ex-

pected cell frequency was larger than .5 and the total sample size was at least 5 times

the number of cells. This rule was found to hold for main effects and 2-factor interacti-

ons as well as for higher order interactions. For 2G , on the other hand, a rule was found

only for main effects and 2-factor interactions (to avoid seriously increased type I error

rates the smallest expected cell frequency should be larger than 1 and the total sample

size at least 10 times the cell number), whereas in many cases significance tests of higher

order interactions lead to seriously increased alpha levels even when this rule was satis-

fied.

However, a Monte Carlo study can only yield limited information on the behavior of

a test statistic as only a limited number of cases can be simulated, and the cases inclu-

ded will usually differ in some respects from the conditions that a researcher is faced

with when analyzing his/her data. In the present study we aimed at chosing the condi-

tions in a way to cover the most typical cases of psychological research and as a result


of our study we defined an area of conditions where one may feel on the save side using

SPSS standard procedures.

Nevertheless, important questions are left open: What should one do when the above

rules for applying asymptotic procedures are not satisfied? E. g., when higher order in-

teractions are to be tested and the smallest expected cell sizes are smaller than .5, i. e.,

too small for either 2G and Pearson's 2! .

In such situations it would be desirable to provide bootstrapping procedures which

enable the researcher to start his/her own simulation study with his/her specific design

conditions and parameter values and with the actual values for computational accuracy

and the constant delta. Based on this simulation the distribution of the test statistic

could be assessed empirically and used instead of the asymptotic distribution to compu-

te the tail probability needed for the significance test. E. g., when a 4-factor interaction

is to be tested one might proceed as follows: First, the empirical data are used to esti-

mate the parameters under the model of the null hypothesis. As the 4-factor interaction

is under question this is a model containing main effects, 2-factor and 3-factor interacti-

ons but no 4-factor interaction. Next, a large number of data sets with the actual

sample size is generated from this model. To each data set the null hypothesis model

with all parameters free except the 4-factor interaction, and the alternative model with

all parameters free including the 4-factor interaction (i. e. the saturated model) are fit-

ted. Comparing these two models the test statistic for the 4-factor interaction is compu-

ted and the distribution of the test statistic over the samples as replications is assessed.

To perform the significance test for the 4-factor interaction one may use the empirical

distribution of the test statistic to estimate the 95th percentile point and use it as the

critical value or, alternatively one may estimate the tail probability (p-value) from the

proportion of samples which have led to a value exceeding that computed from the real

data set.

Similarily, one may obtain also estimated power values: A further simulation study

might be run generating Monte Carlo samples from a model with parameter values spe-

cified according to the respective alternative hypothesis. The proportion of cases leading

to a test statistic exceeding the 95th percentile point obtained from the null hypothesis

yields an estimate of statistical power for this alternative hypothesis.

This approach, which has been implemented in the program PANMARK by Van de

Pol, Langeheine and De Jong (1991), is called "sophisticated bootstrapping" or "parame-

tric bootstrapping" and has been presented and demonstrated with a variety of log-


linear models and latent class models also by Langeheine, Pannekoek and Van de Pol

(1996) and Langeheine, Van de Pol and Pannekoek (1997). Von Davier (1997) develop-

ped parametric bootstrapping procedures for various item response models, as for these

models the number of cells (= possible response patterns) becomes large even for mode-

rate item numbers, and hence the sample sizes available are in praxis nearly always too

small to use asymptotic tests as Pearson's 2! or 2G .

So far, parametric bootstrapping is the only alternative when the sample size is

known to be too small for asymptotic procedures. Furthermore, it may also be helpful

when the actual design conditions differ far from those covered by our tables and gene-

ralizations from our Monte Carlo study become hazardous.

As similar considerations apply in principle to all other asymptotically derived signi-

ficance tests it would be desirable if parametric bootstrapping procedures were included

in all major statistical packages whenever asymptotic tests are applied and only rough

knowledge is a vailable about their behavior with finite sample sizes.


References

[1] Agresti, A. & Yang, M.C. (1987). An empirical investigation of some effects of spar-

seness in contingency tables. Computational Statistics and Data Analysis, 5, 9-21.

[2] Berry, K.J. & Mielke, P.W. Jr. (1988). Monte Carlo comparisons of the asymptotic

chi-square and Likelihood-ratio tests with the nonasymptotic chi-square test for

sparse r x c tables. Psychological Bulletin, 103, 256-264.

[3] Brandstätter, E. (1999). Confidence interval as an alternative to significance testing.

Methods of Psychological Research-Online. Vol. 4 No. 2. http://www. mpr-online.de

[4] Chapman, J.W. (1976). A comparison of the X², - 2 log R, and multinominal proba-

bility criteria for significance tests when expected frequencies are small. Journal of

the American Statistical Association, 71, 854-863.

[5] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).

Hillsdale, N.J.: Lawrence Erlbaum.

[6] Cohen, J. (1994). The earth is round (p �.05). American Psychologist, 49, 997-1003.

[7] Davier, von, M. (1997). Methoden zur Prüfung probabilistischer Testmodelle. [Insti-

tut für die Pädagogik der Naturwissenschaften an der Universität Kiel] IPN. Olsen-

hausenstraße 62, D-24098 Kiel.

[8] Erdfelder, E., Faul, F. & Buchner, A. (1996). GPOWER: A general power analysis

program. Behavior Research Methods, Instruments, & Computers, 28, 1-11.

[9] Faul, F. & Erdfelder, E. (1992). GPOWER: A priori-, post hoc-, and compromise

power analysis for MS-DOS (Computer program). Bonn: Bonn University.

[10] Fienberg, S.E. (1977). The analysis of cross-classified categorial data. Cambridge,

MA: The MIT Press.

[11] Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In

Keren, G. & Lewis, C. (Eds.), A handbook for data analysis in the behavioral scien-

ces. Methodological issues, 311-339. Hillsdale: Lawrence Erlbaum.

[12] Goodman, L.A. (1996). A single general method for the analysis of cross-classified

data: Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher,

and also some methods of correspondence analysis and association analysis. Journal

of the American Statistical Associaton, 433, 408-428.


[13] Haber, M. (1984). A comparison of tests for the hypothesis of no three-factor in-

teraction in 2 x 2 x 2 contingency tables. Journal of Statistical Computation and Si-

mulation, 20, 205-215.

[14] Harlow, L.L., Mulaik, S.A. & Steiger, J.H. (1997). What if there were no signifi-

cance tests? Hillsdale: Lawrence Erlbaum.

[15] Harris, R.J. (1997). Reforming significance testing via three-valued logic. In Har-

low, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance

tests. Mahwah, N.J.: Lawrence Erlbaum.

[16] Hosmane, B. (1986). Improved likelihood ratio tests and Pearson chi-square tests

for independence in two dimensional contingency tables. Communications in Stati-

stics - Theory and Methods, 15, 1875-1888.

[17] Hosmane, B. (1987). An empirical investigation of chi-square tests for the hypothe-

sis of no three-factor interaction in I x J x K contingency tables. Journal of Statisti-

cal Computation and Simulation, 28, 167-178.

[18] Iseler, A. (1997). Signifikanztests: Ritual, guter Brauch und gute Gründe. Methods

of Psychological Research-Online, Diskussionsforum. URL http://www.pabst-

publishers.de/impr/forum_e.html.

[19] Jones, L.V. (1999). A sensible reformulation of the significance test. ViSta: The

Visual Statistics System. http: // forrest.psych.unc.edu/jones-tukey 112399.html

[20] Koehler, K.J. (1986). Goodness-of-fit tests for log-linear models in sparse contin-

gency tables. Journal of the American Statistical Association, 81, 483-492.

[21] Koehler, K.J. & Larntz, K. (1980). An empirical investigation of goodness-of-fit

statistics for sparse multinomials. Journal of the American Statistical Association,

75, 336-344.

[22] Langeheine, R., Pannekoek, J. & Van de Pol, F. (1996). Bootstrapping goodness-

of-fit measures in categorical data analysis. Sociological Methods & Research, 24,

492-516.

[23] Langeheine, R., Van de Pol, F. & Pannekoek, J. (1997). Kontingenztabellen-

Analyse bei kleinen Stichproben: Probleme bei der Prüfung der Modellgültigkeit mit-

tels Chi-Quadrat Statistiken. Empirische Pädagogik, 11, 63-77.


[24] Larntz, K. (1978). Small-sample comparisons of exact levels for chi-squared

goodness-of-fit statistics. Journal of the American Statistical Association, 73, 253-

263.

[25] Lawal, H.B. (1984). Comparisons of X², Y², Freeman-Tukey and William's impro-

ved 2G test statistics in small samples of one-way multinomials. Biometrika, 71,

415-458.

[26] Meehl, P.E. (1997). The problem is epistemology, not statistics: Replace significan-

ce tests by confidence intervals and quantify accuracy of risky numerical predictions.

In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no signifi-

cance tests (p. 393-426). Mahwah, N.J.: Lawrence Erlbaum.

[27] Milligan, G.W. (1980). Factors that affect Type I and Type II error rates in the

analysis of multidimensional contingency tables. Psychological Bulletin, 87, 238-244.

[28] Mulaik, St.A., Raju, N.S. & Harshman, R.A. (1997). There is a time and place of

significance testing. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if

there were no significance tests (p. 65-115). Hillsdale, N.J.: Lawrence Erlbaum.

[29] Read, T.R.C. & Cressie, N.A.C. (1988). Goodness-of-fit statistics for discrete mul-

tivariate data. New York: Springer.

[30] Reichardt, Ch.S. & Gollob, H.F. (1997). When confidence intervals should be used

instead of statistical significance tests, and vice versa. In Harlow, L.L., Mulaik, St.A.

& Steiger, J.H. (Eds.), What if there were no significance tests (p. 259-284). Hillsda-

le, N.J.: Lawrence Erlbaum.

[31] Rudas, T. (1986). A Monte Carlo comparison of the small sample behaviour of the

Pearson, the likelihood ratio and the Cressie-Read statistics. Journal of Statistical

Computation and Simulation, 24, 107-120.

[32] Schmidt, F.L. & Hutner, J.E. (1997). Eight common but false objections to the

discontinuation of significance testing in the analysis of research data. In In Harlow,

L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests

(p. 37-64). Hillsdale, N.J.: Lawrence Erlbaum.

[33] Sedlmeier, P. (1996). Jenseits des Signifikanztest-Rituals: Ergänzungen und Alter-

nativen. Methods of Psychological Research Online, 1, 41-63.


[34] Sedlmeier, P. (1998). Was sind die guten Gründe für Signifikanztests? Diskussi-

onsbeitrag zu Sedlmeier (1996) und Iseler (1997). Methods of Psychological Research-

Online, 3, 39-42.

[35] SPSS Statistical Algorithms (ohne Jahr). Chicago: SPSS Inc.

[36] Upton, G.J.G. (1982). A comparison of alternative tests for the 2 x 2 comparative

trial. Journal of the Royal Statistical Society Series A, 145, 86-105.

[37] Van de Pol, F., Langeheine, R. & De Jong, W. (1991). PANMARK user manual:

PANel analysis using MARKov chains. Voorburg: Netherlands Central Bureau of

Statistics.

what sample sizes are needed to get correct significance ... · for cases not covered by this rule...

Documents