modern approaches effect size key source: kline, beyond significance testing

MODERN APPROACHES

Effect Size Key source: Kline, Beyond Significance Testing

Always a difference

Commonly define the null hypothesis as ‘no difference’

Differences between groups always exist (at some level of precision)

Obtaining statistical significance can be seen as just a matter of sample size

Furthermore, the importance and magnitude of an effect are not reflected (because of the role of sample size in probability value attained)

What should we be doing?

Want to make sure we have looked hard enough for the difference – power analysis

Figure out how big the thing we are looking for is – effect size

Calculating effect size

Different statistical tests have different effect sizes developed for them

However, the general principle is the same

Effect size refers to the magnitude of the impact of the independent variable(s) on the outcome variable

Types of effect size

Three basic classes of effect size commonly used: d family vs. r family

Focused on standardized mean differences Allows comparison across samples and variables with differing

variance Cohen’s d

Note sometimes no need to standardize (units of the scale have inherent meaning)

Variance-accounted-for Amount explained versus the total

Case level effect sizes However one should note that in group comparison

situations, each is directly translatable to the another

Cohen’s d

Independent samples t-test

Cohen initially suggested could use either sample standard deviation, since they should both be equal according to our assumptions. In practice people now use the pooled variance.

Also Cohen’s f for the ANOVA counterpart Others available for particular cases, e.g. control

groups, dependent samples

1 2

p

X Xd

s

Association

A measure of association describes the amount of the covariation between the independent and dependent variables

It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size

R, eta

Characterizing effect size

Cohen emphasized that the interpretation of effects requires the researcher to consider things narrowly in terms of the specific area of inquiry

Evaluation of effect sizes inherently requires a personal value judgment regarding the practical or clinical importance of the effects

Other effect size measures

Case level effect sizes Measures for non-continuous data

Association Contingency coefficient Phi Cramer’s Phi

d-family Odds Ratios

Agreement Kappa

Case-level effect sizes

Indexes such as Cohen’s d and eta2 estimate effect size at the group or variable level only

However, it is often of interest to estimate differences at the case level

Case-level indices of group distinctiveness are proportions of scores from one group versus another that fall above or below a reference point

Reference points can be relative (e.g., a certain number of standard deviations above or below the mean in the combined frequency distribution) or more absolute (e.g., the cutting score on an admissions test)

Case-level effect sizes

U1

Proportion of nonoverlap If no overlap then = 1, 0 if

all overlap U2

Proportion of scores in lower group exceeded by the same proportion in upper group

If same means = .5, if all group2 exceeds group 1 then = 1.0

U3

Proportion of scores in lower group exceeded by typical score in upper group

Same range as U2

Other case-level effect sizes

Tail ratios (Feingold, 1995): Relative proportion of scores from two different groups that fall in the upper extreme (i.e., either the left or right tail) of the combined frequency distribution

“Extreme” is usually defined relatively in terms of the number of standard deviations away from the grand mean

Tail ratio > 1.0 indicates one group has relatively more extreme scores

Here, tail ratio = p2/p1:

Other case-level effect sizes

Common language effect size (McGraw & Wong, 1992) is the predicted probability that a random score from the upper group exceeds a random score from the lower group

Find area to the right of that value Range .5 – 1.0

1 2

2 21 2

0 ( )CL

X Xz

s s

Confidence Intervals for Effect Size Effect size statistics such as Hedge’s g

and η2 have complex distributions Traditional methods of interval

estimation rely on approximate standard errors assuming large sample sizes

General form for d is the same as that of any CI

( )cv dd t s

Problem

However, CIs formulated in this manner are only approximate, and are based on the central (t) distribution centered on zero

The true (exact) CI depends on a noncentral distribution and additional parameter Noncentrality parameter What the alternative hype distribution is centered on

(further from zero, less belief in the null) d is a function of this parameter, such that if ncp

= 0 (i.e. is centered on the null hype value), then d = 0 (i.e. no effect)

1 2

1 2

1 2

1 2

pop

pop

n nd ncp

n n

n nncp d

n n

Confidence Intervals for Effect Size Similar situation for r and eta2 effect size

measures Gist: we’ll need a computer program to help us

find the correct upper and lower bounds for the ncp to get exact confidence intervals for effect sizes Then based on it’s relation to d, transform those bounds

Statistica and R (MBESS package) have the functionality built in, several stand alone packages are available, (Steiger) while others allow for such intervals to be programmed (even SPSS scripts are available (Smithson))

Limitations of effect size measures

Standardized mean differences: Heterogeneity of within-conditions variances across

studies can limit their usefulness—the unstandardized contrast may be better in this case

Measures of association: Correlations can be affected by sample variances and

whether the samples are independent or not, the design is balanced or not, or the factors are fixed or not

Also affected by artifacts such as missing observations, range restriction, categorization of continuous variables, and measurement error (see Hunter & Schmidt, 1994, for various corrections)

Variance-accounted-for indexes can make some effects look smaller than they really are in terms of their substantive significance


How to fool yourself with effect size estimation:

1. Measure effect size only at the group level

2. Apply generic definitions of effect size magnitude without first looking to the literature in your area

3. Believe that an effect size judged as “large” according to generic definitions must be an important result and that a “small” effect is unimportant (see Prentice & Miller, 1992)

4. Ignore the question of how theoretical or practical significance should be gauged in your research area

5. Estimate effect size only for statistically significant results


6. Believe that finding large effects somehow lessens the need for replication

7. Forget that effect sizes are subject to sampling error

8. Forget that effect sizes for fixed factors is specific to the particular levels selected for study

9. Forget that standardized effect sizes encapsulate other quantities such as the unstandardized effect size, error variance, and experimental design

10. As a journal editor or reviewer, substitute effect size magnitude for statistical significance as a criterion for whether a work is published

11. Automatically equate effect size with ‘cause size’

Recommendations

APA task force suggestions Report effect sizes Report confidence intervals Use graphics

Report and interpret effect sizes in the context of those seen in previous research rather than rules of thumb

Report and interpret confidence intervals (for effect sizes too) also within the context of prior research In other words don’t be overly concerned with whether a CI for

a mean difference doesn’t contain zero but where it matches up with previous CIs

Summarize prior and current research with the display of CIs in graphical form (e.g. w/ Tryon’s reduction)

Report effect sizes even for non-statistically significant results

EFFECT SIZE ESTIMATION IN ONE-WAY ANOVA

Contrast Review

Concerns design with a single factor A with at least 2 levels (conditions) The omnibus comparison concerns all levels

(i.e., dfA > 2) A focused comparison or contrast concerns just

two levels (i.e.,df = 1) The omnibus effect is often relatively

uninteresting compared with specific contrasts (e.g., treatment 1 vs. placebo control)

A large omnibus effect can also be misleading if due to a single discrepant mean that is not of substantive interest

Comparing Groups

The traditional approach is to analyze the omnibus effect followed by analysis of all possible pairwise contrasts (i.e. compare each condition to every other condition)

However, this approach is typically incorrect (Wilkinson & TFSI,1999)—for example, it is rare that all such contrasts are interesting

Also, use of traditional methods for post hoc comparisons reduces power for every contrast, and power may already be low in a typical psych data setting

Contrast specification and tests A contrast is a directional effect that corresponds to a

particular facet of the omnibus effect often represented with the symbol y for a population or yˆ for a sample a weighted sum of means

In a sample, a contrast is calculated as:

a1, a2, ... , aj is the set of weights that specifies the contrast Review

Contrast weights must sum to zero and weights for at least two different means should not equal zero

Means assigned a weight of zero are excluded from the contrast Means with positive weights are compared with means given

negative weights

1 1 2 2... k k j ja X a X a X a X

Contrast specification and tests For effect size estimation with the d

family, we generally want a standard set of contrast weights

In a one-way design, you want the sum of the absolute values of the weights in to equal two (i.e., ∑| aj | = 2.0)

Mean difference scaling permits the interpretation of a contrast as the difference between the averages of two subsets of means

Contrast specification and tests An exception to the need for mean

difference scaling is testing for trends (polynomials) specified for a quantitative factor (e.g., drug dosage)

There are default sets of weights that define trend components (e.g. linear, quadratic, etc.) that are not typically based on mean difference scaling

Not usually a problem because effect size for trends is generally estimated with the r family (measures of association)

Measures of association for contrasts of any kind generally correct for the scale of the contrast weights

Orthogonal Contrasts

Two contrasts are orthogonal if they each reflect an independent aspect of the omnibus effect

For balanced designs and unbalanced designs (latter)

1 21

1 2

1

0

0

c

i

c

i

a a

a a

n

Orthogonal Contrasts

The maximum number of orthogonal contrasts is: dfA = a − 1 a = # of groups of that factor

For a set of all possible orthogonal pairwise contrasts, the SSA = the total SS from the contrasts, and their eta-squares will sum to the SSA eta-square

That is, the omnibus effect can be broken down into a − 1 independent directional effects

However, it is more important to analyze contrasts of substantive interest even if they are not orthogonal

Contrast specification and tests t-test for a contrast against the nil

hypothesis

The F is

w/in

2

t( )

w

dfs

weighted mean difference

as MS

n

2/

/ in

2

2

(1, )w inw

SSF df t

MS

SSan

Dependent Means

Test statistics for dependent mean contrasts usually have error terms based on only the two conditions compared—for example:

s2 here refers to the variance of the contrast difference scores

This error term does not assume sphericity, as is the case with repeated measures design

2

( 1)

D

t ns

ss

n

Confidence Intervals

Approximate confidence intervals for contrasts are generally fine (i.e. good enough)

The general form of an individual confidence interval for Ψ is:

dferror is specific to that contrast

[ ( )]cv errors t df

Contrast specification and tests There are also corrected

confidence intervals for contrasts that adjust for multiple comparisons (i.e., inflated Type I error) Known as simultaneous or

joint confidence intervals Their widths are

generally wider compared with individual confidence intervals because they are based on a more conservative critical value

Program for correcting http://www.psy.unsw.edu.a

u/research/resources/psyprogram.html

Standardized contrasts

The general form for standardized mean difference (in terms of population parameters)

pooled


There are three general ways to estimate σ (i.e., the standardizer) for contrasts between independent means:

1. Calculate d as Glass’s Δ i.e., use the standard deviation of the control group

2. Calculate d as Hedge’s g i.e., use the square root of the pooled within-conditions

variance for just the two groups being compared 3. Calculate d as an extension of g

Where the standardizer is the square root of MSW based on all groups (i.e. the omnibus test)

Generally recommended


To calculate d from a tcontrast for a paper not reporting effect size like they should

Recall the weights should sum to 2

2

( )a

g or d tn

CIs

Once the d is calculated one can easily obtain exact confidence intervals via the MBESS package in R or Steiger’s standalone program The latter will provide the interval for the

noncentrality parameter which must then be converted to d, and since the MBESS package used Steiger’s as a foundation, it is to be recommended

Cohen’s f Cohen’s f provides what can interpreted

as the average standardized mean difference across the groups in question

It has a direct relation to measures of association (eta-squared)

As with Cohen’s d, there are guidelines regarding Cohen’s f

.10, .25, .40 for small, moderate and large effect sizes

These correspond to eta-square values of: .01, .06, .14

Again though, one should conduct the relevant literature for determining what constitutes ‘small’ ‘medium’ and ‘large’

2 22 2

2 2

2

2

1 1

1

ff

f

f

Measures of Association

A measure of association describes the amount of the covariation between the predictor and outcome variables

It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size

A squared multiple correlation (R2) calculated in ANOVA is called the correlation ratio or estimated eta-squared (2)

Eta-squared

A measure of the degree to which variability among observations can be attributed to conditions

Example: 2 = .50 50% of the variability seen in the scores is due to the

independent variable.

2 2treatpb

total

SSR

SS

More than one factor

It is a fairly common practice to calculate eta2 (correlation ratio) for the omnibus effect but to calculate the partial correlation ratio for each contrast

As we have noted before

2partial η treat

treat error

SS

SS SS

SPSS calls everything partial eta-squared in it’s output, but for a one-way design you’d report it as eta-squared

Problem

Eta-squared (since it is R-squared) is an upwardly biased measure of association (just like R-squared was)

As such it is better used descriptively than inferentially

Omega-squared

Another effect size measure that is less biased and interpreted in the same way as eta-squared

So why do we not see omega-squared so much?

People don’t like small values Stat packages don’t provide it by default

2 ( 1)effect error

total error

SS k MS

SS MS

Omega-squared

Put differently

2

2

( )

/ ( ) ( 1)

/ ( ) ( 1)

effect effect error

total error

effect effect error effect effect

effect effect error error effect effect

df MS MS

SS MS

df kn MS MS df Fpartial

df kn MS MS MS df F kn

Omega-squared

Assumes a balanced design eta2 does not

Though the omega values are generally lower than those of the corresponding correlation ratios for the same data, their values converge as the sample size increases

Note that the values can be negative—if so, interpret as though the value were zero

Comparing effect size measures Consider our previous example with item

difficulty and arousal regarding performance

Tests of Between-Subjects Effects

Dependent Variable: Score

240.000a 5 48.000 9.600 .000 .667

120.000 1 120.000 24.000 .000 .500

60.000 2 30.000 6.000 .008 .333

60.000 2 30.000 6.000 .008 .333

120.000 24 5.000

360.000 29

SourceB/t groups

Difficulty

Arousal

Difficulty * Arousal

Error

Total

Type IIISum ofSquares df Mean Square F Sig.

Partial EtaSquared

R Squared = .667 (Adjusted R Squared = .597)a.

Comparing effect size measures

2 ω2 Partial 2

f

B/t groups .67 .59 .67 1.42

Difficulty .33 .32 .50 .70

Arousal .17 .14 .33 .45

Interaction .17 .14 .33 .45

Slight differences due to rounding, f based on eta-squared

No p-values

As before, programs are available to calculate confidence intervals for an effect size measure

Example using the MBESS package for the overall effect 95% CI on ω2:

.20 to .69

No p-values

Ask yourself as we have before, if the null hypothesis is true, what would our effect size be (standardized mean difference or proportion of variance accounted for)?

0 Rather than do traditional hypothesis testing, one can

simply see if our CI for the effect size contains the value of zero (or, in eta-squared case, gets really close)

If not, reject H0

This is superior in that we can use the NHST approach, get a confidence interval reflecting the precision of our estimates, focus on effect size, and de-emphasize the p-value

FACTORIAL EFFECT SIZE

Multiple factors

There are some special considerations in designs with multiple factors

Ignoring these special issues can result in effect size variation across multiple-factor designs due more to artifacts than real differences in effect size magnitude

These issues arise in part out of the problem of what to do about ‘other’ factors when effects on a particular factor are estimated

Methods for effect size estimation in multiple-factor designs are also not as well developed as for single-factor designs

However, this is more true for standardized contrasts than for measures of association

Special Considerations

It is generally necessary to have a good understanding of ANOVA for multiple-factor designs (e.g., factorial ANOVA) in order to understand effect size estimation

However, this does not mean one needs to know about the F test only, which is just a small (and perhaps the least interesting) part of ANOVA

The most useful part of an ANOVA source table for the sake of effect size estimation is everything to the left of the usual columns for F and p

It is better to see ANOVA as a general tool for estimating variance components for different types of effects in different kinds of designs

Multiple factors

Multiple-factor designs arise out of a few basic distinctions, including whether the 1. factors are between-subjects vs. within-

subjects 2. factors are experimental vs. nonexperimental 3. relation between the factors or subjects is

crossed vs. nested The most common type of multiple-factor

design is the factorial design, in which every pair of factors is crossed (levels of each factor are studied in all combinations with levels of all other factors)

Multiple factors

Some common types of factorial designs: Completely between-subjects factorial design:

Subjects nested under all combinations of factor levels (i.e.,all samples are independent)

Completely within-subjects factorial design: Each case in a single sample is tested under every

combination of two or more factors (i.e., subjects are crossed with all factors)

Mixed within-subjects factorial design (split-plot or mixed design): At least one factor is between-subjects and another

is within subjects (i.e., subjects are crossed with some factors but nested under others)

Multiple factors

While there are other distinctions, one that cannot be ignored is whether a factorial design is balanced or not

In a balanced (equal-n among groups) design, the main and interaction effects are all independent E.g. True experiment with randomized

assignment for all factors For this reason balanced factorials are

referred to as orthogonal designs

Multiple factors

However, factorial designs in applied research are often not balanced

We distinguish between unbalanced designs with 1. unequal but proportional cell sizes 2. unequal and disproportional cell sizes

Designs with proportional cell sizes can be analyzed as orthogonal designs as equal n is a special case of being proportional

Multiple factors

In nonorthogonal designs, the main effects overlap (i.e., they are not independent)

This overlap can be corrected in different ways, which means that there may be no unique estimate of the sums of squares for a particular main effect

This ambiguity can affect both statistical tests and effect size estimates, especially measures of association

The choice among alternative sets of estimates for a particular nonorthogonal design is best based on rational considerations, not statistical ones

Factorial ANOVA

Just as in single-factor ANOVA, two basic sources of variability are estimated in factorial ANOVA, within-conditions and between conditions

The total within-conditions variance, MSW, is estimated the same basic way—as the weighted average of the within-conditions variances This is true regardless of whether the design is balanced or not

However, estimation of the numerator of the total between-conditions variability in a factorial design depends on whether the design is balanced or not

The “standard” equations for effect sums of squares presented in many introductory statistics books are for balanced designs only

It is only for such designs that the sums of squares for the main and interaction effects are both additive and unique i.e. we could add up the eta2 for main effects and interaction to get

the overall eta2

Factorial ANOVA

Contrasts can be specified for main, simple, or interaction effects

A single-factor contrast involves the levels of just one factor while we are controlling for the other factors—there are two kinds: 1. A main comparison involves contrasts between

subsets of marginal means for the same factor (i.e., it is conducted within a main effect)

2. A simple comparison involves contrasts between subsets of cell means in the same row or column (i.e., it is conducted within a simple effect)

A single-factor contrast is specified with weights just as a contrast in a one-way design—we assume here mean difference scaling (i.e., ∑ | ai | = 2.0)

Factorial ANOVA

An interaction contrast specifies a single-df interaction effect It is specified with weights applied to cells of the whole design

(e.g., to all six cells of a 2×3 design) These weights follow the same general rules as for one-way

designs, but see the Kline text for specifics regarding effect size estimation. The weights should also be doubly centered, which means that they

sum to zero in any row or column If an interaction contrast in a two-way design should be interpreted

as the difference between a pair of simple comparisons (i.e., mean difference scaling), the sum of the absolute values of the weights must be 4.0

As an example, these weights compare the simple effect of A at B1 with the simple effect of A at B3

Factorial ANOVA

As is probably clear by now, that there can be many possible effects that can be analyzed in a factorial design

This is especially true for designs with three or more factors, for which there are a three-way interaction effect, two-way interaction effects, main effects, and contrasts for any of those

One can easily get lost by estimating every possible effect which means it is important to have a plan that minimizes the number of analyses while still respecting essential hypotheses

Some of the worst misuses of statistical tests are seen in factorial designs when this advice is ignored E.g. All possible effects are tested and sorted into two categories,

those statistically significant and subsequently discussed at length vs. those not statistically significant and subsequently ignored

This misuse is compounded when power is ignored, which can vary from effect to effect in factorial designs

Standardized Contrasts

There is no definitive method at present for calculating standardized contrasts in factorial designs

However, some general principles discussed by Cortina and Nouri (2000), Olejnik and Algina (2000), and others are that:

1. Estimates for effects of each independent variable in a factorial design should be comparable with effect sizes for the same factor studied in a one-way design

2. Changing the number of factors in the design should not necessarily change the effect size estimates for any one of them


Standardized contrasts may be preferred over measures of association as effect size indexes if contrasts are the main focus of the analysis This is most likely in designs with just two

factors Because they are more efficient,

measures of association may be preferred in larger designs


Standardized contrasts in factorial designs have the same general form as in one-way designs: d = Ψ/σ*

The problem is figuring out which standardizer should go in the denominator

Here we’ll present what is described in both Kline and Howell in a general way, though Kline has more specifics


Basically it comes down to putting the variance due to other effects not being considered back into the error variance which will become our standardizer

So if we have SS for Therapy Race T*R and Err, if we used the SSerr MSerr √MSerr here for a standardizer it would be much smaller than in a one way design comparing groups for the main effect of Therapy (for example)

Then we’d run around saying we had a much larger effect than those who just looked at a main effect of therapy, when in fact that may not be the case at all


The solution then is to add those other effects back into the error term

What of simple effects? The simple effects would use the same

standardizer as would be appropriate for the corresponding main effect

ˆ race TxR error

race TxR error

SS SS SSs

df df df

Measures of Association

If using measures of association, the process is the same as in the one-way design

2 ( 1)effect error

total error

SS k MS

SS MS

Summary

As with the one-way design we have the approach of looking at standardized mean differences (d-family) or variance accounted for assessments of effect size (r-family)

There is no change as far as the r-family goes when dealing with factorial design

The goal for standardized contrasts is to come up with a measure that will reflect what is seen for main effects and be consistent across studies

EFFECT SIZE

Repeated Measures

Measures of association

Measures of association are conducted in the same way for repeated measures design

In general, partial η2 is SSeffect/(SSeffect + SSerror)

And this holds whether the samples are independent or not, and for a design of any complexity


There are three approaches one could use with dependent samples

1. Treat as you would contrasts between independent means

2. Standardize the dependent mean change against the standard deviation of the contrast difference scores, sDΨ

3. Standardize using the square root of MSerror The first method makes a standardized mean change

from a correlated design more directly comparable with a standardized contrast from a design with unrelated samples, but the latter may be more appropriate for the change we are concerned with.

The third is not recommended as in this case the metric is not generally of the original or change scores and so may be difficult to interpret.


One thing we’d like to be able to do is compare situations that could have been either dependent or independent Ex. We could test work performance at morning and night

via random assignment or repeated measures In that case we’d want a standardizer in the

original metric, so the choice would be to use those d family measures that we would in the for simple pairwise contrasts for independent samples

For actual interval repeated measures (i.e. time), it should be noted that we are typically more interested in testing for trends and the r family of measures

Mixed design

r family measures of effect such as η2 will again be conducted in the same fashion

For standardized mean differences we’ll have some considerations given which differences we’re looking at

Mixed design

For the between groups differences can we calculate as normal for comparisons at each measure/interval and simple effects i.e. use √MSwithin This assumes you’ve met your homogeneity of variance

assumption This approach could be taken for all contrasts, but it would

ignore the cross-condition correlations for repeated measures comparisons

However, the benefit we’d get from being able to compare across different (non-repeated) designs suggests it is probably the best approach

One could if desired look at differences in the metric of the difference scores and thus standardized mean changes

For more info, consult the Kline text, Olejnik and Algina (2000) on our class webpage, Cortina and Nouri (2000)

ANCOVA

r - family of effect sizes, same old story

d-family effect size While one might use adjusted means, if experimental

design (i.e. no correlation b/t covariate and grouping variable) the difference should be pretty much the same as original means

However, current thinking is that the standardizer should come from the original metric, so run just the Anova and use the sqrt of the MSerror from that analysis

In other words, if thinking about standardized mean differences, it won’t matter whether you ran an ANOVA on the post test scores or ANCOVA if you meet your assumptions

However, your (partial) r effect sizes will be larger with the ANCOVA as the variance due to the covariate is taken out of the error term

modern approaches effect size key source: kline, beyond significance testing

Documents