modern approaches effect size key source: kline, beyond significance testing
TRANSCRIPT
MODERN APPROACHES
Effect Size Key source: Kline, Beyond Significance Testing
Always a difference
Commonly define the null hypothesis as ‘no difference’
Differences between groups always exist (at some level of precision)
Obtaining statistical significance can be seen as just a matter of sample size
Furthermore, the importance and magnitude of an effect are not reflected (because of the role of sample size in probability value attained)
What should we be doing?
Want to make sure we have looked hard enough for the difference – power analysis
Figure out how big the thing we are looking for is – effect size
Calculating effect size
Different statistical tests have different effect sizes developed for them
However, the general principle is the same
Effect size refers to the magnitude of the impact of the independent variable(s) on the outcome variable
Types of effect size
Three basic classes of effect size commonly used: d family vs. r family
Focused on standardized mean differences Allows comparison across samples and variables with differing
variance Cohen’s d
Note sometimes no need to standardize (units of the scale have inherent meaning)
Variance-accounted-for Amount explained versus the total
Case level effect sizes However one should note that in group comparison
situations, each is directly translatable to the another
Cohen’s d
Independent samples t-test
Cohen initially suggested could use either sample standard deviation, since they should both be equal according to our assumptions. In practice people now use the pooled variance.
Also Cohen’s f for the ANOVA counterpart Others available for particular cases, e.g. control
groups, dependent samples
1 2
p
X Xd
s
Association
A measure of association describes the amount of the covariation between the independent and dependent variables
It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size
R, eta
Characterizing effect size
Cohen emphasized that the interpretation of effects requires the researcher to consider things narrowly in terms of the specific area of inquiry
Evaluation of effect sizes inherently requires a personal value judgment regarding the practical or clinical importance of the effects
Other effect size measures
Case level effect sizes Measures for non-continuous data
Association Contingency coefficient Phi Cramer’s Phi
d-family Odds Ratios
Agreement Kappa
Case-level effect sizes
Indexes such as Cohen’s d and eta2 estimate effect size at the group or variable level only
However, it is often of interest to estimate differences at the case level
Case-level indices of group distinctiveness are proportions of scores from one group versus another that fall above or below a reference point
Reference points can be relative (e.g., a certain number of standard deviations above or below the mean in the combined frequency distribution) or more absolute (e.g., the cutting score on an admissions test)
Case-level effect sizes
U1
Proportion of nonoverlap If no overlap then = 1, 0 if
all overlap U2
Proportion of scores in lower group exceeded by the same proportion in upper group
If same means = .5, if all group2 exceeds group 1 then = 1.0
U3
Proportion of scores in lower group exceeded by typical score in upper group
Same range as U2
Other case-level effect sizes
Tail ratios (Feingold, 1995): Relative proportion of scores from two different groups that fall in the upper extreme (i.e., either the left or right tail) of the combined frequency distribution
“Extreme” is usually defined relatively in terms of the number of standard deviations away from the grand mean
Tail ratio > 1.0 indicates one group has relatively more extreme scores
Here, tail ratio = p2/p1:
Other case-level effect sizes
Common language effect size (McGraw & Wong, 1992) is the predicted probability that a random score from the upper group exceeds a random score from the lower group
Find area to the right of that value Range .5 – 1.0
1 2
2 21 2
0 ( )CL
X Xz
s s
Confidence Intervals for Effect Size Effect size statistics such as Hedge’s g
and η2 have complex distributions Traditional methods of interval
estimation rely on approximate standard errors assuming large sample sizes
General form for d is the same as that of any CI
( )cv dd t s
Problem
However, CIs formulated in this manner are only approximate, and are based on the central (t) distribution centered on zero
The true (exact) CI depends on a noncentral distribution and additional parameter Noncentrality parameter What the alternative hype distribution is centered on
(further from zero, less belief in the null) d is a function of this parameter, such that if ncp
= 0 (i.e. is centered on the null hype value), then d = 0 (i.e. no effect)
1 2
1 2
1 2
1 2
pop
pop
n nd ncp
n n
n nncp d
n n
Confidence Intervals for Effect Size Similar situation for r and eta2 effect size
measures Gist: we’ll need a computer program to help us
find the correct upper and lower bounds for the ncp to get exact confidence intervals for effect sizes Then based on it’s relation to d, transform those bounds
Statistica and R (MBESS package) have the functionality built in, several stand alone packages are available, (Steiger) while others allow for such intervals to be programmed (even SPSS scripts are available (Smithson))
Limitations of effect size measures
Standardized mean differences: Heterogeneity of within-conditions variances across
studies can limit their usefulness—the unstandardized contrast may be better in this case
Measures of association: Correlations can be affected by sample variances and
whether the samples are independent or not, the design is balanced or not, or the factors are fixed or not
Also affected by artifacts such as missing observations, range restriction, categorization of continuous variables, and measurement error (see Hunter & Schmidt, 1994, for various corrections)
Variance-accounted-for indexes can make some effects look smaller than they really are in terms of their substantive significance
Limitations of effect size measures
How to fool yourself with effect size estimation:
1. Measure effect size only at the group level
2. Apply generic definitions of effect size magnitude without first looking to the literature in your area
3. Believe that an effect size judged as “large” according to generic definitions must be an important result and that a “small” effect is unimportant (see Prentice & Miller, 1992)
4. Ignore the question of how theoretical or practical significance should be gauged in your research area
5. Estimate effect size only for statistically significant results
Limitations of effect size measures
6. Believe that finding large effects somehow lessens the need for replication
7. Forget that effect sizes are subject to sampling error
8. Forget that effect sizes for fixed factors is specific to the particular levels selected for study
9. Forget that standardized effect sizes encapsulate other quantities such as the unstandardized effect size, error variance, and experimental design
10. As a journal editor or reviewer, substitute effect size magnitude for statistical significance as a criterion for whether a work is published
11. Automatically equate effect size with ‘cause size’
Recommendations
APA task force suggestions Report effect sizes Report confidence intervals Use graphics
Report and interpret effect sizes in the context of those seen in previous research rather than rules of thumb
Report and interpret confidence intervals (for effect sizes too) also within the context of prior research In other words don’t be overly concerned with whether a CI for
a mean difference doesn’t contain zero but where it matches up with previous CIs
Summarize prior and current research with the display of CIs in graphical form (e.g. w/ Tryon’s reduction)
Report effect sizes even for non-statistically significant results
EFFECT SIZE ESTIMATION IN ONE-WAY ANOVA
Contrast Review
Concerns design with a single factor A with at least 2 levels (conditions) The omnibus comparison concerns all levels
(i.e., dfA > 2) A focused comparison or contrast concerns just
two levels (i.e.,df = 1) The omnibus effect is often relatively
uninteresting compared with specific contrasts (e.g., treatment 1 vs. placebo control)
A large omnibus effect can also be misleading if due to a single discrepant mean that is not of substantive interest
Comparing Groups
The traditional approach is to analyze the omnibus effect followed by analysis of all possible pairwise contrasts (i.e. compare each condition to every other condition)
However, this approach is typically incorrect (Wilkinson & TFSI,1999)—for example, it is rare that all such contrasts are interesting
Also, use of traditional methods for post hoc comparisons reduces power for every contrast, and power may already be low in a typical psych data setting
Contrast specification and tests A contrast is a directional effect that corresponds to a
particular facet of the omnibus effect often represented with the symbol y for a population or yˆ for a sample a weighted sum of means
In a sample, a contrast is calculated as:
a1, a2, ... , aj is the set of weights that specifies the contrast Review
Contrast weights must sum to zero and weights for at least two different means should not equal zero
Means assigned a weight of zero are excluded from the contrast Means with positive weights are compared with means given
negative weights
1 1 2 2... k k j ja X a X a X a X
Contrast specification and tests For effect size estimation with the d
family, we generally want a standard set of contrast weights
In a one-way design, you want the sum of the absolute values of the weights in to equal two (i.e., ∑| aj | = 2.0)
Mean difference scaling permits the interpretation of a contrast as the difference between the averages of two subsets of means
Contrast specification and tests An exception to the need for mean
difference scaling is testing for trends (polynomials) specified for a quantitative factor (e.g., drug dosage)
There are default sets of weights that define trend components (e.g. linear, quadratic, etc.) that are not typically based on mean difference scaling
Not usually a problem because effect size for trends is generally estimated with the r family (measures of association)
Measures of association for contrasts of any kind generally correct for the scale of the contrast weights
Orthogonal Contrasts
Two contrasts are orthogonal if they each reflect an independent aspect of the omnibus effect
For balanced designs and unbalanced designs (latter)
1 21
1 2
1
0
0
c
i
c
i
a a
a a
n
Orthogonal Contrasts
The maximum number of orthogonal contrasts is: dfA = a − 1 a = # of groups of that factor
For a set of all possible orthogonal pairwise contrasts, the SSA = the total SS from the contrasts, and their eta-squares will sum to the SSA eta-square
That is, the omnibus effect can be broken down into a − 1 independent directional effects
However, it is more important to analyze contrasts of substantive interest even if they are not orthogonal
Contrast specification and tests t-test for a contrast against the nil
hypothesis
The F is
w/in
2
t( )
w
dfs
weighted mean difference
as MS
n
2/
/ in
2
2
(1, )w inw
SSF df t
MS
SSan
Dependent Means
Test statistics for dependent mean contrasts usually have error terms based on only the two conditions compared—for example:
s2 here refers to the variance of the contrast difference scores
This error term does not assume sphericity, as is the case with repeated measures design
2
( 1)
D
t ns
ss
n
Confidence Intervals
Approximate confidence intervals for contrasts are generally fine (i.e. good enough)
The general form of an individual confidence interval for Ψ is:
dferror is specific to that contrast
[ ( )]cv errors t df
Contrast specification and tests There are also corrected
confidence intervals for contrasts that adjust for multiple comparisons (i.e., inflated Type I error) Known as simultaneous or
joint confidence intervals Their widths are
generally wider compared with individual confidence intervals because they are based on a more conservative critical value
Program for correcting http://www.psy.unsw.edu.a
u/research/resources/psyprogram.html
Standardized contrasts
The general form for standardized mean difference (in terms of population parameters)
pooled
Standardized contrasts
There are three general ways to estimate σ (i.e., the standardizer) for contrasts between independent means:
1. Calculate d as Glass’s Δ i.e., use the standard deviation of the control group
2. Calculate d as Hedge’s g i.e., use the square root of the pooled within-conditions
variance for just the two groups being compared 3. Calculate d as an extension of g
Where the standardizer is the square root of MSW based on all groups (i.e. the omnibus test)
Generally recommended
Standardized contrasts
To calculate d from a tcontrast for a paper not reporting effect size like they should
Recall the weights should sum to 2
2
( )a
g or d tn
CIs
Once the d is calculated one can easily obtain exact confidence intervals via the MBESS package in R or Steiger’s standalone program The latter will provide the interval for the
noncentrality parameter which must then be converted to d, and since the MBESS package used Steiger’s as a foundation, it is to be recommended
Cohen’s f Cohen’s f provides what can interpreted
as the average standardized mean difference across the groups in question
It has a direct relation to measures of association (eta-squared)
As with Cohen’s d, there are guidelines regarding Cohen’s f
.10, .25, .40 for small, moderate and large effect sizes
These correspond to eta-square values of: .01, .06, .14
Again though, one should conduct the relevant literature for determining what constitutes ‘small’ ‘medium’ and ‘large’
2 22 2
2 2
2
2
1 1
1
ff
f
f
Measures of Association
A measure of association describes the amount of the covariation between the predictor and outcome variables
It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size
A squared multiple correlation (R2) calculated in ANOVA is called the correlation ratio or estimated eta-squared (2)
Eta-squared
A measure of the degree to which variability among observations can be attributed to conditions
Example: 2 = .50 50% of the variability seen in the scores is due to the
independent variable.
2 2treatpb
total
SSR
SS
More than one factor
It is a fairly common practice to calculate eta2 (correlation ratio) for the omnibus effect but to calculate the partial correlation ratio for each contrast
As we have noted before
2partial η treat
treat error
SS
SS SS
SPSS calls everything partial eta-squared in it’s output, but for a one-way design you’d report it as eta-squared
Problem
Eta-squared (since it is R-squared) is an upwardly biased measure of association (just like R-squared was)
As such it is better used descriptively than inferentially
Omega-squared
Another effect size measure that is less biased and interpreted in the same way as eta-squared
So why do we not see omega-squared so much?
People don’t like small values Stat packages don’t provide it by default
2 ( 1)effect error
total error
SS k MS
SS MS
Omega-squared
Put differently
2
2
( )
/ ( ) ( 1)
/ ( ) ( 1)
effect effect error
total error
effect effect error effect effect
effect effect error error effect effect
df MS MS
SS MS
df kn MS MS df Fpartial
df kn MS MS MS df F kn
Omega-squared
Assumes a balanced design eta2 does not
Though the omega values are generally lower than those of the corresponding correlation ratios for the same data, their values converge as the sample size increases
Note that the values can be negative—if so, interpret as though the value were zero
Comparing effect size measures Consider our previous example with item
difficulty and arousal regarding performance
Tests of Between-Subjects Effects
Dependent Variable: Score
240.000a 5 48.000 9.600 .000 .667
120.000 1 120.000 24.000 .000 .500
60.000 2 30.000 6.000 .008 .333
60.000 2 30.000 6.000 .008 .333
120.000 24 5.000
360.000 29
SourceB/t groups
Difficulty
Arousal
Difficulty * Arousal
Error
Total
Type IIISum ofSquares df Mean Square F Sig.
Partial EtaSquared
R Squared = .667 (Adjusted R Squared = .597)a.
Comparing effect size measures
2 ω2 Partial 2
f
B/t groups .67 .59 .67 1.42
Difficulty .33 .32 .50 .70
Arousal .17 .14 .33 .45
Interaction .17 .14 .33 .45
Slight differences due to rounding, f based on eta-squared
No p-values
As before, programs are available to calculate confidence intervals for an effect size measure
Example using the MBESS package for the overall effect 95% CI on ω2:
.20 to .69
No p-values
Ask yourself as we have before, if the null hypothesis is true, what would our effect size be (standardized mean difference or proportion of variance accounted for)?
0 Rather than do traditional hypothesis testing, one can
simply see if our CI for the effect size contains the value of zero (or, in eta-squared case, gets really close)
If not, reject H0
This is superior in that we can use the NHST approach, get a confidence interval reflecting the precision of our estimates, focus on effect size, and de-emphasize the p-value
FACTORIAL EFFECT SIZE
Multiple factors
There are some special considerations in designs with multiple factors
Ignoring these special issues can result in effect size variation across multiple-factor designs due more to artifacts than real differences in effect size magnitude
These issues arise in part out of the problem of what to do about ‘other’ factors when effects on a particular factor are estimated
Methods for effect size estimation in multiple-factor designs are also not as well developed as for single-factor designs
However, this is more true for standardized contrasts than for measures of association
Special Considerations
It is generally necessary to have a good understanding of ANOVA for multiple-factor designs (e.g., factorial ANOVA) in order to understand effect size estimation
However, this does not mean one needs to know about the F test only, which is just a small (and perhaps the least interesting) part of ANOVA
The most useful part of an ANOVA source table for the sake of effect size estimation is everything to the left of the usual columns for F and p
It is better to see ANOVA as a general tool for estimating variance components for different types of effects in different kinds of designs
Multiple factors
Multiple-factor designs arise out of a few basic distinctions, including whether the 1. factors are between-subjects vs. within-
subjects 2. factors are experimental vs. nonexperimental 3. relation between the factors or subjects is
crossed vs. nested The most common type of multiple-factor
design is the factorial design, in which every pair of factors is crossed (levels of each factor are studied in all combinations with levels of all other factors)
Multiple factors
Some common types of factorial designs: Completely between-subjects factorial design:
Subjects nested under all combinations of factor levels (i.e.,all samples are independent)
Completely within-subjects factorial design: Each case in a single sample is tested under every
combination of two or more factors (i.e., subjects are crossed with all factors)
Mixed within-subjects factorial design (split-plot or mixed design): At least one factor is between-subjects and another
is within subjects (i.e., subjects are crossed with some factors but nested under others)
Multiple factors
While there are other distinctions, one that cannot be ignored is whether a factorial design is balanced or not
In a balanced (equal-n among groups) design, the main and interaction effects are all independent E.g. True experiment with randomized
assignment for all factors For this reason balanced factorials are
referred to as orthogonal designs
Multiple factors
However, factorial designs in applied research are often not balanced
We distinguish between unbalanced designs with 1. unequal but proportional cell sizes 2. unequal and disproportional cell sizes
Designs with proportional cell sizes can be analyzed as orthogonal designs as equal n is a special case of being proportional
Multiple factors
In nonorthogonal designs, the main effects overlap (i.e., they are not independent)
This overlap can be corrected in different ways, which means that there may be no unique estimate of the sums of squares for a particular main effect
This ambiguity can affect both statistical tests and effect size estimates, especially measures of association
The choice among alternative sets of estimates for a particular nonorthogonal design is best based on rational considerations, not statistical ones
Factorial ANOVA
Just as in single-factor ANOVA, two basic sources of variability are estimated in factorial ANOVA, within-conditions and between conditions
The total within-conditions variance, MSW, is estimated the same basic way—as the weighted average of the within-conditions variances This is true regardless of whether the design is balanced or not
However, estimation of the numerator of the total between-conditions variability in a factorial design depends on whether the design is balanced or not
The “standard” equations for effect sums of squares presented in many introductory statistics books are for balanced designs only
It is only for such designs that the sums of squares for the main and interaction effects are both additive and unique i.e. we could add up the eta2 for main effects and interaction to get
the overall eta2
Factorial ANOVA
Contrasts can be specified for main, simple, or interaction effects
A single-factor contrast involves the levels of just one factor while we are controlling for the other factors—there are two kinds: 1. A main comparison involves contrasts between
subsets of marginal means for the same factor (i.e., it is conducted within a main effect)
2. A simple comparison involves contrasts between subsets of cell means in the same row or column (i.e., it is conducted within a simple effect)
A single-factor contrast is specified with weights just as a contrast in a one-way design—we assume here mean difference scaling (i.e., ∑ | ai | = 2.0)
Factorial ANOVA
An interaction contrast specifies a single-df interaction effect It is specified with weights applied to cells of the whole design
(e.g., to all six cells of a 2×3 design) These weights follow the same general rules as for one-way
designs, but see the Kline text for specifics regarding effect size estimation. The weights should also be doubly centered, which means that they
sum to zero in any row or column If an interaction contrast in a two-way design should be interpreted
as the difference between a pair of simple comparisons (i.e., mean difference scaling), the sum of the absolute values of the weights must be 4.0
As an example, these weights compare the simple effect of A at B1 with the simple effect of A at B3
Factorial ANOVA
As is probably clear by now, that there can be many possible effects that can be analyzed in a factorial design
This is especially true for designs with three or more factors, for which there are a three-way interaction effect, two-way interaction effects, main effects, and contrasts for any of those
One can easily get lost by estimating every possible effect which means it is important to have a plan that minimizes the number of analyses while still respecting essential hypotheses
Some of the worst misuses of statistical tests are seen in factorial designs when this advice is ignored E.g. All possible effects are tested and sorted into two categories,
those statistically significant and subsequently discussed at length vs. those not statistically significant and subsequently ignored
This misuse is compounded when power is ignored, which can vary from effect to effect in factorial designs
Standardized Contrasts
There is no definitive method at present for calculating standardized contrasts in factorial designs
However, some general principles discussed by Cortina and Nouri (2000), Olejnik and Algina (2000), and others are that:
1. Estimates for effects of each independent variable in a factorial design should be comparable with effect sizes for the same factor studied in a one-way design
2. Changing the number of factors in the design should not necessarily change the effect size estimates for any one of them
Standardized Contrasts
Standardized contrasts may be preferred over measures of association as effect size indexes if contrasts are the main focus of the analysis This is most likely in designs with just two
factors Because they are more efficient,
measures of association may be preferred in larger designs
Standardized Contrasts
Standardized contrasts in factorial designs have the same general form as in one-way designs: d = Ψ/σ*
The problem is figuring out which standardizer should go in the denominator
Here we’ll present what is described in both Kline and Howell in a general way, though Kline has more specifics
Standardized Contrasts
Basically it comes down to putting the variance due to other effects not being considered back into the error variance which will become our standardizer
So if we have SS for Therapy Race T*R and Err, if we used the SSerr MSerr √MSerr here for a standardizer it would be much smaller than in a one way design comparing groups for the main effect of Therapy (for example)
Then we’d run around saying we had a much larger effect than those who just looked at a main effect of therapy, when in fact that may not be the case at all
Standardized Contrasts
The solution then is to add those other effects back into the error term
What of simple effects? The simple effects would use the same
standardizer as would be appropriate for the corresponding main effect
ˆ race TxR error
race TxR error
SS SS SSs
df df df
Measures of Association
If using measures of association, the process is the same as in the one-way design
2 ( 1)effect error
total error
SS k MS
SS MS
Summary
As with the one-way design we have the approach of looking at standardized mean differences (d-family) or variance accounted for assessments of effect size (r-family)
There is no change as far as the r-family goes when dealing with factorial design
The goal for standardized contrasts is to come up with a measure that will reflect what is seen for main effects and be consistent across studies
EFFECT SIZE
Repeated Measures
Measures of association
Measures of association are conducted in the same way for repeated measures design
In general, partial η2 is SSeffect/(SSeffect + SSerror)
And this holds whether the samples are independent or not, and for a design of any complexity
Standardized contrasts
There are three approaches one could use with dependent samples
1. Treat as you would contrasts between independent means
2. Standardize the dependent mean change against the standard deviation of the contrast difference scores, sDΨ
3. Standardize using the square root of MSerror The first method makes a standardized mean change
from a correlated design more directly comparable with a standardized contrast from a design with unrelated samples, but the latter may be more appropriate for the change we are concerned with.
The third is not recommended as in this case the metric is not generally of the original or change scores and so may be difficult to interpret.
Standardized contrasts
One thing we’d like to be able to do is compare situations that could have been either dependent or independent Ex. We could test work performance at morning and night
via random assignment or repeated measures In that case we’d want a standardizer in the
original metric, so the choice would be to use those d family measures that we would in the for simple pairwise contrasts for independent samples
For actual interval repeated measures (i.e. time), it should be noted that we are typically more interested in testing for trends and the r family of measures
Mixed design
r family measures of effect such as η2 will again be conducted in the same fashion
For standardized mean differences we’ll have some considerations given which differences we’re looking at
Mixed design
For the between groups differences can we calculate as normal for comparisons at each measure/interval and simple effects i.e. use √MSwithin This assumes you’ve met your homogeneity of variance
assumption This approach could be taken for all contrasts, but it would
ignore the cross-condition correlations for repeated measures comparisons
However, the benefit we’d get from being able to compare across different (non-repeated) designs suggests it is probably the best approach
One could if desired look at differences in the metric of the difference scores and thus standardized mean changes
For more info, consult the Kline text, Olejnik and Algina (2000) on our class webpage, Cortina and Nouri (2000)
ANCOVA
r - family of effect sizes, same old story
d-family effect size While one might use adjusted means, if experimental
design (i.e. no correlation b/t covariate and grouping variable) the difference should be pretty much the same as original means
However, current thinking is that the standardizer should come from the original metric, so run just the Anova and use the sqrt of the MSerror from that analysis
In other words, if thinking about standardized mean differences, it won’t matter whether you ran an ANOVA on the post test scores or ANCOVA if you meet your assumptions
However, your (partial) r effect sizes will be larger with the ANCOVA as the variance due to the covariate is taken out of the error term