310 spring2012 chapter14 anova
TRANSCRIPT
Chapter Fourteen
Analysis of Variance
790
Analysis of Variance• The analysis of variance is a procedure aimed at testing whether differences exist between two or morepopulation means.
• It originated in the 1920s as a statistical technique to determine whether different treatments of fertilizer produced different crop yields.
• The terminology of “treatment” has stuck ever since. Thus, regardless of the specific application, the procedure is designed to determine whether there are significant differences between the means of different treatments.
791
Examples of “Treatments” in Economic Applications• Determining whether there are significant differences
between the means of different “treatments” has many potential applications in Economics. A few examples include:
• Determining whether the investment behavior of individuals varies across different age categories. Here, the “treatments” would be each of the age categories.
• Determining whether consumer’s perceptions of a particular good (e.g, personal computers) varies across brands (e.g, IBM, Dell, HP). Here, the “treatments” would be the different brands analyzed.
• Whether different market segments (e.g, across age groups) have different consumption habits. Here, the “treatments” would be the different market segments analyzed.
792
One‐Way Analysis of Variance• In a one‐way Analysis of Variance, we test simultaneously
whether there exist differences between two or more population means. This differs from a “multiple comparisons” approach in that in a one‐way test we evaluate whether all population means are equal or not, but we do not care which ones differ and which ones are equal, so it is a “one‐shot” procedure.
• In this test we assume that we have independent samples from each population. The test then proceeds by analyzing the variance of each population to determine from there whether we can infer that the population means differ.
793
• We have j=1,2,…,k populations, each of which corresponds to a “treatment”.
• The mean and variance of the jth population is denoted (as usual) by µj and j respectively.
• In a one‐way analysis of variance we test:
against the alternativeAt least two means differ
• We have k independent samples , , … ,
• The sample size we have from the jth population is denoted by nj .
794
• The sample means from each sample are:
• We denote the “grand mean of all observations” as
where
• is the mean we would obtain if we pooled together all samples and treated them as if they came from a single population.
795
• Suppose H0 is satisfied and all means are equal. Then , where denotes the “grand
population mean”. • Intuitively, if H0 is true then:
a) The grand sample mean would be a consistent, valid estimator for µ.
b) Each (for j=1,…,k) should be “close” to • Thus, if H0 is true, we should expect that each of the following should be “small”:
for each j=1,…,k
7.796
• The formal measure of the proximity of each to is the between‐treatments variation, and the statistic we use for it is called the sum of squares for treatments (SST), given by:
∙
∙ ∙ ⋯ ∙
• From our previous arguments, if H0 is true, then SST should be “small”.
• But what does “small” mean in this case? The answer is that should be small relative to the inherent variance of the variable X. That is, SST should be small relative to the variance in X that is not caused by the treatments.
7.797
• The formal measure of the variance in X that is not caused by the treatments is the within‐treatments variation, and the statistic we use for it is called the sum of squares for error (SSE), given by:
sj
s1 s2 skwhere sj is the sample variance for the jth sample.
• SSE is the combined or pooled variation of the k samples.• SSE is analogous to sp , the pooled variance estimator we used when we tested .
• Back then, we stated that sp was a valid estimator only if 1 2 .
7.798
• Analogously, SSE will be a valid measure of the within‐treatment variation only if all variances are the same. That is,
1 2 k
• This will be a maintained assumption in a one‐way Analysis of Variance.
• OK, so we have the two main components of the test‐statistic: SST and SSE. How is the actual test‐statistic constructed?
• First, we normalize SST and SSE….7.799
• We define the Mean Square for Treatments (MST) as
• We define the Mean Square for Error (MSE) as
• The test‐statistic for is:
• Under H0, F is approximately distributed as an F‐random variable with degrees of freedom in the numerator and degrees of freedom in the denominator.
7.800
• Rejection Region: Remember, intuitively, we should reject
In favor of the alternativeAt least two means differ
• If MST is large relative to MSE. That is, we should reject H0 “if F is large”. Our rejection rule should therefore be to reject H0 if F > c, where ‘c’ is chosen to achieve the significance level .
• Given the previous distribution result about F, our rejection rule should be to “reject H0 in favor of H1 if”:
, ,
7.801
• From here, the p‐value of the test is given by:
Where H is a random variable with F distribution with k‐1 degrees of freedom in the numerator and n‐k degrees of freedom in the denominator, and ‘F’ is the value obtained for our test‐statistic.
7.802
Analysis of Variance (ANOVA) Table• All the components of the ANOVA test are typically
summarized in a table like the following.
• SS(Total) is the total variation in the data. SST represents the variation attributable to the treatment, and SSE would be the variation unexplained by the treatment.
• Under the maintained assumption that all variances are the same, then
is an estimator of the sample variance in the data, S27.803
SOURCE OF VARIATION
DEGREES OF FREEDOM
SUMS OF SQUARES
MEAN SQUARES
F‐STATISTIC
Treatments k‐1 SST MST=SST/(k‐1) F=MST/MSE
Error n‐k SSE MSE=SSE/(n‐k)
TOTAL n‐1 SS(Total)=SST+SSE
• Example: Investment behavior among different age groups.‐ Suppose we want to investigate whether the average proportion of financial assets that is invested in the stock market differs across age categories. For the latter we use:
1) Young (less than 35)2) Early middle age (35 to 49)3) Late middle age (50 to 65)4) Senior (older than 65) • These are the four different “treatments” we look at.
7.804
• For i=1,2,3,4, let Average proportion of assets invested in stock market for individuals in the ith age group
• We want to test
againstfor some (i,j) in 1,2,3,4.
7.805
• We have k=4 (four treatments). The basic ingredients we need to compute the F‐ANOVA test are:
1
1
⋮1
1 ⋯
• And we need to compute the sample variances s1 , s2 , … , sk
7.806
• In the data, we have
, , ,
s1 , s2 , s3 , s4• From here, we can compute
s1 + s2s3 s4
7.807
• Now we can construct the ANOVA Table :
• If we use a significance level of 5%, our critical value is , , . , , .
• Since F > , we reject H0. There is evidence to infer that at least two means differ. That is, the investment behavior of at least two age groups differ.
• The p‐value is given by: Pr(H > 2.78) where H is F‐distributed with 3 and 362 degrees of freedom. We can compute this in Excel as “=1‐F.DIST(2.78,3,362,1)”. This yields 0.04, very close to 5%, but slightly below. This is a “close call” rejection of H0.7.808
Multiple Comparisons• In a one‐way ANOVA, we do not know which pairs of population means differ, we only test in one‐shot, whether they are all the same or not.
• In many real‐world cases, we may be interested in knowing which treatment means differed.
• To this end, as an alternative to the ANOVA method, we could do pairwise tests to determine whether for each pair (i,j) in 1,…,k.
7.809
• So, the problem is to test, for each pair of treatments (i,j) in 1,…,k:
against
• There are three different approaches to this problem:
a) Fisher’s Least Significant Difference (LSD) Method
b) Bonferroni’s Adjustment to LSD Methodc) Tukey’s Multiple Comparison Method.
7.810
Fisher’s Least Significant Difference (LSD) Method• The task is to test, for each pair of treatments (i,j) in 1,…,k:
: against : .• The test‐statistic to determine whether proposed by Fisher is a mild
variation of the one we used to test whether in Chapter 13 when variances were equal. The test‐statistic is now:
∙ 1 1
• MSE is (as defined above) the Mean Squared Error, given by:
, where 1 ∙ s1 1 ∙ s2 ⋯ 1 ∙ sk
• Under the maintained assumption that all variances are the same, MSE is a valid “pooled variance” estimator. Thus, this is a generalization of the statistic we used in Chapter 13 when variances are equal.
7.811
• Under H0, the approximate distribution of t is a Student‐t with n‐k degrees of freedom.
• Therefore, we should reject H0 in favor of H1 if |t| > t/2,n‐k
• Thus, if we define the least significant difference LSD as
t/2,n‐k
• Then we conclude that if
7.812
• Example: (continued) We go back to the investment‐behavior example across age groups. The pairwise comparisons we can make are SIX in total:
• and • and • and • and • and • and
• We still use a target significance level of 5%. 7.813
• For each pair, the LSD critical values are summarized below:
• In each instance above, we have:
t/2,n−k t.025,362
1.967.814
Pair LSD Critical Value Reject H0: for H1: : ?
and 5.8 8.1 YES
and 6.2 6.7 YES
and 7.1 7.4 YES
and 5.6 1.3 NO
and 6.5 0.6 NO
and 6.9 0.7 NO
Bonferroni’s Correction to LSD• Fisher’s LSD Method has a drawback. • To see why, note that even if we do pairwise comparisons, our general hypotheses are:
vs.: for some (i,j)
• Now, recall from our study of probability rules and properties, that
• Fisher’s approach is analogous to estimating (erroneously) as
7.815
• As a result, Fisher’s approach typically overshoots the prespecified significance level . This means that the probability of Type I error in Fisher’s approach is typically larger than the significance level we aim to achieve.
• Bonferroni’s correction to LSD is simply a correction to the critical values used in Fisher’s test.
• Let∙
• C denotes the number of all different pairwise comparisons (i,j) in 1,…,k.
• If we want the probability of a Type I error to be no more than , we should use, in our search for critical values:
7.816
• Once we make the correction , we proceed as in the LSD method:
• we define the least significant difference LSD as
t/2,n−k
• Then we conclude that if
• This “Bonferroni” correction guarantees that the probability of committing a Type I error is at most
.
7.817
• Example: (continued) We go back to the investment‐behavior example across age groups.
• We still want no more than a 5% probability of Type I error.
• We have ∙ (total number of pairwise
comparisons)• We then use
• Using Excel, we find t/2,n−k= t.004,362 = 2.657.818
• We now have
• Without the correction, there is a tendency to “over‐reject” H0. We see now that there is only evidence of a difference between and .
7.819
Pair Bonferroni‐corrected Critical
Value
Reject H0: for H1: :
?
and 7.8 8.1 YES
and 8.4 6.7 NO
and 9.5 7.4 NO
and 7.6 1.3 NO
and 8.8 0.6 NO
and 9.4 0.7 NO
Tukey’s Multiple Comparison Method
• Like Fisher’s approach (and its Bonferroni correction), Tukey’s approach is a method to determine a critical value to determine whether for each pair of treatments (i,j).
• The key statistic in Tukey’s method is the so‐called Studentized range ‘q’, given by:
, where
• The studentized range q has a very particular distribution, indexed by k and (degrees of freedom). Table 7 includes critical values for this distribution.
7.820
• The critical value used by Tukey’s method is given by:
, where⋯
• is the “harmonic mean” of sample sizes. If all sample sizes are the same, we easily have
• Once (“omega”) is obtained, we conclude that if
7.821
• Example (continued): In our example we have , ,
• And using Table 7, we find
. .
• Therefore, the Tukey critical value is
7.822
• We now fail to reject the null hypothesis that all population means are the same. It was a “close call” since
, which is relatively close to the critical value 8.36.• Recall, however, that our ANOVA F‐test was also a “close
call” rejection of H0.• Overall, the evidence against H0 is mixed.
7.823
Pair TukeyCritical Value
Reject H0: for
: ?
and 8.36 8.1 NO
and 8.36 6.7 NO
and 8.36 7.4 NO
and 8.36 1.3 NO
and 8.36 0.6 NO
and 8.36 0.7 NO
• Which approach to use in practice?• LSD should not be used without the Bonferronicorrection.
• With this in mind, none of the procedures we studied universally dominates the others.
• As a “rule of thumb” suggested in the book:a) If you want to test whether for only a subset
of (i,j) of treatments, use Bonferrini’s approach. This makes sense because, if you are only interested in a subset of treatments, then C will be smaller and Bonferroni’s approach will be “more precise” than if we want to do all pairwise comparisons.
b) If you want to do all pairwise comparisons, use Tukey’s approach.
7.824
• Tukey’s method is an alternative to the ANOVA F‐test (By their design, LSD and Bonferroni’s LSD are not truly alternatives to the ANOVA F‐test). However, it is not clear whether Tukey’s method is better/worse than the ANOVA F‐test. In practice, you can implement both and verify whether both reach the same type of conclusion.
7.825
Randomized Block (Two‐Way) Analysis of Variance
• The goal of Analysis of Variance is to compare the between treatments variation (summarized by SST) to the within treatments variation (summarized by SSE).
• If we could refine how the data is partitioned and consequently be able to reduce the within treatments variation, we would have a more powerful way to test the relevance of the between treatments variation.
• Partitioning the data in blocks within each treatment is a way to achieve this refinement.
7.826
• Suppose we divide each of our k samples into bblocks each.
• Each block would be constructed by grouping observations according to according to criteria that significantly affects the response variable ‘X’.
• If we follow this criteria in constructing our blocks, then the remaining within treatment variation (SSE) would become smaller.
• Constructing blocks like this is not always possible in practice; it depends on how rich our data is.
7.827
• The number of blocks is “b”. Denote:Mean of the observations in the jth
treatment (j=1,2,…,k).Mean of the observations in the ith
treatment (i=1,2,…,b).
7.828
• The total variation in X is now partitioned into three sources:
• WhereSST = sum of squares for treatments.SSB = sum of squares for blocks.SSE = sum of squares for errors.
• The formulas for SST, SSB and SSE are as the following…
7.829
•
7.830
• The corresponding Mean Squares measures are:
7.831
• The test‐statistic for the null hypothesis that all means are the same across treatments is, as before:
• Under the null hypothesis that all means are the same across treatments, F has approximately an F‐distribution with numerator degrees of freedom and denominator degrees of freedom.
7.832
• We can also test whether means are the same across blocks. The test‐statistic for this would be:
• Under the null hypothesis that all means are the same across blocks, F has approximately an F‐distribution with numerator degrees of freedom and denominator degrees of freedom.
• Note: If we cannot reject that all means are the same across blocks, it would essentially imply that our blocks are not truly relevant, and we should go back to the original one‐way ANOVA approach.
7.833
• The ANOVA Table with blocks looks like this:
7.834
Two‐Factor Analysis of Variance
• So far, we have focused on “treatments” that involve only one factor. This can be extended to treatments that have multiple factors.
• For example, instead of simply trying to identify the effect of “fertilizer” (e.g, indexed by brands ‘A’, ‘B’ and ‘C’), we might also want to investigate whether this effect varies with “temperature” (e.g, ‘low’ and ‘high’).
• Or, we might want to measure the effect of “education” and “gender” on an individual’s employment situation.
7.835
• Experiments where treatments have two or more factors are called factorial experiments.
• The textbook focuses on two‐factor analysis (but this can be extended to three or more factors).
• We denote these factors as “A” and “B”. The goal of two‐factor analysis of variance is to investigate whether:
1) All the means for factor ‘A’ are equal, keeping factor ‘B’ constant.
2) All the means for factor ‘B’ are equal, keeping factor ‘A’ constant.
3) All the means for all possible combinations of ‘A’ and ‘B’ (interacted with each other) are the same.
7.836
• Example: Suppose we want to investigate the effect of (A) Education and (B) Gender on the number of jobs that individual holds in their lifetimes.
• The table to the right shows a two‐way classification table used to investigate this question…
.
7.837
• Let• Number of levels for factor ‘A’.• Number of levels for factor ‘B’.• Number of “replicates”. • Replicates are the number of observations for each possible combination of A and B. The book focuses on balanced designs, where ‘r’ is the same for all possible combinations of A and B.
• Mean of the observations when the factor level ‘A’ is equal to ‘i’.
• Mean of the observations when the factor level ‘B’ is equal to ‘j’.
• Mean of the observations when the factor level ‘A’ is equal to ‘i’ AND the factor level ‘B’ is equal to ‘j’.
7.838
• This table illustrates the notation described above.
7.839
• Accordingly, we now have Sum of Squares for Treatments for A, for B and for AB (the interaction of A and B). We denote them as:
• The formulas are given by:
∙
∙
7.840
• The Sum of Squares for Errors (SSE) is now
• The corresponding Mean Square measures are:
7.841
• The F‐tests in this two‐factor ANOVA are:• Test for differences in population means between the levels of factor A,
keeping factor B fixed:
under the null hypothesis that all the means of factor A are the same, F(A) has an F‐distribution with and ∙ degrees of freedom.
• Test for differences in population means between the levels of factor B, keeping factor A fixed:
under the null hypothesis that all the means of factor B are the same, F(B) has an F‐distribution with and ∙ degrees of freedom.
• Test for differences in population means between the interactions of factors A and B:
under the null hypothesis that all the means are the same, F(AB) has an F‐distribution with and ∙ degrees of freedom.
7.842
ANOVA in Excel• ANOVA tests can be computationally intensive, especially if we have blocks or multi‐factor models.
• Excel can perform ANOVA tests for single‐factor and two‐factor models.
• To do this, go to:Data ‐> Data Analysis ‐>
• The “Two‐factor Without Replication” corresponds to the “randomized blocks” ANOVA.
7.843
• Anova: Single Factor• Anova: Two‐Factor
With Replication• Anova: Two‐Factor
Without Replication
• Example (Exercise 14.63): The manager of a radio station wants to find out whether the amount of time that listeners spend listening to this station varies across the different days of the week. If the average time is the same across all days, the manager will schedule the most popular music evenly throughout the week. Otherwise, the top hits will be played mostly on the days that attract the biggest audience.
• So, here the “treatment” is “day of the week”, and we have 7 treatment levels.
• Using Excel, (and 5% significance level) we obtain:, critical value = 2.02,
p‐value = 3.78E‐37• We can infer, at a 5% significance level, that differences exist
across days of the week in the average time listening to the station.
7.844
• Example (Exercise 14.64): To find out if medical specialists differ in the amount of time devoted to patient care, a survey was conducted where the number of hours of patient care per week were recorded for five specialists: “General Medicine”, “Internal Medicine”, “Surgery”, “Pediatrics” and “Obs/Gyn”. These are the five treatment levels.
• The data are divided in blocks, according to age (ranging from 30 to 65 years of age).
• Using Excel, we obtain (5% significance level):F‐statistic (blocks) = 6.36, critical‐value = 1.51,
p‐value = 9.4E‐16• And
F‐statistic (treatments) = 10.72, critical‐value = 2.43, p‐value = 1.32E‐7
• Thus, we can infer that the blocks we used are relevant, and that there exist differences in the time devoted to patient care across these five specialties.
7.845
• Example: Comparing lifetime number of jobs by educational level and gender.‐ To study whether there exist differences between genders and educational levels, a data set was compiled. This is a case of a two‐factor ANOVA.
• Denote the two treatment factors as:A = “Educational Level”
B = “Gender”• The “treatment levels” for factor A are:1) Less than High School2) High School3) Less than Bachelor’s Degree4) At least one Bachelor’s Degree
• Naturally, Gender is simply “male” and “female”. 7.846
• Using Excel (data set file is “Xm‐14‐04a.xlsx”), the F‐statistics are (significance level 5%):
• For treatment “A” (education level):, critical value = 2.73, p‐value=0.006
There is sufficient evidence to infer, at a 5% significance level that differences in the number of jobs exist between educational levels.
• For treatment “B” (gender):, critical value = 3.97, p‐value=0.294
There is NOT sufficient evidence to infer, at a 5% significance level that differences in the number of jobs exist between genders.
• For the interaction “AB”:, critical value = 2.73, p‐value=0.89
There is no evidence to conclude that “gender” and “education” interact as determinants of number of jobs.
7.847
• Example (Exercise 14.75): To determine the effectiveness of five detergent brands, 150 white sheets were equally soiled. Then, 30 sheets were washed in each brand: 10 with cold water, 10 with warm water and 10 with hot water. After washing, “whiteness” scores were measured for each sheet.
• The goal then is to measure the average effect of two‐factor treatments on whiteness. From the statement of the problem, these two factors are:A = Detergent Brand, with five levels (the five brands)
B = Water temperature, with three levels
7.848
• Using Excel, we obtain (with 5% significance level):• For treatment “A” (detergent brand):
, critical value = 2.43, p‐value=1.4E‐7 There is sufficient evidence to infer, at a 5% significance level
that differences in “whiteness” exist between these detergent brands.
• For treatment “B” (water temperature):, critical value = 3.06, p‐value=5.8E‐8
There is sufficient evidence to infer, at a 5% significance level that differences in “whiteness” arise from different water temperatures.
• For the interaction “AB”:, critical value = 2.01, p‐value=0.007
There is sufficient evidence to infer, at a 5% significance level, that detergent brands interact with water temperature in ways that result in differences in “whiteness”.
7.849