regression and correlation gtech 201 lecture 18. anova analysis of variance continuation from...
Post on 20-Dec-2015
221 views
TRANSCRIPT
Regression and Correlation
GTECH 201Lecture 18
ANOVA
Analysis of Variance Continuation from matched-pair difference
of means tests; but now for 3+ cases We still check whether samples come from
one or more distinct populations Variance is a descriptive parameter ANOVA compares group means and looks
whether they differ sufficiently to reject H0
ANOVA H0 and HA
ANOVA Test Statistic
MSB = between-group mean squares
MSW = within-group mean squares
Between-group variability is calculated in three steps:
B
w
MSF
MS
1. Calculate overall mean as weighted average of sample means
2. Calculate between-group sum of squares3. Calculate between-group mean squares (MSB)
Between-group Variability
1. Total or overall mean
2. Between-group sum of squares
3. Between-group mean squares
1
k
i ii
T
n XX
N
2
2 2
1 1
k k
B i i T i i Ti i
SS n X X n X N X
1B B
BB
SS SSMS
df k
Within-group Variability
1. Within-group sum of squares
2. Within-group mean squares
2
1
1k
w i ii
SS n s
W WW
W
SS SSMS
df N k
Kruskal-Wallis Test Nonparametric equivalent of ANOVA Extension of Wilcoxon rank sum W test
to 3+ cases Average rank is Ri / ni
Then the Kruskal-Wallis H test statistic is
With N =n1 + n2 + … +nk = total number of observations, and
Ri = sum of ranks in sample i
2
1
123 1
1
ki
i i
RH N
N N n
ANOVA Example
House prices by neighborhood in ,000 dollars A B C D175 151 127 174147 183 142 182138 174 124 210156 181 150 191184 193 180148 205
196
ANOVA Example, continued
Sample statistics
n X sA 6 158.00 17.83B 7 183.29 17.61C 5 144.60 22.49D 4 189.25 15.48
Total 22 168.68 24.85
Now fill in the six steps of the ANOVA calculation
The Six Steps
1 6(158.00) 7(183.29) 5(144.60) 4(189.25)168.68
22
k
i ii
T
n XX
N
2 2 2 2 22 2
1
6 158.00 7 183.29 5 144.60 4 189.25 22 168 6769.394k
B i i Ti
SS n X N X
6769.3942256.465
1 3B B
BB
SS SSMS
df k
2 2 2 22
1
1 5 17.83 6 17.61 4 22.49 3 15.48 6193.379k
w i ii
SS n s
6193.379
344.07722 4
W WW
W
SS SSMS
df N k
2256.465
6.558344.077
B
W
MSF
MS .003p
Correlation Co-relatedness between 2+ variables As the values of one variable go up,
those of the other change proportionally
Two step approach:1. Graphically - scatterplot2. Numerically – correlation coefficients
Is There a Correlation?
Scatterplots Exploratory analysis
Pearson’s Correlation Index
Based on concept of covariance
= covariation between X and Y
= deviation of X from its mean
= deviation of Y from its mean
Pearson’s correlation coefficient
XYCV X X Y Y XYCV
X X
Y Y
/X Y
X X Y Y Nr
S S
Sample and Population
r is the sample correlation coefficient
Applying the t distribution, we can infer the correlation for the whole population
Test statistic for Pearson’s r
2
2
1
r nt
r
Correlation Example Lake effect snow
Spearman’s Rank Correlation
Non-parametric alternative to Pearson
Logic similar to Kruskal and Wilcoxon
Spearman’s rank correlation coefficient 2
3
61s
dr
N N
Regression
In correlation we observe degrees of association but no causal or functional relationship
In regression analysis, we distinguish an independent from a dependent variable
Many forms of functional relationships bivariate linear
multivariate non-linear (curvi-linear)
Graphical Representation
In correlation analysis either variable could be depicted on either axis
In regression analysis, the independent variable is always on the X axis
Bivariate relationship is described by a best-fitting line through the scatterplot
Least-Square Regression
Objective: minimize 2id
Y a bX
Regression Equation
Y = a + bX
22
n XY X Yb
n X X
Y b Xa
n
Strength of Relationship
How much is explained by the regression equation?
Coefficient of Determination
Total variation of Y (all the bucket water)
Large ‘Y’ = dependent variable Small ‘y’ = deviation of each value of Y
from its mean
e = explained; u = unexplained
22y Y Y 2 2 2
e uy y y
2 2 2e uy y y
Explained Variation
Ratio of square of covariation between X and Y to the variation in X
where xy = covariation between X and Y
x2 = total variation of X
Coefficient of determination
222e
xyy
x
22
2eyry
Error Analysis
r 2 tells us what percentage of the variation is accounted for by the independent variable
This then allows us to infer the standard error of our estimate
which tells us, on average, how far off our prediction would be in measurement units
2
2eySE
n