correlation and covariance
DESCRIPTION
Correlation and Covariance. Overview. Outcome, Dependent Variable (Y-Axis). Height. Continuous. Histogram. Predictor Variable (X-Axis). Scatter. Continuous. Boxplot. Categorical. Variables. Dependent Variables. Y. Y. Height. X4. X3. X1. X2. Independent Variables. X’s. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/1.jpg)
Correlation andCovariance
![Page 2: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/2.jpg)
Overview
Continuous
Continuous
Categorical
Histogram
Scatter
Boxplot
Predictor Variable(X-Axis)
Height
Outcome, Dependent Variable (Y-Axis)
![Page 3: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/3.jpg)
Variables
Y
X’s
Height
Independent Variables
DependentVariables
Y
X4X3X2X1
![Page 4: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/4.jpg)
Correlation Matrix for Continuous Variables
chart.Correlation(num2)PerformanceAnalytics package
![Page 5: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/5.jpg)
Slide 5
• A deviation is the difference between the mean and an actual data point.
• Deviations can be calculated by taking each score and subtracting the mean from it:
deviation ix x
Calculating ‘Error’
![Page 6: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/6.jpg)
Calculating ‘Error’
![Page 7: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/7.jpg)
Slide 7
• Take the error between the mean and the data and add them????
0)( XX
Score Mean Deviation
1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total = 0
Use the Total Error?Deviation
![Page 8: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/8.jpg)
Slide 8
• We could add the deviations to find out the total error.
• Deviations cancel out because some are positive and others negative.
• Therefore, we square each deviation.
• If we add these squared deviations we get the sum of squared errors (SS).
Sum of Squared ErrorsDeviation
![Page 9: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/9.jpg)
Slide 9
2SS ( ) 5.20X X
Score Mean Deviation Squared Deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 5.20
Sum of Squared Errors
![Page 10: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/10.jpg)
Slide 10
• The variance is measured in units squared.
• This isn’t a very meaningful metric so we take the square root value.
• This is the standard deviation (s).
2
1 5.205 1.02
niix x
ns
Standard Deviation
![Page 11: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/11.jpg)
Slide 11
• The sum of squares is a good measure of overall variability, but is dependent on the number of scores.
• We calculate the average variability by dividing by the number of scores (n).
• This value is called the variance (s2).
Variance
![Page 12: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/12.jpg)
Slide 12
Same Mean, Different Standard Deviation
![Page 13: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/13.jpg)
Temperature Variation Across Cities
Austin
Las Vegas
San Diego
San Francisco
Tampa Bay
Count of Hours
![Page 14: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/14.jpg)
1cov( , ) i ix x y y
Nx y
Covariance
Y
X
Persons 2,3, and 5 look to have similar magnitudes from their means
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
-4-3-2-1012345
![Page 15: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/15.jpg)
254417
441021418221
4)4)(62()2)(60()1)(41()2)(41()3)(40(
1))((
)cov(
.
.....
.....N
yyxxy,x ii
Covariance
• Calculate the error [deviation] between the mean and each subject’s score for the first variable (x).
• Calculate the error [deviation] between the mean and their score for the second variable (y).
• Multiply these error values.• Add these values and you get the cross product deviations.• The covariance is the average cross-product deviations:
![Page 16: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/16.jpg)
1cov( , ) i ix x y y
Nx y
Covariance
Age Income Education7 4 34 1 86 3 58 6 18 5 77 2 95 3 39 5 87 4 58 2 29 5 28 4 29 2 38 4 73 1 43 1 38 2 61 2 53 1 76 3 3
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
Age vs. Income
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
-6.0-5.0-4.0-3.0-2.0-1.00.01.02.03.04.0
Delta A Delta I
Do they VARY the same way relative to their own means?
2.47
![Page 17: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/17.jpg)
• It depends upon the units of measurement.• E.g. the covariance of two variables measured in miles
might be 4.25, but if the same scores are converted to kilometres, the covariance is 11.
• One solution: standardize it! normalize the data• Divide by the standard deviations of both variables.
• The standardized version of covariance is known as the correlation coefficient.• It is relatively unaffected by units of measurement.
Limitations of Covariance
![Page 18: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/18.jpg)
cov
1
xy
x y
i i
x y
s s
x x y y
N s s
r
The Correlation Coefficient
cov
4.25
1.67 2.92.87
xy
x ys sr
![Page 19: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/19.jpg)
• It varies between -1 and +1• 0 = no relationship
• It is an effect size• ±.1 = small effect• ±.3 = medium effect• ±.5 = large effect
• Coefficient of determination, r2
• By squaring the value of r you get the proportion of variance in one variable shared by the other.
Things to Know about the Correlation
![Page 20: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/20.jpg)
Correlation
Covariance is High: r ~1
Covariance is Low: r ~0
![Page 21: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/21.jpg)
Correlation
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
f(x) = 0.432980193459235 x + 0.250575771533855R² = 0.442392806360523
![Page 22: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/22.jpg)
Correlation
Need inter-item/variable correlations > .30
![Page 23: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/23.jpg)
Character Vector: b <- c("one","two","three")
numeric vector
character vector
Numeric Vector: a <- c(1,2,5.3,6,-2,4)
Matrix: y<-matrix(1:20, nrow=5,ncol=4)
Dataframe:d <- c(1,2,3,4)e <- c("red", "white", "red", NA)f <- c(TRUE,TRUE,TRUE,FALSE)mydata <- data.frame(d,e,f)names(mydata) <- c("ID","Color","Passed")
List:w <- list(name="Fred", age=5.3)
Data Structures
Framework Source: Hadley Wickham
![Page 24: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/24.jpg)
Correlation Matrix
![Page 25: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/25.jpg)
Correlation and Covariance
1cov( , ) i ix x y y
Nx y
![Page 26: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/26.jpg)
Revisiting the Height Dataset
![Page 27: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/27.jpg)
Galton: Height Dataset
cor(heights)Error in cor(heights) : 'x' must be numeric
Initial workaround: Create data.frame without the Factors
h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids)
cor() function does not handle Factors
Later we will RECODE the variable into a 0, 1
Excel correl() does not either
![Page 28: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/28.jpg)
Histogram of Correlation Coefficients
-1 +1
![Page 29: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/29.jpg)
Correlations Matrix: Both Types
library(car)scatterplotMatrix(heights)
Zoom in on Gender
![Page 30: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/30.jpg)
Correlation Matrix for Continuous Variables
chart.Correlation(num2)PerformanceAnalytics package
![Page 31: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/31.jpg)
Categorical: Revisit Box Plot
Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors
Note there is an equation here:Y = mx b
Correlation will depend on spread of distributions
![Page 32: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/32.jpg)
Manual Calculation: Note Stdev is Lower
Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the
Continuous Variable has a lot of variation, spread.
![Page 33: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/33.jpg)
Categorical: Recode!Gender recoded as
a 0= Female1 = Male
@correl does not work with Factor
Variables
Formula now works!
![Page 34: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/34.jpg)
Correlation: Continuous & Discrete
More examples of cor.test()
![Page 35: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/35.jpg)
• Too many variables is difficult to handle
• Computing power to handle all that data.
• Principal components analysis seeks to identify and quantify those components by analyzing the original, observable variables
• In many cases, we can wind up working with just a few—on the order of, say, three to ten—principal components or factors instead of tens or hundreds of conventionally measured variables.
Overview
![Page 36: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/36.jpg)
Z1X1
X2
X3
Z2
observable variables
Which component explains the most variance?
Z3
vectors
Principal Components Analysis
![Page 37: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/37.jpg)
Principal Components Analysis
![Page 38: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/38.jpg)
Principal Components
![Page 39: Correlation and Covariance](https://reader033.vdocuments.net/reader033/viewer/2022061610/56815117550346895dbf363b/html5/thumbnails/39.jpg)
Correlation Regression