correlation and covariance. overview continuous categorical histogram scatter boxplot predictor...
TRANSCRIPT
Correlation andCovariance
Overview
Continuous
Continuous
Categorical
Histogram
Scatter
Boxplot
Predictor Variable(X-Axis)
Height
Outcome, Dependent Variable (Y-Axis)
Correlation
Covariance is High: r ~1
Covariance is Low: r ~0
• It varies between -1 and +1• 0 = no relationship
• It is an effect size• ±.1 = small effect• ±.3 = medium effect• ±.5 = large effect
• Coefficient of determination, r2
• By squaring the value of r you get the proportion of variance in one variable shared by the other.
Things to Know about the Correlation
Variables
Y
X’s
Height
Independent Variables
DependentVariables
Y
X4X3X2X1
Little Correlation
Correlation is For Linear Relationships
Outliers Can Skew Correlation Values
Correlation and Regression Are Related
1cov( , ) i ix x y y
Nx y
Covariance
Y
X
Persons 2,3, and 5 look to have similar magnitudes from their means
254417
441021418221
4)4)(62()2)(60()1)(41()2)(41()3)(40(
1))((
)cov(
.
.....
.....N
yyxxy,x ii
Covariance
• Calculate the error [deviation] between the mean and each subject’s score for the first variable (x).
• Calculate the error [deviation] between the mean and their score for the second variable (y).
• Multiply these error values.
• Add these values and you get the cross product deviations.
• The covariance is the average cross-product deviations:
1cov( , ) i ix x y y
Nx y
Covariance
Age Income Education7 4 34 1 86 3 58 6 18 5 77 2 95 3 39 5 87 4 58 2 29 5 28 4 29 2 38 4 73 1 43 1 38 2 61 2 53 1 76 3 3
Do they VARY the same way relative to their own means?
2.47
• It depends upon the units of measurement.• E.g. the covariance of two variables measured in miles
might be 4.25, but if the same scores are converted to kilometres, the covariance is 11.
• One solution: standardize it! normalize the data• Divide by the standard deviations of both variables.
• The standardized version of covariance is known as the correlation coefficient.• It is relatively unaffected by units of measurement.
Limitations of Covariance
cov
1
xy
x y
i i
x y
s s
x x y y
N s s
r
The Correlation Coefficient
cov
4.25
1.67 2.92.87
xy
x ys sr
Correlation
Covariance is High: r ~1
Covariance is Low: r ~0
Correlation
Correlation
Need inter-item/variable correlations > .30
Character Vector: b <- c("one","two","three")
numeric vector
character vector
Numeric Vector: a <- c(1,2,5.3,6,-2,4)
Matrix: y<-matrix(1:20, nrow=5,ncol=4)
Dataframe:d <- c(1,2,3,4)e <- c("red", "white", "red", NA)f <- c(TRUE,TRUE,TRUE,FALSE)mydata <- data.frame(d,e,f)names(mydata) <- c("ID","Color","Passed")
List:w <- list(name="Fred", age=5.3)
Data Structures
Framework Source: Hadley Wickham
Correlation Matrix
Correlation and Covariance
1cov( , ) i ix x y y
Nx y
Revisiting the Height Dataset
Galton: Height Dataset
cor(heights)Error in cor(heights) : 'x' must be numeric
Initial workaround: Create data.frame without the Factors
h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids)
cor() function does not handle Factors
Later we will RECODE the variable into a 0, 1
Excel correl() does not either
Histogram of Correlation Coefficients
-1 +1
Correlations Matrix: Both Types
library(car)scatterplotMatrix(heights)
Zoom in on Gender
Correlation Matrix for Continuous Variables
chart.Correlation(num2)PerformanceAnalytics package
Categorical: Revisit Box Plot
Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors
Note there is an equation here:Y = mx b
Correlation will depend on spread of distributions
Manual Calculation: Note Stdev is Lower
Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the
Continuous Variable has a lot of variation, spread.
Categorical: Recode!Gender recoded as
a 0= Female1 = Male
@correl does not work with Factor
Variables
Formula now works!
Correlation: Continuous & Discrete
More examples of cor.test()
Correlation Regression
Continuous Categorical
Continuous
Categorical
Histogram
Scatter
Bar
CrossTable
Boxplot
Predictor Variable(X-Axis)
Pie
Mosaic
CrossTable
LinearRegression
LogisticRegression
Regression Model
Parents Height
Gender
Frequency
0
1
Outcome, Dependent Variable (Y-Axis)
Mean, Median, Standard Deviation
Proportions
Summary