correlation and covariance. overview continuous categorical histogram scatter boxplot predictor...

32
Correlation and Covariance

Upload: victor-haynes

Post on 26-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation andCovariance

Page 2: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Overview

Continuous

Continuous

Categorical

Histogram

Scatter

Boxplot

Predictor Variable(X-Axis)

Height

Outcome, Dependent Variable (Y-Axis)

Page 3: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation

Covariance is High: r ~1

Covariance is Low: r ~0

Page 4: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

• It varies between -1 and +1• 0 = no relationship

• It is an effect size• ±.1 = small effect• ±.3 = medium effect• ±.5 = large effect

• Coefficient of determination, r2

• By squaring the value of r you get the proportion of variance in one variable shared by the other.

Things to Know about the Correlation

Page 5: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Variables

Y

X’s

Height

Independent Variables

DependentVariables

Y

X4X3X2X1

Page 6: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)
Page 7: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Little Correlation

Page 8: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation is For Linear Relationships

Page 9: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Outliers Can Skew Correlation Values

Page 10: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation and Regression Are Related

Page 11: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

1cov( , ) i ix x y y

Nx y

Covariance

Y

X

Persons 2,3, and 5 look to have similar magnitudes from their means

Page 12: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

254417

441021418221

4)4)(62()2)(60()1)(41()2)(41()3)(40(

1))((

)cov(

.

.....

.....N

yyxxy,x ii

Covariance

• Calculate the error [deviation] between the mean and each subject’s score for the first variable (x).

• Calculate the error [deviation] between the mean and their score for the second variable (y).

• Multiply these error values.

• Add these values and you get the cross product deviations.

• The covariance is the average cross-product deviations:

Page 13: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

1cov( , ) i ix x y y

Nx y

Covariance

Age Income Education7 4 34 1 86 3 58 6 18 5 77 2 95 3 39 5 87 4 58 2 29 5 28 4 29 2 38 4 73 1 43 1 38 2 61 2 53 1 76 3 3

Do they VARY the same way relative to their own means?

2.47

Page 14: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

• It depends upon the units of measurement.• E.g. the covariance of two variables measured in miles

might be 4.25, but if the same scores are converted to kilometres, the covariance is 11.

• One solution: standardize it! normalize the data• Divide by the standard deviations of both variables.

• The standardized version of covariance is known as the correlation coefficient.• It is relatively unaffected by units of measurement.

Limitations of Covariance

Page 15: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

cov

1

xy

x y

i i

x y

s s

x x y y

N s s

r

The Correlation Coefficient

cov

4.25

1.67 2.92.87

xy

x ys sr

Page 16: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation

Covariance is High: r ~1

Covariance is Low: r ~0

Page 17: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation

Page 18: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation

Need inter-item/variable correlations > .30

Page 19: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Character Vector: b <- c("one","two","three")

numeric vector

character vector

Numeric Vector: a <- c(1,2,5.3,6,-2,4)

Matrix: y<-matrix(1:20, nrow=5,ncol=4)

Dataframe:d <- c(1,2,3,4)e <- c("red", "white", "red", NA)f <- c(TRUE,TRUE,TRUE,FALSE)mydata <- data.frame(d,e,f)names(mydata) <- c("ID","Color","Passed")

List:w <- list(name="Fred", age=5.3)

Data Structures

Framework Source: Hadley Wickham

Page 20: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation Matrix

Page 21: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation and Covariance

1cov( , ) i ix x y y

Nx y

Page 22: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Revisiting the Height Dataset

Page 23: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Galton: Height Dataset

cor(heights)Error in cor(heights) : 'x' must be numeric

Initial workaround: Create data.frame without the Factors

h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids)

cor() function does not handle Factors

Later we will RECODE the variable into a 0, 1

Excel correl() does not either

Page 24: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Histogram of Correlation Coefficients

-1 +1

Page 25: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlations Matrix: Both Types

library(car)scatterplotMatrix(heights)

Zoom in on Gender

Page 26: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation Matrix for Continuous Variables

chart.Correlation(num2)PerformanceAnalytics package

Page 27: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Categorical: Revisit Box Plot

Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors

Note there is an equation here:Y = mx b

Correlation will depend on spread of distributions

Page 28: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Manual Calculation: Note Stdev is Lower

Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the

Continuous Variable has a lot of variation, spread.

Page 29: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Categorical: Recode!Gender recoded as

a 0= Female1 = Male

@correl does not work with Factor

Variables

Formula now works!

Page 30: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation: Continuous & Discrete

More examples of cor.test()

Page 31: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Correlation Regression

Page 32: Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)

Continuous Categorical

Continuous

Categorical

Histogram

Scatter

Bar

CrossTable

Boxplot

Predictor Variable(X-Axis)

Pie

Mosaic

CrossTable

LinearRegression

LogisticRegression

Regression Model

Parents Height

Gender

Frequency

0

1

Outcome, Dependent Variable (Y-Axis)

Mean, Median, Standard Deviation

Proportions

Summary