correlation

12
Correlation We can often see the strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here are both scatterplots of the same data , on different scales. The second seems to be a stronger association… So we need a measure of association independent of the graphics…

Upload: blaze-burris

Post on 31-Dec-2015

22 views

Category:

Documents


0 download

DESCRIPTION

Correlation. We can often see the strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here are both scatterplots of the same data , on different scales. The second seems to be a stronger association… - PowerPoint PPT Presentation

TRANSCRIPT

Correlation We can often see the

strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here are both scatterplots of the same data, on different scales. The second seems to be a stronger association…

So we need a measure of association independent of the graphics…

Use the correlation coefficient, r

The correlation coefficient is a measure of the direction and strength

of a linear relationship.

It is calculated using the mean and the standard deviation of both

the x and y variables.

Correlation can only be used to describe quantitative variables. Categorical variables don’t have means and standard deviations.

The correlation coefficient r

Time to swim: = 35, sx = 0.7

Pulse rate: = 140 sy = 9.5€

r =1

n −1

x i − x

sx

⎝ ⎜

⎠ ⎟

i=1

n

∑ y i − y

sy

⎝ ⎜ ⎜

⎠ ⎟ ⎟

x

y

Part of the calculation involves finding z, the standardized score similar to the one we used when working with the normal distribution.

Standardization:Allows us to compare correlations between data sets where variables are measured in different units or when variables are different.

For instance, we might want to compare the correlation between [swim time and pulse], with the correlation between [swim time and breathing rate].

You DON'T want to do this by hand. Make sure you learn how to use your calculator or the computer to find r.

r =1

n −1

x i − x

sx

⎝ ⎜

⎠ ⎟

i=1

n

∑ y i − y

sy

⎝ ⎜ ⎜

⎠ ⎟ ⎟

z for time z for pulse

r does not distinguish between x & y

The correlation coefficient, r, treats x and y symmetrically

"Time to swim" is the explanatory variable here, and belongs on the x axis. However, in either plot r is the same (r=-0.75).

r = -0.75 r = -0.75€

r =1

n −1

x i − x

sx

⎝ ⎜

⎠ ⎟

i=1

n

∑ y i − y

sy

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Changing the units of measure of variables does not change the correlation coefficient r, because we "standardize out" the units when getting z-scores.

r has no unit of measure (unlike x and y)

r = -0.75

r = -0.75

z-score plot is the same for both plots€

r =1

n −1

x i − x

sx

⎝ ⎜

⎠ ⎟

i=1

n

∑ y i − y

sy

⎝ ⎜ ⎜

⎠ ⎟ ⎟

z for time z for pulse

r ranges from -1 to +1

r quantifies the strength and direction of a linear relationship between 2 quantitative variables.

Strength: how closely the points follow a straight line.

Direction: is positive when individuals with higher X values tend to have higher values of Y.

When variability in one

or both variables

decreases, the

correlation coefficient

gets stronger

( closer to +1 or -1).

No matter how strong the association, r should not be used to describe non-linear relationships - we have other methods…

Note: You can sometimes transform a non-linear association to a linear form, for instance by taking the logarithm. You can then calculate a correlation using the transformed data.

Correlation coefficient r describes linear relationships

Correlations are calculated using

means and standard deviations,

and thus are NOT resistant to

outliers - try the Statistical Applet

under Resources in the eBook on

the Stats Portal…

Influential points

Just moving one point away from the

general trend here decreases the

correlation from -0.91 to -0.75

In this example, adding two outliers decreases r from 0.95 to 0.61.

Go to the Stats Portal, under Resources, try Statistical Applets, and

choose the Correlation and Regression one… put some points in the

scatterplot, watch the value of r and see what happens when you put in

an outlier or two…

• Homework:

• Read section 2.2, pay careful attention to the properties of the correlation coefficient, r•To explore how extreme outlying observations influence r, play around with the Statistical Applet on Correlation and Regression under Resources in the eBook on the Stats Portal…

– Then, using the computer to draw the scatterplots and do the computations as needed, do problems #2.42 - 2.44, 2.47, 2.53, 2.55, 2.56, 2.60