june 3, 2008stat 111 - lecture 5 - correlation1 exploring data numerical summaries of relationships...

28
June 3, 2008 Stat 111 - Lecture 5 - Correlation 1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

Upload: noah-bridges

Post on 28-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 3, 2008 Stat 111 - Lecture 5 - Correlation 1

Exploring Data

Numerical Summaries of Relationships between

Statistics 111 - Lecture 5

Page 2: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 3, 2008 Stat 111 - Lecture 5 - Correlation 2

Administrative Notes

• Homework 2 due Monday, June 8th

• Start the homework tonight– It’s long– There are things that will confuse you

• You will need JMP for the homework– Download it (link on my website)– Use a Wharton computer (link to the

Wharton account information on website)

Page 3: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 3

Outline

• Correlation• Linear Regression• Linear Prediction

Page 4: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Introduction 4

Course Overview

Collecting Data

Exploring DataProbability Intro.

Inference

Comparing Variables Relationships between Variables

Means Proportions Regression Contingency Tables

Page 5: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 5

Examples from Last Class

• 60 US Cities: Education vs. mortality

• Vietnam draft lottery: draft order vs. dateNew Orleans

Baltimore

San Jose

York

Lancaster

Denver

Page 6: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 6

How two variables vary together

• We have already looked at single-variable measures of spread:

• We now have pairs of two variables:

• We need a statistic that summarizes how they are spread out “together”

variance s2 1

n 1(x i x )2

(x1,y1),(x2,y2),,(xn, yn )

Page 7: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 7

Correlation

• Correlation of two variables:

• We divide by standard deviation of both X and Y, so correlation has no units

r 1

n 1

(x i x )(y i y )

sx sy

Page 8: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 8

Correlation

• Correlation is a measure of the strength of linear relationship between variables X and Y

• Correlation has a range between -1 and 1 • r = 1 means the relationship between X and Y is

exactly positive linear • r = -1 means the relationship between X and Y is

exactly negative linear • r = 0 means that there is no linear relationship

between X and Y• r near 0 means there is a weak linear relationship

between X and Y

Page 9: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 9

Measure of Strength• Correlation roughly measures how “tight” the

linear relationship is between two variables

Page 10: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 10

Examples

Education and Mortalityr = -0.51

Draft Order and Birthdayr = -0.22

Page 11: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 11

Cautions about Correlation

• Correlation is only a good statistic to use if the relationship is roughly linear

• Correlation can not be used to measure non-linear relationships

• Always plot your data to make sure that the relationship is roughly linear!

Page 12: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 12

Correlation with Non-linear Data• A high correlation does not always mean a strong

linear relationship • A low correlation does not always mean no relationship

(just no linear relationship)

Page 13: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 13

Linear Regression

• If our X and Y variables do show a linear relationship, we can calculate a best fit line in addition to the correlation

Y = a + b·X

• The values a and b together are called the regression coefficients• a is called the intercept• b is called the slope

Page 14: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 14

Example: US Cities

Mortality = a + b·Education• How to determine our “best” line ?

ie. best regression coefficients a and b ?

Page 15: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 15

Finding the “Best” Line

• Want line that has smallest total “Y-residuals”

• Must square “Y-residuals” and add them up:

Y-residuals

total residuals = (y i - (a bx i))2

Page 16: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 16

Best values for slope and intercept

• Line with smallest total residuals is called the “least-squares” line• need calculus to find a and b that give

“least-squares” line

b rsy

sx

a y bx

Page 17: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 17

Example: Education and Mortality

Mortality = 1353.16 - 37.62 · Education

• Negative association means negative slope b

Page 18: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 18

JMP Output

Mortality = 1353.16 - 37.62 · Education

• Note that JMP gives correlation in terms of r2

• Need to take square root and give correct sign

r = -0.51

Page 19: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 19

Interpreting the regression line

• The slope coefficient b is the average change you get in the Y variable if you increased the X variable by one• Eg. increasing education by one year means an

average decrease of ≈ 38 deaths per 100,000

• The intercept coefficient a is the average value of the Y variable when the X variable is equal to zero• Eg. an uneducated city (Education = 0) would have

an average mortality of 1353 deaths per 100,000

Page 20: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 20

Example: Vietnam Draft Order

Draft Order = 224.9 - 0.226 · Birthday

• Slightly negative slope means later birthdays have a lower draft order

Page 21: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 21

Example: Crack Cocaine Sentences

• Relationship seems to be non-linear

Page 22: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 22

Sentence = 90.4 + 0.269 · Quantity• Extra gram of crack means extra 0.27 months

(about 8 days) sentence on average • Intercept is not always interpretable: having no

crack gets you an 90 month sentence!

Page 23: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 23

Prediction

• The regression equation can be used to predict our Y variable for a specific value of our X variable

• Eg: Sentence = 90.4 + 0.269 · Quantity• Person caught with 175 grams has a predicted

sentence of 90.4 + 0.269 · 175 =137.5 months

• Eg: Mortality = 1353.16 - 37.62 · Education• City with only 6 median years of education has a

predicted mortality of 1353.16 - 37.62 · 6 = 1127 deaths per 100,000

Page 24: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 24

Problems with Prediction• Quality of predictions depends on truth of

assumption of a linear relationship• Eg: crack sentences for large quantities

Page 25: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 2, 2008 Stat 111 - Lecture 5 - Correlation 25

Problems with Prediction

• We shouldn’t extrapolate predictions beyond the range of data for our X variable

• Eg: Sentence = 90.4 + 0.269 · Quantity• Having no crack still has a predicted sentence of 90

months!

• Eg: Mortality = 1353.16 - 37.62 · Education• A city with median education of 35 years is predicted to

have no deaths at all!

Page 26: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

Break!

• Take 5

June 3, 2008 Stat 111 - Lecture 5 - Correlation 26

Page 27: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

How to Use JMP

• Scatterplots

• Correlation

• Regression – Hand calculating (using Distribution)– Have JMP calculate (using Fit Y by X)

June 3, 2008 Stat 111 - Lecture 5 - Correlation 27

Page 28: June 3, 2008Stat 111 - Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics 111 - Lecture 5

June 3, 2008 Stat 111 - Lecture 5 - Correlation 28

Next Class

• Probability– This is the hardest part of the course.

• Moore, McCabe and Craig: Section 4.1-4.4