06. linear regression - department of statistics and …€¦ ·  · 2017-05-31if in a linear...

54
STT 200 Ashoke Kumar Sinha Ashoke Kumar Sinha Ashoke Kumar Sinha Ashoke Kumar Sinha This lecture is based on Chapters 7, 8 and 9. Acknowledgement: Author is indebted to Dr. Jennifer Kaplan and Dr. ParthanilRoy for allowing him to use/edit many of their slides.

Upload: truongliem

Post on 09-May-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

STT 200

Ashoke Kumar SinhaAshoke Kumar SinhaAshoke Kumar SinhaAshoke Kumar Sinha

This lecture is based on Chapters 7, 8 and 9.

Acknowledgement: Author is indebted to Dr. Jennifer Kaplan and

Dr. Parthanil Roy for allowing him to use/edit many of their slides.

What is the goal of Chapter 7?

Exploring relationships (or association) between two quantitative variables

a) by drawing a picture (known as scatterplot),

and

b) using a quantitative summary (known as correlation coefficient or simply correlation).

2

Example: Height and Weight

• How is weight of an individual related to

his/her height?

• Typically, one can expect a taller person to be

heavier.

• Is it supported by the data?

• If yes, how to determine this “association”?

3

What is a scatterplot?

• A scatterplot is a diagram which is used to display values of two quantitative variables from a data-set.

• The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

4

Example 1: Scatterplot of height and

weight

5

Example 2: Scatterplot of hours watching

TV and test scores

6

Looking at Scatterplots

We look at the following features of a scatterplot:-

• Direction (positive or negative)

• Form (linear, curved)

• Strength (of the relationship)

• Unusual Features.

When we describe

histograms we mention

• Shape

• Center

• Spread

• Outliers 7

Asking Questions on a Scatterplot

• Are test scores higher or lower when the TV watching

is longer? Direction (positive or negative association).

• Does the cloud of points seem to show a linear

pattern, a curved pattern, or no pattern at all? Form.

• If there is a pattern, how strong does the relationship

look? Strength.

• Are there any unusual features? (2 or more groups or

outliers).

8

Positive and Negative Associations

• Positive association means for most of the data-

points, a higher value of one variable corresponds to

a higher value of the other variable and a lower value

of one variable corresponds to a lower value of the

other variable.

• Negative association means for most of the data-

points, a higher value of one variable corresponds to

a lower value of the other variable and vice-versa.

9

This association is:

A. positive

B. negative.10

This association is:

A. positive

B. negative.11

Curved Scatterplot

• When the plot shows a clear curved pattern,

we shall call it a curved scatterplot.

12

Linear Scatterplot

• Unless we see a curve, we shall call the scatterplot

linear.

• We shall soon learn how to quantify the strength of

the linear form of a scatterplot.

13

Which one has stronger linear association?

A.left one,

B.right one.

Because, in the right

graph the points are

closer to a straight line.

14

Which one has stronger linear association?

A.left one,

B.right one.

Hard to say

– we need a measure

of linear association.

15

Unusual Feature: Presence of Outlier

• This scatterplot clearly has an outlier.

16

Unusual Feature: Two Subgroups

• This scatterplot clearly has two subgroups.

17

Explanatory and Response Variables

• The main variable of interest (the one which we would like to predict) is called the response variable.

• The other variable is called the explanatory variable or the predictor variable.

• Typically we plot the explanatory variable along the horizonatal axis (x-axis) and the response variable along the vertical axis (y-axis).

18

Example: Scatterplot of height and weight

In this case, we are trying to predict the weight based on the height

of a person. Therefore

• weight is the response variable, and

• height is the explanatory variable.

19

How to measure linear association?

• Use correlation coefficient or simply correlation.

• Correlation is a value to describe the strength of the linear association between two quantitative variables.

• Suppose x and y are two variables. Let zx and zy

denote the z-scores of x and y respectively. Then correlation is defined as:

.1

1

∑−

= yx zzn

r

• We shall use TI 83/84 Plus to compute correlation

coefficient (to be discussed later).20

21

Correlation is unit-free

Because correlation is calculated using standardized scores

� it is free of unit (i.e. does not have any unit);

� does not change if the data are rescaled.

In particular, this means that correlation does not depend on the unit of the two quantitative variables.

For example, if you are computing the correlation between the heights and weights of a bunch of individuals, it does not matter if the heights are measured in inches or cms and if the weights are measured in lbs or kgs.

22

Properties of Correlation

• Correlation is unit-free.

• Correlation does not change if the data are rescaled.

• It is a number between -1 and 1.

• The sign of the correlation indicates the direction of the linear association (if the association is positive then so is the correlation and if the association is negative then so is the correlation).

• The closer the correlation is to -1 or 1, the stronger is the linear association.

• Correlations near 0 indicate weak linear association.

23

Words of Warning about Correlation

• Correlation measures linear association between two quantitative variables.

• Correlation measures only the strength of the linear association.

• If correlation between two variables is 0, it only means that they are not linearly associated. They may still be nonlinearly associated.

• To measure the strength of linear association only the value of correlation matters.

A correlation of -0.8 is a stronger linear association compared to a correlation value 0.7.

The negative and positive signs of correlation only indicate direction of association.

• Presence of outlier(s) may severely influence correlation.

• High correlation value may not always imply causation.

24

Check before calculation of correlation

• Are the variables quantitative?

• Is the form of the scatter plot straight enough

(so that a linear relationship makes sense)?

• Have we removed the outliers? Or else, the

value of the correlation can get distorted

dramatically.

25

Regression line and

prediction

26

Explanatory and Response Variables

Above scatter plot indicates a linear

relationship between height and

weight. Suppose an individual is

68 in tall. How can we predict his

weight?

• The main variable of interest

(the one which we would like

to predict) is called the

response variable (denoted by y).

• The other variable is called

the explanatory variable or

the predictor variable (denoted by x).

Here height is the

predictor (or explanatory

variable) and weight is

the response variable.

27

What is Linear Regression?

• When the scatter plot looks roughly

linear, we may model the relationship

between the variables with “best-

fitted” line (known as regression

line): y = b0 + b1x.

• b1 (the coefficient of x) is called the

slope of the regression line.

• b0 is called the intercept of the

regression line.

• We estimate the slope (b1) and the

intercept (b0).

• Next given the value of x, we plug in

that value in the regression line

equation to predict y.

This procedure is called

linear regression.

28

Conditions for Linear Regression

• Quantitative Variables Condition: both variables have to be quantitative.

• Straight Enough Condition: the scatter plot must appear to have moderate linear association.

• Outlier Condition: there should not be any outliers.

29

Example of Linear Regression

Suppose

• x = amount of protein (in gm) in a burger (explanatory variable),

• y = amount of fat in (in gm) the burger (response variable).

Goal: Express the relationship of x and y using a line (the regression line): y = b0 + b1x.

Questions:

1. How to find b1 (slope) and b0 (intercept)?

2. How will it help in prediction?

30

Formulae of b0 and b1

.1

=

x

y

s

srb

.10xbyb −=

� Intercept: b0 = (mean of y) – b1×(mean of x)

i.e.

• Although there are many lines that can describe the

relationship, there is a way to find “the line that fits best”.

• For the best fitted line:

� Slope: b1 = (correlation)×(std.dev. of y)/(std.dev. of x)

i.e.

31

• If we are given the summary statistics, i.e. mean,

standard deviations of x and y and their correlations,

then we plug in those values in the formulae to find

b0 and b1.

• If we are given the actual data (not the summary),

then we need to compute all those summary values.

• However given the data TI 83/84 Plus can find the

equation of regression line.

But be careful, because TI 83/84 writes the

regression equation as y = ax + b.

So a = slope (= b1), and b = intercept (= b0).

Computation of b0 and b1

32

Example 1

If in a linear regression problem, the correlation between the variables is 0.9 and the standard deviations of the explanatory (x) and the response (y) variables are 0.75 and 0.25 respectively, and the means of the explanatory and the response variables are 10 and 5 respectively, calculate the regression line.

• Estimate of slope: b1 = 0.9 × (0.25/0.75) = 0.3.

• Estimate of intercept: b0 = 5 - 0.3 × 10 = 2.

• So the estimated regression line is: y = 0.3x + 2.

33

Example 2

Fat (g) Sodium (mg) Calories

19 920 410

31 1500 580

34 1310 590

35 860 570

39 1180 640

39 940 680

43 1260 660

Fat (in gm), Sodium (in mg) and Calorie content in 7 burgers are given above.

Using TI 83/84 Plus for regression

• Press [STAT].

• Choose 1: Edit.

• Type the Fat data under L1, Sodium under L2 and Calories under

L3.

Suppose (L1) Fat is the predictor and (L2) Sodium is the response.

• Press [STAT] again and select CALC using right-arrow.

• Select 4: LinReg(ax+b) (LinReg(ax+b) appears on screen).

• Type [2nd] and [1] (that will put L1 on screen).

• Type , and then [2nd] and [2] (that will put ,L2 on screen).

• Press [ENTER].

• This will produce a (slope), b (intercept), r2 and r (correlation

coefficient).

34

Tips: Using TI 83/84 Plus

• Caution: After LinReg(ax+b) you must first put the predictor

(explanatory) variable, and then the response variable.

• Note that the values of r and r2 will not show up if the

diagnostic is not switched on in the TI 83/84 Plus calculator.

• To get the diagnostic switched on:

1. Press [2nd] and [0] (that will choose CATALOG).

2. Select using arrow keys DiagnosticOn.

3. Press [ENTER] and [ENTER] again.

4. This will switch the diagnostic on.

35

Tips: Using TI 83/84 Plus

• To delete one particular list variable (say L2):

1. Press [STAT].

2. Choose 1: Edit.

3. Select the variable L2 using the arrow keys.

4. Press [CLEAR] followed by [ENTER].

• To delete all stored data:

a) Press [2nd] and [+] (that will choose MEM).

b) Select 4: ClrAllLists.

c) Press [ENTER] and [ENTER] again.

d) This will clear all the stored data in the lists.

36

37

Fat vs. Sodium

Fat vs Sodium

600

800

1000

1200

1400

1600

15 20 25 30 35 40 45

Fat (g)

So

diu

m (

mg

)

Using TI 83/84 Plus we get:

� r = 0.199

� r2 = 0.0396

� a = 6.08 (= b1)

� b = 930.02 (= b0)

• Correlation is approximately 0.2, which indicates that linear association is very weak (positive) between fat and sodium.• Scatter plot supports the small value of r.• Regression line: y = 6.08x + 930.02.

38

Fat vs. Calories

Using TI 83/84 Plus we

get:

� r = 0.961

� r2 = 0.923

� a = 11.06 (= b1)

� b = 210.95 (= b0)

• Correlation 0.96 indicates that there is a very strong positive linear relation between fat and calories.• Scatter plot supports the high positive value of r.• Regression line: y = 11.06x + 210.95.

Fat vs Calories

350

400

450

500

550

600

650

700

15 20 25 30 35 40 45

Fat (g)

Ca

lori

es

39

CountryPercent with Cell Phone

Life Expectancy (years)

Turkey 85.7% 71.96

France 92.5% 80.98

Uzbekistan 46.1% 71.96

China 47.4% 73.47

Malawi 11.9% 50.03

Brazil 75.8% 71.99

Israel 123.1% 80.73

Switzerland 115.5% 80.85

Bolivia 49.4% 66.89

Georgia 59.7% 76.72

Cyprus 93.8% 77.49

Spain 122.6% 80.05

Indonesia 58.5% 70.76

Botswana 74.6% 61.85

U.S. 87.9% 78.11

Example 3

40

Example 3: Scatter plot with regression line

%Cell Phone vs Life Expectancy

y = 0.21x + 56.91

R = 0.7848

R2 = 0.6159

45

50

55

60

65

70

75

80

85

0 20 40 60 80 100 120 140

% Cell Phone

Lif

e E

xp

ec

tan

cy

Possible

outliers

41

Example 3: Scatter plot with regression line

% Cell Phones vs Life Expectancy (without outliers)

y = 0.13x + 64.7

R = 0.802

R2 = 0.6437

45

50

55

60

65

70

75

80

85

0 20 40 60 80 100 120 140

% Cell Phones

Lif

e E

xp

ec

tan

cy

42

Predicted values and residuals

• Let the regression line be y = b0 + b1x.

• Suppose (x0, y0) is an observed data.

• Then the predicted value of y given x = x0 is

.ˆ010

xbby +=

.ˆ0

yye −=

• Residual (denoted by e) measures how much the predicted

value deviates from the observed value of y.

• So if the observed value of y is y0 then

residual = (observed value - predicted value)

• Residuals are the errors due to prediction using the regression

line.

43

Example 1 revisited

If in a linear regression problem, the correlation between the

variables is 0.9 and the standard deviations of the x and y

are 0.75 and 0.25 respectively, and the means of the x and y are 10 and 5 respectively. Calculate the residual when x = 20

and y = 7.

• Estimate of slope: b1 = 0.9 × (0.25/0.75) = 0.3.

• Estimate of intercept: b0 = 5 - 0.3 × 10 = 2.

• So the estimated regression line is: y = 0.3x + 2.

• The predicted value of y when x = 20, is

= 0.3 × 20 + 2 = 8.

•The corresponding residual is = (observed y) – (predicted y) = 7 – 8 = -1.

44

Evaluating regression

The fit of a regression line can be evaluated with

�R2 (the coefficient of determination),

�se (standard deviation of residuals).

45

R2 (the coefficient of determination)

R2 is the fraction of total sample variation in y explained by

the regression model.

Some properties of R2 :

� R2 = (correlation)2 = r2.

� 0 ≤ R2 ≤ 1.

� R2 close to 0 implies weak linear relationship (and

also not a good fit of the regression line).

� R2 close to 1 implies strong linear relationship (and

also a very good fit of the regression line).

46

R2 (the coefficient of determination)• For instance, if R2 = 0.54, then 54% of the total sample variation

in y is explained by the regression model.

It indicates a moderate fit of the regression line.

On the scatter plot the points will not be very close to the

regression line.

• If R2 = 0.96, then 96% of the total sample variation in y is

explained by the regression model. It indicates a very good fit of

the regression line. On the scatter plot the points will be very

close to the regression line.

• On the other hand, if R2 = 0.19, then only 19% of the total

sample variation in y is explained by the regression model, which

indicates a very bad fit of the regression line.

The scatter plot will show either a curved pattern, or the points

will be clustered showing no pattern.

47

se (standard deviation of residuals)

se is the standard deviation of the residuals.

In case there is no ambiguity, we often just write s instead of se.

� Smaller the se better the model fit.

� Larger the se worse the model fit.

Remember that residuals are the errors due to prediction using the

regression line.

So larger value of se implies that there is more spread in the

residuals, as a result there is more error in the prediction. Hence

the observations are not close to the regression line.

On the other hand, smaller value of se indicates that the

observations are closer to the regression line, implying a better

fit.

48

Summary: Direction of linear association

• The sign of correlation (r) indicates the direction of

the linear association between x and y.

• Notice that the slope (b1) of regression line has the

same sign as that of correlation (r). Hence the sign of

slope also indicates the direction of the linear

association between x and y.

• Positive sign of correlation (and slope) implies

positive linear association.

• Negative sign of correlation (and slope) implies

negative linear association.

49

Summary: Strength of linear association

• The value of correlation (r) [ignoring the sign] gives

us the strength of the linear association.

• Also the value of R2 gives us the strength of the

linear association.

• Values close to 1 implies strong linear association.

• Values close to 0 implies weak or no linear

association.

• You cannot determine strength of linear association

from the value of slope (b1) of regression line.

50

Correlation (r) and R2

• If you know the value of correlation, you can

compute R2 = (correlation)2.

• But knowing the value of R2 alone is not sufficient to

compute correlation, because we cannot get the

information of sign of correlation.

• However if we know the value of R2 and also the sign

of slope (b1) we can compute correlation as follows:

r = (sign of b1) √ R2.

51

Lurking variable

• High correlation between x and y may not

always mean causation.

• Sometimes there is a lurking variable which is

highly correlated to both x and y and as a result

we obtain a high correlation between x and y.

52

Example: Lurking variable

• A study at the University of Pennsylvania Medical Center (published in the May 13, 1999 issue of Nature) concluded:-

young children who sleep with the light on are much more likely to develop eye problems in later life.

• However, a later study at The Ohio State University did not find that infants sleeping with the light on caused the development of eye problems.

• The second study (done at OSU) did find a strong link between parental myopia and the development of child myopia, also noting that myopic parents were more likely to leave a light on in their children's bedroom. So parental myopia is a lurking variable here.

Reference:

http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation

53

Choose the best description of the

scatter plot

A. Moderate, negative, linear association

B. Strong, curved, association

C. Moderate, positive, linear association

D. Strong, negative, non-linear association

E. Weak, positive, linear association

54

Match the following values of correlation coefficients for the data

shown in this scatter plots.

A. r = -0.67

B. r = -0.10

C. r = 0.71

D. r = 0.96

E. r = 1.00

Fig. 1

Fig. 3

Fig. 2