with thanks to my students in ams572 – data analysis simple linear regression 1

With Thanks to My Students in AMS572 – Data Analysis

Simple Linear Regression

1

1. Introduction

Response (out come or dependent) variable (Y): height of the wifePredictor (explanatory or independent) variable (X): height of the husband

Example:

George Bush :1.81mLaura Bush: ?

David Beckham: 1.83mVictoria Beckham: 1.68m

Brad Pitt: 1.83mAngelina Jolie: 1.70m

● To predict height of the wife in a couple, based on the husband’s height

2

Regression analysis:

● The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805, and by Gauss in 1809.

● The method was extended by Francis Galton in the 19th century to describe a biological phenomenon.

● This work was extended by Karl Pearson and Udny Yule to a more general statistical context around 20th century.

●　 regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable.

History:

● 　 when there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression.

● when it is not clear which variable represents a response and which is a predictor, correlation analysis is used to study the strength of the relationship

3

A probabilistic model We denote the n observed values of the

predictor variable x as

nxxx ...,,, 21

We denote the corresponding observed values of the response variable Y as

nyyy ...,,, 21

4

Notations of the simple linear Regression

- Observed value of the random variable Yi depends on xi iy0 1 ( 1, 2, ..., )i i iY x i n

i 2)(

0)(

i

i

Var

E

0 1( )i i iE Y x

- random error with

unknown mean of Yi

Unknown SlopeTrue Regression Line

Unknown Intercept5

iY

2

Linear function of the predictor variable

Have a common variance,Same for all values of x.

i Normally distributed

4 BASIC ASSUMPTIONS – for statistical inference

Independent

7

Comments:1. Linear not in x

But in the parameters and

2. Predictor variable is not set as predetermined fixed values, is random along with Y. The model can be considered as a conditional model

xYE log)( 10

10

Example:

xxXYE 10)|(

linear, logx = x*

Example: Height and Weight of the children.Height (X) – given Weight (Y) – predict

Conditional expectation of Y given X = x

8

2. Fitting the Simple Linear Regression

Model

2.1 Least Squares (LS) Fit

9

Example 10.1 (Tires Tread Wear vs. Mileage: Scatter Plot. From: Statistics and Data Analysis; Tamhane and Dunlop; Prentice Hall. )

10

0 1

0 1( ) ( 1,2,..... )i i

y x

y x i n

2

0 11

[ ( )]n

i ii

Q y x

11

The “best” fitting straight line in the sense of minimizing Q: LS estimate

One way to find the LS estimate and

Setting these partial derivatives equal to zero and simplifying, we get

1

0 11 1

20 1

1 1 1

n n

i ii i

n n n

i i i ii i i

n x y

x x x y

0 110

0 111

2 [ ( )]

2 [ ( )]

n

i ii

n

i i ii

Qy x

Qx y x

0

12

Solve the equations and we get

2

1 1 1 10

2 2

1 1

1 1 11

2 2

1 1

( )( ) ( )( )

( )

( )( )

( )

n n n n

i i i i ii i i i

n n

i ii i

n n n

i i i ii i i

n n

i ii i

x y x x y

n x x

n x y x y

n x x

13

To simplify, we introduce

The resulting equation is known as the least squares line, which is an estimate of the true regression line.

1 1 1 1

2 2 2

1 1 1

2 2 2

1 1 1

1( )( ) ( )( )

1( ) ( )

1( ) ( )

n n n n

xy i i i i i ii i i i

n n n

xx i i ii i i

n n n

yy i i ii i i

S x x y y x y x yn

S x x x xn

S y y y yn

0 1 1xy

xx

Sy x

S

0 1y x

14

Example 10.2 (Tire Tread vs. Mileage: LS Line Fit)Find the equation of the line for the tire tread wear data from

Table10.1,we have

and n=9.From these we calculate

2 2144, 2197.32, 3264, 589,887.08, 28,167.72i i i i i ix y x y x y

16, 244.15,x y

1 1 1

1 1( )( ) 28,167.72 (144 2197.32) 6989.40

9

n n n

xy i i i i

i i i

S x y x yn

2 2 2

1 1

1 1( ) 3264 (144) 960

9

n n

xx i i

i i

S x xn

15

The slope and intercept estimates are

Therefore, the equation of the LS line is

Conclusion: there is a loss of 7.281 mils in the tire groove depth for every 1000 miles of driving.

Given a particular

We can find

Which means the mean groove depth for all tires driven for 25,000miles is estimated to be 178.62 miles.

1 0

6989.40ˆ ˆ7.281 244.15 7.281*16 360.64960

and

360.64 7.281*25 178.62y mils 25x

360.64 7.281 .y x

16

2.2 Goodness of Fit of the LS Line

Coefficient of Determination and Correlation

The residuals:

are used to evaluate the goodness of fit of the LS line.

( 1,2,..... )i n 0 1̂ˆi iy x

0 1ˆ ˆ( )i i ie y x

( 1,2,..... )i n

17

We define:

Note: total sum of squares (SST)

Regression sum of squares (SSR)

Error sum of squares (SSE)

is called the coefficient of determination

2 2 2

1 1 1 1

0

ˆ ˆ ˆ ˆ( ) ( ) ( ) 2 ( )( )n n n n

i i i i i i ii i i i

SSR SSE

SST y y y y y y y y y y

SST SSR SSE

The ratio

2 1SSR SSE

RSST SST

18

2R

Example 10.3 (Tire Tread Wear vs. Mileage: Coefficient of Determination and Correlation

For the tire tread wear data, calculate using the result s from example 10.2 We have

Next calculate Therefore

The Pearson correlation is

where the sign of r follows from the sign of since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope.

2 2 2

1 1

1 1( ) 589,887.08 (2197.32) 53,418.73

9

n n

yy i ii i

SST S y yn

53,418.73 2531.53 50,887.20SSR SST SSE

2 50,887.200.953

53,418.73R

1̂ 7.281

2R

19

0.953 0.976r

The Maximum Likelihood Estimators (MLE)

Consider the linear model:

where is drawn from a normal population with mean 0 and standard deviation σ, the likelihood function for Y is:

Thus, the log-likelihood for the data is:

i i iy a bx i

2

2

22 2

)(exp

)2(

1

ii

n

bxayL

.2

)()ln(

2)2ln(

2log

2

22

ii bxaynn

L

The MLE Estimators

Solving

We obtain the MLEs of the three unknown model parameters

The MLEs of the model parameters a and b are the same as the LSEs – both unbiased

The MLE of the error variance, however, is biased:

2, ,a b

2

log log log0, 0, 0

L L L

a b

2

2 1ˆ

n

ii

eSSE

n n

2.3 An Unbiased Estimator of An unbiased estimate of is given by

Example 10.4(Tire Tread Wear Vs. Mileage: Estimate of

Find the estimate of for the tread wear data using the results from Example 10.3 We have SSE=2351.3 and n-2=7,therefore

Which has 7 d.f. The estimate of is miles.

22

2 1

2 2

n

ii

eSSE

sn n

2

2 2351.53361.65

7S

361.65 19.02s

22

Under the normal error assumption

* Point estimators:

* Sampling distributions of and :

For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.

1

00 1,

2

0( ) i

xx

xSE s

nS

1( )xx

sSE

S

xx

i

nS

xN

22

00 ,~ˆ

xxS

N2

11 ,~ˆ

3. Statistical Inference on and

23

* Pivotal Quantities (P.Q.’s):

* Confidence Intervals (CI’s):

0 0 1 12, 2,

2 2

,n n

t SE t SE

2

0

00~

)ˆ(

ˆ

nt

SE

2

1

11~

)ˆ(

ˆ

nt

SE

24

Statistical Inference on and , Con’t

* Hypothesis tests:

-- Test statistics:

-- At the significance level , we reject in

favor of if and only if (iff)

-- The first test is used to show whether there is a linear relationship between x and y

aH

0H

00 1 1:H

01 1:aH

01 1

0

1( )t

SE

0 2, / 2nt t

0 1: 0H

1: 0aH

10

1( )t

SE

Statistical Inference on and , Con’t

25

Analysis of Variance (ANOVA), Con’t

Mean Square:

-- a sum of squares divided by its d.f.

SSR SSEMSR= ,MSE=

1 2n

2 22 01 1 2

1, 202 21

ˆ ˆ ˆ1~

ˆ/ ( )

xxn

xx

HMSR SSR St F

MSE s s s S SE

26

Analysis of Variance (ANOVA)

ANOVA Table

Example:

Source of Variation

(Source)

Sum of Squares

(SS)

Degrees of Freedom

(d.f.)

Mean Square

(MS)

F

Regression

Error

SSR

SSE

1

n - 2

Total SST n - 1

SSRMSR=

1SSE

MSE=2n

MSRF=

MSE

Source SS d.f. MS F

Regression

Error

50,887.20

2531.53

1

7

50,887.20

361.25

140.71

Total 53,418.73

8 27

4. Regression Diagnostics

4.1 Checking for Model Assumptions

Checking for Linearity Checking for Constant Variance Checking for Normality Checking for Independence

28

Checking for Linearity Xi =Mileage Y=β0 + β1 x Yi =Groove Depth ^ ^ ^

^ Y=β0 + β1 x Yi =fitted value ^

ei =residual Residual = ei = Yi- Yi

i Xi Yi^

Yi ei

1 0 394.33 360.64 33.69

2 4 329.50 331.51 -2.01

3 8 291.00 302.39 -11.39

4 12 255.17 273.27 -18.10

5 16 229.33 244.15 -14.82

6 20 204.83 215.02 -10.19

7 24 179.00 185.90 -6.90

8 28 163.83 156.78 7.05

9 32 150.33 127.66 22.67

Xi

ei

35302520151050

40

30

20

10

0

-10

-20

Scatterplot of ei vs Xi

29

Checking for Normality

C1

Perc

ent

50403020100-10-20-30-40

99

95

90

80

70

605040

30

20

10

5

1

Mean 3.947460E-16StDev 17.79N 9AD 0.514P-Value 0.138

Normal Probability Plot of residualsNormal　

30

Checking for Constant Variance

Var(Y) is not constant. A sample residual plots when

Var(Y) is constant.

Plot of Residuals vs Fitted Value

-30

-20

-10

0

10

20

30

40

0 100 200 300 400

Fitted Value

Res

idua

ls

31

Checking for Independence

Does not apply for Simple Linear Regression Model

Only apply for time series data

32

4.2 Checking for Outliers & Influential Observations

What is OUTLIER Why checking for outliers is important Mathematical definition How to deal with them

33

4.2-A. IntroRecall Box and Whiskers Plot (Chapter 4 of T&D) Where (mild) OUTLIER is defined as any observations that lies outside of

Q1-(1.5*IQR) and Q3+(1.5*IQR) (Interquartile range, IQR = Q3 − Q1) (Extreme) OUTLIER as that lies outside of Q1-(3*IQR) and Q3+(3*IQR) Observation "far away" from the rest of the data

34

4.2-B. Why are outliers a problem?

May indicate a sample peculiarity or a data entry error or other problem ;

Regression coefficients estimated that minimize the Sum of Squares for Error (SSE) are very sensitive to outliers >>Bias or distortion of estimates;

Any statistical test based on sample means and variances can be distorted In the presence of outliers >>Distortion of p-values;

Faulty conclusions.Example: ( Estimators not sensitive to outliers are said to be robust )

Sorted Data Median Mean Variance 95% CI for mean

Real Data

1 3 5 9 12 5 6.0 20.6 [0.45, 11.55]

Data with Error

1 3 5 9 120 5 27.6 2676.8 [-36.630,91.83]

35

4.2-C. Mathematical Definition

OutlierThe standardized residual is given by

If |ei*|>2, then the corresponding observation may be regarded an outlier.

Example: (Tire Tread Wear vs. Mileage)

• STUDENTIZED RESIDUAL: a type of standardized residual calculated with the current

observation deleted from the analysis.• The LS fit can be excessively influenced by observation that is not necessarily an outlier as

defined above.

i 1 2 3 4 5 6 7 8 9

ei* 2.25 -0.12 -0.66 -1.02 -0.83 -0.57 -0.40 0.43 1.51

36

4.2-C. Mathematical Definition Influential ObservationObservation with extreme x-value, y-value, or both.

• On average hii is (k+1)/n, regard any hii>2(k+1)/n as high leverage;

• If xi deviates greatly from mean x, then hii is large;

• Standardized residual will be large for a high leverage observation;

• Influence can be thought of as the product of leverage and outlierness.

Example: (Observation is influential/ high leverage, but not an outlier)

eg.1 with without eg.2 scatter plot residual plot

0 5 10 15

37

4.2-C. SAS code of the tire exampleSAS codeData tire;Input x y;Datalines;0 394.334 329.50…32 150.33;Run;

proc reg data=tire; model y=x; output out=resid rstudent=r h=lev cookd=cd dffits=dffit;Run;

proc print data=resid; where abs(r)>=2 or lev>(4/9) or cd>(4/9) or abs(dffit)>(2*sqrt(1/9)); run;

38

4.2-C. SAS output of the tire exampleSAS output

39

4.2-D. How to deal with Outliers & Influential Observations

Investigate (Data errors? Rare events? Can be corrected?)

Ways to accommodate outliers Non Parametric Methods (robust to outliers) Data Transformations Deletion (or report model results both with and

without the outliers or influential observations to see how much they change)

40

4.3 Data Transformations

Reason

To achieve linearity To achieve homogeneity of variance To achieve normality or symmetry about the

regression equation

41

Types of Transformation

Linearzing Transformation

transformation of a response variable, or predicted variable, or both, which produces an approximate linear relationship between variables.

Variance Stabilizing Transformation

make transformation if the constant variance assumption is violated

42

Linearizing Transformation

Use mathematical operation, e.g. square root, power, log, exponential, etc.

Only one variable needs to be transformed in the simple linear regression.

Which one? Predictor or Response? Why?

43

Xi Yi^

log Yi

^ Y =

^exp (logYi) Ei

0 394.33 5.926 374.64 19.69

4 329.50 5.807 332.58 -3.08

8 291.00 5.688 295.24 -4.24

12 255.17 5.569 262.09 -6.92

16 229.33 5.450 232.67 -3.34

20 204.83 5.331 206.54 -1.71

24 179.00 5.211 183.36 -4.36

28 163.83 5.092 162.77 1.06

32 150.33 4.973 144.50 5.83

e.g. We take a log transformation on Y = exp (-x) <=> log Y = log - x

xi

Resi

dual

35302520151050

40

30

20

10

0

-10

-20

ei　(original)ei　with　transformation

Variable

Plot of Residual vs xi & xi from the exponential fit

Data

Perc

ent

50403020100-10-20-30-40

99

95

90

80

70

60

50

40

30

20

10

5

1

3.947460E-16 17.79 9 0.514 0.1380.3256 8.142 9 0.912 0.011

Mean StDev N AD P

eiei　with　transformation

Variable

Normal Probability Plot of ei and ei with transformation

44

Variance Stabilizing Transformation

Delta method : Two terms Taylor-series approximations

Var( h(Y)) ≈ [h2 g2 (whereVar(Y) = g2Y) =

. set [h’(]2 g2 (

2. h’( =

3. h = h(y) =

e.g. Var(Y) = c22 , where c > 0, g = c↔ g(y) = cy

h(y) =

Therefore it is the logarithmic transformation

)(

g

d

)(yg

dy

)(1g

cydy y

dyc 1 )log(1 yc

45

5. Correlation Analysis

Pearson Product Moment Correlation: a measurement of how closely two variables share a linear relationship.

Useful when it is not possible to determine which variable is the predictor and which is the response.

Health vs wealth. Which is predictor? Which is response?

Y)Var(X)Var(

Y) Cov(X, Y) corr(X,

46

Statistical Inference on the Correlation Coefficient ρ We can derive a test on the correlation

coefficient in the same way that we have been doing in class. Assumptions

X, Y are from the bivariate normal distribution

Start with point estimator r: sample correlation coefficient: estimator

of the population correlation coefficient ρ

Get the pivotal quantity The distribution of r is quite complicated T0: test statistic for ρ = 0

Do we know everything about the p.q.? Yes: T ~ tn-2 under H0 : ρ=0

1

2 2

1 1

( )( )

( ) ( )

n

i ii

n n

i ii i

X X Y Yr

X X Y Y

0 2

2

1

r nT

r

47

Bivariate Normal Distribution

pdf:

Properties μ1, μ2 means

for X, Y σ1

2, σ22 variances

for X, Y ρ the correlation coeff

between X, Y48

Derivation of T0

Therefore, we can use t as a statistic for testing against the null hypothesis

H0: β1=0

Equivalently, we can test against

H0: ρ=0

?1

0 21

1 1 1

22

1 10 1 2

1

are these equivalent?

ˆ2ˆ( )1

substitute:

ˆ ˆ ˆ

( 2)1

then:

ˆ ˆ( 2)ˆT ˆ( 2) / ( )

yes, they are equivalent.

x xx xx

y yy

xx

xx

r nT

SEr

s S Sr

s S SST

SSE n sr

SST SST

S n SST

SST n s s S SE

49

Exact Statistical Inference on ρ

Test H0 : ρ=0 , Ha : ρ≠0 Test statistic:

Reject H0 iff

0 2

2

1

r nT

r

534.3

7.01

2157.020

t

Example A researcher wants to determine if two test

instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the .01 level?

H0 : ρ=0 , Ha : ρ≠0

for α = .01, 3.534 = t0 > t13, .005 = 3.012

▲ Reject H0

50

0 2, / 2nt t

Approximate Statistical Inference on ρ

There is no exact method of testing ρ vs an arbitrary ρ0 Distribution of R is very

complicated T0 ~ tn-2 only when ρ = 0

To test ρ vs an arbitrary ρ0 one can use Fisher’s transformation

Therefore, let

3

1,

1

1ln

2

1

1

1ln

2

1tanh 1

nN

R

RR

^ ^0

00

11 1 1 1ln , under H , ~ ln ,

2 1 2 1 3

rN

r n

51

Approximate Statistical Inference on ρ Test :

Sample estimate:

Z test statistic:

CI for ρ:

r

r

1

1ln

2

1^

0

^

0 3 nz

0 0 1 0

00 0 1 0

0

: vs. :

11: ln vs. :

2 1

H H

H H

^ ^

/ 2 / 2

2 2

2 2

1 1

3 3

1 1

1 1

l u

l u

z zn n

e e

e e

We reject H0 if |z0| > zα/2

52

Approximate Statistical Inference on ρusing SAS

Code:

Output:

53

Pitfalls of Regression and Correlation Analysis

Correlation and causation Ticks cause good health

Coincidental data Sun spots and republicans

Lurking variables Church, suicide, population

Restricted range Local, global linearity

54

Summary

Linear regression analysis

The Least squares (LS) estimates: and

Correlation Coefficient r

Probabilistic model for Linear regression:

Confidence Interval & Prediction intervalOutliers?

Influential Observations?

Data Transformations?

CorrelationAnalysis

Model Assumptions

)( 1010 2/,2 oror SEtn

55

2

2

1

r nt

r

1 1ln

2 1

2 1SSR SSE

rSST SST

20 1

1

[ ( )]n

i ii

Q y x

Least Squares (LS) Fit

Sample correlation coefficient r

Statistical inference on ß0 & ß1

Prediction Interval

Model Assumptions

Correlation Analysis

Linearity Constant Variance Normality Independence

2*

* *2, / 2

11n

xx

x xY Y t s

n S

xx

i

nS

xN

22

00 ,~ˆ

xxS

N2

11 ,~ˆ

56

Questions?

57

with thanks to my students in ams572 – data analysis simple linear regression 1

Documents

predictor variable x

regression analysis

height x

response variable y

independent variable

set of predictor variable

simple linear regression

multiple linear regression