regression analysis (2)1 pearson product moment coefficient of correlation: the variances and...

29
Regression Analysis (2) 1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n individuals or experimental units is selected and two variables are measured on each individual or unit so that both variables are random, the correlation coef- ficient r is the appropriate measure of linearity for use in this situation. Correlation Analysis yy xx xy y x xy S S S s s s r 1 1 1 2 2 n S s n S s n S s yy y xx x xy xy

Upload: randolph-patrick

Post on 26-Dec-2015

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 1

Pearson Product Moment Coefficient of Correlation:

The variances and covariances are given by:

In general, when a sample of n individuals or experimental units is selected and two variables are measured on each individual or unit so that both variables are random, the correlation coef-ficient r is the appropriate measure of linearity for use in this situation.

Correlation Analysis

yyxx

xy

yx

xy

SS

S

ss

sr

1

1

1

22

n

Ss

n

Ss

n

Ss yy

yxx

xxy

xy

Page 2: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 2

Table Heights and weights of n 10 backfield all-stars

Player Height x Weight y 1 73 185

2 71 175 3 75 200 4 72 210 5 72 190 6 75 195 7 67 150 8 69 170 9 71 180 10 69 175

ExampleThe heights and weights of n 10 offensive backfield football players are randomly selected from a county’s football all-stars. Calculate the correlation coefficient for the heights (in inches) and weights (in pounds) given in Table below.

Page 3: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 3

SolutionYou should use the appropriate data entry method of your scientific calculator to verify the calculations for the sums of squares and cross-products:

using the calculational formulas given earlier in this chapter. Then

or r =.83. This value of r is fairly close to 1, the largest possible value of r , which indicates a fairly strong positive linear relationship between height and weight.

2610 4.60 328 yyxxxy SSS

8261.)2610)(4.60(

328 r

Page 4: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 4

There is a direct relationship between the calculation formulas for the correlation coefficient r and the slope of the regression line b.

Since the numerator of both quantities is Sxy, both r and b have the same sign.

Therefore, the correlation coefficient has these general properties:- When r 0, the slope is 0, and there is no linear relationship

between x and y.- When r is positive, so is b, and there is a positive relationship

between x and y.- When r is negative, so is b, and there is a negative relationship

between x and y.

Page 5: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 5

The relationship between r (correlation coefficient) and the regression model

x

y

x

y

x

y

xy

s

sr

Therefore

Xs

sr

s

sXrYY

s

XXr

s

YY

1

^

^

^

Page 6: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 6

Figure Some typical scatter plots

Page 7: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 7

The population correlation coefficient is calculated and interpreted as it is in the sample.

The experimenter can test the hypothesis that there is no correlation between the variables x and y using a test statistic that is exactly equivalent to the test of the slope in previous Section.

Page 8: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 8

Test of Hypothesis Concerning the correlation Coefficient

1. Null hypothesis: H 0 : 0

2. Alternative hypothesis:One-Tailed Test Two-Tailed TestH a : 0 H a : 0

(or H a : 0)

3. Test statistic:

When the assumptions are satisfied, the test statistic will have a Student’s t distribution with (n 2) degrees of freedom.

201

2

r

nrt

Page 9: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 9

1. Null hypothesis: H 0 : 2. Alternative hypothesis:

One-Tailed Test Two-Tailed TestH a : H a : (or H a : )

3. Test statistic:

When the assumptions are satisfied, the test statistic will have a Student’s t distribution

with (n 2) degrees of freedom.

)1)(1(

2)(

02

00

r

nrt

When comparing to non-zero constant

Page 10: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 10

4. Rejection region: Reject H 0 when

One-Tailed Test Two-Tailed Testt tn-2 t t/2, n-2 or t t/2, n-2

(or t tn-2 when the alternative hypothesis is H a : 0 or H a : )or p-value

Page 11: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 11

Example Refer to the height and weight data in the previous Example The correlation of height and weight was calculated to be r =.8261. Is this correlation significantly different from 0?

SolutionTo test the hypotheses

the value of the test statistic is

which for n 10 has a t distribution with 8 degrees of freedom. Since this value is greater than t.005 3.355, the two-tailed p-value is less than 2(.005) .01, and the correlation is declared significant at the 1% level (P < .01). The value r

2 .82612 .6824 means that about 68% of the variation in one of the variables is explained by the other. The Minitab printout n Figure 12.17 displays the correlation r and the exact p-value for testing its significance.

15.4)8261(.1

2108261.

1

2220

r

nrt

0: versus 0:0 HaH

Page 12: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 12

r is a measure of linear correlation and x and y could be perfectly related by some curvilinear function when the observed value of r is equal to 0.

Page 13: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 13

Testing for Goodness of Fit In general, we do not know the underlying

distribution of the population, and we wish to test the hypothesis that a particular distribution will be satisfactory as a population model.

Probability Plotting can only be used for examining whether a population is normal distributed.

Histogram Plotting and others can only be used to guess the possible underlying distribution type.

Page 14: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 14

Goodness-of-Fit Test (I)

A random sample of size n from a population whose probability distribution is unknown.

These n observations are arranged in a frequency histogram, having k bins or class intervals.

Let Oi be the observed frequency in the ith class interval, and Ei be the expected frequency in the ith class interval from the hypothesized probability distribution, the test statistics is

Page 15: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 15

Goodness-of-Fit Test (II)

If the population follows the hypothesized distribution, X0

2 has approximately a chi-square distribution with k-p-1 d.f., where p represents the number of parameters of the hypothesized distribution estimated by sample statistics.

That is,

Reject the hypothesis if

21

1

220 ~

pk

k

i i

ii

E

EO

21,

20 pk

Page 16: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 16

Goodness-of-Fit Test (III) Class intervals are not required to be equal width.

The minimum value of expected frequency can not be to small. 3, 4, and 5 are ideal minimum values.

When the minimum value of expected frequency is too small, we can combine this class interval with its neighborhood class intervals. In this case, k would be reduced by one.

Page 17: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 17

Example 8-18 The number of defects in printed circuit boards is hypothesized to follow a Poisson distribution. A random sample of size 60 printed boards has been collected, and the number of defects observed as the table below:

The only parameter in Poisson distribution is , can be estimated by the sample mean = {0(32) + 1(15) + 2(19) + 3(4)}/60 = 0.75. Therefore, the expected frequency is:

32.2860472.0

472.0!0

)75.0()0(

1

075.0

1

E

eXPp

Page 18: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 18

Example 8-18 (Cont.) Since the expected frequency in the last cell is less than 3,

we combine the last two cells:

Page 19: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 19

Example 8-18 (Cont.)1. The variable of interest is the form of distribution of

defects in printed circuit boards.2. H0: The form of distribution of defects is Poisson

H1: The form of distribution of defects is not Poisson3. k = 3, p = 1, k-p-1 = 1 d.f.

4. At = 0.05, we reject H0 if X20 > X2

0.05, 1 = 3.845. The test statistics is:

6. Since X20 = 2.94 < X2

0.05, 1 = 3.84, we are unable to reject the null hypothesis that the distribution of defects in printed circuit boards is Poisson.

94.244.10

)44.1013(

24.21

)24.2115(

32.28

)32.2832()( 222

1

220

k

i i

ii

E

EO

Page 20: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 20

Contingency Table Tests Example 8-20

A company has to choose among three pension plans. Management wishes to know whether the preference for plans is independent of job classification and wants to use = 0.05. The opinions of a random sample of 500 employees are shown in Table 8-4.

Page 21: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 21

Contingency Table Test- The Problem Formulation (I) There are two classifications, one has r levels and the other

has c levels. (3 pension plans and 2 type of workers) Want to know whether two methods of classification are

statistically independent. (whether the preference of pension plans is independent of job classification)

The table:

Page 22: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 22

Contingency Table Test- The Problem Formulation (II) Let pij be the probability that a random selected element falls

in the ijth cell, given that the two classifications are independent. Then pij = uivj, where the estimator for ui and vj are

Therefore, the expected frequency of each cell is

Then, for large n, the statistic

has an approximate chi-square distribution with (r-1)(c-1) d.f.

r

iijj

c

jiji O

nvO

n 11

1

1

r

iij

c

jijjiij OO

nvnE

11

1

r

i

c

j ij

ijij

E

EO

1 1

220

)(

Page 23: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 23

Example 8-20

Page 24: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 24

Page 25: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 25

Key Concepts and Formulas

I. A Linear Probabilistic Model1. When the data exhibit a linear relationship, the appropriate

model is y x . 2. The random error has a normal distribution with mean 0

and variance 2.

II. Method of Least Squares

1. Estimates a and b, for and , are chosen to minimize SSE,

The sum of the squared deviations about the regression line,

.ˆ bxay

Page 26: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 26

2. The least squares estimates are b Sxy Sxx and

III. Analysis of Variance1. Total SS SSR SSE, where Total SS Syy and

SSR (Sxy)2 Sxx.

2. The best estimate of 2 is MSE SSE (n 2).

IV. Testing, Estimation, and Prediction

1. A test for the significance of the linear regression—H0 :

0can be implemented using one of the two test statistics:

.xbya

MSEMSR

or /MSE

FS

bt

xx

Page 27: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 27

2. The strength of the relationship between x and y can be measured using

which gets closer to 1 as the relationship gets stronger.

3. Use residual plots to check for nonnormality, inequality of variances, and an incorrectly fit model.

4. Confidence intervals can be constructed to estimate the intercept and slope of the regression line and to estimate the average value of y, E( y ), for a given value of x.

5. Prediction intervals can be constructed to predict a particular observation, y, for a given value of x. For a given x, prediction intervals are always wider than confidence intervals.

SS TotalMSR2 R

Page 28: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 28

V. Correlation Analysis1. Use the correlation coefficient to measure the relationship

between x and y when both variables are random:

2. The sign of r indicates the direction of the relationship; r near

0 indicates no linear relationship, and r near 1 or 1 indicates

a strong linear relationship.3. A test of the significance of the correlation coefficient is

identical to the test of the slope

yyxx

xy

SS

Sr

Page 29: Regression Analysis (2)1 Pearson Product Moment Coefficient of Correlation: The variances and covariances are given by: In general, when a sample of n

Regression Analysis (2) 29

Cause and Effect X could cause Y Y could cause X X and Y could cause each other X and Y could be caused by a third variable Z X and Y could be related by chance Bad (or good) luck

Need careful examination of the study. Try to find previous evidences or academic explanations.