elementary statistics correlation and regression

24
Elementary Statistics Correlation and Regression

Upload: meghan-park

Post on 02-Jan-2016

232 views

Category:

Documents


1 download

TRANSCRIPT

Elementary Statistics

Correlation and Regression

Correlation

What type of relationship exists between the two variables and is the correlation significant?

x y

Cigarettes smoked per day

Score on SAT

Height

Hours of Training

Explanatory(Independent) Variable

Response(Dependent) Variable

A relationship between two variables

Number of Accidents

Shoe Size Height

Lung Capacity

Grade Point Average

IQ

Negative Correlation–as x increases, y decreases

x = hours of trainingy = number of accidents

Scatter Plots and Types of Correlation

60

50

40

30

20

10

0

0 2 4 6 8 10 12 14 16 18 20

Hours of Training

Acc

iden

ts

Positive Correlation–as x increases, y increases

x = SAT scorey = GPA

GP

AScatter Plots and Types of Correlation

4.003.753.50

3.002.752.502.252.00

1.501.75

3.25

300 350 400 450 500 550 600 650 700 750 800

Math SAT

No linear correlation

x = height y = IQ

Scatter Plots and Types of Correlation

160150140130120110

1009080

60 64 68 72 76 80

Height

IQ

Correlation CoefficientA measure of the strength and direction of a linear

relationship between two variables

The range of r is from –1 to 1.

If r is close to 1 there is a

strong positive

correlation.

If r is close to –1 there is a strong negative correlation.

If r is close to 0 there is no

linear correlation.

–1 0 1

x y 8 78 2 92 5 9012 5815 43 9 74 6 81

AbsencesFinalGrade

Application

959085807570656055

4540

50

0 2 4 6 8 10 12 14 16

Fin

al G

rade

XAbsences

6084846481003364184954766561

624 184450696 645666486

57 516 3751 579 39898

1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81

64 4 25144225 81 36

xy x2 y2

Computation of rn x y

r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho).

The sampling distribution for r is a t-distribution with n – 2 d.f.

Standardized teststatistic

For a two tail test for significance:

For left tail and right tail to testnegative or positive significance:

Hypothesis Test for Significance

(The correlation is not significant)

(The correlation is significant)

A t-distribution with 5 degrees of freedom

Test of Significance

You found the correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data.Test the significance of this correlation. Use = 0.01.

1. Write the null and alternative hypothesis.

2. State the level of significance.

3. Identify the sampling distribution.

(The correlation is not significant)

(The correlation is significant)

= 0.01

t0 4.032–4.032

Rejection Regions

Critical Values ± t0

4. Find the critical value.

5. Find the rejection region.

6. Find the test statistic.

t0–4.032 –4.032

t = –9.811 falls in the rejection region. Reject the null hypothesis.

There is a significant correlation between the number of times absent and final grades.

7. Make your decision.

8. Interpret your decision.

The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept.

The line of regression is:

The slope m is:

The y-intercept is:

Once you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line.

The Line of Regression

180

190

200

210

220

230

240

250

260

1.5 2.0 2.5 3.0Ad $

= a residual

(xi,yi) = a data pointre

venu

e= a point on the line with the same x-value

Calculate m and b.

Write the equation of the line of regression with x = number of absences and y = final grade.

The line of regression is: = –3.924x + 105.667

6084846481003364184954766561

624 184450696 645666486

57 516 3751 579 39898

1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81

64 4 25144225 81 36

xy x2 y2x y

0 2 4 6 8 10 12 14 16

404550556065707580859095

Absences

Fin

alG

rade

m = –3.924 and b = 105.667

The line of regression is:

Note that the point = (8.143, 73.714) is on the line.

The Line of Regression

The regression line can be used to predict values of y for values of x falling within the range of the data.

The regression equation for number of times absent and final grade is:

Use this equation to predict the expected grade for a student with

(a) 3 absences (b) 12 absences

(a)

(b)

Predicting y Values

= –3.924(3) + 105.667 = 93.895

= –3.924(12) + 105.667 = 58.579

= –3.924x + 105.667

The coefficient of determination, r2, is the ratio of explained variation in y to the total variation in y.

The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r2 = (–0.975)2 = 0.9506.

Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc.

The Coefficient of Determination

The Standard Error of Estimate, se,is the standard

deviation of the observed yi values about the predicted

value.

The Standard Error of Estimate

1 8 78 74.275 13.8756 2 2 92 97.819 33.8608 3 5 90 86.047 15.6262 4 12 58 58.579 0.3352 5 15 43 46.807 14.4932 6 9 74 70.351 13.3152 7 6 81 82.123 1.2611

92.767

= 4.307

x y

Calculate for each x.

The Standard Error of Estimate

Given a specific linear regression equation and x0, a specific value of x, a c-prediction interval for y is:

where

Use a t-distribution with n – 2 degrees of freedom.

The point estimate is and E is the maximum error of estimate.

Prediction Intervals

Construct a 90% confidence interval for a final grade when a student has been absent 6 times.

1. Find the point estimate:

The point (6, 82.123) is the point on the regression line with x-coordinate of 6.

Application

Construct a 90% confidence interval for a final grade when a student has been absent 6 times.

2. Find E,

At the 90% level of confidence, the maximum error of estimate is 9.438.

Application

Construct a 90% confidence interval for a final grade when a student has been absent 6 times.

When x = 6, the 90% confidence interval is from 72.685 to 91.586.

3. Find the endpoints.

Application

– E = 82.123 – 9.438 = 72.685

+ E = 82.123 + 9.438 = 91.561

72.685 < y < 91.561