class 5: thurs., sep. 23 example of using regression to make predictions and understand the likely...
Post on 20-Dec-2015
215 views
TRANSCRIPT
![Page 1: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/1.jpg)
Class 5: Thurs., Sep. 23
• Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and experience
• Normal distribution calculations• R squared• Checking the assumptions of the simple
linear regression model: residual plots.
![Page 2: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/2.jpg)
Teachers’ Salaries and Dating• In U.S. culture, it is usually considered impolite to ask how
much money a person makes.• However, suppose that you are single and are interested in
dating a particular person.• Of course, salary isn’t the most important factor when
considering whom to date but it certainly is nice to know (especially if it is high!)
• In this case, the person you are interested in happens to be a high school teacher, so you know a high salary isn’t an issue.
• Still you would like to know how much she or he makes, so you take an informal survey of 11 high school teachers that you know.
![Page 3: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/3.jpg)
Distributions Salary
35000 5000060000
Moments
Mean 50881.818 Std Dev 6491.1968 Std Err Mean 1957.1695 upper 95% Mean 55242.664 lower 95% Mean 46520.973 N 11
![Page 4: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/4.jpg)
B a s e d o n t h i s d a t a , w h a t c a n y o u c o n c l u d e ? A b s e n t a n y o t h e r i n f o r m a t i o n , b e s t g u e s s f o r t e a c h e r ’ s s a l a r y i s t h e m e a n s a l a r y , $ 5 0 , 8 8 2 . B u t i t i s l i k e l y t h a t t h i s e s t i m a t e w i l l n o t b e c o r r e c t . T o g e t a n i d e a o f h o w f a r o f f , y o u m i g h t b e , y o u c a n c a l c u l a t e t h e s t a n d a r d d e v i a t i o n :
82.649110
421437378
1
)(11
1
2
n
yys i
i
T h e s t a n d a r d d e v i a t i o n i s t h e “ t y p i c a l ” a m o u n t b y w h i c h a n o b s e r v a t i o n d e v i a t e s f r o m m e a n . T h u s , y o u r b e s t e s t i m a t e f o r y o u r p o t e n t i a l d a t e ’ s s a l a r y i s $ 5 0 , 8 8 2 b u t a t y p i c a l e s t i m a t e w i l l b e o f f b y a b o u t $ 6 , 5 0 0 .
![Page 5: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/5.jpg)
• You happen to know that the person you are interested in has been teaching for 8 years.
• How can you use this information to better predict your potential date’s salary?
• Regression Analysis to the Rescue! • You go back to each of the original 11 teachers you
surveyed and ask them for their years of experience. • Simple Linear Regression Model: E(Y|
X)= , the distribution of Y given X is normal with mean and standard deviation .
Bivariate Fit of Salary By Years of Experience
35000
40000
45000
50000
55000
60000
65000
Sa
lary
0 2.5 5 7.5 1012.5Years of Experience
X10
X10
![Page 6: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/6.jpg)
B i v a r i a t e F i t o f S a l a r y B y Y e a r s o f E x p e r i e n c e
3 5 0 0 0
4 0 0 0 0
4 5 0 0 0
5 0 0 0 0
5 5 0 0 0
6 0 0 0 0
6 5 0 0 0
Salary
0 2 .5 5 7 .5 1 0 1 2 .5Y e a rs o f E x p e rie n c e
L in e a r F it L i n e a r F i t
S a l a r y = 4 0 6 1 2 . 1 3 5 + 1 6 8 6 . 0 6 7 4 Y e a r s o f E x p e r i e n c e S u m m a r y o f F i t
R S q u a r e 0 . 5 4 5 8 8 1 R S q u a r e A d j 0 . 4 9 5 4 2 3 R o o t M e a n S q u a r e E r r o r 4 6 1 0 . 9 3 M e a n o f R e s p o n s e 5 0 8 8 1 . 8 2 O b s e r v a t i o n s ( o r S u m W g t s ) 1 1
![Page 7: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/7.jpg)
Linear Fit L in e a r F it
S a la ry = 4 0 6 1 2 .1 3 5 + 1 6 8 6 .0 6 7 4 Y e a rs o f E xp e rie n c e S u m m a ry o f F it
R S q u a re 0 .5 4 5 8 8 1 R S q u a re A d j 0 .4 9 5 4 2 3 R o o t M e a n S q u a re E rro r 4 6 1 0 .9 3 • Predicted salary of your potential date who has been a
teacher for 8 years = Estimated Mean salary for teachers of 8 years = 40612.135+1686.0674*8 = $54,100
• How far off will your estimate typically be? Root mean square error = Estimated standard deviation of Y|X = $4,610.93.
• Notice that the typical error of your estimate of teacher salary using experience, $4,610.93, is less than that of using only information on mean teacher salary, $6,491.20.
• Regression analysis enables you to better predict your potential date’s salary.
![Page 8: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/8.jpg)
More Information About Your Potential Date’s Salary
• From the regression model, you predict that your potential date’s salary is $54,100 and the typical error you expect to make in your prediction is $4,611.
• Suppose you want to know an interval that will most of the time (say 95% of the time) contain your date’s salary? What’s the chance that your date will make more than $60,000? What’s the chance that your date will make less than $50,000?
• We can answer these questions by using the fact that under the simple linear regression model, the distribution of Y|X is normal, here the subpopulation of teachers with 8 years of experience has a normal distribution with mean $54,100 and standard deviation $4,611.
![Page 9: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/9.jpg)
• 95% interval: For the subpopulation of teachers with 8 years of experience, 95% of the salaries will be within two SDs of the mean. An interval that will contain a randomly chosen teacher’s salary with 8 years of experience 95% of the time is: $54,100 2*$4,611 = ($44,878,$63,322).
• What’s the probability that your date will make more than $60,000? If you don’t have any additional information about your date other than his or her number of years of teaching, we can assume that your date is a random draw from the subpopulation of teachers with 8 years of teaching.
• According to the simple linear regression model, the subpopulation of teachers with 8 years of experience is estimated to have a normal distribution with mean $54,100 and standard deviation $4,611.
![Page 10: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/10.jpg)
Properties of the Normal Distribution (Section 1.3)
• Suppose a variable Y has a normal distribution with mean and standard deviation . Then
follows a standard normal distribution.• Then the probability that Y is greater than a number c
equals
where Z equals standard normal distribution with mean 0 and SD 1.
The probabilities for a standard normal distribution can be found in Table A.
Review Section 1.3 on using the normal tables.
)()()(
cZP
cYPcYP
Y
Z
![Page 11: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/11.jpg)
• Probability that a teacher with 8 years of experience has salary > $60,000:
• Probability that a teacher with 8 years of experience has salary < $50,000:
• Probability that a teacher with 8 years of experience has salary between $52,000 and $56,000:
1003.08997.01)28.1(1
)28.1()611,4
100,54000,60
611,4
100,54()000,60(
ZP
ZPY
PYP
1867.0)89.0()611,4
100,54000,50
4611
100,54()000,50(
ZP
YPYP
3363.03228.06591.0)46.0()41.0()41.046.0(
)611,4
100,54000,56
611,4
100,54
611,4
100,54000,52()000,56000,52(
ZPZPZP
YPYP
![Page 12: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/12.jpg)
R Squared
• How much better predictions of your potential date’s salary does the simple linear regression model provide than just using the mean teacher’s salary?
• This is the question that R squared addresses. • R squared: Number between 0 and 1 that measures how
much of the variability in the response the regression model explains.
• R squared close to 0 means that using regression for predicting Y|X isn’t much better than mean of Y, R squared close to 1 means that regression is much better than the mean of Y for predicting Y|X.
Summary of Fit
RSquare 0.545881 RSquare Adj 0.495423 Root Mean Square Error 4610.93
![Page 13: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/13.jpg)
R Squared Formula
•
• Total sum of squares = = the sum of squared prediction errors for using sample mean of Y to predict Y
• Residual sum of squares = , where is the prediction of Yi from the least squares line.
squares of sum Total
squares of sum Residual - squares of sum Total2 R
2
1)( YY
n
i i
n
i ii YY1
2)ˆ(
ii XY 10ˆˆˆ
![Page 14: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/14.jpg)
What’s a good R squared?
• As with correlation, it depends on the context.• A good R2 depends on the context. In precise laboratory
work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.
• The best measure of whether the regression model is providing predictions of Y|X that are accurate enough to be useful is the root mean square error, which tells us the typical error in using the regression to predict Y from X.
![Page 15: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/15.jpg)
Checking the model• The simple linear regression model is a great tool
but its answers will only be useful if it is the right model for the data. We need to check the assumptions before using the model.
• Assumptions of the simple linear regression model:1. Linearity: The mean of Y|X is a straight line.2. Constant variance: The standard deviation of Y|X is
constant.3. Normality: The distribution of Y|X is normal.4. Independence: The observations are independent.
![Page 16: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/16.jpg)
Checking that the mean of Y|X is a straight line
1. Scatterplot: Look at whether the mean of Y given X appears to increase or decrease in a straight line.
Bivariate Fit of Salary By Years of Experience
35000
40000
45000
50000
55000
60000
65000
Sa
lary
0 2.5 5 7.5 1012.5Years of Experience
Bivariate Fit of Heart Disease Mortality By Wine Consumption
2
4
6
8
10
12
He
art
Dis
ea
se
Mo
rta
lity
0 10 20 30 40 50 60 70 80
Wine Consumption
![Page 17: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/17.jpg)
Residual Plot• Residuals: Prediction error of using
regression to predict Yi for observation i:
, where
• Residual plot: Plot with residuals on the y axis and the explanatory variable (or some other variable on the x axis.
ii XY 10ˆˆˆ iii YYres ˆ
-3-2-10123
Res
idua
l
0 10 20 30 40 50 60 70 80
Wine Consumption
-10000
-5000
0
5000
Resid
ual
0 2.5 5 7.5 10 12.5
Years of Experience
![Page 18: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/18.jpg)
• Residual Plot in JMP: After doing Fit Line, click red triangle next to Linear Fit and then click Plot Residuals.
• What should the residual plot look like if the simple linear regression model holds? Under simple linear regression model, the residuals
should have approximately a normal distribution with mean zero and a standard deviation which is the same for all X.
• Simple linear regression model: Residuals should appear as a “swarm” of randomly scattered points about their (which is always zero).
• A pattern in the residual plot that for a certain range of X the residuals tend to be greater than zero or tend to be less than zero indicates that the mean of Y|X is not a straight line.
)ˆˆ(ˆ10 iiiii XYYYres
![Page 19: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/19.jpg)
B i v a r i a t e F i t o f M i l e a g e B y S p e e d
5
10
15
20
25
30
35
40Mil
eage
0 10 20 30 40 50 60 70 80 90 100 110
Speed
Linear Fit
L i n e a r F i t M i l e a g e = 2 3 . 2 6 6 7 7 6 - 0 . 0 0 1 2 7 0 1 S p e e d
-20
-10
0
10
Resid
ual
0 10 20 30 40 50 60 70 80 90 100 110
Speed
D a t a S i m u l a t e d F r o m A S i m p l e L i n e a r R e g r e s s i o n M o d e l I d e a l r e g . J M P B i v a r i a t e F i t o f Y B y X
0
10
20
30
40
50
60
70
80
90
100
110
Y
0 10 20 30 40 50 60 70 80 90 100 110
X
-2
-1
0
1
2
Resid
ual
0 10 20 30 40 50 60 70 80 90 100 110
X
![Page 20: Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and](https://reader030.vdocuments.net/reader030/viewer/2022032800/56649d445503460f94a20403/html5/thumbnails/20.jpg)
Summary
• Normal distribution can be used to calculate probability that Y takes on certain values given X
• R squared: measure of how much regression improves on ignoring X when predicting Y.
• Assumptions of simple linear regression model must be checked in order for model to be used. Residual plots can be used to check the linearity assumption.
• Tuesday’s class: Section 2.4 (more on checking assumptions, outliers and influential points, lurking variables).