topic 12 simple linear regression and correlation
TRANSCRIPT
TOPIC 12TOPIC 12
Simple Linear Regression and
Correlation
Simple Linear Regression and
Correlation
• Representation of some phenomenon• Mathematical model is a mathematical
expression of some phenomenon• Often describe relationships between
variables• Types
– Deterministic models– Probabilistic models
ModelsModels
• Hypothesize exact relationships
• Suitable when prediction error is negligible
• Example: force is exactly mass times acceleration– F = m·a
Deterministic ModelsDeterministic Models
• Hypothesize two components– Deterministic– Random error
• Example: sales volume (y) is 10 times advertising spending (x) + random error– y = 10x + – Random error may be due to factors
other than advertising
Probabilistic ModelsProbabilistic Models
ProbabilisticModels
Regression Models
Correlation Models
Regression ModelsRegression Models
• Answers ‘What is the relationship between the variables?’
• Equation used– One numerical dependent (response) variable
What is to be predicted– One or more numerical or categorical
independent (explanatory) variables• Used mainly for prediction and estimation
Regression ModelsRegression Models
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random error term• Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
Regression Modeling StepsRegression Modeling Steps
Simple
1 ExplanatoryVariable
RegressionModels
2+ ExplanatoryVariables
Multiple
LinearNon-
LinearLinear
Non-Linear
Types of Regression ModelsTypes of Regression Models
y x 0 1
Relationship between variables is a linear function
Dependent (Response) Variable
Independent (Explanatory) Variable
Population Slope
Population y-intercept
Random Error
Linear Regression ModelLinear Regression Model
Population
$ $
$
$
$
Unknown Relationship
0 1y x
Random Sample
$ $
$ $
0 1ˆ ˆ ˆy x
Population & Sample Regression ModelPopulation & Sample Regression Model
y
x
0 1i i iy x
0 1E y x
Observedvalue
Observed value
i = Random error
(line of means)
Population Linear Regression ModelPopulation Linear Regression Model
y
x
0 1ˆ ˆ ˆi i iy x
0 1ˆ ˆˆi iy x
Unsampled observation
i = Random
error
Observed value
^
Sample Linear Regression ModelSample Linear Regression Model
1. Plot of all (xi, yi) pairs
2. Suggests how well model will fit
0204060
0 20 40 60
x
y
Scatter DiagramScatter Diagram
0204060
0 20 40 60
x
y
• How would you draw a line through the points?• How do you determine which line ‘fits best’?
Thinking ChallengeThinking Challenge
• ‘Best fit’ means difference between actual y values and predicted y values are a minimum– But positive differences off-set negative
2 2
1 1
ˆ ˆn n
ii ii i
y y
• Least Squares minimizes the Sum of the Squared Differences (SSE)
Least Squares MethodLeast Squares Method
e2
y
x
e1 e
3
e4
^^
^^
2 0 1 2 2ˆ ˆ ˆy x
0 1ˆ ˆˆi iy x
2 2 2 2 21 2 3 4
1
ˆ ˆ ˆ ˆ ˆLS minimizes n
ii
Least Squares GraphicallyLeast Squares Graphically
1 1
11 2
12
1
ˆ
n n
i ini i
i ixy i
nxx
ini
ii
x y
x ySS nSS
x
xn
Slope
y-intercept
0 1ˆ ˆy x Prediction Equation
0 1ˆ ˆy x
Coefficient EquationsCoefficient Equations
You’re a marketing analyst for Hasbro Toys. You gather the following data:
Ad $ Sales (Units)1 12 13 24 25 4
Find the least squares line relatingsales and advertising.
Least Squares ExampleLeast Squares Example
2 2
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
15 10 55 26 37
xi yi xi yi xiyi
Parameter Estimation Solution TableParameter Estimation Solution Table
1 1
11 2 2
12
1
15 1037
5ˆ .7015
555
n n
i ini i
i ii
n
ini
ii
x y
x yn
x
xn
ˆ .1 .7y x
0 1ˆ ˆ 2 .70 3 .10y x
Parameter Estimation SolutionParameter Estimation Solution
Parameter Estimates
Parameter Standard T for H0:Variable DF Estimate Error Param=0 Prob>|T|INTERCEP 1 -0.1000 0.6350 -0.157 0.8849ADVERT 1 0.7000 0.1914 3.656 0.0354
0^
1^
ˆ .1 .7y x
Parameter Estimation Computer OutputParameter Estimation Computer Output
0
1
2
3
4
0 1 2 3 4 5
Sales
Advertising
ˆ .1 .7y x
Regression Line Fitted to the DataRegression Line Fitted to the Data
You’re an economist for the county cooperative. You gather the following data:
Fertilizer (lb.) Yield (lb.) 4 3.0 6 5.510 6.512 9.0
Find the least squares line
Relating crop yield and fertilizer.
Least Squares Thinking ChallengeLeast Squares Thinking Challenge
4 3.0 16 9.00 12
6 5.5 36 30.25 33
10 6.5 100 42.25 65
12 9.0 144 81.00 108
32 24.0 296 162.50 218
xi2yi xi yi xiyi
2
Parameter Estimation Solution TableParameter Estimation Solution Table
1 1
11 2 2
12
1
32 24218
4ˆ .6532
2964
n n
i ini i
i ii
n
ini
ii
x y
x yn
x
xn
0 1ˆ ˆ 6 .65 8 .80y x
ˆ .8 .65y x
Parameter Estimation SolutionParameter Estimation Solution
02468
10
0 5 10 15
Yield (lb.)
Fertilizer (lb.)
ˆ .8 .65y x
Regression Line Fitted to the Data*Regression Line Fitted to the Data*
1. Mean of probability distribution of error, ε, is 0
2. Probability distribution of error has constant variance
3. Probability distribution of error, ε, is normal
4. Errors are independent
Linear Regression AssumptionsLinear Regression Assumptions
x1 x2 x3
yE(y) = β0 + β1x
x
Error Probability DistributionError Probability Distribution
• Measured by standard error of regression model– Sample standard deviation of : s^
• Affects several factors– Parameter significance– Prediction accuracy
^• Variation of actual y from predicted y, y
Random Error VariationRandom Error Variation
y
xxi
0 1ˆ ˆˆi iy x
yi 2ˆ( )i iy y
Unexplained sum of squares
2( )iy yTotal sum of squares
2ˆ( )iy yExplained sum of squares
y
Variation MeasuresVariation Measures
22 ˆ2 i i
SSEs where SSE y y
n
2
2
SSEs s
n
Estimation of σ2Estimation of σ2
You’re a marketing analyst for Hasbro Toys. You gather the following data:
Ad $ Sales (Units)1 12 13 24 25 4
Find SSE, s2, and s.
Calculating SSE, s2, s ExampleCalculating SSE, s2, s Example
1 1
2 1
3 2
4 2
5 4
xi yi
.6
1.3
2
2.7
3.4
ˆy y
.4
-.3
0
-.7
.6
2ˆ( )y yˆ .1 .7y x
.16
.09
0
.49
.36
SSE=1.1
Calculating SSE SolutionCalculating SSE Solution
Calculating s2 and s Solution
2 1.1.36667
2 5 2
SSEs
n
.36667 .6055s
Estimation of σ2Estimation of σ2
• Shows if there is a linear relationship between x and y
• Involves population slope 1
• Hypotheses – H0: 1 = 0 (No Linear Relationship)
– Ha: 1 0 (Linear Relationship)
• Theoretical basis is sampling distribution of slope
Test of Slope CoefficientTest of Slope Coefficient
y
Population Linex
Sample 1 Line
Sample 2 Line
b1
Sampling Distribution
1
1S
^
^
All Possible Sample Slopes
Sample 1: 2.5Sample 2: 1.6 Sample 3: 1.8Sample 4: 2.1 : :
Very large number of sample slopes
Sampling Distribution of Sample SlopesSampling Distribution of Sample Slopes
1
1 1
ˆ
2
12
1
ˆ ˆ2
wherexx
n
ini
xx ii
t df nsS
SS
x
SS xn
Slope Coefficient Test StatisticSlope Coefficient Test Statistic
You’re a marketing analyst for Hasbro Toys.
You find β0 = –.1, β1 = .7 and s = .6055.Ad $ Sales (Units)1 12 13 24 25 4
Is the relationship significant at the .05 level of significance?
^^
Test of Slope Coefficient ExampleTest of Slope Coefficient Example
xi yi xi2 yi
2 xiyi
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
15 10 55 26 37
Solution TableSolution Table
1
1
ˆ 2
1
ˆ
.6055.1914
1555
5
ˆ .703.657
.1914
xx
SS
SS
tS
Test Statistic SolutionTest Statistic Solution
• H0: 1 = 0
• Ha: 1 0• .05• df 5 - 2 = 3• Critical Value(s):
Test Statistic:
Decision:
Conclusion:
1
1
ˆ
ˆ .703.657
.1914t
S
Reject at = .05
There is evidence of a relationshipt0 3.182-3.182
.025
Reject H0 Reject H0
.025
Test of Slope Coefficient SolutionTest of Slope Coefficient Solution
Parameter Estimates Parameter Standard T for H0:Variable DF Estimate Error Param=0 Prob>|T|INTERCEP 1 -0.1000 0.6350 -0.157 0.8849ADVERT 1 0.7000 0.1914 3.656 0.0354
t = 1 / S
P-Value
S11 1
^^^^
Test of Slope Coefficient Computer OutputTest of Slope Coefficient Computer Output
ProbabilisticModels
Regression Models
Correlation Models
Correlation Model MapCorrelation Model Map
• Answers ‘How strong is the linear relationship between two variables?’
• Coefficient of correlation– Sample correlation coefficient denoted r– Values range from –1 to +1– Measures degree of association– Does not indicate cause–effect
relationship
Correlation ModelsCorrelation Models
xy
xx yy
SSr
SS SS
2
2
2
2
xx
yy
xy
xSS x
n
ySS y
n
x ySS xy
n
where
Coefficient of CorrelationCoefficient of Correlation
–1.0 +1.00–.5 +.5
No Linear Correlation
Perfect Negative
Correlation
Perfect Positive
Correlation
Increasing degree of positive correlation
Increasing degree of negative correlation
Coefficient of Correlation ValuesCoefficient of Correlation Values
You’re a marketing analyst for Hasbro Toys.
Ad $ Sales (Units)1 12 13 24 25 4
Calculate the coefficient ofcorrelation.
Coefficient of Correlation ExampleCoefficient of Correlation Example
xi yi xi2 yi
2 xiyi
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
15 10 55 26 37
Solution TableSolution Table
22
2
22
2
(15)55 10
5
(10)26 6
5
(15)(10)37 7
5
xx
yy
xy
xSS x
n
ySS y
n
x ySS xy
n
7.904
10 6xy
xx yy
SSr
SS SS
Coefficient of Correlation SolutionCoefficient of Correlation Solution
You’re an economist for the county cooperative. You gather the following data:
Fertilizer (lb.) Yield (lb.) 4 3.0 6 5.510 6.512 9.0
Find the coefficient of correlation.
Coefficient of Correlation Thinking ChallengeCoefficient of Correlation Thinking Challenge
4 3.0 16 9.00 12
6 5.5 36 30.25 33
10 6.5 100 42.25 65
12 9.0 144 81.00 108
32 24.0 296 162.50 218
xi2yi xi yi xiyi
2
Solution TableSolution Table
22
2
22
2
(32)296 40
4
(24)162.5 18.5
4
(32)(24)218 26
4
xx
yy
xy
xSS x
n
ySS y
n
x ySS xy
n
26.956
40 18.5xy
xx yy
SSr
SS SS
Coefficient of Correlation SolutionCoefficient of Correlation Solution
Proportion of variation ‘explained’ by relationship between x and y
2 Explained Variation
Total Variationyy
yy
SS SSEr
SS
0 r2 1
r2 = (coefficient of correlation)2
Coefficient of DeterminationCoefficient of Determination
You’re a marketing analyst for Hasbro Toys.
You know r = .904.
Ad $ Sales (Units)1 12 13 24 25 4
Calculate and interpret thecoefficient of determination.
Coefficient of Determination ExampleCoefficient of Determination Example
r2 = (coefficient of correlation)2
r2 = (.904)2
r2 = .817
Interpretation: About 81.7% of the sample variation in Sales (y) can be explained by using Ad $ (x) to predict Sales (y) in the linear model.
Coefficient of Determination SolutionCoefficient of Determination Solution
Root MSE 0.60553 R-square 0.8167 Dep Mean 2.00000 Adj R-sq 0.7556 C.V. 30.27650
r2 adjusted for number of explanatory variables & sample size
r2
r2 Computer Outputr2 Computer Output
• Types of predictions– Point estimates– Interval estimates
• What is predicted– Population mean response E(y) for given x
Point on population regression line
– Individual response (yi) for given x
Prediction With Regression ModelsPrediction With Regression Models
Mean y, E(y)
y
^yIndividual
Prediction, y
E(y) = b0 + b1x
^x
xP
^ ^y i =
b 0 + b 1x
What is Predicted?What is Predicted?
Confidence Interval Estimate for Mean Value of y at x = xp
xx
p
SS
xx
nSty
2
2/
1ˆ
df = n – 2
Confidence Interval EstimateConfidence Interval Estimate
1. Level of confidence (1 – )• Width increases as confidence increases
2. Data dispersion (s)• Width increases as variation increases
3. Sample size• Width decreases as sample size increases
4. Distance of xp from mean • Width increases as distance increases
Factors Affecting Interval WidthFactors Affecting Interval Width
x
Sample 2 Line
y
xx1 x2
y
Sample 1 LineGreater dispersion than x1
x
Why Distance from MeanWhy Distance from Mean
You’re a marketing analyst for Hasbro Toys.
You find β0 = -.1, β 1 = .7 and s = .6055.
Ad $ Sales (Units)1 12 13 24 25 4
Find a 95% confidence interval forthe mean sales when advertising is $4.
^^
Confidence Interval Estimate ExampleConfidence Interval Estimate Example
x i y i x i2 y i
2 x iy i
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
15 10 55 26 37
Solution TableSolution Table
2
/ 2
2
1ˆ
ˆ .1 .7 4 2.7
4 312.7 3.182 .6055
5 10
1.645 ( ) 3.755
p
xx
x xy t s
n SS
y
E Y
x to be predicted
Confidence Interval Estimate SolutionConfidence Interval Estimate Solution
Prediction Interval of Individual Value of y at x = xp
Note!
2
/ 2
1ˆ 1
p
xx
x xy t S
n SS
df = n – 2
Prediction Interval of Individual ValuePrediction Interval of Individual Value
Expected(Mean) y
yy we're trying topredict
Prediction, y
xxp
eE(y) = b0 + b1x
^^
^y i =
b 0 + b 1x i
Why the Extra ‘S’ ?Why the Extra ‘S’ ?
You’re a marketing analyst for Hasbro Toys.
You find β0 = -.1, β 1 = .7 and s = .6055.
Ad $ Sales (Units)1 12 13 24 25 4
Predict the sales when advertising is $4. Use a 95% prediction interval.
^^
Prediction Interval ExamplePrediction Interval Example
x i y i x i2 y i
2 x iy i
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
15 10 55 26 37
Solution TableSolution Table
2
/ 2
2
4
1ˆ 1
ˆ .1 .7 4 2.7
4 312.7 3.182 .6055 1
5 10
.503 4.897
p
xx
x xy t s
n SS
y
y
x to be predicted
Prediction Interval SolutionPrediction Interval Solution
Dep Var Pred Std Err Low95% Upp95% Low95% Upp95%Obs SALES Value Predict Mean Mean Predict Predict 1 1.000 0.600 0.469 -0.892 2.092 -1.837 3.037 2 1.000 1.300 0.332 0.244 2.355 -0.897 3.497 3 2.000 2.000 0.271 1.138 2.861 -0.111 4.111 4 2.000 2.700 0.332 1.644 3.755 0.502 4.897 5 4.000 3.400 0.469 1.907 4.892 0.962 5.837
Predicted y when x = 4
Confidence Interval
SYPredictionInterval
Interval Estimate Computer OutputInterval Estimate Computer Output
x
y
x
^^ ^
y i = b 0 +
b 1x i
Confidence Intervals vs. Prediction IntervalsConfidence Intervals vs. Prediction Intervals