modeling a linear relationship lecture 44 secs. 13.1 – 13.3.1 tue, apr 24, 2007
TRANSCRIPT
Modeling a Linear Relationship
Lecture 44Secs. 13.1 – 13.3.1Tue, Apr 24, 2007
Bivariate Data
Data is called bivariate if each observations consists of a pair of values (x, y).
x is the explanatory variable. y is the response variable. x is also called the independent variable. y is also called the dependent variable.
Scatterplots
Scatterplot – A display in which each observation (x, y) is plotted as a point in the xy plane.
Example
Draw a scatterplot of the following data of calories vs. cholesterol in Subway sandwiches.
Calories (x) 350 290 330 290 320 370 280 290 310 230
Cholesterol (y) 50 20 45 15 35 50 20 25 20 0
Example
200 250 300 350 400Calories
Ch
ole
stero
l
0
10
20
30
50
40
Example
Does there appear to be a relationship? How can we tell?
TI-83 - Scatterplots
To set up a scatterplot,Enter the x values in L1.
Enter the y values in L2.
Press 2nd STAT PLOT.Select Plot1 and press ENTER.
TI-83 - Scatterplots
The Stat Plot display appears.Select On and press ENTER.Under Type, select the first icon (a small
image of a scatterplot) and press ENTER.For XList, enter L1.For YList, enter L2.For Mark, select the one you want and press
ENTER.
TI-83 - Scatterplots
To draw the scatterplot,Press ZOOM. The Zoom menu appears.Select ZoomStat (#9) and press ENTER. The
scatterplot appears.Press TRACE and use the arrow keys to
inspect the individual points.
Describing a Linear Relationship How would we describe this relationship?
200 250 300 350 400Calories
Ch
ole
stero
l
0
10
20
30
50
40
Linear Association
Draw (or imagine) an oval around the data set. If the oval is tilted, then there is some linear
association. If the oval is tilted upwards from left to right, then
there is positive association. If the oval is tilted downwards from left to right,
then there is negative association. If the oval is not tilted at all, then there is no
association.
Positive Linear Association
x
y
Positive Linear Association
x
y
Negative Linear Association
x
y
Negative Linear Association
x
y
No Linear Association
x
y
No Linear Association
x
y
Strong vs. Weak Association
The association is strong if the oval is narrow.
The association is weak if the oval is wide.
Strong Positive Linear Association
x
y
Strong Positive Linear Association
x
y
Weak Positive Linear Association
x
y
Weak Positive Linear Association
x
y
Example
200 250 300 350 400Calories
Ch
ole
stero
l
0
10
20
30
50
40
Describing the Relationship
200 250 300 350 400Calories
Ch
ole
stero
l
0
10
20
30
50
40
Describing the Relationship
There appears to be a strong positive linear association between calories and cholesterol in Subway sandwiches.
Example
Draw a scatterplot of the following data.
x y
2 3
3 5
5 9
6 12
9 16
Simple Linear Regression
To quantify the linear relationship between x and y, we wish to find the equation of the line that “best” fits the data.
Typically, there will be many lines that all look pretty good.
How do we measure how well a line fits the data?
Measuring the Goodness of Fit
Which line better fits the data?
x
y
Measuring the Goodness of Fit
Which line better fits the data?
x
y
Measuring the Goodness of Fit
Which line better fits the data?
x
y
Measuring the Goodness of Fit
Which line better fits the data?
x
y
Measuring the Goodness of Fit
Start with the scatterplot.
x
y
Measuring the Goodness of Fit
Draw any line through the scatterplot.
x
y
Measuring the Goodness of Fit
Measure the vertical distances from every point to the line
x
y
Measuring the Goodness of Fit
Each of these represents a deviation, called a residual, from the line.
x
y
e
Residuals
The i th residual – The difference between the observed value of yi and the predicted, or expected, value of yi.
Use yi^ for the predicted yi.
The formula for the ith residual is
iii yye ˆ
Residuals
Notice that the residual is positive if the data point is above the line and it is negative if the data point is below the line.
Measuring the Goodness of Fit
The ith residual.
x
y
ei
xi
yi^
yi
Measuring the Goodness of Fit
Find the sum of the squared residuals.
x
y
ei
xi
yi^
yi
Measuring the Goodness of Fit
The smaller the sum of squared residuals, the better the fit.
x
y
ei
xi
yi^
yi
Example
Consider the data points
x y
2 3
3 5
5 9
6 12
9 16
Example
2 3 4 5 6 7 8 9
5
10
15
Least Squares Line
Let’s see how good the fit is for the line
y^ = -1 + 2x,
where y^ represents the predicted value of y, not the observed value.
Sum of Squared Residuals
Begin with the data set.
x y
2 3
3 5
5 9
6 12
9 16
Sum of Squared Residuals
Compute the predicted y, using y^ = -1 + 2x.
x y y^
2 3 3
3 5 5
5 9 9
6 12 11
9 16 17
Sum of Squared Residuals
Compute the residuals, y – y^.
x y y^ y – y^
2 3 3 0
3 5 5 0
5 9 9 0
6 12 11 1
9 16 17 -1
Sum of Squared Residuals
Square the residuals.
x y y^ y – y^ (y – y^)2
2 3 3 0 0
3 5 5 0 0
5 9 9 0 0
6 12 11 1 1
9 16 17 -1 1
Sum of Squared Residuals
Find the sum of the squared residuals.
x y y^ y – y^ (y – y^)2
2 3 3 0 0
3 5 5 0 0
5 9 9 0 0
6 12 11 1 1
9 16 17 -1 1
SSE = (y – y^)2 = 2.00
Least Squares Line
Least squares line – The line for which the sum of the squares of the residuals is as small as possible.
The least squares line is also called the line of best fit or the regression line.
Regression Line
We will write regression line as
a is the y-intercept. b is the slope.
This is the usual slope-intercept form
with the two terms rearranged and relabeled.
bxay ˆ
bmxy ˆ
TI-83 – Computing Residuals
It is not hard to compute the residuals and the sum of their squares on the TI-83.
(Later, we will see a faster method.) Enter the x-values in list L1 and the y-values in list L2.
Compute a + b*L1 and store in list L3 (y^ values).
Compute (L2 – L3)2. This is a list of the squared residuals.
Compute sum(Ans). This is the sum of the squared residuals.
Sum of Squared Residuals
Now let’s see how good the fit is for the line
y^ = -0.5 + 1.9x. We will compute the sum of squared
residuals, SSE.
Sum of Squared Residuals
Begin with the data set.
x y
2 3
3 5
5 9
6 12
9 16
Sum of Squared Residuals
Compute the predicted y, using y^ = -0.5 + 1.9x.
x y y^
2 3 3.3
3 5 5.2
5 9 9.0
6 12 10.9
9 16 16.6
Sum of Squared Residuals
Compute the residuals, y – y^.
x y y^ y – y^
2 3 3.3 -0.3
3 5 5.2 -0.2
5 9 9.0 0.0
6 12 10.9 1.1
9 16 16.6 -0.6
Sum of Squared Residuals
Compute the squared residuals.
x y y^ y – y^ (y – y^)2
2 3 3.3 -0.3 0.09
3 5 5.2 -0.2 0.04
5 9 9.0 0.0 0.00
6 12 10.9 1.1 1.21
9 16 16.6 -0.6 0.36
Sum of Squared Residuals
Find the sum of the squared residuals.
x y y^ y – y^ (y – y^)2
2 3 3.3 -0.3 0.09
3 5 5.2 -0.2 0.04
5 9 9.0 0.0 0.00
6 12 10.9 1.1 1.21
9 16 16.6 -0.6 0.36
SSE = (y – y^)2 = 1.70
Sum of Squared Residuals
We conclude that y^ = -0.5 + 1.9x is a better fit than y^ = -1 + 2x.
Is it the best fit?
Sum of Squared Residuals
2 3 4 5 6 7 8 9
5
10
15
y^ = -1 + 2x
Sum of Squared Residuals
2 3 4 5 6 7 8 9
5
10
15
y^ = -0.5 + 1.9x
Example
For all the lines that one could draw through this data set,
it turns out that 1.70 is the smallest possible value for the sum of the squares of the residuals.
x y
2 3
3 5
5 9
6 12
9 16
Example
Therefore,
y^ = -0.5 + 1.9x
is the regression line for this data set.
Prediction
Use the regression line to predict y when x = 4 x = 7 x = 20
Interpolation – Using an x value within the observed extremes of x values to predict y.
Extrapolation – Using an x value beyond the observed extremes of x values to predict y.
Interpolation vs. Extrapolation
Interpolated values are more reliable then extrapolated values.
The farther out the values are extrapolated, the less reliable they are.