chapter 4: analyzing bivariate data with fathommath.coe.uga.edu/olive/emat3500f07/chapter4.pdf ·...

Chapter 4: Analyzing Bivariate Data with Fathom

______________________________________________________________________________ Learning to Teach Mathematics with Technology: An Integrated Approach Page 67 Module One: Data Analysis and Probability Draft Modified 9/7/2007

Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative attributes. They use the concept of variation and deviations from the mean in each univariate attribute to help conceptualize correlation and least squares regression as ways of describing the relationship in the bivariate data and developing a linear model for making predictions. The teachers will consider pedagogical issues concerning the difficulties students may have in analyzing bivariate data and the benefits and drawbacks for using conceptual underpinnings from univariate analysis to develop bivariate analysis techniques. Objectives: Mathematical: Teachers will be able to

• use concepts and techniques used in univariate data analysis to understand how two attributes co-vary and the techniques used to analyze two quantitative variables in bivariate data analysis;

• describe how to create a linear model using the method of least squares; • describe the strength of the relationship between two quantitative attributes using

correlation coefficient; • analyze the appropriateness of a linear function as a model for a bivariate data using a

residual plot; • use the coefficient of determination to explain the amount of variation for a predicted

value that can be attributed to a domain value by a least squares line; • determine whether a linear function is an appropriate model for a set of data.

Technological: Teachers will be able to use Fathom to • create box plots and scatterplots; • plot functions and values on graphs; • create a movable line and show squares; • create a least-squares line-of-best-fit,; • create a residual plot; • enter formulas; • use the summary table to compute statistical measures.

Pedagogical: Teachers will • discuss the benefits and drawbacks related to different representations of data, using

dynamic files to conceptualize correlation and linear models; • become familiar with some difficulties students have with creating and using a least

squares linear model; • consider benefits and drawbacks of using tasks to assist students in reasoning about how

to use univariate data analysis techniques to develop a better understanding of bivariate data analysis techniques.

Prerequisites: univariate analysis techniques, understanding mean and standard deviation as measures of center and spread, dot plots and box plots Vocabulary: covariation, bivariate data, scatterplot, predictor variable, response variable, correlation coefficient, residuals, Sum of squares, Least Squares regression line, coefficient of determination, interpolation, extrapolation, influential points



Technology Files: 2006Vehicles.ftm, Correlation.ftm, Outlier.ftm, Emergency Technology Files: 2006Vehicles_Ch4Sect3.ftm, 2006Vehicles_Ch4Sect4.ftm, 2006Vehicles_Ch4Sect5.ftm, 2006Vehicles_Ch4Sect6.ftm Required Materials: Fathom v. 2



Chapter 4: Analyzing Bivariate Data While measures of center and spread provide us with important information about the distribution of a single variable, often more than one variable is collected, and relationships between two or more variables are examined. In Chapter 3, we used two attributes in the 2006 Vehicle data to answer a question about vehicles with which engine type seemed to have the best City mpg performance. To answer that question, we were using bivariate data with one quantitative attribute (City mpg) and one qualitative attribute (Engine type). Bivariate data is the term used to describe data that have two variables for each observation. In this lesson, we will focus on ways to examine relationships between two quantitative attributes. When examining two quantitative measures in a data set, our attention is on how the measures co-vary. Thus for bivariate data, covariation involves correspondence of the variation possible in each variable. To help students conceptualize covariation, we are going to build on what they already know about partitioning data sets and using measures of center and spread to describe univariate distributions.



Section 1: Examining Relationships Between Two Quantitative Variables Open the 2006Vehicle.ftm file and clear all previous graphs. There are several quantitative attributes in this data set. When considering a purchase of a new vehicle, a buyer is likely interested in the gas mileage for both city and highway. When looking at the 2006 vehicle data, we can ask the question:

Is there a relationship between City mpg and Hwy mpg for this set of vehicles?

We can begin by examining the distributions of the City mpg and the Hwy mpg on separate dot plots. Drag down two empty Graph objects and create dot plots for each attribute.

Figure 4. 1

We can utilize Fathom’s linked representations to informally investigate the nature of the relationship between the two distributions by selecting a portion of the cases in the City distribution and noticing the corresponding position of those cases in the Hwy distribution. To select a portion of cases in a graph,

1. in the graph window, click and drag to draw a dashed rectangle around a subset of points.

2. When the mouse is released, the selected points will appear red in the graph. However, because the representations are all linked, the same cases will be highlighted in all other open representations of this data.

Tech Tip: Unless one of your goals is to investigate the effects of changing data values, remember to choose Prevent Changing Values in Graphs from the Collection menu.

Tech Tip: To deselect cases in a graph, click on any “white space” in the graph.

Figure 4. 2



Figure 4. 3

FOCUS ON MATHEMATICS Q1. What do you anticipate might be a reasonable relationship between the City mpg and Hwy mpg? Q2. Select the cases with the highest City mpg. Where are the corresponding values for the Hwy mpg located in its dot plot? Identify the specific vehicles represented by these points. Q3. Change the graphs for the City and Hwy to be displayed as box plots. Click on the lower whisker in one of the box plots and notice the location of the highlighted cases in the other box plot. Repeat this several times by clicking on different parts of a box plot. What do you notice? Q4. Compare what you noticed in your exploration in Q3 to the prediction you made in Q2.

FOCUS ON PEDAGOGY Q5. Explain how it might be helpful to use the linked representations of two dot plots or box plots to assist students in examining the covariation between two quantitative attributes. Having students look for the corresponding position of subsets of data in two univariate distributions (e.g., see Figure 4.3) can help students notice initial patterns and relationships between attributes and motivate the notion of how to display the data to examine the bivariate distribution. We want to help students transition to a representation in two dimensions, rather than representations like



dot plots and box plots that are only in one dimension. The two-dimensional space commonly used to represent bivariate data is a scatterplot. Scatterplots are an efficient way to examine whether there is an association between two quantitative attributes. We are interested in examining if there is an association between City mpg and Hwy mpg. City and Hwy mpg are both outcome measures of a vehicle’s performance, where one is not dependent on the other. We are interested if there is a relationship between these attributes; however, unlike some situations, we are not assuming that there is a cause-effect, independent-dependent relationship between these two quantities. Thus, it does not matter which attribute we use as our input (x) or predictor variable, and which one we use as the output (y) or response variable. In creating a scatterplot, the attribute assigned as the predictor variable is represented on the x-axis while the response variable is on the y-axis. For our example, we will assign City mpg as the predictor variable and Hwy mpg as the response variable. FOCUS ON PEDAGOGY Q6. Create several pairs of variables that can help students understand the difference between an association between two variables that are independent-dependent or where both are outcome measures and would only have a predictor-response relationship. Explain why each pair is either independent-dependent or a predictor-response example. To help students transition from examining the two attributes as distributions in one dimension, to inscribing the data in two dimensions, we are going to re-orient the univariate distributions such that the distribution of the predictor variable (City) is horizontal and the response variable (Hwy) is vertical. To move an attribute from the x-axis to y-axis,

1. in the graph window, drag the attribute label (Hwy) from the x-axis and drop it on the y-axis.

2. The graph will be redrawn with the distribution displayed along the y-axis.

Figure 4. 4


______________________________________________________________________________ Learning to Teach Mathematics with Technology: An Integrated Approach 3 Module One: Data Analysis and Probability Draft Modified 9/7/2007

Before continuing, change the window sizes and orient the two box plots as shown in Figure 4.5 so as to leave room to add another graph object to display the scatterplot.

Figure 4. 5

The boxplots for each attribute show how each distribution is partitioned by the quartiles, with the second quartile, represented by the line inside the box, representing the median. To analyze how the two attributes co-vary, we are going to inscribe the data as a scatterplot where each case icon in the graph will represent the ordered pair (City, Hwy) for that particular vehicle. To create a scatterplot,

1. drag down an empty Graph object to the workspace.

2. Click and drag the attribute representing the predictor variable to the x-axis.

3. Click and drag the attribute representing the response variable to the y-axis.

Figure 4. 6



Adjust the scales of the three graphs until they are aligned (see Figure 4.7). It will also be helpful to display the location of the mean for the City and Hwy on each graph.

Figure 4. 7

The display of the horizontal and vertical lines representing the mean values can help students examine how each data point varies from the means. The lines can also serve as a reference to notice the placement of data points in comparison to the general trend of data points. When describing relationships between two variables, we typically describe the form (linear, exponential, etc), direction (positive or negative), and strength (weak, moderate, or strong) of the general trend and relationship.

FOCUS ON MATHEMATICS Q7. Explain why you use the command Plot Value to display the mean City mpg and use the command Plot Function to display the mean Hwy mpg in a scatterplot. Q8. Use form, direction and strength to describe the relationship between City and Hwy mpg. Q9. Describe the location of the data points in relation to the mean City and mean Hwy mpg. What does this tell you about the general trend of the data? Q10. Describe a typical City and Hwy mpg for this set of vehicles. Explain how you determined what you would consider as “typical”.

Tech Tip: To display the mean on a graph, select the graph window then choose Plot Value (or Plot Function for the attribute on the y-axis) under the Graph menu. Use the formula mean(attribute_name) (e.g., mean (City)).

Tech Tip: Remember that you can click on data points or select a cluster of data points in a graph and see those points highlighted in all graphs and in the table.



FOCUS ON PEDAGOGY Q11. How can displaying the means in a scatterplot help or hinder students’ ability to think about variation in bivariate data? In the next section, we will more closely examine how to quantify the strength of a linear relationship between two quantitative variables.



Section 2: Conceptualizing Correlation Based on the scatterplot of City mpg versus Hwy mpg, there appears to be a relationship between the two attributes and the data seems like it could be modeled using a linear function with positive slope. Thus, there seems to be a correlation between the two attributes, meaning that there seems to be a general trend and an association between the variation in each attribute. A correlation coefficient, usually represented by the letter r, is a measure used to describe the strength and direction of a linear relationship between two variables. The most common formula used in textbooks and technology tools is the Pearson Product Moment correlation coefficient. There are several equivalent forms of the formula used to compute this measure, all of which are based on how the data points vary from the mean of the predictor (x) and response (y) variables. For example, consider the formula below as one expression of the Pearson’s correlation coefficient.

∑ ∑∑

−−

−−=

22 )()(

))((

yyxx

yyxxr

ii

ii

Notice that the expressions in both the numerator and denominator are based on how a data value deviates from the mean, including squared deviations. Minimize the 2006Vehicle window. Open the file Correlation.ftm. To help visualize how the correlation coefficient is a measure of covariation and the spread of the bivariate data, we are going to use this interactive diagram as shown in Figure 4.8. In this diagram, we have a slider to control the value of r, and are able to view how changes in r affect the spread of the data points in the scatterplot. A visualization tool such as this can help students get a sense of how r is a measure of the covariation between the variables.

Figure 4. 8

Tech Tip: Fathom software includes several pre-created documents that are useful for exploring various concepts. These can be found in the Sample Documents folder. The correlation.ftm file is adapted from one of these sample documents.



FOCUS ON MATHEMATICS Q12. Drag the slider for r and observe how the scatterplot changes. Based on your exploration, fill in the blank in each of the following statements. a. A correlation coefficient value of ______suggests a perfect increasing linear association between two variables. b. A correlation coefficient value of _______ suggests a perfect decreasing linear association between two variables. c. A correlation coefficient value of ______ suggests there is no linear association between the two variables.

Q13. Use the slider to create a scatterplot in the Correlation file that can help you estimate a value for the correlation coefficient for the relationship displayed in the 2006Vehicle file between City mpg and Hwy mpg.

FOCUS ON PEDAGOGY Q14. Discuss the benefits and drawbacks of using this interactive diagram to introduce correlation as a useful measure for describing the strength of a relationship between two variables. Generally, a correlation coefficient with absolute value greater than 0.8 is an indicator of a strong linear relationship, while a correlation coefficient with absolute value less than 0.5 is considered weak. The scale in Figure 4.9 can be helpful in interpreting the r value. However, it is important to not use the correlation coefficient as the sole indicator of the strength of a linear relationship. Just remember that the correlation coefficient is computed using means and deviations for the mean—and we know how easily an outlier can affect a mean!

Figure 4. 9

Pedagogy Tip: Students can test out their ability to estimate a correlation values for a given scatterplot using the java applet at http://www.stat.uiuc.edu/courses/stat100/java/GCApplet/GCAppletFrame.html


_________________________________________________Learning to Teach Mathematics with Technology: An IntegraModule One: Data Analysis and Probability

Section 3: Using a Line to Describe a Relationship Between Two Quantitative Variables1 Return to the 2006Vehicle.ftm file. The scatterplot of Hwy vs. City mpg suggests that there may be a linear relationship between the two variables. That is, if we know a vehicle’s City mpg, then we can use a linear function rule to predict an approximate value for that vehicle’s Hwy mpg. We can use Fathom to compute the correlation coefficient, r. To compute r,

1. drag an empty Summary object to the workspace.

2. Click and drag a quantitative attribute label onto the Summary table. Once the cursor is over the table, a down arrow and a right arrow will appear. Drop the quantitative attribute below the down arrow.

3. Click and drag a second quantitative attribute label onto the Summary table to the right of the right arrow. The default measure that will be displayed is the correlation between the two attributes.

FOCUS ON MATHEMATICS Q15. Compare the calculated correlation coefficienusing the Correlation.ftm file. Q16. What does the value of the correlation coefficrelationship between City and Hwy mpg?

Since we have a high correlation value, it makes sefunction to model the vehicle data. The model repre

1 The technology file “2006Vehicle_Ch4Sect3” is available fothey were unable to complete Section 1 with the technology.

Figure 4. 10

1
Figure 4. 1
_____________________________ ted Approach Page 78

Draft Modified 9/7/2007

t with the one you estimated

ient imply about the

nse to try to use a linear sents an estimate of the

r students to use for Section 3 if



response variable (often denoted as y) given a value for the predictor variable (often denoted as x). This model could be thought of as a measure of center for this bivariate distribution. To create a movable line to fit bivariate data,

1. click on the graph and under the Graph menu, select Add Movable Line. The graph of a line appears in the scatterplot with its equation at the bottom of the graph window.

2. Dragging the line by its middle changes the intercept (translates the line) while dragging by either end changes the slope (rotates the line).

FOCUS ON MATHEMATICS Q17. Insert a movable line and adjust it so that you feel it best models the data. Describe the method you used for determining where to place the line to model the data. Q18. Interpret the slope and y-intercept in the equation of your linear model. Q19. We can use the equation of a line to estimate a value for the response variable for an input for the predictor variable. For example, a Jeep Liberty gets 22 mpg in the City. If we think of a linear model for this data with an equation, Hwy = Slope*City+Y-intercept, then we can use this equation to estimate a predicted Hwy mpg when the City mpg is 22. Based on where you placed the moveable line, use your equation with the specific values for slope and y-intercept to predict the Hwy mpg for a vehicle with a City mpg of 31. Q20. Is the slope of your line the same as the value of the correlation coefficient? Should they be the same? Why or why not?

FOCUS ON PEDAGOGY Q21. How can the ability to overlay a moveable line on a scatterplot help or hinder students’ understanding of the use of a linear equation to model a relationship between two variables? Q22. One of the difficulties in using learning activities such as this is that students do not have confidence in their solution when their results differ from fellow students or from a teacher. Think of two strategies that you could employ to help students understand that differences in solutions are acceptable and expected in the context of trying to estimate a linear model. How could you capitalize on this difficulty should it arise?


_____________________________________Learning to Teach Mathematics with TechnolModule One: Data Analysis and Probability

When we try to create a linear model by visually inspecting a graph, it is unlikely that two different people will generate the exact same line. If we have two or more different lines, how do we determine which is really best? There are different methods that one might use for creating a linear model and analyzing how good of a fit it is. With each method, the goal is to minimize the distance each predicted value is from the actual data value. Similar to how we examined deviations from the mean with univariate data, we can examine deviations from a line with bivariate data. One common method that is used for finding the best linear model is to minimize the deviations of the actual data points from the predicted values. Visually, these are the vertical

distances between the actual data points and the line (Figure 4.12). We are trying to minimize the vertical distance between an actual dathat would fall on the line for the samedata point. Thus, we are comparing thcollection to its predicted y-coordinateThe difference, y-coordinate of actual value, is called a residual. Recall that for univariate data, we described deviation from the mean using variance and standard deviation(Section 4 of Chapter 3), which are both based on summing the squared deviations. We can use a similar method with our bivariate data where we sum the square of each residual. For bivariate data, this sum is called the Sum of squares (or residual sum of squares), and represents a measure of variation. A linear model that minimizes the sum of squares is calledthe Least Squares regression line.

2
Figure 4. 1
___ogy:

ta value and the predicted value (output) input value (x- coordinate) of the actual

e y-coordinate for an actual data point in the that would result from the linear function. data point -- y-coordinate of predicted

3
Figure 4. 1
______________________________________ An Integrated Approach Page 80

Draft Modified 9/7/2007



To visualize the Sum of squares, 1. click on the graph that has moveable line and select Show Squares

from the Graph menu. Notice the sum of the squares is computed and displayed below the equation of the line.

2. Adjusting the movable line will change the squares and the value of the Sum of squares.

You should notice that there are tan and maroon squares shown in the scatterplot. Since the mean(Hwy) was used with the Plot Function command, Fathom is also displaying the squared deviations from the horizontal line . Before continuing, it is important to remove the mean(Hwy) and mean(City) lines from the scatterplot. To remove a plotted value or function from a graph,

1. point to the plotted value or function you wish to remove and right-click with the mouse.

2. In the pop-up menu, choose Cut (or Clear) Formula. The graph of the value or function should be removed from the graph.

Figure 4.15 displays the moveable line that is being used to estimate a linear model for estimating Hwy mpg given City mpg. The squares represent the squared residuals for how much the line over or under predicts for each data point.

Figure 4. 15

Figure 4. 14



FOCUS ON MATHEMATICS Q23. Manipulate the movable line to explore whether it is possible to create a line that is far from several points but still has a small sum of squares. Explain. Q24. Compare and contrast the method of squaring residuals for bivariate data with the calculation of the standard deviation in univariate data. Q25. Adjust the movable line so as to minimize the sum of squares. Record your new equation for the linear model. Use your new linear model to compare the predicted Hwy mpg and the actual Hwy mpg for the following two vehicles: a) Honda Accord Standard engine vehicle, and b) Toyota Prius Hybrid engine. Does your line under-predict or over-predict for the each of these vehicles? By how much?

FOCUS ON PEDAGOGY Q26. Describe the benefits and drawbacks of building on what students already know about deviations from a mean with univariate data and standard deviation to find a linear model by minimizing the sum of squared residuals.



Section 4: Visualizing the Residuals2 In addition to finding the smallest Sum of squares, a plot of the residuals is helpful in deciding whether your line is a good model for the data. A residual plot displays the order pairs (x-value, (y-value -- predicted y-value)). To view a plot of the residuals,

1. click on a scatterplot that has a linear model displayed. 2. Under the Graph menu, choose Make Residual Plot. 3. The residual plot will be displayed at the bottom of the graph window.

Figure 4. 16

In general, we want the residuals to be near zero, and the plotted points should be randomly dispersed above and below the horizontal line y=0 and not reveal any trends or patterns.

FOCUS ON MATHEMATICS Q27. Consider the residual plot for your linear model. If you continue to adjust the moveable line, you should notice the residual plot update accordingly. What does the residual plot reveal about the usefulness of your linear model for predicting Hwy mpg for various vehicles? 2 The technology file “2006Vehicle_Ch4Sect4” is available for students to use for Section 4 if they were unable to complete Section 3 with the technology.



Q28. A student placed their movable line in the scatterplot for the 2006 Vehicle data that resulted in the following residual plot.

Sketch the location of the predicted linear model based on the residual plot above in the following graph.

FOCUS ON PEDAGOGY Q29. Describe some of the conceptual difficulties students may have in interpreting and using the residual plot. How will you help them understand the residual plot and its usefulness in analyzing a linear model?



Section 5: The Least Squares Regression Line3 While we can use the techniques of minimizing the sum of squares and viewing the residual plot to help find a really good linear model, technologies like Fathom can easily compute a linear model using the Least Squares method. To find the least squares line,

1. click on a graph displaying a scatterplot. 2. Under the Graph menu, select Least-Squares Line. The Least-Squares

regression line computed by Fathom will appear on the graph, along with the equation for this line and the value of r2.

The square of the correlation coefficient, 2r , is called the coefficient of determination and can be interpreted as the proportion of variation in the response variable that can be attributed to the variation in the predictor variable by the least squares line. As the value gets closer to 1, the variation is better defined by the predictor variable, and increasingly accurate predictions for the response variable can be assumed. The difference between r2 and 1 (1-r2) indicates the proportion of the variation in the response variable that is attributed to other variables besides the predictor variable.

FOCUS ON MATHEMATICS Q30. Compare the function rule for the least squares linear model with the function rule for your estimated linear model (your moveable line). Q31. What is the predicted Hwy mpg for the Honda Accord Standard and the Toyota Prius Hybrid using the least squares linear model? How do these compare to the predicted Hwy mpg for both vehicles using your estimated linear model found with the moveable line (refer to your solution of Q24)? Q32. Interpret the coefficient of determination for this least squares linear model.

Now that we have the least squares line, we can remove the Movable line and the Residual Plot from the graph. To remove a movable line from a graph,

1. with the graph selected, under the Graph menu, choose Remove Moveable Line.

2. The moveable line and its associated residual plot should be removed.

3 The technology file “2006Vehicles_Ch4Sect5.ftm” is available for students to use for Section 5 if they were unable to complete through Section 4 with the technology.

Tech Tip: If a moveable line and residual plot are currently displayed when the Least Squares line is added to a graph, the Residual plot will still be displaying the residuals for the moveable line, not the Least Squares line.



The least squares line is computed using means and standard deviations for each variable, and the correlation coefficient r. To help visualize the location of the least squares line in relation to the mean mpg for both City and Hwy, use Plot Value and Plot Function to display the mean(City) and mean(Hwy).

Figure 4. 17

Algebraically, the slope and y-intercept of the least squares line are:

x

y

ss

rslope =

[r =correlation coefficient, sy = Standard deviation of y, sx = Standard deviation of x]

y-intercept= y -slope ( x )

Thus the equation for the least squares line can be symbolically represented as

))(()(ˆ xss

ryxss

ryx

y

x

y −+=

Or an alternative form of

yxxss

ryx

y +−= )(ˆ



When using a least squares line to make predictions, interpolation is the process of predicting a response based on a value within the domain of the predictor variable. In answering previous questions, we used interpolation. Extrapolation is the process of predicting based on a value outside the range of the original data for which the least squares line was computed. For example, we would be extrapolating if we used the least squares line to predict the Hwy mpg for a vehicle with a City mpg of 5 or 80. When finding a linear model we should take into account at least four different factors in considering whether the line is a good model for the data: (1) the value of the correlation coefficient and coefficient of determination, (2) how the data is positioned in the scatterplot in relation to the linear model, (3) the residual plot, and (4) the situation and whether the line makes sense for all data points. In many cases, the domain of the linear model needs to be specified in order for it to fit the situation and to avoid the potential dangers of extrapolation. FOCUS ON MATHEMATICS Q33. The least squares regression line passes through the intersection of the mean City mpg and the mean Hwy mpg (see Figure 4.17). Will this always happen? Justify your answer algebraically. Q34. Do you believe the least squares line is a good model for the 2006 Vehicle City and Hwy data? Explain.

FOCUS ON PEDAGOGY Q35. How could you assist students in thinking about the dangers of extrapolating using the 2006 Vehicle data? Q36. If technologies like Fathom, as well as others such as Excel and graphing calculators, will compute and display the least squares line, would you choose to show students the algebraic form for computing the least squares line? Why or why not? Defend your position.


Section 6: Exploring Additional Attributes on a Scatterplot4 The analysis of the scatterplot and least squares line for the City and Hwy mpg raises several issues. First, there are several vehicles, all of which are Hybrids, for which the linear model grossly overestimates their Hwy mpg based on their City mpg. Second, the r2 value suggests that there may be other variables besides City mpg that are contributing to the variation in the Hwy mpg. Fathom can facilitate students analyzing data in a scatterplot in such a way as to add a third variable, or dimension, to the analysis. This can help students visualize the relationship between three variables, rather than only considering two. In the Graph menu, deselect the Least Squares and Show Squares line before continuing. Remove the mean City and mean Hwy lines. To overlay a legend attribute on a scatterplot,

1. drag the attribute of interest to the center of the scatterplot. 2. If the attribute is quantitative, then each data point will be displayed along

a color gradient continuum. If the attribute is qualitative (categorical) then each data point will be displayed using different shapes and colors. A key will appear at the bottom of the graph.

Tech Tip: To remove a legend attribute from a graph under the graph menu, choose Remove Legend Attribute.


FOCUS ON MATHEMATICS Q37. Which quantitative and qualitative attributes in the 2006 vehicle data set could be related to the City and Hwy mpg for a vehicle? Q38. Explore overlaying the attribute Weight on the Hwy vs. City scatterplot. Explain whether a vehicle’s weight seems to be related to the City and Hwy mpg.

FOCUS ON PEDAGOGY Q39. Consider how the differences between the use of color to highlight different attributes in Fathom and in TinkerPlots could affect students’ reasoning about relationships between attributes. Earlier we noticed that many of data points that did not seem to fit the general trend between City and Hwy mpg were Hybrid vehicles. It may make sense, then

4 The technology file “2006Vehicle_Ch4Sect6.ftm” is available for students to use for Section 6 if they were unable to complete through Section 5 with the technology.



to compute a least squares line for the data in each of the subcategories of engine type: Hybrid, Diesel, and Standard. If we overlay a qualitative attribute on a scatterplot and display a Least-Squares Line on it, Fathom will compute a least squares line using the qualitative attribute as a filter, and will compute a different linear model for subsets of data according to the qualitative attribute.

FOCUS ON MATHEMATICS Q40. Overlay the Engine type attribute onto the Hwy vs. City scatterplot. Display the least squares line. Interpret the resulting least squares equations.

FOCUS ON PEDAGOGY Q41. What are the benefits and drawbacks of having the ability in Fathom to overlay a qualitative attribute as a filter for computing Least Squares linear models for subsets of data? As we have seen, overlaying an attribute in a scatterplot can allow students to simultaneously consider three attributes and relationships among them. While at first this may seem confusing to students, this feature can help them consider relationships among more than one variable and realize that linear models that only consider a relationship between two variables are often not sufficient in explaining the phenomenon. In the case of the 2006 Vehicle data, we have learned that the type of engine in a car appears to be a significant factor that affects the relationship between a vehicle’s fuel economy when driving in a city and on the highway.



Section 7: Exploring the Effects of Outliers on Correlation and the Least Squares Line In the exploration of the 2006 vehicle data, we were able to dynamically control the location of a moveable line in the scatterplot. Now, we are going to reverse our locus of control and capitalize on the ability to move data points in a scatterplot and observe the effect on the measures of that data. Open the file Outliers.ftm. This file has 5 data points displayed in the table and a scatterplot. The correlation between the two variables is also computed and displayed.

Figure 4. 18

FOCUS ON MATHEMATICS Q42. The correlation coefficient is currently about r=0.025. Explain why this value makes sense for these 5 data points. Q43. Add the Least Squares line to the graph. Describe the slope of the least squares line with respect to the correlation. Why are they related in that way? Q44. Drag the “center” point located at (6.1, 7.5) to the upper right and bottom left corner of the graph. Describe the effects on the correlation coefficient and Least Squares line. Repeat by dragging this point to the bottom right and top left corners. Q45. Describe two translations of the “center” point that have little effect on the slope of the least squares line.



Q46. By moving the five points, find at least three very different arrangements of points that result in a corresponding least squares line that has a positive slope and a correlation coefficient greater than 0.8. How are the three arrangements different, and what does this difference imply about the relationship between the location of the points on the scatterplot, the value of the correlation coefficient, and the slope of least squares line?

FOCUS ON PEDAGOGY Q47. Often middle or high school students only consider the correlation coefficient when modeling data. Explain why this information alone is not sufficient when determining an appropriate mathematical model. Q48. What are the benefits and drawbacks of changing the data points in a graphical representation and observing the effects on the measures of correlation and the least squares line? The exploration in the Outliers.ftm file is yet another example of how a technology tool can be used to create an interactive diagram. Such diagrams allow students to engage in dynamic manipulations, observe effects of their activities, and reflect on those effects to develop a more meaningful conception of a mathematical idea. These type of diagrams can be used in a variety of settings such as: 1) for individuals to complete working alone at a computer, 2) small groups of students working together with one computer, 3) small groups of students working on individual computers but allowed to discuss their results as a group, and, 4) whole group discussion with the interactive diagram displayed using a projector and students and teacher discussing the activities and the effects of the activities together. When considering how you use such technology files in your own classroom you will need to balance what your goals are for students’ learning with the time allotted and computers available.



SUGGESTED ASSIGNMENTS H-Q1 (Mathematical) In this problem you will be investigating relationships between Weight and City mpg.

(a) First determine if there is a strong linear relationship between Weight and City mpg. Consider which attribute should be the predictor variable and which should be the response variable. Explain your results. (b) Investigate the relationship between Weight and City mpg using a third attribute, engine type, and describe what additional information about the data this attribute provides. c) Remove the Engine type attribute from the scatterplot. Construct a plot of Residuals versus Weight. Describe this plot and what it tells you about the relationship between City mpg and Weight.

H-Q2 (Pedagogical) In many classrooms, students and teachers use graphing calculators to enter data, create scatterplots, and compute linear regression. Discuss the advantages and disadvantages between using graphing calculators and Fathom for helping students understand these concepts and perform these procedures. H-Q3. (Mathematical) [Question about computing residual on graphing calculator and displaying a residual plot. Give directions.] H-Q4 (Pedagogical) Create a task where a linear relationship exists between two variables, but a linear model would be inappropriate, such as shoe size and scores on an achievement test. Create a series of questions that would help students to see that, though the scatterplot reveals a linear trend and the correlation coefficient is strong, it does not make sense to predict one from the other. H-Q5 (Mathematical) The median-median line is another linear model that is more resistant to outliers than the least squares line. Below are directions for creating the median-median line without technology. • To fit a median-median line to the points, divide the points into three

groups. Do this by taking the set of one-third of the points consisting of those with the smallest x-values, a middle group, a set of one-third of the points with the larges x-values. For the SAT data set of n=51, each one-third will include 17 data points. If the number of points is not divisible by three, extra points need to be assigned symmetrically. Thus, if there is



one extra point, it should be added to the middle group, and if there are two extra points, add one each to the two outer groups. Even if it makes equal allocation impossible, points with the same x-values must always be placed in the same group.

• Consider each group of data separately and order the values of both variables. Ignore the data pairings at this point.

• Now create a summary point (one for each group) for each portion of the data by using the median x-value and the median y-value, and combining them to create an ordered pair. We have three summary points: for the leftmost data, for the middle group, and for the right hand group. **These summary points may, or may not, be actual data points.

• Now use the two outer summary points to determine the equation of the line between them. This, in essence, will determine your slope.

• Construct the line parallel to this line but is one-third of the way to the middle summary point. (adjusting the y-intercept) This is the median-median line. By moving the line one-third of the way toward the middle summary point gives each summary point equal weight in determining the y-intercept. To do this:

i. Find the y-coordinate of the point on the line with the same x-coordinate as the middle summary point. (The predicted value)

ii. Find the vertical distance between the middle summary point and the line by subtracting y-values.

iii. Find the coordinates of the point P*one third of the way from the line to the middle summary point.

Fathom will compute the median-median line. For the Hwy vs. City mpg graph, under the Graph menu choose, choose Median-Median line. Record the equation and compare it with the least square model. Record the similarities and differences that you notice.