agresti/franklin statistics, 1 of 52 chapter 3 association: contingency, correlation, and regression...
TRANSCRIPT
Agresti/Franklin Statistics, 1 of 52
Chapter 3Association: Contingency,
Correlation, and Regression
Learn ….
How to examine links
between two variables
Agresti/Franklin Statistics, 2 of 52
Variables
Response variable: the outcome variable
Explanatory variable: the variable that explains the outcome variable
Agresti/Franklin Statistics, 3 of 52
Association
An association exists between the two variables if a particular value for one variable is more likely to occur with certain values of the other variable
Agresti/Franklin Statistics, 4 of 52
Section 3.1
How Can We Explore the Association Between Two Categorical Variables?
Agresti/Franklin Statistics, 6 of 52
Example: Food Type and Pesticide Status
What is the response variable? What is the explanatory variable?
Pesticides:
Food Type: Yes No
Organic 29 98
Conventional 19485 7086
Agresti/Franklin Statistics, 7 of 52
Example: Food Type and Pesticide Status
What proportion of organic foods contain pesticides?
What proportion of conventionally grown foods contain pesticides?
Pesticides:
Food Type: Yes No
Organic 29 98
Conventional 19485 7086
Agresti/Franklin Statistics, 8 of 52
Example: Food Type and Pesticide Status
What proportion of all sampled items contain pesticide residuals?
Pesticides:
Food Type: Yes No
Organic 29 98
Conventional 19485 7086
Agresti/Franklin Statistics, 9 of 52
Contingency Table
The Food Type and Pesticide Status Table is called a contingency table
A contingency table:• Displays 2 categorical variables
• The rows list the categories of 1 variable
• The columns list the categories of the other variable
• Entries in the table are frequencies
Agresti/Franklin Statistics, 10 of 52
Example: Food Type and Pesticide Status
Contingency Table Showing Conditional Proportions
Agresti/Franklin Statistics, 11 of 52
Example: Food Type and Pesticide Status
What is the sum over each row? What proportion of organic foods contained
pesticide residuals? What proportion of conventional foods
contained pesticide residuals?
Pesticides: Food Type: Yes No
Organic 0.23 0.77
Conventional 0.73 0.27
Agresti/Franklin Statistics, 13 of 52
Example: For the following pair of variables, which is the response variable and which is the explanatory variable?
College grade point average (GPA) and high school GPA
a. College GPA: response variable and High School GPA : explanatory variable
b. College GPA: explanatory variable and High School GPA : response variable
Agresti/Franklin Statistics, 14 of 52
Section 3.2
How Can We Explore the Association Between Two Quantitative
Variables?
Agresti/Franklin Statistics, 15 of 52
Scatterplot
Graphical display of two quantitative variables:
• Horizontal Axis: Explanatory variable, x
• Vertical Axis: Response variable, y
Agresti/Franklin Statistics, 17 of 52
Positive Association
Two quantitative variables, x and y, are said to have a positive association when high values of x tend to occur with high values of y, and when low values of x tend to occur with low values of y
Agresti/Franklin Statistics, 18 of 52
Negative Association
Two quantitative variables, x and y, are said to have a negative association when high values of x tend to occur with low values of y, and when low values of x tend to occur with high values of y
Agresti/Franklin Statistics, 19 of 52
Example: Did the Butterfly Ballot Cost Al Gore the 2000 Presidential Election?
Agresti/Franklin Statistics, 20 of 52
Linear Correlation: r
Measures the strength of the linear association between x and y
• A positive r-value indicates a positive association
• A negative r-value indicates a negative association
• An r-value close to +1 or -1 indicates a strong linear association
• An r-value close to 0 indicates a weak association
Agresti/Franklin Statistics, 22 of 52
Example: 100 cars on the lot of a used-car dealership
Would you expect a positive association, a
negative association or no association between
the age of the car and the mileage on the
odometer? Positive association Negative association No association
Agresti/Franklin Statistics, 24 of 52
Regression Line
Predicts the value for the response variable, y, as a straight-line function of the value of the explanatory variable, x
bxay ˆ
Agresti/Franklin Statistics, 25 of 52
Example: How Can Anthropologists Predict Height Using Human Remains?
Regression Equation:
is the predicted height and is the length of a femur (thighbone), measured in centimeters
xy 4.24.61ˆ
y x
Agresti/Franklin Statistics, 26 of 52
Example: How Can Anthropologists Predict Height Using Human Remains?
Use the regression equation to predict the height of a person whose femur length was 50 centimeters
ˆ 61.4 2.4(50)y
Agresti/Franklin Statistics, 27 of 52
Interpreting the y-Intercept
y-Intercept:
• the predicted value for y when x = 0
• helps in plotting the line
• May not have any interpretative value if no observations had x values near 0
Agresti/Franklin Statistics, 28 of 52
Interpreting the Slope
Slope: measures the change in the predicted variable for every unit change in the explanatory variable
Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height
Agresti/Franklin Statistics, 30 of 52
Residuals
Measure the size of the prediction errors
Each observation has a residual
Calculation for each residual:
ˆy y
Agresti/Franklin Statistics, 31 of 52
Residuals
A large residual indicates an unusual observation
Large residuals can easily be found by constructing a histogram of the residuals
Agresti/Franklin Statistics, 32 of 52
“Least Squares Method” Yields the Regression Line
Residual sum of squares:
The optimal line through the data is the line that minimizes the residual sum of squares
2 2ˆ( ) ( )residuals y y
Agresti/Franklin Statistics, 33 of 52
Regression Formulas for y-Intercept and Slope
Slope:
Y-Intercept:
( )y
x
sb r
s
( )a y b x
Agresti/Franklin Statistics, 34 of 52
The Slope and the Correlation
Correlation:• Describes the strength of the association
between 2 variables
• Does not change when the units of measurement change
• It is not necessary to identify which variable is the response and which is the explanatory
Agresti/Franklin Statistics, 35 of 52
The Slope and the Correlation
Slope:• Numerical value depends on the units used to
measure the variables
• Does not tell us whether the association is strong or weak
• The two variables must be identified as response and explanatory variables
• The regression equation can be used to predict the response variable
Agresti/Franklin Statistics, 37 of 52
Extrapolation
Extrapolation: Using a regression line to predict y-values for x-values outside the observed range of the data• Riskier the farther we move from the range
of the given x-values
• There is no guarantee that the relationship will have the same trend outside the range of x-values
Agresti/Franklin Statistics, 38 of 52
Regression Outliers
Construct a scatterplot
Search for data points that are well removed from the trend that the rest of the data points follow
Agresti/Franklin Statistics, 39 of 52
Influential Observation
An observation that has a large effect on the regression analysis
Two conditions must hold for an observation to be influential:
Its x-value is relatively low or high compared to the rest of the data
It is a regression outlier, falling quite far from the trend that the rest of the data follow
Agresti/Franklin Statistics, 42 of 52
Correlation does not Imply Causation
A correlation between x and y means that there is a linear trend that exists between the two variables
A correlation between x and y, does not mean that x causes y
Agresti/Franklin Statistics, 43 of 52
Lurking Variable
A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest
Agresti/Franklin Statistics, 44 of 52
Simpson’s Paradox
The direction of an association between two variables can change after we include a third variable and analyze the data at separate levels of that variable
Agresti/Franklin Statistics, 49 of 52
Example: Is Smoking Actually Beneficial to Your Health?
An association can look quite different after adjusting for the effect of a third variable by grouping the data according to the values of the third variable
Agresti/Franklin Statistics, 50 of 52
Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire
Would you expect the correlation to be negative, zero, or positive?
a. Negative
b. Zero
c. Positive
Agresti/Franklin Statistics, 51 of 52
If the correlation is positive, does this mean that having more firefighters at a fire causes the damages to be worse?
a. Yes
b.No
Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire
Agresti/Franklin Statistics, 52 of 52
Identify a third variable that could be considered a common cause of x and y:
a. Distance from the fire station
b. Intensity of the fire
c. Time of day that the fire was discovered
Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire