agresti/franklin statistics, 1 of 52 chapter 3 association: contingency, correlation, and regression...

52
Agresti/Franklin Statistics, 1 of 52 Chapter 3 Association: Contingency, Correlation, and Regression Learn …. How to examine links between two variables

Upload: martina-lucas

Post on 16-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Agresti/Franklin Statistics, 1 of 52

Chapter 3Association: Contingency,

Correlation, and Regression

Learn ….

How to examine links

between two variables

Agresti/Franklin Statistics, 2 of 52

Variables

Response variable: the outcome variable

Explanatory variable: the variable that explains the outcome variable

Agresti/Franklin Statistics, 3 of 52

Association

An association exists between the two variables if a particular value for one variable is more likely to occur with certain values of the other variable

Agresti/Franklin Statistics, 4 of 52

Section 3.1

How Can We Explore the Association Between Two Categorical Variables?

Agresti/Franklin Statistics, 5 of 52

Example: Food Type and Pesticide Status

Agresti/Franklin Statistics, 6 of 52

Example: Food Type and Pesticide Status

What is the response variable? What is the explanatory variable?

Pesticides:

Food Type: Yes No

Organic 29 98

Conventional 19485 7086

Agresti/Franklin Statistics, 7 of 52

Example: Food Type and Pesticide Status

What proportion of organic foods contain pesticides?

What proportion of conventionally grown foods contain pesticides?

Pesticides:

Food Type: Yes No

Organic 29 98

Conventional 19485 7086

Agresti/Franklin Statistics, 8 of 52

Example: Food Type and Pesticide Status

What proportion of all sampled items contain pesticide residuals?

Pesticides:

Food Type: Yes No

Organic 29 98

Conventional 19485 7086

Agresti/Franklin Statistics, 9 of 52

Contingency Table

The Food Type and Pesticide Status Table is called a contingency table

A contingency table:• Displays 2 categorical variables

• The rows list the categories of 1 variable

• The columns list the categories of the other variable

• Entries in the table are frequencies

Agresti/Franklin Statistics, 10 of 52

Example: Food Type and Pesticide Status

Contingency Table Showing Conditional Proportions

Agresti/Franklin Statistics, 11 of 52

Example: Food Type and Pesticide Status

What is the sum over each row? What proportion of organic foods contained

pesticide residuals? What proportion of conventional foods

contained pesticide residuals?

Pesticides: Food Type: Yes No

Organic 0.23 0.77

Conventional 0.73 0.27

Agresti/Franklin Statistics, 12 of 52

Example: Food Type and Pesticide Status

Agresti/Franklin Statistics, 13 of 52

Example: For the following pair of variables, which is the response variable and which is the explanatory variable?

College grade point average (GPA) and high school GPA

a. College GPA: response variable and High School GPA : explanatory variable

b. College GPA: explanatory variable and High School GPA : response variable

Agresti/Franklin Statistics, 14 of 52

Section 3.2

How Can We Explore the Association Between Two Quantitative

Variables?

Agresti/Franklin Statistics, 15 of 52

Scatterplot

Graphical display of two quantitative variables:

• Horizontal Axis: Explanatory variable, x

• Vertical Axis: Response variable, y

Agresti/Franklin Statistics, 16 of 52

Example: Internet Usage and Gross National Product (GDP)

Agresti/Franklin Statistics, 17 of 52

Positive Association

Two quantitative variables, x and y, are said to have a positive association when high values of x tend to occur with high values of y, and when low values of x tend to occur with low values of y

Agresti/Franklin Statistics, 18 of 52

Negative Association

Two quantitative variables, x and y, are said to have a negative association when high values of x tend to occur with low values of y, and when low values of x tend to occur with high values of y

Agresti/Franklin Statistics, 19 of 52

Example: Did the Butterfly Ballot Cost Al Gore the 2000 Presidential Election?

Agresti/Franklin Statistics, 20 of 52

Linear Correlation: r

Measures the strength of the linear association between x and y

• A positive r-value indicates a positive association

• A negative r-value indicates a negative association

• An r-value close to +1 or -1 indicates a strong linear association

• An r-value close to 0 indicates a weak association

Agresti/Franklin Statistics, 21 of 52

Calculating the correlation, r

))((1

1

yx s

yy

s

xx

nr

Agresti/Franklin Statistics, 22 of 52

Example: 100 cars on the lot of a used-car dealership

Would you expect a positive association, a

negative association or no association between

the age of the car and the mileage on the

odometer? Positive association Negative association No association

Agresti/Franklin Statistics, 23 of 52

Section 3.3

How Can We Predict the Outcome of a Variable?

Agresti/Franklin Statistics, 24 of 52

Regression Line

Predicts the value for the response variable, y, as a straight-line function of the value of the explanatory variable, x

bxay ˆ

Agresti/Franklin Statistics, 25 of 52

Example: How Can Anthropologists Predict Height Using Human Remains?

Regression Equation:

is the predicted height and is the length of a femur (thighbone), measured in centimeters

xy 4.24.61ˆ

y x

Agresti/Franklin Statistics, 26 of 52

Example: How Can Anthropologists Predict Height Using Human Remains?

Use the regression equation to predict the height of a person whose femur length was 50 centimeters

ˆ 61.4 2.4(50)y

Agresti/Franklin Statistics, 27 of 52

Interpreting the y-Intercept

y-Intercept:

• the predicted value for y when x = 0

• helps in plotting the line

• May not have any interpretative value if no observations had x values near 0

Agresti/Franklin Statistics, 28 of 52

Interpreting the Slope

Slope: measures the change in the predicted variable for every unit change in the explanatory variable

Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height

Agresti/Franklin Statistics, 29 of 52

Slope Values: Positive, Negative, Equal to 0

Agresti/Franklin Statistics, 30 of 52

Residuals

Measure the size of the prediction errors

Each observation has a residual

Calculation for each residual:

ˆy y

Agresti/Franklin Statistics, 31 of 52

Residuals

A large residual indicates an unusual observation

Large residuals can easily be found by constructing a histogram of the residuals

Agresti/Franklin Statistics, 32 of 52

“Least Squares Method” Yields the Regression Line

Residual sum of squares:

The optimal line through the data is the line that minimizes the residual sum of squares

2 2ˆ( ) ( )residuals y y

Agresti/Franklin Statistics, 33 of 52

Regression Formulas for y-Intercept and Slope

Slope:

Y-Intercept:

( )y

x

sb r

s

( )a y b x

Agresti/Franklin Statistics, 34 of 52

The Slope and the Correlation

Correlation:• Describes the strength of the association

between 2 variables

• Does not change when the units of measurement change

• It is not necessary to identify which variable is the response and which is the explanatory

Agresti/Franklin Statistics, 35 of 52

The Slope and the Correlation

Slope:• Numerical value depends on the units used to

measure the variables

• Does not tell us whether the association is strong or weak

• The two variables must be identified as response and explanatory variables

• The regression equation can be used to predict the response variable

Agresti/Franklin Statistics, 36 of 52

Section 3.4

What Are Some Cautions in Analyzing Associations?

Agresti/Franklin Statistics, 37 of 52

Extrapolation

Extrapolation: Using a regression line to predict y-values for x-values outside the observed range of the data• Riskier the farther we move from the range

of the given x-values

• There is no guarantee that the relationship will have the same trend outside the range of x-values

Agresti/Franklin Statistics, 38 of 52

Regression Outliers

Construct a scatterplot

Search for data points that are well removed from the trend that the rest of the data points follow

Agresti/Franklin Statistics, 39 of 52

Influential Observation

An observation that has a large effect on the regression analysis

Two conditions must hold for an observation to be influential:

Its x-value is relatively low or high compared to the rest of the data

It is a regression outlier, falling quite far from the trend that the rest of the data follow

Agresti/Franklin Statistics, 40 of 52

Which Regression Outlier is Influential?

Agresti/Franklin Statistics, 41 of 52

Example: Does More Education Cause More Crime?

Agresti/Franklin Statistics, 42 of 52

Correlation does not Imply Causation

A correlation between x and y means that there is a linear trend that exists between the two variables

A correlation between x and y, does not mean that x causes y

Agresti/Franklin Statistics, 43 of 52

Lurking Variable

A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest

Agresti/Franklin Statistics, 44 of 52

Simpson’s Paradox

The direction of an association between two variables can change after we include a third variable and analyze the data at separate levels of that variable

Agresti/Franklin Statistics, 45 of 52

Example: Is Smoking Actually Beneficial to Your Health?

Agresti/Franklin Statistics, 46 of 52

Example: Is Smoking Actually Beneficial to Your Health?

Agresti/Franklin Statistics, 47 of 52

Example: Is Smoking Actually Beneficial to Your Health?

Agresti/Franklin Statistics, 48 of 52

Example: Is Smoking Actually Beneficial to Your Health?

Agresti/Franklin Statistics, 49 of 52

Example: Is Smoking Actually Beneficial to Your Health?

An association can look quite different after adjusting for the effect of a third variable by grouping the data according to the values of the third variable

Agresti/Franklin Statistics, 50 of 52

Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire

Would you expect the correlation to be negative, zero, or positive?

a. Negative

b. Zero

c. Positive

Agresti/Franklin Statistics, 51 of 52

If the correlation is positive, does this mean that having more firefighters at a fire causes the damages to be worse?

a. Yes

b.No

Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire

Agresti/Franklin Statistics, 52 of 52

Identify a third variable that could be considered a common cause of x and y:

a. Distance from the fire station

b. Intensity of the fire

c. Time of day that the fire was discovered

Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire