copyright © 2014, 2011 pearson education, inc. 1 chapter 18 inference for counts

Copyright © 2014, 2011 Pearson Education, Inc. 1

Chapter 18Inference for Counts


18.1 Chi-Squared Tests

Retailers can customize the online shopping experience by learning more about its customers. For example, Amazon wants to know if income level affects what shoppers look for (camera or phone) when they visit electronics.

Use a chi-squared test for independence to answer this question.



Contingency Table: Purchase Category vs. Household Income (555 visitors to Amazon)



Observations from Contingency Table

Association is evident suggesting that income and choice of product are dependent.

Households with lower incomes seem more likely to purchase a phone; those with higher incomes a camera.

Are these differences in purchase rates the result of sampling variation?


18.2 Test of Independence

Chi-Squared test of independence

Tests the independence of two categorical variables using counts in a contingency table.



Hypotheses for the chi-squared testH0: Household Income and Purchase

Category are independent.Ha: Household Income and Purchase

Category are not independent.

OrH0: p25 = p50 = p75 = p100 = p100+

Ha: p25 , p50 , p75 , p100 , p100+ are not all equal



Hypotheses for the chi-squared test

Null hypothesis describes five segments of the population defined by household income.

Null assumes conditional probabilities of purchase type given income level are equal across the five segments.

Alternative hypothesis is vague; does not indicate why the null is false.



Calculating χ2

Measures the distance between the observed contingency table and a hypothetical contingency table.

The hypothetical contingency table obeys H0 while being consistent with observed marginal counts.



Calculating χ2

The null hypothesis determines expected cell counts in the

hypothetical table.



Calculating χ2

Accumulates the deviations between the observed and expected counts (in the hypothetical table) across all cells.

ected

ectedobserved

exp

)exp( 22



Calculating χ2

For retail data on purchase category and household income, the chi-squared statistic is 33.925.

925.3316.57

)16.5753(...

74.50

)74.5038(

61.43

)61.4326( 2222



Plots of the chi-squared testMosaic Plot for Retail Data



Plots of the chi-squared testMosaic Plot for Independent Variables



Conditions

No lurking explanation for association. Data are random samples from indicated

segments of the population. Categories defining the table are mutually

exclusive. Expected cell counts are not too small.



The chi-squared distribution

Sampling distribution of the chi-squared statistic if the null hypothesis is true.

Right-skewed. Assigns probabilities to positive values only. Identified by degrees of freedom (df). Approaches normal distribution as df increase.



The chi-squared distribution



Getting the p-value

df for χ2 test of independence = (r - 1)(c - 1)

df based on size of contingency tabler = number of rowsc = number of columns



Getting the p-value – Retail Example

Observed χ2 = 33.925 with 4 dfFrom χ2 table P(χ2 > 9.4877) = 0.05; since 33.925 > 9.4877, we can reject H0

The p-value is therefore < 0.05; the exact p-value is 0.0000008.



Getting the p-value – Retail Example



Summary: chi-squared test of independence



Chi-squared test of independence – Checklist

No obvious lurking variable. SRS Condition. Contingency table condition. Sample size condition. Expected cell

frequencies at least 10; expected cell frequencies of 5 permitted with at least 4 df.



Connection to two-sample tests

Chi-squared test reduces to two-sided version of the two-sample test of the difference between proportions.

If the 95% confidence interval for p1 – p2

does not include zero, then the chi-squared test has a p-value less than 0.05 and H0 is rejected.


4M Example 18.1: RETAIL CREDIT

Motivation

Managers of a chain worry that some methods of recruiting customers for store credit, called channels, produce more problems than other channels. Is the channel used related to the status of the customer’s account a year later?



MethodData collected for 630 accounts on variables Channel and Status after 12 months.



Method – Check Conditions

No obvious lurking variable. Difficult to check without knowing more about channels.

SRS condition reasonably met. Contingency table condition satisfied. Sample size condition must be checked

after computing expected cell frequencies.



Mechanics – Mosaic Plot



Mechanics – Expected Counts

Sample size condition satisfied.Χ2 = 9.158 with 4 df; p-value = 0.057.Cannot reject H0 at α = 0.05.



Message

Observed rates of late payments and early closure are not statistically significantly different among credit accounts opened a year ago through in-store, mailing and Web channels. Since the p-value is close to 0.05, it may be worthwhile to monitor accounts developed through mailings.


18.3 General Versus Specific Hypotheses

Chi-squared test cannot match the power of a more specific test.

A 95% confidence interval for the difference in proportions of late payments from accounts developed via the mailing channel versus the other two channels (combined into one) does not contain zero.


18.4 Tests of Goodness of Fit

Chi-Squared test of goodness of fit

A test of the distribution of a single categorical variable.



Testing for randomness

Do shoppers purchase big-ticket items more often on some days of the week than on others?

Are cars made on some days more likely to have defects than the cars made on other days?


4M Example 18.2: DETECTING ACCOUNTING FRAUD

Motivation

Managers would like to have a systematic method to audit purchase amounts on invoices to uncover fraud.



Method

Managers collected a sample of n = 135 invoices. Amounts ranged from $100 to $100,000, with an average of $42,000. Leading digits for the amounts should follow a distribution known as Benford’s law.



Method Probabilities based on Benford’s law



MethodCounts of leading digits in sample of invoices




All conditions are satisfied. The smallest expected count is 6.2. Because there are more than 4 degrees of freedom, the relaxed sample size condition is used.



Mechanics

Χ2 = 19.1 with 8 df. P-value = 0.014. Reject H0.



Message

The deviation of the distribution of leading digits in these invoice amounts is statistically significantly different from the form predicted by Benford’s law. This confirms suspicion that the digits are atypical and may indicate fraud.



Testing the fit of a probability model

How do we know whether the observed counts match a particular distribution?


4M Example 18.3: WEB HITS

Motivation

Managers of the Web site plan to use a Poisson model to summarize how often users click on ads. If it fits well, they will use this model to summarize concisely the volume of traffic headed to advertisers and to measure the effects of changes in the Web site on traffic patterns.



MethodData collected on a sample of 685 users that visited the Web site during a recent weekday evening.




SRS and the contingency table conditions are satisfied. However, need to combine the last three categories in order to meet the sample size condition.



Mechanics



Mechanics

Χ2 = 0.345 with 2 df. P-value = 0.84. Cannot reject H0.



Message

The distribution of the number of ads clicked by users is consistent with a Poisson distribution. Managers of the Web site can use this model to summarize user behavior.


Best Practices

Remember the importance of experiments.

State your hypotheses before looking at the data.

Plot the data.

Think when you interpret a p-value.


Pitfalls

Don’t confuse statistical significance with substantive significance.

Don’t use a chi-squared test when the expected frequencies are too small.

Don’t cherry pick comparisons.

Don’t use the number of observations to find the degrees of freedom of chi-squared.

copyright © 2014, 2011 pearson education, inc. 1 chapter 18 inference for counts

Documents