business statistics stat 2160 - wmich.edu · 1.3 summarizing categorical data ... is a course in...

178
Business Statistics Stat 2160 2010 Edition Prepared by: Statistical Computation Lab (SCL) Department of Statistics Western Michigan University

Upload: vuhanh

Post on 09-Aug-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Business Statistics

Stat 2160

2010 Edition

Prepared by:Statistical Computation Lab (SCL)

Department of StatisticsWestern Michigan University

Page 2: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

2

Copyright c©2010 by Joshua D. Naranjo at Western Michigan University.

All rights reserved.

Reproduction or translation of any part of this work beyond that permitted by Sections 107 and108 of the 1976 United States Copyright Act without the permission of the copyright owner isunlawful.

Page 3: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Contents

1 Data Presentation 1

1.1 Statistics and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Measurement Levels of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Summarizing Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Relative Frequency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.3 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Summarizing Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Relative Frequency Table and Histogram . . . . . . . . . . . . . . . . . . . . 9

1.4.3 Box-and-Whisker Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.4 Dotplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4.5 Symmetry and Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 Using the TI-84: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5.1 Entering Data into a List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5.2 Calculating the five-number-summary . . . . . . . . . . . . . . . . . . . . . . 15

1.5.3 Drawing a Boxplot or Histogram . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 Using Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6.1 Creating and saving an Excel format data file . . . . . . . . . . . . . . . . . . 15

1.6.2 Creating an Excel data file from a text file (separated by spaces, comma, ortab) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6.3 Retrieving a previously saved Excel data file . . . . . . . . . . . . . . . . . . 17

1.6.4 How to get summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.6.5 How to create a histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Samples and Populations 25

2.1 Sample Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Methods of Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

i

Page 4: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

ii CONTENTS

2.3 Types of Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Some Issues in Question Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Types of Survey Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 A Questionnaire Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Location and Spread 33

3.1 Measures of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 The Average and SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 Interpreting the SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.2 The empirical rule for the SD . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 The Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Other Measures of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Other Measures of Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Adding a Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7 Multiplying by a constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 Using the TI-84: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.8.1 Calculating X and S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Binomial and Normal Distributions 45

4.1 Computing Probabilities Using the Normal Curve . . . . . . . . . . . . . . . . . . . . 46

4.1.1 Calculating Areas under the Normal Curve . . . . . . . . . . . . . . . . . . . 47

4.1.2 Calculating Percentiles Using the Normal Curve . . . . . . . . . . . . . . . . 49

4.1.3 The Standard Normal or Z-Curve . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Binomial Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1 Computing Binomial Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.2 Expected Value and SD of a Binomial Random Variable . . . . . . . . . . . . 55

4.3 Computing Binomial Probabilities Using the Normal Curve . . . . . . . . . . . . . . 56

4.3.1 Some Approximations Are Better Than Others . . . . . . . . . . . . . . . . . 58

4.4 Using the TI-84: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.1 Computing Normal Curve probability between a and b . . . . . . . . . . . . . 59

4.4.2 Computing Normal Curve percentile . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.3 Computing Binomial PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.4 Computing Binomial CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Using Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.1 Computer binomial probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.2 Computing P (X ≤ x) from the normal curve . . . . . . . . . . . . . . . . . . 60

4.5.3 Getting z critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5.4 An alternative method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Page 5: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

CONTENTS iii

4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Sampling Distribution of the Proportion 67

5.1 The Sample Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.1 The Sampling Distribution of p̂ is Approximately Normal . . . . . . . . . . . 69

5.2 Estimating the Population Proportion p . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6 Sampling Distribution of the Mean 75

6.1 Behavioral Properties of the Sample Average . . . . . . . . . . . . . . . . . . . . . . 76

6.2 Drawing from a Nonnormal Population . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.3 Estimating the Population Mean Using a Random Sample . . . . . . . . . . . . . . . 81

6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7 Confidence Intervals 87

7.1 Estimating the Population Mean Using Intervals . . . . . . . . . . . . . . . . . . . . 88

7.2 t-based Confidence Interval for the Mean . . . . . . . . . . . . . . . . . . . . . . . . 90

7.3 Determining Sample Size for Estimating the Mean . . . . . . . . . . . . . . . . . . . 92

7.4 Comparing the Averages of Two Independent Samples . . . . . . . . . . . . . . . . . 93

7.5 Comparing Averages of Two Paired Samples . . . . . . . . . . . . . . . . . . . . . . 96

7.6 Estimating the Population Proportion Using Intervals . . . . . . . . . . . . . . . . . 98

7.7 Sample Size for Estimating the Population Proportion . . . . . . . . . . . . . . . . . 99

7.8 Confidence Interval for the Difference Between Two Proportions . . . . . . . . . . . 100

7.9 Overview of Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.10 Using the TI-84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.10.1 Computing a Confidence Interval for µ . . . . . . . . . . . . . . . . . . . . . . 102

7.10.2 Computing a confidence interval for µ1 − µ2: . . . . . . . . . . . . . . . . . . 102

7.10.3 Computing a Confidence Interval for p . . . . . . . . . . . . . . . . . . . . . . 103

7.10.4 Computing a Confidence Interval for p1 − p2 . . . . . . . . . . . . . . . . . . 103

7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8 Testing Hypotheses 113

8.1 Test of Significance involving the Sample Average . . . . . . . . . . . . . . . . . . . . 114

8.1.1 Upper-tailed t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8.1.2 Two-tailed t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.1.3 The t-test is not universally appropriate . . . . . . . . . . . . . . . . . . . . . 119

8.1.4 Generalizing Tests of Significance . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8.1.6 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

8.2 Using the TI-84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Page 6: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

iv CONTENTS

8.2.1 Calculating a t-test for µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.2.2 Calculating a t-test for µ1 − µ2 . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.2.3 Calculating a Paired t-test for µ1 − µ2 . . . . . . . . . . . . . . . . . . . . . . 1258.2.4 Calculating a z-test for p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.2.5 Calculating a z-test for p1 − p2 . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.3 Using Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1268.3.1 Comparing two means (independent samples) . . . . . . . . . . . . . . . . . . 1268.3.2 Comparing two means (paired samples) . . . . . . . . . . . . . . . . . . . . . 126

8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

9 Testing Equality of Frequencies 1339.1 Testing for Independence between Two Variables Using an r × c Table . . . . . . . . 134

9.1.1 Calculating the Table of Expected Frequencies . . . . . . . . . . . . . . . . . 1369.1.2 The Chi-Square Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9.2 Testing for Equality of Proportions Between 2 Samples . . . . . . . . . . . . . . . . . 1379.3 Using the TI-84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

9.3.1 Entering Data into a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399.3.2 Calculating a Chi-Square Test for an r by c Table . . . . . . . . . . . . . . . 139

9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

10 Correlation 14310.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14410.2 Computing the Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 14510.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

11 Linear Regression 15111.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15211.2 Calculating the Least Squares Regression Line . . . . . . . . . . . . . . . . . . . . . 15511.3 More on Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

11.3.1 The Excel Printout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16011.4 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16311.5 Optional: The Multicollinearity Problem . . . . . . . . . . . . . . . . . . . . . . . . . 16411.6 Using Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

11.6.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16511.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Page 7: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 1

Data Presentation

1

Page 8: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

2 CHAPTER 1. DATA PRESENTATION

1.1 Statistics and Data

The term statistics is used in two ways. In the broad context, ‘statistics’ refers to a collection oftechniques and procedures for analyzing data. This is what we mean when we say ‘Statistics 216is a course in Business Statistics’- the course will cover techniques and procedures for analysingbusiness-related data. In a narrower context, ‘statistics’ refers to the numbers themselves (like theaverage, median, etc.) that arise out of data analysis. ‘Baseball statistics’ and ‘financial marketstatistics’ are used in this context.

In this course, ‘data’ refers to any collection of measurements that are made on a number ofsubjects. These measurements are often stored in a row-and-column display called a spreadsheet.An example is given below of a spreadsheet containing data taken on ten students in a class.

CLASS DATA

Student Sex Level GPA Hours Transport Sleep Hrs.

Taken Last Night

1 M Sophomore 3.10 32 Car 7.0

2 M Junior 3.20 66 Car 8.0

3 F Senior 3.49 94 Bus 8.0

4 M Senior 2.68 89 Walk 10.0

5 F Junior 3.73 69 Bicycle 8.0

6 F Junior 3.39 59 Car 8.0

7 F Senior 3.80 86 Walk 8.0

8 M Junior 3.11 75 Car 8.0

9 F Sophomore 3.10 27 Car 7.0

10 M Senior 3.10 96 Walk 3.0

Each row represents a subject (or student, in this case), and each column represents a mea-surement. The measurement columns are also called variables, because the values vary from onesubject to another.

1.2 Measurement Levels of Data

It is useful to distinguish between four levels of measurements for data, from weakest tostrongest:

1. Nominal (no ordering)2. Ordinal (ordering exists, but not distance)3. Interval (distance exists, but not ratios)4. Ratio (ratios exist)

Page 9: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.3. SUMMARIZING CATEGORICAL DATA 3

Sex is a nominal variable, since ‘Male’ and ‘Female’ are just names of categories. There is nointrinsic ordering between them.

A student’s level of standing (freshman, sophomore, junior, or senior) is ordinal; they are alsonames of categories but, unlike sex, they are rank-ordered. However, subtraction cannot be doneand distances do not make sense.

GPA is an interval measurement; subtraction can be done and distances make sense. Forexample, the distance from 2.3-2.4 is the same distance as 3.7-3.8. However, ratios do not makesense; is 4.0 ‘twice as high’ as 2.0? The answer is no. The grading system would work just as wellon the scale (A, B, C)=(5.0, 4.0, 3.0) instead of (4.0, 3.0, 2.0).

Finally, number of credit hours is a ratio measurement. A student who has completed 90 credithours has TWICE as many as 45 credit hours, and 3 times as many as 30 credit hours

It is useful to recognize a hierarchy of information in the sense that a measurement level containsan amount of information greater than or equal to the level below it. At lower levels of measurement,data analyses tend to be less sensitive and sophisticated. A statistical study should aim for thehighest levels of measurement possible or affordable.

Interval and ratio variables together are often called numerical variables because they providea number which measures ‘quantity’ (how much, how many) of something.

Nominal and ordinal variables together are often called categorical variables because they classifyinto categories rather than count or measure. It is tempting to think of categorical variables as‘non-numerical’ but sometimes they do consist of numbers. For example, ‘social security number’consists of numbers, but are used more as labels rather than quantities.

1.3 Summarizing Categorical Data

A large business school conducted a study of graduation rate for their students. Out of 408 enteringstudents in 1995, 55% had graduated within 6 years. Data for a random sample of 56 students aregiven below. ‘AGE’ was their age upon entering the program, ‘HS GPA’ was their cumulative highschool GPA, ‘DORM’=1 means they lived in a dorm during their first year in college, ‘GPA2’ wastheir college GPA after two years of college, and ‘GRAD6’=1 means they graduated within 6 yearsof starting college.

The variables MAJOR, SEX, DORM, and GRAD6 are categorical. The variables AGE, HSGPA, and GPA2 are numerical.

GRADUATION RATE DATA

STUDENT MAJOR SEX AGE HS GPA DORM GPA2 GRAD6

1 GBS M 18 2.73 1 3.40 0

2 GBS F 18 3.20 1 1.35 0

3 MGT M 18 3.21 1 3.27 1

Page 10: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4 CHAPTER 1. DATA PRESENTATION

4 ACT M 18 3.50 1 2.87 1

5 ACT F 18 3.58 1 2.93 0

6 ACT F 18 3.60 1 2.94 1

7 MGT M 18 2.91 1 0.46 0

8 ACT F 18 3.14 1 2.70 0

9 ACT F 34 2.60 0 2.84 0

10 GBS M 18 2.93 1 2.22 0

11 GBS F 18 3.83 1 3.10 1

12 ACT M 18 3.86 1 3.06 0

13 GBS F 18 3.16 1 2.61 0

14 GBS F 18 3.44 1 3.02 0

15 GBS F 18 2.63 1 1.83 0

16 ACT F 18 3.77 1 2.99 0

17 GBS F 18 2.79 1 2.93 1

18 GBS M 18 4.00 1 3.78 0

19 GBS M 18 3.76 1 3.01 1

20 GBS M 18 3.57 1 3.08 1

21 GBS M 18 2.94 1 1.94 0

22 GBS M 18 3.32 1 3.59 1

23 ACT F 19 2.55 0 2.49 1

24 MGT M 19 2.43 1 2.38 1

25 ACT M 18 2.78 1 1.27 0

26 ACT M 18 3.72 1 3.47 1

27 ACT F 17 3.76 0 3.00 0

28 MGT M 17 3.36 1 3.63 1

29 GBS M 18 2.57 1 1.18 0

30 GBS M 18 2.90 1 2.16 0

31 ACT M 18 3.36 1 3.40 1

32 GBS M 18 3.54 1 2.85 0

33 GBS F 18 2.81 1 3.00 1

34 MGT M 18 2.78 1 2.63 0

35 GBS M 18 2.46 1 2.19 0

36 GBS M 18 2.60 0 2.54 0

37 MKT M 18 3.08 1 3.22 0

38 MGT M 18 2.78 1 2.75 1

39 ACT M 18 2.71 1 2.02 0

40 GBS M 18 2.61 1 2.42 1

41 GBS M 18 2.37 1 2.19 1

42 MKT M 18 3.33 1 3.48 1

43 ACT F 18 3.54 1 2.91 1

Page 11: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.3. SUMMARIZING CATEGORICAL DATA 5

44 GBS M 17 3.36 1 3.22 1

45 ACT M 18 3.35 1 2.16 1

46 GBS M 17 2.82 1 2.35 1

47 GBS F 18 2.75 1 2.10 0

48 GBS M 19 2.59 1 1.77 0

49 MKT M 18 2.67 0 1.78 0

50 MKT M 18 3.19 1 2.71 1

51 MGT F 18 3.62 1 3.33 1

52 GBS F 18 2.81 1 1.82 0

53 ACT F 18 3.73 1 2.79 0

54 MGT M 18 2.60 1 2.87 0

55 GBS F 19 2.58 1 2.34 1

56 GBS M 18 3.29 1 2.99 1

1.3.1 Relative Frequency Table

A relative frequency table presents the total count for each category and the relative frequency orpercentage of time in which each category occurs. A relative frequency table for MAJOR, whichhas four categories, is provided below.

Relative Frequency Table for Major

Major Frequency Rel. Freq.

ACT 16 0.29

GBS 28 0.50

MGT 8 0.14

MKT 4 0.07

----- -- ----

Total 56 1.00

1.3.2 Bar Chart

A bar chart is a plot of the relative frequency table. The order by which the categories are presentedin the bar chart is arbitrary. Sometimes they are presented in order of decreasing frequency for aes-thetic purposes. Some computing packages present the categories alphabetically, unless otherwisespecified (see the following bar chart from Excel).

Page 12: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

6 CHAPTER 1. DATA PRESENTATION

1.3.3 Pie Chart

A pie chart gives the same information as a bar chart, except in a circular shape. Some peopleprefer the look of pie charts over bar charts. However, studies have shown that some people havedifficulty comparing relative size of wedges. Look at the pie chart below. Does the size of ACTlook twice the size of MGT, and four times the size of MKT?

Page 13: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.4. SUMMARIZING NUMERICAL DATA 7

1.4 Summarizing Numerical Data

The variables AGE, HS GPA, and GPA2 in the Graduation Rate Data are numerical. In thissection we discuss some techniques for summarizing and comparing numerical data.

1.4.1 Stem-and-Leaf Plot

A stem-and-leaf plot for HS GPA is presented below. The numbers on the left are called stems,while the numbers on the right are called leaves. Note that the stems skip a number because thefirst stem includes GPA’s that start with 2.2 or 2.3 and the second stem includes GPA’s that startwith 2.4 or 2.5 (the six leaves on this second stem correspond to 2.43, 2.46, 2.55, 2.57, 2.58, and2.59, respectively). The stem width here is chosen so as to achieve a moderate number of rows,and produce a reasonably informative picture of how the observations are ‘distributed’.

Page 14: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8 CHAPTER 1. DATA PRESENTATION

Stem-and-Leaf Display of HS GPA (stem width=.2)

22 | 7

24 | 365789

26 | 0001371258889

28 | 1120134

30 | 8469

32 | 019235666

34 | 414478

36 | 0223667

38 | 36

40 | 0

If the stems were broken up, we would have the following stem-and-leaf plot. Which stem-and-leaf plot is better? The answer depends on the viewers’ taste, and is not a matter ofright or wrong.

Stem-and-Leaf Display of HS GPA (stem width=.1)

22 |

23 | 7

24 | 36

25 | 5789

26 | 000137

27 | 1258889

28 | 112

29 | 0134

30 | 8

31 | 469

32 | 019

33 | 235666

34 | 4

35 | 14478

36 | 02

37 | 23667

38 | 36

39 |

40 | 0

The distribution of HS GPA seems to have a cluster around 2.7, and another cluster around3.3. This “bimodal shape” (i.e. having two modes or peaks) is also seen in the histogram drawn in

Page 15: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.4. SUMMARIZING NUMERICAL DATA 9

the next section.

1.4.2 Relative Frequency Table and Histogram

In Section 1.3.1, we presented a relative frequency table for categorical data. Relative frequencytables may also be used for numerical data. A relative frequency table for HS GPA is presentedbelow. The class width is chosen so as to achieve a moderate number of class intervals.

Relative Frequency Table of High School GPA

HS GPA Frequency Rel. Freq.

2.2-2.39 1 .02

2.4-2.59 6 .11

2.6-2.79 13 .23

2.8-2.99 7 .13

3.0-3.19 4 .07

3.2-3.39 9 .16

3.4-3.59 6 .11

3.6-3.79 7 .13

3.8-4.00 3 .05

-------- -- ----

Total 56 1.01

There are three things to keep in mind when constructing a frequency table:

1. How many classes (i.e. intervals) do you want? This also determines the class width.2. Where does the first interval start?3. How do you avoid boundary disputes?

The last item requires you to choose a boundary convention. For instance, the intervals in the thefrequency table above could have been written as follows:

2.2-2.4

2.4-2.6

2.6-2.8

etc

This is much simpler to read, isn’t it? However, does a grade of 2.4 count in the first or secondinterval? One remedy is to include a footnote to the table that says something like “The intervalsinclude the left endpoint but not the right”. This way, the grade of 2.4 belongs to the second (notthe first) interval. Or, square braces and parenthesis may be used as follows: [2.2-2.4), [2.4-2.6),. . .,[3.8-4.0]. This also means that the class intervals include the left endpoint but not the right (except

Page 16: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

10 CHAPTER 1. DATA PRESENTATION

for the last class). The frequency table above chose to avoid the boundary problem altogether bymaking each interval end before the start of the next one. This is also effective, but the frequencytable can look more complicated. Which method do you prefer?

The relative frequency table is a compact numerical way to present how the data is distributed.If the frequencies are plotted as columns, the resulting plot is called a histogram.

Histogram of High School GPA

2.0 2.5 3.0 3.5 4.0

The histogram and stem-and-leaf plots look alike, except that the stem-and-leaf plot has columnsthat go sideways instead of upwards. Stem-and-leafs are better if you want the data values them-selves available from the plot. However, the stem-and-leaf is impractical for large sample sizes. Thefrequency table and histogram can handle large sample sizes easily. Furthermore, they allow moreflexibility in choosing class widths. For example, you may choose class widths of .15, as follows:[2.35-2.50), [2.50-2.65), [2.65-2.80),....., [3.85, 4.00].

1.4.3 Box-and-Whisker Plot

Order the data from smallest to largest. What is the range of the first (or smaller) half of thedata? The smallest quarter of the data? The next quarter? The box-and-whisker plot or boxplotis a graphical picture of the distribution of quarters of the data. Consider once more the HS GPA

Page 17: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.4. SUMMARIZING NUMERICAL DATA 11

variable in the Graduation Rate Data of Section 1.3. Since there are n=56 observations, a quarterof the data comprises 56/4=14 observations. The left whisker of the boxplot goes from 2.37 to 2.72.This means that the lowest 14 GPA’s (one quarter of the data) lie within (2.37, 2.72). The next 14larger observations lie within (2.72, 3.11). The third set of 14 observations lie within (3.11, 3.52).Finally, the largest 14 observations lie within (3.52, 4.00).

Boxplot of High School GPA

2.5 3.0 3.5 4.0

The five values (2.37, 2.72, 3.11, 3.52, 4.00) that divide the data into quarters and form thefences and whiskers of the boxplot are collectively called the five-number-summary of the data.They are often denoted as MIN, Q1, MED, Q3, and MAX respectively.

1. MIN is called the minimum, and is the smallest of the ordered observations.2. Q1 is the upper boundary of the first quarter, and is called the first quartile.3. MED is the upper boundary of the second quarter, and is called the second quartile.

However, it also divides the data into lower and upper halves, and is more oftencalled the median.

4. Q3 is the upper boundary of the third quarter, and is called the third quartile.5. MAX is the largest of the ordered observations and is called the maximum.

Page 18: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

12 CHAPTER 1. DATA PRESENTATION

Boxplots are quite useful for comparing two distributions side-by-side. Below, we present aboxplot of second year GPA alongside the boxplot of high school GPA. The boxplots are presentedvertically this time, but the interpretation remains the same. Note that there is a slight differencein location as measured by the medians, but there is a radical difference in spread between thetwo distributions. The most noteworthy feature of 2nd year GPA is the long left tail, which isevidence that some students are not doing very well in college. (Note: Some computing packagesuse a special symbol to denote outlying values, or outliers. The boxplot for second year GPA hasan extremely low outlier, denoted by a circle. The left (or bottom) whisker ends at the secondsmallest observation).

HS GPA GPA2

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Boxplots of HS GPA and GPA2

Different statistical computing packages often have different ways of computing the quartiles.In this class, we compute the quartiles as follows. First, arrange the observations from smallest(1st ordered observation) to largest (nth ordered observation). Then

Q1 is the .25(n+1)st ordered observation.MED is the .50(n+1)st ordered observation.Q3 is the .75(n+1)st ordered observation.

Page 19: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.4. SUMMARIZING NUMERICAL DATA 13

If .25(n+1) is not an integer, take the average of the two adjacent ordered observations. Similarlyfor MED and Q3. Following are the 56 ordered observations of HS GPA used in the boxplot above.

2.37 2.43 2.46 2.55 2.57 2.58 2.59 2.60 2.60 2.60

2.61 2.63 2.67 2.71 2.73 2.75 2.78 2.78 2.78 2.79

2.81 2.81 2.82 2.90 2.91 2.93 2.94 3.08 3.14 3.16

3.19 3.20 3.21 3.29 3.32 3.33 3.35 3.36 3.36 3.36

3.44 3.50 3.54 3.54 3.57 3.58 3.60 3.62 3.72 3.73

3.76 3.76 3.77 3.83 3.86 4.00

Since .25(56+1)=14.25, then Q1 is computed as the average of the 14th and 15th ordered obser-vations (2.71+2.73)/2=2.72. Similarly, MED=(3.08+3.14)/2 = 3.11, and Q3= (3.50+3.54)/2=3.52.

1.4.4 Dotplot

For moderately large samples, a dotplot diagram is also an effective method of examining the shapeof the distribution. We present dotplots of HS GPA and 2nd year GPA below. Note the muchlonger left tail of 2nd year GPA which was apparent from the comparison boxplots shown earlier.Also note the extremely low outlier for 2nd year GPA.

Comparison Dotplots: High School GPA versus 2nd Year College GPA

. .

: : . .

: .: . . : : .

.:.::::.: .::::..:.::: .

-----+---------+---------+---------+---------+---------+-HS GPA

:

: . :::.

. ... :: ...:..:.:..:.:::: :..::.. .

-----+---------+---------+---------+---------+---------+-GPA2

0.70 1.40 2.10 2.80 3.50 4.20

1.4.5 Symmetry and Skewness

The shape of the data is often described by its symmetry or nonsymmetry (also called skewness).Here are stem-and-leaf plots for symmetric, right skewed, and left-skewed data.

Page 20: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

14 CHAPTER 1. DATA PRESENTATION

SYMMETRIC DATA RIGHT-SKEWED DATA LEFT-SKEWED DATA

4 | 7 4 | 58 4 | 0

5 | 35 5 | 011245569 5 | 5

6 | 005 6 | 2445 6 | 2

7 | 11200 7 | 17 7 |

8 | 846799 8 | 8 | 17

9 | 0199 9 | 2 9 | 2445

10 | 41 10 | 5 10 | 011245569

11 | 1 11 | 0 11 | 58

The term ‘skew’ refers to the direction of the longer tail if you stand the stem-and-leaf upright,or draw a histogram. Here is the histogram of the left-skewed stem plot above.

Histogram of left-skewed data

The long tail, of course, denotes outlying or extreme observations. The left-skewed histogramabove contains outlying small observations, while a right-skewed histogram would have outlyinglarge observations. A left skew also indicates that the smallest quarter of observations will be morespread out. See the boxplot below.

Boxplot of left-skewed data

Page 21: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.5. USING THE TI-84: 15

1.5 Using the TI-84:

1.5.1 Entering Data into a List

1. [STAT]2. [EDIT] should already be highlighted. Press [ENTER].3. The first row of list [L1] should already be highlighted. (To erase data already

stored in [L1], arrow up to the header and press [CLEAR] [ENTER]. You may alsoarrow right to other lists [L2]-[L6]) 5. Input the data, typing [ENTER] after eachobservation. 6. [2nd][QUIT]

1.5.2 Calculating the five-number-summary

1. [STAT]2. Arrow right to [CALC].3. [1-Var Stats] should already be highlighted. Press [Enter]4. Input [2nd] [L1] (or whatever list contains the data)5. [ENTER]. Then scroll down to minX, Q1, Med, Q3 and maxX

1.5.3 Drawing a Boxplot or Histogram

1. [2ND][STAT PLOT] [ENTER]2. [Plot1] should already be highlighted. Arrow down to [On] and press [ENTER].3. Arrow down and right to the histogram (top right) or boxplot (bottom middle).

Press [Enter]4. Arrow down to Xlist and input [2nd] [L1] (or whatever list contains the data). Leave

[Freq] at 1.5. Press [GRAPH].6. If your graph does not show or looks too small, press [ZOOM], arrow down to

[ZoomStat], then [ENTER].7. You may get summary statistics from the plot by pressing [TRACE], then right or

left arrow.

1.6 Using Excel

1.6.1 Creating and saving an Excel format data file

1. Open up Microsoft Excel. This loads the Excel program and causes an Excel applicationwindow to appear. If prompted, choose to open a Blank Workbook.

2. Type in your data (see figure below)

Page 22: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

16 CHAPTER 1. DATA PRESENTATION

3. Click the Office Button (in the top left) and choose ’Save As’.

4. Chose the folder where you want to save your file. You may have to navigate to the properfolder / flash drive using the ’Save in’ bar along the top. Once you have selected the folder,type in the desired file name.

5. Underneath ’File name’ you will see ’Save as type’. It should default to ’Excel Workbook’,which will be the Excel 2007 format. If you wish to save the file in an alternate format, youcan choose that from this menu (you can also chose alternate file types by hovering over theright arrow next to the ’Save as’ from step 3).

1.6.2 Creating an Excel data file from a text file (separated by spaces, comma,or tab)

1. Open up Microsoft Excel.

2. Click the Office Button and select ’Open’.

3. Select the folder where the file you wish to open is saved using the displayed folders or the’Look in’ bar along the top.

4. On the ’Files of type’ drop-down list, choose ’Text files’.

Page 23: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.6. USING EXCEL 17

5. Select the file you wish to open. If it does not appear, verify that you are in the correct folder.

6. Click ’Open’. An import wizard will appear, choose ’Delimited’ and click ’Next’.

7. Choose the appropriate delimiter(s). The delimiter is where Excel will split each row intocolumns. For example if you chose ’comma delimited’ and your text file reads: A,B,C youwill end up with three columns in Excel. The ’Data preview’ at the bottom of the windowwill display how your results will look in Excel with the currently selected delimiter(s).

8. When you are satisfied that you have selected the correct delimiter(s), select ’Next’.

9. If there are any columns requiring special formatting, specify it here (for Stat 2160 this willnot be necessary). When you are finished click ’Finish’.

1.6.3 Retrieving a previously saved Excel data file

1. Open up Microsoft Excel.

2. Click the Office Button and select ’Open’.

3. Select the folder where the file you wish to open is saved using the displayed folders or the’Look in’ bar along the top.

4. The ’Files of type’ bar along the bottom should read ’All Excel Files’. If it does not changeit to this.

5. Select the file you wish to be opened. If your file does not appear, verify that you are in thecorrect folder.

6. Click ’OK’.

1.6.4 How to get summary statistics

1. Open the Microsoft Excel data file. If you data is still in a text file, refer to the instructionon how to create an Excel file from a text file.

2. Click the ’Data’ tab along the top of the Excel window and click ’Data Analysis’ from the farright. If ’Data Analysis’ does not appear, you have to install the Analysis ToolPak.

(a) Click the Office Button and select ’Excel Options’ from along the bottom.

(b) From the list along the left choose ’Add-ins’ and click ’Go’ from the bottom.

(c) Check the box next to ’Analysis ToolPak’ (you do not need the ’Analysis ToolPak -VBA’ )

Page 24: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

18 CHAPTER 1. DATA PRESENTATION

(d) Click ’OK’. Excel will install the Analysis ToolPak and an option ’Data Analysis’ shouldappear to the right under the ’Data’ tab.

3. Choose ’Descriptive Statistics’ and click ’OK’. A new window will appear.

4. Click the text box for the ’Input Range’ and then highlight the cells containing your data. Ifthe first row of the cells you highlighted contains variable names, click on the ’Labels in FirstRow’.

5. On the output options, click on ’summary Statistics’ and click ’OK’.

1.6.5 How to create a histogram

1. Open the Microsoft Excel data file. If you data is still in a text file, refer to the instructionon how to create an Excel file from a text file.

2. Click on the ’Data’ tab along the top and select ’Data Analysis’. Choose ’Histogram’ andclick ’OK’.

3. Click the text box for the ’Input Range’ and then highlight the cells containing your data.

4. If there is a bin range specified, select the appropriate cells, otherwise leave the bin rangeblank. If the first row of your data contains variable names, check the ’Labels’ box.

5. On the output options, check the ’chart output’ and then click ’OK’.

6. To remove the gaps in the histogram, click on the histogram and then on the bars so thatthe bars are selected. Then right-click on the bars and choose ’Format Data Series’. Under’Series Options’, set the gap width to 0 using the slider bar or the text field below. Click’OK’.

Page 25: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.7. EXERCISES 19

1.7 Exercises

1. Consider the following eight variables in the Graduation Rate data of Section 1.3. Determinewhether each variable is categorical or numerical. If the variable is categorical, state whetherit is nominal or ordinal. If the variable is numerical, state whether it is an interval or ratiomeasurement.

(a) Student

(b) Major

(c) Sex

(d) Age upon entering the program

(e) High school GPA

(f) Did the student live in a dormitory during his/her first year in college?

(g) College GPA after 2 years of college

(h) Did the student graduate within 6 years of starting college?

2. Identify the following types of data as numerical or categorical. If numerical, further classifyinto interval or ratio.

(a) The scores on exam one for Stat 216 students this semester

(b) Marital status

(c) Annual income

(d) Social Security Number

(e) Cumulative GPA

3. The following data presents the number of cases of tomato soup sold per week at a localsupermarket over a period of 28 weeks:

Week Sold Week Sold Week Sold

1 81 11 86 21 912 61 12 133 22 993 77 13 91 23 894 71 14 111 24 965 69 15 86 25 1086 81 16 84 26 867 66 17 131 27 848 111 18 71 28 769 56 19 118

10 81 20 88

Page 26: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

20 CHAPTER 1. DATA PRESENTATION

(a) Using the 28 values for the number of cases sold, construct a stem-and-leaf display ofthe number of cases of tomato soup sold per week. What values did you choose for yourstems? How many stems do you have? Would the display look better if you had morestems? Fewer stems? How would you achieve this?

(b) From the stem-and-leaf display, how many cases of tomato soup are most likely to besold at any given week?

(c) Construct a histogram of the number of cases of tomato soup sold per week. How manyintervals would you choose? Write down the class boundaries for your chosen intervals.

4. Using the Class Data of Section 1.1, consider the variable LEVEL.

(a) Construct a bar chart for LEVEL.

(b) Construct a pie chart for LEVEL.

5. Consider the variable GPA in the Class Data of Section 1.1.

(a) Compute the five-number summary.

(b) Using the values from part (a), construct the box plot for GPA by hand.

(c) Enter the data into your TI-84 and obtain a boxplot.

6. The 6-year graduation rate for a random sample of 30 colleges and universities in the US isdisplayed in the following stem-and-leaf plot. Note that the stem unit is 10%, and the leafunit is 1%. For example, the maximum value is 92%.

23 44 5 6 8 8 95 0 1 1 1 2 3 4 5 5 6 7 8 8 96 0 1 2 4 4 5 77 1 789 2

(a) Obtain the five-number summary using 1-Var Stats on your TI-84.

(b) Obtain a boxplot and histogram for 6-year graduation rate..

(c) In your opinion, which of the three plots (stem-and-leaf, boxplot, histogram) best illus-trates the data? Why?

7. The following plots represent five different samples of data. For each, describe the shape.

Page 27: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.7. EXERCISES 21

(a)

(b)

(c)

4 5 85 0 1 1 2 4 5 5 6 96 2 4 4 57 1 789 210 511 0

(d)

(e)

0 41 8 8 92 1 1 1 4 6 73 4 5 74 7

8. Consider the following (partial) Excel worksheet. Which of the columns contain categoricaldata?

Page 28: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

22 CHAPTER 1. DATA PRESENTATION

9. A manager of a car rental company took a random sample of 100 business days over the lastfiscal year and recorded the number of cars rented per day. The frequency distribution forthe data is given below.

Interval Frequency Relative Freq.

(20, 25] 4(25, 30] 11(30, 35] 23(35, 40] 31(40, 45] 15(45, 50] 10(50, 55] 6

(a) Fill in the Relative Frequencies above.

(b) Draw a histogram by hand, with either the frequencies or relative frequencies as thevertical axis.

(c) What interval does the median number of car rentals per day fall into?

(d) What percentage of business days had 30 or less car rentals?

(e) How many business days had more than 45 car rentals?

Page 29: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

1.7. EXERCISES 23

(f) Give an approximate value for the number of car rentals per day that would be the 95th

percentile.

Page 30: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

24 CHAPTER 1. DATA PRESENTATION

Page 31: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 2

Samples and Populations

25

Page 32: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

26 CHAPTER 2. SAMPLES AND POPULATIONS

2.1 Sample Surveys

Before making trend decisions, launching a new product, or starting a new business, you needinformation to minimize your risk and enhance your chance for success. Knowledge allows youto evaluate the potential success of long-term decisions or strategies, and evaluate trends and thefuture needs of your industry.

Knowledge often comes in the form of data. Sometimes, existing data is sufficient, but oftendata needs to be gathered. Knowledge of sample survey techniques increase your likelihood ofgetting accurate, useable information with the least expense of time and money. In this chapter,we discuss important issues about survey sampling for the purpose of collecting information.

We start with some terminology. Data is often collected on a random sample of subjects froma target population. Consider trying to estimate the percentage of WMU undergraduates whosepermanent residence is in Southwest Michigan (either Allegan, Barry, Berrien, Branch, Calhoun,Cass, Kalamazoo, Kent, Ottawa, St. Joseph, or Van Buren counties). Suppose that we want toselect approximately n = 200 students. There are several sampling methods available.

2.2 Methods of Sample Selection

A simple random sample is similar to putting the names of all students in a hat, and then drawing200 tickets without replacement. A more practical procedure is to create a frame, or a list, of allWMU students and have a computer randomly select 200 names from the frame. On a smallerscale, one may use a table of random numbers to select subjects from the frame. The methodof simple random sampling has ideal statistical properties in the sense that each element in theframe has an equal chance of being selected. However many target populations have no convenientframes available. Examples are the populations of Kalamazoo residents, customers of Meijer Stores,potential credit card customers, and US registered voters. In these cases, more convenient framesthat simply approximate the target population may be used, like phone books and mailing lists.

A k-in-1 systematic sample selects every kth subject on a list or a sequence. For example, every10th customer entering the store may be approached for a survey, or every 25th number in thephone book may be called.

A stratified random sample partitions the population into subsets called strata, and selects asimple random sample from each stratum. In the survey of WMU undergraduates, we may chooseto draw 50 students from each of the freshman, sophomore, junior, and senior strata, instead of200 students from the combined student population.

A cluster sample would randomly select, say, 5 classes of 40 students each. Each class is calleda cluster of students. Cluster sampling works best when heterogenous clusters are available, i.e.each cluster is approximately representative of the population. This means that classes need tocontain freshmen and seniors, men and women, etc. Therefore, it seems better to limit the frame tocontaining general education classes, because classes for specific majors tend to be less heterogenous.

Page 33: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

2.3. TYPES OF SURVEYS 27

The sampling designs discussed above are examples of probability samples, because their use ofrandom selection (of clusters, or within strata) gives them desirable statistical properties. Informaland less scientific surveys may conduct nonprobability samples, like randomly approaching peopleat the mall or WMU students at the Berhard Center. Nonprobability samples tend to be lessrepresentative of the target population because interviewers tend to choose more friendly-lookingpeople, or more attractive people. Furthermore, the shopping mall sample would tend to excludesegments of the population who don’t shop often. Approaching students at the Bernhard Centermay tend to exclude dormitory residents who have meal plans.

When populations are small, data may be collected on the whole population instead of a sample.This is called a census survey instead of a sample survey.

2.3 Types of Surveys

Three frequently used types of surveys are face-to-face interview, telephone interview, and writtenquestionnaire. They vary in terms of cost per respondent and types of questions you can ask.

Face-to-face interviews are the most costly in terms of time and money. However, the interview-ers can explain complex issues and explore unanticipated responses. Consider the question “Doyou have any suggestions for university administrators about how to improve the college experienceof WMU students?” A response like “Improve dormitory facilities” may be too vague. The inter-viewer can then explore the issue further ”Do you mean better rooms, better study facilities, morecomputers, better food?” Another advantage is the interviewers’ ability to personally observe therespondent/s and the surrounding environment. This is particularly useful for, say, surveys aboutwork conditions or workplace interaction.

Phone interviews are cheaper and quicker than face-to-face interviews. They also allow theinterviewer to explain questions and react to responses. However, personal observation is notpossible. Furthermore some respondents may feel that the survey call, like a telemarketer’s call, isan invasion of personal time. This would inhibit honest, thoughtful answers.

Written questionnaire is the cheapest way to reach a lot of people. It also allows respondentsto remain anonymous, probably the questionnaire’s strongest advantage. However, response ratemay be low (”another dumb survey”) and this can lead to biased results. Furthermore, poorlyconstructed questions can lead to inaccurate results.

Table 2.1 breaks each of the 3 categories above into smaller categories, and lists the primaryadvantages and disadvantages. Feel free to add more categories.

2.4 Some Issues in Question Construction

Question construction is pivotal to the success of a survey. This is particularly true of writtenquestionnaires, where respondents cannot ask for clarification of the question.

Page 34: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

28 CHAPTER 2. SAMPLES AND POPULATIONS

Table 2.1: Types of Surveys

Type of Survey Advantages Disadvantages

Face-to-face Interview Explain questions, explore issues, Expensive, needmake observations, use visual aids. interviewer training

- at home or work Accuracy, better sampling Expensive

- in public places Cheaper, more people in less time Less representative sample

Telephone survey Accurate, cheap No personal observations

Written questionnaire Cheapest per respondent Bias from low response rate

- by mail Allows anonymity Slow

- by e-mail Cheaper, quicker results Less representative sample

- web survey Quicker data processing Need computing expertise

Can the question be misunderstood? The question “What kind of car do you drive?” may beasking for type of car (coupe, sedan, SUV, pick-up truck?). Or model type (Ford Taurus, PontiacFirebird, Honda Civic?).

Do any terms need to be defined? Consider the more specific question “Do you drive a domesticor foreign car?” What do you mean by domestic or foreign? Is a Tennessee-made Toyota foreign?What about a Mercedes Benz, now co-owned by Chrysler?

What type of information do you want to gather: knowledge or attitude? Consider the followingquestions: “Are your class sizes satisfactory?” or “Do you feel your class sizes are satisfactory?”or “Are you personally satisfied with your class sizes?” All three questions ask for an opinion, butthe first seems more objective while the last one is most personal.

Are the questions loaded? Some questions lead the respondent to an almost predeterminedresponse, especially if the respondent does not have a strong opinion on the matter. Compare“Are your class sizes satisfactory?” with the more neutral “Are your class sizes satisfactory ornot satisfactory?” A more loaded question would be “Do you agree that some class sizes are toolarge?”.

Do the questions ask for sensitive information? When asked “Did you vote during the lastPresidential election?” some respondents may not feel comfortable admitting they did not. Abetter question may be: “Did anything come up that prevented you from voting during the lastPresidential election?”.

2.5 Types of Survey Errors

Coverage errors occur when the sampling frame excludes some segments of the target population.Phone books are very convenient, but unlisted households are excluded. Election polls sample from

Page 35: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

2.6. A QUESTIONNAIRE EXAMPLE 29

the frame of registered voters, but the target population is the subset who are going to vote.Nonresponse errors can cause serious bias in survey results. Phone interviewers find it easiest

to reach families with young children, and hardest to reach young singles. Mail surveys tend to beanswered by those who feel strongly about an issue, or by those who feel more civic responsibility,neither of which is a representative cross-section of the population.

Measurement Errors occur when respondents answer ‘inaccurately’ because of question wording,question ordering, interviewer effect, or other external influences. For example, the answer to “Doyou approve of affirmative action?” may be influenced by the gender of the interviewer. Also, thequestion “Do you like living in Davis Hall?” may be influenced by preceding it with the question“Do you have enough parking spots in Davis Hall?”.

2.6 A Questionnaire Example

The intersection of Howard Street and West Michigan is a hard place for pedestrians to cross.Suppose that University planners are thinking about some changes in this area. Two options are:

A. Construct a pedestrian overpass or tunnel,B. Reduce pedestrian crossing by moving the commercial establishments to the campus

side (east side) of Howard Street.

Option B is more expensive than option A, but may be worth the investment if it creates a moreappealing westgate approach to the University.

Table 2.2 shows a potential questionnaire that may be used for data-gathering. Which questionswould you change, remove, add? How would you select the respondents?

Page 36: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

30 CHAPTER 2. SAMPLES AND POPULATIONS

Table 2.2: Campus Planning Survey

Please take a few minutes to answer the following questions. Your honest opinions wll bevery helpful in our mission to improve the University experience of WMU students.

1. How frequently do you cross the intersection of Howard and West Michigan on foot?This is the corner where Video Hits Plus is located. (Check One)

© 8 or more times a week © Occasionally© 3-7 times a week © Never

2. How frequently do you cross only for a snack or a meal?© 8 or more times a week © Occasionally© 3-7 times a week © Never

3. If a footbridge were built to cross Howard Street, would you use it?© 8 or more times a week © Occasionally© 3-7 times a week © Never

4. Where would you put the footbridge?On the Jimmy Johns side (south)On the Video Hits Plus side (north)

5. How frequently do you drive off campus only for a snack or a meal?© 8 or more times a week © Occasionally© 3-7 times a week © Never

6. If the commercial establishments on the Video Hits side of Howard were locatedon the campus side, how frequently would you visit them for a snack or a meal?

© 8 or more times a week © Occasionally© 3-7 times a week © Never

7. Where would you put the new location? (See map)© On Site A (parking lot facing the University Bookstore across W. Michigan)© On Site B (parking lot between the Haenicke Science Building and Rankin Street)

8. Potential students often get their first view of the University along this drive on W. Michigan.Do you think the new location would change visitors’ first impression of WMU?

© Yes, for the better© Yes, for the worse© No

Page 37: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

2.7. EXERCISES 31

2.7 Exercises

1. Northwest Airlines plans to make changes in its frequent-flier Worldperks program. One ofthem is to discontinue discounts offered for passengers who choose to redeem award miles dur-ing off-peak periods. Before doing so, Northwest wants to verify if this policy will discouragecustomers from flying during off-peak periods. The company made a list of all its flights onone day during an off-peak period, and randomly selected 25 flights. All of the Worldperksmembers on those flights were then asked whether they would change their travel dates if thediscount were discontinued.

(a) State the target population.

(b) State the frame and the sampling method used. Is the frame the same as the target pop-ulation? Do you think this will give a representative sample from the target population?How would you have selected a better sample?

(c) Construct a short questionnaire that will help Northwest assess the impact of the policyin discouraging customers from flying during off-peak periods.

2. The patient representative of Bronson Methodist Hospital wanted to make sure that thehospital is providing good service to its patients. He obtained a list of all people that haveused the hospital’s services for the past year and conducted a phone interview with every 8thperson in the list.

(a) State the target population.

(b) State the sampling method used. Do you think this will give a representative sample?How would you have selected a better sample?

(c) Construct a short questionnaire that you think would provide useful information abouthospital services.

3. Wendy’s wants to find out whether moving from their current location on West Michiganto Bernhard Center would capture more of the student market. They plan to conduct anopinion survey from 50 students within the premises of the building.

(a) State the target population.

(b) How would you select the 50 students? Comment on any possible nonrepresentative-ness in your sample. (For example: Would your selection method tend to includetoo many/few women from among the target population? How about undergraduates?Evening students? Foreign students? Faculty?)

(c) Construct a short questionnaire that would help Wendy’s make a decision.

Page 38: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

32 CHAPTER 2. SAMPLES AND POPULATIONS

4. Western Herald takes an interest in the opinion of WMU students regarding the 2002 ENRONcontroversy. From each college, 10 students were asked whether campaign contributions wasa factor in the Enron failure.

(a) Describe the target population.

(b) State the sampling method used. How would you select the 10 students from eachcollege? Do you think this will give a representative sample?

(c) Construct a short questionnaire for the Herald.

5. Gartner Research Company wants to compare the consumer satisfaction ratings of AOL(America Online) as an internet service provider(ISP) and an e-mail provider. From a list ofonline consumers, he randomly chose 6,700 people to participate in the survey.

(a) Describe the target population.

(b) State the sampling method used. Do you think this will give a representative sample?

(c) Construct a short questionnaire for the purpose.

6. In the year 2002, US News Magazine printed the following survey question:

Should the government ban human cloning?

Yes: 1-900-193-0025 No: 1-900-193-0026*50 cents per call

Would you consider the results of this opinion poll reliable? Explain.

7. Radio stations often do audience research in order to estimate the number and type of listeners(for advertising purposes) and discover their preferences (for programming purposes). Design(i) a sampling plan, and (ii) a 10-15 item questionnaire to answer the following questions:

(a) How large is the audience? (There are two measures here: the ”average” size for anypoint in time, and the ”reach” or maximum number of people reached for a period ofsay, 1 week)

(b) What are the age, gender, education, and income profile of your audience?

(c) What type of radio programs interests your audience most and least?

Page 39: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 3

Location and Spread

33

Page 40: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

34 CHAPTER 3. LOCATION AND SPREAD

3.1 Measures of Location

Table 3.1 gives monthly stock prices of three local companies for the years 2000 and 2001. Theformer Pharmacia Corporation (now Pfizer) has three missing values because the company did notexist before the merger of Pharmacia-Upjohn and Monsanto in March 2000.

Table 3.1: Monthly Stock Prices

OBS DATE KELLOGG PHARMACIA STRYKER

1 1/01/2000 24.25 . 31.63

2 2/01/2000 25.31 . 29.25

3 3/01/2000 25.75 . 34.88

4 4/01/2000 24.25 49.93 35.97

5 5/01/2000 30.37 51.93 37.81

6 6/01/2000 29.75 51.68 43.75

7 7/01/2000 25.93 54.75 42.93

8 8/01/2000 23.18 58.56 44.81

9 9/01/2000 24.18 60.18 42.87

10 10/01/2000 25.37 55.00 47.12

11 11/01/2000 24.62 61.00 53.31

12 12/01/2000 26.25 61.00 50.59

13 1/01/2001 26.20 56.02 45.20

14 2/01/2001 26.59 51.70 56.15

15 3/01/2001 27.03 50.37 52.25

16 4/01/2001 25.50 52.26 59.29

17 5/01/2001 26.72 48.56 57.45

18 6/01/2001 29.00 45.95 54.85

19 7/01/2001 30.07 44.62 59.98

20 8/01/2001 31.99 39.60 54.83

21 9/01/2001 30.00 40.56 52.90

22 10/01/2001 30.50 40.52 56.24

23 11/01/2001 29.49 44.40 54.91

24 12/01/2001 30.10 42.65 58.37

Examine the dotplot of monthly stock prices in Figure 3.1. The plot for Kellogg Co. is‘located’ farther to the left (i.e. the values are smaller). Pharmacia and Stryker seem to be locatedat approximately the same place on the line, but Stryker has more ‘spread’ or ‘variability’. Now,suppose some friends bought shares of Kellogg stock last year, how much do you think they paid

Page 41: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

3.2. THE AVERAGE AND SD 35

Figure 3.1: Dotplot of Monthly Stock Prices

. .

..:. :

.:::: ::..

-----+---------+---------+---------+---------+---------+-KELLOGG

:

.: .... .: : :. ..:

-----+---------+---------+---------+---------+---------+-PHARMACIA

.

. . .. . :.: . ..: ::....

-----+---------+---------+---------+---------+---------+-STRYKER

20 30 40 50 60 70

per share? A good answer would be “Around $27, give or take $3”. Would they have paid the sameprice for Pharmacia shares? No, Pharmacia stocks tended to fall “around $50, give or take $7”.More precisely, here are the average and standard deviation of stock prices for the three companies:

Company Average ± SD

KELLOGG 27.18 ± 2.54PHARMACIA 50.54 ± 6.82STRYKER 48.22 ± 9.20

3.2 The Average and SD

Let X1,X2, . . . ,Xn denote the data values. The average or arithmetic mean (denoted X informulas) is computed as

X =X1 + X2 + · · · + Xn

n. (3.1)

The average, which necessarily falls somewhere in the middle of the data points, is a commonlyused statistic to indicate the center or location of the data. But the data points are all spread out,so most points tend to miss the average by some amount. In the Pharmacia example, since noneof the points equal the average $50.54, then every point in the data misses (or deviates from) theaverage by some amount. What is the average size of the deviation? To a large extent, this is whatwe call the standard deviation or SD (which will be denoted by s in formulas). Let |Xi − X |denote the deviations from average (ignoring negative signs). Note that

ave{|Xi − X |} .=

ave{|Xi − X |2}

Page 42: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

36 CHAPTER 3. LOCATION AND SPREAD

or in formulas,∑ |Xi − X|

n

.=

∑ |Xi − X |2n

The SD is the second formula with the denominator n replaced by n − 1,

S =

∑ |Xi − X |2n − 1

(3.2)

Why do we use the square root of squares instead of just the average of absolute values? Whydo we replace n by n − 1 in the denominator? The long answers are mathematically complicated;a short answer is “because both adjustments give the statistic better mathematical properties”.

Here are the monthly Pharmacia stock prices, sorted from smallest to largest.

39.60 40.52 40.56 42.65 44.40 44.62 45.95 48.56

49.93 50.37 51.68 51.70 51.93 52.26 54.75 55.00

56.02 58.56 60.18 61.00 61.00

Now, compute the deviations from average Xi − 50.54.

-10.94 -10.02 -9.98 -7.89 -6.14 -5.92 -4.59 -1.98

-0.61 -0.17 1.14 1.16 1.39 1.72 4.21 4.46

5.48 8.02 9.64 10.46 10.46

The average size of the deviations (ignoring negative signs) is

(10.94 + 10.02 + 9.98 + · · · + 10.46)/21 = 5.54.

The square root of the average of squares is√

(10.942 + 10.022 + 9.982 + · · · + 10.462)/21 = 6.66.

The standard deviation is√

(10.942 + 10.022 + 9.982 + · · · + 10.462)/20 = 6.82.

3.2.1 Interpreting the SD

Here is a closer look at the dotplot of Pharmacia stock prices.

.: . : . . .. :.. : . . .:

--+---------+---------+---------+---------+-- PHARMACIA

40.0 45.0 50.0 55.0 60.0

Page 43: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

3.3. THE VARIANCE 37

The average is approximately $50. Most individual stock prices are different from $50; the moreextreme points (on the outer edges of the plot) deviate by approximately $10, the points closer tocenter deviate by $2 or $3. The average deviation seems to be around $6 or $7. In fact, the SD is$6.82.

3.2.2 The empirical rule for the SD

Another interpretation of the SD involves the empirical rule for the SD, which we state as follows.

The Empirical Rule (for data sets whose histograms are bell-shaped):Approximately 68% of the observations fall within 1 SD of the average.Approximately 95% of the observations fall within 2 SD’s of the average.

When we say that Pharmacia stock prices averaged $50.54 the past two years give or take $6.82,we may expect approximately 68% of the observations (14 out of 21) to fall within $50.54 ± $6.82.In fact, 13 out of 21 observations do. We expect approximately 95% of the observations (20 out of21) to fall within $50.54 ± 2($6.82). In fact, all 21 observations do.

The empirical rule is a good way to come up with an “educated guess” of the average and SDof a process. For example, suppose you were asked about the price of gas in Kalamazoo. It is easyto come up with a “typical” gas price, say, around $1.69 during the academic year 2002-2003. It isharder to come up with a measure of variability: $1.69 plus or minus ? However, if you cancome up with some estimate of an interval within which gas prices fell most of the time, you candeduce a standard deviation from this. Suppose that approximately 19 times out of 20 (or 95% ofthe time), gas prices during the year fell between $1.39 and $1.79, this implies an average of $1.59with an SD of 10 cents.

3.3 The Variance

The variance (denoted S2) is simply the square of the SD.

S2 =

|Xi − X|2n − 1

(3.3)

Some statistical computing packages (Excel included) report both the variance and the SD whencomputing summary statistics. Some textbooks also emphasize the variance along with the SD. Inthis text, we will always refer to the SD when discussing methodology and formulas.

3.4 Other Measures of Location

The sample median is another popular measure of location besides the average. It is particularlyuseful for skewed distributions like household income, which have occasionally large outliers. Recall

Page 44: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

38 CHAPTER 3. LOCATION AND SPREAD

that the sample median is the middle value in an ordered array of the data, hence it is insensitiveto extremely large observations. Consider the following data on household incomes:

$40T, $50T, $55T, $65T, $65T, $145T

The average household income is $70T, so 5 out of the 6 households have lower than average income.The median household income is ($55T+$65T)/2=$60T, so half of the sample has income belowmedian, and half have income above median. This is a useful property of the median that theaverage does not have: it separates the lower half of the data from the upper half. For instance,if your Exam 1 score is higher than the class median, then you know that you did better than atleast half of the class. If your Exam 1 score is higher than the class average, it is still possible thatyour score belonged to the lower half of the class.

3.5 Other Measures of Spread

The range, calculated as R = Max − Min, is sometimes used as measure of spread instead of theSD. It does not have mathematical properties like the empirical rule, but it is quick and easy tocompute, and easier to understand. When monitoring manufacturing processes (like the amountof fill in a soda bottle), the range is often used to monitor whether the process needs adjustment.A relatively large range means that the filling process is inconsistent.

The coefficient of variation is the SD divided by the average (CV = S/X). Suppose that thedaily closing stock price for IBM during the past month averaged $120 with SD $12, while DellComputers averaged $30 with SD $6. It may seem that the price for IBM is more variable becausethe SD is larger; however its SD is only 10% of its average stock price (CV = $12/$120 = .10)while Dell’s SD is 20% of its average stock price (CV = $6/$30 = .20). If you want to measureprice volatility, do you prefer the SD or the CV?

3.6 Adding a Constant

The list of numbers: 1, 3, 4, 4 has average equal to 3 and SD equal to 1.414. Suppose a constant,say c=10, is added to each number on the list, thus yielding 11, 13, 14, 14. What is the averageand SD of this new list?

Compare the dotplots of the old and new list:

Page 45: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

3.7. MULTIPLYING BY A CONSTANT 39

O

O O O

Old: --+---------+---------+---------+---------+--

0 5 10 15 20

O

O O O

New: --+---------+---------+---------+---------+--

0 5 10 15 20

The dotplots or histograms look the same, except that the new histogram has been shifted 10units to the right. Therefore, the new average should be 10 units larger than the old average, butthe new SD should be equal to the old SD. This may be verified by actual computations for thelist 11, 13, 14, 14: i.e. new average equals 13, new SD equals 1.414.

Subtracting by a constant will move the histogram to the left instead of to the right. Thereforethe average will also move left, the SD will remain unchanged.

Adding a positive constant c to a list will add the same constant to the average. TheSD remains unchanged. Subtracting a positive constant c from a list will subtract

the same constant from the average. The SD remains unchanged.

3.7 Multiplying by a constant

If we multiply the list 1, 3, 4, 4 by a constant, say 5, how will the average and SD change? Thenew list is 5, 15, 20, 20. The stacked dotplots follow:

O

O O O

Old: --+---------+---------+---------+---------+--

0 5 10 15 20

O

O O O

New: --+---------+---------+---------+---------+--

0 5 10 15 20

Page 46: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

40 CHAPTER 3. LOCATION AND SPREAD

This time, the old histogram looks like it has been stretched out (over an interval that is 5 timeslonger) instead of shifted. Phrased differently, all distances have been multiplied by 5. Therefore,the new average should be 5 times the old average, and the new SD should be 5 times the old SD.It should be easy to verify that the list 5, 15, 20, 20 has average 15 and SD equal to 7.07.

If we multiply a list by a positive constant c, the average and SD will be multipliedby the same constant.

If we multiplied by -5 instead of 5, the new histogram will still be stretched out by the samefactor, but reflected over to the other side of zero. Thus, the average will change signs, but the SDremains positive.

If we multiply a list by a negative constant −c, the average will be multiplied by−c, the SD will be multiplied by the positive number c.

3.8 Using the TI-84:

3.8.1 Calculating X and S

1. [STAT]2. Arrow right to [CALC].3. [1-Var Stats] should already be highlighted. Press [Enter]4. Input [2nd] [L1] (or wherever your data is stored)5. [ENTER]. The average and SD are in x and Sx, repectively. The output σx is the

SD formula in Equation ( 3.2) with the denominator n − 1 replaced by n .

Page 47: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

3.9. EXERCISES 41

3.9 Exercises

Exercises for Sections 3.1- 3.5

1. Consider the following data:

1.2, 2.6, 3.5, 2.2, 1.4, 1.9, 4.4, 1.8, 3.8

Calculate the

(a) average

(b) SD

(c) median

(d) range

2. The typical American generates about 3.8 pounds of solid trash per day with standard devia-tion of 1.5 pounds. Assuming that the histogram for trash produced per day is approximatelybell-shaped, between what two values will 95% of the observations fall?

3. The following data are the monthly rental prices (in dollars) for a sample of 15 unfurnished2-bedroom apartments off-campus.

675, 550, 650, 750, 575, 600, 560, 775, 850, 725, 625, 700, 740, 1100, 625

(a) Calculate the average, median, SD and range.

(b) Obtain a histogram and boxplot with your TI-84. What is the shape?

(c) Based on the information you’ve gathered so far, how would you answer the ques-tion “How much should I expect to pay for an unfurnished 2-bedroom apartment off-campus?”.

(d) If you are to choose between the average and median, which do you think is a bettermeasure of location for this data set? Why?

4. Consider the following histogram. Suppose 95% of the data falls between 12.4 and 46.8. Usingthe empirical rule, give an estimate of the standard deviation for this sample of data.

Page 48: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

42 CHAPTER 3. LOCATION AND SPREAD

5. Consider the following sample of data:

11.2, 28.4, 32.6, 17.0, 13.5, 2.2, 38.0, 41.4, 1.9, 14.4, 21.8, 23.8

(a) Calculate the quartiles.

(b) Calculate the average

(c) Calculate the standard deviation.

Exercises for Sections 3.6- 3.7

6. A random sample of 10 different unfurnished 1-bedroom apartments off-campus yield anaverage monthly rent of $500 with SD $100. A second sample of 10 2-bedroom apartmentsyield an average rent of $650 with SD $100.

(a) Suppose that WMU students get a $20 discount on the monthly rent for all 1-bedroomapartments. The discounted monthly rent for the sample of 1-bedroom apartmentsshould yield an average of with SD .

(b) Instead of a $20 discount, suppose that WMU students get a 5% discount on the monthlyrent for all 1-bedroom apartments. The discounted monthly rent for the sample of 1-bedroom apartments should yield an average of with SD .

Page 49: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

3.9. EXERCISES 43

(c) If you took the 1-bedroom and 2-bedroom samples together, would you guess that theSD of the combined sample is around $100, smaller than $100, or larger than $100?

7. Golden delicious apples were packed in plastic bags for delivery. The filled bags averaged 9.5pounds in weight, and the SD was .8 pounds.

(a) The bags of apples are sold to a grocery chain, at a price of 30 cents per pound. (Theheavier bags will, of course, cost more than the lighter bags.) The price per bag shouldaverage dollars with SD dollars.(Hint: Price = (Weight)($.30).)

(b) Suppose that the grocery chain uses their own trucks to pick up the bagged apples fromthe grower. For the trucking fee, they charge the grower 25 cents per bag. Thereforefor each bag they pay a net price of (Weight)($.30)-$.25. The net price per bag shouldaverage dollars with SD dollars.

Page 50: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

44 CHAPTER 3. LOCATION AND SPREAD

Page 51: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 4

Binomial and Normal Distributions

45

Page 52: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

46 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

4.1 Computing Probabilities Using the Normal Curve

The electronics store TVs and More wants to know how many large screen television sets to orderfor the next monthly delivery. Past data show that this time of year, they sell an average of 36large screen television sets per month, with a standard deviation of 8.

Problem 1. If they order 40 sets, what is the probability that they will run out ofstock?

Problem 2. Given the cost of running out of stock and the storage cost of keepingtoo many, the manager decides to order enough TV sets to cover customer demands90% of the time, i.e there should be 10% or less likelihood that they run out of stock.How many should they order?

Knowing the average and SD of a process (36 ± 8) gives us some understanding of what toexpect. However, the sales situation above requires the computation of probabilities, or chances, asin the “chance of demand exceeding 40”. This is where the normal curve is helpful. If the demandvariable histogram is approximately bell-shaped, then the desired probability (area to the right of40 under histogram) may be approximated using the normal curve (area to the right of 40 undernormal curve). See the two graphs in Figure 4.1.

Figure 4.1: Probability of Demand Exceeding 40

12 20 28 36 44 52 6040 X4 12 20 28 36 44 52 60 68

X40

Page 53: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.1. COMPUTING PROBABILITIES USING THE NORMAL CURVE 47

By “demand following the normal curve”, we mean that in repeated observations past, presentand future, the histogram of demand for TV sets will look like the histogram on the left of Figure4.1. Or that the stem and leaf plot of n = 100 future observations should look like Figure 4.2

Figure 4.2: Stem-and-leaf of idealized normal curve demand data (mean=36, SD=8)1 6

1 89

2 012

2 34455

2 66777888

2 9990001111

3 2222333344444

3 5555566666677777

3 8888899990000

4 1111222333

4 44455566

4 77889

5 012

5 34

5 6

Sometimes, we do have data-based evidence that the data histogram or stem-and-leaf will looklike the normal curve. Height and GPA, for instance, is known to follow the normal curve quiteclosely (with documentation in published sources). At other times, we may need to rely on thenormality assumption without the hard evidence, because it is often the easiest way to compute,at least approximately, desired probabilities. In any case, probabilities based on the normal curveshould be treated as approximations of the true probabilities anyway.

4.1.1 Calculating Areas under the Normal Curve

The TI-84 calculator may be used to compute P (a ≤ X ≤ b), or the area under the normal curvebetween a and b. You will need to input the two boundary numbers a and b, and the center andspread of the curve (called the mean and SD of the curve, respectively). See Figure 4.3.

Page 54: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

48 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

Figure 4.3: Area under normal curve between a and b [P (a ≤ X ≤ b)]

mean a b X

Page 55: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.1. COMPUTING PROBABILITIES USING THE NORMAL CURVE 49

Using the TI-84: Calculating P (a ≤ X ≤ b):1. [2nd][DISTR]2. Arrow down to [NORMALCDF]3. Input (a, b, mean, SD). Don’t forget to type a comma between entries,

and end with a parenthesis.4. [ENTER]

Example: Returning to our sales problem, the mean is 36 and the SD is 8. We needthe area to the right of 40, or the area under the curve between a = 40 and b = ∞.(In place of b = ∞, you may type any number that is 5 or more SD’s above themean, since there is negligible area beyond 5 SD’s on either side). Now, 5 SD’s abovethe mean is 36+5(8)=76, so using a convenient value like b = 9999 is appropriate.Typing the inputs (40, 9999, 36, 8) to [NORMALCDF] will give you the desiredarea P (a ≤ X ≤ b) = .3085, or approximately 31%.

Exercise: Examine the idealized stem-and-leaf in Figure 4.2.a. What percentage of the data is larger than 40 (exclusive, i.e. excluding 40)?b. What percentage of the data is larger than 40 (inclusive)?c. If we split the difference between inclusive and exclusive, what percentage

of the data is larger than 40? Is this close to 31%?

4.1.2 Calculating Percentiles Using the Normal Curve

The second part of the TV sales problem requires calculation of the 90th percentile of the demandhistogram. In mathematical notation, we are looking for a such that P (X > a) = .10. See Figure4.4.

Using the TI-84: Calculating normal percentiles1. [2nd][DISTR]2. Arrow down to (INVNORM)3. Input (percentile, mean, SD). Don’t forget to type a comma between entries,

and end with a parenthesis.4. [ENTER]

Example: To find the 90th percentile of TV sales, type the inputs (.90, 36, 8). Theanswer is 46.25, or approximately 46.

Exercise: Find the following percentiles for TV sales. Make aguess first, then use your calculator. Was your guess close?a. 80th percentileb. 20th percentilec. 50th percentile

Page 56: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

50 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

Figure 4.4: The 90th percentile: P (X ≥ a) = .10

12 20 28 36 44 52 60a

10%

X

−3 −2 −1 0 1 2 31.28 Z

Page 57: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.2. BINOMIAL PROBABILITIES 51

4.1.3 The Standard Normal or Z-Curve

Figure 4.5 contains plots of the normal curve with mean 36 and SD 8 (TV sales), compared withplots of the normal curve with mean 70 and SD 3 (adult men’s heights, in inches).

The areas within one and two SD’s of the mean, respectively, are shaded. The area under thenormal curve within one SD of the mean will always be .6827, or approximately 68%. The areawithin two SD’s of the mean will always be .9545, or approximately 95%.

All the plots in Figure 4.5 contain a second horizontal axis, labeled Z. Note that the Z = 1whenever X is one SD above the mean. Similarly, Z = −1 whenever X is one SD below the mean.

X is how many SD’s above the mean? This number is Z.

If we replace the original X-axis by the Z-axis, then all normal curves will look like Figure 4.6.This is the Z-curve, also called the standard normal curve.

The standard normal curve (or Z-curve) has the following properties:

Properties of the Z-curve:1. The Z-curve has mean 0 and SD 1.2. The Z-curve satisfies the following:

i. P (−1.0 ≤ Z ≤ 1.0) = .68ii. P (−2.0 ≤ Z ≤ 2.0) = .95iii. P (−3.0 ≤ Z ≤ 3.0) = .997

The empirical rule for the SD is really a special case of the normal curve applied to data. Werestate the empirical rule here

Empirical Rule:If the data histogram is approximately normal-shaped, expect around

68% of the observations will fall within 1 SD of the mean95% of the observations will fall within 2 SD’s of the mean99.7% of the observations will fall within 3 SD’s of the mean

Exercise: Revisit Figure 4.2, and examine how closely the data satisfies the empiricalrule.

4.2 Binomial Probabilities

A sequence of n observations is called a binomial process if(i) each observation results in one of two possible outcomes (which we call success and failure)

Page 58: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

52 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

Figure 4.5: Areas within One and Two SD’s of the Mean for N(36, 8) and N(70, 3)

12 20 28 36 44 52 60

−3 −2 −1 0 1 2 3

0.6827

X

Z

12 20 28 36 44 52 60

−3 −2 −1 0 1 2 3

0.9545

X

Z

61 64 67 70 73 76 79

−3 −2 −1 0 1 2 3

0.6827

X

Z

61 64 67 70 73 76 79

−3 −2 −1 0 1 2 3

0.9545

X

Z

Page 59: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.2. BINOMIAL PROBABILITIES 53

Figure 4.6: The Standard Normal Curve and Areas within 1, 2, 3 SD’s of the Mean

−4 −2 0 2 4 Z

68%95%

99.7%

Page 60: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

54 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

(ii) the probability of success is p and the probability of failure is q = 1− p for all observations(iii) the observations are independent of each other.

Obs 1 Obs 2 Obs 3 · · · Obs np ւ ց q p ւ ց q p ւ ց q p ւ ց qS F S F S F S F

Let X denote the total number of successes among the n observations. Then X is called abinomial random variable with parameters n and p.

The following are all binomial random variables.

Example 1 A Stat 216 multiple choice quiz has 5 questions, with each question having 4 choices.Let X be the number of correct answers (C) by someone who is guessing on all questions.Then X is a Binomial random variable with parameter values n = 5 and p = 1/4.

Ques 1 Ques 2 Ques 3 Ques 4 Ques 51/4 ւ ց 3/4 1/4 ւ ց 3/4 1/4 ւ ց 3/4 1/4 ւ ց 3/4 1/4 ւ ց 3/4

C W C W C W C W C W

Example 2 Available data shows that 40% of telephone respondents agree to be interviewed formarket research surveys. Suppose that the polling organization Reliable Research randomlyselects and dials telephone numbers until 50 respondents are reached. Let X be the numberof respondents (out of the 50) who agree to be interviewed. Then X is a Binomial randomvariable with parameter values n = 50 and p = .40.

Example 3 Historically, 20% of television buyers at TV World purchase the store’s extendedwarranty. Suppose that 300 TV sets were sold during the previous quarter. Let X be thenumber of extended warranties that were sold along with the 300 sets. Then X is a Binomialrandom variable with parameter values n = 300 and p = .20.

4.2.1 Computing Binomial Probabilities

In Example 1, the number of correct guesses may be 0, 1, 2, 3, 4 or 5. How likely can a guesserget all 5 questions right? The answer is .001, or about once in every 1000 tries. How about thelikelihood of getting 2 out of 5 questions right? The answer is .264, approximately a quarter of thetime. The following probability distribution table gives the likelihood or probability of eachpossible value of X.

P (X = 0) P (X = 1) P (X = 2) P (X = 3) P (X = 4) P (X = 5)

.237 .395 .264 .088 .015 .001(4.1)

Page 61: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.2. BINOMIAL PROBABILITIES 55

You can compute these probabilities yourself by successively substituting j = 0, 1, 2, 3, 4 and5 in the formula

P (X = j) =5!

j! (5 − j)!(1/4)j (3/4)5−j , (4.2)

where j! = (j)(j − 1)(j − 2) · · · (1). (Remember that 0! = 1, and (1/4)0 = 1.) This formula ( 4.2)is called the binomial probability distribution function (or pdf) for n = 5 and p = 1/4. Tocompute the probabilities for Examples 2 and 3, you will need the binomial pdf for general n andp:

P (X = j) =n!

j! (n − j)!pj (1 − p)n−j, j = 0, 1, 2, . . . , n. (4.3)

Example 3 (Con’t): 10 TV sets were sold in one day. What is the probability that3 extended warranties were sold? Using Equation 4.3 with n = 10, p = .20, andj = 3, we get

P (X = 3) =10!

3! 7!(.20)3 (.80)7 = .201

Exercise: Five percent of videos rented at Campus Video are returned late. If 30videos were rented during the last hour, what is the probability that

a. 2 will be returned lateb. none will be returned late

Binomial probabilities may also be presented in a cumulative form P (X ≤ j) = P (X = 0) +P (X = 1) + · · · + P (X = j). Here is the cumulative distribution function (cdf) for the binomialdistribution with n = 5 and p = 1/4.

P (X ≤ 0) P (X ≤ 1) P (X ≤ 2) P (X ≤ 3) P (X ≤ 4) P (X ≤ 5)

.237 .632 .896 .984 .999 1.00(4.4)

Do you see how to compute ( 4.4) from ( 4.1)? The TI-84 calculator can compute either thebinomial pdf or the binomial cdf (see Sections 4.4.3 and 4.4.4 at the end of this chapter).

Exercise: Five percent of videos rented at Campus Video are returned late. If 30videos were rented during the last hour, what is the probability that

a. 2 or fewer will be returned lateb. 5 or more will be returned late

4.2.2 Expected Value and SD of a Binomial Random Variable

Suppose that last quarter, TV World sold 300 television sets. If there is .20 likelihood of selling anextended warranty with each television sold, the number of extended warranties sold last quarter

Page 62: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

56 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

should be around 60 , give or take 7 or so (we will compute this later). Thefirst number is called the expected value of the number of warranties sold; the second number is thestandard deviation. Recall that X, the number of warranties sold, is a Binomial random variable.The expected value, denoted E(X) , of a Binomial random variable X with parameters n and pis computed as:

E(X) = np (4.5)

The expected value E(X) is also called the average or mean of X, and denoted µ. The standarddeviation of a Binomial random variable is computed as

SD(X) = σX =√

np(1 − p) (4.6)

Returning to the example, since n = 300 and p = .20, we have E(X) = 300(.20) = 60.0, andSD(X) =

300(.20)(.80) = 6.93, or approximately 60 ± 7.The SD for random variables is interpreted similarly to the SD for a sample. If the store sells

300 TV sets every quarter, they won’t sell exactly 60 extended warranties every time; sometimesthey will sell more, sometimes they will sell less. By how much more, and how much less? Theanswer is, “By 7, on the average”. Similarly, a baseball player with .200 batting average won’tnecessarily get 60 hits in 300 at bats . We expect him to get 60 hits, give or take 7 hits or so.

Exercise: Suppose that 5% of videos rented at Campus Video incur a late rental fee. If700 videos were rented last week, the number that will incur a late rental fee shouldbe around give or take or so.

4.3 Computing Binomial Probabilities Using the Normal Curve

Beyond the empirical rule, we may apply the normal curve to approximating binomial probabilities.The key image is visualizing the binomial histogram as centered about np with SD=

np(1 − p).Here, for example, is the histogram of binomial probabilities for n = 30 and p = .4.

The height of the rectangle over, say 10, is its probability P (X = 10). However, since the widthof the rectangle is 1, then

P (X = 10)=height of rectangle over 10 = area of rectangle over 10.

What about P (X ≤ 10)? This probability corresponds to the total area of the rectangles over andto the left of 10 (shaded rectangles in Figure 4.8).

Using the binomial probability function to compute the area (probability) of each shaded rect-angle, the total area equals .2915. However, this requires repeated applications of the binomialformula (11 times, in fact). We may calculate a quick approximation of the desired probability byreplacing the rectangles with a curve! See shaded area under curve in Figure 4.8.

There are many normal curves, which one do we use to replace the rectangles? Answer: thecurve which has the same mean and SD as the rectangles! The (binomial) rectangles have mean

Page 63: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.3. COMPUTING BINOMIAL PROBABILITIES USING THE NORMAL CURVE 57

Figure 4.7: Histogram of Probabilities for Binomial n = 30 and p = .4

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 X

Figure 4.8: Approximating P (X ≤ 10) using Curve instead of Rectangles

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 X123 2110.5

Page 64: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

58 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

µ = n p = (30)(.4) = 12 and SD σ =√

np(1 − p) =√

(30)(.4)(.6) = 2.68. Using the normal curvewith the same mean and SD, the area to the left of 10.5 is .2878, which is a close estimate of thetrue area of .2915.

Similarly, P (X = 14) = .1101 is approximated by .1124, the area under the normal curvebetween 13.5 and 14.5.

4.3.1 Some Approximations Are Better Than Others

Examine the shape of the binomial histogram for n = 20 and p = .10 in Figure 4.9.

Figure 4.9: Histogram of Probabilities for Binomial with n = 20 and p = .10

0 1 2 3 4 5 6 7 X

Since there is a noticeable skewness to the right, we will get a poor approximation of areas ifwe replace the rectangles by a normal curve. If the value of p were .90 instead of .10, the binomialhistogram would be skewed left. This is typical behavior of binomial histograms whenever p iseither too close to 0 or too close to 1. When is it ’safe’ to use the normal curve to approximatebinomial probabilities? A convenient rule of thumb is as follows:

The normal curve with gives reasonably good approximations of binomial proba-bilites whenever both np > 5 and n(1 − p) > 5.

Page 65: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.4. USING THE TI-84: 59

The reader is reminded that normal curve approximations, no matter how close, are still ap-proximations. The binomial formula ( 4.3) should be used to calculate exact probabilities wheneverpossible.

4.4 Using the TI-84:

4.4.1 Computing Normal Curve probability between a and b

1. [2nd][DISTR]2. Arrow down to [NORMALCDF] [ENTER].3. Input (a, b, mean, SD) [ENTER]. (Type a comma between inputs, end with a close

parenthesis.)

4.4.2 Computing Normal Curve percentile

1. [2nd][DISTR]2. Arrow down to (INVNORM) [ENTER].3. Input (percentile, mean, SD) [ENTER]

4.4.3 Computing Binomial PDF

This computes the binomial probability distribution function P (X = j) = n!j! (n−j)!p

j (1 − p)n−j

1. [2nd][DISTR]2. Arrow down to (BINOMPDF)3. Input (n, p, j) [ENTER].

4.4.4 Computing Binomial CDF

This computes the binomial cumulative distribution function P (X ≤ j) = P (X = 0) + P (X =1) + · · · + P (X = j).

1. [2nd][DISTR]2. Arrow down to (BINOMCDF)3. Input (n, p, j) [ENTER].

Page 66: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

60 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

4.5 Using Excel

4.5.1 Computer binomial probabilities

1. Select the cell where you wish to insert a formula. Click the ’Formulas’ tab and select ’InsertFunction’. A window with a list of functions will appear.

2. From the ’Or select a category’ list, select ’Statistical’ and find ’BINOMDIST’ from the listof functions below. Click ’OK’.

3. Enter the following arguments for BINOMDIST:

’Number s’ is the number of successes (x)

’Trials’ is the number of trials or the sample size (n)

’Probability s’ is the probability of success (p)

’Cumulative’ is a logical value, you can type either ’true’ or ’false’. Typing ’true’ will give acumulative probability P( X ≤ x) , typing ’false’ will give an exact probability P(X = x).

4. Click ’OK’. The result will be printed in the cell you selected.

4.5.2 Computing P (X ≤ x) from the normal curve

1. Select the cell where you wish to insert a formula. Click the ’Formulas’ tab and select ’InsertFunction’. A window with a list of functions will appear.

2. From the ’Or select a category’ list, select ’Statistical’ and find ’NORMDIST’ from the listof functions below. Click ’OK’.

3. Enter the following arguments for NORMDIST:

’X’ is the value you want the probability less than or equal to.

Enter the appropriate mean and standard deviation.

’Cumulative’ is a logical value. For this always type ’true’.

4. Click ’OK.

NOTE: There is also a ’NORMSDIST’ function (difference of an ’S’ in the middle). Do notuse this function.

Page 67: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.5. USING EXCEL 61

4.5.3 Getting z critical values

1. Select the cell where you wish to insert a formula. Click the ’Formulas’ tab and select ’InsertFunction’. A window with a list of functions will appear.

2. From the ’Or select a category’ list, select ’Statistical’ and find ’NORMSINV’ from the listof functions below. Click ’OK’.

3. Enter the desired level of significance. Remember to use the area to the left of the criticalvalue (1 − α).

4. Click ’OK.

NOTE: There is also a ’INVNORM’ function (difference of lacking an ’S” in the middle). Donot use this function.

4.5.4 An alternative method

1. Instead of using the ’Insert Formula’ button, you can also type the function names directlyinto a cell. The format is shown below:

• =BINOMDIST(

• =NORMDIST(

• =INVSNORM(

2. If you spelled the function name properly a small bubble will pop up with the names of thearguments to fill. Place a comma between arguments.

Page 68: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

62 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

4.6 Exercises

Exercises for Section 4.1

1. A study was conducted in order to determine if any link existed between cellular phone usageand the development of brain cancer. (Don’t worry, no connection was found.) Data fromthis study indicate that the monthly phone usage for all users is approximately normallydistributed with mean 2.4 hours and standard deviation 1.1 hours.

(a) What proportion of cell phone users are on their phones between 1 hour and 3 hours permonth?

(b) Just to be safe, suppose you decide to be in the 5th percentile of cell phone users interms of monthly usage. How much time can you spend on your phone this month?

2. On the average, a watch battery is known to last for 2 years (24 months), with a standarddeviation of 9 months. Assuming that the population is normally distributed,

(a) What percentage of watch batteries last more than 6 months?

(b) What is the life span of a watch battery which lasts longer than 60% of all batteries?

(c) What proportion of watch batteries last shorter than 2 years or longer than 312 years (42

months)?

3. Data from previous years reveal that the distribution of first exam scores in an introductorystatistics class is approximately normal with a mean of 72 and a standard deviation of 12.

(a) Given that the passing score is 65, approximately what percentage of students pass thefirst exam?

(b) If a student aims to be at least in the 98th percentile, how should he or she score?

(c) The middle 95% of the scores will tend to fall between and .

4. Mastercard records show that credit card purchases for the month of December average $1200with standard deviation $550. Assuming that purchases approximately follow the normalcurve

(a) What percentage of customers spend between $1000 and $2000 in December using theirMastercard?

(b) What percentage of customers will charge less than $2000 for the month of December?

(c) Suppose that Mastercard gives special offers to customers who spend more than $1500in a month. What percentage of customers will “earn” a special offer for their Decemberpurchases?

(d) How much did a Mastercard customer spend if only 2% of all customers spent more?

Page 69: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.6. EXERCISES 63

5. Suppose 95% of data coming from a normally distributed population falls between 4 and 35.Based on the empirical rule, what is the standard deviation of this sample of data?

6. Employees of a company are given a test that measures compentency. Scores from this testare known to be distributed normally with mean 200 and standard deviation 35. The peoplescoring in the bottom 3% will be fired. What score is necessary to retain a job?

7. Let X be a normal random variable with µ = 25 and σ = 10. Find the following probabilities.

(a) P (X > 22)

(b) P (24 < X < 39)

(c) P (X < 18)

Exercises for Section 4.2

8. Records in a Beanie Babies factory show that for every 100 newly produced toys, 20 aredefective. If the quality control manager takes a sample of 10 newly manufactured stuffedtoys,

(a) What is the chance that she will find 5 or more defectives?

(b) What is the likelihood that all stuffed toys are in perfect condition (no defectives)?

(c) How many non-defectives should she expect to find?

(d) Suppose 200 toys were sampled instead of 10. The number of defectives will be betweenwhat two values with 68% chance?

9. NFL place kicker Jan Stenerud has a lifetime record of 66.2% for making field goals. Assumethat his field goal attempts are independent. Out of five attempts, what is the probability hemakes at least four of them?

10. Suppose a student in Stat 216 has not been attending class this semester but decides to takethe exam anyway. If he randomly guesses on each of the 25 questions, then he has a 1 out of5 chance of getting a correct answer, since it is a multiple choice exam with choices a, b, c,d, or e.

(a) How many questions should the student expect to get correct on this exam?

(b) What is the probability that the student will get a ”C” or better, which is 16 or morecorrect?

(c) What is the probability that the student will score lower than a ”C” (15 or less correct)?

11. For each of the following situations, decide if the binomial distribution is appropriate. If yes,determine n and p. If not, explain why not.

Page 70: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

64 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

(a) A random sample of 10 newborn babies is taken from the nursery of the city hospital.Let X be the number of girls in the sample.

(b) People getting off the Lovell street bus at the business college are asked if they own adog. Let X be the number of people until the first dog owner gets off the bus.

(c) Suppose the pond you’re fishing in contains 25 bluegills and 10 bass. If you land 5 fish,let X be the number of bass among your catch.

12. The quality control section of a purchasing contract for valves specifies that the diametermust be between 2.53 and 2.57 centimeters. Assume that the production equipment is set sothat the mean diameter is 2.56 centimeters and the standard deviation is 0.01 cm.

(a) What proportion of valves produced, over the long run, will be within these specifications,assuming a normal distribution?

(b) If 400 valves are produced, the number that fall within specifications is expected to bearound , give or take or so.

13. Millions of Americans are subjected to drug testing in the workplace. The tests are notalways reliable. Foods and medicines such as Advil, Midol, Sudafed, poppy seeds, and someantibiotics can cause false-positive results. It has even been shown that the skin pigmentmelatonin causes those with dark skin to be at a higher risk of falsely testing positive. Afalse positive rate of 5% is considered to be a conservative estimate. Suppose that there are40 employees in your company, which tests its workers weekly.

(a) What is the probability of no false-positives for this week’s test?

(b) What is the probability of at least one false-positive for this week’s test?

(c) What is the likelihood that the test will give 5 to 10 false-positives?

(d) What is the chance of at most 3 false-positives?

(e) How many false-positives should the company expect?

Exercises for Section 4.3

14. Suppose that it is a known fact that 75% of all college students are single. A random sampleof 120 college students is taken for observation. Note that np = 120(.75) = 90 > 5, andn(1 − p) = 120(.25) = 30 > 5, so the normal approximation to the binomial is valid.

(a) What is the approximate probability that at least 85 students in the sample will besingle?

(b) What is the approximate probability that between 100 and 105 (inclusive) students inthe sample will be single?

Page 71: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

4.6. EXERCISES 65

15. It is known that 30% of all customers of a major credit card pay their bills in full before anyinterest charges are incurred. Suppose a simple random sample of 150 credit card holders isselected. Note that np = 150(.3) = 45 > 5, and n(1 − p) = 150(.7) = 105 > 5, so the normalapproximation to the binomial is quite valid.

(a) What is the probability that 30 or fewer customers pay their account balances in fullbefore any interest charges are incurred?

(b) What is the probability that between 40 and 50 customers pay their account balancesin full before any interest charges are incurred?

(c) What is the probability that more than 60 customers pay their account balances in fullbefore any interest charges are incurred?

16. Based on past experience, 20% of all customers at a particular automotive service station payfor their purchases with cash or check. Supoose a random sample of 100 customers is selectedfor observation. Note that np = 100(.2) = 20 > 5, and n(1 − p) = 100(.8) = 80 > 5, so thenormal approximation to the binomial is valid.

(a) What is the approximate probability that at least 32 customers pay with cash or check?

(b) What is the approximate probability that between 10 and 25 customers pay with cashor check?

(c) What is the approximate probability that 50 or more customers pay with cash or check?

Page 72: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

66 CHAPTER 4. BINOMIAL AND NORMAL DISTRIBUTIONS

Page 73: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 5

Sampling Distribution of theProportion

67

Page 74: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

68 CHAPTER 5. SAMPLING DISTRIBUTION OF THE PROPORTION

5.1 The Sample Proportion

Suppose a student guesses at the answer on every question in a 300-question exam. If he gets 60questions correct, then his proportion of correct guesses is 60/300=.20. If he gets 75 questionscorrect, then his proportion of correct guesses is 75/300=.25. The proportion of correct guesses issimply the number of correct guesses divided by the total number of questions.

Similarly, if TV World sells 60 extended warranties with 300 television sets sold, then its war-ranty sales rate is 60/300=.20. If it sold 75 warranties, this is a sales rate of 75/300=.25. Thewarranty sales rate, or proportion of TV sets sold with warranties, is simply the number of war-ranties divided by the total number of TV sets.

Now, let X denote the number of successes out of a sample of n observations. If each observationis a success with probability p independently of the other observations, then X is a binomial randomvariable with parameters n and p. Furthermore, the proportion of successes in the sample isalso a random variable and is computed as

p̂ =X

n=

number of successes

total number of observations in the sample(5.1)

Since X is expected to be around np give or take√

np(1 − p), then X/n is expected to bearound np/n give or take

np(1 − p)/n, or p give or take√

p(1 − p)/√

n. ( Make sure that youagree with the last statement before moving on. It may help to think of this analog: Supposeannual rainfall in Kalamazoo is expected to be around 24 inches give or take 6 inches. How dowe change the measurement from inches to feet? We divide both numbers by 12! In feet, annualrainfall in Kalamazoo is expected to be around 24/12 give or take 6/12, or 2 feet give or take .5feet. Now read the first sentence of this paragraph one more time.)

Going back to the TV World example, the number of warranties sold is expected to be around60±7. Thus the proportion of warranties sold is expected to be around 60/300±7/300, or .20± .02.

We summarize the formulas for the mean and SD of X and p̂ in the following table.

Random Variable Mean SD

X np√

np(1 − p)

p̂ p√

p(1 − p)/√

n

(5.2)

Exercise 1. If TV World sold 1200 television sets last year,a. the proportion of sets sold with extended warranties should be be around .20, give

or take .b. the percentage of sets sold with extended warranties should be around 20%, give or

take .

Data analysis sometimes involves percentages instead of proportions. The only difference, ofcourse, is 2 decimal places, or multiplication by 100% (e.g. 20% instead of .20). In order to avoid

Page 75: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

5.1. THE SAMPLE PROPORTION 69

repetition, we present all statistical formulas in proportions. As Exercise 1 shows, the answers canalways be converted to percentages in the end.

Exercise 2. Historically, 5% of rentals from Campus Video are returned late.a. Campus Video rented out 100 videos yesterday. The percentage that will be returned

late should be around 5%, give or take .b. Campus Video rented out 700 videos last week. The percentage that will be returned

late should be around 5%, give or take . (Should the answer be different orthe same as (a)?)

Exercise 2 is an illustration of the law of large numbers. A simpler illustration involves a cointoss. If you toss a coin repeatedly, which tends to get closer to 50% heads: 100 tosses or 700 tosses?The correct answer is 700. The larger the number of tosses, the closer we expect to get to 50%.The reason for this, as the Exercise shows, is the smaller give or take value. The sample size n liesin the denominator of the SD of p̂. Therefore the larger the sample size, the smaller the SD, whichhappens to be the give-or-take value.

The Law of Large Numbers for Sample Percentages:The sample percentage tends to get closer to the true percentage as sample sizeincreases.

5.1.1 The Sampling Distribution of p̂ is Approximately Normal

Since p̂ = X/n, the sampling distribution of p̂ looks the same as that of X except for differentnumbers on the horizontal axis. For n = 30 and p = .4, the probability histogram of X and p̂ isshown in Figure 5.1.

Therefore, like the binomial, the sampling distribution of p̂ may be approximated by a normalcurve with the correct mean and SD.

Example: Toss a fair coin 50 times. What is the chance of getting 60% or more heads?Solution: The question is equivalent to “What is the probability that p̂ exceeds .60?”

Using the mean and SD given in ( 5.2) with n = 50 and p = .50, the sampleproportion is expected to be around .50 give or take

(.5)(1 − .5)/√

50 or .50 giveor take .07. With mean .50 and SD .07, the area to the right of .60 (under thenormal curve with mean .50 and SD .07) is .0766. Thus, the proportion of headswill exceed .60 fewer than 8 percent of the time.

Exercise 3. If TV World sold 1200 television sets last year, the percentage of setssold with extended warranties is expected to be around 20%, give or take .Estimate the likelihood that it sold warranties with more than 25% of those sets.

Page 76: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

70 CHAPTER 5. SAMPLING DISTRIBUTION OF THE PROPORTION

Figure 5.1: Probability Histogram of X and p̂

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 X

0.1 0.13 0.17 0.2 0.23 0.27 0.3 0.33 0.37 0.4 0.43 0.47 0.5 0.53 0.57 0.6 0.63 0.67 0.7 P̂

Page 77: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

5.2. ESTIMATING THE POPULATION PROPORTION P 71

5.2 Estimating the Population Proportion p

The TV World computations in the previous section assume that we know the warranty rate isp = .20. In data analysis, population parameters like p are typically unknown and estimated fromthe data. Consider estimating the proportion p of the current WMU graduating class who planto go to graduate school. Suppose we take a sample of 40 graduating students, and suppose that6 out of the 40 are planning to go to graduate school. Then our estimate is p̂ = 6/40 = .15 ofthe graduating class plan to go to graduate school. Now p̂ is based on a sample, and unless wegot really lucky, chances are the .15 estimate missed. By how much? On the average, a randomvariable misses the mean by one SD. From the previous section, the SD of p̂ equals

p(1 − p)/√

n.It follows that the expected size of the miss is

p(1 − p)/√

n. This last term is called the standarderror of estimation of the sample proportion, or simply standard error (SE) of the proportion .

However, since we do not know p, we cannot calculate this SE. In a situation like this, statis-ticians replace p with p̂ when calculating the SE. The resulting quantity is called the estimatedstandard error of the sample proportion . In practice, however, the word “estimated” is droppedand the estimated SE is simply called the SE .

The population proportion p is estimated using the sample proportion p̂. Thisestimate tends to miss by an amount called the standard error (SE) of p̂ .

The SE of p̂ is calculated as√

p̂(1 − p̂)/√

n

Exercise 4.a. If 6 out of 40 students plan to go to graduate school, the proportion of all students

who plan to go to graduate school is estimated as . The standard error of thisestimate is .

b. If 54 out of 360 students plan to go to graduate school, the proportion of all studentswho plan to go to graduate school is estimated as . The standard error of thisestimate is .

Exercise 4 shows the effect of of increasing the sample size on the SE of the sample proportion.Multiplying the sample size by a factor of 9 (from 40 to 360) makes the SE decrease by a factor of3. In the formula for the SE of p̂, the sample size appears (i) in the denominator, and (ii) inside asquare root. Therefore, multiplying the sample size by a certain factor divides the SE of p̂ by thesquare root of that factor

As sample size increases, the SE of p̂ decreases like the square root of sample size.

Page 78: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

72 CHAPTER 5. SAMPLING DISTRIBUTION OF THE PROPORTION

5.3 Exercises

Exercises for Section 5.1

1. According to Horseman’s Quarterly, three out of five racehorses lose money. A racing part-nership has 49 horses in its stables. What is the probability that more than two-thirds of thestable’s horses will lose money?

2. Suppose that 20% of students in a large university are graduate students. If a random sampleof 125 students are randomly selected, What is the probability that 25% or more of the sampleare graduate students?

3. An opinion poll of 800 residents of a large suburb asked whether they favor a rise in taxesto help pay for a new shopping mall. If 60% or more support the tax, a referendum will beplaced on the next election ballot. If the true (population) proportion of all residents whofavor the mall tax is 55%, what is the probability that a referendum will be placed on thenext ballot? (In other words, what is the probability that the sample proportion will reachor exceed 60%?)

4. Eighty percent of small retail businesses experience cash flow problems during their first yearof operation. A consultant takes a random sample of 60 small businesses that have been inbusiness for one year.

(a) What is the probability that 85% or more of the sampled businesses have experiencedcash flow problems?

(b) What is the probability that less than 70% of the sampled businesses have experiencedcash flow problems?

5. Historically, 92% of the overnight deliveries of a parcel service arrive the next morning beforenoon (considered “on time” delivery). Suppose 300 deliveries are randomly selected for thepurposes of quality control.

(a) What is the probability that more than 90% of the deliveries were on time?

(b) What is the probability that less than 85% of the deliveries were on time?

Exercises for Section 5.2

6. A random sample of 200 adult Texas residents is taken, and they were asked whether theyapprove of the death penalty. Suppose that 80 out of the 200 say they favor the death penalty.

(a) What is estimate of the proportion of adult Texans who favor the death penalty?

(b) Calculate a standard error for your estimate.

Page 79: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

5.3. EXERCISES 73

7. A recent poll of 400 Kalamazoo residents resulted in a 60% approval rating for PresidentBush. Calculate a standard error for this estimate.

8. A sample of 100 individuals showed that 20% experienced gastrointestinal problems afterconsuming 10 grams of sorbitol, a common artificial sweetener. Attach a standard error tothis estimate.

Page 80: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

74 CHAPTER 5. SAMPLING DISTRIBUTION OF THE PROPORTION

Page 81: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 6

Sampling Distribution of the Mean

75

Page 82: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

76 CHAPTER 6. SAMPLING DISTRIBUTION OF THE MEAN

6.1 Behavioral Properties of the Sample Average

What percentage of adult men are between 5’6” and 6’ tall? Population surveys have shownthat men’s heights are approximately normally distributed with mean 5’9” and SD 3”. Thus thepercentage of men between 5’6” and 6’ is estimated as 68%, the percentage within 1 SD of themean. See Figure 6.1

Figure 6.1: Percentage of men’s heights between 66 and 72 inches

57 60 63 66 69 72 75 78 81 X

68%

If the population of men are randomly assigned into groups of 9, and the average heights arecomputed for each group, what percentage of groups average between 5’6” and 6’ in height? Isthe answer approximately 68%? No. In fact, more than 99% of the groups will average between5’6” and 6’, even though only 68% of individuals do. Why? Because averages tend to include tall,short and medium heights - therefore averages tend to fall closer to middle than individuals. (Thinkabout this: We put in a hat the names of all the men in class. You will win $20 if the names youdraw average over 6 feet tall. Would you rather draw 1 or 2 names? What are your chances ofwinning if you draw 9 names?)

Figure 6.2 shows the histogram of averages, superimposed over the histogram for individuals.

The main result of this chapter explains the different look of the two histograms. We say it inwords first:

Page 83: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

6.1. BEHAVIORAL PROPERTIES OF THE SAMPLE AVERAGE 77

Figure 6.2: Histogram of individual heights (solid line) and 9-men averages (dashed line)

60 63 66 69 72 75 78

66 67 68 69 70 71 72 X

X

Page 84: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

78 CHAPTER 6. SAMPLING DISTRIBUTION OF THE MEAN

If a population has a normal shaped histogram with mean µ and standard deviationσ, then n-member averages will have a normal shaped histogram with mean µ andstandard deviation σ/

√n.

Now we say it in statistical notation.

If we have a random sample X1,X2, . . . Xn ∼ N(µ, σ), then X ∼ N(µ, σ/√

n) .

The symbol “∼ N( · , · )” is read “having normal shaped histogram with mean equal to ...(what-ever is written before the comma) and SD equal to ...(whatever is written after the comma).”. Ifwe were so inclined to compute average height for as many n-member samples as we can drawfrom the population, and if we took this big long list of averages and printed a histogram in Excel,the Main Result says we will observe 3 things: (i) this histogram of averages will look like a bellcurve, (ii) with mean equal to the mean for individuals, (iii) and much smaller SD than the SD forindividuals.

When we first talked about the population mean and SD, we did not mention (because it seemsobvious) that these are the mean and SD for individuals. If male heights average 5’9” with SD 3”,then approximately 68% of individuals fall between 5’9” ± 3”. At this point, it may be helpfulto think of individual heights as a 1-member average. If we took 9-member averages instead,approximately 68% will fall between 5’9” ± 1.0”, because the SD of 9-member averages equalsthe SD of individuals (or 1-member averages) divided by

√9. Furthermore, approximately 68%

of 25-member averages will fall between 5’9” plus or minus inches?. Answer: 3/√

25 = .6inches.

Example 1Recall that men’s heights are normally distributed with mean 5’9” and SD 3”.a. What percentage of men are over 5’11” tall?b. Select a man at random. What is the probability that he is over 5’11” tall?c. If the average height were calculated for all possible samples of size 9 that can be

taken, what percentage of averages will exceed 5’11”?d. Given one randomly selected sample of size 9, what is the probability that the average

height will exceed 5’11”?e. Given a sample of size 25, what is the probability that the average height will exceed

5’11”?f. Given a sample of size 9, the average height of the sample will exceed with

probability .90.g. 90% of samples of size 9 will have average height exceeding .

Page 85: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

6.2. DRAWING FROM A NONNORMAL POPULATION 79

6.2 Drawing from a Nonnormal Population

The histogram of n-member averages looks like the normal curve when the histogram of individuals(i.e. the population histogram) looks like the normal curve. What if the population histogram is notnormal shaped? Answer: The histogram of averages may still look like the normal curve dependingon the sample size. Consider rolling a regular six-sided die. Rolling a die is like drawing froma large population consisting of the numbers 1, 2, 3, 4, 5 and 6 in equal proportion. Then thehistogram of individuals looks rectangular (Figure 6.3.).

Figure 6.3: Histogram of the outcome of rolls of a die

1 2 3 4 5 6 X

Randomly draw groups of 9 members each, as before, and take the average of each group. Thisis the same as taking the average of 9 rolls of the die. What will the histogram of the averageslook like? The averages will tend to include as many 6’s and 5’s as there are 1’s and 2’s, wouldn’tit? Hence, the averages will tend to fall close to center, with central values more likely thanoutlying values. Check out the histogram of averages of 9 rolls, superimposed over the histogramfor individuals in Figure 6.4.

Compare the histogram for averages of 9 rolls versus 25 rolls in Figure 6.5. The histogram foraverages looks more and more like the normal curve as the number of rolls increase.

The behavior of the sample average in Figures 6.4 and 6.5 may be summarized in one of thebetter known theorems in statistics.

Page 86: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

80 CHAPTER 6. SAMPLING DISTRIBUTION OF THE MEAN

Figure 6.4: individuals (n = 1) versus averages (n = 9). Note the ‘centering’ or ‘central tendency’of averages.

1 2 3 4 5 6

2.0 2.5 3.0 3.5 4.0 4.5 5.0

X

X

Figure 6.5: Histogram for averages (n = 9 versus n = 25). Note the smaller range for n = 25.

2.0 2.5 3.0 3.5 4.0 4.5 5.0 X 2.9 3.1 3.3 3.5 3.7 3.9 4.1 X

Page 87: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

6.3. ESTIMATING THE POPULATION MEAN USING A RANDOM SAMPLE 81

The Central Limit Theorem:The histogram of averages tends to look like the normal curve even when the his-togram of individuals does not (and gets closer to the normal curve as the samplesize gets larger).

The mean of averages is the same as the mean of individuals.

The SD of averages equals the SD of individuals divided by√

n.

The Central Limit Theorem is one theorem with three components. It is helpful to see how eachcomponent is reflected in the histograms in Figure 6.4. (i) the shape of the histogram of averagesis approximately normal (even if the histogram of individuals is rectangular), (ii) both histogramsare centered over the same value (i.e. the population mean), and (iii) the histogram of averagesgets narrower as sample size increases.

Example 2. Suppose that gross weekly rentals for Campus Video historically average$1200 with a standard deviation of $200.

a. There are 16 weeks this current semester. The weekly rental for this semester shouldaverage around $ give or take $ or so.

b. Estimate the chance that the average weekly rental for this semester will fall below$1000.

c. Estimate the chance that this week’s rental will fall below $1000.d. With 90% chance, the average weekly rental for this semester will exceed $ .

6.3 Estimating the Population Mean Using a Random Sample

Consider estimating the average GPA (call this µ) of the approximately 23,000 WMU undergrad-uates. In the absence of the complete database, we may wish to estimate µ by taking a randomsample of, say, n = 25 students and computing the sample average (call this X). Suppose X=3.05.Now, unless we got really lucky with the random sample, chances are µ is not equal to 3.05; i.e.the estimate missed. By how much?

On the average, a random variable misses the mean by one standard deviation. From theprevious section, the standard deviation of X is σ/

√n, where sigma is the population SD for

individuals. Therefore, σ/√

n is the expected size of the miss when X is used to estimate µ.Typically, σ/

√n is called the standard error of estimation of the mean , or simply standard

error of the mean. ‘Standard error’ will often be denoted by the abbreviation SE throught thismanuscript.

In the absence of the population database for 23000 WMU students, we do not know µ nor σ.Thus, we cannot calculate σ/

√n. In this situation, σ is typically replaced by the standard deviation

S of the sample. In most applications, standard error of the mean refers to S/√

n instead ofσ/

√n.

Page 88: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

82 CHAPTER 6. SAMPLING DISTRIBUTION OF THE MEAN

Returning to the GPA example, suppose that the sample of n = 25 students yielded an averageGPA of 3.05 and a standard deviation of .40. Then the WMU population average GPA is estimatedas 3.05 with a standard error of S/

√n = .40/

√25 = .08.

The population mean µ is estimated using the sample mean X. This estimate tendsto miss by an amount called the standard error (SE) of the mean, which is calculatedas S/

√n.

Example 3. What is the average length of stay for undergraduate students at WMU?a. Suppose 25 graduating students were randomly selected and asked about their length

of stay. Suppose that the sample averaged 5.3 years, with an SD of 1.5 years. Thenthe WMU average stay is estimated as years give or takeyears or so.

b. A second sample of 100 students were interviewed. The mean and SD for the secondsample were also 5.3 years and 1.5 years, respectively. Calculate an estimate for theWMU average stay and provide a standard error for your estimate.

Similar to the SE for proportions, the formula for the SE of the mean has the sample size (i)in the denominator, and (ii) inside the square root sign. Therefore, increasing the sample size bya factor of 4 makes the standard error decrease by a factor of

√4.

The standard error of the mean decreases like the square root of the sample size.

Page 89: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

6.4. EXERCISES 83

6.4 Exercises

Exercises for Section 6.1

1. The national norm of a science test for tenth graders has a mean of 210 and a standarddeviation of 28.

(a) What percentage scored over 220? (You may assume that the histogram of scores looksapproximately like the normal curve.)

(b) One tenth grader is randomly selected. What is the chance that he/she scored over 220?

(c) A random sample of 40 tenth graders are selected. What is the chance that this groupwill average over 220? (Should this be smaller, larger, or approximately equal to (b)?)

(d) A larger sample of 100 tenth graders is selected. What is the probability that this groupwill average over 220? (Should this be smaller, larger, or approximately equal to (c)?)

2. The number of customers per week at each store of a supermarket chain has a mean of 4400and standard deviation 500. Also, 85% of the stores sell lottery tickets.

(a) What is the probability that a randomly selected store will have less than 4200 customersin a week? (Assume that the number of customers per week is approximately normal.)

(b) If a random sample of 40 stores is selected, what is the probability that the sample willaverage below 4200 customers for the week?

(c) If a random sample of 40 stores is selected, what is the probability that fewer than 75%of the stores sell lottery tickets?

Exercises for Section 6.2

3. Suppose that healthy human body temperatures average 98.6◦F , with a standard deviationof 0.95◦F .

(a) At any one time, what percentage of healthy people have temperatures over 99◦F? (Youmay NOT assume that the histogram of body temperatures looks approximately like thenormal curve.)

(b) 25 healthy people are selected at random. What is the probability that their tempera-tures average 99◦F or higher?

(c) There is an 80% chance that their temperatures average below ◦F .

4. The time spent per e-mail “session” reading and writing personal messages averages 15 min-utes, with a standard deviation of 5 minutes.

Page 90: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

84 CHAPTER 6. SAMPLING DISTRIBUTION OF THE MEAN

(a) What proportion of email sessions last 10 minutes or longer? (You may NOT assumethat the histogram of session times looks approximately like the normal curve. It is rightskewed.)

(b) Suppose that 30 e-mail sessions are monitored. What is the probability that the averagelength will exceed 10 minutes? (Will this probability be the same as (a)?)

(c) Suppose that 60 e-mail sessions are monitored. What is the probability that the averagelength will exceed 10 minutes? (Will your answer be smaller, larger, or approximatelyequal to (b)?)

Exercises for Section 6.3

5. Safe Skies Airline took a random sample of 25 flights to estimate the average time that arrivingpassengers wait for luggage at the carousel. The sample average was found to be 16.2 minuteswith a standard deviation of 4 minutes. The population average waiting time is estimated as

minutes give or take minutes or so.

6. Brazil was Latin America’s leading economic power in the 1970s. In the past, this countryhas also had one of the highest inflation rates in Latin America. In an attempt to regaininvestor confidence, the Brazilian government has increased interest rates and has worked toestablish a sound fiscal policy to keep inflation under control. Based on a random sample of60 food items selected in 2001, the inflation rate was estimated as 6% (the sample SD was1.75%). Find a standard error for this 6% estimate.

7. A manufacturing company’s profits depend on the cost of materials. One material of interestis carbon fiber, which is used to make golf shafts and fishing rods. The cost per pound (indollars) was recorded for ten randomly selected days from the first six months of 2002. Thedata follow:

7.6, 7.8, 8.8, 7.3, 6.6, 7.5, 6.7, 8.6, 7.4, 7.7

(a) Calculate an estimate for the average cost per pound during the first six months of 2002.

(b) Calculate a standard error for your estimate in (a).

8. The Federal Reserve board made eight interest-rate cuts on its benchmark short-term ratesduring 2001. However, credit cards have averaged around 14%, with the lowest credit cardscharging around 8%. A financial analyst believes that the average rate on credit cards thatcharge monthly fees is lower than the national average of all credit cards. A random sampleof 29 different credit cards that charged monthly fees revealed a sample mean of 13.2% witha sample standard deviation of 1.4%.

(a) Calculate an estimate for the average rate of cards that charge monthly fees.

Page 91: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

6.4. EXERCISES 85

(b) Calculate a standard error for your estimate in (a).

(c) Consider the possibility that the true population average rate for cards that chargemonthly fees is 14%, and luck-of-the-draw caused the sample to be below average. Whatis the chance of getting a sample average as low as 13.2% if indeed the population averageis 14%?

Page 92: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

86 CHAPTER 6. SAMPLING DISTRIBUTION OF THE MEAN

Page 93: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 7

Confidence Intervals

87

Page 94: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

88 CHAPTER 7. CONFIDENCE INTERVALS

7.1 Estimating the Population Mean Using Intervals

Problem revisited: Estimate the average GPA µ of the population of approximately 23000 WMUundergraduates.

Consider estimating the population average µ by the sample average of, say n = 25 randomlyselected students. Suppose this sample average, which we denote by X, equals 3.05. Now chancesare the true average µ is not equal to 3.05. For sure, the true WMU average GPA is between 1.00and 4.00, and with high confidence between (2.50, 3.50); but what level of confidence do we havethat it is between say, (2.75, 3.25) or (2.95, 3.15)? Even better, can we find an interval (a, b) whichwill contain µ with , say, 95% certainty?

Recall that if σ is the standard deviation for individuals, then σ/√

n is the standard deviationfor averages (also called the SE of the sample average). More precisely, n-member averages form anapproximate normal histogram with mean µ and standard error σ/

√n. This means that there is

68% likelihood that the sample average falls within σ/√

n of µ, and 95% likelihood that the sampleaverage falls within 2(σ/

√n) of µ; see figure below.

Selected areas under histogram for X

68%

µ − σ n µ − σ n

µ

95%

µ − 2σ n µ − 2σ n

µ

In statistical shorthand, we write

P (X is inside the interval µ ± 2(σ/√

n)) = .95

where P ( · ) is read as “the probability that ( · )”.Of course, the distance of X from µ is also the distance of µ from X , so it is equally true that

P (µ is inside the interval X ± 2(σ/√

n)) = .95. (7.1)

Page 95: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.1. ESTIMATING THE POPULATION MEAN USING INTERVALS 89

Recall the question posed earlier: “Can we find an interval which we know contains µ with 95%confidence?” Answer: X ± 2(σ/

√n).

Example 1: Suppose that the SD of student GPA is .30. If a random sample of n = 25students has average GPA equal to 3.05, then we are 95% confident that µ is containedwithin 3.05 ± 2(.30/

√25) or (2.93, 3.17).

Replacing 2(σ/√

n) by 1(σ/√

n) in Equation ( 7.1) results in a 68% confidence interval. Ingeneral, we can create an interval with any desired confidence level by replacing the multiplier withthe appropriate standard normal percentile z (also called the z-critical value).

P (µ is inside the interval X ± z(σ/√

n)) = 1 − α (7.2)

where z is determined from 1 − α as follows:

ZZ α 2−Z α 2

1 − α

The following table gives the appropriate critical value z for typical values of 1 − α.

1 − α .68 .80 .90 .950 .954 .99

z 1.00 1.28 1.64 1.96 2.00 2.58(7.3)

z-Confidence Interval: An interval estimate for the population mean µ withconfidence level 1 − α for known σ is given by

X ± z(σ/√

n) (7.4)

Page 96: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

90 CHAPTER 7. CONFIDENCE INTERVALS

Example 1 (con’t.):a. Construct a 99% confidence interval for WMU GPA.b. Construct a 75% confidence interval for WMU GPA.c. Construct a 95.4% confidence interval for WMU GPA.d. Construct a 95.0% confidence interval for WMU GPA.

Theoretically speaking, X ± 2(σ/√

n) has a confidence level of 95.4%. Applied statisticiansusually just round off and say 95%. We can, of course, choose the more precise critical value of1.96 to get an exact 95% confidence level. However, normality of X is at best approximate anyway,so the precision seems spurious. Still, the distinction is important enough to merit a reminder.

z=1.96 versus z=2.0 in Equation ( 7.4):Using z = 1.96 in gives a 95% confidence interval.Using z = 2.00 in gives 95.4% confidence interval.

7.2 t-based Confidence Interval for the Mean

The 95% confidence interval provided by ( 7.4) is simple, but not very useful. Why? Because thepopulation standard deviation σ is usually unknown (if we knew it, we would likely also know thepopulation average µ, and have no need for an interval estimate.)

In practical applications, we replace the population standard deviation σ in ( 7.2) by S, thestandard deviation of the sample. However, this substitution changes the coverage probability 1−α.Fortunately, there is a simple adjustment that allows us to maintain the desired coverage level 1−α:replace the normal distribution critical value z by the slightly larger t-distribution critical value t.The resulting confidence interval is the primary result of this section.

t-Confidence Interval:When the population standard deviation σ is not known, an interval estimate forthe population average µ with confidence level 1 − α is given by

X ± t(S/√

n) (7.5)

where t is a critical value determined from the tn−1 distribution in such a way that there is area(1 − α) between t and −t.

Page 97: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.2. T-BASED CONFIDENCE INTERVAL FOR THE MEAN 91

tt α 2, n−1−t α 2, n−1

1 − α

The value n − 1 is called degrees of freedom, or df for short. It is a parameter of the t-curve inthe sense that changing the value of n− 1 changes the shape of the t-curve, though usually not bymuch. Here are appropriate t critical values for selected 1 − α and n − 1.

1 − α .80 .90 .95 .99

t: n-1=5 1.48 2.02 2.57 4.03t: n-1=15 1.34 1.75 2.13 2.95t: n-1=25 1.32 1.71 2.06 2.79t: n-1=35 1.31 1.69 2.03 2.72t: n-1=50 1.30 1.68 2.01 2.68t: n-1=100 1.29 1.66 1.98 2.63t: n-1=500 1.28 1.65 1.96 2.58

The t critical values are always larger than the z, and get progressively closer as n − 1 getslarger (they are equal at n = ∞). For a 95% confidence interval, the t values are 2.06, 2.03, 2.01,1.98, and 1.96 for respective sample sizes n = 26,36, 51, 101, and 501.

Recall that the term S/√

n in equation ( 7.5) is the (estimated) standard error of the mean.With .68 chance, X misses µ by less than this amount. To generalize, X misses µ by less thant(S/

√n) with 1−α certainty. Thus, the term t(S/

√n) is called the margin of error with confidence

level 1 − α.

If 1 − α = .95, then t is close to 2.0. For this reason, the 95% margin of error is often writtenas 2(S/

√n). When working with a random sample, the exact critical value t is read from a table

Page 98: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

92 CHAPTER 7. CONFIDENCE INTERVALS

or calculator, and depends on the sample size. However, for sample size calculations (see nextsection), the approximate critical value 2.0 is typically used.

Example: Given the following GPA for 6 students: 2.80, 3.20, 3.75, 3.10, 2.95, 3.40a. Calculate a 95% confidence interval for the population mean GPA.b. If the confidence level is increased from 95% to 99% , will the length of the confidenceinterval increase, decrease, or remain the same?c. If the confidence level is kept at 95% but the sample size is quadrupled to n = 24

(i) do you expect the sample mean X to increase, decrease, or remain approximately thesame?

(ii) do you expect the sample SD S to increase, decrease, or remain approximately thesame?

(iii) do you expect the length of the confidence interval to increase, decrease, or remainapproximately the same?

We end this section with a reminder that the confidence interval ( 7.5) works well ONLY whenat least one of the following conditions hold: (i) the population histogram looks like the normalcurve, or (ii) the sample size is relatively large (n ≥ 30 is a frequently used thumb rule. Thefollowing table summarizes the performance of the t-intervals under four situations.

Normal curve Not normal curve

Small sample size [n < 30] Good PoorLarger sample sizes [n ≥ 30] Good Fair

7.3 Determining Sample Size for Estimating the Mean

Suppose we want to estimate the average GPA µ of WMU undergraduates this school year. His-torically, the SD of student GPA is known to be σ = .30. If a random sample of size n = 25yields a sample mean of X = 3.05, then the population mean µ is estimated as lying within theinterval X ± (1.96)(σ/

√n) = 3.05 ± (1.96)(.30/

√25) = 3.05 ± .12 with 95% confidence. The

plus-or-minus quantity .12 is called the margin of error of the sample mean associated with a 95%confidence level. It is also correct to say “we are 95% confident that µ is within .12 of the samplemean 3.05”. However, .12 is a large number. The value 3.05 ± .12 can be as low as 2.93 or ashigh as 3.17. Is there some way to reduce the margin of error? Yes, and quite easily. Look at theformula for the margin of error: (1.96)(σ/

√n). Observe that if sample size increases, the margin

of error decreases. Let M denote the desired margin of error. Solving M = (1.96)(σ/√

n) for ngives the following:

Page 99: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.4. COMPARING THE AVERAGES OF TWO INDEPENDENT SAMPLES 93

In order to be 95% confident that the sample mean is within a distance M of thetrue mean µ, choose a sample size equal to

n = (1.96)2σ2/M2.

The quantity M is the desired size of the 95% margin of error.

Example: Suppose we want to reduce the margin of error for estimating mean GPAfrom .12 to .05. The sample size we need is: n = (1.96)2(.30)2/(.05)2=138. To verifythat this is the correct sample size, the 95% confidence interval would be computed(if the sample mean remains at 3.05) as 3.05 ± (1.96)(.30/

√138) = 3.05 ± .05.

Note that the population SD σ is needed in the sample size formula. This typically unknownvalue may be (i) estimated from historical data, or (ii) from a pilot sample (like the initial sampleof size 25 mentioned in the example above). The SD from the pilot sample is then substituted intothe sample size formula and the required sample size is calculated.

If a confidence level other than 95% is desired, we may simply replace z = 1.96 by the appropriatez critical value from the table

1 − α .68 .80 .90 .950 .954 .99

z 1.00 1.28 1.64 1.96 2.00 2.58

In order to be (1-α)100% confident that the sample mean is within a distance M ofthe true mean µ, choose a sample size equal to

n = z2σ2/M2.

7.4 Comparing the Averages of Two Independent Samples

Is there ”grade inflation” in WMU? How does the average GPA of WMU students today comparewith, say 10, years ago? Suppose a random sample of 100 student records from 10 years ago yieldsa sample average GPA of 2.90 with a standard deviation of .40. A random sample of 100 currentstudents today yields a sample average of 2.98 with a standard deviation of .45. The differencebetween the two sample means is 2.98-2.90 = .08. Is this proof that GPA’s are higher today than 10years ago? Well....first we need to account for the fact that 2.98 and 2.90 are not the true averages,but are computed from random samples. Therefore, .08 is not the true difference, but simply anestimate of the true difference. Can this estimate miss by much? Fortunately, statistics has a wayof measuring the expected size of the “miss” (or error of estimation) . For our example, it is .06 (we

Page 100: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

94 CHAPTER 7. CONFIDENCE INTERVALS

show how to calculate this later). Therefore, we can state the bottom line of the study as follows:”The average GPA of WMU students today is .08 higher than 10 years ago, give or take .06 or so.”

We now show how to calculate the .06, the standard error of the estimate. But first, a noteon terminology. The estimate .08=2.98-2.90 is a difference between averages (or means) of twoindependent random samples. ”Independent” refers to the sampling luck-of-the-draw: the luck ofthe second sample is unaffected by the first sample. In other words, there were two independentchances to have gotten lucky or unlucky with the sampling. The likely size of the error of estimationin the .08 is called the standard error of the difference between independent means. We calculateit using the following formula:

SE of (X1 − X2) =

SE21 + SE2

2 (7.6)

where SE1 = S1/√

n1 and SE2 = S2/√

n2.Note that SE1 and SE2 are the SE’s of X1 and X2, respectively. The formula looks easier without

the notation and the subscripts. 2.98 is a sample mean, and has standard error .45/√

100 = .045(since SE=S/

√n). Similarly, 2.90 is a sample mean and has standard error .40/

√100 = .040.

Summarizing, we write the two mean estimates and their respective SE’s as follows:

X1 = 2.98; SE = .045X2 = 2.90; SE = .040

If two independent estimates are subtracted, the formula ( 7.6) shows how to compute the SE ofthe difference :

X1 − X2 = 2.98 − 2.90; SE =√

(.045)2 + (.0402).

The difference in GPA’s is estimated as .08 , give or take .06 or so, written in statistical shorthandas .08 ± .06.

Remember the Pythagorean Theorem in geometry? Think of the two SE’s as the length of thetwo sides of the triangle (call them a and b). The SE of the difference then equals the length of thehypotenuse (SE of difference =

√a2 + b2).

We are now ready to state a confidence interval for the difference between two independentmeans.

z-Confidence Interval for µ1 − µ2 (Independent Samples)An interval estimate for the difference µ1 − µ2 with confidence level 1 − α is givenby

(

X1 − X2

)

± z

SE21 + SE2

2 (7.7)

where SE1 = S1/√

n1 and SE2 = S2/√

n2. The confidence interval works best whenboth n1 and n2 are large, preferably at least 30.

Page 101: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.4. COMPARING THE AVERAGES OF TWO INDEPENDENT SAMPLES 95

The correct z critical value for a 95% confidence interval is z = 1.96. Therefore a 95% z-confidence interval for µ1 − µ2 is

(2.98 − 2.90) ± 1.96√

(.045)2 + (.040)2

or (-.04, .20).There is a second procedure that is preferable when either n1 or n2 or both are small. However,

this method needs additional requirements to be satisfied (at least approximately):

Requirement R1: Both samples follow a normal-shaped histogramRequirement R2: The population SD’s σ1 and σ2 are equal.

Let Sp denote a “pooled” estimate of the common SD, as follows:

Sp =

(n1 − 1)S21 + (n2 − 1)S2

2

n1 + n2 − 2

The following confidence interval is called a “Pooled SD” or “Pooled Variance” confidence interval.

t-Confidence Interval for µ1−µ2 with Pooled SD (for Independent Sampleswith Equal SD)A pooled SD interval estimate for the difference in means µ1 − µ2 with confidencelevel 1 − α is given by

(

X1 − X2

)

± t

SE21 + SE2

2 (7.8)

where SE1 = Sp/√

n1, SE2 = Sp/√

n2, and t is chosen similar to equation ( 7.5)except df=n1 + n2 − 2. The confidence interval works best when both requirementsR1 and R2 are satisfied.

Returning to the grade inflation example, the pooled SD is

Sp =

(100 − 1)(.452) + (100 − 1)(.402)

100 + 100 − 2= .426.

Therefore, SE1 = .426/√

100 = .0426, SE2 = .426/√

100 = .0426, and the difference between meansis estimated as

(2.98 − 2.90) ± t√

(.0426)2 + (.0426)2

where the second term is the standard error. For a 95% confidence interval, the appropriate valuefrom the t curve with 198 degrees of freedom is 1.96. Therefore a t-confidence interval for µ1 − µ2

with confidence level .95 is

(2.98 − 2.90) ± 1.96√

(.0426)2 + (.0426)2

Page 102: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

96 CHAPTER 7. CONFIDENCE INTERVALS

or (-.04, .20).Note that the t-confidence interval ( 7.8) with pooled SD looks like the z-confidence interval

( 7.7), except that S1 and S2 are replaced by Sp, and z is replaced by t. We present a summary ofthe situations under which each method is recommended.

R1 and R2 are both satisfied R1 or R2 or both not satisfied

Both samples are large Use z or t Use zOne or both samples small Use t Consult a statistician

7.5 Comparing Averages of Two Paired Samples

To assess the stock performance of various sectors of industry for the year 2002, we look at stockprices for selected companies. Table 7.1 gives stock prices for 6 food service companies and 5computer industry companies at January of 2002 and 2003.

Table 7.1: Stock Prices of Selected Food and Computer Companies

Jan Jan Jan Jan

Company 2002 2003 Company 2002 2003

Coca Cola 42.79 40.22 Microsoft 31.75 23.65

McDonalds 26.81 14.24 Dell 27.49 23.86

Kraft 36.34 31.68 Oracle 17.26 12.03

Dole 27.32 32.47 Cisco Systems 31.45 12.95

Starbucks 23.77 22.72 Hewlett Packard 21.59 17.32

Wendy’s 29.94 27.09

----------------------- -----------------------

Average 31.16 28.07 Average 25.91 17.96

SD 7.10 8.95 SD 6.34 5.65

There are two types of mean comparisons allowed by this type of data:

Independent Samples: If we compare food company versus computer company prices(for example 31.16 versus 25.91), this is called a comparison of means betweenindependent samples. The confidence interval procedure for µ1 − µ2 is given byeither ( 7.7) or ( 7.8).

Paired Samples: If we compare Jan. 2002 prices versus Jan.2003 prices (for example31.16 versus 28.07), this is called a comparison of means between paired samples.The confidence interval for µ1 − µ2 is discussed below.

Page 103: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.5. COMPARING AVERAGES OF TWO PAIRED SAMPLES 97

The best way to tell whether you have independent or paired samples is to ask: do I have ”twosamples” (independent) or ”one sample measured twice” ( i.e. paired)? The statistical treatment isdifferent because paired samples involve sampling luck just once. This gives paired-samples betterstatistical properties than independent samples. For independent samples, luck-of-the-draw occurstwo times.

The correct procedure for estimating the difference between means of paired columns is asfollows: subtract the two measurements, and use a one-sample confidence interval ( 7.4) or ( 7.5)on the differences.

For the food service companies, the calculations are given in Table 7.2:

Table 7.2: Stock Prices of Selected Food Companies

Jan Jan

Company 2002 2003 Difference

(A) (B) (A-B)

Coca Cola 42.79 40.22 2.57

McDonalds 26.81 14.24 12.57

Kraft 36.34 31.68 4.66

Dole 27.32 32.47 -5.15

Starbucks 23.77 22.72 1.05

Wendy’s 29.94 27.09 2.85

-----------------------------------

Average 31.16 28.07 3.09

SD 7.10 8.95 5.74

Using ( 7.5), a 95% confidence interval for the difference in means is given by

3.09 ± 2.57 (5.74/√

6)

or (-2.93, 9.11).

t-Confidence Interval for µ1 − µ2 (for Paired Samples)An interval estimate for the difference µ1 − µ2 with confidence level 1 − α is givenby

diff ± t (Sdiff/√

n) (7.9)

where diff and Sdiff are the average and SD of the difference column, n is the lengthof the difference column, and t is the 1−α critical value for t with n− 1 degrees offreedom.

Page 104: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

98 CHAPTER 7. CONFIDENCE INTERVALS

For the independent sample comparison of food versus computer companies (for 2002), thecalculation is as follows:

Sp =

(6 − 1)(7.102) + (5 − 1)(6.342)

6 + 5 − 2= 6.77

Therefore, SE1 = 6.77/√

6 = 2.76, SE2 = 6.77/√

5 = 3.03, and the difference between means isestimated as

(31.16 − 25.91) ± t√

(2.76)2 + (3.03)2

where the second term is the standard error. For a 95% confidence interval, the critical value fromthe t curve with 6+5-2=9 degrees of freedom is 2.26. Therefore a t-confidence interval for µ1 − µ2

with confidence level .95 is

(31.16 − 25.91) ± 2.26√

(2.76)2 + (3.03)2

or (-4.01, 14.51).

7.6 Estimating the Population Proportion Using Intervals

Consider estimating the proportion of WMU undergraduates whose permanent residence is inSouthwest Michigan. We may take a random sample of, say, n = 25 students and compute thesample proportion p̂. If 9 out of the 25 students come from Southwest Michigan, then p̂ = 9/25 =.36. Because of the “luck of the draw”, chances are the true proportion p is not exactly .36, butsome number close to .36. Can we find an interval which we know contains p with say, at least 95%confidence?

Recall that the sample proportion p̂ over different groups of 25 has a histogram that wouldlook approximately normal with mean p and SD

p(1 − p)/√

n. This means that there is 95%likelihood that p̂ falls within (1.96)

p(1 − p)/√

n of p. Of course, this also means there is 95%likelihood that p falls within (1.96)

p(1 − p)/n of p̂. In statistical shorthand,

P (p is inside the interval p̂ ± (1.96)√

p(1 − p)/√

n) = .95

Recall the question posed earlier: “Can we find an interval which we know contains p with 95%confidence?” Answer: p̂ ± (1.96)

p(1 − p)/√

n. In applications, we replace p inside the squarerootsign by p̂. For our example, the 95% confidence interval for p is .36 ± (1.96)

.36(1 − .36)/√

25or .36 ± .19. Furthermore, we can create an interval with any desired confidence level by replacingthe critical value z=1.96 with the appropriate standard normal percentile z.

Confidence Interval for p:An interval estimate for the population proportion p with confidence level 1 − α isgiven by

p̂ ± z√

p̂(1 − p̂)/√

n (7.10)

where the critical value z is determined from the normal curve.

Page 105: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.7. SAMPLE SIZE FOR ESTIMATING THE POPULATION PROPORTION 99

ZZ α 2−Z α 2

1 − α

The following table gives the appropriate critical value z for typical values of 1 − α.

1 − α .68 .80 .90 .95 .99

z 1.00 1.28 1.64 1.96 2.58

Example:1. Suppose that 9 out of 25 randomly selected students live in Southwest Michigan.

a. Construct a 99% confidence interval for the true proportion of WMUstudents who live in Southwest Michigan .

b. Construct a 75% confidence interval for the true proportion of WMUstudents who live in Southwest Michigan .

2. Suppose that 36 out of 100 randomly selected students live in Southwest Michigan.

a. Construct a 99% confidence interval for the true proportion of WMUstudents who live in Southwest Michigan .

b. Construct a 75% confidence interval for the true proportion of WMUstudents who live in Southwest Michigan .

7.7 Sample Size for Estimating the Population Proportion

If 9 out of 25 randomly selected WMU students live in Southwest Michigan, the 95% confidenceinterval for the true proportion is p̂ ± (1.96)

p̂(1 − p̂)/√

n = .36 ± .19. This result says that

Page 106: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

100 CHAPTER 7. CONFIDENCE INTERVALS

the true proportion can be as low as .17 or as high as .55. If we wanted to reduce the marginof error from .19 to some value M , then we set the formula for margin of error equal to M , i.eM = (1.96)

p̂(1 − p̂)/√

n. Solving for n gives the result we need.

In order to be 95% confident that the sample proportion is within a distance M ofthe true proportion p, choose a sample size equal to

n = (1.96)2p̂(1 − p̂)/M2

where p̂ is an estimate based on historical data or a pilot sample. The quantity Mis called the 95% margin of error for p.

Example: Suppose we want to reduce the margin of error for estimating the populationproportion from .19 to .10. Using the estimate p̂ = .36 based on the initial sample,the sample size we need is: n = (1.96)2(.36)(.64)/(.10)2=89. To verify that thisis the correct sample size, the 95% confidence interval would be computed (if thesample proportion remained at .36) as .36 ± (1.96)

(.36)(.64)/√

89 = .36 ± .10.

If a confidence level other than 95% is desired, replace z = 1.96 by the appropriate z criticalvalue.

In order to be (1-α)100% confident that the sample proportion is within a distanceM of the true proportion p, choose a sample size equal to

n = z2p̂(1 − p̂)/M2

7.8 Confidence Interval for the Difference Between Two Propor-tions

Has retention rate at WMU been changing? Suppose that a random sample of 200 entering studentsin 1989 showed 74% were still enrolled 3 years later. Another random sample of 200 enteringstudents in 1999 showed that 66% were still enrolled 3 years later. This constitutes an 8% changein 3-year retention rate. However, the 8% difference is based on random sampling, and is only anestimate of the true difference. What is the likely size of the error of estimation?

The calculation of the standard error for the difference in proportions parallels the calculationfor a difference in means.

SE of (p̂1 − p̂2) =

SE21 + SE2

2 (7.11)

Page 107: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.9. OVERVIEW OF CONFIDENCE INTERVALS 101

where SE1 and SE2 are the SE’s of p̂1 and p̂2, respectively. For the retention rates, let p̂1 =.74 with standard error SE1 =

(.74)(1 − .74)/√

200 = .031 and p̂2 = .66 with standard errorSE2 =

(.66)(1 − .66)/√

200 = .033. Then the difference .74 − .66 = .08 will have standard error√.0312 + .0332 = .045.

We now state a confidence interval for the difference between two proportions.

Confidence Interval for the Difference Between Proportions (for Inde-pendent Samples)An interval estimate for the difference in proportions p1 − p2 with confidence level1 − α is given by

(p̂1 − p̂2) ± z

SE21 + SE2

2 (7.12)

where SE1 and SE2 are the standard errors of p̂1 and p̂2, respectively.

The SE for the .08 change in retention rates is .045, so the .08 estimate is likely to be off bysome amount close to .045. However, the 95% margin of error is approximately 1.96 SE’s, or .09.A 95% confidence interval for the difference in proportions p1 − p2 is .08 ± .09 or (−.01, .17).

Coverting to percentages, the difference between retention rates for 1989 and 1999 is 8% witha 95% margin of error of 9%. A 95% confidence interval for the true difference is (−1%, 17%).

7.9 Overview of Confidence Intervals

All the confidence intervals follow a pattern

ESTIMATE ± (CRITICAL VALUE)*(SE of ESTIMATE)

The following summary table lists the ESTIMATES and corresponding SE’s. The CRITICALVALUE is simply a multiplier that fine tunes the length of the confidence interval according to thedesired level of confidence. The exact value of the critical value is read from the t or z curve (butroughly speaking, a value close to 1 gives a 68% level of confidence while a value close to 2 gives a95% level of confidence).

Page 108: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

102 CHAPTER 7. CONFIDENCE INTERVALS

ESTIMATE SE OF ESTIMATE CRITICAL VALUE

X S/√

n t (df=n − 1)

X1 − X2

SE21 + SE2

2 where

if SE1 = S1/√

n1, SE2 = S2/√

n2 then zif SE1 = Sp/

√n1, SE2 = Sp/

√n2 then t (df=n1 + n2 − 2)

√p̂(1−p̂)√

nz

p̂1 − p̂2

SE21 + SE2

2 where

SE1 =

√p̂1(1−p̂1)√

n1, SE2 =

√p̂2(1−p̂2)√

n2z

7.10 Using the TI-84

7.10.1 Computing a Confidence Interval for µ

A. Computing ( 7.5) from a Data List:

1. [STAT]2. Arrow right to [TESTS]3. Arrow down to [TINTERVAL]. Press [ENTER].4. Choose [DATA]. Press [ENTER].5. Input the list name and the confidence level (must be a positive decimal). Leave

[Freq] at 1.6. Arrow down to [Calculate]. Then press [ENTER].

B. Computing ( 7.5) from X and SFollow Procedure A, but choose [STATS] in Step 4.

7.10.2 Computing a confidence interval for µ1 − µ2:

A. Computing ( 7.8) from two data lists:

1. [STAT]2. [TESTS]3. [2-SampTInt]. [ENTER].4. Choose [DATA]. [ENTER].5. Input the list names and the confidence level (must be a positive decimal).Leave [Freq1] and [Freq2] at 1.6. Choose [Pooled]: YES. [ENTER].7. Arrow down to [Calculate]. [ENTER].

Page 109: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.10. USING THE TI-84 103

B. Computing ( 7.8) from X1, S1, X2 and S2

Follow Procedure C, but choose [STATS] in Step 4.

C. Computing the Paired-t confidence interval from two data lists

1. Suppose your data is in lists [L1] and [L2]. Take the differences and store in [L3] asfollows: [STAT] [EDIT], scroll to header of [L3], [2nd][L1] - [2nd][L2] [ENTER]. 2.Follow Section 7.10.1 using data in [L3].

7.10.3 Computing a Confidence Interval for p

1. [STAT]2. [TESTS]3. [1-PropZInt]. [ENTER].4. Input x, n, confidence level5. Arrow down to [Calculate]. [ENTER].

7.10.4 Computing a Confidence Interval for p1 − p2

1. [STAT]2. [TESTS]3. [2-PropZInt]. [ENTER].4. Input x1, n1, x2, n2, confidence level5. Arrow down to [Calculate]. [ENTER].

Page 110: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

104 CHAPTER 7. CONFIDENCE INTERVALS

7.11 Exercises

Exercises for Sections 7.1 and 7.2

1. There are two basic forms of sleep: slow wave sleep (SWS) and rapid eye movement (REM)sleep. Infants spend about 50% of their sleep time in SWS and 50% in REM sleep. Adultsbelow age 60 spend about 20% of their sleep time in REM and 80% in SWS sleep. In a studyof sleep patterns, data was collected on 13 elderly males over age 60. The percentage of totalsleep time spent in REM sleep is presented below.

21, 20, 22, 7, 9, 14, 23, 9, 10, 25, 15, 17, 11

(a) Calculate the sample average and standard deviation.

(b) Calculate the population average and standard deviation, if possible.

(c) Do you think the sample average equals the population average? If not, by how muchdo you expect it to miss?

(d) Calculate the standard error (SE) of the sample average.

(e) With 95% confidence, the estimate should miss the population average by less than.

(f) The population average should be between and with 95% confidence.

(g) If the level of confidence above is reduced to 90%, will the new interval be shorter orlonger?

(h) Calculate a 90% confidence interval for the population average.

2. Twenty-five factory workers total were asked how many vacation days they take a year. Theaverage of the sample was 22.85 days, and the standard deviation of the sample was 5.80.

(a) Do you think the population average equals 22.85? If not, do you think the populationaverage is higher or lower than 22.85 days? By how much do you expect it to miss?

(b) Calculate the standard error for the sample average.

(c) Construct a 95% confidence interval for the average number of vacation days per yeartaken by factory workers.

3. A new breakfast cereal Frosted Corn is test marketed for one month. The total sales for thefirst 9 quarters (3-month periods) indicate an average of $8350, with a standard deviation of$1840.

Page 111: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.11. EXERCISES 105

(a) Treating the first 9 quarters as a sample of size 9, do you think the sample averageequals the ’population’ average of future sales? If not, can you calculate a value whichestimates how much you expect it to miss?

(b) Is it reasonable to assume that the first 9 quarters is a random sample from, say, thefirst 40 quarters (a 10-year period)?

(c) Calculate a standard error for the sample average.

(d) Construct a 90% confidence interval for the ’population’ average of future quarterly sales.

Exercises for Section 7.3

4. A Michigan automobile insurance company would like to estimate the average size of anaccident claim. A random sample of 20 accident claims yielded an average of $1600 with astandard deviation of $800.

(a) What is the standard error of the $1600 estimate?

(b) What is the 95% margin of error of the estimate?

(c) Estimate the true average accident claim with a 95% interval estimate.

(d) Suppose the confidence interval above is considered too wide. What sample size isnecessary to reduce the width of the confidence interval to one-third of its original size?

(e) What sample size is necessary to reduce the 95% margin of error to $50?

5. In order to test a new production method, 15 employees were selected randomly to try the newmethod. The mean production rate for the sample was 80 parts per hour, and the standarddeviation was calculated to be 10 parts per hour.

(a) Give a 95% confidence interval for the true mean production rate for the new method.

(b) Suppose the confidence interval in (a) is considered too wide. What sample size isnecessary to reduce the width of the confidence interval to one half of its original size?

(c) What sample size is necessary to reduce the standard error of the sample mean to 1.5parts per hour?

(d) What is the 95% margin of error of the estimate?

(e) What sample size is necessary to reduce the 95% margin of error to 1.5 parts per hour?

6. Safe Lumber Company hires an independent contractor to monitor the groundwater withina one mile radius of its factory to ensure that no toxic levels of arsenic are being released.The contractor takes a sample of groundwater from selected sites within the one mile radius.Safe Lumber will need to pay for an expensive clean-up if a confidence interval of the averagearsenic level in groundwater lies completely above 50 parts per billion (for example,if theconfidence interval is (75,95)).

Page 112: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

106 CHAPTER 7. CONFIDENCE INTERVALS

(a) If Safe Lumber wanted to avoid an expensive clean-up, should it prefer a 90% or 95%confidence interval?

(b) If Safe Lumber wanted to avoid an expensive clean-up, should it prefer a sample size of9 or 25?

(c) Suppose that data from 9 sites had arsenic levels averaging 75 ppb with a sample SD of30 ppb. Construct 90% and 95% confidence intervals for the average arsenic level.

(d) Suppose that data from 25 sites resulted in the same average of 75 ppb and SD of 30ppb. Construct 90% and 95% confidence intervals for the average arsenic level.

(e) If Safe Lumber wanted to avoid an expensive clean-up, would it prefer to sample fromhigh ground areas or low ground areas?

(f) If Safe Lumber wanted to avoid an expensive clean-up, would it prefer to sample closeror farther from the factory?

7. A public broadcasting station began a week-long on-air promotional telethon to get viewersto pledge money to help support the station. The first day yielded 55 callers with an averagepledge of $48 and a standard deviation of $19.

(a) If these 55 callers are treated as a random sample from the population of callers yet tocome, determine a 95% interval estimate for the average amount of all pledges.

(b) How many callers are needed before the population average pledge may be estimated towithin $2 with 95% confidence?

8. In developing patient appointment schedules, a medical center desires to estimate the meantime that a staff member spends with each patient. How large a sample should be taken ifthe precision of the estimate is to be within ±2 minutes at a 95% level of confidence? Assumethe population standard deviation is estimated to be around 8 minutes.

9. A “quarter barrel” keg of beer is supposed to contain 8 gallons. The quality inspector ata local brewing company wants to estimate the average volume in the kegs to within 1/16gallon (i.e. 1/2 pint) with 95% confidence.

(a) Assuming the volume of beer in the kegs has standard deviation 1/4 gallons, what samplesize is needed?

(b) If the standard deviation estimate is changed from 1/4 to 1/8 gallon, will the requiredsample size be bigger or smaller?

(c) If the margin of error is increased from 1/16 to 1/8 gallon, will the required sample sizebe bigger or smaller?

(d) if the confidence level is changed from 95% to 90%, will the required sample size bebigger or smaller?

Page 113: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.11. EXERCISES 107

Exercises for Section 7.4

10. A personnel agency is comparing salaries of uncertified accountants versus CPA’s. Amongrecent accountants that they placed, 4 were uncertified while 6 were CPA’s. The annualsalaries (in dollars) are summarized in the following table:

Uncertified CPAAverage: $48,500 $56,600SD: $8,700 $9,100Sample size: 4 6

(a) On the average, CPA certification is worth an estimated dollars per year ac-cording to the data.

(b) Calculate a standard error for the estimate in (a).

(c) Calculate a 95% margin of error for the estimate in (a).

(d) Calculate a 95% confidence interval for the difference between average salaries of uncer-tified accountants and CPA’s.

(e) What assumptions did you need for your confidence interval?

11. Do credit cards with no annual fee charge higher interest rates than cards that have annualfees? Among 29 cards surveyed, 17 had no annual fees while 12 charged an annual fee. Amongthe cards with no annual fee, the average interest rate was 19% (SD=8%). Among cards withan annual fee, the average interest rate was 17% (SD=3%).

(a) On the average, cards with no annual fee charge an estimated percent higher APRthan cards with annual fees.

(b) Calculate a standard error for the estimate in (a).

(c) Calculate a 95% margin of error for the estimate in (a).

(d) Construct a 95% interval estimate for the difference in average interest rates.

(e) What assumptions did you need for your confidence interval?

Exercises for Section 7.5

12. A new gasoline additive is supposed to make gas burn more cleanly and increase gas mileagein the process. Consumer Protection Anonymous conducted a mileage test to confirm this.They took seven of their cars, filled it with regular gas, and drove it on I-94 until it was empty.They repeated the process using the same cars, but using the gas additive. The recorded gasmileage follows:

Page 114: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

108 CHAPTER 7. CONFIDENCE INTERVALS

Car 1 2 3 4 5 6 7

Without Additive 22 15 18 28 12 25 18With Additive 26 19 17 34 17 25 22

(a) For each of the 7 cars, calculate the mileage difference (With-Without), between the twofuel types.

(b) Calculate the average and SD for the differences in (a).

(c) Fill in the blanks: “On the average, the gas additive increases mileage by mpg,give or take mpg or so (Reminder: The SE, not the SD, goes in the second blank.)

(d) Calculate a 95% margin of error.

(e) Construct a 95% confidence interval for the ’population’ mileage effect of the additive.

13. Some stock market analysts have speculated that parts of West Michigan Telecom might beworth more than the whole. For example, the company’s communication systems in AnnArbor and Detroit could be sold to other communications companies. Suppose that a stockmarket analyst chose 9 acquisition experts and asked each to predict the return (in percent)on investment in the company held to the year 2003 if (i) it does business as usual, or (ii) ifit breaks up its communication systems and sells all its parts. Their predictions follow:

Expert 1 2 3 4 5 6 7 8 9

Not Break Up 12 21 8 20 16 5 18 21 10Break Up 15 25 12 17 17 10 21 28 15

(a) On the average, experts believe breaking up the company will earn an estimatedpercent higher return.

(b) Calculate a standard error for the estimate in (a).

(c) Calculate a 95% margin of error for the estimate in (a).

(d) Construct a 95% interval estimate for the difference in average predicted return.

(e) What assumptions did you need for your confidence interval above?

Exercises for Section 7.6

14. A survey conducted by USA Today showed that 118 of 250 investors owned some real estate.

(a) What proportion of the surveyed investors owned some real estate?

(b) Do you think the proportion for the population equals your estimate in (a)? If not, byhow much do you expect it to miss?

(c) Calculate a standard error (SE) for your estimate in (a).

Page 115: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.11. EXERCISES 109

(d) Find two numbers that satisfy the following statement: ”The population proportionshould be between and with 95% confidence.”

(e) If the level of confidence above is reduced to 90%, will the new interval be shorter orlonger?

(f) Calculate a 90% confidence interval for the population proportion.

15. A survey conducted by USA Today showed that 118 of 250 investors owned some real estate.

(a) What percentage of the surveyed investors owned some real estate?

(b) Do you think the percentage for the population equals your estimate in (a)? If not, byhow much do you expect it to miss?

(c) Calculate a standard error (SE) for your estimate in (a).

(d) Find two numbers that satisfy the following statement: ”The population percentageshould be between and with 95% confidence.”

(e) If the level of confidence above is reduced to 90%, will the new interval be shorter orlonger?

(f) Calculate a 90% confidence interval for the population percentage.

16. An appliance manufacturer offers maintenance contracts on its major appliances. A managerwants to know what fraction of buyers of the company’s convection ovens are also buying themaintenance contract with the oven. From a random sample of 120 sales slips, 31 of the ovenbuyers opted for the contract.

(a) The proportion of customers who buy the contract along with their oven is estimated as.

(b) Calculate a standard error for the estimate in (a).

(c) Calculate a 95% interval estimate for the true proportion of customers who buy thecontract along with their oven.

Exercises for Section 7.7

17. Researchers who were concerned if doctors were consistently adjusting dosages for weight ofelderly patients studied 2000 prescriptions. They found that for 600 of the prescriptions, thedoctors failed to adjust the dosages.

(a) Doctors fail to adjust dosage for an estimated percent of prescriptions.

(b) Calculate a standard error for the percentage in (a).

(c) Calculate a 95% margin of error for the percentage in (a).

(d) Calculate a 95% interval estimate for the true proportion

Page 116: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

110 CHAPTER 7. CONFIDENCE INTERVALS

(e) Calculate a 95% confidence interval for the true percentage of prescriptions where doctorsfail to adjust dosages.

18. The Tourism Bureau of South Carolina plans to sample visitors at major beaches throughoutthe state to estimate the proportion of beach goers who are not South Carolina residents.Preliminary estimates reveal that 55% are not South Carolina residents. How large a samplemust be taken to estimate the proportion of out-of-state visitors with a .03 margin of error?Use a 95% confidence level.

19. The production manager for a large city newspaper wants to determine the proportion ofnewspapers printed that have a defect, such as excessive ruboff, missing pages, duplicatepages, and so on. A random sample of 20 newspapers resulted in 2 containing some kind ofdefect.

(a) The percentage of defective newspapers is estimated as .

(b) What is the standard error of the above estimate? .

(c) Construct a 95% confidence interval for the defect percentage.

(d) What is the 95% margin of error of the above estimate? .

(e) How many newspapers need to be sampled to reduce the 95% margin of error to 5%?

(f) How many newspapers need to be sampled to have a 90% margin of error equal to 5%?

Exercises for Section 7.8

20. According to a survey by Town Plaza Hotels, 12% of 150 men who were travelling on businessbrought a friend or spouse. In contrast, 28% of 101 women in the survey who were travellingon business brought a friend or spouse.

(a) According to the data, there is an estimated difference of between the the per-centage of men and women who brought a friend or spouse along on business travel.

(b) Calculate a standard error for the estimate in (a).

(c) Calculate a 95% margin of error for the estimate in (a).

(d) Construct a 95% interval estimate for the difference between the two percentages.

21. Borrowers Bank classifies credit cardholders as ”prime” whenever they pay interest chargesfor at least 6 of the last 12 billing cycles. The table below cross classifies 2080 cardholdersaccording to whether they are (i) prime or not, and (ii) below age 30 or not

Below 30 30 and Over

Prime 475 510Not Prime 376 719

Page 117: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

7.11. EXERCISES 111

(a) Construct a 95% confidence interval for the percentage of prime cardholders among thosewho are less than 30 years old

(b) Construct a 95% confidence interval for the percentage of prime cardholders among thosewho are 30 years or older.

(c) Construct a 95% confidence interval for the difference between the two percentages in(a) and (b)

Page 118: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

112 CHAPTER 7. CONFIDENCE INTERVALS

Page 119: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 8

Testing Hypotheses

113

Page 120: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

114 CHAPTER 8. TESTING HYPOTHESES

8.1 Test of Significance involving the Sample Average

A bowler is bragging that his average is “at least 180”. We observe him play three games, his scoresare 125, 155, 140 (X = 140, S = 15). Should we accept or reject his claim? We should reject it.Why? Because a sample average as low as 140 is unlikely from a 180 bowler. How unlikely? A 180bowler will bowl a 3-game average of 140 or lower only 2 percent of the time. Is 2 percent of thetime unlikely? In statistics, yes. 5 percent or less is called statistically significant.

The decision making process above is called a test of significance . Here is the way a statisticalreport would formally present the test, in numbered stages.

1. Hypotheses: H0 : µ ≥ 180 versus H1 : µ < 180

2. Test Statistic: t = X−180S/

√n

= 140−18015/

√3

= −4.62

3. P-value: Presuming H0 is true, the likelihood of chance variation yielding a t-statistic as lowas -4.62 is .02. (Calculation details later.)

4. Conclusion: Since P-value ≤ .05, the observed sample value X = 140 is declared significantlyunlikely under H0 : µ ≥ 180. Hence, we reject H0 and conclude H1 : µ < 180. The sampleprovides evidence to reject the bowler’s claim.

Here is a more detailed description of each component of the test of significance above.

1. The null and alternative hypotheses .

H0 and H1 are called the null hypothesis and alternative hypothesis, respectively. Thetwo hypotheses describe the two possibilities: the claim is true (µ ≥ 180), or the claim is false(µ < 180). Note that

(i) the two hypotheses are statements about the population (ii) the two hypothesesare complementary; if one occurs the other does not (iii) the hypothesis withthe equal sign is the null hypothesis

A test of significance rejects (population statement) H0 and concludes H1 if the sample valuesare “significantly far from H0 and inside H1”. Hence, we reject H0 : µ ≥ 180 and concludeH1 : µ < 180 if X is some significant distance below 180. How far below 180 is “significant”?The test statistic helps us determine where to draw the line in the sand.

2. The Test Statistic

For tests of hypotheses on µ, the t-test statistic is a ratio of the form

t =Observed Sample Mean - H0 Value

SE of Sample Mean=

X − H0 Value

S/√

n.

Page 121: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.1. TEST OF SIGNIFICANCE INVOLVING THE SAMPLE AVERAGE 115

For the null hypothesis H0 : µ ≥ 180, the t-test statistic is

t =X − 180

S/√

n

H0 will be rejected if and only if X will be some significant distance below 180, which happensif and only if t is some significant distance below 0. Based on the sample observed scores, theobserved t value is

t =X − 180

S/√

n=

140 − 180

15/√

3= −4.62

“Is t = −4.62 significantly below 0?” To answer this, we will need the help of the t-curvewith n − 1 degrees of freedom.

3. The P-value

Using the t curve with n − 1 = 2 degrees of freedom, the likelihood of chance variationresulting in a t-value as low as -4.62 is .02.

−4.62 t

P−value = 0.02

Since this likelihood is less than .05 (the standard for statistical significance), we declare that“t = −4.62 is significantly below 0”, or that “X = 140 is significantly below 180”, and rejectH0 : µ ≥ 180.

Page 122: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

116 CHAPTER 8. TESTING HYPOTHESES

In general, the P-value is the total area under the curve more extreme than t in support ofH1. If t is “deep in H1 territory”, then the P-value is small. If P-value ≤ .05, we reject H0

with statistical significance. If P-value ≤ .01, we reject H0 with high statistical significance.If P-value is larger than .05, we accept H0.

4. Conclusion

If H0 is rejected, the conclusion is usually stated as ‘there is enough evidence to ...’ or ‘thereare statistically significant differences...’. If H0 is accepted, the conclusion is usually statedas ‘there is not enough evidence to ...’, or ‘there are no statistically significant differences...’.

Since P-value=.02 in our example, we conclude that ‘the sample provides enough evidence toreject the bowler’s claim of a 180 average’. Or ‘his performance (X = 140) was much lowerthan his claimed average (µ = 180), and the difference is statistically significant.’

Summary of the lower-tailed t-test for µ:

H0 : µ ≥ c versus H1 : µ < c

Test statistic: t = (X − c)/(S/√

n)

P-value: Total area less than t (the direction of H1) under t-curve with n − 1 degrees of freedomIf t is significantly below 0 (the direction of H1), the P-value will be small.

Conclusion: If P-value ≤ .05, we reject H0 with statistical significance. If P-value ≤ .01, we rejectH0 with high statistical significance. If P-value >.05, we do not reject H0.

8.1.1 Upper-tailed t-test

Cheap N Fresh food stores advertise that their average checkout waiting time is 4 minutes or less.A random selection of shoppers at peak time yielded the following waiting times: 3.5, 5.0, 3.8, 4.5,7.2, 5.0. Does this evidence disprove the store’s claim?

1. Hypotheses: H0 : µ ≤ 4.0 versus H1 : µ > 4.0

2. Test Statistic: t = X−4.0S/

√n

= 4.83−4.01.31/

√6

= 1.55

3. P-value: Presuming H0 is true, the likelihood of chance variation yielding a t-statistic higherthan 1.55 (this is the H1 direction) is .09.

Page 123: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.1. TEST OF SIGNIFICANCE INVOLVING THE SAMPLE AVERAGE 117

t1.56

P−value = 0.09

4. Conclusion: Since P-value > .05, we do not reject H0. The sample does not provide enoughevidence to disprove the store’s claim.

Summary of the upper-tailed t-test for µ:

H0 : µ ≤ c versus H1 : µ > c

Test statistic: t = (X − c)/(S/√

n)

P-value: Total area greater than t (the direction of H1) under t-curve with n−1 degrees of freedom.If t is far enough above 0 (the direction of H1), the P-value will be small.

Conclusion: If P-value ≤ .05, we reject H0 with statistical significance. If P-value ≤ .01, we rejectH0 with high statistical significance. If P-value>.05, we do not reject H0.

8.1.2 Two-tailed t-test

Past data have shown that if the payphones at the airport are emptied every 14 days, the coincollectors will be 70% full on the average. The phone company tries to schedule the collection visitsat 70% full because money is lost if the phones get full and unusable, but visiting the phones toofrequently is also an expense. The company keeps data on fill amounts during collection, in casethey need to increase or decrease collection frequency. During the last visit, suppose that 5 phoneswere 50%, 40%, 70%, 75%, and 45% full, respectively. Do you think the frequency of visits needsto be changed, or is this just chance variation?

Page 124: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

118 CHAPTER 8. TESTING HYPOTHESES

It is important to note that an action will be made (i.e. increase or decrease visits) if the averagefill shifts away from µ = 70% in either direction. Since we are looking for statistical evidence of anaverage shift in either direction, the appropriate hypotheses and test procedure are:

1. Hypotheses: H0 : µ = 70 versus H1 : µ 6= 70

2. Test Statistic: t = X−70S/

√n

= 56.0−7015.57/

√5

= −2.01

3. P-value: Presuming H0 is true, the likelihood of chance variation yielding a t-statistic moreextreme than -2.01 on either side of 0 (since H1 direction is both high and low) is .11.

t−2.01 2.01

0.0550.055

P−value = 0.055+0.055 = 0.11

4. Conclusion: Since P-value > .05, we do not reject H0. The sample does not provide enoughevidence that the mean fill has shifted from 70%.

Summary of the two-tailed t-test for µ:

H0 : µ = c versus H1 : µ 6= c

Test statistic: t = (X − c)/(S/√

n)

P-value: Total area greater than |t| and less than −|t| under t-curve with n− 1 degrees of freedomIf t is far enough from 0 on either side (the direction of H1), the P-value will be small.

Conclusion: If P-value ≤ .05, we reject H0 with statistical significance. If P-value ≤ .01, we rejectH0 with high statistical significance. If P-value>.05, we do not reject H0.

Page 125: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.1. TEST OF SIGNIFICANCE INVOLVING THE SAMPLE AVERAGE 119

8.1.3 The t-test is not universally appropriate

There are situations where the t-test does not work well. The t-test tends to be appropriatewhenever the sample size is reasonably large, say n ≥ 25. For sample sizes smaller than that, thet-test is appropriate only when the population histogram looks normal in the first place. When thesample size is small and the population histogram does not look normal, alternative tests callednonparametric tests are advised.

8.1.4 Generalizing Tests of Significance

Recall that confidence intervals follow a pattern

ESTIMATE ± CRITICAL VALUE * (SE of ESTIMATE) (8.1)

where ESTIMATE can either be a X , X1 − X2, p̂, or p̂1 − p̂2 depending on the type of data andproblem at hand.

Test statistics also follow a general pattern

Test Statistic =ESTIMATE - HYPOTHESIZED VALUE

SE of ESTIMATE. (8.2)

The ESTIMATE’s that may be used in tests are listed in Table 8.1 along with their SE’s andthe correct curve from which to read P-values. In effect, we may replace X in any of the previoustests by X1 − X2 or p̂ or p̂1 − p̂2. As long as we remember to use the corresponding SE anddistribution curve, the test procedure is the same as before.

Table 8.1: Summary of Test Statistics

ESTIMATE SE OF ESTIMATE P-value CURVE

X S/√

n t (df=n − 1)

X1 − X2

SE21 + SE2

2 where

if SE1 = S1/√

n1, SE2 = S2/√

n2 zif SE1 = Sp/

√n1, SE2 = Sp/

√n2 t (df=n1 + n2 − 2 )

√p̂(1−p̂)√

nz

p̂1 − p̂2

SE21 + SE2

2 where

SE1 =

√p̂1(1−p̂1)√

n1, SE2 =

√p̂2(1−p̂2)√

n2z

Page 126: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

120 CHAPTER 8. TESTING HYPOTHESES

We describe once more the procedure for a lower-tailed test of hypotheses; but this timewe use the generic term ESTIMATE instead of the sample mean X.

1. Hypotheses: H0 : Parameter ≥ Hypothesized Value versus H1: Parameter < HypothesizedValue

2. Test Statistic = ESTIMATE - HYPOTHESIZED VALUESE of ESTIMATE

.

3. P-value: Presuming H0, the likelihood of chance producing a test statistic as low as or lowerthan observed in Step 2 is the area to its left under the P-value curve.

4. Conclusion: If P-value ≤ .05 the observed difference in Step 2 is statistically significant.Therefore, Reject H0 and conclude that the sample provides enough evidence to prove H1.

For upper-tailed tests, we reject H0 if we observe values of the estimate larger than thehypothesized value. Therefore the P-value is the area to the right of the observed test statistic inStep 2.

For two-tailed tests, the P-value is the area both left and right of the observed test statisticin Step 2.

8.1.5 Examples

Example 1. Suppose that an ad for Western Michigan University claims that ‘a majority of WMUstudents come from outside the Southwest Michigan area’. A sample of 25 students revealed thatonly 9 out of 25, or 36%, come from outside Southwest Michigan. Can this just be chance error,or is this significant evidence that the advertisement is wrong?

Solution: The data consists of a sample percentage (or proportion); i.e. 36% of 25 students comefrom outside Southwest Michigan. Therefore, the ESTIMATE is a either a sample percentage orproportion (we will choose a proportion). At the end of the test, the conclusion will be one of twothings: the data will provide evidence either for or against the claim ‘a majority of WMU studentscome from outside the Southwest Michigan area’. Therefore the competing hypotheses are p ≥ .50versus p < .50. The hypothesis with the equal sign is the null hypothesis, so we have H0 : p ≥ .50.

1. Hypotheses: H0 : p ≥ .50 versus H1 : p < .50

2. Test Statistic: z = ESTIMATE - HYPOTHESIZED VALUESE of ESTIMATE

= p̂−.50√p̂(1−p̂)/n

= .36−.50√.36(1−.36)/25

= −1.46

3. P-value: Presuming H0 is true, the likelihood of chance variation yielding a z-statistic as lowas -1.46 is .07.

Page 127: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.1. TEST OF SIGNIFICANCE INVOLVING THE SAMPLE AVERAGE 121

4. Conclusion: Since P-value > .05, the observed sample value p̂ = .36 is not significantlydifferent from .50. Hence, we do not reject H0 : p ≥ .50. The sample does not provide enoughevidence to prove the ad wrong.

Example 2 Is there ”grade inflation” in WMU? A random sample of 100 student records from10 years ago yields a sample average GPA of 2.90 with a standard deviation of .40. A randomsample of 100 current students today yields a sample average of 2.98 with a standard deviationof .45. Is the difference between 2.90 and 2.98 just chance variation, or a statistically significantincrease?

Solution: The evidence consists of two sample averages. The ESTIMATE of interest is thedifference: 2.98-2.90=.08 (i.e average today minus average 10 years ago). At the end of the test,the conclusion will be one of two things: the data either support or provide evidence againstthe hypothesis of ’grade inflation’. Therefore the competing hypotheses are ’average grade today ishigher than 10 years ago’ or ’not’. In terms of the population averages, we write these as µ2−µ1 > 0,or µ2 − µ1 ≤ 0. The hypothesis with the equal sign is null, so we have H0 : µ2 − µ1 ≤ 0. The H0

Value is 0.

1. Hypotheses: H0 : µ2 − µ1 ≤ 0 versus H1 : µ2 − µ1 > 0

2. Test Statistic: z = ESTIMATE - HYPOTHESIZED VALUESE of ESTIMATE

= (X2−X1)−0q

SE2

1+SE2

2

= (2.98−2.90)−0√(.045)2+(.0402)

= 1.33

3. P-value: Presuming H0 is true, a z-statistic as large as 1.33 results from chance variation 9%of the time (P-value=.09).

4. Conclusion: Since P-value > .05, the observed increase in average GPA is not statisticallysignificant, and may be explained by chance variation. Do not reject H0.

If the sample sizes were small and the two variances were equal, a pooled-SD t-test would havebeen the more appropriate test. We illustrate the procedure here. The SD’s for the two samplesS1 = .40 and S2 = .45 are pooled into one SD:

Sp =

(n1 − 1)S21 + (n2 − 1)S2

2

n1 + n2 − 2=

(100 − 1)(.452) + (100 − 1)(.402)

100 + 100 − 2= .426.

The SE’s for the two averages are SE1 = .426/√

100 = .0426 and SE2 = .426/√

100 = .0426. Nowwe conduct the test.

1. Hypotheses: H0 : µ2 − µ1 ≤ 0 versus H1 : µ2 − µ1 > 0

Page 128: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

122 CHAPTER 8. TESTING HYPOTHESES

2. Test Statistic: t = ESTIMATE - HYPOTHESIZED VALUESE of ESTIMATE

= (X2−X1)−0q

SE2

1+SE2

2

= (2.98−2.90)−0√(.0426)2+(.04262)

= 1.33

3. P-value: Presuming H0 is true, a t-statistic with 198 degrees of freedom will by chance be aslarge as 1.33 around 9% of the time (P-value=.09).

4. Conclusion: Since P-value > .05, the observed increase in average GPA is not statisticallysignificant, and may be explained by chance variation. Do not reject H0.

Example 3. The paired t-test Table 8.2 shows stock prices for six selected food servicecompanies. Is there a significant difference between average Jan. 2002 and Jan. 2003 stock prices?

Table 8.2: Stock Prices of Selected Food Companies

Jan Jan

Company 2002 2003 Difference

(A) (B) (A-B)

Coca Cola 42.79 40.22 2.57

McDonalds 26.81 14.24 12.57

Kraft 36.34 31.68 4.66

Dole 27.32 32.47 -5.15

Starbucks 23.77 22.72 1.05

Wendy’s 29.94 27.09 2.85

-----------------------------------

Average 31.16 28.07 3.09

SD 7.10 8.95 5.74

Solution:At first glance, the evidence looks like it consists of two averages 31.12 and 28.07. However,

remember that these two averages are not independent, because the data consists of two measure-ments on only ONE sample, rather than two separate samples. (Remember the Paired-t confidenceinterval?) The correct analysis here takes the differences between the two measurements, as shownin Table 8.2. If 2002 and 2003 prices are significantly different, then the difference column willhave an average that is significantly different from 0.

Ignoring the original two columns and focusing instead on the difference column, the evidencefor a difference in prices lies in the value 3.09, which is a single average and may be represented asX. At the end of the test, the conclusion will be one of two things: the average difference is eithersignificantly different from 0 or not (µ 6= 0 or µ = 0). Suppose we want a test that can detect

Page 129: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.1. TEST OF SIGNIFICANCE INVOLVING THE SAMPLE AVERAGE 123

either a positive or negative change in average stock prices, then we can choose to do a two-tailedtest. Of course, a one-tailed test that detects change in only one direction can also be done.

1. Hypotheses: H0 : µ = 0 versus H1 : µ 6= 0

2. Test Statistic: t = ESTIMATE - HYPOTHESIZED VALUESE of ESTIMATE = X−0

S/√

n

= 3.09−05.74/

√6

= 1.32

3. P-value: By chance alone, a t-statistic with 5 d.f. may get as large as 1.32 on either side of 0around 24% of the time (i.e. P-value=.244).

4. Conclusion: Since P-value > .05, the observed change in stock prices is not statisticallysignificant, and may be explained by chance variation. Do not reject H0.

Example 4. Has retention rate at WMU been changing? Suppose that a random sample of200 entering students in 1989 showed 74% were still enrolled 3 years later. Another random sampleof 200 entering students in 1999 showed that 66% were still enrolled 3 years later. Is this drop of8% statistical evidence of changing retention rate, or can it be just chance error (i.e luck of thedraw in the students selected for the samples)?

Solution:

The evidence consists of two sample proportions, or more precisely, the difference between twosample proportions (.74-.66=.08). At the end of the test, the conclusion will be one of two things:the data either support or provide evidence against ’changing retention rates’. Since an increase ordecrease will be interpreted as a change (this was the agreement at the start of the study), we willconduct a two-tailed test. The competing hypotheses are ’change’ versus ’no change’. In terms ofthe population, these hypotheses may be written as p1 − p2 neq0 versus p1 − p2 = 0. The latter isassigned to H0 because it contains the equal sign.

1. Hypotheses: H0 : p1 − p2 = 0 versus H1 : p1 − p2 6= 0

2. Test Statistic: z = ESTIMATE - HYPOTHESIZED VALUESE of ESTIMATE

= (p̂1−p̂2)−0q

SE2

1+SE2

2

= (.74−.66)−0√(.031)2+(.0332)

= 1.78

3. P-value: Presuming H0 is true, a z-statistic as extreme as 1.78 on either side of 0 results fromchance variation 7.5% of the time (P-value=.075).

4. Conclusion: Since P-value > .05, the observed change in retention rate is not statisticallysignificant, and may be explained by chance variation. Do not reject H0.

Page 130: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

124 CHAPTER 8. TESTING HYPOTHESES

8.1.6 Type I and Type II Errors

A test of significance leads us to one of two conclusions: either reject H0 or accept H0.

Terminology:If we reject H0 when it is correct, we are said to have committed a Type I error.If we accept H0 when it is false, we are said to have committed a Type II error.

Recall the bowling example. Suppose that the bowler really does have an average that is 180 orhigher, but ‘luck of the draw’ led us to observe unfortunately low scores 125, 155 and 140. Basedon the data, we wrongly reject his claim H0 : µ ≥ 180. This wrong decision is called a Type I error.Since our test procedure rejects H0 only when the observed test statistic is deep in H1 territory (inthe extreme 5% territory, in fact), then the test procedure commits Type I error with less than .05probability.

On the other hand, if the bowler really does have an average that is lower than 180, but thescores we observe were not low enough for us to reject his claim H0 : µ ≥ 180, then we havecommitted a Type II error.

Since the likelihood of Type I error is required to be below .05, the likelihood of committingType II error may be large. The bowler’s claim of H0 : µ ≥ 180 will be rejected only when theobserved sample scores are extremely low (i.e. in the lowest 5% region). If we observed scores of175, 178, 172, we cannot reject his claim, can we. Why not? Because a 180-bowler can very likelyproduce these scores! Thus we are forced to accept his claim, even though in truth he may only bea 175 bowler (in which case we have already committed Type II error).

Notes:The likelihood of committing Type I error is less than .05.The likelihood of committing Type II error may be quite large.

In scientific studies, the likelihood of Type II error is often reduced to desired levels by increasingthe size of the observed sample. The required calculations are beyond the scope of this textbook.

8.2 Using the TI-84

8.2.1 Calculating a t-test for µ

1. [STAT]2. Arrow right to [TESTS]3. Arrow down to [T-TEST]. Press [ENTER].4. Choose [DATA]. Press [ENTER].5. Input the null hypothesized value µ0, list name or summary statistics, and the type of

alternative hypothesis (two-tailed µ 6= µ0, left-tailed µ < µ0, or right-tailed µ > µ0).6. Arrow down to [Calculate] or [Draw]. Then press [ENTER].

Page 131: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.2. USING THE TI-84 125

8.2.2 Calculating a t-test for µ1 − µ2

1. [STAT]2. [TESTS]3. [2-SampTTest]. [ENTER].4. Choose [DATA] or [STATS]. [ENTER].5. Input the list names or summary statistics, and the type of alternative hypothesis

(two-tailed µ1 6= µ2, left-tailed µ1 < µ2, or right-tailed µ1 > µ2).6. Choose [Pooled]: YES or NO. [ENTER].7. Arrow down to [Calculate] or [Draw]. [ENTER].

8.2.3 Calculating a Paired t-test for µ1 − µ2

1. Suppose your data is in lists [L1] and [L2]. Take the differences and store in [L3] asfollows: [STAT] [EDIT], scroll to header of [L3], [2nd][L1] - [2nd][L2] [ENTER]. 2.Follow Section 8.2.1 using data in [L3].

8.2.4 Calculating a z-test for p

1. [STAT]2. [TESTS]3. [1-PropZTest]. [ENTER].4. Input the null hypothesized value p0, x, n, and the type of alternative hypothesis

(two-tailed p 6= p0, left-tailed p < p0, or right-tailed p > p0).5. Arrow down to [Calculate] or [Draw]. [ENTER].

8.2.5 Calculating a z-test for p1 − p2

1. [STAT]2. [TESTS]3. [2-PropZTest]. [ENTER].4. Input x1, n1, x2, n2 and the type of alternative hypothesis (two-tailed p1 6= p2,

left-tailed p1 < p2, or right-tailed p1 > p2).5. Arrow down to [Calculate] or [Draw]. [ENTER].

Page 132: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

126 CHAPTER 8. TESTING HYPOTHESES

8.3 Using Excel

8.3.1 Comparing two means (independent samples)

1. Open the Microsoft Excel data file. If your data is still in a text file, refer to 1.6.2 on howto convert text files to Excel files. The following instructions will tell you how to add DataAnalysis in step 2 if it is not already there. Click the Office button. Click Excel options.Select Add-ins. Choose Analysis tool pack. Click ’OK.’

2. Select the ’Data’ tab, under Analysis click ’Data Analysis.’ Choose ’T-test: 2-sample assum-ing equal variances’ and click ’OK.’ A new window will appear.

3. Enter the appropriate cell ranges of your first variable on the Variable 1 range. Enter theappropriate cell ranges of your second variable on the Variable 2 range. If the first rowcontains variable names, click on ’labels in 1st row.’

4. Input the hypothesized mean difference (µ1 − µ2) and the desired alpha (α). Click ’OK.’

8.3.2 Comparing two means (paired samples)

1. Open the Microsoft Excel data file. If your data is still in a text file, refer to 1.6.2 on how toconvert text files to Excel files.

2. If you do not have the Analysis tab, refer to 8.1.1. Select the ’Data’ tab, under Analysisclick ’Data Analysis.’ Choose ’T-test: Paired two sample for means’ and click ’OK.’ A newwindow will appear.

3. Enter the appropriate cell ranges of your first variable on the Variable 1 range. Enter theappropriate cell ranges of your second variable on the Variable 2 range. If the first rowcontains variable names, click on ’labels in 1st row.’

4. Input the hypothesized mean difference (µ1 − µ2) and the desired alpha (α). Click ’OK.’

Page 133: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.4. EXERCISES 127

8.4 Exercises

1. There are two basic forms of sleep: slow wave sleep (SWS) and rapid eye movement (REM)sleep. Infants spend about 50% of their sleep time in SWS and 50% in REM sleep. Adultsbelow age 60 spend about 20% of their sleep time in REM and 80% in SWS sleep. In a studyof sleep patterns, data was collected on 13 elderly males over age 60. The percentage of totalsleep time spent in REM sleep is presented below.

21, 20, 22, 7, 9, 14, 23, 9, 10, 25, 15, 17, 11

(a) Calculate the sample average and standard deviation.

(b) The sample average is how far below 20%?

(c) Calculate the standard error (SE) of the sample average.

(d) The sample average is how many SE’s below 20% ?

(e) Is the sample average significantly below 20%, or is it just chance variation?

(f) If you conduct a test of significance on the following hypothesis: ‘Does the data providescientific evidence that elderly males spend less than 20% of their sleep time in REM?’,how would you write the null and alternative hypotheses?

(g) What is the test statistic you would use?

(h) What distribution curve would you use to compute the P-value?

(i) Calculate a (one-tailed) P-value for your test.

(j) What is the conclusion of your test?

2. Twenty-five factory workers were asked how many vacation days they take a year. The averageof the sample was 22.85 days, and the standard deviation of the sample was 5.80.

(a) Calculate the standard error of the sample average.

(b) In the past, the company’s Human Resources Office uses an average of 18 vacationdays per worker in order to make available man-hour and payroll predictions. Do youthink this standard of 18 days needs to be changed? (Is the observed average of 22.85significantly different from 18? Calculate a (one-tailed) P-value to measure statisticalsignificance of the difference.)

3. A new breakfast cereal Frosted Corn is test marketed for one month. The total sales for thefirst 9 quarters (3-month periods) indicate an average of $8350, with a standard deviation of$1840.

Page 134: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

128 CHAPTER 8. TESTING HYPOTHESES

(a) Treating the first 9 quarters as a sample of size 9, calculate a standard error for theaverage sales of $8350.

(b) Based on a cost-profit analysis, production of Frosted Corn will be discontinued if thesample average of the first 9 quarters falls below $9000 with statistical significance. Doesthe data indicate production will be stopped? (Calculate a one-tailed P-value to measurestatistical significance.)

4. A Michigan automobile insurance company would like to estimate the average size of anaccident claim for the fiscal year 2002. A random sample of 20 accident claims yielded anaverage of $1600 with a standard deviation of $800.

(a) What is the standard error of the $1600 estimate?

(b) The national average claim for the industry is $1500. Does the sample indicate that theMichigan average is different from the national average? (Conduct a two-tailed test ofsignificance. What is the P-value of your test?)

5. In order to test a new production method, 15 employees were selected randomly to try thenew method. The mean production rate for the sample was 80 parts per hour (with sampleSD 10 parts per hour).

(a) Calculate a standard error for the sample mean production rate for the new method.

(b) The mean production rate under the old method is known to be 70 parts per hour. Is thenew method rate significantly different, or is this just luck-of-the-draw on the employeessampled ? (Calculate a two-tailed P-value for your test.)

6. Safe Lumber Company hires an independent contractor to monitor the groundwater within aone mile radius of its factory to ensure that no toxic levels of arsenic are being released. Thecontractor takes a sample of groundwater from selected sites within the one mile radius. SafeLumber will need to pay for an expensive clean-up if the average arsenic level in groundwateris shown to be over 50 parts per billion with statistical significance (using a one-tailed test).

(a) Suppose that data from 9 sites had arsenic levels averaging 75 ppb with a sample SD of30 ppb. Is this statistically significant proof that the average level is over 50 ppb?

(b) If Safe Lumber wanted to avoid an expensive clean-up, should it prefer a one-tailed ortwo-tailed test? (What is the P-value if a two-tailed test were done?)

(c) If Safe Lumber wanted to avoid an expensive clean-up, should it prefer a sample size of 9or 25? (What is the P-value if the sample size was 25 instead of 9, with sample averageand SD remaining the same?)

(d) If Safe Lumber wanted to avoid an expensive clean-up, would it prefer to sample fromhigh ground areas or low ground areas?

Page 135: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.4. EXERCISES 129

(e) If Safe Lumber wanted to avoid an expensive clean-up, would it prefer to sample closeror farther from the factory?

7. A personnel agency is comparing salaries of uncertified accountants versus CPA’s. Amongrecent accountants that they placed, 4 were uncertified while 6 were CPA’s. The annualsalaries (in dollars) are summarized in the following table:

Uncertified CPAAverage: $48,500 $56,600SD: $8,700 $9,100Sample size: 4 6

(a) Are CPA salaries significantly higher than uncertified, or can this be just luck-of-the-draw? (Conduct a one-tailed test of significance using a pooled-SD t-test. What is thecalculated P-value of the test?)

(b) Suppose that the samples sizes were 40 and 60 instead of 4 and 6 (sample average andSD’s the same), will your P-value in (a) get smaller or larger? What is the size of thenew P-value?)

(c) Suppose that you conduct a one-tailed unpooled-SD z-test for the sample sizes 4 and 6,do you expect the P-value to be smaller, larger, or approximately equal to the P-valuein (a)? Which P-value is more correct ?

(d) Suppose that you conduct an unpooled-SD z-test for the sample sizes 40 and 60, do youexpect the P-value to be smaller, larger, or approximately equal to the P-value in (b)?Which P-value is more correct ?

8. Do credit cards with no annual fee charge higher interest rates than cards that have annualfees? Among 29 cards surveyed, 17 had no annual fees while 12 charged an annual fee. Amongthe cards with no annual fee, the average interest rate was 19% (SD=8%). Among cards withan annual fee, the average interest rate was 17% (SD=3%).

(a) Are interest rates significantly higher for cards with no annual fees? (Conduct a one-tailed test of significance using a pooled-SD t-test. What is the calculated P-value ofthe test?)

(b) Conduct a one-tailed unpooled-SD z-test. Do you expect the P-value to be smaller,larger, or approximately equal to the P-value in (a)? Which P-value is more correct ?

9. A new gasoline additive is supposed to make gas burn more cleanly and increase gas mileagein the process. Consumer Protection Anonymous conducted a mileage test to confirm this.They took seven of their cars, filled it with regular gas, and drove it on I-94 until it was empty.They repeated the process using the same cars, but using the gas additive. The recorded gasmileage follows:

Page 136: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

130 CHAPTER 8. TESTING HYPOTHESES

Car 1 2 3 4 5 6 7

Without Additive 22 15 18 28 12 25 18With Additive 26 19 17 34 17 25 22

(a) For each of the 7 cars, calculate the mileage difference (With-Without), between the twofuel types.

(b) Calculate the average and SD for the differences in (a).

(c) Does the data support the claim of higher gas mileage? (Conduct a one-tailed test ofsignificance. What is the P-value of your test?)

(d) Suppose that the data were quadrupled, but the data just repeats itself four times (i.e.(22, 26),. . .,(18, 22) each appears four times). Then the new sample size is 28. What isthe new average and SD for the differences? What is the new P-value for the one-tailedtest of significance??

(e) Suppose that a t-test for independent samples were used on the original data with 7cars. Do you expect the P-value to be smaller, larger, or approximately equal to thepaired-t P-value? Which P-value is more correct ?

(f) Eyeball the following idealized data and state any pattern that you see. (Without anystatistical testing, what is the obvious conclusion about the effectivity of the gas addi-tive?) Now compare the P-values for the paired-t test versus the t-test for independentsamples. Which P-value is more correct ?

Car 1 2 3 4 5 6 7

Without Additive 22 15 18 28 12 25 18With Additive 26 19 22 32 16 29 22

10. Some stock market analysts have speculated that parts of West Michigan Telecom might beworth more than the whole. For example, the company’s communication systems in AnnArbor and Detroit could be sold to other communications companies. Suppose that a stockmarket analyst chose 9 acquisition experts and asked each to predict the return (in percent)on investment in the company held to the year 2003 if (i) it does business as usual, or (ii) ifit breaks up its communication systems and sells all its parts. Their predictions follow:

Expert 1 2 3 4 5 6 7 8 9

Not Break Up 12 21 8 20 16 5 18 21 10Break Up 15 25 12 17 17 10 21 28 15

(a) On the average, do experts believe breaking up the company will earn a higher return?(Conduct a one-tailed test of significance. What is the P-value of your test?)

Page 137: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

8.4. EXERCISES 131

(b) Suppose that a test for independent samples is used. Do you expect the P-value to besmaller, larger, or approximately equal to the P-value in (a)? Which P-value is morecorrect ?

11. A survey conducted by USA Today showed that 118 of 250 investors owned some real estate.

(a) What proportion of the surveyed investors owned some real estate?

(b) Calculate a standard error (SE) for your estimate in (a).

(c) Conduct a test of significance for H0 : p = .5 versus H1 : p 6= .5. What is the P-value ofyour test?

(d) Suppose a larger survey showed 1180 out of 2500 investors owned some real estate. Whatis the new sample proportion and SE? What is the new P-value for the test of significanceabove?

12. A survey conducted by USA Today showed that 118 of 250 investors owned some real estate.

(a) What percentage of the surveyed investors owned some real estate?

(b) Calculate a standard error (SE) for your estimate in (a).

(c) Conduct a test of significance for H0 : p = .5 versus H1 : p 6= .5 using the samplepercentage in the test statistic. What is the P-value of your test?

13. An appliance manufacturer offers maintenance contracts on its major appliances. A managerwants to know what fraction of buyers of the company’s convection ovens are also buying themaintenance contract with the oven. From a random sample of 120 sales slips, 31 of the ovenbuyers opted for the contract.

(a) The proportion of customers who buy the contract along with their oven is estimated as.

(b) Calculate a standard error for the estimate in (a).

(c) Test the hypothesis H0 : p = .3 versus the alternative H1 : p 6= .3. What is the P-valueof your test?

14. Researchers concerned if doctors were consistently adjusting dosages for weight of elderlypatients studied 2000 prescriptions. They found that for 600 of the prescriptions, the doctorsfailed to adjust the dosages.

(a) Doctors fail to adjust dosage for an estimated percent of prescriptions.

(b) Calculate a standard error for the percentage in (a).

(c) Test the hypothesis H0 : p ≥ 1/3 versus the alternative H1 : p < 1/3. What is theP-value of your test?

Page 138: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

132 CHAPTER 8. TESTING HYPOTHESES

15. A believer in horoscopes is willing to put his beliefs to a test. You will present him each weekwith three horoscopes for that week, one of them being his own. At the end of each week heis to pick out the horoscope best describing his experiences.

(a) Suppose he made 12 correct identifications out of 26 during the first 6 months. Shouldyou concede that his faith has some justification, or can this be explained by guessing?(Use a one-tailed test.)

(b) During the next six months, he again got 12 out of 26 correct (he now has 24 out of 52total). Now, should you concede that his faith has some justification?

16. According to a survey by Town Plaza Hotels, 12% of 150 men who were travelling on businessbrought a friend or spouse. In contrast, 28% of 101 women in the survey who were travellingon business brought a friend or spouse.

(a) According to the data, there is an estimated difference of between the the per-centage of men and women who brought a friend or spouse along on business travel.

(b) Calculate a standard error for the estimate in (a).

(c) Conduct a test of significance for the difference between the two percentages. (Use atwo-tailed test.)

17. Borrowers Bank classifies credit cardholders as ”prime” whenever they pay interest chargesfor at least 6 of the last 12 billing cycles. The table below cross classifies 2080 cardholdersaccording to whether they are (i) prime or not, and (ii) below age 30 or not

Below 30 30 and Over

Prime 475 510Not Prime 376 719

(a) The percentage of prime cardholders among those who are less than 30 years old isestimated as % give or take % or so.

(b) The percentage of prime cardholders among those who are 30 years or older is estimatedas % give or take % or so.

(c) Are the two percentages in (a) and (b) significantly different?. (Use a two-tailed test.)

Page 139: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 9

Testing Equality of Frequencies

133

Page 140: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

134 CHAPTER 9. TESTING EQUALITY OF FREQUENCIES

9.1 Testing for Independence between Two Variables Using an

r × c Table

Is there an association between gender and height? Yes, males tend to be taller than females. Amore formal way of saying this is ‘height distribution for males tends to be different from females’.Is there an association between shoe size and height? Yes, ‘height distribution for men who wearsize 12 is different from those who wear size 8’. Is there an association between GPA and height?No, ‘height distribution tends to be the same for 3.0 students as well as 3.5 students.’

Two variables A and B are said to be associated if the distribution of B tends to changewith the level of the A variable. Otherwise, they are said to be independent.

Therefore height is associated with gender and shoe size, but independent of GPA.

Now consider the following 3 × 4 table. 189 students entering a business school program werefollowed as part of an attrition (i.e. drop out, transfer) study. The students were cross classifiedaccording to 4 categories of high school GPA (2.0-2.5, 2.5-3.0, 3.0-3.5, 3.5-4.0) and 3 categories ofattrition outcomes (‘did not return for 2nd year’, ‘returned for 2nd but not for 3rd year’, ‘returnedfor 3rd year’). Is there an association between HS GPA and college attrition?

Table 9.1: Retention versus HS GPA

2.0-2.5 2.5-3.0 3.0-3.5 3.5-4.0

No 2nd Year 25 3 4 6No 3rd Year 14 7 4 6

Return for 3rd Year 41 7 28 44

To analyze whether attrition and GPA are independent, we will analyze whether attritiondistribution remains the same regardless of GPA level. Here is how a statistical report wouldpresent the analyis, in numbered stages.

1. Hypotheses

H0 : The two variables are independent versus H1 : The two variables are associated.

2. Test Statistic

If the two variables are independent, the observed frequencies should be distributed like theseexpected frequencies (calculations later):

Page 141: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

9.1. TESTING FOR INDEPENDENCE BETWEEN TWO VARIABLES USING AN R×C TABLE135

2.0-3.5 2.5-3.0 3.0-3.5 3.5-4.0

No 2nd Year 16.08 3.42 7.24 11.26No 3rd Year 13.12 2.79 5.90 9.18

Return for 3rd Year 50.80 10.80 22.86 35.56

The chi-square test statistic measures whether the observed and expected frequencies areclose:

χ2 = (25−16.08)2

16.08 + (3−3.42)2

3.42 + (4−7.24)2

7.24 + (6−11.26)2

11.26

+ (14−13.12)2

13.12 + (7−2.79)2

2.79 + (4−5.90)2

5.90 + (6−9.18)2

9.18

+ (41−50.80)2

50.80 + (7−10.80)2

10.80 + (28−22.86)2

22.86 + (44−35.56)2

35.56= 23.42

3. P-value:

The area greater than 23.42 under the chi-square curve with (3-1)(4-1)=6 degrees of freedomis .0007.

0.0007

Chi−square curve with 6 df

0 5 10 15 20 25χ2

||23.42

4. Conclusion: Since P-value < .05, we reject H0 and conclude with very strong evidence thatGPA and attrition are associated. In particular, low HS GPA’s are associated with higherattrition rates.

Page 142: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

136 CHAPTER 9. TESTING EQUALITY OF FREQUENCIES

9.1.1 Calculating the Table of Expected Frequencies

The row and column totals of the observed frequencies are

2.0-2.5 2.5-3.0 3.0-3.5 3.5-4.0 Total

No 2nd Year 38No 3rd Year 31

Return for 3rd Year 120

Total 80 17 36 56 189

The numbers in the right margin are called the row totals while the numbers in the bottommargin are called the column totals. Now, calculate the relative frequency distribution of row totals.This is called the marginal distribution for the rows, or the marginal attrition distribution in thiscase.

2.0-2.5 2.5-3.0 3.0-3.5 3.5-4.0 Total

No 2nd Year 38 (20.1%)No 3rd Year 31 (16.4%)

Return for 3rd Year 120 (63.5%)

Total 80 17 36 56 189 (100%)

If the two variables are independent, then we expect the attrition distribution for each column tobe the same. Extrapolating the marginal attrition distribution to each column, we get the followingexpected frequencies

(.201)(80)=16.08 (.201)(17)=3.42 (.201)(36)=7.24 (.201)(56)=11.26(.164)(80)=13.12 (.164)(17)=2.79 (.164)(36)=5.90 (.164)(56)=9.18(.635)(80)=50.80 (.635)(17)=10.80 (.635)(36)=22.86 (.635)(56)=35.56

If the observed frequencies are close to these expected frequencies, then we conclude independence.If the observed frequencies are quite different from these expected frequencies, then we concludethat the two variables are associated.

9.1.2 The Chi-Square Test Statistic

The chi-square test statistic is an overall measure of how close the observed frequencies are to theexpected frequencies. It has the form

χ2 = SUM(observed frequency - expected frequency)2

expected frequency.

The null hypothesis of independence is rejected if χ2 is large, because this means that observedfrequencies and expected frequencies are far apart. The chi-square curve is used to judge whether

Page 143: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

9.2. TESTING FOR EQUALITY OF PROPORTIONS BETWEEN 2 SAMPLES 137

the calculated test statistic is large enough. We reject H0 if the test statistic is large enough sothat the area beyond it (under the chi-square curve with (r-1)(c-1) degrees of freedom) is less than.05.

p−value

Chi−square curve with(r−1)(c−1) df

χ2

The P-value is the area greater than χ2 under the chi-square curve with (r-1)(c-1)degrees of freedom.

9.2 Testing for Equality of Proportions Between 2 Samples

Dormitory residency is sometimes advertised as being beneficial to students because it helps fostera connectedness to the university and maybe help the students develop an academic lifestyle. Inone study involving 408 entering students in 1995, students were classified according to (i) whetherthey lived in a campus dormitory during their first year, and (ii) whether they had graduatedwithin 6 years of entering college. The data is presented in a 2× 2 cross-classification table below.

Table 9.2: Dorm Residency versus Graduation Rate

Dorm Not Dorm

Grad ≤ 6 Yrs. 204 20Grad > 6 Yrs. 160 24

Page 144: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

138 CHAPTER 9. TESTING EQUALITY OF FREQUENCIES

We are interested in comparing 6-year graduation rate of dormitory residents versus non-dormitory residents. In terms of population parameters, we write the null and alternative hy-potheses as

H0 : p1 = p2 versus H1 : p1 6= p2

where p1 and p2 are the population graduation rates for dorm and non-dorm residents, respectively.Note that IF we treat dorm residency as a variable, then a test for independence between

graduation rate and residency also compares graduation rate between dorm and non-dorm students.Therefore, we may use the chi-square test of independence to test for equality of proportionsbetween populations. In this case, it is called a test for equality of proportions rather than a testfor independence. The null and alternative hypotheses are written differently, but the rest of thetest procedure remains the same.

If the row variable has only two categories (Success-Failure), and the columns are popu-lation labels rather than levels of a second variable, the chi-square test of independenceis called a test for equality of proportions.

First compute the graduation rate marginal distribution:

Dorm Not Dorm TotalGrad ≤ 6 Yrs. 224 (This is 54.9% of 408)Grad > 6 Yrs. 184 (This is 45.1% of 408)

Total 364 44 408

Extrapolating the marginal distribution to each individual column, we get the following expectedfrequencies:

(.549)(364)=199.84 (.549)(44)=24.16(.451)(364)=164.16 (.451)(44)=19.84

The chi-square test compares these expected frequencies to the observed frequencies. The nullhypothesis is rejected if observed and expected frequencies are too far apart.

Here is the way a statistical report would formally present the test, in numbered stages.

1. Hypotheses H0 : p1 = p2 versus H1 : p1 6= p2, where p1=6-yr. grad rate of dorm, p2=6-yr.grad rate of non-dorm

2. Test statistic:

χ2 =(204−199.84)2

199.84 + (20−24.16)2

24.16(160−164.16)2

164.16 + (24−19.84)2

19.84

= 1.78

3. P-value: the area greater than 1.78 under the chi-square curve with 1 degree of freedon is .18.

Page 145: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

9.3. USING THE TI-84 139

Chi−square with 1 df

0.18

0.0 0.5 1.0 1.5 2.0 2.5χ2

||1.78

4. Conclusion: Since P-value > .05, we accept H0 : p1 = p2. The sample does not provideenough evidence to declare a significant difference between graduation rates of dormitory andnon-dormitory students.

9.3 Using the TI-84

9.3.1 Entering Data into a Matrix

1. [2ND] [MATRIX]2. Arrow right to [EDIT]. Matrix [A] should already be highlighted. Press [ENTER].3. Input the number of rows. [ENTER]. Input the number of columns. [ENTER].4. Input the matrix entries.5. [2ND] [QUIT]

9.3.2 Calculating a Chi-Square Test for an r by c Table

0. Enter the table of observed counts into matrix [A].1. Press [STAT]2. Arrow right to [TESTS]3. Arrow down to [χ2-TEST]. Press [ENTER].4. Choose [Calculate] or [Draw]. Press [ENTER].

(The expected counts will be output in matrix [B])

Page 146: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

140 CHAPTER 9. TESTING EQUALITY OF FREQUENCIES

9.4 Exercises

1. In a study of drug usage by students at a large university, data was obtained regarding hardliquor experience of smokers and nonsmokers.

Hard-liquor use Nonsmokers Smokers

Once or more 15 23Never 56 18

Conduct a test of significance to assess whether liquor experience is independent of smoking.

2. A study investigating the association between size of cars and country found the followingfrequency counts:

USA Japan UK France

Economy 21 24 33 55

Compact 27 35 37 40

Full Size 36 11 12 4

Luxury 15 3 7 8

Is there evidence of a significant relationship between size of car and country, or are the twovariables independent?

3. To compare two diets (A and B), a study was done where overall health was measured aseither excellent, average, or poor. The frequencies are listed in the following table. Does thedata suggest that health depends on diet, or are they independent?

Excellent Average Poor

Diet A 37 24 19

Diet B 17 33 20

4. Are hard-driven, intense people (said to possess Type A personality) more inclined to heartproblems than those with low-key, easy going Type B personalities? A study was madeinvolving 510 heart attack patients. An activity survey test was given to determine Type A,Type B, and neutral personalities. Each person’s case was followed for three years and thenumber who survived for three years after the heart attack were recorded. The data follow:

Type A Neutral Type B

Died 20 15 15Survived 160 150 150

Page 147: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

9.4. EXERCISES 141

(a) Do the data provide sufficient evidence to indicate that survival rates are different fordifferent types of people?

(b) Suppose that a larger study resulted in the following table:

Type A Neutral Type B

Died 40 30 30Survived 320 300 300

Observe that all the frequencies in the larger study are twice the size of the first study.Do you expect the P-value of the larger study to be smaller, larger, or approximatelyequal to the first study?

5. In April 2003, a researcher conducted a survey to test for association between opinions ongun control and war with Iraq. She asked 100 people two questions: (i) do you believe thatprivate citizens should be allowed to own guns, and (ii) did you approve of the decision tosend troops to Iraq? The marginal totals are given below.

Yes on troops No on troops Total

Yes on guns 40No on guns 60

Total 51 49 100

(a) Compute the table of Expected Frequencies.

(b) Suppose 25 respondents answered ’Yes’ to both questions. Calculate the chi-square teststatistic for independence/association between the two opinions.

(c) Use the chi-square curve to compute the P-value for your test statistic above. What isthe correct degrees of freedom to use?

(d) What is the conclusion of your test?

6. Computer-controlled cameras are being used to ticket automobile drivers for speeding andrunning red lights. These devices are operated by private firms and have an incentive topull in as many drivers as they can. Although approximately 70accept and pay these tickets,others resent this procedure and fight the ticket. A frequency table with marginal totals isgiven below. Assume that the actions of 200 motorists who received a ticket from a computer-controlled camera produced the following results. Is there evidence of a significant differencebetween the proportions of motorists who pay their ticket by the type of traffic violation?

Run red light Speeding Total

Pay ticket 140Fight Ticket 60

Total 60 140 200

Page 148: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

142 CHAPTER 9. TESTING EQUALITY OF FREQUENCIES

(a) Compute the table of Expected Frequencies.

(b) Suppose we know that 1/3 of those who were ticketed for running a red light fought theticket. Is this enough information to conduct a test of association/independence betweenthe two variables?

(c) Using the information in (b), compute the chi-square test statistic for independence/associationbetween the two variables.

(d) Use the chi-square curve to compute the P-value for your test statistic above. What isthe correct degrees of freedom to use?

(e) What is the conclusion of your test?

Page 149: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 10

Correlation

143

Page 150: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

144 CHAPTER 10. CORRELATION

10.1 Correlation

The following data appeared in the Wall Street Journal in 1984. Advertisements were selected byan annual survey conducted by Video Board Tests, Inc., a New York ad-testing company, based oninterviews with 20,000 adults who were asked to name the most outstanding TV commercial theyhad seen, noticed, and liked. The retained impressions were based on a survey of 4,000 adults, inwhich regular product users were asked to cite a commercial they had seen for that product categoryin the past week. ‘TV Ad Budget’ was the 1983 advertising budget in $ millions. ‘Impressions’ isthe estimated number of million impressions per week.

Company TV Ad Impressions

Budget

MILLER LITE 50.1 32.1

PEPSI 74.1 99.6

STROH’S 19.3 11.7

FED’L EXPRESS 22.9 21.9

BURGER KING 82.4 60.8

COCO-COLA 40.1 78.6

MC DONALD’S 185.9 92.4

MCI 26.9 50.7

DIET COLA 20.4 21.4

FORD 166.2 40.1

LEVI’S 27.0 40.8

BUD LITE 45.6 10.4

ATT/BELL 154.9 88.9

CALVIN KLEIN 5.0 12.0

WENDY’S 49.7 29.2

POLAROID 26.9 38.0

SHASTA 5.7 10.0

MEOW MIX 7.6 12.3

OSCAR MEYER 9.2 23.4

CREST 32.4 71.1

KIBBLES ’N BITS 6.1 4.4

Figure 10.1 shows a scatterplot of Impressions versus Spending. Note that the points seemto loosely fall around a line sloped upwards. We say that there is a positive linear association, orlinear relationship, between spending and the number of impressions made.

If the points fall around a straight line sloped downwards, we say that there is a negativeassociation.

Page 151: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

10.2. COMPUTING THE PEARSON CORRELATION COEFFICIENT 145

Figure 10.1: Scatterplot of Impressions Score (Y) versus Ad Spending (X)

The direction and strength of association is often expressed in a single number called the(Pearson) correlation coefficient. Typically denoted by r, the correlation coefficient r is a numberbetween -1 and +1, inclusive. A value of r = 0 means that no linear association exists; thepoints either look like a random scatter, or fall around a horizontal line. A value of r = +1 (orr = −1) indicates a perfectly linear relationship; all the points fall on a straight line sloped upwards(downwards).

The correlation between TV Ad Budget and Impressions in the data above is +.65.

Figure 10.2 shows various scatterplots with different correlations. Although .5 is halfwaybetween 0 and 1, note that the plot corresponding to r = .5 barely shows a pattern of association.In practice, plots can show correlations up to .3 purely by accident (e.g. correlation between GPAand, say, shoe size). When correlation reaches +1 or -1, all points fall on a straight line.

10.2 Computing the Pearson Correlation Coefficient

One formula for the Pearson correlation coefficient r is as follows:

r =

XY − (P

X)(P

Y )n

(

X2 − (P

X)2

n

)(

Y 2 − (P

Y )2

n

)

(10.1)

Page 152: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

146 CHAPTER 10. CORRELATION

Figure 10.2: Correlation plots (Clockwise from top left: r= -.5, .7, .9, -.95)

2000 2500 3000 3500 4000

1020

3040

50

car weight

mpg

0 2 4 6 8

68

1012

1416

R/S

P/E

20 30 40 50 60 70 80

4060

8010

012

014

0

ave.temp.

kilo

wat

t usa

ge

2000 4000 6000

4000

8000

1200

0

square feet

annu

al s

ales

Page 153: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

10.2. COMPUTING THE PEARSON CORRELATION COEFFICIENT 147

The following numerical example shows how the formula ( 10.1) is used:

X Y1 23 54 54 8

XY = (1)(2) + (3)(5) + (4)(5) + (4)(8) = 69∑

X = 1 + 3 + 4 + 4 = 12∑

Y = 2 + 5 + 5 + 8 = 20∑

X2 = 12 + 32 + 42 + 42 = 42∑

Y 2 = 22 + 52 + 52 + 82 = 118

r =69 − (12)(20)

4√

(

42 − (12)2

4

)(

118 − (20)2

4

)

= .866

We present a second formula that is harder to compute but easier to interpret.

r =

(X − X)(Y − Y )√

(X − X)2√

(Y − Y )2(10.2)

Consider the Ad Spending example at the start of this chapter. Many of the (X,Y ) points aresimultaneously above average, since companies that have higher than average Advertising Spendingalso have higher than average Impressions. Both X−X and Y −Y are positive for these companies.Therefore, the product (X − X)(Y − Y ) is positive for these companies. Most of the remainingcompanies have lower than average Spending and lower than average Impressions. Both X−X andY − Y are negative for these companies, but the product (X − X)(Y − Y ) is still positive! Hencethe numerator in ( 10.2) tends to be a large positive number for the Ad Spending data.

If the points were sloped downwards, then high X-values tend to go with low Y-values, and theproduct (X − X)(Y − Y ) is negative for these points. This is partly how the correlation formula( 10.2) works. The denominator terms have been put in to ensure that r does not go beyond -1 or+1.

Page 154: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

148 CHAPTER 10. CORRELATION

10.3 Exercises

1. Consider the following data:

X Y

-2 0

2 3

5 10

-1 1

6 15

(a) Compute the correlation between X and Y.

(b) Compute the correlation between Y and X.

(c) Add 5 to Y, so the new values are 5, 8, 15, 6, 20. Now compute the correlation betweenX and Y. Is the correlation smaller, larger, or the same as before?

(d) Multiply Y by 5, so the new values are 0, 15, 50, 5, 75. Now compute the correlationbetween X and Y. Is the correlation smaller, larger, or the same as before?

(e) Multiply Y by -1, so that the new values are 0, -3, -10, -1, -15. Now compute thecorrelation between X and Y. Is the correlation smaller, larger, or the same as before?

2. The following data presents the per capita income of 20 European OECD countries for 1960and as well as the percentages of the labor force employed in agriculture, industry, and servicesfor each country.

PCINC = Per capita income , 1960 ($)AGR = Percent of labor force in agriculture, 1960IND = Percent of labor force in industry, 1960SER = Percent of labor force in service occupations, 1960

Page 155: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

10.3. EXERCISES 149

COUNTRY PCINC AGR IND SER

CANADA 1536 13 43 45

SWEDEN 1644 14 53 33

SWITZERLAND 1361 11 56 33

LUXEMBURG 1242 15 51 34

U. KINGDOM 1105 4 56 40

DENMARK 1049 18 45 37

W. GERMANY 1035 15 60 25

FRANCE 1013 20 44 36

BELGIUM 1005 6 52 42

NORWAY 977 20 49 32

ICELAND 839 25 47 29

NETHERLANDS 810 11 49 40

AUSTRIA 681 23 47 30

IRELAND 529 36 30 34

ITALY 504 27 46 28

JAPAN 344 33 35 32

GREECE 324 56 24 20

SPAIN 290 42 37 21

PORTUGAL 238 44 33 23

TURKEY 177 79 12 9

Using your TI-84 calculator, compute the correlation coefficient of the following pairs : PCINCvs. AGR, PCINC vs. IND and PCINC vs. SER.

(a) Which among the three labor sectors provides the strongest linear relationship with percapita income?

(b) If majority of the labor force works in agriculture, would you expect a higher per capitaincome?

(c) Suppose PCINC (per capita income) is coded in thousands of dollars instead, whathappens to the correlation coefficients?

3. Consider the first and second exam scores of 35 Stat 216 students:

Page 156: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

150 CHAPTER 10. CORRELATION

Student First Second Student First Second

1 21 22 19 25 22

2 23 23 20 13 19

3 16 19 21 17 22

4 23 19 22 23 18

5 23 24 23 11 21

6 17 21 24 17 14

7 12 18 25 18 11

8 15 16 26 13 16

9 20 20 27 18 11

10 8 10 28 16 15

11 22 24 29 21 17

12 22 22 30 15 9

13 23 22 31 16 22

14 18 19 32 22 16

15 22 23 33 18 16

16 20 20 34 21 13

17 20 20 35 19 24

18 20 20

(a) Draw a scatterplot of the data. How are the two exam scores related based on the plot?Would you say that this relationship is strong?

(b) Compute the correlation coefficient between the first and second exam scores. Does thisvalue support your judgment in the previous question?

(c) The Stat 216 director decided to curve the first exam scores by giving away 5 points.

i. Obtain a new scatterplot and compare this with the old plot.

ii. What happens to the correlation coefficient? Explain this behavior.

(d) Fifth and eleventh students were found cheating during the second exam. As a result,they were both given zeros in that exam. What will happen now to the correlationcoefficient? Can you consider this new value reliable? Explain why.

Page 157: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Chapter 11

Linear Regression

151

Page 158: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

152 CHAPTER 11. LINEAR REGRESSION

11.1 Simple Linear Regression

The following table contains data on winning bid price for 12 Saturn cars on eBaY in July 2002.The car mileage is also given, and the cars have been arranged in increasing order of Miles.

Car Miles Price ($)

1 9300 7100

2 10565 15500

3 15000 4400

4 15000 4400

5 17764 5900

6 57000 4600

7 65940 8800

8 73676 2000

9 77006 2750

10 93739 2550

11 146088 960

12 153260 1025

Problem: Based on the data, how much do I expect to get for a Saturn car that hasbeen driven 60000 miles?

An initial analysis would go like this: ”Car 7 has 65000 miles and has a bid of $8800. I shouldexpect to get a little more for mine, maybe $9000(?). However, Car 6 only has 57000 miles, yet thehigh bid is only $4600. Based on this observation, I should expect to get a little less than $4600,maybe $4400 (?).” This type of ad hoc data analysis looks at a few observations (Cars 6 and 7)without considering the rest of the data.

Simple linear regression is a data analysis technique that tries to find a linear pattern in thedata. In linear regression, we use all of the data to calculate a straight line which may be used topredict Price based on Miles. Since Miles is used to predict Price, Miles is called an ‘ExplanatoryVariable’ while Price is called a ‘Response Variable’. Table 11.1 shows a scatterplot of Price (onthe Y-axis) versus Miles (on the X-axis):

Notice that the points seem to fall around a straight line sloping downwards. Can you draw thisline? We will discuss one way to do this, called the least squares (LS) method. For now, supposethat the LS line has already been computed (we will do this later). The LS line overlayed on thescatterplot looks like Figure 11.2.

The formula for this line, in the form Y = a + bX, is

PREDICTED PRICE = $8136 − .05127(MILES)

Page 159: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.1. SIMPLE LINEAR REGRESSION 153

Figure 11.1: Scatterplot of Price versus Miles

Page 160: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

154 CHAPTER 11. LINEAR REGRESSION

Figure 11.2: Least Squares (LS) regression line overlayed on scatterplot

20000 40000 60000 80000 100000 120000 140000

5000

1000

015

000

miles

price

Page 161: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.2. CALCULATING THE LEAST SQUARES REGRESSION LINE 155

The slope of the line is −.05127, which means that predicted Price tends to drop 5 cents for everyadditional mile driven, or about $512.70 for every 10,000 miles. The intercept (or Y -intercept)of the line is $8136; this should not be interpreted as the predicted price of a car with 0 mileagebecause the data provides information only for Saturn cars between 9,300 miles and 153,260 miles).

We can now use the line to predict the selling price of a car with 60000 miles. What is theheight or Y value of the line at X = 60000? The answer is

PREDICTED PRICE = $8136 − .05127(60000) = $5059.80,

or about $5000 or $5100 or so.

11.2 Calculating the Least Squares Regression Line

One way to calculate the regression line is to use the five summary statistics X , SDX , Y , SDY , andr (i.e. the mean and SD of X, the mean and SD of Y , and the Pearson correlation between X andY .) The least squares regression line is represented by the equation

PREDICTED Y = a + bX

where the slope b and intercept a are calculated in the following order:

b = rSDY

SDX

, a = Y − bX

Consider the Saturn Price data. We let Y =Price and X=Miles since we want to estimate Pricegiven a value of Miles (not vice versa). Calculate and present the summary statistics with theproper X and Y notation, as follows:

Page 162: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

156 CHAPTER 11. LINEAR REGRESSION

Car Miles Price

(X) (Y)

1 9300 7100

2 10565 15500

3 15000 4400

4 15000 4400

5 17764 5900

6 57000 4600

7 65940 8800

8 73676 2000

9 77006 2750

10 93739 2550

11 146088 960

12 153260 1025

X = 61195 Y = 4999SDX = 50989 SDY = 4079

r = −.641

Using the formulas for slope and intercept of the LS line, we get

b = −.641 407950989 = −.05127

a = 4999 − (−.05127)(61195) = 8136

so that the regression line is

PREDICTED Y = a + bX = 8136 + (−.05127)X

In terms of the original variable names, we may write the regression line as

PREDICTED PRICE = 8136 -.05127 (MILES)

11.3 More on Simple Regression

How well does the prediction model work? We now discuss statistics that help us measure thepredictive performance of the model.

Consider Car 7 which has Miles=65940 and Price=$8800. In the scatterplot, it is plotted as(X, Y ) = (65940, 8800). The regression line says that a car with X = 65940 will have predictedY value

PREDICTED Y = $8136 − .05127(65940) = $4755.47

The observed value Y=$8800 is much higher than PREDICTED Y= $4755.47. How much higher?The difference is $8800-$4755.47 = $4044.53. This is called the ”Residual”.

Page 163: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.3. MORE ON SIMPLE REGRESSION 157

Residual = Y - PREDICTED Y

Here is the complete table of values of the explanatory variable X, the observed values of Y ,the predicted values PRED, and the residuals RES.

Car Miles(X) Price(Y) PRED RES=Y-PRED

1 9300 7100 7659.35 -559.35

2 10565 15500 7594.49 7905.51

3 15000 4400 7367.11 -2967.11

4 15000 4400 7367.11 -2967.11

5 17764 5900 7225.41 -1325.41

6 57000 4600 5213.81 -613.81

7 65940 8800 4755.47 4044.53

8 73676 2000 4358.85 -2358.85

9 77006 2750 4188.13 -1438.13

10 93739 2550 3330.24 -780.24

11 146088 960 646.36 313.64

12 153260 1025 278.66 746.34

In a scatterplot, Y is the height of the point. PREDICTED Y is the height of the line directlyabove or below that point (denoted by *’s in the scatterplot below). The residuals, which arecomputed by subtraction (RESIDUAL=Y-PREDICTED Y), tell us how far each point is above orbelow the line. Points above the regression line will have positive residual. Points that are belowthe line will have negative residual.

Page 164: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

158 CHAPTER 11. LINEAR REGRESSION

20000 40000 60000 80000 100000 120000 140000

5000

1000

015

000

miles

pric

e

pred=8136.15029785 − 0.05126904(MILES)

* *

*

*

*

*

*

**

**

*

*

Large residuals (ignoring the signs) indicate observations where the regression line gives poorpredictions. For example, Car 7 has a large positive residual while Car 8 has a large negativeresidual. Based on its mileage, Car 7 has a predicted price of $4755; its actual price is $4044 higher(a residual value of +$4044). Car 8 has a predicted price of $4358, its actual Price is $2358 lower(a residual value of -$2358).

The sum of the squared residuals

SSE = (Y1 − PREDICTED Y1)2 + · · · + (Yn − PREDICTED Yn)2

is a measure of ‘overall size’ of the residuals. In the Saturn Price data, SSE = (−559.35)2 + · · · +(746.34)2 = 107805718. If a regression line is predicting poorly, there tend to be lots of large(positive and negative) residuals, and hence SSE tends to be large.

Page 165: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.3. MORE ON SIMPLE REGRESSION 159

A large value of SSE means that the regression line is predicting poorly. A smallvalue of SSE means that the regression line is predicting well.

The regression line predicts best when correlation is high. In the best case scenario where r = 1or r = −1, all the points will fall on a straight line. In this extreme case, all the residuals are 0,and SSE=0.

When r = 1 or r = −1, SSE equals 0. This is the best case for regression prediction.

There is a more basic prediction method than regression: this method ignores the X-valuesaltogether, and uses the sample average Y to predict each Y . This is called one-sample predictionin contrast to regression prediction. In one-sample prediction, the errors of prediction are thedifferences Y − Y and their sum of squares is called Total Sum of Squares, denoted SSTo:

SSTo = (Y1 − Y )2 + · · · + (Yn − Y )2

What happens in regression when the X and Y variable are uncorrelated? In this worst casescenario, r = 0 so that the regression slope is computed as b = r(SDY /SDX) = 0. The computationfor intercept results in a = Y − bX = Y . The resulting predicted values are Pred Y = a+ bX = Y .Therefore when X and Y are uncorrelated, regression reduces to one-sample prediction ignoringthe X-values, and SSE equals SSTo.

When r = 0, SSE equals SSTo. This is the worst case for regression prediction.

In applications, the typical scenario falls somewhere in the middle: |r| is somewhere between 0and 1, and SSE is somewhere between 0 and SSTo.

|__________________|___________________________________|

0 SSE SSTo

<----------------- | ---------------------------------->

BEST WORST

Is SSE closer to 0 (best case) or to SSTo (worst case)? The distance between SSTo and SSEis a useful measure of the statistical performance of the prediction model. It is calculated bysubtraction, and is called SSR:

SSR= SSTo - SSE

Page 166: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

160 CHAPTER 11. LINEAR REGRESSION

For the Saturn Price data, SSTo=182,977,206 and SSE=107,805,718. The difference betweenthem is SSR=75,171,488.

x______________________________x_______________________x

0 107,805,718 182,977,206

(SSE) (SSTo)

|_______________________|

75,171,488

(SSR)

Recall that large residuals indicate poor predictions by the straight line model which predicts Y(Price) given X (Miles). The residuals then represent a component of the car price that is not dueto, or ‘not explained by’, the car’s mileage. Since SSE is a summary measure of residual size, andsince SSTo=SSE+SSR, we may think of SSTo as having been partitioned into (i) a component notexplained by mileage (SSE) and (ii) a component explained by mileage (SSR). The latter componentis often expressed as a proportion or percentage:

R2 =SSR

SSTo

For the Saturn Price data, SSTo=182,977,206. Since SSR/SSTo=.41, we say that 41% of total pricevariability is explained by the car’s mileage. The remaining 59% is due to something else - probablycondition of car, power and luxury options, time remaining in auction, market luck-of-the-draw,etc.

SSTo=182,977,206

|______________________________________________________|

SSE=107,805,718 SSR=75,171,488

|______________________________|_______________________|

(.59) (.41)

11.3.1 The Excel Printout

Following is the printout for running regression in Excel.

Page 167: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.3. MORE ON SIMPLE REGRESSION 161

*********************************************************************************

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.640955792

R Square 0.410824328

0.351906761

Standard Error 3283.378116

Observations 12

ANOVA

df SS MS F Significance F

Regression 1 75171487.75 75171488 6.972866 0.024707725

Residual 10 107805718.5 10780572

total 11 182977206.3

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 8136.150298 1519.8802 5.353152 0.000322 4749.645588 11522.65501

miles -0.051269039 0.0194155 -2.64062 0.024708 -0.09452957 -0.00800851

*********************************************************************************

In the previous section, we noted that 41% of this total price variability is explained by thecar’s mileage (since R2=SSR/SSTo=.41). This ‘R Square’ is reported on the second line of theprintout under ‘Regression Statistics’.

The correlation between Miles and Price (X and Y ) is r = .640955792. The square of this valueis r2 = (.640955792)2 = .410824328. This number should look familiar, because it is ALSO thevalue of SSR/SSTo, which explains why SSR/SSTo is called ‘R Square’. Note the use of upper case‘R’ instead of lower case ‘r’ in the Excel printout. Use of the upper case R is standard notation inregression studies, while lower case r tends to be used in correlation studies.

Now move down to the Analysis of Variance (ANOVA) table . It contains three rows; ”Regres-sion” (or R for short), ”Residual” (or E), and ”Total” (or To). The sum of squares column (denoted”SS”) gives SSR, SSE, and SSTo respectively. Mathematically, SSR and SSE are not comparablein size because they contain different amounts of information, also called ”degrees of freedom”or simply ”df”. This is where the degrees of freedom column comes in. We have SSE=107.8M,but it contains 10 pieces of information, which averages to 10.78M per piece. On the other hand,SSR=75.1M contains only 1 piece of information, so the average is 75.1M per piece. These ”averagesum of squares per df” are called ”mean squares” and reported under the MS column. Note that

Page 168: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

162 CHAPTER 11. LINEAR REGRESSION

MS for Regression is 6.97 times the MS for Residual. This may be interpreted as follows. “Pricesof cars vary partly because they have different mileage, right? How much of the variability in priceis due to (or explained by) mileage? Answer: the average variation EXPLAINED by Miles is 6.97times larger than UNEXPLAINED.” Is 6.97 statistically significant? The answer is Yes, becausethere is only a 2.47% chance of getting a ratio that large from pure chance variation. Since thisP-value is smaller than 5%, the result is statistically significant.

A generic representation of the ANOVA table for simple regression is given below. As usual,the sample size is denoted by n. (For the Saturn price data, n = 12.) The formulas for SSR, SSEand SSTo have already been discussed previously. MSR and MSE are obtained by division. F is theratio between MSR and MSE. The P-value, or observed significance level of the F-ratio, is obtainedfrom an F-table (which is beyond the scope of this class).

*****************************************************************

ANOVA

df SS MS F Significance F

Regression 1 SSR MSR=SSR/dfR F=MSR/MSE P-value

Residual n-2 SSE MSE=SSE/dfE

total n-1 SSTo

*****************************************************************

Below the ANOVA table in the Excel regression printout are the estimates of intercept andslope (under the column ”coefficients”). Their standard errors are also reported along with a t-ratio, p-value for the t-ratio, and 95% confidence interval. The standard errors are interpreted theusual way, as follows. If another random sample of 12 cars were selected, the computed slope ofthe regression line would probably change. By how much? Answer: by approximately .019. Thet-ratio is the estimate divided by its standard error, and the P-value measures how likely it is fromchance alone to get a ratio that large.

Consider the following two statements:

1. The price of a used Saturn car should be around $4999 give or take $4079 or so.

2. The price of a used Saturn car with 80,000 miles should be around $4034 give or take$3283 or so.

The first statement gives no information on mileage of the car. In the absence of an X-value,our best guess is the sample mean Y = $4999. The standard error for this prediction is the standarddeviation from the mean, which is simply the regular sample standard deviation SDY = $4079. Thesecond statement, on the other hand, tells us the car mileage. The predicted price for a car with80,000 miles is obtained from the regression equation as Pred Y = 8136 − .05127(80000) = $4034.The standard error for this prediction is the standard deviation from predicted values, which turnsout to be the square root of the MSE

√10780572 = $3283. This is reported in the Excel printout as

Page 169: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.4. MULTIPLE LINEAR REGRESSION 163

”Standard Error” (4th line of printout under ‘Regression Statistics’). To summarize: if a predictionis based on the average, the appropriate give-or-take is the sample SD. If a prediction is based ona regression line, the appropriate give-or-take is the square root of the MSE.

11.4 Multiple Linear Regression

The term multiple attached to linear regression means that there are two or more X-variables usedto predict Y . Recall the Saturn price data again, but this time with the Model year included.

The mileage, model year, and price (highest bid)

for 12 Saturn cars on eBay in July 2002:

Car Miles Year Price

1 73676 1996 2000

2 77006 1998 2750

3 10565 2000 15500

4 146088 1995 960

5 15000 2001 4400

6 65940 2000 8800

7 9300 2000 7100

8 93739 1996 2550

9 153260 1994 1025

10 17764 2002 5900

11 57000 1998 4600

12 15000 2000 4400

This time, consider the prediction problem:

Problem: How much would I expect to get for a model 1996 Saturn car that has beendriven 60000 miles?

Unlike the previous section, we now know the model year of the car, not just the mileage. Sincewe want to predict PRICE given YEAR and MILES, we want to fit a regression model that lookslike this:

PREDICTED Y = a + b1X1 + b2X2

where Y =PRICE, X1=MILES, and X2=YEAR. The fitted regression line, in terms of variablenames, is

PREDICTED PRICE = −800615 − 0.0327(MILES) + 404(YEAR) (11.1)

How do we interpret the coefficients in ( 11.1)? For Saturn cars within the same model year,every additional mile driven corresponds to a price drop of about 3 cents (or a drop of about $327

Page 170: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

164 CHAPTER 11. LINEAR REGRESSION

for every 10000 additional miles). Similarly, mileage being equal, a more recent model year will tendto have a higher price, on the average about $404 per year difference.

CAUTION: If the X variables are highly correlated, the coefficients may have a mean-ingless interpretation. This is called the multicollinearity problem (see optional sectionbelow).

We can now use the fitted model ( 11.1) to estimate the selling price of a model year 1996Saturn car with 60000 miles. Plugging in the desired values for MILES and YEAR, we get

PREDICTED PRICE = −800615 − 0.0327(60000) + 404(1996) = 3807.

A car with the same mileage 60000 miles from model year 1999 (i.e. 3 years more recent) wouldhave what estimated price? Answer: Since it is 3 years more recent, its estimated price should behigher than $3807 by about 3(404)=$1212, which comes out to $3807+$1212=$5019.

This can be verified by using the fitted model ( 11.1) on the new numbers

PREDICTED PRICE = −800615 − 0.0327(60000) + 404(1999) = 5019.

11.5 Optional: The Multicollinearity Problem

When the X-variables are highly correlated, the coefficients lose their meaning. Consider thefollowing hypothetical data:

X1 X2 Y

10 20 140

20 40 180

30 60 220

40 80 260

50 100 300The data is fit perfectly by the model

Model A : Y = 100 + 2(X1) + 1(X2).

However, note that X2 = 2(X1) (i.e. one is a linear function of the other). Thus, the followingmodels, all with different coefficients, also fit perfectly.

Model B: Y = 100 + 4(X1) + 0(X2)Model C: Y = 100 + 0(X1) + 2(X2)Model D: Y = 100 − 6(X1) + 5(X2)Model E: Y = 100 + 10(X1) − 3(X2)

Page 171: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.6. USING EXCEL 165

Since a model with β1 = 4 fits just as well as a model with β1 = 10, the beta coefficents have noreal meaningful interpretation. In statistics, this coexistence of multiple correct models is called thenonidentifiability problem. It is a consequence of one of the X-variables being a linear combinationof the others, also called the multicollinearity problem.

Moral: Avoid models with X-variables which are too strongly correlated with eachother.

Of course, sometimes multicollinearity is unavoidable. In the Saturn price example, the twoX-variables MILES and YEAR have correlation r=-.91, so the multicollineariy problem exists tosome extent. Therefore, the coefficients have to be very carefully interpreted. One way to minimizethis collinearity problem is to try to include cars which are new but have high mileage, or old buthave low mileage. This will reduce the correlation between MILES and YEAR.

11.6 Using Excel

11.6.1 Multiple Linear Regression

1. Open the Microsoft Excel data file. If your data is still in a text file, refer to 1.6.2 on how toconvert text files to Excel files.

2. If you do not have the Analysis tab, refer to 8.3.1. Select the ’Data’ tab, under Analysis click’Data Analysis.’ Choose ’Regression’ and click ’OK.’ A new window will appear.

3. Enter the appropriate cell ranges of your dependent variable on the Y-range. Enter theappropriate cell ranges of your independent variable on the X-range. Note that you can dothis by highlighting the cells that contain your data. If the first row contains variable names,click on ’labels in 1st row.’

4. Click on confidence level, residuals and residual plots. Click ’OK.’

Page 172: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

166 CHAPTER 11. LINEAR REGRESSION

11.7 Exercises

1. Consider the following data:

X Y

-2 0

2 3

5 10

-1 1

6 15

(a) Compute the correlation between Y and X.

(b) Compute the mean and SD of X.

(c) Compute the mean and SD of Y.

(d) Compute the slope of the regression line for predicting Y from X.

(e) Compute the intercept of the regression line for predicting Y from X.

(f) Write down the regression equation for predicting Y from X.

(g) Add 5 to Y, so the new values are 5, 8, 15, 6, 20. What do you think will happen tothe slope and intercept? (Hint: What happens to the scatterplot if you add 5 to Y?).Verify your answer by calculation.

(h) Multiply Y by 5, so the new values are 0, 15, 50, 5, 75. What do you think will happento the slope and intercept? (Hint: What happens to the scatterplot if you multiply Yby 5?). Verify your answer by calculation.

2. Consider the data in Exercise 1.

(a) Write down the regression equation for predicting Y from X.

(b) If a sixth point were found, and this point had value X=4, what is the regression pre-diction of the Y-value?

(c) Compute the Predicted Value for each of the five data points. Do the Predicted Valueshave the same average as Y?

(d) Which of the five Y-values are higher than their Predicted Value?

(e) Compute the Residual for each of the five data points. Do the Residuals have 0 average?

(f) Compute SSE and SSTo. Is SSE smaller than SSTo?

(g) Compute SSR and R2. How is R2 related to the correlation between X and Y?

Page 173: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

11.7. EXERCISES 167

(h) What percentage of SST0 is ’explained’ by X. What percentage of SSTo is not explainedby X?

3. Consider the data in Exercise 1.

(a) Fill in the following ANOVA table (the P-value for F is optional):

ANOVA

df SS MS F Significance F

Regression ___ _______ _______ _______ _______

Residual ___ _______ _______

total ___ _______

(b) If a sixth point were found, and this point had value X=4, the Y-value is predicted tobe give or take .

4. Consider the data of a random sample of records of resales of homes from Feb 15 to Apr 30,1993 from the files maintained by the Albuquerque Board of Realtors. This type of data iscollected by multiple listing agencies in many cities and is used by realtors as an informationbase.

PRICE = Selling price (in hundred dollars)

SQFT = Square feet of living space

AGE = Age of home (years)

FEATS = Number out of 11 features (dishwasher, refrigerator, microwave,

disposer, washer, intercom, skylight(s), compactor, dryer, handicap fit,

cable TV access

NE = Located in northeast sector of city (1) or not (0)

COR = Corner location (1) or not (0)

TAX = Annual taxes (dollars)

PRICE SQFT AGE FEATS NE CUST COR TAX

2050 2650 13 7 1 1 0 1639

2080 2600 * 4 1 1 0 1088

2150 2664 6 5 1 1 0 1193

2150 2921 3 6 1 1 0 1635

1999 2580 4 4 1 1 0 1732

1900 2580 4 4 1 0 0 1534

1800 2774 2 4 1 0 0 1765

1560 1920 1 5 1 1 0 1161

1450 2150 * 4 1 0 0 *

Page 174: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

168 CHAPTER 11. LINEAR REGRESSION

1449 710 1 3 1 1 0 1010

1375 1837 4 5 1 0 0 1191

1270 1880 8 6 1 0 0 930

1250 2150 5 3 1 0 0 984

1235 1894 4 5 1 1 0 1112

1170 1928 8 8 1 1 0 600

1180 1830 * 3 1 0 0 733

1155 1767 16 4 1 0 0 794

1110 1630 15 3 1 0 1 867

1139 1680 17 4 1 0 1 750

995 1725 * 3 1 0 0 923

995 1500 15 4 1 0 0 743

975 1430 * 3 1 0 0 752

975 1360 * 4 1 0 0 696

900 1400 16 2 1 0 1 731

960 1573 17 6 1 0 0 768

860 1385 * 2 1 0 0 653

(a) Set up a scatter diagram of PRICE (X) vs. TAX (Y)

(b) Assuming a linear relationship, use the least-squares method to find the regression co-efficients

(c) Interpret the meaning of the slope estimate.

(d) Compute R-square

(e) At the 0.05 level of significance, is there a significant linear relationship between PRICEand TAX?

Page 175: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

Index

µ

estimated by X, 82

interval estimation, 89

mean of binomial, 56

population mean, 78, 81, 88

σ

population SD, 78, 90

SD for individuals, 88

alternative hypothesis, 114

ANOVA table, 161, 162

associated variables, 134

association, 134

association

linear, 144, 145

measure of, 145

average, 35

average

adding a constant, 39

central limit theorem, 81

histogram of, 76

multiply by a constant, 40

using the TI-84, 40

bar chart, 5

binomial, 51

binomial

expected value, average, mean, 56

n, p, 55

normal approximation, 56

probability distribution function (pdf), 55

standard deviation, SD, 56

thumb rule for normal curve, 58

using the TI-84, 59

variable, 54

boundary convention, 9

boxplot, 10

categorical, 3cdf of binomial, 55

census, 27

central limit theorem, 81

change to percentage, 68

chi square

curve, 135

degrees of freedom, 135

test for equal proportions, 138

test for independence, 134

test statistic, 136

using the TI-84, 139

chi square, χ2, 135

class width, 9

cluster sample, 26

coefficient of variation, 38column total, 136

confidence interval

for difference between two proportions, 101

for proportion, 99

overview table, 102

paired t, 97

using the TI-84, 102, 103

correlation

computing r, 145

examples, 145

169

Page 176: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

170 INDEX

formula, 147

Pearson r, 145coverage errors, 28

critical valuet, 90, 91z, 89

data, 2data list, 15

degrees of freedomchi square, 135

for t, 91dotplot, 13

E(X), 56

empirical rule, 37, 51expected frequencies, 134

expected frequencieshow to calculate, 136

expected value, 56explanatory variable, 152extreme observations, 14

failure, 51frame, 26

graph, 15

H0, 114

H1, 114histogram, 10histogram

of averages, 76, 78of individuals, 76, 79

of sample averages, 79hypothesis, 114

independent samples, 97independent variables, 134intercept, 155

interval, 2

law of large numbers, 69

least squares, LS, 152

leaves, 7

levels of mesurement, 2

linear relationship, 144lower-tailed test, 116

margin of errorfor proportion, 100

for sample mean, 92

marginal distribution, 136

MAX, 11

maximum, 11

mean, 35mean

effect of sample size on SE, 82

of binomial, 56

measurement error, 29

measurements, 2

median, 11, 37MIN, 11

minimum, 11

multiple regression, 163

multiple regression

equation, 163

multicollinearity, 164

nominal, 2

nonprobability samples, 27

nonresponse, 29normal curve, 46

normal curve

empirical rule, 51

percentile, 49

standard normal, 51

stem-and-leaf, 47using the TI-84, 47, 59

Z, 51

null hypothesis, 114

numerical, 3

Page 177: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

INDEX 171

observed frequencies, 135

ordinal, 2outlying observations, 14

p̂effect of sample size on SE, 71

mean and SD, 68normal curve approximation, 69

SE, 71P-value, 115

paired samples, 96, 97paired-t test, 122

pie chart, 6

pilot sample, 93pooled SD, 95

populationnormal-shaped histogram, 79

mean, 81, 82mean (see µ), 78

SD (see σ), 78population mean (µ), 88

population mean (µ)interval estimation, 89, 90

population proportion (p), 98

population proportion (p)interval estimate of, 98

interval estimation, 99sample size for estimation, 100

predicted value, 155, 156probability

binomial, 55normal curve, 46

probability distribution table, 54

probability samples, 27proportion

p̂, 68confidence interval, 98

effect of sample size on SE, 71law of large numbers, 69

mean and SD, 68

normal curve approximation, 69sample, 68SE, 71

quartile, 11

R2, 160random sample, 26range, 38ratio, 2regression

R2, 160ANOVA table, 161, 162calculating LS line, 155equation of LS line, 155Excel printout, 160explanatory variable, 152formula for slope and intercept, 155intercept, 155multiple, 163predicted value, 156prediction, 155residual, 156response variable, 152scatterplot, 152simple linear, 152slope, 155SSE, 158SSR, 159SSTo, 159unexplained variation, 162

relative frequency, 5relative frequency table, 9residual, 156response variable, 152row total, 136

sample sizefor proportion, 100

sampling methods, 26scatterplot, 144, 152

Page 178: Business Statistics Stat 2160 - wmich.edu · 1.3 Summarizing Categorical Data ... is a course in Business Statistics’- the course will cover ... Student Sex Level GPA Hours

172 INDEX

SD

adding a constant, 39multiply by a constant, 40

using the TI-84, 40SE

for difference between two proportions, 100

of difference between means, 94of proportion, 71

of the sample mean, 82SE, or standard error

of the mean, 81simple random sample, 26

skewness, 13slope, 155

spread, 34, 38spreadsheet, 2

SSE, 158

SSR, 159SSTo, 159

standard deviation, 35standard deviation

of binomial, 56standard error (see SE), 71, 81

standard normal curve, 51statistics, 2

stem, 7

stem-and-leaf plot, 7stock prices, 34, 36

stock pricesaverage, 35

dotplot, 36SD, 35

strata, 26stratified random sample, 26

subjects, 2

success, 51symmetric, 13

systematic sample, 26

t

confidence interval, 90critical value, 90, 91

table of random numbers, 26target population, 26test of hypothesis, 114test of significance, 114test statistic, 114test statistic

chi square, 136chi square, χ2, 134general pattern, 119paired t, 122two propotions, 123using the TI-84, 124

thumb rulefor t, 92

Trace (in TI-84), 15two-tailed test, 118Type I error, 124Type II error, 124

upper-tailed test, 117

variability, 34variance, 37

without replacement, 26

Xestimate of µ, 82sample mean (or sample average), 81

zconfidence interval, 90critical value, 89, 90

Z-curve (standard normal), 51ZoomStat (in TI-84), 15