chi squared tests

24
Mathematlcs Term 3 STPM Chapter 6 Chisquaredrests N 6.1 The Chi-squored Distribution Hypothesis test discussed in the last chapter each involves a null hypothesis stated in terms of a population parameter and a test statistic having a known probability distribution. They are called parametric tests. However, not all ideas can be stated in terms of population parameters. In this chapter, we shall discuss a non-parametric test called chi-squared test which is performed using the chi-squared distribution. Let xt, x2, ..., x,be a random sample from a normal distribution with mean 1t andvariance d. Then the sampling distribution of the statistic Le.-o)' ^,2 - i=l C is called the chi-squared distribution with n - L degrees of freedom. The probability density function is givenby r , _xi f(X',) = c(X',)' 'e 2 where c is a constant, Xl ls the chi-squared statistic with v degrees offreedom and e is the base ofthe natural logarithm. c is a normalised factor so that the area under the chi-squared curve is equal to one. Examples of chi-squared distributions with various degrees of freedom are shown in the figure below. The curve for degrees of freedom, y = n - 1 = 3 - I = 2, represents the distribution of chi-square values computed from all possible samples of size 3. Likewise, the curve for degrees of freedom equal to 10 corresponds to the distribution for samples of size 11. il l- 295

Upload: shiu-ping-wong

Post on 20-Jan-2016

161 views

Category:

Documents


0 download

DESCRIPTION

chi square

TRANSCRIPT

Page 1: Chi Squared Tests

Mathematlcs Term 3 STPM Chapter 6 Chisquaredrests N

6.1 The Chi-squored Distribution

Hypothesis test discussed in the last chapter each involves a null hypothesis stated in terms of a population

parameter and a test statistic having a known probability distribution. They are called parametric tests.

However, not all ideas can be stated in terms of population parameters. In this chapter, we shall discuss a

non-parametric test called chi-squared test which is performed using the chi-squared distribution.

Let xt, x2, ..., x,be a random sample from a normal distribution with mean 1t andvariance d.

Then the sampling distribution of the statistic

Le.-o)'^,2 - i=l

C

is called the chi-squared distribution with n - L degrees of freedom. The probability density function is

givenby r , _xi

f(X',) = c(X',)' 'e 2

where c is a constant, Xl ls the chi-squared statistic with v degrees offreedom and e is the base ofthe natural

logarithm. c is a normalised factor so that the area under the chi-squared curve is equal to one.

Examples of chi-squared distributions with various degrees of freedom are shown in the figure below. The

curve for degrees of freedom, y = n - 1 = 3 - I = 2, represents the distribution of chi-square values computed

from all possible samples of size 3. Likewise, the curve for degrees of freedom equal to 10 corresponds to

the distribution for samples of size 11.

ill-

295

Page 2: Chi Squared Tests

I

l*Nl *"ah"-"tics Term 3 STPM chapter 6 chi-squared Tests

The chi-squared distribution has the following properties:

. The values of X2 cannot be negative

. The curve is not symmetric

. They are all positively skewed

. As v gets larger, the degree of skewness decreases

. The mean of the distribution is equal to the number of degrees of freedom: p = v.

. The variance is equal to two times the number of degrees of freedom: 02 = 2 x v

. When the degrees of freedom are greater than or equal to 2, the maximum value occurs when

xl,=, - z

. As the degrees of freedom increase, the chi-squared curve approaches a normal distribution.

The area under the curve between 0 and a particular chi-squared value is a cumulative probability associated

with that chi-squared value. For example, the figure below is a graph of the chi-squared distribution with6 degrees of freedom, the shaded area represents a cumulative probability associated with a chi-squared

statistic equal to x; that is, it is the probability that the value of a chi-squared statistic will fall between

0 and x.

The X2-distribution table gives values of X' for various values of a and v, where a and v represent

significance level and degrees of freedom respectively. The areas, c, are the column headings; the degrees offreedom, v, are given in the left column, and the table entries are the X2 values. Hence the X2 value with 6

degrees of freedom, leaving an area of 0.05 to the left, is Xi = 1.635. Owing to lack of symmetry, we mustalso use the table to find X'u = 12.592 for q, = 0.95.

296

BJ

Page 3: Chi Squared Tests

Mathematics Term 3 STPM Chapter 6 Chisquared fests N

Critical values for the X2-distribution

If X has a X2-distribution with u degrees of freedom, then for eachpair of values of p and v, the tabulated value of x is such thatP(X< x)=P. N

ill-

P 0.01 0.025 0.0s 0.9 0.95 0.975 0.99 0.995 0.999

v =l2

3

4

5

6

7

8

9

10

11

t2

l3

t4

15

l6t7

18

t9

20

2t

22

23

24

25

26

27

28

29

30

0.031571

0.02010

0.1 148

0.2971

0.5543

0.872r

1.239

t.647

2.088

2.558

3.053

3.571

4.107

4.660

5.229

s.8t2

6.408

7.0rs

7.633

8.260

8.897

9.542

10.20

10.86

lt.52

12.20

12.88

13.56

t4.26

14.9s

0.039821

0.05064

0.21s8

0.4844

0.8312

1.237

1.690

2.1 80

2.700

3.247

3.816

4.404

5.009

s.629

6.262

6.908

7.564

8.231

8.907

9.59r

10.28

10.98

tL.69

t2.40

13.r2

13.84

14.57

15.31

16.0s

t6.79

o.0\932

0.t026

0.3518

0.7t07

1.145

1.63s

2.167

2.733

3.32s

3.940

4.575

5.226

5.892

6.57t

7.26r

7.962

8.672

9.390

10.12

10.85

I 1.59

12.34

13.09

13.8s

14.61

1s.38

16.15

16.93

17.71

t8.49

2.706

4.60s

6.251

7.779

9.236

t0.64

t2.02

t3.36

14.68

t5.99

17.28

18.55

19.81

2t.06

22.3r

23.s4

24.77

25.99

27.20

28.41

29.62

30.81

32.0t

33.20

34.38

35.56

36.74

37.92

39.09

40.26

3.841

5.991

7.815

9.488

tl.07t2.59

14.07

15.51

16.92

18.31

19.68

21.03

22.36

23.68

25.00

26.30

27.59

28.87

30.14

3t.41

32.67

33.92

35.t7

36.42

37.65

38.89

40.1 I

41.34

42.56

43.77

5.024

7.378

9.348

I 1.14

t2.83

14.45

16.01

17.53

19.02

20.48

21.92

23.34

24.74

26.r2

27.49

28.85

30.1 9

31.53

32.85

34.t7

35.48

36.78

38.08

39.36

40.6s

41.92

43.19

44.46

45.72

46.98

6.635

9.2t0

r1.34

t3.28

15.09

16.81

18.48

20.09

21.67

23.2r

24.73

26.22

27.69

29.14

30.58

32.00

33.4r

34.81

36.t9

37.57

38.93

40.29

4t.64

42.98

44.3r

45.64

46.96

48.28

49.s9

50.89

7.879

10.60

12.84

14.86

16.75

18.55

20.28

2t.95

23.s9

25.t9

26.76

28.30

29.82

3t.32

32.80

34.27

35.72

37.t6

38.58

40.00

41.40

42.80

44.r8

45.56

46.93

48.29

49.65

50.99

52.34

53.67

10.83

t3.82

r6.27

18.47

20.51

22.46

24.32

26.r2

27.88

29.59

31.26

32.91

34.53

36.t2

37.70

39.25

40.79

42.31

43.82

45.31

46.80

48.27

49.73

5 1.18

52.62

54.05

s5.48

s6.89

58.30

59.70

297

Page 4: Chi Squared Tests

lNl t"ah.*"tics Term 3 STPM chapter 6 chi-squared Tests

Example '1

Solution

Example 2

$olation

The curve of the chi-squared distribution with v = 3 degrees of freedom is shown

below. Find the critical value of X2 such that the area in the shaded region is0.025.

Look it up in the table by proceeding down the left column entitled v, degrees

of freedom, to v = 3. Then move to the right till the column labelled 0.975 is

found. The result is 9.348. Thus we have P(x' > 9.348) = 9.925.

A factory has produced a particular type ofdrill. On average, the useful operating

live is 5.5 hours. The standard deviation is 0.47 hour. The quality controldepartment runs a test by randomly selecting six drills. The standard deviation

of the selected drill is 0.61 hour. Determine the chi-squared statistic represented

by this test.

Given o = 0.47 hour, s = 0.61 hour, and the number of sample observations

n = 6. the chi-squared statistic is

n,z - nS2x= d_ 6(0.61'?)

0.472

= 10.107

GJl.

2.

E;ge1,eiSe_-Cl,_=Find the 95th percentile of the chi-squared distribution with 9 degrees of freedom.

Using the table of chi-squared distribution table, find

(a) P(x:, < 18.4s),

(b) P(X1, > 1e.81),

(c) P(X'r, ) 32.67).

298

Page 5: Chi Squared Tests

Mathematics Term 3 STPM Chapter 6 Chi-squared r"sts N

Giving v and q, find the critical value(s) for each case

(a)3.

a--

(b)

(c)

4.

5.

Using the chi-squared distribution table, find the value of k such that

(a) P(X1, < k) = 0.0t

(b) P(x1, > k) = o.es

(c) P(k < x2s < 9.39) = o.o4

(a) Find the mean and the standard deviation of a chi-squared distribution with 8 degrees of freedom.(b) Which one of the following chi-squared distributions looks the most like a normal distribution?

(i) A chi-squared distribution with I degree of freedom(ii) A chi-squared distribution with 2 degrees of freedom(iii) A chi-squared distribution with 5 degrees of freedom(iv) A chi-squared distribution with 10 degrees of freedom

A random sample of 30 observations from a normal population with variance d = 8.3, is found tohave a sample variance s2 = LL.72. Determine the chi-squared statistic from this experiment,

The chi-squared test can be used to test how good a fit between observed frequencies and expected frequencies.Observed frequencies are the actual frequencies observed from a random sample. Expected frequenciesare theoretical frequencies based on a distribution under the null hyprothesis which is presumed to be trueuntil statistical evidence indicates otherwise.

As an example: what would we expect by flipping a coin 12 times? By chance, we observe six heads and sixtails. If we observe one head and eleven tails in this experiment, would this outcome be attributable merelyto chance or be it due to the coin being biased? The chi-squared test can help providing an answer.

Before discussing the chi-squared test, we have several assumptions to make. First, frequency data is used

to represent the actual number of elements in each category. Second, categories are mutually exclusive, that

6.

299

ilg-

Page 6: Chi Squared Tests

iil*rNl u.th.-"tics Term 3 STPM chapter 6 chisquared Tests

is, whatever is being tallied can only be in one cell and cannot overlap. Third, categorical data is a grouping

of data according to similar characteristics in a way to show the frequencies of each category.

Let us look at an example to see how we use the chi-squared test to determine whether the frequencies

observed across the categories differ significantly from what are expected theoretically. Consider the tossing

of a six-sided dice. We have the null hlpothesis that the dice is fair, which is equivalent to the hlpothesis

that the distribution of outcomes is uniform. Suppose that the dice is thrown 60 times and each outcome is

recorded. The observed frequency o for each face of the dice is shown in the table below:

The chi-squared test will compare the observed frequencies o. with the corresponding expected frequencies

e-. The table above lists the observed frequencies, and the expected frequencies need to be determined.

To calculate the expected frequency for each outcome, we make use of the hypothesis that the outcome

of a fair dice is uniformly distributed. Since the probability of each outcome is one-sixth and there are a

total of 60 rolls of the dice, we have

Expected frequency e x60=10

Note that the expected frequencies are anticipated only in theoretical sense. It is not practical to expect

the observed frequencies perfectly match the expected frequencies. The table below lists the observed and

expected frequencies for each category:

Faces

1

ot = 12

I

or = 12

er=10

2

o,=8

2

o,=8

e:=10

Faces

3

o_, = I'l

3

ot= 14

e., = l0

4

or= 7

4

oi7er= l0

5

o-=9

5

o---9

e-=10

6

oa=10

6

oe=10

ee= l0

_16

6

Now, we need to decide whether the observed frequencies are reasonably close to the expected frequencies

or really different from them. The hypothesis to be tested is how good the observed frequencies fit a given

pattern or a theoretical distribution. The test is called a goodness-of-fit test.

A useful measure for the oerall discrepancy between the observed and expected frequencies is the chi-squared test statistic

5b -,t'v2 i=l r I' ,-1

where X2 is a value of a random variable X2 whose sampling distribution is approximately very closely

described by the chi-squared distribution with k - 1 degrees of freedom and k is the number of categories.

The symbols o. and e. represent the observed and expected frequencies respectively for the lth category.

For the chi-squared goodness-of-fit test, the number ofdegrees offreedom shows the number ofindependentfree choices which can be made in allocating values to the expected frequencies. In this example of tossing

300

Page 7: Chi Squared Tests

Mathematics Term 3 STPM chapter 6 Chi-squaredf""ts N

a dice, there are six expected frequencies (one for each face, that is, I to 6) and only five of the expected

frequencies can vary independently and the sixth one must take whatever value is required to fulfil that

constraint oftotal frequency. Thus, the degrees offreedom v = number ofcategories - number ofconstraints.

Here there are six categories and one constraint, so v = 6 - I = 5.

To calculate the chi-squared test statistic, we first subtract the expected frequency e. from the observed

frequency o-. Then we square the difference and subsequently divide the squared difference by the expected

frequency e., before finally adding the quotients. This is done in the table below:

This means the value of X2 with 5 degrees of freedom is 3.4.

In the goodness-of-fit test, if the observed frequencies are the same as the expected frequencies, then

X2 = 0. Thus, if X2 value is small, there will be high degree of compatibility between expected and observed

frequencies, indicating a good fit. lf X2 value is large, there is a low degree of matching between the two

frequencies and the fit is poor. This also implies that the critical region falls in the right tail of the chi-

squared distribution. At the l0% significance level, we flnd X'z, = 9.236 using X2 table. The calculated value

of X2 = 3.4 is less than 9.236, it would support the hypothesis that the outcomes of the dice is uniformlydistributed. In other words, the dice is fair.

Note: To perform a chi-squared test, the expected frequency for each category is at least equal to 5. This

restriction may require combining adjacent categories, resulting in a reduction of the number of degrees offreedom.

il,g-

Faces o.I

e.I

(o,"r)

(o. - e,)2(o, - e,)2

e.I

1 t2 10 2 4 0.4

2 8 l0 1 4 0.4

J t4 l0 4 t6 1.6

4 7 l0 _J 9 0.9

5 9 l0 -1 I 0.1

6 l0 0l0 0 0

X2 = 3.4

9.236

30r

Page 8: Chi Squared Tests

lSl *.ah"-.tlcs Term 3 STPM Chapter 6 Chi-squared Tests

EXample 3 A quality supervisor at a glass manufacturing factory inspects a random sample

of 60 sheets of glass to check for any minor defects. The number of flaws in a

glass sheet are recorded. The results are as follows:

Numberofflaws 0 1 2 3

Observed frequency 32 15 9 4

Use a 5% significance level to test the hypothesis that these data follows a Poisson

distribution.

A test procedure is as follows.

i:*":#illI:i#liHr"'#ilLi',',',::'r',T.0,,,.,Step @: Specify the significance levelHere a = 0.05

Step @: Select the appropriate test statistic and calculate its valueUse the chi-squared goodness-of-fit test to determine whether observed sample

frequencies differ significantly from expected frequencies specified in the nullhypothesis.

The mean of the presumed Poisson distribution is unknown so must be estimatedfrom the data by the sample mean,

Lox^- L,

- 3z)o+rc*t+9*z+q*332+15+9+4

Hencewithtr=0.75,

p(X = x) - e-o'5.0.'75*' , xi= o, 1,2,3' i' x.!

which gives the following probability associated with each class and thus thecorresponding expected frequency is obtained by multiplying the appropriatePoisson probability by the sample size n = 60.

x, P(X=x,) e,

0 0.472 28.32

t 0.354 2t.242 0.133 7.98

3 or more 0.041 2.46

If an expected frequency is less than 5, two or more classes can be combined.In the above situation the expected frequency in the last class is less than 3, so

we should combine the last two classes to get,

=4560= 0.75

B6

302

Page 9: Chi Squared Tests

Mathematlcs Term 3 STPM Chapter 6 Chi-squared f"rrc N

Number of 0bserved Expectedflaws frequency frequency

0 32 28.32

1 15 21.24

2 or more 13 10.44

The chi-squared value can now be calculated:

w2-s @-e)'l\ -L e

_ (32 - 28sD'z (ls - 2t.2q'z (13 - rl.4q'z28.32 2t.24 10.44

= 2.94

Step @: Determine the critical regionSince both the total frequency and the mean of the Poisson distribution of the

observed data are required in estimation, the number of degrees of freedom is

k - 2.Here, we have 3 classes, thus the chi-squared statistic has 3 - 2 = | degree

of freedom. Using a significance level of 0.05, from chi-squared distribution table,the critical value of X'?o.r, with 1 degree of freedom is 3.841.

Step @: Make a decisionAs X2 = 2S4 < 3.841, we conclude that there is no real evidence to suggest the

data does not follow a Poisson distribution.

Exampre 4 fr"i11*"3:'rJi"Ji #u::;r,#1T""'Hl'i-'1fi3;:"Jl",H5il;deviation s = 6.4 minutes. Determine wether there is significant evidence at

the 5o/o significance level, to reject the null hypothesis that the call length has a

normal distribution.

Call length (in minutes) Frequency

0-s 4

5-10 9

10-15 16

15-20 13

20-25 5

2s-30 3

We proceed with the steps of a test procedure as follows:

Step @: State the hypothesesHo: The telephone call lengths follow a normal distributionH,: The telephone call lengths do not follow a normal distribution

ill-303

Page 10: Chi Squared Tests

N U"th.-"tlcs Term 3 STPM Chapter 6 Chi-squared Tests

Classboundaries oi

Below 10 13

10-15 16

15-20 13

Step @: Specify the significance levelHere a = 0.05

Step @: Select the appropriate test statistic and calculate its valueUse the chi-squared goodness-of-fit test to determine whether observed sample

frequencies differ significantly from expected frequencies specified in the nullhypothesis.

The distribution of call lengths may be approximated by the normal distribution.The sample mean and sample standard deviation will be used for p and o incalculating z values corresponding to the class boundaries. The expected frequencyfor each class (category), listed in the given table can be obtained from a normalcurve. The z values corresponding to the boundaries of the second class are

_ 5-t4 = -t.406r 6.4

,-= to-t+ =_0.625, 6.4

From the normal table, the area between zt = -1.406 and z, = -0.625 is

P(-1.406<Z<-0.62s)= P(Z < -0.62s) - P(Z < -1.406)= 0.266 - 0.08 = 0.186

Thus, the expected frequency for the second class is e, :0.186 x 50:9.3.

The expected frequency for the first class interval is obtained by using the totalarea under the normal curve to the left of the boundary 5. For the last class

interval, we can use the total area to the right of the boundary 25. All otherexpected frequencies could be found by the similar method described above forthe second class. The complete set of calculation needed to find the expectedfrequency in each class is summarised in the table below. Note that we have

combined adjacent classes in the table, where the expected frequencies are less

than 5. As a result, the total number of classes is reduced from 6 to 4.

Class boundaries ! o,

i:i, i ;),,l0-ls i 16

rs-20 I ,,

'ri -'rZ ;)tThe following table shows the detailed calculations for the chi-squared value.

€, (o,- e,) (o,- e,)2 +Lr3.3 -0.3 0.09 0.0068

€,

1l

e;i t3 3

14.8

t3.2

ilj,"

reJ 14.8 1.2 t.44

13.2 -0.2 0.04

0.0973

0.0030

0.0727

X2 = 0.180

304

Above 20 8 8.8 -0.8 0.64

Page 11: Chi Squared Tests

Mathematlcs Term 3 STPM Cf,apter 6 Chi-squared f""t" N

Step @: Determine the critical regionAltogether three constraints: total frequency, sample mean and standard deviation,have been estimated from the sample data, the number of degrees of freedomis therefore equal to k - 3 = 4 - 3 = l. Using a significance level of 0.05, thecritical value of chi-squared with I degree of freedom is 3.841.

Step @: Make a decisionAs X2 = 0.180 < 3.841, we have no reason to reject the null hypothesis andconclude that the normal distribution offers a good frt for the distribution oftelephone call lengths.

l.

Number of accidents 0

Observed frequency 28

(a) Determine the mean number of accidents per week.(b) Test the hypothesis that the data follows Poisson distribution at the 5% significence level.

12315 12 5

6.

Exereise 6.'Assume that a chi-squared goodness-of-fit test is conducted. Determine the critical value of the chi-squared test statistic for each of the following cases.(a) Number of categories = 7, ot = 0.01(b) Number of categories = 10, a = 0.10

A random sample of 500 observations is obtained and distributed into 4 categories as follows:

CategoryL234xi 49 263 146 42

Use a = 0.05 to test the null hypothesis Ho: p, = 0.10, pz = 0.50, p, = 0.30, p4 = 0.10.

Three coins are tossed 150 times, and the observed frequencies of 0, l, 2 and 3 heads per toss are14, 43, 67 and 26 times respectively. Use a 570 significance level to test whether the three coins are

balanced.

An experiment is to draw a card from a regular deck of 52 cards that has been thoroughly shuffledand it is recorded whether it is a spade, heart, diamond, or club. This process is repeated 40 times,each time replacing the card just drawn. If after 40 trials, 9 spades, 13 hearts, ll diamonds and 7 clubsare obtained. Test the hypothesis that the deck is honest at the 10% significence level.

Each package of beans sold in the supermarket is supposed to mix red beans, mung beans, black beansand black-eyed beans in the ratio of 5:3:l:1. A random sample selected from these packages contains400 of mixed beans is found to have 210 red beans, 124 mung beans, 30 black beans and 36 black-eyed beans. Test the hlpothesis that the package contains the mixed beans in the ratio 5:3:1:l at the0.05 significance level.

A boy buys a bag of 100 jelly beans. This bag has 5 different colours of jelly beans in it. Assume allfive colours are equally likely to be put in the bag. The boy is curious about the colour distributionand opens the bag. He finds out that he has 17 brown, 24 yellow, l0 red, 31 green, and l8 white. Testthe hlpothesis that the colours of the jelly beans occur with equal frequency at a significance level of5o/o.

The number of road accidents per week at a junction is monitored by the public traffic department.The table below shows the frequency of accidents per week in 60 weeks.

7.

305

il6

Page 12: Chi Squared Tests

6

8. The following frequency distribution table represents the number of days during a year that a total of50 employees at a company are absent from work due to illness. It is thought that the data follows anormal distribution with population mean Lt = 7 and, standard deviation o = 3.

Number of days absent Number of employees

0-3 4

3-6 13

6-9 24

9-t2 7

t2-15 2

Test the goodness-of-fit between the observed class frequencies and the corresponding expected

frequencies of a normal distribution at the 5% significence level.

9. A paper shop has several retail stores in a city. The following table shows the number of boxes shipped

per day for the last 100 days.

Number of packages shipped Number of days

0-5 5

5-10 13

10-15 28

t5-20 23

20-25 18

25-30 l0

30-35 3

(a) Calculate the sample mean and sample standard deviation of the number of absent days per week.

(b) Use a 5% significance level to test the goodness of fit between the observed class frequencies snd

the corresponding expected frequencies of a normal distribution.

10. The table below shows the number of rain days in fanuary for the years from 1953 to 2004.

Numberofraindays 0 I 2 3 4 5

Observed frequency 9 7 14 15 6 I

(a) Find the mean rain day.

(b) Test the hypothesis that the recorded data may be fitted by the Poisson distribution at the 10olo

significance level.

11. A recent study reports the number of hours of personal computer usage per week for a sample of 60

persons. Excluding from the study are people who work in the office and use the computer as part oftheir work.

1.1 6.7 2.2 2.6 9.8 6.4 4.9 5.2 4.5 9.3 7.9 4.6

4.3 4.5 9.3 5.3 6.3 8.8 6.5 0.6 5.2 6.6 9.3 4.3

6.3 2.r 2.7 0.4 5.1 5.6 5.4 4.8 2.1 10.1 1.3 5.6

2.4 2.4 4.7 1.7 2.0 6.7 3.7 3.3 1.1 2.7 6.7 6.s

4.3 9.7 7.7 5.2 r.7 8.s 4.2 5.5 9.2 8.s 6.0 8.1

(a) Organise the data into a frequency distribution.(b) Compute the sample mean and sample standard deviation of number of hours computer usage

per week.(c) It is thought that the data follows a normal distribution. Test the hlpothesis at the 57o significance

Ievel.

306

Page 13: Chi Squared Tests

Mathematics Term 3 STPM Chapfer 6 Chi-squared fe"ts N

When two attributes (variables) are observed for each element of a random sample, the data can besimultaneously classified with respect to these attributes in a two-way classification table called a contingencytable. We can then determine whether there is a significant association between the two attributes.

Suppose we take a random sample of 200 persons and classify them based on gender as well as whether thesepersons own handphones. The observed frequencies are presented in the following 2 x 2 contingency table.

Own handphone Total

(no)

60 130

40 70

100 200

A contingency table can be of any size. In general, a contingency table with r rows and c columns is denotedas an r x c table. The row and column totals in the above table are called marginal frequencies. It is commonpractice to refer to each possible outcome of an experiment as a cell. Hence in our example we have four cells.

Let us test the hlpothesis of independence between a person's gender and a person's possession of a handphone.To perform this test, we first calculate the expected frequencies for each of the four cells of the above 2 x2 contingency table under the assumption that the hypothesis is true.

Let M represent the event that an individual selected from the sample is male.Let Y represent the event that an individual selected owns a handphone.

events, P(M n D = P(M)P(I). But P(M n n =#,P(M) =ffi, a.,d

e ,., - I no\/ roo \2oo - \ 2oo /\ 2oo /

Which we can rearrange as

, - 130 x 100 _ (First row total)(First column total)',,- 2oo--@"Where e,, is the expected frequency for the cell in row I and column l.

The general formula for obtaining the expected frequency of any cell is given by

Expected frequency - (Row-total)(Colpmn total)Total sample size

The expected frequency for each cell is recorded in parentheses beside the actual observed value in the tableshown below.

MaleFemale

Total

Own handphone(ves)

70

30

100

Own handphone(yes)

70 (6s)30 (3s)

100

Own handhpone i Total(no)

I

,

60 (6s) I 130

40 (3s) I 70

100 200

Since M and Y are independent

P()'') = loo . Thu.. we have200

MaleFemale

Total il6,--Note that the expected frequencies in any row or column add up to the appropriate marginal total. Weneed to calculate only the one expected frequency in the top row of the table and then find the others bysubtraction. The number ofdegrees offreedom associated with the chi-squared test used here is equal to thenumber of cell frequencies that may be filled in freely when we are given the marginal totals and the grand

307

Page 14: Chi Squared Tests

G--g

o DrrM onapteroL;nFsqu

total, and in this illustration that number is 1. A simple formula providing the correct number of degrees offreedom is

v=(r_l)(c_l).Hence, for our example, v = (2 - l)(2 - 1) = I degree of freedom.We want to measure how much the observed frequencies differ collectively, from their corresponding expectedfrequencies. We do this with the chi-squared test statistic

-,- { (o -e,)2n-,?, ,

,

where the summation extends over all the cells in the r x c contingency table.

We have

uz _ (70 - 65)'z (60 - 65)r (30 - 35): (40 - 35),65 65 35 3s

= 2.1978

Using a chi-squared table, we can see that for y = 1, the critical value for 5% significance level is X] = 3.3a1.Since the calculated value for X2 of 2.1978 does not fall within the critical region, we do not ieject thehypothesis that there is no relationship between a person's gender and the person's possession ofa handphone.

EXample 5 The following data show the attitude of housewives in various parts of the countryto a certain brand of detergent.

Attitude North Central South

Like 46 21 3lIndifferent 25 58 35Dislike 16 37 42

Test the hlpothesis that the attitude to new introduced detergent is independentof geographical area of residence at the l7o significance level

The given table is arranged to include the row and column totals.

Attitude North Central South Total

Like 46 2t 31 98Indifferent 25 58 35 I l8Dislike 16 37 42 95

Total 87 116 108 311

Step @: State the hypothesesHo: There is no association between attitude and locationH,: Theere is association between attitude and location

Step @: Specifr the significance levelGiven a = 0.01

Step @: Select the appropriate test statistic and calculate its valueUse the chi-squared test for independence to determine whether there is anysignificant association between the two categorical variables.

Page 15: Chi Squared Tests

Mathematlcs Term 3 STPM Chapter 6 Chi-squaredf"y" N

As with goodness-of-fit test described earlier, the key idea of the chi-squaredtest for independence is a comparison of observed and expected frequencies.

The expected frequency for each cell of the table can be generated using the

following formula:

Expected frequency - (Row-total)(Colgmn total)- ---1---"-t

Total sample size

In fact, for a 3 x 3 contingency table, only four expected values in the top tworows of the table are calculated and the remaining five expected values are found

by subtraction. For example, to calculate the expected frequency (for attitude

like and north;29-I JL = 27.41.In this way, the table of both observed and' 311expected frequencies is as shown below.

Attitude North Central South Total

98

ll895

Total 87 116 108 311

The number of degrees of freedom v = (r - lXc - l) = (3 - 1X3 - l) = 4.

The chi-squared test statistic is

" .( (o,-e,)'L-2

i=l Ei

A6 - 27.4i'), Ql - 36.55)2, (31 - 34.04)2, (25 - 33.01)'z, (58 - 44.01)2

27.41 36.55 34.04 33.01 44.01

.(35-40.98)2. (16-26.5$2 . G7-35.4q'z , e2-32.98)'?40.98 26,58 35.44 32.98

= 33.5057

Step @: Determine the critical regionFrom chi-squared table, the critical value X2 for 4 degrees of freedom at 17o level

is given by 13.28.

Step @: Make a decisionAs the calculated value 33.51 is greater than the critical value 13.28, we can

conclude there is evidence to reject Ho; that is attitude to new detergent and

geographical area of residence are not independent.

IndifferentDislike

2s (33.01) s8 (44.01) 3s (40.e8)

16 (26.s8) 37 (3s.44) 42 (32.e8)

E}(ereise&1. An experiment has 500 observations and the data are classified into 4 x 6 contingency table. Suppose

we conduct a chi-squared test of independence at the l7o significance level. Assume the calculated

value of the chi-squared test statistic is 39.2.

(a) Determine the number of degrees of freedom.(b) Find the critical value for the chi-squared test of independence.(c) Determine whether the chi-squared test values falls into the critical region.

309

il6

Page 16: Chi Squared Tests

lSl *.ahu-.tlcs Term 3 STPM Chapter 6Chr'-sguared rests

2, The following3 x 2 contingency table contains observed values for a sample of size 250. Determine

whether the row and column variables are independent using the chi-squared test with a = 0.025.

X ,Y

AB

C

25

55

63

)/32

38

3. A research group performs a study on gender and handedness (right- or left-handed). 800 individuals

are randomly chosen from a very large population. The following contingency table displays the

distribution of the two categories.

Right-handed Left-handed

Male 344 72

Female 352 32

Test the hypothesis that gender is independent of handedness at the 57o significance level.

4. Consider a sample of 200 customers. For each customer, we have information on gender and preference

of food. A contingency table for these data is shown below.

Indian fapanese Western

MaleFemale 20 50 20

Carry out a test, at the 57o significance level, to determine whether there is any association between

gender and preference of food.

5. In an experiment to study the association between diabetes and smoking habits, the following data are

502040

Page 17: Chi Squared Tests

B-g

Non- Moderate Heavysmokers smokers smokers

Diabetes 25 30 18No diabetes 40 2L 16

Using a l%o significance level, test the hypothesis that there is no association between cigarette smokingand the risk of diabetes.

6. A camera manufacturer has four suppliers of lenses. The table below shows the numbers of defectivelenses supplied by the suppliers.

Good Defective

Supplier I 95 5

Supplier 2 180 15

Supplier 3 134 t6Supplier 4 138 7

Test, at the 57o significance level, whether the supplier is associated with the lens quality. What is youradvice to the purchasing department based on the test result?

Page 18: Chi Squared Tests

Mathematics Term 3 STPM Chapfer 6 Chisquaredf"src N

7. The table shows the result of a taste test in which a random sample of 500 people in two age groupsis asked which of four formulations of a chocolate drink they prefer.

Age group Formulation A Formulation B Formulation C Formulation D

7 -2526-50

30

28

69

36

116

70

78

73

Use a 0.01 significance level to test whether the preference for the different formulation change withage.

8. Fruit trees are subject to a bacteria-caused disease. Several different treatments for this disease are

adopted. Treatment A: no action taken, treatment B: careful removal of clearly affected branches, andtreatment C: frequent spraying of the leaves with an antibiotic in addition to careful removal of clearlya{fected branches. There are few different outcomes from the disease. Outcome 1: tree dies in the same

year as the disease is noticed, outcome 2: tree dies 2-4 years after disease is noticed, outcome 3: treesurvives beyond 4 years. A group of 200 trees are assorted into one of the treatments and over thenext few years the outcome is recorded. The results are displayed in the following contingency table.

OutcomeTreatment

A B C

1

2

J

37

l6J

24

20

15

t732

36

Determine whether there is any substantial evidence to conclude that outcome is independent oftreatment. Use a 5% significance level for this test.

9. The table below shows the observed distribution of blood types: A, B, AB, and O in three samples ofMalays living in Kedah, Selangor and fohor.

Blood type Kedah Selangor |ohor

AB

ABo

t416

3

t7

205184

5l232

4t37

1l5l

Test, at the 5o/o significance level, whether the distribution of blood type is different across the threestates.

10. A manufacturer operates four assembly machines on three separate shifts daily. The table below gives

the number of machine breakdowns recorded in the past year.

Machine I Machine 2 Machine 3 Machine 4

First shiftSecond shiftThird shift

75

90

141

89

108

175

43

63

t2t

28

59

t4l

Determine whether these data provide sufficient evidence, at the 2.5o/o significance level, to infer thatmachine breakdown is independent of shift.

ill-3ll

Page 19: Chi Squared Tests

ummePgl. The chi-squared distribution has one parameter, called the degree of freedom.

2. The chi-squared distribution curve lies to the right of the vertical axis and is skewed to the right.

3. In a goodness-of-fit test, we test the null hypothesis that the observed frequencies follow a certair

. :"":"::i::::.]:i":']:'::t:: ,h- -,,,, hrmnrhpcic rh,r rrrrn arrrihrrrpc ,rp inr,pnpnrpnr

Page 20: Chi Squared Tests

5. General test procedure ln a chl-squared test.. State the hypotheses. Specify the significance level

. Calculate the value of the chi-squared test statistic f @' -e')' (Combine any adjacent classes

where necessary) i= I €,

. Determine the critical region based on the number of degrees of freedom and the significance level

. Make a decision

l. (a) Find P(0.83 < x1 < 12.8) .

(b) Determine the value of ft such that P(6.447 I X'r, < k) = O.Oag.

Three identical dice are thrown 150 times. The number of dice whose scores on the top faces at each

throw are odd is recorded. The results are as follows:

Using a 570 significance level, test the hypothesis that all three dice are unbiased.

A departmental store sells men's shirts and stocks these shirts in five different sizes: S, M, L, XL, andXXL. The number of the shirts sold each week is recorded.

Sizes Number of shirts

S

M

L

XL

xxL

2l24

39

25

13

4.

Test, at a l07o significance level, the hypothesis that number of shirts sold is uniformly distributed.

Cars heading to a certain junction may go straight, turn left or turn right. A road transport departmentofficer asserts that 60% of the cars will go straight at the intersection, and of the remaining 40%o, equalproportions will turn left and right. One hundred cars are randomly monitored and it is found that 51

cars go straight, 17 cars turn left are 32 cars turn right. Test, at the 5olo significance level, the hypothesisthat the proportions of cars going straight, turning left and turning right do not differ significantlyfrom those asserted by the officer.

JJ 59 43 l5

)

REVI'ION EXERCI'E

Number of odd scores

Frequency

Page 21: Chi Squared Tests

Mathematics Term 3 STPM Chapter 6 Chi-squaredrr"ts N

A pharmaceutical company conducts a trial on 200 patients to determine the effectiveness of a new

cough remedy. Of these patients, 100 are randomly selected to be given the standard cough remedy

and the remaining 100 are assigned the new cough remedy. The result are recorded as shown.

No reliefSome reliefFull relief

53

34

13

6.

Carry out a test, at a significance level of 57o, to investigate whether the two cough remedies are equally

effective.

A football fan keeps the record of the goals scored per match by his favourite team. The results are

shown below.

(a) Computed the mean number of goals scored per match.

(b) Using a 57o significance level, perform a test of the hlpothesis that the number of goals per match

has a Poisson distribution.

The following table gives the cumulative frequency distribution of the lives (in years) of 40 note-book

batteries tested by a battery manufacturer.

Based on the previous experience, it is believed that a normal distribution with mean 3.5 years and

standard deviation 0.7 year provides a good approximation. Perform a chi-squared test, at the 5o/o

significance level, to determine whether the normal distribution gives a good fit for these data.

The table below shows the frequency distribution of marks for a paper obtained by 178 candidates.

The population mean and standard deviation of the distribution of marks for the paper are 26.0 and

11.5 respectively. Test, at the 10% significance level, the hypothesis that the distribution of marks forthe paper is normal.

7.

ilg*

Goals obtained per match

11 16 25 14

Battery life not greater than 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Cumulative frequency 0 2 J l 22 32 3t 40

50<x<6040<x<5030<x<4020<x<3010<x<200<x<10

5

19

34

63

47

l0

313

5.

Standardcough remedy

Newcough remedy

34Number of matches

Mark,.r Number of candidates

37

44

19

Page 22: Chi Squared Tests

BJ

lNl U"th"-"tics Term 3 STPM chapter 6 Chi-squared Tests

9. A botanist sows three seeds in each of 80 pots. The number of seeds which germinate in each port is

recorded. The results of all the 80 ports are given in the following table.

Number of seeds germinate 0 1 2 J

Number of pots 25 20 29 6

(a) Estimate the probability that an individual seed germinates.

(b) Using a 17o significance level, test the hlpothesis that the data may be fitted by the binomialdistribution.

10. The distributions of marks for a paper marks in an examination has mean U and standard deviation

o. Each candidate is assigned one of the five grades A, B, C, D, E as follows:

Mark,x Grade

x 2 ui39,2 A

u+g< x < u+3!'22 B

u-g<xlui!'22 C

u-3L<x<rr-4'22 D

x < u-3L'2 E

The table below summarises the grades of a random sample of 198 candidates.

Grade A B C D E

Number of candidates t7 55 81 JJ t2

Determine, at the 1% significance level, the adequacy of a normal distribution as a model for these

data.

11. The lengths (in millimetres) in a random sample of 50 leaves of a certain plant are recorded

as follows:

145 133 125 157 165 138 t43 151 148 132

155 136 144 158 147 t52 140 148 146 150

138 177 165 l l8 154 126 163 121 140 168

163 r35 147 153 146 140 173 142 r35 138

156 147 142 128 144 145 l5l 135 161 150

Test the hypothesis that the leave length can be approximately modelled by a normal distribution.

Use a 0.05 significance level.

314

Page 23: Chi Squared Tests

Mathematlcs Term 3 STPM Chapler 6 chi-squaredf""t" N

12. The table below shows the number of individuals exposed to a certain virus and the number ofindividuals who develop the disease.

Development of disease

Yes No

Exposure to Yes 44 116

virus No 19 128

Conduct a test of hypothesis at the l% significance level, to determine whether there is association

between the exposure to the virus and the development of the disease.

13. The table below shows the number of males and females in each of three ernployment categories at amanufacturing company.

Managerial Support Worker

Male 10 39 285Female 6 52 624

Using a 17o significance level, test whether there is any association between gender and employmentcategories.

14. A researcher in a study of heart disease in males links subjects to socioeconomic status and smokinghabits. The results are summarised in the contingency table below

Socioeconomic status

High Middle Low

Current 66 29)T9Ktng Former t 19 27hablts Never gg lz

55

36

30

Perform a chi-squared test on association between smoking habits and socioeconomic status. Use a

significance level 2.5o/o.

15. A hlpermarket wants to study the relationship between the method of payment by customers ofdifferent age groups. A random sample of 250 customers is taken and the results are summarised

in the table below.

Age group

L8-25 26-35 36-45 Over46

Payment Card l8 36 25 30

method Cash t4 27 33 67

Carry out a test at the 570 significance level to find out whether the method of payment is independent

of age group.

il6

315

Page 24: Chi Squared Tests

N *"an"rr,.tics Term 3 STPM Chapter 6 Chi-squared Tests

The school of Biologicalpollutant and the numberin the table below.

16. Sciences of a university records theof brain abnormality for laboratory

level of exposure to a certainmice. The data are summarised

Number of brain abnormalitiy

0-2 3-4 5-6

Test, at the 570 significance level, r.thether there is association between the level of exposure to thepollutant and the number of brain abnormality lbund in the laboratory mice.

17, The table below summarises the number of hours of sleep at nights for a random sample of adults ofdifferent age groups.

Number of hours of sleep

Less than 6 6 to 8 More than 8

Age group

25-4445-54

>_ 55

41 85 70

34 77 62

76 69 43

Carry out a test, at the 1% significance level, to determine whether the number of hours of sleep is

independent of the age of an adult.

A plant expert collects samples of rice from a large field of 600 plots. One part of his investigation is

based on the sterility observed and genotype used for each plot.

Genotypes

I II III IV

Sterilitv

No problem

Moderate

Severe

30 21 19 16

r02 90 120 77

18 39 11 57

Test, at a l% significance level, whether sterility is independent of genotype.

6

HighMedium

Iow

t28

7

18

7

8

39

13

8

3t6

Level ofexposure to

pollutant

18.