using statistics to make inferences 8

8.11

Using Statistics To Make Inferences 8

Summary

Contingency tables.Goodness of fit test.

Saturday 22 April 2023 06:47 AM

8.22

Goals To assess contingency tables for independence.To perform and interpret a goodness of fit test.

Practical

Construct and analyse contingency tables.

8.33

RecallTo compare a population and sample variance we employed?χ2Cc

cc

8.44

TodayThe probability approach from last week is employed to tell if “observed” data confirms to the pattern “expected” under a given model.

8.55

Categorical Data - Example

Assessed intelligence of athletic and non-athletic schoolboys.

bright stupid Totalathleti

c581 567 1148

lazy 209 351 560Total 790 918 1708

K. Pearson “On The Relationship Of Intelligence To Size And Shape Of Head, And To Other Physical And Mental Characters”, Biometrika, 1906, 5, 105-146, data on page 144.

http://biomet.oxfordjournals.org/content/5/1-2/105.full.pdf

http://biomet.oxfordjournals.org/content/5/1-2/105.full.pdf

8.66

Procedure1. Formulate a null hypothesis. Typically

the null hypothesis is that there is no association between the factors.

2. Calculate expected frequencies for the cells in the table on the assumption that the null hypothesis is true.

3. Calculate the chi-squared statistic. This is for an r x c table with entries in row i and column j.

r

i

c

j jijijiobserved

1 1

22

,expected,expected,

8.77

Procedure4. Compare the calculated statistic with

tabulated values of the chi-squared distribution with ν degrees of freedom.

ν = (rows ‑ 1)(columns ‑ 1) = (r – 1)(c – 1)

8.88

Key Assumptions1. Independence of the observations. The data

found in each cell of the contingency table used in the chi-squared test must be independent observations and non-correlated.

2. Large enough expected cell counts. As described by Yates et al., "No more than 20% of the expected counts are less than 5 and all individual expected counts are 1 or greater" (Yates, Moore & McCabe, 1999, The Practice of Statistics, New York: W.H. Freeman p. 734).

8.99

Key Assumptions3. Randomness of data. The data in the table

should be randomly selected.

4. Sufficient Sample Size. It is also generally assumed that the sample size for the entire contingency table is sufficiently large to prevent falsely accepting the null hypothesis when the null hypothesis is true.

8.1010

Example

Assessed intelligence of athletic and non athletic schoolboys.

Observed


c581 567 1148

lazy 209 351 560Total 790 918 1708

8.1111

ProbabilitiesThe probability a random boy is athletic is

6721.017081148

The probability a random boy is bright is

4625.01708790

Assuming independence, the probability a random boy is both athletic and bright is

98.5301708

7901148

3109.01708790

17081148


c581 567 1148

lazy 209 351 560Total 790 918 1708

For 1708 respondents the expected number of athletic bright boys is

CCCCCCCCCCCCCCC

8.1212

Expected


c530.98 1148

lazy 560Total 790 918 1708

The expected number of athletic bright boys is

98.5301708

7901148

8.1313

Expected


c530.98 ? 1148

lazy 560Total 790 918 1708

The expected number of athletic stupid boys is

8.1414

Expected


c530.98 617.02 1148

lazy 560Total 790 918 1708

The expected number of athletic stupid boys is

1148 – 530.98 = 617.02

8.1515

Expected


c530.98 617.02 1148

lazy ? 560Total 790 918 1708

The expected number of lazy bright boys is

8.1616

Expected


c530.98 617.02 1148

lazy 259.02 ? 560Total 790 918 1708

The expected number of stupid lazy boys is

8.1717

Expected


c530.98 617.02 1148

lazy 259.02 300.98 560Total 790 918 1708

The expected number of stupid lazy boys is

918 – 617.02 = 300.98

8.1818

Expectedbright stupid Total

athletic

530.98 617.02 1148

lazy 259.02 300.98 560Total 790 918 1708

8.1919

χ2

73.2698.300

98.30035102.259

02.25920902.617

02.61756798.530

98.530581

22

222

calc

Expected

Expected - Observed 2

111 cr

1708918790Total560351209lazy1148567581athleticTotalstupidbright

1708918790Total560351209lazy1148567581athleticTotalstupidbright

Observed Expected

1708918790Total560300.98259.02lazy1148617.02530.98athleticTotalstupidbright

1708918790Total560300.98259.02lazy1148617.02530.98athleticTotalstupidbright

Only one cell is free.

8.2020

χ2 As a general rule to employ this statistic,all expected frequencies should exceed 5.

If this is not the case categories are pooled (merged) to achieve this goal. See the Prussian data later.

8.2121

Conclusion73.262 calc 1

84.305.21

ν p=0.1

p=0.05

p=0.025

p=0.01

p=0.005

p=0.002

1 2.706 3.841 5.024 6.635 7.879 9.550

The result is significant (26.73 > 3.84) at the 5% level. So we reject the hypothesis of independence between athletic prowess and intelligence.

8.2222

SPSSRaw data

Note v1 are the row labelsv2 are the column labelsv3 is the frequency

for each cell

8.2323

SPSSData > Weight Cases

Since frequency data has been input, necessary to weight.This is essential, do not use percentages.

8.2424

SPSSAnalyze > Descriptive Statistics > Crosstabs

Set row and column variables.

Frequencies already set.

8.2525

SPSSSelect chi-square

8.2626

SPSSSelectObserved – input dataExpected – output data,

under the model

8.2727

SPSS

V1 * V2 Crosstabulation

581 567 1148531.0 617.0 1148.0

209 351 560259.0 301.0 560.0

790 918 1708790.0 918.0 1708.0

CountExpected CountCountExpected CountCountExpected Count

athletic

lazy

V1

Total

bright stupidV2

Total

Expected cell frequencies

Expected under the model.

8.2828

SPSS

Chi-Square Tests

26.736b 1 .00026.204 1 .00026.973 1 .000

.000 .0001708

Pearson Chi-SquareContinuity Correctiona

Likelihood RatioFisher's Exact TestN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is 259.02.

b.

Pearson Chi Square is the required statistic

Do not report p = .000, rather p < .001

Note Fisher’s exact test, only available in SPSS for 2x2 tables (see next slide).

ff

8.2929

What If We Have Small Cell Counts?Fisher's exact test

The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your cells has an expected frequency of five or less. Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is. In SPSS, unless you have the SPSS Exact Test Module, you can only perform a Fisher's exact test on a 2x2 table, and these results are presented by default.

8.3030

AsideTwo dials were compared. A subject was asked to read each dial many times, and the experimenter recorded his errors. Altogether 7 subjects were tested. The data shows how many errors each subject produced. Do the two conditions differ at the 0.05 significance level (give the appropriate p value)?

Observed data1 2 3 4 5 6 736 31 31 29 32 25 2629 35 34 35 34 35 30What key word describes this data?

8.3131

AsideWhat tests are available for paired data?

One sample t test

Sign test

Wilcoxon Signed Ranks Test

CCCCCCCCCc

8.3232

AsideWhat tests are available for paired data? What assumptions are made?One sample t test

Sign test


normality

Resembles the SignTest in scope, but it is much more sensitive. In fact, for large numbers it is almost as sensitive as the Student t-test

No assumption of normality

8.3333

AsideWhat tests are available for paired data? One sample t test

Sign test


Sign test answers the question How Often?, whereas other tests answer the question How Much?

One sample t test – meanWilcoxon Signed Ranks Test - median

8.3434

ExampleThe table is based on case-records of women employees in Royal Ordnance factories during 1943-6. The same test being carried out on the left eye (columns) and right eye (rows).

Stuart “The estimation and comparison of strengths of association in contingency tables”, Biometrika, 1953, 40, 105-110.

http://www.jstor.org/stable/

http://www.jstor.org/stable/

8.3535

ObservedHighes

tSecon

dThird Lowes

tTotal

Highest

1520 266 124 66 1976

Second

234 1512 432 78 2256

Third 117 362 1772 205 2456Lowes

t36 82 179 492 789

Total 1907 2222 2507 841 7477Is there any obvious structure?

8.3636

Expected

In general to find the expected frequency in a particular cell the equation is

Row total x Column total / Grand total

8.3737

Expected

7477841250722221907Total

7894921798236Lowest

24562051772362117Third

2256784321512234Second

1976661242661520Highest

TotalLowestThirdSecondHighest

7477841250722221907Total

7894921798236Lowest

24562051772362117Third

2256784321512234Second

1976661242661520Highest

TotalLowestThirdSecondHighestIn general to find the expected frequency in a particular cell the equation is

Row total x Column total / Grand total

So for highest right and bottom left the equation becomes

1976 x 1907 / 7477 = 503.98

8.3838

ExpectedHighest Secon

dThird Lowes

tTotal

Highest

503.98 ? 1976

Second

? 2256

Third ? 2456Lowes

t? ? ? ? 789

Total 1907 2222 2507 841 7477Row total x Column total / Grand total

1976 x 1907 / 7477 = 503.98

8.3939


dThird Lowes

tTotal

Highest

503.98 587.22 662.54 ? 1976

Second

575.39 670.43 756.43 ? 2256

Third 626.40 729.87 823.48 ? 2456Lowes

t? ? ? ? 789

Total 1907 2222 2507 841 7477Row total x Column total / Grand total

8.4040


dThird Lowes

tTotal

Highest

503.98 587.22 662.54 ? 1976

Second

575.39 670.43 756.43 ? 2256

Third 626.40 729.87 823.48 ? 2456Lowes

t? ? ? ? 789

Total 1907 2222 2507 841 7477The missing values are simply found by

subtraction

8.4141


dThird Lowes

tTotal

Highest

503.98 587.22 662.54 ? 1976

Second

575.39 670.43 756.43 2256

Third 626.40 729.87 823.48 2456Lowes

t789

Total 1907 2222 2507 841 74771976 – 503.98 – 587.22 – 662.54 = 222.26

8.4242


dThird Lowes

tTotal

Highest

503.98 587.22 662.54 222.26 1976

Second

575.39 670.43 756.43 2256

Third 626.40 729.87 823.48 2456Lowes

t789

Total 1907 2222 2507 841 74771976 – 503.98 – 587.22 – 662.54 = 222.26

8.4343


dThird Lowes

tTotal

Highest

503.98 587.22 662.54 222.26 1976

Second

575.39 670.43 756.43 ? 2256

Third 626.40 729.87 823.48 ? 2456Lowes

t? ? ? ? 789

Total 1907 2222 2507 841 7477Similarly for the remaining cells

8.4444


dThird Lowes

tTotal

Highest

503.98 587.22 662.54 222.26 1976

Second

575.39 670.43 756.43 253.75 2256

Third 626.40 729.87 823.48 276.25 2456Lowes

t201.23 234.47 264.55 88.75 789

Total 1907 2222 2507 841 7477

8.4545

Short CutContributions to the χ2 statistic,

for the top left cell the contribution is

expected

expectedobserved 2

32.204898.503

98.5031520 2

8.4646

Conclusion32.20482 calc 911 cr

ν p=0.1 p=0.05 p=0.025

p=0.01 p=0.005

p=0.002

9 14.684 16.919 19.023 21.666 23.589 26.056 92.1605.2

9

The above statistic makes it very clear that there is some relationship between the quality of the right and left eyes.

For the top left cell only.

Nine cells are free.

8.4747

Highest

Second Third Lowest Total

Highest

2048.32

175.72 437.75 109.86

Second 202.55 1056.38

139.14 121.73

Third 414.25 185.41 1092.53

18.38

Lowest 135.67 99.15 27.66 1832.37

Total 8097

Total χ2

8.4848

Conclusion87.80962 calc 911 cr

ν p=0.1 p=0.05 p=0.025

p=0.01 p=0.005

p=0.002

9 14.684 16.919 19.023 21.666 23.589 26.056 92.1605.2

9

The above statistic makes it very clear that there is some relationship between the quality of the right and left eyes.

For all cells.

Nine cells are free.

8.4949

SPSS

Raw data

8.5050

SPSSExpected cell frequencies

V1 * V2 Crosstabulation

1520 36 234 117 1907504.0 201.2 575.4 626.4 1907.0

66 492 78 205 841222.3 88.7 253.8 276.2 841.0

266 82 1512 362 2222587.2 234.5 670.4 729.9 2222.0

124 179 432 1772 2507662.5 264.5 756.4 823.5 2507.01976 789 2256 2456 7477

1976.0 789.0 2256.0 2456.0 7477.0

CountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected Count

Highest

Lowest

Second

Third

V1

Total

Highest Lowest Second ThirdV2

Total

8.5151

SPSSPearson Chi Square is the required statistic

Chi-Square Tests

8096.877a 9 .0006671.512 9 .000

7477

Pearson Chi-SquareLikelihood RatioN of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 88.75.

a.

8.5252

Poisson DistributionThe Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

Typical applications are to queues/arrivals. The number of phone calls received per day.The occurrence of accidents/industrial injuries.More exotically, birth defects and the number of genetic mutations. The occurrence of rare diseases.

8.5353

Poisson Distribution1 discrete events which are independent.

2 events occur at a fixed rate λ per unit continuum.(λ lambda)

8.5454

Poisson Distribution x successes

!

;Probxexx

e is approximately equal to 2.718

λ is the rate per unit continuum

the mean is λ the variance is λ

8.5555

Casio 83ES

exp or “e”

exp(1) = 2.7182818

exp(2) = 7.389056

Its inverse, on the same key is ln, so

ln(2.7182818) = 1

ln(7.389056) = 2

8.5656

Alternate applicationsA similar approach may be employed to test if simple models are plausible.

8.5757

χ2 Goodness of Fit Test

i i

iicalc E

EO 22

The degrees of freedom are ν = m – n – 1, where there are m frequencies left in the problem, after pooling, and n parameters have been fitted from the raw data.

For example…

8.5858

ExampleThe number of Prussian army corps in which soldiers died from the kicks of a horse in a year.

Typical “industrial injury” data

8.5959

Which distribution is appropriate?

Is the data discrete or continuous?Discrete, since a simple countccccccccccccccccccccccc

8.6060

Check list of distributionsDiscrete Continuous

Binomial Normal

Poisson Exponential

8.6161

Check list of distribution parameters

Discrete Continuous

Binomial Normal

Poisson Exponential

n p μ σ2

λ

cccccccccccccccccccccccccc

Discrete, no “n” implies Poissonccccccc

λcccccccccccccccccccccccccc

8.6262

Observed DataNumber deaths in

a corpsObserved

frequency (Oi)0 1441 912 323 114 2

5 or more 0Total 280

We need to estimate the Poisson parameter λ. Which is the mean of the distribution.

8.6363

Observed DataNumber deaths in

a corpsObserved

frequency (Oi)0 1441 912 323 114 2

5 or more 0Total 280

8.6464

Mean

7.02113291144

241133229111440

ccccccccccccccccccccc280Total05 or more241133229111440

Observed frequency (Oi)

Number deaths in a corps

280Total05 or more241133229111440

Observed frequency (Oi)


8.6565

ExpectedNumber

deaths in a corps

Poisson model

Expected probability

0 0.49661 0.34762 0.12173 0.02844 0.0050

5 or more By subtraction

?

Total 1 1

e e

!2/2 e!3/3 e!4/4 e

λ = 0.7 and “e” is a constant on your calculator

8.6666

ExpectedNumber

deaths in a corps

Poisson model


0 0.49661 0.34762 0.12173 0.02844 0.0050

5 or more By subtraction

0.0008

Total 1 1

e e

!2/2 e!3/3 e!4/4 e

8.6767

Expected FrequencyExpected frequency for no deaths 280 x 0.4966 =

139.04 Number

deaths in a corps

Expected probabilit

y

Expected frequency (Ei)

0 0.4966 139.041 0.34762 0.12173 0.02844 0.0050

5 or more 0.0008Total 1

8.6868

Expected FrequencyExpected frequency for remaining rows

280 × probability = frequency



Expected frequency (Ei)

0 0.4966 139.041 0.3476 97.332 0.1217 34.073 0.0284 7.954 0.0050 1.39

5 or more 0.0008 0.22Total 1 280

Note the two expected frequencies less than 5!

8.6969

χ2 CalculationNumber deaths

in a corps

Observed

frequency (Oi)

Expected

frequency (Ei)

0 144 139.04 0.181 91 97.33 0.412 32 34.07 0.13

3 or more

13 9.56 1.24

Total 280 280 1.95

i

ii

EEO 2

Pool to ensure all expected frequencies exceed 5

8.7070

ConclusionHere m (frequencies) = 4, n (fitted parameters) = 1 then ν = m – n – 1 = 4 – 1 – 1 = 2

ν p=0.1 p=0.05 p=0.025

p=0.01 p=0.005

p=0.002

2 4.605 5.991 7.378 9.210 10.597 12.429 991.505.2

2 95.12 calc

The hypothesis, that the data comes from a Poisson distribution would be accepted (5.991 > 1.95).

8.7171

Next WeekBring your calculators next week

8.7272

ReadRead Howitt and Cramer pages 134-152

Read Howitt and Cramer (e-text) pages 125-134

Read Russo (e-text) pages 100-119

Read Davis and Smith pages 434-448

8.7373

Practical 8This material is available from the module web page.

http://www.staff.ncl.ac.uk/mike.cox

Module Web Page

http://www.staff.ncl.ac.uk/mike.cox/psy1011.htm

8.7474

Practical 8This material for the practical is available.

Instructions for the practicalPractical 8

Material for the practicalPractical 8

http://www.staff.ncl.ac.uk/mike.cox/PSY1011/story8.pdf

http://www.staff.ncl.ac.uk/mike.cox/PSY1011/8.MTW

8.7575

Assignment 2You will find submission details on the module web site

Note the dialers lower down the page give access to your individual assignment. It is necessary to enter your student number exactly as it appears on your smart card.

http://www.staff.ncl.ac.uk/mike.cox/psy1011.htm

8.7676

Assignment 2As a general rule make sure you can perform the calculations manually.

It does no harm to check your calculations using a software package.

Some software employ non-standard definitions and should be used with caution.

8.7777

Assignment 2All submissions must be typed.

8.7878

Whoops!Researchers at Cardiff University School of Social Science claim errors made by the Hawk-Eye line - calling technology can be greater than 3.6mm - the average error quoted by the manufacturers.

Teletext, p388

12 June 2008

8.7979

Whoops!

Kate Middleton 'marries Prince Harry' on souvenir mugThe Telegraph - Thursday 17 March 2011

8.8080

Whoops!

Poldark - BBC - 8 March 2015

using statistics to make inferences 8

Documents