using statistics to make inferences 8
DESCRIPTION
Using Statistics To Make Inferences 8. Summary Contingency tables. Goodness of fit test. 1. Sunday, 10 August 2014 11:52 AM. Goals. To assess contingency tables for independence. To perform and interpret a goodness of fit test. Practical Construct and analyse contingency tables. 2. - PowerPoint PPT PresentationTRANSCRIPT
8.11
Using Statistics To Make Inferences 8
Summary
Contingency tables.Goodness of fit test.
Saturday 22 April 2023 06:47 AM
8.22
Goals To assess contingency tables for independence.To perform and interpret a goodness of fit test.
Practical
Construct and analyse contingency tables.
8.33
RecallTo compare a population and sample variance we employed?χ2Cc
cc
8.44
TodayThe probability approach from last week is employed to tell if “observed” data confirms to the pattern “expected” under a given model.
8.55
Categorical Data - Example
Assessed intelligence of athletic and non-athletic schoolboys.
bright stupid Totalathleti
c581 567 1148
lazy 209 351 560Total 790 918 1708
K. Pearson “On The Relationship Of Intelligence To Size And Shape Of Head, And To Other Physical And Mental Characters”, Biometrika, 1906, 5, 105-146, data on page 144.
8.66
Procedure1. Formulate a null hypothesis. Typically
the null hypothesis is that there is no association between the factors.
2. Calculate expected frequencies for the cells in the table on the assumption that the null hypothesis is true.
3. Calculate the chi-squared statistic. This is for an r x c table with entries in row i and column j.
r
i
c
j jijijiobserved
1 1
22
,expected,expected,
8.77
Procedure4. Compare the calculated statistic with
tabulated values of the chi-squared distribution with ν degrees of freedom.
ν = (rows ‑ 1)(columns ‑ 1) = (r – 1)(c – 1)
8.88
Key Assumptions1. Independence of the observations. The data
found in each cell of the contingency table used in the chi-squared test must be independent observations and non-correlated.
2. Large enough expected cell counts. As described by Yates et al., "No more than 20% of the expected counts are less than 5 and all individual expected counts are 1 or greater" (Yates, Moore & McCabe, 1999, The Practice of Statistics, New York: W.H. Freeman p. 734).
8.99
Key Assumptions3. Randomness of data. The data in the table
should be randomly selected.
4. Sufficient Sample Size. It is also generally assumed that the sample size for the entire contingency table is sufficiently large to prevent falsely accepting the null hypothesis when the null hypothesis is true.
8.1010
Example
Assessed intelligence of athletic and non athletic schoolboys.
Observed
bright stupid Totalathleti
c581 567 1148
lazy 209 351 560Total 790 918 1708
8.1111
ProbabilitiesThe probability a random boy is athletic is
6721.017081148
The probability a random boy is bright is
4625.01708790
Assuming independence, the probability a random boy is both athletic and bright is
98.5301708
7901148
3109.01708790
17081148
bright stupid Totalathleti
c581 567 1148
lazy 209 351 560Total 790 918 1708
For 1708 respondents the expected number of athletic bright boys is
CCCCCCCCCCCCCCC
8.1212
Expected
bright stupid Totalathleti
c530.98 1148
lazy 560Total 790 918 1708
The expected number of athletic bright boys is
98.5301708
7901148
8.1313
Expected
bright stupid Totalathleti
c530.98 ? 1148
lazy 560Total 790 918 1708
The expected number of athletic stupid boys is
8.1414
Expected
bright stupid Totalathleti
c530.98 617.02 1148
lazy 560Total 790 918 1708
The expected number of athletic stupid boys is
1148 – 530.98 = 617.02
8.1515
Expected
bright stupid Totalathleti
c530.98 617.02 1148
lazy ? 560Total 790 918 1708
The expected number of lazy bright boys is
8.1616
Expected
bright stupid Totalathleti
c530.98 617.02 1148
lazy 259.02 ? 560Total 790 918 1708
The expected number of stupid lazy boys is
8.1717
Expected
bright stupid Totalathleti
c530.98 617.02 1148
lazy 259.02 300.98 560Total 790 918 1708
The expected number of stupid lazy boys is
918 – 617.02 = 300.98
8.1818
Expectedbright stupid Total
athletic
530.98 617.02 1148
lazy 259.02 300.98 560Total 790 918 1708
8.1919
χ2
73.2698.300
98.30035102.259
02.25920902.617
02.61756798.530
98.530581
22
222
calc
Expected
Expected - Observed 2
111 cr
1708918790Total560351209lazy1148567581athleticTotalstupidbright
1708918790Total560351209lazy1148567581athleticTotalstupidbright
Observed Expected
1708918790Total560300.98259.02lazy1148617.02530.98athleticTotalstupidbright
1708918790Total560300.98259.02lazy1148617.02530.98athleticTotalstupidbright
Only one cell is free.
8.2020
χ2 As a general rule to employ this statistic,all expected frequencies should exceed 5.
If this is not the case categories are pooled (merged) to achieve this goal. See the Prussian data later.
8.2121
Conclusion73.262 calc 1
84.305.21
ν p=0.1
p=0.05
p=0.025
p=0.01
p=0.005
p=0.002
1 2.706 3.841 5.024 6.635 7.879 9.550
The result is significant (26.73 > 3.84) at the 5% level. So we reject the hypothesis of independence between athletic prowess and intelligence.
8.2222
SPSSRaw data
Note v1 are the row labelsv2 are the column labelsv3 is the frequency
for each cell
8.2323
SPSSData > Weight Cases
Since frequency data has been input, necessary to weight.This is essential, do not use percentages.
8.2424
SPSSAnalyze > Descriptive Statistics > Crosstabs
Set row and column variables.
Frequencies already set.
8.2525
SPSSSelect chi-square
8.2626
SPSSSelectObserved – input dataExpected – output data,
under the model
8.2727
SPSS
V1 * V2 Crosstabulation
581 567 1148531.0 617.0 1148.0
209 351 560259.0 301.0 560.0
790 918 1708790.0 918.0 1708.0
CountExpected CountCountExpected CountCountExpected Count
athletic
lazy
V1
Total
bright stupidV2
Total
Expected cell frequencies
Expected under the model.
8.2828
SPSS
Chi-Square Tests
26.736b 1 .00026.204 1 .00026.973 1 .000
.000 .0001708
Pearson Chi-SquareContinuity Correctiona
Likelihood RatioFisher's Exact TestN of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
Computed only for a 2x2 tablea.
0 cells (.0%) have expected count less than 5. The minimum expected count is 259.02.
b.
Pearson Chi Square is the required statistic
Do not report p = .000, rather p < .001
Note Fisher’s exact test, only available in SPSS for 2x2 tables (see next slide).
ff
8.2929
What If We Have Small Cell Counts?Fisher's exact test
The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your cells has an expected frequency of five or less. Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is. In SPSS, unless you have the SPSS Exact Test Module, you can only perform a Fisher's exact test on a 2x2 table, and these results are presented by default.
8.3030
AsideTwo dials were compared. A subject was asked to read each dial many times, and the experimenter recorded his errors. Altogether 7 subjects were tested. The data shows how many errors each subject produced. Do the two conditions differ at the 0.05 significance level (give the appropriate p value)?
Observed data1 2 3 4 5 6 736 31 31 29 32 25 2629 35 34 35 34 35 30What key word describes this data?
8.3131
AsideWhat tests are available for paired data?
One sample t test
Sign test
Wilcoxon Signed Ranks Test
CCCCCCCCCc
8.3232
AsideWhat tests are available for paired data? What assumptions are made?One sample t test
Sign test
Wilcoxon Signed Ranks Test
normality
Resembles the SignTest in scope, but it is much more sensitive. In fact, for large numbers it is almost as sensitive as the Student t-test
No assumption of normality
8.3333
AsideWhat tests are available for paired data? One sample t test
Sign test
Wilcoxon Signed Ranks Test
Sign test answers the question How Often?, whereas other tests answer the question How Much?
One sample t test – meanWilcoxon Signed Ranks Test - median
8.3434
ExampleThe table is based on case-records of women employees in Royal Ordnance factories during 1943-6. The same test being carried out on the left eye (columns) and right eye (rows).
Stuart “The estimation and comparison of strengths of association in contingency tables”, Biometrika, 1953, 40, 105-110.
8.3535
ObservedHighes
tSecon
dThird Lowes
tTotal
Highest
1520 266 124 66 1976
Second
234 1512 432 78 2256
Third 117 362 1772 205 2456Lowes
t36 82 179 492 789
Total 1907 2222 2507 841 7477Is there any obvious structure?
8.3636
Expected
In general to find the expected frequency in a particular cell the equation is
Row total x Column total / Grand total
8.3737
Expected
7477841250722221907Total
7894921798236Lowest
24562051772362117Third
2256784321512234Second
1976661242661520Highest
TotalLowestThirdSecondHighest
7477841250722221907Total
7894921798236Lowest
24562051772362117Third
2256784321512234Second
1976661242661520Highest
TotalLowestThirdSecondHighestIn general to find the expected frequency in a particular cell the equation is
Row total x Column total / Grand total
So for highest right and bottom left the equation becomes
1976 x 1907 / 7477 = 503.98
8.3838
ExpectedHighest Secon
dThird Lowes
tTotal
Highest
503.98 ? 1976
Second
? 2256
Third ? 2456Lowes
t? ? ? ? 789
Total 1907 2222 2507 841 7477Row total x Column total / Grand total
1976 x 1907 / 7477 = 503.98
8.3939
ExpectedHighest Secon
dThird Lowes
tTotal
Highest
503.98 587.22 662.54 ? 1976
Second
575.39 670.43 756.43 ? 2256
Third 626.40 729.87 823.48 ? 2456Lowes
t? ? ? ? 789
Total 1907 2222 2507 841 7477Row total x Column total / Grand total
8.4040
ExpectedHighest Secon
dThird Lowes
tTotal
Highest
503.98 587.22 662.54 ? 1976
Second
575.39 670.43 756.43 ? 2256
Third 626.40 729.87 823.48 ? 2456Lowes
t? ? ? ? 789
Total 1907 2222 2507 841 7477The missing values are simply found by
subtraction
8.4141
ExpectedHighest Secon
dThird Lowes
tTotal
Highest
503.98 587.22 662.54 ? 1976
Second
575.39 670.43 756.43 2256
Third 626.40 729.87 823.48 2456Lowes
t789
Total 1907 2222 2507 841 74771976 – 503.98 – 587.22 – 662.54 = 222.26
8.4242
ExpectedHighest Secon
dThird Lowes
tTotal
Highest
503.98 587.22 662.54 222.26 1976
Second
575.39 670.43 756.43 2256
Third 626.40 729.87 823.48 2456Lowes
t789
Total 1907 2222 2507 841 74771976 – 503.98 – 587.22 – 662.54 = 222.26
8.4343
ExpectedHighest Secon
dThird Lowes
tTotal
Highest
503.98 587.22 662.54 222.26 1976
Second
575.39 670.43 756.43 ? 2256
Third 626.40 729.87 823.48 ? 2456Lowes
t? ? ? ? 789
Total 1907 2222 2507 841 7477Similarly for the remaining cells
8.4444
ExpectedHighest Secon
dThird Lowes
tTotal
Highest
503.98 587.22 662.54 222.26 1976
Second
575.39 670.43 756.43 253.75 2256
Third 626.40 729.87 823.48 276.25 2456Lowes
t201.23 234.47 264.55 88.75 789
Total 1907 2222 2507 841 7477
8.4545
Short CutContributions to the χ2 statistic,
for the top left cell the contribution is
expected
expectedobserved 2
32.204898.503
98.5031520 2
8.4646
Conclusion32.20482 calc 911 cr
ν p=0.1 p=0.05 p=0.025
p=0.01 p=0.005
p=0.002
9 14.684 16.919 19.023 21.666 23.589 26.056 92.1605.2
9
The above statistic makes it very clear that there is some relationship between the quality of the right and left eyes.
For the top left cell only.
Nine cells are free.
8.4747
Highest
Second Third Lowest Total
Highest
2048.32
175.72 437.75 109.86
Second 202.55 1056.38
139.14 121.73
Third 414.25 185.41 1092.53
18.38
Lowest 135.67 99.15 27.66 1832.37
Total 8097
Total χ2
8.4848
Conclusion87.80962 calc 911 cr
ν p=0.1 p=0.05 p=0.025
p=0.01 p=0.005
p=0.002
9 14.684 16.919 19.023 21.666 23.589 26.056 92.1605.2
9
The above statistic makes it very clear that there is some relationship between the quality of the right and left eyes.
For all cells.
Nine cells are free.
8.4949
SPSS
Raw data
8.5050
SPSSExpected cell frequencies
V1 * V2 Crosstabulation
1520 36 234 117 1907504.0 201.2 575.4 626.4 1907.0
66 492 78 205 841222.3 88.7 253.8 276.2 841.0
266 82 1512 362 2222587.2 234.5 670.4 729.9 2222.0
124 179 432 1772 2507662.5 264.5 756.4 823.5 2507.01976 789 2256 2456 7477
1976.0 789.0 2256.0 2456.0 7477.0
CountExpected CountCountExpected CountCountExpected CountCountExpected CountCountExpected Count
Highest
Lowest
Second
Third
V1
Total
Highest Lowest Second ThirdV2
Total
8.5151
SPSSPearson Chi Square is the required statistic
Chi-Square Tests
8096.877a 9 .0006671.512 9 .000
7477
Pearson Chi-SquareLikelihood RatioN of Valid Cases
Value dfAsymp. Sig.
(2-sided)
0 cells (.0%) have expected count less than 5. Theminimum expected count is 88.75.
a.
8.5252
Poisson DistributionThe Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.
Typical applications are to queues/arrivals. The number of phone calls received per day.The occurrence of accidents/industrial injuries.More exotically, birth defects and the number of genetic mutations. The occurrence of rare diseases.
8.5353
Poisson Distribution1 discrete events which are independent.
2 events occur at a fixed rate λ per unit continuum.(λ lambda)
8.5454
Poisson Distribution x successes
!
;Probxexx
e is approximately equal to 2.718
λ is the rate per unit continuum
the mean is λ the variance is λ
8.5555
Casio 83ES
exp or “e”
exp(1) = 2.7182818
exp(2) = 7.389056
Its inverse, on the same key is ln, so
ln(2.7182818) = 1
ln(7.389056) = 2
8.5656
Alternate applicationsA similar approach may be employed to test if simple models are plausible.
8.5757
χ2 Goodness of Fit Test
i i
iicalc E
EO 22
The degrees of freedom are ν = m – n – 1, where there are m frequencies left in the problem, after pooling, and n parameters have been fitted from the raw data.
For example…
8.5858
ExampleThe number of Prussian army corps in which soldiers died from the kicks of a horse in a year.
Typical “industrial injury” data
8.5959
Which distribution is appropriate?
Is the data discrete or continuous?Discrete, since a simple countccccccccccccccccccccccc
8.6060
Check list of distributionsDiscrete Continuous
Binomial Normal
Poisson Exponential
8.6161
Check list of distribution parameters
Discrete Continuous
Binomial Normal
Poisson Exponential
n p μ σ2
λ
cccccccccccccccccccccccccc
Discrete, no “n” implies Poissonccccccc
λcccccccccccccccccccccccccc
8.6262
Observed DataNumber deaths in
a corpsObserved
frequency (Oi)0 1441 912 323 114 2
5 or more 0Total 280
We need to estimate the Poisson parameter λ. Which is the mean of the distribution.
8.6363
Observed DataNumber deaths in
a corpsObserved
frequency (Oi)0 1441 912 323 114 2
5 or more 0Total 280
8.6464
Mean
7.02113291144
241133229111440
ccccccccccccccccccccc280Total05 or more241133229111440
Observed frequency (Oi)
Number deaths in a corps
280Total05 or more241133229111440
Observed frequency (Oi)
Number deaths in a corps
8.6565
ExpectedNumber
deaths in a corps
Poisson model
Expected probability
0 0.49661 0.34762 0.12173 0.02844 0.0050
5 or more By subtraction
?
Total 1 1
e e
!2/2 e!3/3 e!4/4 e
λ = 0.7 and “e” is a constant on your calculator
8.6666
ExpectedNumber
deaths in a corps
Poisson model
Expected probability
0 0.49661 0.34762 0.12173 0.02844 0.0050
5 or more By subtraction
0.0008
Total 1 1
e e
!2/2 e!3/3 e!4/4 e
8.6767
Expected FrequencyExpected frequency for no deaths 280 x 0.4966 =
139.04 Number
deaths in a corps
Expected probabilit
y
Expected frequency (Ei)
0 0.4966 139.041 0.34762 0.12173 0.02844 0.0050
5 or more 0.0008Total 1
8.6868
Expected FrequencyExpected frequency for remaining rows
280 × probability = frequency
Number deaths in a corps
Expected probability
Expected frequency (Ei)
0 0.4966 139.041 0.3476 97.332 0.1217 34.073 0.0284 7.954 0.0050 1.39
5 or more 0.0008 0.22Total 1 280
Note the two expected frequencies less than 5!
8.6969
χ2 CalculationNumber deaths
in a corps
Observed
frequency (Oi)
Expected
frequency (Ei)
0 144 139.04 0.181 91 97.33 0.412 32 34.07 0.13
3 or more
13 9.56 1.24
Total 280 280 1.95
i
ii
EEO 2
Pool to ensure all expected frequencies exceed 5
8.7070
ConclusionHere m (frequencies) = 4, n (fitted parameters) = 1 then ν = m – n – 1 = 4 – 1 – 1 = 2
ν p=0.1 p=0.05 p=0.025
p=0.01 p=0.005
p=0.002
2 4.605 5.991 7.378 9.210 10.597 12.429 991.505.2
2 95.12 calc
The hypothesis, that the data comes from a Poisson distribution would be accepted (5.991 > 1.95).
8.7171
Next WeekBring your calculators next week
8.7272
ReadRead Howitt and Cramer pages 134-152
Read Howitt and Cramer (e-text) pages 125-134
Read Russo (e-text) pages 100-119
Read Davis and Smith pages 434-448
8.7373
Practical 8This material is available from the module web page.
http://www.staff.ncl.ac.uk/mike.cox
Module Web Page
8.7474
Practical 8This material for the practical is available.
Instructions for the practicalPractical 8
Material for the practicalPractical 8
8.7575
Assignment 2You will find submission details on the module web site
Note the dialers lower down the page give access to your individual assignment. It is necessary to enter your student number exactly as it appears on your smart card.
8.7676
Assignment 2As a general rule make sure you can perform the calculations manually.
It does no harm to check your calculations using a software package.
Some software employ non-standard definitions and should be used with caution.
8.7777
Assignment 2All submissions must be typed.
8.7878
Whoops!Researchers at Cardiff University School of Social Science claim errors made by the Hawk-Eye line - calling technology can be greater than 3.6mm - the average error quoted by the manufacturers.
Teletext, p388
12 June 2008
8.7979
Whoops!
Kate Middleton 'marries Prince Harry' on souvenir mugThe Telegraph - Thursday 17 March 2011
8.8080
Whoops!
Poldark - BBC - 8 March 2015