general purpose: - ksufac.ksu.edu.sa/sites/default/files/541.doc · web viewgeneral purpose: the...
TRANSCRIPT
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
General Purpose:
The idea of significance testing. If we have a basic knowledge of the underlying
distribution of a variable, then we can make predictions about how, in repeated
samples of equal size, this particular statistic will "behave," that is, how it is
distributed. For example, if we draw 100 random samples of 100 adults each from
the general population, and compute the mean height in each sample, then the
distribution of the standardized means across samples will likely approximate the
normal distribution (Student's t distribution with 99 degrees of freedom). Now
imagine that we take an additional sample in a particular city ("X") where we
suspect that people are taller than the average population. If the mean height in
that sample falls outside the upper 95% tail area of the t distribution then we
conclude that, indeed, the people of X city are taller than the average population.
Are most variables normally distributed? In the above example we relied on
our knowledge that, in repeated samples of equal size, the standardized means (for
height) will be distributed following the t distribution (with a particular mean and
variance). However, this will only be true if in the population the variable of
1
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
interest (height in our example) is normally distributed, that is, if the distribution
of people of particular heights follows the normal distribution (the bell-shape
distribution). For many variables of interest, we simply do not know for sure that
this is the case. For example, is income distributed normally in the population? --
probably not. The incidence rates of rare diseases are not normally distributed in
the population, the number of car accidents is also not normally distributed, and
neither are very many other variables in which a researcher might be interested.
Sample size. Another factor that often limits the applicability of tests based on the
assumption that the sampling distribution is normal is the size of the sample of
data available for the analysis (sample size; n). We can assume that the sampling
distribution is normal even if we are not sure that the distribution of the variable in
the population is normal, as long as our sample is large enough (e.g., 100 or more
observations). However, if our sample is very small, then those tests can be used
only if we are sure that the variable is normally distributed, and there is no way to
test this assumption if the sample is small.
Measurement Scales: Variables differ in how well they can be measured, i.e., in
how much measurable information their measurement scale can provide. There is
obviously some measurement error involved in every measurement, which
determines the amount of information that we can obtain. Another factor that
determines the amount of information that can be provided by a variable is its type
of measurement scale. Specifically, variables are classified as (a) nominal, (b)
ordinal, (c) interval, or (d) ratio.
a) Interval variables allow us not only to rank order the items that are
measured, but also to quantify and compare the sizes of differences between
them. For example, temperature, as measured in degrees Fahrenheit or Celsius,
constitutes an interval scale. We can say that a temperature of 40 degrees is
higher than a temperature of 30 degrees, and that an increase from 20 to 40
degrees is twice as much as an increase from 30 to 40 degrees.
b) Ratio variables are very similar to interval variables; in addition to all the
properties of interval variables, they feature an identifiable absolute zero point,
thus, they allow for statements such as x is two times more than y. Typical
2
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
examples of ratio scales are measures of time or space. For example, as the
Kelvin temperature scale is a ratio scale, not only can we say that a
temperature of 200 degrees is higher than one of 100 degrees, we can correctly
state that it is twice as high. Interval scales do not have the ratio property.
Most statistical data analysis procedures do not distinguish between the
interval and ratio properties of the measurement scales.
c) Nominal variables allow for only qualitative classification. That is, they
can be measured only in terms of whether the individual items belong to some
distinctively different categories, but we cannot quantify or even rank order
those categories. For example, all we can say is that two individuals are
different in terms of variable A (e.g., they are of different race), but we cannot
say which one "has more" of the quality represented by the variable. Typical
examples of nominal variables are gender, race, color, city, etc.
d) Ordinal variables allow us to rank order the items we measure in terms of
which has less and which has more of the quality represented by the variable,
but still they do not allow us to say "how much more." A typical example of
an ordinal variable is the socioeconomic status of families. For example, we
know that upper-middle is higher than middle but we cannot say that it is, for
example, 18% higher. Also, this very distinction between nominal, ordinal,
and interval scales itself represents a good example of an ordinal variable. For
example, we can say that nominal measurement provides less information than
ordinal measurement, but we cannot say "how much less" or how this
difference compares to the difference between ordinal and interval scales.
Problems in measurement. Applications of tests that are based on the normality
assumptions are further limited by a lack of precise measurement. For example, let
us consider a study where grade point average is measured as the major variable
of interest. Is an A average twice as good as a C average? Is the difference
between a B and an A average comparable to the difference between a D and a C
average? Somehow, the grade point average is a crude measure of scholastic
activities that only allows us to establish a rank ordering of students from
"excellent" students to "poor" students. Most common statistical techniques such
as analysis of variance (and t- tests), regression, etc., assume that the underlying
3
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
measurements are at least of interval, meaning that equally spaced intervals on the
scale can be compared in a meaningful manner (e.g, B minus A is equal to D
minus C). However, as in our example, this assumption is very often not
acceptable, and the data rather represent a rank ordering of observations (ordinal)
rather than precise measurements.
Test of significance (Nonparametric methods) :
The test of significance is a test applied to two or more means to prove if the
difference between the means is a true difference or just due to chance.
Parametric methods: are statistical techniques enabling us to determine if there
is a significant difference between to sample means with underlying assumptions
of normality, homogeneity of variances, and linearity. Therefore, the need is
evident for statistical procedures that enable us to process data of "low quality,"
from small samples, on variables about which nothing is known (concerning their
distribution). Specifically, nonparametric methods were developed to be used in
cases when the researcher knows nothing about the parameters of the variable of
interest in the population (hence the name nonparametric). In more technical
terms, nonparametric methods do not rely on the estimation of parameters (such as
the mean or the standard deviation) describing the distribution of the variable of
interest in the population. Therefore, these methods are also sometimes (and more
appropriately) called parameter-free methods or distribution-free methods.
Rationale for nonparametric methods
1. Nonparametric methods were developed for conditions in which assumptions
necessary for using parametric methods cannot be made.
2. Sometimes called distribution-free method because it is not necessary to
assume that the observations are normally distributed.
3. Appropriate for dealing with data that are measured on a nominal or ordinal
scale and whose distribution is unknown.
NB. Nonparametric methods should be viewed as complementary statistical methods
rather than attractive alternatives. An inherent characteristic is that they deal with
ranks rather than values of observations.
4
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
What is meant by “nonparametric methods”?
1. Approaches for estimation and hypothesis testing when no underlying data
distribution is assumed.
2. Approaches where at least one of the following is satisfied;
Method can be used on ordinal-scaled data.
Method can be used on interval or ratio data, where distribution of the
random variable of interest is unspecified.
Method can be used on nominal-scaled data.
When do we use nonparametric methods?
1. When normality cannot be assumed.
2. When data cannot be transformed to normality.
3. When methods based on other non-normal distributions are not available or
appropriate.
4. When there is not sufficient sample size to assess the form of the distribution.
What are advantages of nonparametric methods?
1. Used with all scale.
2. Easier to compute by hand for small samples (Developed originally before
wide computer use).
3. Need not involve population parameters.
4. Results may be as exact as parametric procedures.
5. Make fewer assumptions. When no underlying data distribution is assumed.
6. Can handle small samples without assuming something about the original data
distribution. (Recall normal distribution comes from averages of large
samples).
What are disadvantages of nonparametric methods?
1. May waste information if data permit using parametric procedures. Example:
converting data from ratio to ordinal scale.
2. Difficult to compute by hand for large samples.
5
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
3. They do not take advantage of all the special characteristics of a distribution.
Consequently, these methods do not fully utilize the information known about
the distribution.
4. Tables not widely available.
What are the differences between parametric and nonparametric methods?
Parametric Test Procedures Nonparametric Test Procedures
Involve population parameters like mean
and SD of population distribution.
Do not involve population parameters and
don’t assume data is normal, or t-distributed.
The underlying measurements are at least of
interval, meaning that equally spaced
intervals on the scale can be compared in a
meaningful. For example, temperature, as
measured in degrees F or C, constitutes an
interval scale. We can say that a
temperature of 40>30 degrees, and that an
increase from 20 to 40 degrees is twice as
much as an increase from 30 to 40 degrees.
Dependent variable may be measured on any
scale method
Interval or Ratio where distribution of
the random variable of interest is
unspecified
Ordinal-scaled data or ranking: good-
better-best, upper-middle-lower.
Nominal-scaled data: gender, race,
color, city.
Often requires large sample sizes to call to
normality
Sample sizes can be small
Based on normality assumptions i.e. have
stringent (strict) assumptions: normal
distribution (the bell-shape distribution).
Have few assumptions about the population
distribution
Examples: t Test, Z Test, F test Example: Wilcoxon Rank Sum Test
6
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Parametric and Nonparametric Methods Equivalents:
Basically, there is at least one nonparametric equivalent for each parametric general
type of test. In general, these tests fall into the following categories:
1. Tests of differences between groups (independent samples).
Usually, when we have two samples that we want to compare concerning their
mean value for some variable of interest, we would use the t-test for
independent samples; nonparametric alternatives for this test are the Mann-
Whitney U test, and the Kolmogorov-Smirnov two-sample test.
If we have multiple groups, we would use analysis of variance (ANOVA); the
nonparametric equivalents to this method are the Kruskal-Wallis analysis of
ranks and the Median test.
2. Tests of differences between variables (dependent samples).
If we want to compare two variables measured in the same sample we would
customarily use the t-test for dependent samples. Nonparametric alternatives
to this test are the Wilcoxon's matched pairs test and Sign test. If the variables
of interest are dichotomous in nature (i.e., "pass" vs. "no pass") then
McNemar's Chi-square test is appropriate.
If there are more than two variables that were measured in the same sample,
then we would customarily use repeated measures ANOVA. Nonparametric
alternatives to this method are Friedman's two-way analysis of variance and
Cochran Q test (if the variable was measured in terms of categories, e.g.,
"passed" vs. "failed"). Cochran Q is particularly useful for measuring changes
in frequencies (proportions) across time.
3. Tests of relationships between variables.
To express a relationship between two variables one usually computes the
correlation coefficient. Nonparametric equivalents to the standard correlation
coefficient are Spearman R, Kendall Tau, and coefficient Gamma.
If the two variables of interest are categorical in nature (e.g., "passed" vs.
"failed" by "male" vs. "female") appropriate nonparametric statistics for
testing the relationship between the two variables are the Chi-square test, the
Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for
relationships between multiple cases is available: Kendall coefficient of
concordance.
7
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
When to Use Which Method?
It is not easy to give simple advice concerning the use of nonparametric procedures.
Each nonparametric procedure has its peculiar sensitivities and blind spots. For
example,
The Kolmogorov-Smirnov two-sample test is not only sensitive to differences in
the location of distributions (for example, differences in means) but is also greatly
affected by differences in their shapes.
The Wilcoxon matched pairs test assumes that one can rank order the degree of
differences in matched observations in a useful manner. If this is not the case, one
should rather use the Sign test.
In general, if the result of a study is important (e.g., does a very expensive and
painful drug therapy help people get better?), then it is always useful to run
different nonparametric tests; should discrepancies in the results occur dependent
on which test is used, one should try to understand why some tests give different
results.
On the other hand, nonparametric statistics are less statistically powerful
(sensitive) than their parametric counterparts, and if it is important to detect even
small effects (e.g., is this food additive harmful to people?) one should be very
careful in the choice of a test statistic.
Large data sets and nonparametric methods.
Nonparametric methods are most appropriate when the sample sizes are small.
When the data set is large (e.g., n > 100) it often makes little sense to use
nonparametric statistics at all.
When the samples become very large, then the sample means will follow the
normal distribution even if the respective variable is not normally distributed in
the population, or is not measured very well. Thus, parametric methods, which are
usually much more sensitive (i.e., have more statistical power) are in most cases
appropriate for large samples.
However, the tests of significance of many of the nonparametric statistics
described here are based on asymptotic (large sample) theory; therefore, important
tests can often not be performed if the sample sizes become too small.
8
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Some commonly used statistical tests- Table
Example Parametric test Example of non-parametric
Purpose of test
To compare girls’ heights with boys’ heights
Two-sample (unpaired) t test
Mann-Whitney U test
Compares two independent samples drawn from the same population
To compare weight of an infant before and after a feed
One sample (paired) t test
Wilcoxon matched pairs test
Compares two sets of observations on a single sample
To determine whether plasma glucose is higher one, two, or three hours after a meal
One way analysis of variance (F test) using total sum of squares
Kruskal-Wallis analysis of variance by ranks
Effectively, a generalization of the paired t or Wilcoxon matched pairs test where three or more sets of observations are made on a single sample
In the above example, to determine whether the results differ in male and female subjects
Two way analysis of variance
Two way analysis of variance by ranks
As above, but tests the influence (and interaction) of two different covariates
To determine whether acceptance into medical school is more likely if the applicant was born in the same country
χ2 test Fisher’s exact test Tests the null hypotheses that the distribution of a discontinuous variable is the same in two (or more) independent samples
To assess whether and to what extent plasma HbA1 concentration is related to plasma triglyceride concentration in diabetic patients
Product moment correlation coefficient (Pearson’s r)
Spearman’s rank coefficient (r2)
Assesses the strength of the straight line association between two continuous variables
To see how peak expiratory flow rate varies with height
Regression by least squares method
Non-parametric regression (various tests)
Describes the numerical relation between two quantitative variables, allowing one value to be predicted from the other
To determine whether and to what extent a person’s age, body fat, and sodium intake determine their blood pressure
Multiple regression by least squares method
Non-parametric regression (various tests)
Describes the numerical relation between a dependent variable and several predictor variables (covariates)
9
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Mann–Whitney U or Wilcoxon rank sum test (MWW)
Frank Wilcoxon (1892 - 1965) was an outstanding chemist, whose interest in statistics first started while studying fungicides, when he and colleagues studied Fisher’s newly published Statistical Methods for Research Workers. In 1945 he published his paper setting out the rank-sum and signed-rank tests which are still named after him. His background was colorful. A keen cyclist and motor cyclist, he and his twin sister were born in an Irish castle, to wealthy American parents. He grew up in the States, ran away to sea, worked as an oil worker and tree surgeon, and attended a military academy before finally entering college, aged 26, to read chemistry.
Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW),
Wilcoxon rank-sum test, or Wilcoxon-Mann-Whitney test) is a non-parametric
test for assessing whether two independent samples of observations come from the
same distribution.
It is one of the best-known non-parametric significance tests.
It was proposed initially by Frank Wilcoxon in 1945, for equal sample sizes, and
extended to arbitrary (random) sample sizes and in other ways by Mann and
Whitney (1947).
Mann-Whitney-Wilcoxon test is practically identical to performing an ordinary
parametric two-sample t test on the data after ranking over the combined samples.
They are excellent alternatives to the t test if your data are significantly skewed.
NB: Mann-Whitney-Wilcoxon test is equivalents to unpaired t test and Wilcoxon
matched pair is equivalents to paired t test.
Mann-Whitney-Wilcoxon test tests for differences in medians and for chances of
obtaining greater observations in one population versus the other.
The null hypothesis (H0) in the MWW test is that both populations have the same
probability of exceeding each other. i.e. no difference in the two population
distributions.
The alternative hypothesis (Ha) is that the variable in one population is
stochastically greater.
The test involves the calculation of a statistic, usually called U (the sum of ranks),
whose distribution under the null hypothesis is known. In the case of small
10
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
samples, the distribution is tabulated, but for sample sizes above ~20 there is a
good approximation using the normal distribution methods.
The U test is included in most modern statistical packages. It is also easily
calculated by hand, especially for small samples.
Assumptions for Mann-Whitney U Test: The two samples under investigation in the test are independent of each other and
the observations within each sample are independent.
The observations are ordinal or continuous measurements (i.e., for any two
observations, one can at least say, whether they are equal or, if not, which one is
greater).
Data types that can be analysed with Mann-Whitney U-test:
Data points should be independent from each other.
Data do not have to be normal and variances do not have to be equal.
All individuals must be selected at random from the population.
All individuals must have equal chance of being selected.
Sample sizes should be as equal as possible but some differences are allowed.
Calculations: There are two ways of doing this.
Procedure # 1
Stage 1: Call one sample A and the other B.
Stage 2: Place all the values together in rank order (i.e. from lowest to
highest). If there are two samples of the same value, the 'A' sample is placed
first in the rank.
Stage 3: Inspect each 'B' sample in turn and count the number of 'A's which
precede (come before) it. Add up the total to get a U value.
Stage 4: Repeat stage 3, but this time inspects each A in turn and count the
number of B's which precede it. Add up the total to get a second U value.
Stage 5: Take the smaller of the two U values and look up the probability
value in the table below. This gives the percentage probability that the
difference between the two sets of data could have occurred by chance.
11
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Example: The results of the cytogenetic analysis of abnormal cells after exposure to
the drug (Y) are shown below together with the concurrent control (X) data. Use the
Mann-Whitney U test to determine whether there is any evidence of a difference in
the median DISPERSION between the treated group and the control group.
Group (X) = 7; 3; 6; 2; 4; 3; 5; 5
Group (Y) = 3; 5; 6; 4; 6; 5; 7; 5
Solution: This example deals with two sets of sample data from two contrasting
groups, group X and group Y, with the aim of comparing them and demonstrating
differences. There are eight pairs of data in this example. Tests of significance are
used to tell us whether the differences between the two sets of sample data are truly
significant or whether these differences could have occurred by chance. Moreover,
tests of significance tell us the probability level that differences between the two
groups, X and Y are due to chance.
Stage 1:
Sample A = 7; 3; 6; 2; 4; 3; 5; 5
Sample B = 3; 5; 6; 4; 6; 5; 7; 5
Stage 2:
Group A A A B A B A A B B B A B B A B
Labels 2 3 3 3 4 4 5 5 5 5 5 6 6 6 7 7
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Stage 3: The sum of the ranks for the group B is U =
3+4+6+6+6+7+7+8 = 47
Stage 4: : The sum of the ranks for the group A is U =
0+0+0+1+2+2+5+7 = 17
Stage 5: U = 17
The critical value from the table = 6.5.
The probability that the quality of the measures in group Y is better than group X
just by chance is 6.5 per cent (see the next Table).
If you find that there is a significant probability that the differences could have
occurred by chance, this can mean:
1. Either the difference is not significant and there is little point in
looking further for explanations of it, OR
12
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
2. Your sample is too small. If you had taken a larger sample, you might
well find that the result of the test of significance changes: the difference
between the two groups becomes more certain.
n¹1 2 3 4 5 6 7 8
u
0 11.1 2.2 0.6 0.2 0.1 0.0 0.0 0.0
1 22.2 4.4 1.2 0.4 0.2 0.1 0.0 0.0
2 33.3 8.9 2.4 0.8 0.3 0.1 0.1 0.0
3 44.4 13.3 4.2 1.4 0.5 0.2 0.1 0.1
4 55.6 20.0 6.7 2.4 0.9 0.4 0.2 0.1
5 26.7 9.7 3.6 1.5 0.6 0.3 0.1
6 35.6 13.9 5.5 2.3 1.0 0.5 0.2
7 44.4 18.8 7.7 3.3 1.5 0.7 0.3
8 55.6 24.8 10.7 4.7 2.1 1.0 0.5
9 31.5 14.1 6.4 3.0 1.4 0.7
10 38.7 18.4 8.5 4.1 2.0 1.0
11 46.1 23.0 11.1 5.4 2.7 1.4
12 53.9 28.5 14.2 7.1 3.6 1.9
13 34.1 17.7 9.1 4.7 2.5
14 40.4 21.7 11.4 6.0 3.2
15 46.7 26.2 14.1 7.6 4.1
16 53.3 31.1 17.2 9.5 5.2
17 36.2 20.7 11.6 6.5
18 41.6 24.5 14.0 8.0
19 47.2 28.6 16.8 9.7
13
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Procedure # 2
Choose the sample for which the ranks seem to be smaller (The only reason to
do this is to make computation easier). Call this "sample 1," and call the other
sample "sample 2."
Taking each observation in sample 1, count the number of observations in
sample 2 that are smaller than it (count a half for any that are equal to it).
Calculate sum of ranks R1 and R2 then use the following formula
U1 = m x n + m (m + 1)/2 – R1
U2 = m x n + n (n + 1)/2 – R2
U1 + U2 should be equal to m x n
For significance the calculated U1 value should be < the tabulated value.
NB: If you have ties:
1. Rank them anyway, pretending they were slightly different.
2. Find the average of the ranks for the identical values, and give them all that
rank.
3. Carry on as if all the whole-number ranks have been used up
These can now be used for the Mann-Whitney U test
Sorted Data224514141418
Data142542141814
Rank A
224514141418
12345678
Data
142542141814
14
Rank them anyway, pretending they were slightly different
Rank A
12345678
Average = 1.5
Average = 6
Find the average of the ranks for the identical values, and give them all that rank
Rank A
1.51.5346668
Ties
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Solution of our example
Group (X) = 7; 3; 6; 2; 4; 3; 5; 5
Group (Y) = 3; 5; 6; 4; 6; 5; 7; 5
Try to rank them and find the average of the ranks
Sample 2 Ranks (orders) Sample 1 Ranks (orders)
-
-
-
-
-
-
-
- (highest order)
-
-
-
-
-
-
-
-
(lowest order)
m = 8 R1 (sum of the ranks) N = 8 R2 (sum of the ranks)
R1 = should be 77,
R2 = should be 59,
U1 = 23,
U2 = 41
Look at the next Table at n 8 and m 8 you will find the tabulated value (16) is
less than the calculated one (23) so the difference is not significant and we
failed to reject H0.
Example: The results of the cytogenetic analysis of abnormal cells in Males (♂) and
Females (♀) are shown below. Test to see if there is a significant difference between
these two gender groups.
Group (♀) = 9; 4; 6; 8; 6
Group (♂) = 19; 16; 9; 19; 8
Example: In KKUH the % of Mothers Bearing Low-Birth-Weight Babies in 7
months observations were 3, 0, 4, 0, 1, 2 and 3, while the % of Mothers Bearing
15
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Normal-Birth-Weight Babies were 4, 3, 6, 11, 7, 8, and 9. Is there a significant
difference between these two groups?
16
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Comparison to Student's t-test
The Mann-Whitney U test is useful in the same situations as the independent samples
Student's t-test, and the question arises of
which should be preferred?
Ordinal data: Mann-Whitney U remains the logical choice when the data are
ordinal but not interval scaled, so that the spacing between adjacent values
cannot be assumed to be constant.
Robustness: It is much less likely than the Student's t-test to give a spuriously
significant result because of one or two outliers - Mann-Whitney U is more
robust.
Efficiency: When normality holds, Wilcoxon-Mann-Whitney test has an
efficiency of about 0.95 (95%) when compared to the Student's t-test test. For
distributions sufficiently far from normal and for sufficiently large sample
sizes, the Wilcoxon-Mann-Whitney test can be considerably more efficient
than the Student's t-test.
Overall, the robustness makes the Wilcoxon-Mann-Whitney test more widely
applicable than the Student's t-test test, and for large samples from the normal
distribution, the efficiency loss compared to the Student's t-test test is only
5%, so one can recommend Wilcoxon-Mann-Whitney test as the default test
for comparing interval or ordinal measurements with similar distributions.
The relation between efficiency and power in concrete situations isn't trivial
though. For small sample sizes one should investigate the power of the
Wilcoxon-Mann-Whitney test versus Student's t-test.
17
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Wilcoxon Matched Pairs Signed Ranks Test (Wilcoxon Test)
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test for the
case of two related samples or repeated measurements on a single sample. It can be
used as an alternative to the paired Student's t-test when the population cannot be
assumed to be normally distributed. The test is named for Frank Wilcoxon who, in a
single paper, proposed both it and the rank-sum test for two independent samples.
Assumptions
This test considers that the data are at an ordinal-metric level, i.e., that the original
data can be validly ordered, that the data after the intervention can be ordered, and
that the difference between the two sets of data can be validly ordered.
This assumption is slightly less critical than the interval level assumption
necessary for the t-test. The assumption of there being a normal distribution does
not have to be met, this is particularly practical if the maximum change is
somehow limited.
A positive aspect of the Wilcoxon test is that it is a very powerful test. If all the
assumptions for the t-test are met the Wilcoxon has about 95% of the power of the
t-test.
Data types that can be analyzed with Wilcoxon Matched pairs
the distributions do not have to be normal and variances do not have to be equal
all individuals must be selected at random from the population
all individuals must have equal chance of being selected
sample sizes should be equal
Setup
Suppose we collect 2n observations, two observations of each of the n subjects.
Let i denote the particular subject that is being referred to and the first observation
measured on subject i be denoted by Xi and second observation be Yi. For each i
in the observations, Xi and Yi should be paired together.
Let Zi = Yi – Xi for i = 1, ... , n. (The differences Zi are assumed to be
independent. Each Zi comes from a continuous population "they must be
identical" and is symmetric about a common median θ. Xi and Yi are ordinal
18
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
variables such that comparisons such as greater than, less than and equal to have
meaning).
Test procedure
The null hypothesis tested is H0: median θ = 0.
The Wilcoxon signed rank statistic W+ is computed by ordering the absolute
values |Z1|, ..., |Zn|, the rank of each ordered |Zi| is given a rank of Ri. Denote the
positive Zi values with PHI φi = I(Zi > 0), where I(.) is indicator function. The
Wilcoxon signed ranked statistic W+ is defined as
It is often used to test the difference between scores of data collected before and
after an experimental manipulation, in which case the central point under the null
hypothesis would be expected to be zero.
Scores exactly equal to the central point are excluded and the absolute values of
the deviations from the central point of the remaining scores are ranked such that
the smallest deviation has a rank of 1.
Tied scores are assigned a mean rank.
The sums for the ranks of scores with positive and negative deviations from the
central point are then calculated separately.
A value S is defined as the smaller of these two rank sums. S is then compared to
a table of all possible distributions of ranks to calculate p, the statistical
probability of attaining S from a population of scores that is symmetrically
distributed around the central point.
As the number of scores used, n, increases, the distribution of all possible ranks S
tends towards the normal distribution. So although for n ≤ 20, exact probabilities
would usually be calculated, for n > 20, the normal approximation is used.
The Wilcoxon test was popularized by Siegel (1956) in his influential text book
on non-parametric statistics. Siegel used the symbol T for the value defined here
as S. In consequence, the test is sometimes referred to as the Wilcoxon T test, and
the test statistic is reported as a value of T.
19
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Example:
Scoring of abnormal cells were taken at the time of the beginning of treatment with a
chemotherapeutic drug and again 5 weeks later to determine the genotoxic effects of
this drug. Results for 10 children were presented in the next Table. Is there a
significant difference between these two observations?
Subject
(i)
Xi
(1st obser.)
Yi
(2nd obser.)
Sign of
Xi – Yi
Xi – Yi Absolute
Xi – Yi
Rank of
Absolute
Signed
Rank
1 125 110 + 15 15 7 7
2 115 122 - -7 7 3 -3
3 130 125 + 5 5 1.5 1.5
4 140 120 + 20 20 9 9
5 140 140 --- 0 0 --- ---
6 115 124 - -9 9 4 -4
7 140 123 + 17 17 8 8
8 125 137 - -12 12 6 -6
9 140 135 + 5 5 1.5 1.5
10 135 145 - -10 10 5 -5
1. The sign of Xi – Yi is denoted in the Sign column by either (+) or (-). If Xi
and Yi are equal, then the value is thrown out.
2. The values of Xi – Yi are given in the next two columns.
3. The last two columns are the ranks.
4. The absolute rank column has no signs, and the signed rank column gives the
ranks along with their signs.
5. The data is ranked from the smallest value to the largest value.
6. In the case of a tie, ranks are added together and divided by the number of ties.
For example, in this data, there were two instances of the value 5. The ranks
corresponding to 5 are 1 and 2. The sum of these ranks is 3. After dividing by
the no. of ties, you get a mean rank of 1.5, and this value is assigned to both
instances of 5.
20
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
7. The test statistic, W+, is given by the sum of all of the positive values in the
Signed Rank column. The test statistic, W-, is given by the sum of all of the
negative values in the Signed Rank column. For this example, W+ = 27 and
W- = 18. The minimum of these is 18.
8. Lastly, this test statistic is analyzed using a next table of critical values.
9. If the obtained test statistic is less than or equal to the critical value based on
the number of observations n, then the null hypothesis is rejected in favor of
the alternative hypothesis. Otherwise, the null is accepted. i.e. the obtained
value is significiant if it is ≤ than the value in the table.
10. In this case the test statistic is W = 18 and the critical value is 8 for a two-
tailed p-value of 0.05. The test statistic must be less than this to be significant
at this level, so in this case the null hypothesis can not be rejected (not sig.).
Table of critical values for the Wilcoxon test:
Example: Number of Cigarettes Smoked per Day for a group of men was 8, 13, 24,
15, 7, 11, 20, 22, 6, 15, 20 and after their decision for stopping cigarettes smoking the
number was 5, 15, 11, 19, 0, 12, 15, 0, 0, 6, 20. Does the median of the differences
between pre and post decision differ significantly from zero?
21
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Sum of all signed ranks (W) = 41.000, Sum of positive ranks (T+) = 48.000, Sum of
negative ranks (T-) = -7.000. The two-tailed P value is 0.0371, considered significant.
Kruskal-Wallis one-way analysis of variance (KW)
In statistics, the Kruskal-Wallis one-way analysis of variance by ranks (named
after William Kruskal and W. Allen Wallis) is a non-parametric method for
testing equality of population medians among groups. It is identical to a one-way
analysis of variance with the data replaced by their ranks.
It is an extension of the Mann-Whitney U test (MWW) to 3 or more groups.
Since it is a non-parametric method, the Kruskal-Wallis test does not assume a
normal population, unlike the analogous one-way analysis of variance. However,
the test does assume an identically-shaped and scaled distribution for each group,
except for any difference in medians.
The test involves the calculation of a statistic, usually called KW (the sum of
ranks).
In Kruskal-Wallis test all data from all groups are ranked together; i.e., rank the
data from 1 to n ignoring group membership. Assign any tied values the average
of the ranks they would have received had they not been tied.
The null hypothesis (H0) will be rejected if the computed value of KW is so large
that the probability of obtaining a value that large or larger when H0 is true is
equal to or less than chosen significance level.
Post-hoc tests where proposed by Schaich and Hamerle (1984) as well as Conover
(1971, 1980) in order to decide which groups are significantly different from each
other, based upon the mean rank differences of the groups. These procedures are
detailed in Bortz, Lienert and Boehnke (2000, pp. 275); however, not all statistical
packages support Post-hoc analysis for Kruskal-Wallis test. But user contributed
code exists that provides these facilities (for example in SPSS).
Example: The effects of two drugs on reaction time to a certain stimulus were studied
in three samples of experimental animals. Sample 3 served as a control while the
animals in sample 1 were treated with drug A and those in sample 2 were treated with
drug B prior to the application of the stimulus. The data below shows the reaction
times in seconds of the 13 animals. Group 1 are 17, 20, 40, 31, 35. Group 2 are 8, 7, 9,
8. Group 3 are 2, 5, 4, 3. Can we conclude that the three populations represented by
22
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
the three samples differ with respect to reaction time? We can so conclude if we can
reject the H0 that the three populations do not differ in their reaction times.
Solution
Firstly, if you used GraphPad and you selected One-way Analysis of Variance
(ANOVA) followed by Tukey-Kramer Multiple Comparisons Test you will get this:
Assumption test: Are the standard deviations of the groups equal? ANOVA
assumes that the data are sampled from populations with identical SDs. This
assumption is tested using the method of Bartlett. Bartlett statistic (corrected)
= 15.937. The P value is 0.0003. Bartlett's test suggests that the differences
among the SDs is extremely significant. Since ANOVA assumes populations
with equal SDs, you should consider transforming your data (reciprocal or log)
or selecting a nonparametric test.
Thus, at the moment select a nonparametric test i.e. Kruskal-Wallis Test
(Nonparametric ANOVA) followed by Dunn's Multiple Comparisons Test
because the data are independent.
Now you will get this result. The P value is 0.0047, considered very
significant and variation among column medians is significantly greater than
expected by chance.
Calculation details
Group Number of Points Sum of Ranks Mean of RanksDrug A 5 55.000 11.000Drug B 4 26.000 6.500Control 4 10.000 2.500
Kruskal-Wallis Statistic KW = 10.711 (corrected for ties)
Dunn's Multiple Comparisons Test
Comparison Mean Rank Difference P value Drug A versus Drug B 4.500 ns P>0.05Drug A versus control 8.500 ** P<0.01Drug B versus control 4.000 ns P>0.05
23
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Data are presented as mean ± SD. **P < 0.01 versus control (Kruskal-Wallis test followed by Dunn’s multiple comparisons test).
Friedman Test
The Friedman test is a non-parametric statistical test developed by the economist
Milton Friedman.
It is similar to the parametric one-way repeated measures ANOVA by ranks,
however, the data on the dependent variable is measured on an ordinal scale. In its
use of ranks it is similar to the Kruskal-Wallis one-way analysis of variance by
ranks.
The test assumes the study involves one independent variable, and that the same
participants are repeatedly observed under three or more conditions. Also, the
present test bears some resemblance to the Wilcoxon matched pairs signed rank
test. However, the Wilcoxon involves one independent variable with two and only
two levels and the participants in the study are in both conditions.
The null hypothesis of the test is that the treatments have identical effects.
The alternative hypothesis is that at least one of the treatments tends to yield
larger values than at least one of the other treatments.
Friedman test is widely supported by many statistical packages such as SAS,
SPSS, R, matlab, WINKS, GraphPad, BioEstat and more.
Post-hoc tests where proposed by Schaich and Hamerle (1984) as well as Conover
(1971, 1980) in order to decide which groups are significantly different from each
other, based upon the mean rank differences of the groups. These procedures are
detailed in Bortz, Lienert and Boehnke (2000, pp. 275); however, not all statistical
packages support Post-hoc analysis for Friedman's test. But user contributed code
exists that provides these facilities (for example in SPSS).
Example
24
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Suppose for example we want to find out if students have a preference for one
type of soda over others. They are blindfolded and given a taste test. They are
asked to take a sip of Brand X, Brand Y and Brand Z sodas and to rank order their
favourite for the three sodas where a 1 is the highest rank, a 2 the next highest
and a 3 the least preferred soda.
The data representing the rankings given by each participant to the three sodas
are:
Participants’ Rankings of the Three Brands of Soda
Participant Brand X Brand Y Brand ZAhmed 2 1 3Mohamed 1 3 2Ibrahim 1 2 3Anfal 1 3 2Ali 1 3 2Noha 1 2 3Amal 1 3 2Osman 1 2 3Anas 1 3 2Hany 2 1 3
Solution
Go to the GraphPad, select compare mean and raw data click on enter data in 3
separated columns and mark them as Brand X, Brand Y, Brand Z, click on look
to the normality test but do not care with it click on select all columns, perform
repeated measures ANOVA, nonparapetric methods, mark on only perform post
test if P <0.05. Now Friedman test with post-test will be active click on then
the results will be appears as follow;
Friedman Test (Nonparametric Repeated Measures ANOVA)
The P value is 0.0063, considered very significant.
Variation among column medians is significantly greater than expected by chance.
Calculation detail:
Group Sum of Ranks Brand X 12.000Brand y 23.000Brand Z 25.000
Number of Rows = 10
Number of Columns = 3
25
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Friedman Statistic Fr = 9.800
Dunn's Multiple Comparisons Test
If the difference between rank sum means is greater than 10.710 then the P value is
less than 0.05.
Comparison Rank Sum Difference P value Brand X versus Brand Y -11.000 * P<0.05Brand X versus Brand Z -13.000 * P<0.05Brand Y versus Brand Z -2.000 ns P>0.05
Data are presented as mean ± SD. *P < 0.05 versus Brand X (Friedman test followed
by Dunn’s multiple comparisons test).
26
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Data transformations
One advantage of using parametric statistics is that it makes it much easier to
describe your data.
If you have established that it follows a normal distribution you can be sure that a
particular set of measurements can be properly described by its mean and
standard deviation.
If your data are not normally distributed you cannot use any of the tests that
assume that it is (e.g. t test, ANOVA, regression analysis).
If your data are not normally distributed it is often possible to normalize it by
transforming it.
The need for transformations. We should always check the assumptions test (as in
GraphPad): Are the standard deviations equal? (i.e. that data follow a normal
distribution with uniform variance). If the data meet the assumptions i.e. the data
come from populations with equal SDs, we can analyze the raw data as described but
if they are not met, we have two possible strategies:
1. We can use a method which does not require these assumptions, such as a rank-
based method (nonparametric methods) or
2. We can transform the data mathematically to make them fit the assumptions more
closely before analysis.
There are different transformations works for different data types:
Logarithms: Growth rates (or dynamics) are often
exponential and log transforms will often normalise them. Log transforms are
particularly appropriate if the variance increases with the mean.
27
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Reciprocal: If a log transform does not normalise your data
you could try a reciprocal (1/x) transformation. This is often used for enzyme
reaction rate data, survival times.
Square root: This transform is often of value when the data
are counts, e.g. blood cells on a haemocytometer or woodlice in a garden.
Carrying out a square root transform will convert data with a Poisson distribution
to a normal distribution.
Arcsine: This transformation is also known as the angular
transformation and is especially useful for percentages and proportions.
28
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Logit (logistic) transformation: This is the transformation we
apply most often to each proportion, p, in a set of proportions. We cannot take
the logit transformation if either p =0 or p = 1 because the corresponding logit
values are - ∞ and + ∞. One solution is to take p as 1/(2n) instead of 0, and as {1-
1/(2n)} instead of 1. It linearized a sigmoid curve.
Can all data be transformed?
Not all data can be transformed successfully.
1. Sometimes we have very long tails at both ends of the distribution, which
makes transformation ineffective.
2. Sometimes we have a bimodal distribution, which makes transformation
also ineffective.
3. Sometimes we have a large number of identical observations, which
will all transform to the same value whatever transformation we use.
These are often at one extreme of the distribution, usually at zero.
4. Sometimes transformation leads to variation in p -value .
So, What can we do if we cannot transform data to a form? It is usually safer to use
methods that do not require such assumptions. These include the non-parametric
methods.
29
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Correlation Coefficients and linear regression
Correlation is a concept that carries the common colloquial implication of
association, such as ‘‘height and weight are correlated’’.
The statistical procedure will give the word a technical meaning; we can actually
calculate a number that tells the strength of the association. The results of
correlation are expressed as a P-value (for the hypothesis test) and an r-value
(correlation coefficient) or r2 value (coefficient of determination).
Suppose that the data set consists of n pairs of observations expressing a possible
relationship between two continuous variables. We characterize the strength of
such a relationship by calculating the coefficient of correlation called Pearson’s
correlation coefficient. However, like other common parametric statistics, such
as the mean and the standard deviation, the correlation coefficient is very
sensitive to extreme observations. We may be interested in calculating a measure
of association that is more robust with respect to outlying values.
There are not one but two nonparametric procedures: Spearman’s and Kendall’s
tau rank correlations.
Generally: Values near +1 indicate a strong positive association. Values near -1
indicate a strong negative association. Values around 0 indicate a weak
association.
30
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
If they tend to go from lower left to upper right, we have a positive association; if
they tend to go from upper left to lower right, we have a negative association.
NB. The goal of linear regression is to find the equation (Y= intercept + slope x
X) of the line that best fits the points; this is then used as a visual summary of the
relationship between the variables, or for estimating unknown values of one
variable when given the value of the other.
Example: The below Table gives the values for the birth weight (x) and the
increase in weight between days 70 and 100 of life, expressed as a percentage of
the birth weight (y) for 12 infants. If we let each pair of numbers (x; y) be
represented by a dot in a diagram with the x’s on the horizontal axis, we have
Figure (see below). The dots do not fall perfectly on a straight line, but rather,
scatter around a line, very typical for statistical relationships. Because of this
scattering of dots, the diagram is called a scatter diagram. The positions of the
dots provide some information about the direction as well as the strength of the
association under the investigation.
Birth weight (Oz)
The increase in weight between days 70 and 100 of life expressed as aMean % of the birth weight
112 70.56 63111 73.26 66107 77.04 72119 61.88 5292 69.00 7580 94.40 11881 97.20 12084 95.76 114118 49.56 42106 76.32 72103 92.70 9094 85.54 91
By using GraphPad, select Regression and correction, X and Y (or two or more Y
replicates), then click next, put Birth weight in X column and % of the birth
weight in Y1, then click next and select Nonparametric Spearman Rank
Correlation, then click next you will get Spearman r = -0.9632, 95% confidence
interval: -0.9903 to -0.8658. The two-tailed P value is < 0.0001, considered
extremely significant. Click next you will get this scatter figure
31
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Try to do it manual (see below)
Spearman's Rank Correlation Coefficient
Spearman’s rank correlation is a direct nonparametric counterpart of
parametric Pearson’s correlation coefficient.
The Spearman's Rank Correlation Coefficient is used to discover the strength
and direction (negative or positive) of a relationship between two variables.
Example: This example looks at the strength of the link between the price of
a convenience item (a 500 ml bottle of water) and distance from the Pyramids in
Geza, Egypt. The hypothesis tested is that prices should decrease with distance
from the key area of gentrification surrounding the Pyramids. The line followed is
Transacts in the map below, with continuous sampling of the price of a 500 ml
bottle water at every convenience store (see the Table below).
32
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Map to show the location of environmental gradients for transact lines in Giza
Hypothesis We might expect to find that the price of a bottle of water decreases as distance
from the Pyramids increases. Higher property rents close to the Pyramids should
be reflected in higher prices in the shops.
The hypothesis might be written like this: The price of a convenience item
decreases as distance from the Pyramids increases.
The more objective scientific research method is always to assume that no such
price-distance relationship exists and to express the null hypothesis as:
there is no significant relationship between the price of a convenience item
and distance from the Pyramids.
What can go wrong? Having decided upon the wording of the hypothesis, you
should consider whether there are any other factors that may influence the study.
Some factors that may influence prices may include:
The type of retail outlet. You must be consistent in your choice of retail outlet.
For example, restaurants often charge significantly more for water than a
convenience store. You should decide which type of outlet to use and stick with it
for all your data collection.
Some shops have different prices for the same item: a high tourist and lower local
price, dependent upon the shopkeeper's perception of the customer.
Shops near main roads may charge more than shops in less accessible back
streets, due to the higher rents demanded for main road retail sites.
33
Transact 1
Transact 2
Transact 3
Transact 4
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
The positive spread effects from other nearby areas of gentrification or from
competing areas of tourist attraction. The negative spread effects from nearby
areas of urban decay.
Higher prices may be charged during the summer when demand is less flexible,
making seasonal comparisons less reliable.
Cumulative sampling may distort the expected price-distance gradient if several
shops cluster within a short area along the transect line followed by a
considerable gap before the next group of retail outlets.
You should mention such factors in your investigation.
By using GraphPad and the Data collected (see data table below) suggests a fairly
strong negative relationship as shown in this scatter graph:
Scatter graph to show the change in the price of a convenience item (in Egyptian
Pound) with distance (meters) from the Pyramids.
The scatter graph shows the possibility of a negative correlation between the two
variables and the Spearman's rank correlation technique should be used to see if
there is indeed a correlation, and to test the strength of the relationship.
A correlation can easily be drawn as a scatter graph, but the most precise way to
compare several pairs of data is to use a statistical test - this establishes whether
the correlation is really significant or if it could have been the result of chance
alone.
Spearman’s Rank correlation coefficient is a technique which can be used to
summarise the strength and direction (negative or positive) of a relationship
between two variables.
34
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
The Spearman's Rank Correlation Coefficient is used to discover the strength of
a link between two sets of data. The result will always be between 1 and - 1.
Manual Method - calculating the coefficient
Create a table from your data.
Rank the two data sets. Ranking is achieved by giving the ranking '1' to the
biggest number in a column, '2' to the second biggest value and so on. The
smallest value in the column will get the lowest ranking. This should be done for
both sets of measurements.
Tied scores are given the mean (average) rank. For example, the three tied scores
of 1 Egyptian Pound in the example below are ranked fifth in order of price, but
occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of ten. The
mean rank in this case is calculated as (5+6+7) ÷ 3 = 6.
Find the difference in the ranks (d): This is the difference between the ranks of
the two values on each row of the table. The rank of the second value (price) is
subtracted from the rank of the first (distance from the Pyramids).
Square the differences (d²). To remove negative values and then sum them ( d²).
Calculate the coefficient (R) using the formula below.
The answer will always be between 1.0 (a perfect positive correlation) and -1.0 (a
perfect negative correlation).
When written in mathematical notation the Spearman Rank formula looks like:
Data Table: Spearman's Rank Correlation
Convenience
Store
Distance
(meter)
Rank
distance
Price
(Pound
)
Rank
price
(d) Difference
between ranks
d²
1 50 10 1.80 2 8 64
2 175 9 1.20 3.5 5.5 30.25
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
35
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
5 425 6 1.00 6 0 0
6 580 5 1.20 3.5 1.5 2.25
7 710 4 0.80 9 -5 25
8 790 3 0.60 10 -7 49
9 890 2 1.00 6 -4 16
10 980 1 0.85 8 -7 49
d² = 285.5
Now to put all these values into the formula
a. Find the value of all the d² values by adding up all the values in the Difference²
column. In our example this is 285.5. Multiplying this by 6 gives 1713.
b. Now for the bottom line of the equation. The value n is the number of sites at
which you took measurements. This, in our example is 10. Substituting these
values into n³ - n we get 1000 - 10
c. We now have the formula: R = 1 - (1713/990) which gives a value for R : 1 -1.73
= -0.73
What does this R value of -0.73 mean?
The closer R is to +1 or -1, the stronger the likely correlation. A perfect
positive correlation is +1 and a perfect negative correlation is -1. The R value
of -0.73 suggests a fairly strong negative relationship.
A further technique is now required to test the significance of the relationship.
The R value of -0.73 must be looked up on the Spearman Rank significance
table below as follows:
1. Work out the 'degrees of freedom' you need to use. This is the number of pairs
in your sample minus 2 (n-2). In the example it is 8 (10 - 2).
2. Now plot your result on the table.
3. If it is below the line marked 5%, then it is possible your result was the
product of chance and you must reject the hypothesis.
36
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
4. If it is above the 0.1% significance level, then we can be 99.9% confident the
correlation has not occurred by chance.
5. If it is above 1%, but below 0.1%, you can say you are 99% confident.
6. If it is above 5%, but below 1%, you can say you are 95% confident (i.e.
statistically there is a 5% likelihood the result occurred by chance).
In the example, the value 0.73 gives a significance level of slightly less than
5%. That means that the probability of the relationship you have found being a
chance event is about 5 in a 100. You are 95% certain that your hypothesis is
correct.
The reliability of your sample can be stated in terms of how many researchers
completing the same study as yours would obtain the same results: 95 out of
100.
The fact two variables correlate cannot prove anything - only further research
can actually prove that one thing affects the other.
Data reliability is related to the size of the sample. The more data you collect,
the more reliable your result.
Manual solution of the first example
Birth Weight
Rank Weight
Increase in weight
Rank Incr. in weight
(d) Difference between ranks
d²
112 10 63 3 7 49
37
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
111 9 66 4 5 25107 8 72 5.5 2.5 6.25119 12 52 2 10 10092 4 75 7 -3 980 1 118 11 -10 10081 2 120 12 -10 10084 3 114 10 -7 49118 11 42 1 10 100106 7 72 5.5 1.5 2.25103 6 90 8 -2 494 5 91 9 -4 16
d² = 560.51-3360/1716= -0.958041958041958
Spearman r = ~ -0.96
Look up on the Spearman Rank significance table above you will find r significantly
different than zero at P < 0.0001 (r is above the 0.1% significance level, then we can
be 99.9% confident the correlation has not occurred by chance).
By using GraphPad you will get
Spearman r = -0.9632 (corrected for ties)
95% confidence interval: -0.9903 to -0.8658
The two-tailed P value is < 0.0001, considered extremely significant,
indicating a very strong negative association.
NB: If you performed Pearson’s Correlation Coefficient you will get these results;
Correlation coefficient (r) = -0.9463
95% confidence interval: -0.9852 to -0.8149
Coefficient of determination (r squared) = 0.8955
Test: Is r significantly different than zero? The two-tailed P value is < 0.0001,
considered extremely significant. This results is very close to the value of r (-
0. 9632) obtained by Spearman method. This closeness is true when there are
few or no extreme observations.
Chi-Square Test (X2)
Studies often collect data on categorical variables (yes and no, cured and non-cured,
normal and abnormal, living and died, no environmental factors and environmental
factors, agree and disagree, positive and negative, male and female, I and II, present
and absent etc...) that can be summarized as a series of counts. These counts are
commonly arranged in a tabular format known as a contingency table.
1. Chi-Square test for two categories (2 x 2 contingency table)
38
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
For example, a study designed to determine whether or not there is an association
between cigarette smoking and asthma might collect data that could be assembled
into a 2 x 2 table. In this case, the two columns could be defined by whether the
subject smoked or not, while the rows could represent whether or not the subject
experienced symptoms of asthma. The cells of the table would contain the
number of observations or patients as defined by these two variables.
The chi-square test statistic can be used to evaluate whether there is an
association between the rows and columns in a contingency table. More
specifically, this statistic can be used to determine whether there is any difference
between the study groups in the proportions of the risk factor of interest.
Returning to our example, the chi-square statistic could be used to test whether
the proportion of individuals who smoke differs by asthmatic status.
The chi-square test statistic is designed to test the null hypothesis that there is no
association between the rows and columns of a contingency table. This statistic is
calculated by first obtaining for each cell in the table, the expected number of
events that will occur if the null hypothesis is true. When the observed number of
events deviates significantly from the expected counts, then it is unlikely that the
null hypothesis is true, and it is likely that there is a row-column association.
Conversely, a small chi-square value indicates that the observed values are
similar to the expected values leading us to conclude that the null hypothesis is
plausible.
The general formula used to calculate the X2 test statistic is as follows:
X2 = (C x B - A x D) 2/(E x F x G x H) x I.
Results Observed count Expected count Total of rowGroup 1 A B GGroup 2 C D HTotal of column E F I (Ground total)
or
Where Oi = an observed frequency; Ei = an expected (theoretical) frequency,
asserted by the null hypothesis (expected frequency = Row total x Column
total/Ground total).
39
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
The degrees of freedom in a test of independence are equal to (number of rows)-1
× (number of columns)-1. Thus for a 2×2 table, there are (2-1)×(2-1) = 1 degree
of freedom; for a 4×3 table, there are (4-1)×(3-1) = 6 degrees of freedom.
For significance, the calculated X2 should be higher than the tabulated value.
The value of the chi-square statistic cannot be negative and can assume values
from zero to infinity.
The p-value for this test statistic is based on the chi-square probability
distribution and is generally extracted from published tables or estimated using
computer software programs. The p-value represents the probability that the chi-
square test statistic is as extreme as or more extreme than observed if the null
hypothesis were true. As with the t and F distributions, there is a different chi-
square distribution for each possible value of degrees of freedom.
Chi-square distributions with a small number of degrees of freedom are highly
skewed; however, this skewness is attenuated as the number of degrees of
freedom increases.
The use of the chi-square test can be illustrated by using hypothetical data from a
study investigating the association between smoking and asthma among adults
observed in a community health clinic. The results obtained from classifying 150
individuals are shown in the Table.
Hypothetical data showing chi-square test
Symptoms of asthmaEver smoke cigarettes
TotalYes No
Yes 20 30 50No 22 78 100Total 42 108 150
As Table shows, among asthmatics the proportion of smokers was 40 % (20/50),
while the corresponding proportion among asymptomatic individuals was 22 %
(22/100). By applying the formula previously presented,
X2 = (C x B - A x D) 2/(E x F x G x H) x I.
X2=(22x30-20x78)2/42x108x50x100=0.0357142857142857x150= 5.357142857.
The associated p-value for this chi-square distribution that has one degree of
freedom is 0.02. Therefore, if there was truly no association between smoking
and asthma, there is a 2 out of 100 probability of observing a difference in
proportions that is at least as large as 18 percent (40% - 22%) by chance alone.
We would therefore conclude that the observed difference in the proportions is
40
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
unlikely to be explained by chance alone, and consider this result statistically
significant.
Computer analysis
The computer may be used to advantage in calculating X2 test for
independences.
Go to GraphPad and select Analyze a contingency table and two rows, two
columns, then next.
41
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Enter your data then next.
You will get a box to select chi-square, no use Yate's continuity correction,
two-sided p value, calculate relative risk, then next
42
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Now you will get the results which is the two-sided P value is 0.0206,
considered significant and X2 = 5.357.
Because the construction of the chi-square test makes use of discrete data to
estimate a continuous distribution, some authors will apply a continuity
correction when calculating this statistic. Specifically, where Oi - Ei is the
absolute value of the difference between Oi and Ei and the term 0.5 in the
numerator is often referred to as Yate's correction factor. This correction
43
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
factor serves to reduce the chi-square value, and, therefore, increases the
resulting p-value. It has been suggested that this correction yields an overly
conservative test that may fail to reject a false null hypothesis. However, as
long as the sample size is large, the effect of the correction factor is negligible.
When there is small number of counts in the table, the use of the chi-square
test statistic may not be appropriate. Specifically, it has been recommended
that this test not be used if any cell in the table has an expected count of less
than one, or if 20 % of the cells have an expected count that is greater than
five. Under this scenario, the Fisher's exact test is recommended for
conducting tests of hypothesis.
2. Chi-Square test for more than two categories (large contingency tables)
It is used when samples are classified by two factors. For example, one factor
may be represented disease severity (mild, moderate or severe) and the other
factor may be represented blood group (A, B, O, AB). We are interested in
whether the two factors are associated. Are individuals of a particular blood
group likely to be more severely ill?
Example: a cross-sectional survey was carried out among the elderly
population living in Riyadh, with the objective of measuring the frequency of
cardiovascular disease. A total of 259 individuals, ranging between 65 and 95
years of age, were interviewed. Individuals were grouped into four age groups
(65-69, 70-74, 75-79, and 80+ years) at the time of interview. We tested the
Chi-Square test to determine whether the prevalence of chest pain differed in
the four age groups. The observed and expected frequencies (%) are shown in
the following Table.
Manual solution: For each group using this formula for calculation as follow
44
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
We refer X2 to the Table at 3 (4 columns -1) degree of freedom: P > 0.1 (i.e. no
significant).
There is insufficient evidence to reject the null hypothesis of no association
between chest pain and age in the population of elderly people.
Computer analysis: The computer may be used to advantage in calculating X2 test
and X2 test for Trend, which takes into account the ordering of the groups. Go to
GraphPad and select Analyze a contingency table and larger contingency table, then
next.
Enter your data then next.
45
Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..
Now you will get the results.
The P value is 0.1839, considered not significant and X2 = 4.839.
Chi-squared for trend = 3.787 and the P value is 0.0517.
If you want to do X2 for trend manual after ordering the four age groups (1, 2,
3 and 4) it should be like this
Thus, there is not a significant linear trend among the ordered categories
defining the columns and the proportion of subjects in the top row.
46