general purpose: - ksufac.ksu.edu.sa/sites/default/files/541.doc · web viewgeneral purpose: the...

Test of significance: Nonparametric Methods PHL-541…………………………………………………………………………………………..

General Purpose:

The idea of significance testing. If we have a basic knowledge of the underlying

distribution of a variable, then we can make predictions about how, in repeated

samples of equal size, this particular statistic will "behave," that is, how it is

distributed. For example, if we draw 100 random samples of 100 adults each from

the general population, and compute the mean height in each sample, then the

distribution of the standardized means across samples will likely approximate the

normal distribution (Student's t distribution with 99 degrees of freedom). Now

imagine that we take an additional sample in a particular city ("X") where we

suspect that people are taller than the average population. If the mean height in

that sample falls outside the upper 95% tail area of the t distribution then we

conclude that, indeed, the people of X city are taller than the average population.

Are most variables normally distributed? In the above example we relied on

our knowledge that, in repeated samples of equal size, the standardized means (for

height) will be distributed following the t distribution (with a particular mean and

variance). However, this will only be true if in the population the variable of

1


interest (height in our example) is normally distributed, that is, if the distribution

of people of particular heights follows the normal distribution (the bell-shape

distribution). For many variables of interest, we simply do not know for sure that

this is the case. For example, is income distributed normally in the population? --

probably not. The incidence rates of rare diseases are not normally distributed in

the population, the number of car accidents is also not normally distributed, and

neither are very many other variables in which a researcher might be interested.

Sample size. Another factor that often limits the applicability of tests based on the

assumption that the sampling distribution is normal is the size of the sample of

data available for the analysis (sample size; n). We can assume that the sampling

distribution is normal even if we are not sure that the distribution of the variable in

the population is normal, as long as our sample is large enough (e.g., 100 or more

observations). However, if our sample is very small, then those tests can be used

only if we are sure that the variable is normally distributed, and there is no way to

test this assumption if the sample is small.

Measurement Scales: Variables differ in how well they can be measured, i.e., in

how much measurable information their measurement scale can provide. There is

obviously some measurement error involved in every measurement, which

determines the amount of information that we can obtain. Another factor that

determines the amount of information that can be provided by a variable is its type

of measurement scale. Specifically, variables are classified as (a) nominal, (b)

ordinal, (c) interval, or (d) ratio.

a) Interval variables allow us not only to rank order the items that are

measured, but also to quantify and compare the sizes of differences between

them. For example, temperature, as measured in degrees Fahrenheit or Celsius,

constitutes an interval scale. We can say that a temperature of 40 degrees is

higher than a temperature of 30 degrees, and that an increase from 20 to 40

degrees is twice as much as an increase from 30 to 40 degrees.

b) Ratio variables are very similar to interval variables; in addition to all the

properties of interval variables, they feature an identifiable absolute zero point,

thus, they allow for statements such as x is two times more than y. Typical

2


examples of ratio scales are measures of time or space. For example, as the

Kelvin temperature scale is a ratio scale, not only can we say that a

temperature of 200 degrees is higher than one of 100 degrees, we can correctly

state that it is twice as high. Interval scales do not have the ratio property.

Most statistical data analysis procedures do not distinguish between the

interval and ratio properties of the measurement scales.

c) Nominal variables allow for only qualitative classification. That is, they

can be measured only in terms of whether the individual items belong to some

distinctively different categories, but we cannot quantify or even rank order

those categories. For example, all we can say is that two individuals are

different in terms of variable A (e.g., they are of different race), but we cannot

say which one "has more" of the quality represented by the variable. Typical

examples of nominal variables are gender, race, color, city, etc.

d) Ordinal variables allow us to rank order the items we measure in terms of

which has less and which has more of the quality represented by the variable,

but still they do not allow us to say "how much more." A typical example of

an ordinal variable is the socioeconomic status of families. For example, we

know that upper-middle is higher than middle but we cannot say that it is, for

example, 18% higher. Also, this very distinction between nominal, ordinal,

and interval scales itself represents a good example of an ordinal variable. For

example, we can say that nominal measurement provides less information than

ordinal measurement, but we cannot say "how much less" or how this

difference compares to the difference between ordinal and interval scales.

Problems in measurement. Applications of tests that are based on the normality

assumptions are further limited by a lack of precise measurement. For example, let

us consider a study where grade point average is measured as the major variable

of interest. Is an A average twice as good as a C average? Is the difference

between a B and an A average comparable to the difference between a D and a C

average? Somehow, the grade point average is a crude measure of scholastic

activities that only allows us to establish a rank ordering of students from

"excellent" students to "poor" students. Most common statistical techniques such

as analysis of variance (and t- tests), regression, etc., assume that the underlying

3


measurements are at least of interval, meaning that equally spaced intervals on the

scale can be compared in a meaningful manner (e.g, B minus A is equal to D

minus C). However, as in our example, this assumption is very often not

acceptable, and the data rather represent a rank ordering of observations (ordinal)

rather than precise measurements.

Test of significance (Nonparametric methods) :

The test of significance is a test applied to two or more means to prove if the

difference between the means is a true difference or just due to chance.

Parametric methods: are statistical techniques enabling us to determine if there

is a significant difference between to sample means with underlying assumptions

of normality, homogeneity of variances, and linearity. Therefore, the need is

evident for statistical procedures that enable us to process data of "low quality,"

from small samples, on variables about which nothing is known (concerning their

distribution). Specifically, nonparametric methods were developed to be used in

cases when the researcher knows nothing about the parameters of the variable of

interest in the population (hence the name nonparametric). In more technical

terms, nonparametric methods do not rely on the estimation of parameters (such as

the mean or the standard deviation) describing the distribution of the variable of

interest in the population. Therefore, these methods are also sometimes (and more

appropriately) called parameter-free methods or distribution-free methods.

Rationale for nonparametric methods

1. Nonparametric methods were developed for conditions in which assumptions

necessary for using parametric methods cannot be made.

2. Sometimes called distribution-free method because it is not necessary to

assume that the observations are normally distributed.

3. Appropriate for dealing with data that are measured on a nominal or ordinal

scale and whose distribution is unknown.

NB. Nonparametric methods should be viewed as complementary statistical methods

rather than attractive alternatives. An inherent characteristic is that they deal with

ranks rather than values of observations.

4


What is meant by “nonparametric methods”?

1. Approaches for estimation and hypothesis testing when no underlying data

distribution is assumed.

2. Approaches where at least one of the following is satisfied;

Method can be used on ordinal-scaled data.

Method can be used on interval or ratio data, where distribution of the

random variable of interest is unspecified.

Method can be used on nominal-scaled data.

When do we use nonparametric methods?

1. When normality cannot be assumed.

2. When data cannot be transformed to normality.

3. When methods based on other non-normal distributions are not available or

appropriate.

4. When there is not sufficient sample size to assess the form of the distribution.

What are advantages of nonparametric methods?

1. Used with all scale.

2. Easier to compute by hand for small samples (Developed originally before

wide computer use).

3. Need not involve population parameters.

4. Results may be as exact as parametric procedures.

5. Make fewer assumptions. When no underlying data distribution is assumed.

6. Can handle small samples without assuming something about the original data

distribution. (Recall normal distribution comes from averages of large

samples).

What are disadvantages of nonparametric methods?

1. May waste information if data permit using parametric procedures. Example:

converting data from ratio to ordinal scale.

2. Difficult to compute by hand for large samples.

5


3. They do not take advantage of all the special characteristics of a distribution.

Consequently, these methods do not fully utilize the information known about

the distribution.

4. Tables not widely available.

What are the differences between parametric and nonparametric methods?

Parametric Test Procedures Nonparametric Test Procedures

Involve population parameters like mean

and SD of population distribution.

Do not involve population parameters and

don’t assume data is normal, or t-distributed.

The underlying measurements are at least of

interval, meaning that equally spaced

intervals on the scale can be compared in a

meaningful. For example, temperature, as

measured in degrees F or C, constitutes an

interval scale. We can say that a

temperature of 40>30 degrees, and that an

increase from 20 to 40 degrees is twice as

much as an increase from 30 to 40 degrees.

Dependent variable may be measured on any

scale method

Interval or Ratio where distribution of

the random variable of interest is

unspecified

Ordinal-scaled data or ranking: good-

better-best, upper-middle-lower.

Nominal-scaled data: gender, race,

color, city.

Often requires large sample sizes to call to

normality

Sample sizes can be small

Based on normality assumptions i.e. have

stringent (strict) assumptions: normal

distribution (the bell-shape distribution).

Have few assumptions about the population

distribution

Examples: t Test, Z Test, F test Example: Wilcoxon Rank Sum Test

6


Parametric and Nonparametric Methods Equivalents:

Basically, there is at least one nonparametric equivalent for each parametric general

type of test. In general, these tests fall into the following categories:

1. Tests of differences between groups (independent samples).

Usually, when we have two samples that we want to compare concerning their

mean value for some variable of interest, we would use the t-test for

independent samples; nonparametric alternatives for this test are the Mann-

Whitney U test, and the Kolmogorov-Smirnov two-sample test.

If we have multiple groups, we would use analysis of variance (ANOVA); the

nonparametric equivalents to this method are the Kruskal-Wallis analysis of

ranks and the Median test.

2. Tests of differences between variables (dependent samples).

If we want to compare two variables measured in the same sample we would

customarily use the t-test for dependent samples. Nonparametric alternatives

to this test are the Wilcoxon's matched pairs test and Sign test. If the variables

of interest are dichotomous in nature (i.e., "pass" vs. "no pass") then

McNemar's Chi-square test is appropriate.

If there are more than two variables that were measured in the same sample,

then we would customarily use repeated measures ANOVA. Nonparametric

alternatives to this method are Friedman's two-way analysis of variance and

Cochran Q test (if the variable was measured in terms of categories, e.g.,

"passed" vs. "failed"). Cochran Q is particularly useful for measuring changes

in frequencies (proportions) across time.

3. Tests of relationships between variables.

To express a relationship between two variables one usually computes the

correlation coefficient. Nonparametric equivalents to the standard correlation

coefficient are Spearman R, Kendall Tau, and coefficient Gamma.

If the two variables of interest are categorical in nature (e.g., "passed" vs.

"failed" by "male" vs. "female") appropriate nonparametric statistics for

testing the relationship between the two variables are the Chi-square test, the

Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for

relationships between multiple cases is available: Kendall coefficient of

concordance.

7


When to Use Which Method?

It is not easy to give simple advice concerning the use of nonparametric procedures.

Each nonparametric procedure has its peculiar sensitivities and blind spots. For

example,

The Kolmogorov-Smirnov two-sample test is not only sensitive to differences in

the location of distributions (for example, differences in means) but is also greatly

affected by differences in their shapes.

The Wilcoxon matched pairs test assumes that one can rank order the degree of

differences in matched observations in a useful manner. If this is not the case, one

should rather use the Sign test.

In general, if the result of a study is important (e.g., does a very expensive and

painful drug therapy help people get better?), then it is always useful to run

different nonparametric tests; should discrepancies in the results occur dependent

on which test is used, one should try to understand why some tests give different

results.

On the other hand, nonparametric statistics are less statistically powerful

(sensitive) than their parametric counterparts, and if it is important to detect even

small effects (e.g., is this food additive harmful to people?) one should be very

careful in the choice of a test statistic.

Large data sets and nonparametric methods.

Nonparametric methods are most appropriate when the sample sizes are small.

When the data set is large (e.g., n > 100) it often makes little sense to use

nonparametric statistics at all.

When the samples become very large, then the sample means will follow the

normal distribution even if the respective variable is not normally distributed in

the population, or is not measured very well. Thus, parametric methods, which are

usually much more sensitive (i.e., have more statistical power) are in most cases

appropriate for large samples.

However, the tests of significance of many of the nonparametric statistics

described here are based on asymptotic (large sample) theory; therefore, important

tests can often not be performed if the sample sizes become too small.

8


Some commonly used statistical tests- Table

Example Parametric test Example of non-parametric

Purpose of test

To compare girls’ heights with boys’ heights

Two-sample (unpaired) t test

Mann-Whitney U test

Compares two independent samples drawn from the same population

To compare weight of an infant before and after a feed

One sample (paired) t test

Wilcoxon matched pairs test

Compares two sets of observations on a single sample

To determine whether plasma glucose is higher one, two, or three hours after a meal

One way analysis of variance (F test) using total sum of squares

Kruskal-Wallis analysis of variance by ranks

Effectively, a generalization of the paired t or Wilcoxon matched pairs test where three or more sets of observations are made on a single sample

In the above example, to determine whether the results differ in male and female subjects

Two way analysis of variance

Two way analysis of variance by ranks

As above, but tests the influence (and interaction) of two different covariates

To determine whether acceptance into medical school is more likely if the applicant was born in the same country

χ2 test Fisher’s exact test Tests the null hypotheses that the distribution of a discontinuous variable is the same in two (or more) independent samples

To assess whether and to what extent plasma HbA1 concentration is related to plasma triglyceride concentration in diabetic patients

Product moment correlation coefficient (Pearson’s r)

Spearman’s rank coefficient (r2)

Assesses the strength of the straight line association between two continuous variables

To see how peak expiratory flow rate varies with height

Regression by least squares method

Non-parametric regression (various tests)

Describes the numerical relation between two quantitative variables, allowing one value to be predicted from the other

To determine whether and to what extent a person’s age, body fat, and sodium intake determine their blood pressure

Multiple regression by least squares method

Non-parametric regression (various tests)

Describes the numerical relation between a dependent variable and several predictor variables (covariates)

9


Mann–Whitney U or Wilcoxon rank sum test (MWW)

Frank Wilcoxon (1892 - 1965) was an outstanding chemist, whose interest in statistics first started while studying fungicides, when he and colleagues studied Fisher’s newly published Statistical Methods for Research Workers. In 1945 he published his paper setting out the rank-sum and signed-rank tests which are still named after him. His background was colorful. A keen cyclist and motor cyclist, he and his twin sister were born in an Irish castle, to wealthy American parents. He grew up in the States, ran away to sea, worked as an oil worker and tree surgeon, and attended a military academy before finally entering college, aged 26, to read chemistry.

Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW),

Wilcoxon rank-sum test, or Wilcoxon-Mann-Whitney test) is a non-parametric

test for assessing whether two independent samples of observations come from the

same distribution.

It is one of the best-known non-parametric significance tests.

It was proposed initially by Frank Wilcoxon in 1945, for equal sample sizes, and

extended to arbitrary (random) sample sizes and in other ways by Mann and

Whitney (1947).

Mann-Whitney-Wilcoxon test is practically identical to performing an ordinary

parametric two-sample t test on the data after ranking over the combined samples.

They are excellent alternatives to the t test if your data are significantly skewed.

NB: Mann-Whitney-Wilcoxon test is equivalents to unpaired t test and Wilcoxon

matched pair is equivalents to paired t test.

Mann-Whitney-Wilcoxon test tests for differences in medians and for chances of

obtaining greater observations in one population versus the other.

The null hypothesis (H0) in the MWW test is that both populations have the same

probability of exceeding each other. i.e. no difference in the two population

distributions.

The alternative hypothesis (Ha) is that the variable in one population is

stochastically greater.

The test involves the calculation of a statistic, usually called U (the sum of ranks),

whose distribution under the null hypothesis is known. In the case of small

10


samples, the distribution is tabulated, but for sample sizes above ~20 there is a

good approximation using the normal distribution methods.

The U test is included in most modern statistical packages. It is also easily

calculated by hand, especially for small samples.

Assumptions for Mann-Whitney U Test: The two samples under investigation in the test are independent of each other and

the observations within each sample are independent.

The observations are ordinal or continuous measurements (i.e., for any two

observations, one can at least say, whether they are equal or, if not, which one is

greater).

Data types that can be analysed with Mann-Whitney U-test:

Data points should be independent from each other.

Data do not have to be normal and variances do not have to be equal.

All individuals must be selected at random from the population.

All individuals must have equal chance of being selected.

Sample sizes should be as equal as possible but some differences are allowed.

Calculations: There are two ways of doing this.

Procedure # 1

Stage 1: Call one sample A and the other B.

Stage 2: Place all the values together in rank order (i.e. from lowest to

highest). If there are two samples of the same value, the 'A' sample is placed

first in the rank.

Stage 3: Inspect each 'B' sample in turn and count the number of 'A's which

precede (come before) it. Add up the total to get a U value.

Stage 4: Repeat stage 3, but this time inspects each A in turn and count the

number of B's which precede it. Add up the total to get a second U value.

Stage 5: Take the smaller of the two U values and look up the probability

value in the table below. This gives the percentage probability that the

difference between the two sets of data could have occurred by chance.

11


Example: The results of the cytogenetic analysis of abnormal cells after exposure to

the drug (Y) are shown below together with the concurrent control (X) data. Use the

Mann-Whitney U test to determine whether there is any evidence of a difference in

the median DISPERSION between the treated group and the control group.

Group (X) = 7; 3; 6; 2; 4; 3; 5; 5

Group (Y) = 3; 5; 6; 4; 6; 5; 7; 5

Solution: This example deals with two sets of sample data from two contrasting

groups, group X and group Y, with the aim of comparing them and demonstrating

differences. There are eight pairs of data in this example. Tests of significance are

used to tell us whether the differences between the two sets of sample data are truly

significant or whether these differences could have occurred by chance. Moreover,

tests of significance tell us the probability level that differences between the two

groups, X and Y are due to chance.

Stage 1:

Sample A = 7; 3; 6; 2; 4; 3; 5; 5

Sample B = 3; 5; 6; 4; 6; 5; 7; 5

Stage 2:

Group A A A B A B A A B B B A B B A B

Labels 2 3 3 3 4 4 5 5 5 5 5 6 6 6 7 7

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Stage 3: The sum of the ranks for the group B is U =

3+4+6+6+6+7+7+8 = 47

Stage 4: : The sum of the ranks for the group A is U =

0+0+0+1+2+2+5+7 = 17

Stage 5: U = 17

The critical value from the table = 6.5.

The probability that the quality of the measures in group Y is better than group X

just by chance is 6.5 per cent (see the next Table).

If you find that there is a significant probability that the differences could have

occurred by chance, this can mean:

1. Either the difference is not significant and there is little point in

looking further for explanations of it, OR

12


2. Your sample is too small. If you had taken a larger sample, you might

well find that the result of the test of significance changes: the difference

between the two groups becomes more certain.

n¹1 2 3 4 5 6 7 8

u

0 11.1 2.2 0.6 0.2 0.1 0.0 0.0 0.0

1 22.2 4.4 1.2 0.4 0.2 0.1 0.0 0.0

2 33.3 8.9 2.4 0.8 0.3 0.1 0.1 0.0

3 44.4 13.3 4.2 1.4 0.5 0.2 0.1 0.1

4 55.6 20.0 6.7 2.4 0.9 0.4 0.2 0.1

5 26.7 9.7 3.6 1.5 0.6 0.3 0.1

6 35.6 13.9 5.5 2.3 1.0 0.5 0.2

7 44.4 18.8 7.7 3.3 1.5 0.7 0.3

8 55.6 24.8 10.7 4.7 2.1 1.0 0.5

9 31.5 14.1 6.4 3.0 1.4 0.7

10 38.7 18.4 8.5 4.1 2.0 1.0

11 46.1 23.0 11.1 5.4 2.7 1.4

12 53.9 28.5 14.2 7.1 3.6 1.9

13 34.1 17.7 9.1 4.7 2.5

14 40.4 21.7 11.4 6.0 3.2

15 46.7 26.2 14.1 7.6 4.1

16 53.3 31.1 17.2 9.5 5.2

17 36.2 20.7 11.6 6.5

18 41.6 24.5 14.0 8.0

19 47.2 28.6 16.8 9.7

13


Procedure # 2

Choose the sample for which the ranks seem to be smaller (The only reason to

do this is to make computation easier). Call this "sample 1," and call the other

sample "sample 2."

Taking each observation in sample 1, count the number of observations in

sample 2 that are smaller than it (count a half for any that are equal to it).

Calculate sum of ranks R1 and R2 then use the following formula

U1 = m x n + m (m + 1)/2 – R1

U2 = m x n + n (n + 1)/2 – R2

U1 + U2 should be equal to m x n

For significance the calculated U1 value should be < the tabulated value.

NB: If you have ties:

1. Rank them anyway, pretending they were slightly different.

2. Find the average of the ranks for the identical values, and give them all that

rank.

3. Carry on as if all the whole-number ranks have been used up

These can now be used for the Mann-Whitney U test

Sorted Data224514141418

Data142542141814

Rank A

224514141418

12345678

Data

142542141814

14

Rank them anyway, pretending they were slightly different

Rank A

12345678

Average = 1.5

Average = 6

Find the average of the ranks for the identical values, and give them all that rank

Rank A

1.51.5346668

Ties


Solution of our example

Group (X) = 7; 3; 6; 2; 4; 3; 5; 5

Group (Y) = 3; 5; 6; 4; 6; 5; 7; 5

Try to rank them and find the average of the ranks

Sample 2 Ranks (orders) Sample 1 Ranks (orders)

-

-

-

-

-

-

-

- (highest order)

-

-

-

-

-

-

-

-

(lowest order)

m = 8 R1 (sum of the ranks) N = 8 R2 (sum of the ranks)

R1 = should be 77,

R2 = should be 59,

U1 = 23,

U2 = 41

Look at the next Table at n 8 and m 8 you will find the tabulated value (16) is

less than the calculated one (23) so the difference is not significant and we

failed to reject H0.

Example: The results of the cytogenetic analysis of abnormal cells in Males (♂) and

Females (♀) are shown below. Test to see if there is a significant difference between

these two gender groups.

Group (♀) = 9; 4; 6; 8; 6

Group (♂) = 19; 16; 9; 19; 8

Example: In KKUH the % of Mothers Bearing Low-Birth-Weight Babies in 7

months observations were 3, 0, 4, 0, 1, 2 and 3, while the % of Mothers Bearing

15


Normal-Birth-Weight Babies were 4, 3, 6, 11, 7, 8, and 9. Is there a significant

difference between these two groups?

16


Comparison to Student's t-test

The Mann-Whitney U test is useful in the same situations as the independent samples

Student's t-test, and the question arises of

which should be preferred?

Ordinal data: Mann-Whitney U remains the logical choice when the data are

ordinal but not interval scaled, so that the spacing between adjacent values

cannot be assumed to be constant.

Robustness: It is much less likely than the Student's t-test to give a spuriously

significant result because of one or two outliers - Mann-Whitney U is more

robust.

Efficiency: When normality holds, Wilcoxon-Mann-Whitney test has an

efficiency of about 0.95 (95%) when compared to the Student's t-test test. For

distributions sufficiently far from normal and for sufficiently large sample

sizes, the Wilcoxon-Mann-Whitney test can be considerably more efficient

than the Student's t-test.

Overall, the robustness makes the Wilcoxon-Mann-Whitney test more widely

applicable than the Student's t-test test, and for large samples from the normal

distribution, the efficiency loss compared to the Student's t-test test is only

5%, so one can recommend Wilcoxon-Mann-Whitney test as the default test

for comparing interval or ordinal measurements with similar distributions.

The relation between efficiency and power in concrete situations isn't trivial

though. For small sample sizes one should investigate the power of the

Wilcoxon-Mann-Whitney test versus Student's t-test.

17


Wilcoxon Matched Pairs Signed Ranks Test (Wilcoxon Test)

The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test for the

case of two related samples or repeated measurements on a single sample. It can be

used as an alternative to the paired Student's t-test when the population cannot be

assumed to be normally distributed. The test is named for Frank Wilcoxon who, in a

single paper, proposed both it and the rank-sum test for two independent samples.

Assumptions

This test considers that the data are at an ordinal-metric level, i.e., that the original

data can be validly ordered, that the data after the intervention can be ordered, and

that the difference between the two sets of data can be validly ordered.

This assumption is slightly less critical than the interval level assumption

necessary for the t-test. The assumption of there being a normal distribution does

not have to be met, this is particularly practical if the maximum change is

somehow limited.

A positive aspect of the Wilcoxon test is that it is a very powerful test. If all the

assumptions for the t-test are met the Wilcoxon has about 95% of the power of the

t-test.

Data types that can be analyzed with Wilcoxon Matched pairs

the distributions do not have to be normal and variances do not have to be equal

all individuals must be selected at random from the population

all individuals must have equal chance of being selected

sample sizes should be equal

Setup

Suppose we collect 2n observations, two observations of each of the n subjects.

Let i denote the particular subject that is being referred to and the first observation

measured on subject i be denoted by Xi and second observation be Yi. For each i

in the observations, Xi and Yi should be paired together.

Let Zi = Yi – Xi for i = 1, ... , n. (The differences Zi are assumed to be

independent. Each Zi comes from a continuous population "they must be

identical" and is symmetric about a common median θ. Xi and Yi are ordinal

18


variables such that comparisons such as greater than, less than and equal to have

meaning).

Test procedure

The null hypothesis tested is H0: median θ = 0.

The Wilcoxon signed rank statistic W+ is computed by ordering the absolute

values |Z1|, ..., |Zn|, the rank of each ordered |Zi| is given a rank of Ri. Denote the

positive Zi values with PHI φi = I(Zi > 0), where I(.) is indicator function. The

Wilcoxon signed ranked statistic W+ is defined as

It is often used to test the difference between scores of data collected before and

after an experimental manipulation, in which case the central point under the null

hypothesis would be expected to be zero.

Scores exactly equal to the central point are excluded and the absolute values of

the deviations from the central point of the remaining scores are ranked such that

the smallest deviation has a rank of 1.

Tied scores are assigned a mean rank.

The sums for the ranks of scores with positive and negative deviations from the

central point are then calculated separately.

A value S is defined as the smaller of these two rank sums. S is then compared to

a table of all possible distributions of ranks to calculate p, the statistical

probability of attaining S from a population of scores that is symmetrically

distributed around the central point.

As the number of scores used, n, increases, the distribution of all possible ranks S

tends towards the normal distribution. So although for n ≤ 20, exact probabilities

would usually be calculated, for n > 20, the normal approximation is used.

The Wilcoxon test was popularized by Siegel (1956) in his influential text book

on non-parametric statistics. Siegel used the symbol T for the value defined here

as S. In consequence, the test is sometimes referred to as the Wilcoxon T test, and

the test statistic is reported as a value of T.

19


Example:

Scoring of abnormal cells were taken at the time of the beginning of treatment with a

chemotherapeutic drug and again 5 weeks later to determine the genotoxic effects of

this drug. Results for 10 children were presented in the next Table. Is there a

significant difference between these two observations?

Subject

(i)

Xi

(1st obser.)

Yi

(2nd obser.)

Sign of

Xi – Yi

Xi – Yi Absolute

Xi – Yi

Rank of

Absolute

Signed

Rank

1 125 110 + 15 15 7 7

2 115 122 - -7 7 3 -3

3 130 125 + 5 5 1.5 1.5

4 140 120 + 20 20 9 9

5 140 140 --- 0 0 --- ---

6 115 124 - -9 9 4 -4

7 140 123 + 17 17 8 8

8 125 137 - -12 12 6 -6

9 140 135 + 5 5 1.5 1.5

10 135 145 - -10 10 5 -5

1. The sign of Xi – Yi is denoted in the Sign column by either (+) or (-). If Xi

and Yi are equal, then the value is thrown out.

2. The values of Xi – Yi are given in the next two columns.

3. The last two columns are the ranks.

4. The absolute rank column has no signs, and the signed rank column gives the

ranks along with their signs.

5. The data is ranked from the smallest value to the largest value.

6. In the case of a tie, ranks are added together and divided by the number of ties.

For example, in this data, there were two instances of the value 5. The ranks

corresponding to 5 are 1 and 2. The sum of these ranks is 3. After dividing by

the no. of ties, you get a mean rank of 1.5, and this value is assigned to both

instances of 5.

20


7. The test statistic, W+, is given by the sum of all of the positive values in the

Signed Rank column. The test statistic, W-, is given by the sum of all of the

negative values in the Signed Rank column. For this example, W+ = 27 and

W- = 18. The minimum of these is 18.

8. Lastly, this test statistic is analyzed using a next table of critical values.

9. If the obtained test statistic is less than or equal to the critical value based on

the number of observations n, then the null hypothesis is rejected in favor of

the alternative hypothesis. Otherwise, the null is accepted. i.e. the obtained

value is significiant if it is ≤ than the value in the table.

10. In this case the test statistic is W = 18 and the critical value is 8 for a two-

tailed p-value of 0.05. The test statistic must be less than this to be significant

at this level, so in this case the null hypothesis can not be rejected (not sig.).

Table of critical values for the Wilcoxon test:

Example: Number of Cigarettes Smoked per Day for a group of men was 8, 13, 24,

15, 7, 11, 20, 22, 6, 15, 20 and after their decision for stopping cigarettes smoking the

number was 5, 15, 11, 19, 0, 12, 15, 0, 0, 6, 20. Does the median of the differences

between pre and post decision differ significantly from zero?

21


Sum of all signed ranks (W) = 41.000, Sum of positive ranks (T+) = 48.000, Sum of

negative ranks (T-) = -7.000. The two-tailed P value is 0.0371, considered significant.

Kruskal-Wallis one-way analysis of variance (KW)

In statistics, the Kruskal-Wallis one-way analysis of variance by ranks (named

after William Kruskal and W. Allen Wallis) is a non-parametric method for

testing equality of population medians among groups. It is identical to a one-way

analysis of variance with the data replaced by their ranks.

It is an extension of the Mann-Whitney U test (MWW) to 3 or more groups.

Since it is a non-parametric method, the Kruskal-Wallis test does not assume a

normal population, unlike the analogous one-way analysis of variance. However,

the test does assume an identically-shaped and scaled distribution for each group,

except for any difference in medians.

The test involves the calculation of a statistic, usually called KW (the sum of

ranks).

In Kruskal-Wallis test all data from all groups are ranked together; i.e., rank the

data from 1 to n ignoring group membership. Assign any tied values the average

of the ranks they would have received had they not been tied.

The null hypothesis (H0) will be rejected if the computed value of KW is so large

that the probability of obtaining a value that large or larger when H0 is true is

equal to or less than chosen significance level.

Post-hoc tests where proposed by Schaich and Hamerle (1984) as well as Conover

(1971, 1980) in order to decide which groups are significantly different from each

other, based upon the mean rank differences of the groups. These procedures are

detailed in Bortz, Lienert and Boehnke (2000, pp. 275); however, not all statistical

packages support Post-hoc analysis for Kruskal-Wallis test. But user contributed

code exists that provides these facilities (for example in SPSS).

Example: The effects of two drugs on reaction time to a certain stimulus were studied

in three samples of experimental animals. Sample 3 served as a control while the

animals in sample 1 were treated with drug A and those in sample 2 were treated with

drug B prior to the application of the stimulus. The data below shows the reaction

times in seconds of the 13 animals. Group 1 are 17, 20, 40, 31, 35. Group 2 are 8, 7, 9,

8. Group 3 are 2, 5, 4, 3. Can we conclude that the three populations represented by

22


the three samples differ with respect to reaction time? We can so conclude if we can

reject the H0 that the three populations do not differ in their reaction times.

Solution

Firstly, if you used GraphPad and you selected One-way Analysis of Variance

(ANOVA) followed by Tukey-Kramer Multiple Comparisons Test you will get this:

Assumption test: Are the standard deviations of the groups equal? ANOVA

assumes that the data are sampled from populations with identical SDs. This

assumption is tested using the method of Bartlett. Bartlett statistic (corrected)

= 15.937. The P value is 0.0003. Bartlett's test suggests that the differences

among the SDs is extremely significant. Since ANOVA assumes populations

with equal SDs, you should consider transforming your data (reciprocal or log)

or selecting a nonparametric test.

Thus, at the moment select a nonparametric test i.e. Kruskal-Wallis Test

(Nonparametric ANOVA) followed by Dunn's Multiple Comparisons Test

because the data are independent.

Now you will get this result. The P value is 0.0047, considered very

significant and variation among column medians is significantly greater than

expected by chance.

Calculation details

Group Number of Points Sum of Ranks Mean of RanksDrug A 5 55.000 11.000Drug B 4 26.000 6.500Control 4 10.000 2.500

Kruskal-Wallis Statistic KW = 10.711 (corrected for ties)

Dunn's Multiple Comparisons Test

Comparison Mean Rank Difference P value Drug A versus Drug B 4.500 ns P>0.05Drug A versus control 8.500 ** P<0.01Drug B versus control 4.000 ns P>0.05

23


Data are presented as mean ± SD. **P < 0.01 versus control (Kruskal-Wallis test followed by Dunn’s multiple comparisons test).

Friedman Test

The Friedman test is a non-parametric statistical test developed by the economist

Milton Friedman.

It is similar to the parametric one-way repeated measures ANOVA by ranks,

however, the data on the dependent variable is measured on an ordinal scale. In its

use of ranks it is similar to the Kruskal-Wallis one-way analysis of variance by

ranks.

The test assumes the study involves one independent variable, and that the same

participants are repeatedly observed under three or more conditions. Also, the

present test bears some resemblance to the Wilcoxon matched pairs signed rank

test. However, the Wilcoxon involves one independent variable with two and only

two levels and the participants in the study are in both conditions.

The null hypothesis of the test is that the treatments have identical effects.

The alternative hypothesis is that at least one of the treatments tends to yield

larger values than at least one of the other treatments.

Friedman test is widely supported by many statistical packages such as SAS,

SPSS, R, matlab, WINKS, GraphPad, BioEstat and more.

Post-hoc tests where proposed by Schaich and Hamerle (1984) as well as Conover

(1971, 1980) in order to decide which groups are significantly different from each

other, based upon the mean rank differences of the groups. These procedures are

detailed in Bortz, Lienert and Boehnke (2000, pp. 275); however, not all statistical

packages support Post-hoc analysis for Friedman's test. But user contributed code

exists that provides these facilities (for example in SPSS).

Example

24

http://en.wikipedia.org/wiki/Statistical_test

http://en.wikipedia.org/wiki/Matlab

http://en.wikipedia.org/wiki/R_(programming_language)

http://en.wikipedia.org/wiki/SPSS

http://en.wikipedia.org/wiki/ANOVA

http://en.wikipedia.org/wiki/Repeated_measures

http://en.wikipedia.org/wiki/Parametric_statistics

http://en.wikipedia.org/wiki/Milton_Friedman


Suppose for example we want to find out if students have a preference for one

type of soda over others. They are blindfolded and given a taste test. They are

asked to take a sip of Brand X, Brand Y and Brand Z sodas and to rank order their

favourite for the three sodas where a 1 is the highest rank, a 2 the next highest

and a 3 the least preferred soda.

The data representing the rankings given by each participant to the three sodas

are:

Participants’ Rankings of the Three Brands of Soda

Participant Brand X Brand Y Brand ZAhmed 2 1 3Mohamed 1 3 2Ibrahim 1 2 3Anfal 1 3 2Ali 1 3 2Noha 1 2 3Amal 1 3 2Osman 1 2 3Anas 1 3 2Hany 2 1 3

Solution

Go to the GraphPad, select compare mean and raw data click on enter data in 3

separated columns and mark them as Brand X, Brand Y, Brand Z, click on look

to the normality test but do not care with it click on select all columns, perform

repeated measures ANOVA, nonparapetric methods, mark on only perform post

test if P <0.05. Now Friedman test with post-test will be active click on then

the results will be appears as follow;

Friedman Test (Nonparametric Repeated Measures ANOVA)

The P value is 0.0063, considered very significant.

Variation among column medians is significantly greater than expected by chance.

Calculation detail:

Group Sum of Ranks Brand X 12.000Brand y 23.000Brand Z 25.000

Number of Rows = 10

Number of Columns = 3

25


Friedman Statistic Fr = 9.800

Dunn's Multiple Comparisons Test

If the difference between rank sum means is greater than 10.710 then the P value is

less than 0.05.

Comparison Rank Sum Difference P value Brand X versus Brand Y -11.000 * P<0.05Brand X versus Brand Z -13.000 * P<0.05Brand Y versus Brand Z -2.000 ns P>0.05

Data are presented as mean ± SD. *P < 0.05 versus Brand X (Friedman test followed

by Dunn’s multiple comparisons test).

26


Data transformations

One advantage of using parametric statistics is that it makes it much easier to

describe your data.

If you have established that it follows a normal distribution you can be sure that a

particular set of measurements can be properly described by its mean and

standard deviation.

If your data are not normally distributed you cannot use any of the tests that

assume that it is (e.g. t test, ANOVA, regression analysis).

If your data are not normally distributed it is often possible to normalize it by

transforming it.

The need for transformations. We should always check the assumptions test (as in

GraphPad): Are the standard deviations equal? (i.e. that data follow a normal

distribution with uniform variance). If the data meet the assumptions i.e. the data

come from populations with equal SDs, we can analyze the raw data as described but

if they are not met, we have two possible strategies:

1. We can use a method which does not require these assumptions, such as a rank-

based method (nonparametric methods) or

2. We can transform the data mathematically to make them fit the assumptions more

closely before analysis.

There are different transformations works for different data types:

Logarithms: Growth rates (or dynamics) are often

exponential and log transforms will often normalise them. Log transforms are

particularly appropriate if the variance increases with the mean.

27


Reciprocal: If a log transform does not normalise your data

you could try a reciprocal (1/x) transformation. This is often used for enzyme

reaction rate data, survival times.

Square root: This transform is often of value when the data

are counts, e.g. blood cells on a haemocytometer or woodlice in a garden.

Carrying out a square root transform will convert data with a Poisson distribution

to a normal distribution.

Arcsine: This transformation is also known as the angular

transformation and is especially useful for percentages and proportions.

28


Logit (logistic) transformation: This is the transformation we

apply most often to each proportion, p, in a set of proportions. We cannot take

the logit transformation if either p =0 or p = 1 because the corresponding logit

values are - ∞ and + ∞. One solution is to take p as 1/(2n) instead of 0, and as {1-

1/(2n)} instead of 1. It linearized a sigmoid curve.

Can all data be transformed?

Not all data can be transformed successfully.

1. Sometimes we have very long tails at both ends of the distribution, which

makes transformation ineffective.

2. Sometimes we have a bimodal distribution, which makes transformation

also ineffective.

3. Sometimes we have a large number of identical observations, which

will all transform to the same value whatever transformation we use.

These are often at one extreme of the distribution, usually at zero.

4. Sometimes transformation leads to variation in p -value .

So, What can we do if we cannot transform data to a form? It is usually safer to use

methods that do not require such assumptions. These include the non-parametric

methods.

29


Correlation Coefficients and linear regression

Correlation is a concept that carries the common colloquial implication of

association, such as ‘‘height and weight are correlated’’.

The statistical procedure will give the word a technical meaning; we can actually

calculate a number that tells the strength of the association. The results of

correlation are expressed as a P-value (for the hypothesis test) and an r-value

(correlation coefficient) or r2 value (coefficient of determination).

Suppose that the data set consists of n pairs of observations expressing a possible

relationship between two continuous variables. We characterize the strength of

such a relationship by calculating the coefficient of correlation called Pearson’s

correlation coefficient. However, like other common parametric statistics, such

as the mean and the standard deviation, the correlation coefficient is very

sensitive to extreme observations. We may be interested in calculating a measure

of association that is more robust with respect to outlying values.

There are not one but two nonparametric procedures: Spearman’s and Kendall’s

tau rank correlations.

Generally: Values near +1 indicate a strong positive association. Values near -1

indicate a strong negative association. Values around 0 indicate a weak

association.

30


If they tend to go from lower left to upper right, we have a positive association; if

they tend to go from upper left to lower right, we have a negative association.

NB. The goal of linear regression is to find the equation (Y= intercept + slope x

X) of the line that best fits the points; this is then used as a visual summary of the

relationship between the variables, or for estimating unknown values of one

variable when given the value of the other.

Example: The below Table gives the values for the birth weight (x) and the

increase in weight between days 70 and 100 of life, expressed as a percentage of

the birth weight (y) for 12 infants. If we let each pair of numbers (x; y) be

represented by a dot in a diagram with the x’s on the horizontal axis, we have

Figure (see below). The dots do not fall perfectly on a straight line, but rather,

scatter around a line, very typical for statistical relationships. Because of this

scattering of dots, the diagram is called a scatter diagram. The positions of the

dots provide some information about the direction as well as the strength of the

association under the investigation.

Birth weight (Oz)

The increase in weight between days 70 and 100 of life expressed as aMean % of the birth weight

112 70.56 63111 73.26 66107 77.04 72119 61.88 5292 69.00 7580 94.40 11881 97.20 12084 95.76 114118 49.56 42106 76.32 72103 92.70 9094 85.54 91

By using GraphPad, select Regression and correction, X and Y (or two or more Y

replicates), then click next, put Birth weight in X column and % of the birth

weight in Y1, then click next and select Nonparametric Spearman Rank

Correlation, then click next you will get Spearman r = -0.9632, 95% confidence

interval: -0.9903 to -0.8658. The two-tailed P value is < 0.0001, considered

extremely significant. Click next you will get this scatter figure

31


Try to do it manual (see below)

Spearman's Rank Correlation Coefficient

Spearman’s rank correlation is a direct nonparametric counterpart of

parametric Pearson’s correlation coefficient.

The Spearman's Rank Correlation Coefficient is used to discover the strength

and direction (negative or positive) of a relationship between two variables.

Example: This example looks at the strength of the link between the price of

a convenience item (a 500 ml bottle of water) and distance from the Pyramids in

Geza, Egypt. The hypothesis tested is that prices should decrease with distance

from the key area of gentrification surrounding the Pyramids. The line followed is

Transacts in the map below, with continuous sampling of the price of a 500 ml

bottle water at every convenience store (see the Table below).

32


Map to show the location of environmental gradients for transact lines in Giza

Hypothesis We might expect to find that the price of a bottle of water decreases as distance

from the Pyramids increases. Higher property rents close to the Pyramids should

be reflected in higher prices in the shops.

The hypothesis might be written like this: The price of a convenience item

decreases as distance from the Pyramids increases.

The more objective scientific research method is always to assume that no such

price-distance relationship exists and to express the null hypothesis as:

there is no significant relationship between the price of a convenience item

and distance from the Pyramids.

What can go wrong? Having decided upon the wording of the hypothesis, you

should consider whether there are any other factors that may influence the study.

Some factors that may influence prices may include:

The type of retail outlet. You must be consistent in your choice of retail outlet.

For example, restaurants often charge significantly more for water than a

convenience store. You should decide which type of outlet to use and stick with it

for all your data collection.

Some shops have different prices for the same item: a high tourist and lower local

price, dependent upon the shopkeeper's perception of the customer.

Shops near main roads may charge more than shops in less accessible back

streets, due to the higher rents demanded for main road retail sites.

33

Transact 1

Transact 2

Transact 3

Transact 4


The positive spread effects from other nearby areas of gentrification or from

competing areas of tourist attraction. The negative spread effects from nearby

areas of urban decay.

Higher prices may be charged during the summer when demand is less flexible,

making seasonal comparisons less reliable.

Cumulative sampling may distort the expected price-distance gradient if several

shops cluster within a short area along the transect line followed by a

considerable gap before the next group of retail outlets.

You should mention such factors in your investigation.

By using GraphPad and the Data collected (see data table below) suggests a fairly

strong negative relationship as shown in this scatter graph:

Scatter graph to show the change in the price of a convenience item (in Egyptian

Pound) with distance (meters) from the Pyramids.

The scatter graph shows the possibility of a negative correlation between the two

variables and the Spearman's rank correlation technique should be used to see if

there is indeed a correlation, and to test the strength of the relationship.

A correlation can easily be drawn as a scatter graph, but the most precise way to

compare several pairs of data is to use a statistical test - this establishes whether

the correlation is really significant or if it could have been the result of chance

alone.

Spearman’s Rank correlation coefficient is a technique which can be used to

summarise the strength and direction (negative or positive) of a relationship

between two variables.

34

http://geographyfieldwork.com/DataPresentationScatterGraphs.htm


The Spearman's Rank Correlation Coefficient is used to discover the strength of

a link between two sets of data. The result will always be between 1 and - 1.

Manual Method - calculating the coefficient

Create a table from your data.

Rank the two data sets. Ranking is achieved by giving the ranking '1' to the

biggest number in a column, '2' to the second biggest value and so on. The

smallest value in the column will get the lowest ranking. This should be done for

both sets of measurements.

Tied scores are given the mean (average) rank. For example, the three tied scores

of 1 Egyptian Pound in the example below are ranked fifth in order of price, but

occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of ten. The

mean rank in this case is calculated as (5+6+7) ÷ 3 = 6.

Find the difference in the ranks (d): This is the difference between the ranks of

the two values on each row of the table. The rank of the second value (price) is

subtracted from the rank of the first (distance from the Pyramids).

Square the differences (d²). To remove negative values and then sum them ( d²).

Calculate the coefficient (R) using the formula below.

The answer will always be between 1.0 (a perfect positive correlation) and -1.0 (a

perfect negative correlation).

When written in mathematical notation the Spearman Rank formula looks like:

Data Table: Spearman's Rank Correlation

Convenience

Store

Distance

(meter)

Rank

distance

Price

(Pound

)

Rank

price

(d) Difference

between ranks

d²

1 50 10 1.80 2 8 64

2 175 9 1.20 3.5 5.5 30.25

3 270 8 2.00 1 7 49

4 375 7 1.00 6 1 1

35


5 425 6 1.00 6 0 0

6 580 5 1.20 3.5 1.5 2.25

7 710 4 0.80 9 -5 25

8 790 3 0.60 10 -7 49

9 890 2 1.00 6 -4 16

10 980 1 0.85 8 -7 49

d² = 285.5

Now to put all these values into the formula

a. Find the value of all the d² values by adding up all the values in the Difference²

column. In our example this is 285.5. Multiplying this by 6 gives 1713.

b. Now for the bottom line of the equation. The value n is the number of sites at

which you took measurements. This, in our example is 10. Substituting these

values into n³ - n we get 1000 - 10

c. We now have the formula: R = 1 - (1713/990) which gives a value for R : 1 -1.73

= -0.73

What does this R value of -0.73 mean?

The closer R is to +1 or -1, the stronger the likely correlation. A perfect

positive correlation is +1 and a perfect negative correlation is -1. The R value

of -0.73 suggests a fairly strong negative relationship.

A further technique is now required to test the significance of the relationship.

The R value of -0.73 must be looked up on the Spearman Rank significance

table below as follows:

1. Work out the 'degrees of freedom' you need to use. This is the number of pairs

in your sample minus 2 (n-2). In the example it is 8 (10 - 2).

2. Now plot your result on the table.

3. If it is below the line marked 5%, then it is possible your result was the

product of chance and you must reject the hypothesis.

36


4. If it is above the 0.1% significance level, then we can be 99.9% confident the

correlation has not occurred by chance.

5. If it is above 1%, but below 0.1%, you can say you are 99% confident.

6. If it is above 5%, but below 1%, you can say you are 95% confident (i.e.

statistically there is a 5% likelihood the result occurred by chance).

In the example, the value 0.73 gives a significance level of slightly less than

5%. That means that the probability of the relationship you have found being a

chance event is about 5 in a 100. You are 95% certain that your hypothesis is

correct.

The reliability of your sample can be stated in terms of how many researchers

completing the same study as yours would obtain the same results: 95 out of

100.

The fact two variables correlate cannot prove anything - only further research

can actually prove that one thing affects the other.

Data reliability is related to the size of the sample. The more data you collect,

the more reliable your result.

Manual solution of the first example

Birth Weight

Rank Weight

Increase in weight

Rank Incr. in weight

(d) Difference between ranks

d²

112 10 63 3 7 49

37


111 9 66 4 5 25107 8 72 5.5 2.5 6.25119 12 52 2 10 10092 4 75 7 -3 980 1 118 11 -10 10081 2 120 12 -10 10084 3 114 10 -7 49118 11 42 1 10 100106 7 72 5.5 1.5 2.25103 6 90 8 -2 494 5 91 9 -4 16

d² = 560.51-3360/1716= -0.958041958041958

Spearman r = ~ -0.96

Look up on the Spearman Rank significance table above you will find r significantly

different than zero at P < 0.0001 (r is above the 0.1% significance level, then we can

be 99.9% confident the correlation has not occurred by chance).

By using GraphPad you will get

Spearman r = -0.9632 (corrected for ties)

95% confidence interval: -0.9903 to -0.8658

The two-tailed P value is < 0.0001, considered extremely significant,

indicating a very strong negative association.

NB: If you performed Pearson’s Correlation Coefficient you will get these results;

Correlation coefficient (r) = -0.9463

95% confidence interval: -0.9852 to -0.8149

Coefficient of determination (r squared) = 0.8955

Test: Is r significantly different than zero? The two-tailed P value is < 0.0001,

considered extremely significant. This results is very close to the value of r (-

0. 9632) obtained by Spearman method. This closeness is true when there are

few or no extreme observations.

Chi-Square Test (X2)

Studies often collect data on categorical variables (yes and no, cured and non-cured,

normal and abnormal, living and died, no environmental factors and environmental

factors, agree and disagree, positive and negative, male and female, I and II, present

and absent etc...) that can be summarized as a series of counts. These counts are

commonly arranged in a tabular format known as a contingency table.

1. Chi-Square test for two categories (2 x 2 contingency table)

38


For example, a study designed to determine whether or not there is an association

between cigarette smoking and asthma might collect data that could be assembled

into a 2 x 2 table. In this case, the two columns could be defined by whether the

subject smoked or not, while the rows could represent whether or not the subject

experienced symptoms of asthma. The cells of the table would contain the

number of observations or patients as defined by these two variables.

The chi-square test statistic can be used to evaluate whether there is an

association between the rows and columns in a contingency table. More

specifically, this statistic can be used to determine whether there is any difference

between the study groups in the proportions of the risk factor of interest.

Returning to our example, the chi-square statistic could be used to test whether

the proportion of individuals who smoke differs by asthmatic status.

The chi-square test statistic is designed to test the null hypothesis that there is no

association between the rows and columns of a contingency table. This statistic is

calculated by first obtaining for each cell in the table, the expected number of

events that will occur if the null hypothesis is true. When the observed number of

events deviates significantly from the expected counts, then it is unlikely that the

null hypothesis is true, and it is likely that there is a row-column association.

Conversely, a small chi-square value indicates that the observed values are

similar to the expected values leading us to conclude that the null hypothesis is

plausible.

The general formula used to calculate the X2 test statistic is as follows:

X2 = (C x B - A x D) 2/(E x F x G x H) x I.

Results Observed count Expected count Total of rowGroup 1 A B GGroup 2 C D HTotal of column E F I (Ground total)

or

Where Oi = an observed frequency; Ei = an expected (theoretical) frequency,

asserted by the null hypothesis (expected frequency = Row total x Column

total/Ground total).

39


The degrees of freedom in a test of independence are equal to (number of rows)-1

× (number of columns)-1. Thus for a 2×2 table, there are (2-1)×(2-1) = 1 degree

of freedom; for a 4×3 table, there are (4-1)×(3-1) = 6 degrees of freedom.

For significance, the calculated X2 should be higher than the tabulated value.

The value of the chi-square statistic cannot be negative and can assume values

from zero to infinity.

The p-value for this test statistic is based on the chi-square probability

distribution and is generally extracted from published tables or estimated using

computer software programs. The p-value represents the probability that the chi-

square test statistic is as extreme as or more extreme than observed if the null

hypothesis were true. As with the t and F distributions, there is a different chi-

square distribution for each possible value of degrees of freedom.

Chi-square distributions with a small number of degrees of freedom are highly

skewed; however, this skewness is attenuated as the number of degrees of

freedom increases.

The use of the chi-square test can be illustrated by using hypothetical data from a

study investigating the association between smoking and asthma among adults

observed in a community health clinic. The results obtained from classifying 150

individuals are shown in the Table.

Hypothetical data showing chi-square test

Symptoms of asthmaEver smoke cigarettes

TotalYes No

Yes 20 30 50No 22 78 100Total 42 108 150

As Table shows, among asthmatics the proportion of smokers was 40 % (20/50),

while the corresponding proportion among asymptomatic individuals was 22 %

(22/100). By applying the formula previously presented,

X2 = (C x B - A x D) 2/(E x F x G x H) x I.

X2=(22x30-20x78)2/42x108x50x100=0.0357142857142857x150= 5.357142857.

The associated p-value for this chi-square distribution that has one degree of

freedom is 0.02. Therefore, if there was truly no association between smoking

and asthma, there is a 2 out of 100 probability of observing a difference in

proportions that is at least as large as 18 percent (40% - 22%) by chance alone.

We would therefore conclude that the observed difference in the proportions is

40


unlikely to be explained by chance alone, and consider this result statistically

significant.

Computer analysis

The computer may be used to advantage in calculating X2 test for

independences.

Go to GraphPad and select Analyze a contingency table and two rows, two

columns, then next.

41


Enter your data then next.

You will get a box to select chi-square, no use Yate's continuity correction,

two-sided p value, calculate relative risk, then next

42


Now you will get the results which is the two-sided P value is 0.0206,

considered significant and X2 = 5.357.

Because the construction of the chi-square test makes use of discrete data to

estimate a continuous distribution, some authors will apply a continuity

correction when calculating this statistic. Specifically, where Oi - Ei is the

absolute value of the difference between Oi and Ei and the term 0.5 in the

numerator is often referred to as Yate's correction factor. This correction

43


factor serves to reduce the chi-square value, and, therefore, increases the

resulting p-value. It has been suggested that this correction yields an overly

conservative test that may fail to reject a false null hypothesis. However, as

long as the sample size is large, the effect of the correction factor is negligible.

When there is small number of counts in the table, the use of the chi-square

test statistic may not be appropriate. Specifically, it has been recommended

that this test not be used if any cell in the table has an expected count of less

than one, or if 20 % of the cells have an expected count that is greater than

five. Under this scenario, the Fisher's exact test is recommended for

conducting tests of hypothesis.

2. Chi-Square test for more than two categories (large contingency tables)

It is used when samples are classified by two factors. For example, one factor

may be represented disease severity (mild, moderate or severe) and the other

factor may be represented blood group (A, B, O, AB). We are interested in

whether the two factors are associated. Are individuals of a particular blood

group likely to be more severely ill?

Example: a cross-sectional survey was carried out among the elderly

population living in Riyadh, with the objective of measuring the frequency of

cardiovascular disease. A total of 259 individuals, ranging between 65 and 95

years of age, were interviewed. Individuals were grouped into four age groups

(65-69, 70-74, 75-79, and 80+ years) at the time of interview. We tested the

Chi-Square test to determine whether the prevalence of chest pain differed in

the four age groups. The observed and expected frequencies (%) are shown in

the following Table.

Manual solution: For each group using this formula for calculation as follow

44


We refer X2 to the Table at 3 (4 columns -1) degree of freedom: P > 0.1 (i.e. no

significant).

There is insufficient evidence to reject the null hypothesis of no association

between chest pain and age in the population of elderly people.

Computer analysis: The computer may be used to advantage in calculating X2 test

and X2 test for Trend, which takes into account the ordering of the groups. Go to

GraphPad and select Analyze a contingency table and larger contingency table, then

next.

Enter your data then next.

45


Now you will get the results.

The P value is 0.1839, considered not significant and X2 = 4.839.

Chi-squared for trend = 3.787 and the P value is 0.0517.

If you want to do X2 for trend manual after ordering the four age groups (1, 2,

3 and 4) it should be like this

Thus, there is not a significant linear trend among the ordered categories

defining the columns and the proportion of subjects in the top row.

46

general purpose: - ksufac.ksu.edu.sa/sites/default/files/541.doc · web viewgeneral purpose: the...

Documents