statistical analysis in resesarchmodule · statistical analysis in research module pmnjuho e-mail:...

115
Statistical Analysis in Research Module PMNjuho E-mail: [email protected] 1 EDEL812 STATISTICAL ANALYSIS IN RESEARCH MODULE UNIVERSITY OF KWAZULU-NATAL PIETERMARITZBURG CAMPUS 2002 Compiled by: Peter M Njuho, Ph.D. Senior Lecturer School of Statistics and Actuarial Science University of KwaZulu-Natal Private Bag X01 P O Box X01 Scottsville 3209 South Africa

Upload: others

Post on 08-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

11

EDEL812

STATISTICAL ANALYSIS IN RESEARCH

MODULE

UNIVERSITY OF KWAZULU-NATAL PIETERMARITZBURG CAMPUS

2002

Compiled by: Peter M Njuho, Ph.D.

Senior Lecturer School of Statistics and Actuarial Science

University of KwaZulu-Natal Private Bag X01

P O Box X01 Scottsville 3209 South Africa

Page 2: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

22

EDEL 812 STATISTICAL ANALYSIS IN RESEARCH 1. SURVEY AND DESIGN OF EXPERIMENT STUDY CONCEPTS 1.1 Survey versus design of experimental study The difference between a survey study and design of experiment study is mainly in the study objectives. The researcher should understand the difference before he or she undertakes the study. Failure to make the distinction between the two forms of studies leads to complicated data analyses whose results may fail to tie with the study objectives. The primary objective in survey study is to observe the characteristics of the population of interest. For instances, is the disease common across the different communities? Is the level of education distributed the same across the race? Is the distribution of land even across different communities? What is the opinion across the residence regarding new rules in rubbish disposal? In such situations we would be concerned about the level of distribution rather than the actual difference. Where is the variability high? Would be a question of great interest. In designed experimental study the primary interest is to investigate on the relative performance of certain factors. The key questions to be answered are generally expressed as a statement of hypothesis that has to be verified or disproved through experimentation. The interest would be in answering questions such as:

Are the three methods for treating the disease different? And if so, by how much? Is the new teaching method significantly different from the old method?

In a survey study the researcher has no control over the responses. He/she acts as an observer. The outcomes are mainly considered as random. Survey study can be classified into two types namely exploratory or informal survey and formal survey. The exploratory survey is mainly used in obtaining information about population of interest, for example farmer circumstances. The approach places interviewer in direct contact with the subject and allows the interviewers to observe the characteristics of the population. An exploratory survey allows for quick gathering of information through informal interviews with many people. The information from exploratory survey is used to design a well-focused formal survey by:

identifying important topics bearing on research planning that should be the

focus of the formal survey; ensuring that written questions in the formal survey are asked in a way that can

be understood; designing and testing a sampling scheme;

Other important features of an exploratory survey Towards the end of the exploratory survey, it should also be possible to give approximate frequencies of use for a given practice among the target population (e.g. 0-10%, 10-25%, 25-50%, 50-75%, 75-100% of farmers).

The exploratory survey narrows down the data to be collected in the formal survey to that which are essential for understanding present practices and prescreening technologies. An important part of the exploratory survey is to formulate hypotheses.

Page 3: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

33

Examples of such hypotheses:

A larger area can be planted as labour is a limiting factor at the planting period;

There is a dry period three months after the start of the rains and late plantings

may survive this period better than plantings that flower at that time;

Early plantings give an early supply of new food and are particularly important when the previous harvest has been poor.

1.2 Formal survey study concept The purpose of formal survey study is to verify and quantify information and test hypotheses formulated in the exploratory survey. A formal survey involves use of a well-designed questionnaire. Define the population of interest as a first step. It should be noted that we interview a sample of the respondent and use the information obtained from this sample of respondent to make statements or inferences about the population of interest. The following are general rules to be followed when developing a questionnaire

Organizing the questionnaire: -The questionnaire should be divided into sections based on the study themes. Section one should always be designed to collect the bio-data (examples gender, age, education, marital status, etc.) Language of the questionnaire: - The questions should be constructed using clear and friendly language. The responded must be given an opportunity to express himself or herself in a language of choice. Leading questions should be avoided. The question should be put in such away the respondent will provide more information. For example ask, ”did you apply fertilizer to wheat this year?” rather than, “Do you use fertilizer on wheat?” Length of the questionnaire: - Lengthy questionnaires should be avoided, because they may introduce fatigue. The construction of the questions should be compact and comprehensive. The role of the questionnaire is to obtain estimates of how widespread are those problems and opinions and whether there are differences between groups of respondents. Finalising a questionnaire for use in a formal survey study requires an undertaking of pre-test of the same before producing a final version. Subjective data Consider a survey study were interest is in obtaining farmers’ opinions regarding a certain technology. It should be noted that information on what farmers do is objective and quantifiable where as farmers’ opinion and perceptions about problems and technologies are subjective. Sampling procedures for a formal survey study

Page 4: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

44

To select at a reasonable cost a group of respondent which is roughly representative of all subjects in the population of interest. A representative sample must be selected at random. That is, each unit in the population or subgroup of the population has an equal chance of selection. Such an assumption requires all the sampling units to be homogeneous and to be non overlapping. The nature of these non overlapping units dictate the sampling technique to use. The following are some of the sampling procedures:

Simple random samples Stratified random sampling Mult-stage random sampling Systematic random sampling Cluster sampling

Stratification of the population is the process of dividing the population into relatively homogeneous subgroups called strata, and then taking separate samples from each group or strata. Sample size: - Depends on the variability within the population and not on the size of the population. It should conform to the time and cost constraints of the survey. Major costs:- Cost of developing questionnaire, training enumerators, and establishing a suitable sampling method. Form of analysis: Either to estimate population means, variance components, population size or to establish casual relationship, predictable models, frequencies, etc. Commonly used in analysis:- Chi-squares, mean estimation, non-parametric, regression (i.e. logit, probit, logistic, etc.) 1.3 Designed experimental study concepts The researcher has the control over the factors to be tested and the form of data to be collected. He/she sets the experiment and observes the outcome. An experiment is a planned inquiry set to obtain new facts, confirm or disapprove results from a previous experiment or verify certain biological phenomenal. Objectives:- The objectives must clearly stated as questions to be answered; hypotheses to be tested, and effects to be estimated. It is necessary to classify the objectives as major or minor, since certain experimental designs give greater precision for some treatment comparisons than others. Precision: - Precision, sensitivity, or amount of information is measured as a reciprocal of the variance of a mean. That is

Information = )var(

1y

Page 5: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

55

= nσ 2

Where var( y ) denotes the variance of the sample mean y .

As the variance of y denoted by σ 2 increases, the amount of information decreases. Similarly, as n increases, the amount of information increases. Components of experimental design:- The following are components that any researcher must clearly state when conducting a designed experimental study.

Treatment structure Design structure Experimental unit Randomization Replications Assumptions

Treatment structure:- A treatment is a procedure whose effect is to be measured and compared with other treatments. E.g. a standard ration, a spraying schedule, a temperature-humidity combination, etc. A set of treatments, e.g. sources of fertilizer such as DAP, CAN, TSP, Manure, etc. One-way treatment structure, e.g. nitrogen levels, Dairy meal levels, etc., Two-way treatment structure, e.g. plant population and different hybrids. Higher order treatment structure, etc. The interest is to estimate effects, compare effects, predict, etc., Experimental unit:- This the smallest unit of material to which the treatment is applied. e.g. an animal, 5 pigs in a pen, a half-leaf. Sampling unit: -This is often referred to as observational unit. Treatment effect is measured on a sampling unit, which is basically a unit of experimental unit. Sometimes a sampling unit is a complete experimental unit. Experimental error:- This is a measure of the variation which exists among observations on experimental units treated alike. Aim at reducing experimental error in order to improve the power of the test. Replication:- When a treatment appears more than once in an experiment, it is said to replicated. Replication is necessary to provide for an estimate of experimental error, which is required for tests of significance. Without replication there is no basis for comparison. Valid replication requires that for similarly different units there are at least some sets of units treated identically. There are many situations when there can be different levels of replication, providing different degrees of variation. It is necessary to identify the different levels of replication, the correct of replication and situations when “false” replication is used. There are also many situations when multiple levels of replication are necessary and relevant to the analysis. Replication provides means of computing experimental error.

Page 6: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

66

The amount of replication is determined by the extent, to which the standard error must be reduced, which is in turn determined by the size of treatment difference, which the experiment should detect. Given the necessary amount of replication we have a total number of units in the experiment. The division of the total degrees of freedom (Sample size minus 1) can model the variation between these units in an analysis of variance into control (design structure), questions (treatment structure) and error (random structure). We should design experiments so that the error degrees of freedom are between 10 and 20. Experiments not satisfying this requirement are, to some degree, inefficient and should be avoided. Randomization:- Done to ensure that we have a varied or unbiased estimate of experimental error and of treatment means and differences among them. In other words, the procedure provides insurance against the possibility that the model for analysis is valid. It also provides a basis for randomisation test arguments to support coincidence arguments in terms of significance. Randomization provides a valid measure of experimental error. Design Structure:- Involves techniques for controlling known variation among the experimental units. Thus, experimental units are grouped into homogeneous groups referred as blocks such that variation within the groups is a minimum and between them is a maximum. The following are examples of design structure:

Complete randomized design (CRD) Randomized complete block design (RCBD) Latin square design Cross-over design Incomplete block design Experiments with more than one experimental unit such as:-

Split plot design. Strip plot design. Repeated measure design. Assumptions:- The design structure and treatment structure do not interact. The observed values are independently and identically distributed normal with a constant variance. 1.4 Conceptual models in scientific research Conceptual models serve to organise research approaches and direct data presentation. Many inexperienced scientists do not make full use of conceptual models. Conceptual models assume many different forms that are not mutually exclusive. Different conceptual models may be dynamic and interactive. Working hypotheses are an essential component to all scientific approaches and must be elucidated in advance of more detailed research activities. Most working hypotheses may then be captured as either mathematical or statistical models. Simple diagrams should be based on one or more working hypothesis and constructed in advance of detailed research efforts to serve as a framework and may often evolve into more complex forms during the course of many experiments and much thought.

Page 7: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

77

Working hypotheses: Working hypotheses are reductionist word models based on logic and an essential component of all research. All of scientific progress may be viewed as a long chain of working hypotheses that were framed, tested and either accepted or rejected with conclusions that led to a more advanced hypothesis. A well stated working hypothesis is specific and directs treatment selection and measurements. Global hypotheses are more general, often a restatement of overall objectives and generate one to many working hypotheses. A null hypothesis is stated in the negative context and no longer considered essential provided that a working hypothesis is stated in a manner that may either be accepted or rejected. It should be noted that hypothesis testing is a formal procedure by which we investigate research questions using inferential statistics to reach decisions about the validity of the null and alternative hypotheses. It is most reasonable for one scientist to ask another “How do your findings reflect on your working hypothesis?” Working hypotheses should be stated as simply as possible but must be complete statements such as “X regulates Y under Z conditions”. Example 1.1 Phosphorus availability is limiting maize production in nutrient-depleted, smallholder farms in the highly weathered, sandy soils of KZN.

Maize streak virus infects a greater proportion of the maize stand and reduces crop yields to a greater extent under continuous maize cultivation than in maize –legume rotation.

Both of these statements in example 1.1 may be summarised as

If A and B then C (and D, etc.) Always remember that working hypotheses are intended to be either accepted or rejected as a result of successful research and as such must be able to withstand various tests of logic. Do not be defensive when a particular hypothesis is challenged but rather complimented that another scientist considers it worth of discussion. Beware of incomplete statements such as Use of fertiliser is better for farmers, or Maize streak virus is a serious problem. Also, tautologies statements such as “Sustainable agriculture results in long-term food security” are unsatisfactory working hypotheses, rather these sort of statements should be included within introductions, justification sections or overall objectives. Mathematical models Mathematical equations may also serve as conceptual models. Equations attempt to quantify cause and effect relationships. Cause(s) is referred to as the independent variable, that direct an effect in the dependent variable. The mathematical relationship may also be

Page 8: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

88

stated in more general terms as a working hypothesis. The relationship may either be linear or non linear. A general expression a simple linear relationship is of the form

Y = α + βX + ε Where α is the intercept β is the slope ε is the random error Y is the dependent variable X is the dependent variable Many different equations define non linear relationships. Examples of such equations are:

Power functions - (y = axb , where a and b are constants)

Exponential growth curves - (y = abx , where a and b are constants. b can

also be exponential e)

Hyperbolic functions where x is the reciprocal of y ( y = 1/x)

Asymptotic decay curves where y approaches 0 as x increases without limit – (y =ae-kx)

Polynomial curves -- ( y = a0 + a1x + a2 x2 + . . . + apxp )

In general, researcher should identify a conceptual basis for selecting a given curvilinear relationship based upon the properties of a phenomenon under study. Statistical models Statistical models particularly those that examine two or more factors simultaneously, are also useful as conceptual models. Differential effects of one factor on another result in interaction. For instance, in a study set to investigate the effect of two factors, a statistical model would be the following form assuming a completely randomised design. Suppose y is the response variable.

yijk = µ + Ai + Bj + ABij + εijk Where µ denotes the overall mean

Ai ith effect for factor A Bj jth effect for factor B ABij ijth interaction effect εijk random error.

Statistical model could also be considered as a process of partitioning of the response value, into components due to inherent and random variation. Setting up of the model before the analysis enables the researcher to be focused on issues of interest. Interest would be in estimating the main effects of factors A and B, and their interaction effect.

Page 9: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

99

The random error would be used to establish statistical tests to ascertain significance or non significance of these effects from zero or any specified value. Exercise 1.1 1.1 Suppose you were approached to design a questionnaire to be used in a feasibility

study regarding the settlement of a group of urban dwellers on a new land within KZN.

a) What components would you include in your questionnaire? b) What could be the sampling unit? c) What could be the population of interest? d) What could be the sample size?

1.2 Consider the newly constructed Casino in Pietermaritzburg. Suppose the manager

wishes to collect the views of the residence of Pietermaritzburg regarding the business. Indicate how such information could be collected.

1.3 The Checkers in Scottsville, Pietermaritzburg underwent some renovation recently.

Suppose the manager wishes to collect the customers views regarding this change. Indicate how such information would be collected.

1.4 An experiment was conducted to determine the best way to manage citrus insects’

pests and diseases under small holder farms. Six farms were selected. In each farm, 10 trees infested with white flies were selected. The investigator was interested in finding treatment had the best effect in controlling the disease among the four namely 1) pruning, 2) fertiliser application, 3) pesticide application, and 4) farm activities intercropping practise.

a) Indicate how the treatments were applied. b) What could have been the experimental unit? c) How independent are these treatments? d) What name could you give the experimental design used? e) What possible questions could be answered in this study?

Page 10: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1100

2. INFERENCES ABOUT ONE AND TWO POPULATION MEANS 2.1 Introduction to hypothesis testing Hypothesis testing is an area of statistical testing in which we evaluate a conjecture, which we call hypothesis, about some characteristic of the parent population. The hypotheses, usually concerns the unknown parameters of the population. The null hypotheses: This is the statement being tested and is denoted as H0. It is usually stated as equality implying ‘no difference’. The alternative hypothesis This is what is believed to be true if Ha is rejected. Usually, the investigator wishes to establish that there is a difference between the parameter and the value being tested. Thus, the alternative is also called the research hypothesis. Consider, for example, the hypothesis that the mean per capita income in a certain town is R800 per year. Suppose we denote the population mean, byµ . Suppose the investigator believes the mean per capita income of the town is greater than R800. The two hypotheses are stated as H0 : µ = 800 against Ha : µ > 800 If the investigator believes the mean per capita income is less then the alternative hypothesis is stated as

Ha : µ < 800 The alternative is stated in support of what the investigator wish to believe. The significance level This is the probability with which we are willing to reject the null hypothesis when it is correct. Type I error is committed if we reject null hypothesis when it is in fact true. Type II error is committed if we fail to reject null hypothesis when it is false. 2.2 Inferences about a population mean In reality, we encounter situations where interest is in confirming a known hypothesis. This relates to questions such as, has the average increased, decreased or remained static over time? Sometimes, an investigator would like to compare characteristics of two populations. Handling of such investigation involves one or two sample situations. Consider a single population that is normally distributed with mean µ and variance 2σ . Suppose we want to test hypothesis about µ . Let µ 0 denotes a known mean. Hypothesis: Ho : µ = µ0 against Ha : µ ≠ µ0

Page 11: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1111

Example 2.1 The scores on a college placement exam in mathematics are assumed to be normally distributed with a mean of 70 and a standard deviation of 18, The exam is given to a random sample of 50 high school students who have been admitted to college. Their average score on the exam was 67. If this is a true random sample, is the evidence sufficient to suggest that the population mean score is lower than 70? Solution: Let µ denotes the true population mean of the placement exam. We wish to see whether there is evidence that µ < 70. This is the research hypothesis. Thus, 0µ = 70, σ2 = 324, and n = 50.

Hypothesis: Ho : µ = 70 against Ha : µ < 70 Significance level: α = 0.05

Critical region: Reject H0 if the p-value is less than 0.05, where p-value = the probability that X ≤ 67.

Test statistics: The sample mean, X = 67

z = nX

/0

σµ−

= 50/187067 −

= - 1.18 Thus, P( X ≤ 67) = P(Z≤ -1.18) = 0.119 Conclusion: We fail to reject H0 since p-value = 0.119 is not less than 0.05. Based on the results of a random sample of 50 high school students, there is insufficient evidence to say that the mean score on the college placement exam should be lower than 70. Exercise 2.1 2.1 For the future planning and control of automatic sorting machines, a member of the

General Post Office is instructed to take a random sample of the letters posted with a 10c stamp during a specific period of the year. The weights of these letters were recorded as follows (in grams):

25.7 23.2 25.8 25.8 29.1 23.1 17.2 26.4 31.9 18.3 19.2 20.7 23.6 21.6 21.9

21.8 a) Test a claim that the average weight of such letters is 19.6 gms. b) Test a claim that the average weight is greater than 21.6 gms.

Page 12: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1122

c) Find 95 % and 99 % confidence intervals for the average weight of the letters. d) Use your results in (c ) to test whether the mean is 27.9gms.

2.2 A certain department store conducts monthly checks amongst its branches to test

whether the mean balance outstanding on 30-day charge accounts complies with the company policy of R100. For a particular branch store a sample of 100 accounts gave the following results:

x = R104.19 s = R22.13

a) Test the claim at 5 % significance level, that the branch was complying with company policy.

b) The department store financial controller claims that the mean balance is greater than R100. Test this claim. (Use α = 0.05).

2.3 State the null and alternative hypotheses for the following research questions.

a) Are children who have strict parents more disciplined than children who do not have strict parents?

b) Do babies with birth weights of 2.8kg and more have a greater mortality rate than those with birth weights lower than 2.8kg?

2.4 The records of the National Road Traffic department reveal that the scores on the

learner driver test are normally distributed with a mean of 62% and a standard deviation of 16. The traffic department is aware that people in Kwa-Zulu Natal tend to be better at obeying traffic rules than people from other provinces around the country. They administered the learner driver test on a random sample of 200 adults from KZN, and noted that their mean score was 69.9%.

a) Do people from KZN perform better than national mean? (α = 0.05). b) Conduct analysis on the same data to test the research question that people

from KZN perform differently from expectation. (α = 0.05). c) Although the above tests have yielded similar conclusions, in what way do they

differ? d) How would the chance of making Type I and Type II errors change if we

changed the significance level of the tests to (α = 0.01). 2.5 The weight of humans is normally distributed with a mean of 73kg and a variance of

144. To investigate whether the weight of rural South Africans is different from this international mean, we draw a random sample of 100 rural South Africans and calculate their mean weight to be 69kg.

a) Determine whether the weight of rural South Africans is different from the

international mean (α = 0.01). 2.3 Paired t- test problem Consider a situation where a researcher is interested in the effect of a treatment given to randomly selected subjects. Measurements are made before and after application of the

Page 13: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1133

treatment. The data are paired and interest would be to find out the effect as to whether it is negative or none or positive. Differencing eliminates the effect of the subject, and leaving the effect due to the treatment and random. Suppose we have two treatments A and B applied to n samples randomly selected from a normally distributed population with mean µ and variance 2σ . Response Subject Treatment Treatment Difference A B d = X-Y 1 X1 Y1 d1 = X1-Y1 2 X2 Y2 d2 = X2-Y2 3 X3 Y3 d2 = X3-Y3 . . . . . . . . . n Xn Yn d2 = Xn-Yn

We treat the new information on differences as single sample problem, and compute estimates of the mean and the variance using the usual estimation formulae. Assumption The di ‘s (i =1, 2, . . . n) are random samples from a normal population with mean µ d and variance 2σ . Hypothesis: Ho : µd = 0 against H1 : µd ≠ 0 or H1 : µd < 0 or H1 : µd > 0, depending on the available information. That is, the effect may be suspected to differ, or decrease or increase. Under, Ho : µ d = 0 is true, we estimate the variance, 2σ , by 2

ds computed as follows:

2ds =

1/)( 22

− ∑∑n

ndd

The calculated t- value, denoted by tcalc is

tcalc = ns

d

d /0−

Where

d = Average of the paired differences sd = Standard deviation of the paired differences n = Number of pairs.

Reject Ho if |tcalc| > t α / 2

1n− and conclude that there is enough evidence that the treatment had a significant effect at α level whereα , measures the strength of evidence against Ho. The values of t –distribution are given in Table B.

Page 14: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1144

The following example illustrates the computation procedures and the type of inference one can draw.

Example 2.2 A market research study in which a family was asked to record its total monthly purchases at Pick ‘n Pay and its total monthly purchases at Checkers was conducted. The study wishes to estimate the difference in average monthly expenditures by families at the two shopping centres. The data from 10 families selected at random is presented below. The data is in rands.

Family Pick ‘n Pay Checkers Difference, d (d - d )2

1 2 3 4 5 6 7 8 9 10

140 120 230 50 70 240 190 120 250 100

100 150 220 80 110 180 190 140 190 100

40 -30 10 -30 -40 60 0

-20 60 0

50

(40 –5)2 = 1 225 (-30 -5)2 = 1 225 (10 – 5)2 = 25 (-30 – 5)2 = 1 225 (- 40 – 5)2 = 2 025 (60 – 5)2 = 3 025 (0 – 5)2 = 25 (-20 –5)2 = 625 (60 – 5)2 = 3 025 (0 – 5)2 = 25 Sum = 12 450

d = 1050 = 5

sd = 1

)( 2

−∑n

dd =

11012450

= 37.193 Critical region: The t-table value with 9 degrees of freedom at 5 % significance level is t = 2.262. (See Table B) We would reject H0 if |tcalc| > t α / 2

1n− =2.262.

Test statistic: tcalc = ns

d

d /0− =

10/193.3705 −

= 0.425

Conclusion:

Page 15: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1155

We fail to reject H0 since the |tcalc| is not greater than 2.262. We conclude that there is no significant difference in average spending by families at the two shopping centres, based on the available data. Exercises 2.2 2.6 Given two independent samples with the following information Item Sample 1 Sample 2 1 19.6 21.3 2 22.1 17.4 3 19.5 19.0 4 20.0 21.2 5 21.5 20.1 6 20.2 23.5 7 17.9 18.9 8 23.0 22.4 9 12.5 14.3 10 19.0 17.8

a) State the null hypothesis b) What assumption would you make? c) Based on these paired samples, test at the α = 0.10 level whether the true

average paired difference is 0. d) State your conclusions.

2.7 A random sample of 15 cars passed through an urban speed trap. The following

speeds in km per hour were recorded.

Car Speed 1 71 2 49 3 68 4 65 5 64 6 57 7 80 8 63 9 62 10 69 11 45 12 61 13 66 14 66 15 55

a) Estimate µ , the true mean speed of cars passing that point.

Page 16: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1166

b) Given that the speed limit is 60 km/h, test H0: µ = 60 to check if it is reasonable.

c) Set a 95 % confidence limit about the true mean. d) What assumption did you have to make?

2.8 A consumer organisation has sampled 20 owners’ TV sets and recorded the time in

years to the set’s first repair. The data are:

1.97 2.87 3.01 2.75 2.09 1.34 1.62 1.10 2.24 1.79 2.81 0.57 3.17 3.89 3.10 2.05 1.01 4.16 2.59 1.67 a) Estimate the mean time to first repair for the population sampled. b) Set a 99 % confidence limits for the true mean, µ . c) Use the results obtained in part (b) to test H0: µ = 0.

2.9 The following data arise from a survey of aged people in Durban. The variable

recorded per person is the monthly expenditure on medicines, recorded in rands.

34.42 9.66 40.40 31.00 6.30 52.82 2.20 20.00 6.50 48.24 57.13 24.64 37.80 36.00 58.16 a) The claim recently made in a local newspaper was that the mean monthly

expenditure on medicines for elderly people exceeds R 30.00 per month. Test this claim at 5 % significance level.

b) Use the sample to estimate the mean annual expenditure on medicines for this population and set a 95 % confidence limits to this quantity.

2.10 The repainting of lines on freeways represents a large proportion of the

expenditure of a roads department. It is decided that a new, cheaper paint should be tested. Twenty-five randomly chosen 1-km stretches are painted with the new paint. After a month an assessment is done at each site. An instrument using a scale on which the current paint registers 39.2 measures the durability of the paint. For the sample of 25 sites, the following calculations have been done.

x = 39.65 s = 3.02

The department wishes to test (using α =0.01) whether the new paint is better than the current paint.

a) State the appropriate null and alternative hypotheses. b) Test your null hypothesis at the required level. c) State your conclusions.

2.11 A random sample of nine local school children yielded the following sample

statistics for the random variable X =IQ. x = 107 s = 3.88

Page 17: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1177

a) Find a 99 % confidence interval for µ , the mean IQ, and use this interval to test H0 : µ = 100 at the 1% significance level.

2.12 A random sample of 16 pharmacies was selected in the Witwatersrand area. The

price in rands charged for 100 tablets of a particular drug by each pharmacy in the sample was:

3.75 4.10 10.40 7.50 2.95 5.75 7.50 8.90 5.85 7.65 8.10 6.50 7.50 5.50 8.00 4.50

a) Estimate the mean price of 100 tablets of this drug for pharmacies in the area. b) Set 95 % confidence limits to your estimate. c) Carry out a test of a significance to assist you in deciding whether the mean

price of this drug in the Witwatersrand area is lower than R 7,95 (which is known to be the mean selling price in Cape Town).

2.13 Suppose that, after sampling 20 records at random, a sociologist finds the

following durations (to the nearest tenth of a year) of marriage ending in divorce.

10.1 21.2 13.8 11.1 10.9 9.2 6.6 12.3 7.8 15.1 2.61 4.31 4.9 5.4 8.7 4.81 9.42 6.3 24.5 21.6

a) Set up an appropriate null hypothesis and alternative hypothesis. b) Determine whether these data provide proof, at the 5 % significance level, that

the mean duration of marriage ending in divorce in the population has decreased from an earlier value of 14.9.

c) What distribution assumption is made in applying the hypothesis test? 2.14 A designer claims that by smoothing out parts of a particular automobile body to

reduce air resistance, the average fuel consumption can be reduced below 8.0 litres per 100km. In an attempt to support the claim, the designer has obtained a sample of fuel consumption for 15 modified automobiles. The sample mean was 7.4 l/km and standard error of the mean was 0.8 l/km.

a) Do these results provide sufficient evidence to support the claim?

2.15 To test the durability of a new paint for white centre lines, a highway department

painted test strips across heavily travelled roads in eight different locations, and automatic counters showed that they deteriorated after having been crossed by the following number of cars (in thousands).

142.6 167.8 136.5 108.3 126.4 133.7 162.0 149.0 a) Find 95 % confidence limits for µ , the average number of crossings that this

paint can withstand before it deteriorates. b) Find 99 % confidence limits for µ , the average number of crossings that this

paint can withstand before it deteriorates. c) Test the paint manufacturer’s claim that µ =160.0

Page 18: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1188

Oftentimes, investigators are interested in assessing the performance of one population compared to another. For instance, comparing the performance of a new technology to an old one or comparing new variety against an old variety, etc. The two populations are assumed to be independent. Suppose we have two random samples one of size n1, X1, X2, X3, . . . Xn1 drawn from a normally distributed population with mean µ 1 and variance

2σ and the other of size n2, Y1, Y2, Y3, . . . Yn2 drawn from a normally distributed population with mean µ 2 and variance 2σ . Consider the sample means x and y as unbiased estimators of population meansµ 1 and µ 2, respectively. Also, let sample variances 2

1s and 22s both be unbiased estimators of population variance, 2σ .

2.4 Inferences about two population means Assumption The two populations of X’s and Y’s are independent and normally distributed with possibly different means and a common variance. The setting up of hypotheses depends on the study objectives. The following are possible hypothesis: Hypothesis: Ho : µ1 = µ 2 against Ha : µ1 ≠ µ 2 or Ha : µ1 < µ 2 or Ha : µ1 > µ 2 Test Statistic: Thus, X - Y is normally distributed with mean µ1 - µ 2 and variance 2σ . We estimate

2σ by a pooled variance, s p2 , where

s p

2 = (Total Sum of Squares)/(Total Degrees of Freedom)

= 2

)1()1(

21

222

211

−+−+−

nnsnsn

Comparing the equality of the two means against an alternative hypothesis of not equal, demand that the standard error of the means difference computed first. For a combined sample size n1 + n2 < 30, we use t-distribution, otherwise, normal distribution would apply. The appropriate test statistic assuming common variance estimated by a pooled variance is computed as

tcal = )11(

)()(

21

21

nns

yx

p +

−−− µµ

Conclusion: Reject Ho : µ1 = µ 2 in support of Ha : µ1 ≠ µ 2 if |tcal| > tTable obtained with n1 + n2 – 2 degrees of freedom at α - level of significance.

Page 19: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

1199

In case the assumption of common variance of population variances cannot be assumed, say, 2

1σ and 22σ , then an approximate t-distribution with degrees of freedom, df,

computed as

df = [sn

12

1 +

sn

22

2]2 /[(

sn

12

1)/(n1 -1 ) - (

sn

22

2)/(n2 - 1)]

The computed df are rounded down to the nearest integer and then t-test used, noting that

the less degrees of freedom the lower the power. The pooled variance is estimated as sn

12

1

+ sn

22

2.

Example 2.3 Consider data collected to study the heating producing capacity. The heat producing capacity (in millions of calories per ton) was measured on random samples of five specimens each of coal from two mines. The following is the data and the test statistics.

Mine 1 8380 8210 8360 7840 7910 Mine 2 7540 7720 7750 8100 7690

Suppose we assume sample from Mine 1 to be normally distributed with mean µ1 and variance σ 2 , and from Mine 2 to be also normally distributed with mean µ2 and variance σ 2 . Hypothesis: Test H0 : µ1 =µ2 against Ha : µ1 ≠ µ2 Significance level: α = 0.05 Critical region: Reject H0 is |tcal| > t* where t* is the t- table value corresponding to 2(n - 1) = 8 degrees of freedom at 5 % significance level. Test statistics: x1 = 8 140 and SS(x1) = ( )x x−∑ 2 = 253 800 x2 = 7760 and SS(x2) = 170 600 n1 = n2 = n = 5 t* = 2.306 with 8 degrees of freedom and at 5 % significance level. Thus, the pooled estimate of the variance σ 2 is

Page 20: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2200

spooled2 =

SS x SS xn

( ) ( )( )

1 2

2 1+−

= 253800 170600

2 5 1+−( )

= 53 050 The estimate standard error of the mean difference is

S.E.( x1 - x2 ) = sn np

2 1 1( )+ = 53050

25

( )

= 145.67 The value of t- calculate is

tcalc = x x

S E x x1 2

1 2

0− −−. .( )

= 8140 7760

145 67−.

= 380

145 67. = 2.61

Conclusion: We reject H0 since |tcalc| = 2.61 is greater than t* = 2.306 and conclude that the heat producing capacity of the coal from the two mines is not the same. The coal from Mine 1 being superior by 380 ± 145.67 millions of calories per ton.

Example 2.4 A researcher wants to determine whether a given drug has any effect on the scores of human subjects performing a task of psychomotor co-ordination. Nineteen subjects were randomly selected from a subject pool and then randomly assigned to two groups. The nine subjects in group 1 received an oral administration of the drug prior to being tested. The ten subjects in group 2 received a placebo at the same time. The scored results were as follows:

Group 1 Group 2 12 14 10 8 16 5 3 9 11

21 18 14 20 11 19 8 12 13 15

n1 = 9 n2 = 10

Page 21: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2211

Total score for Group 1 : X 1∑ = 88

Group 2 : X 2∑ = 151 On the assumption that the scores are distributed normally, we wish to test whether the two groups are significantly different at 5 % significance level. Hypothesis

H0 : µ1 =µ2 against Ha : µ1 ≠ µ2 Significance level: α = 0.05 Critical region: Reject H0 is |tcal| > t* where t* is the t- table value corresponding to n1 + n2 -2 = 9 +10 -2 = 17 degrees of freedom at 5 % significance level. Thus t* = 2.110 Test statistics: The means and sum of squares, SS (x) for

Group 1 : x1 = 9.778 and SS(x1) = 135.56 Group 2 : x2 = 15.100 and SS(x2) = 164.90

spooled2 =

SS x SS xn n( ) ( )1 2

1 2 2+

+ − =

135 56 164 909 10 2. .++ −

= 17.6742 Hence

S.E.( x1 - x2 ) = sn np

2

1 2

1 1( )+ = 17 6742

19

110

. ( )+

= 1.93 Thus,

tcalc = x x

S E x x1 2

1 2

0− −−. .( )

= 9 778 15 10

1 93. .

.−

= - 2.758

Conclusion: We reject H0 since |tcalc| = 2.758 is greater than t* = 2.110 and conclude that the scores of the experimental group are significantly lower the control group, say by 5.32± 1.93 units. 2.5 The process of setting hypotheses and testing

Page 22: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2222

The investigator set up the study objectives, which are translated into questions that need to be answered by the data collected. These questions are formulated in form of hypotheses. The null hypothesis always has the equality sign whereas the alternative hypothesis is stated as either unequal, decrease or increase based on the available information on direction of the reaction. The basic form of the null and alternative hypotheses for two samples test. Null hypothesis H0 :: µ1 ==µ2 oorr µ1 --µ2 == 00 Possible alternative hypotheses

i) Ha : µ1 ≠ µ2 or µ1 -µ2 ≠ 0 < two tailed test > ii) Ha : µ1 <µ2 or µ1 -µ2 < 0 < one tailed test > iii) Ha : µ1 >µ2 or µ1 -µ2 > 0 < one tailed test >

A conventional rule is to state the null hypothesis with an equality sign and the alternative hypothesis with a strict inequality. The following are the necessary steps to follow when performing hypothesis test. Step 1: State the assumptions associated with the random variable(s) related to the population(s) under investigation. Often, the random effect is assumed to be independently and identically distributed normal with a fixed mean and a constant variance. Step 2: State both the null and alternative hypotheses. Normally, the alternative hypothesis is the statement we wish to prove. Step 3: State the significance level. This is the type I error which, the probability of rejecting the null hypothesis when it is true. It is commonly referred to as an experimental error rate. The conventional levels are 10 %, 5 % and 1 %. Step 4: Set the critical rule or decision rule. It is becoming traditional to use p- value, which is the observed probability of rejecting the null hypothesis when is true. The smaller the p-value, the stronger is the evidence against the null hypothesis. Reject the null hypothesis when p -value is less than the significance level. The rejection region, consist of those values of the test statistic that will lead to the rejection of the null hypothesis. Step 5: Compute the test statistics. These are sample mean(s), variance of the mean(s), standard error of the mean or mean difference, and the degrees of freedom. In general, the test statistic is calculated from the sample data that is used to test the null hypothesis. Step 6: Draw the conclusions based on these statistics when compared to the critical value(s). If the null hypothesis is rejected then declare that there is sufficient evidence. Otherwise, there is no enough evidence. Two probability distributions namely, normal and t - distributions are used. The t - distribution is used when variance(s) is/are unknown and the sample size is less than 30.

Page 23: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2233

The normal distribution is used when the sample size is greater or equal to 30. The variance is still estimated if it is unknown. 2.6 Inferences about a population proportion. Suppose P denotes the proportion in the population with the attribute. This is referred to a probability of success. Suppose that, a random sample of size n is drawn from a binomial distribution. Let y be the number in the population with the attribute. We estimate P using a statistic p as

p = yn

= Total number with the attribute divided by sample size.

The sampling distribution of p is approximately normal with mean P and variance P P

n( )1−

.

Suppose the hypothesis is stated as H0 : P = P0 against Ha : P ≠ P0

Under the assumption that the H0 is true, the variance of p becomes P P

n0 01( )−

Thus,

z calc = p P

P Pn

−−

0

0 01( )

Example 2.5 A recent report claimed that 20% of all college graduates find a job in their chosen area of study. A survey of a random sample of 500 graduates found that 110 obtained work in their area. Is there statistical evidence to refute the claim? Solution: If P denotes the percent of college graduates who find a job in their area of study, then

H0:P = 0.20 against Ha : P ≠ 0.20 We denote the test statistic by p, the proportion of successes in the sample. Thus,

p = 110500

= 0.22

Page 24: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2244

The studentised test statistic is

zcalc = p − 0 20

0 20 0 80500

.( . )( . )

= 0 22 0 20

0018. .

.−

= 1.11 In case of one-sided test, p-value = P (z ≥ 1.11) = 0.1335. For a two-sided test, p-value = 2(0.1335) = 0.267 Conclusion: There is no enough evidence to reject the null hypothesis at 5 % significance level. Thus, cannot refute the claim. Testing with confidence intervals The null hypothesis H0:P = P0 against Ha : P ≠ P0 is rejected at an α level of significance if and only if the hypothesised value P0 falls outside a (1-α )100% confidence interval for P. Example 2.6 A news report in a major city stated that 80% of all violent crimes in that city involves firearms. A survey of all violent crimes in the city for the past 2 years revealed that of 283 violent crimes, 240 involved firearms. Determine with a confidence interval whether the news report is correct. Solution: H0:P = 0.80 against Ha : P ≠ 0.80 Given, P0 = 0.80 n = 283 y = 240 A 95% confidence interval for P is

p ± 1.96p p

n( )1−

But, p =240283

= 0.848

Page 25: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2255

Thus, 0.848 ± 1.960 848 0 152

283. ( . )

= 0.848 ± 0.042

The 95% confidence interval for P is (0.806, 0.890). We reject H0 at 5% significance level, because the hypothesised value, P0 = 0.80 does not fall in the interval. Test H0 : µ1 =µ2 against Ha : µ1 ≠ µ2

Exercises 2.3 2.16 A claim was made that 60% of the adult population thinks that there is too much

violence on television. A random sample of 200 adults found that 110 thought that there is too much violence on television. Is this enough evidence to reject the claim?

2.17 The government believes that no more than 25% of all college students would

favour reducing the penalties for the use of marijuana. A sample of 2 400 college students revealed that 750 favour reducing the penalties.

a) Set up null and alternative hypotheses to evaluate the government’s claim. b) Give the form of the standardised test statistic and calculate its value. c) Compute the p-value and determine whether there is sufficient statistical

evidence to reject the government’s claim. d) State your conclusion.

2.18 A psychologist has developed a new aptitude test and believes that 80% of the

public should score above 50 on the test. From a sample of 200 people, 164 scored above 50.

a) Is there statistical evidence that the claim made by the psychologist is not

valid? b) For the results to be significant at the 5% level of significance, how many out

of 200 will have to score above 50 on the aptitude test?

Page 26: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2266

3. ANALYSIS OF VARIANCE 3.1 Introduction to completely randomised design Testing of two population means is achieved through t- distribution procedures. The experimenter is at liberty to select type I error (the probability of rejecting null hypothesis when it is true), when setting the critical region or rejection rule. We draw inference on the population means based on the sample data. The problem in using t-test when the population means are more than two becomes complicated. For instance, with four

treatments, we require 42⎛⎝⎜⎞⎠⎟ which reads “4 choose 2”, pair-wise comparisons, namely,

{(1,2), (1,3), (1,4), (2,3), (2,4), and (3,4)}. We have, say, α , type I error for each comparison. This probability increases exponentially with the number of pair-wise comparison. The analysis of variance is used as an alternative procedure for testing simultaneously, the equality of population means, while using the same type I error, say,α . The design of an experiment is the process of planning a study. Conclusions are draw from such experiments. The analysis of variance is concerned with the comparison of t Populations (treatments) means µ1 , µ2 ,..., µt . We would like to use sample results to draw inference on the means. The model A statistical model for an observation made on subject j receiving treatment i, denoted yij is expressed as

yij i ij= +µ ε where µ µ τi i= +

i=1,2,..., t, j=1,2,...,ni

µ = Overall mean. µi = Mean of the ith population or treatment. τ i = ith treatment effect. εij = Random effect due to jth replication receiving ith treatment.

The statistical model can be categorised into two parts namely the means effect model and the fixed effects model. That is Means effects model: yij i ij= +µ ε Fixed effects model: yij i ij= + +µ τ ε The null hypothesis related to the fixed effect model is

Page 27: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2277

H0 : τ1 =τ2 ...τ t =0 where τ i = µi - µ , i=1,2, . . ., t. The null hypothesis for the means model is stated as H0 : µ1 =µ2 =...=µt = µ, and alternative hypothesis as Ha : Not all treatment means are equal (i.e. µi ≠ µi' , for some i≠ i’). Assumptions The statistical model for a completely randomised design is based on the following assumptions.

• Each population is normally distributed. That is yij ~ i. i.dN(µi,σ 2 ), i =1, 2, . . . t. • The variance, denoted σ 2 , is the same for each population. • The observations must be independent.

Usually, the above assumptions are summarised by the following mathematical expression: εij ~ i. i.dN(0,σ 2 ), i=1,2, . . ., t, j=1,2, ..., ni. Where the first ‘i’ denotes, identical distribution, second ‘i’ denotes, independently distributed, ‘d’ denotes distribution, ‘N’ denotes normal distribution with mean zero and constant variance denoted by σ 2 . The design layout Suppose an investigator intends to carry out an experiment to investigate the performance of four varieties. Suppose the available experimental material allows for 12 homogeneous experimental units. Thus, each variety occupies three units, say plots. Denote the 4 varieties by V1, V2, V3, and V4. A simple randomisation approach is to write down the variety numbers on 12 pieces of papers wrap each of then and shovel them. Pick each at random and allocate the variety to the unit sequentially. For this example, the layout as a completely randomised design would be

V4 V1 V3 V2 V2 V3 V4 V1 V4 V2 V1 V3

The estimation Under H0 : µ1 =µ2 =...=µt =µ each sample observation would have been drawn from the

same normal probability distribution with meanµ and variance σ 2 . Recall that the sampling distribution of the sample mean, y for a simple random sample of size n from a

Page 28: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2288

normal population is normally distributed with meanµ and standard deviationσ y =σn

.

The best estimate of the mean of the sampling distribution of y is the mean of the individual sample means. That is

y = y y y

tt1 2+ + +...

The between samples variation provides a good estimate of the σ 2 only if the null hypothesis is true. If the null hypothesis is false, the between sample variation overestimates σ 2 . The within sample variation provides a good estimate of σ 2 in either case. If the null hypothesis is true, both estimates will be similar and their ratio will be close to 1. If the null hypothesis is false, the between samples will be larger than within samples, and their ratio will be large. The analysis of variance is a statistical technique for testing the hypothesis that the means of three or more populations (treatments) are equal. Also, it can be used to test the hypothesis that the means of two populations are equal. Pooling is the process of combining the results of two or more independent simple random samples to provide an estimate of σ 2 . When a simple random sample is selected from each population, each of the sample variances provides an unbiased estimate of the population variance σ 2 . The estimate of σ 2 obtained by combining the individual estimates into an overall estimate is called the within samples estimate. The sample mean for the ith treatment

yi .=1ni

yijj

ni

=∑

1, i =1, 2, . . ., t

The sample variance for the ith treatment

sni

i

2 11

=−

( ).y yij ij

ni

−=∑ 2

1

nT = n1 + n2 + . . . + nt

Recall that variance is a measure of the dispersion in a set of responses and is calculated by determining the average distance of a set of responses from its mean. 3.2 Between samples estimate of population variance Consider sample means each estimating the population means for each treatment under investigation. These sample means are statistics with a sampling distribution that is

Page 29: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

2299

normal. The sampling distribution of sample mean i is yi .~ i.i.d N(µi,in

2σ ),, i =1, 2, . . . t.

The investigator wishes to assess how each of these sample means differ from the estimate ..y which estimates the over all population mean, µ. The component that measures the

deviation of the sample means from the overall sample mean is called the mean square between denoted MSB. This is defined as

MSB = SSBt − 1

=1

1t −n y yi i

j

ni

( ). ...−=∑ 2

1

Where SSB = Sum of squares between treatment means. The squaring of the deviation is done to remove the negative sign and the divisor t-1, is the corresponding degrees of freedom. Each of the deviation is weighted by the corresponding replications ni. The MSB is sometimes referred to as systematic variance and can be explained in terms of the independent variables or independent groups or treatments. For instance, suppose we wish to test the performance of a pressure cooker at three different temperature settings. We run the pressure cooker at 20, 40 and 60 kilopascals and record the temperature at which the water boils. We take five temperature readings at each pressure. The average deviations from the overall mean, of the means of the five readings, at each pressure, provides the measure of systematic variance. The component MSB is unbiased estimator of σ 2 under H0. The MSB is not an unbiased estimator of σ 2 and does overestimate, if the means of the t populations are not equal. 3.3 Within samples estimate of population variance The component that measures the deviation of each observation from the overall mean is called the mean square within denoted MSW. It is also an estimate ofσ 2 and is defined as

MSW =SSWn tT −

=1

n tT −( )n si i

i

t

−=∑ 1 2

1

Where SSW = Sum of squares within the treatments, and ∑=

−−

=1

2.

2 )(1

1j

iiji

i yyn

s , the

sample variance for treatment i. The estimate MSW is not influenced by whether or not the null hypothesis is true, unlike the MSB. It always provides an unbiased estimate of σ 2 . The MSW is referred as error variance or random error. This refers to the random variation between sample means, which we find when we select random samples from a population. 3.4 Comparing the variance estimates

Page 30: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3300

The sampling distribution of the ratio MSBMSW

, of the two independent estimates of σ 2

follows an F distribution, under H0. Thus, under H0, and when assumptions are valid, the

sampling distribution of MSBMSW

is an F distribution with the numerator degrees of

freedom equal to t-1 and denominator degrees of freedom equal to nT - t. In general, an F distribution is a ratio of two random variables that are distributed chi-square. Thus, the range of F is from zero to positive infinity.

The value of MSBMSW

is inflated because MSB overestimates σ 2 when the means of the t

populations are not equal. Hence, we will reject H0 if the resulting value of MSBMSW

appears to be too large to have been selected at random from an F distribution with degrees of freedom t-1 in the numerator and nT - t in the denominator. The value of MSBMSW

that will cause us to reject H0 depends on α , the level of significance. Table D

provides the sampling distribution of MSBMSW

and the rejection region associated with a

level of significance equal toα where Fα denotes the critical value. To read the value of F from the table, you need to have numerator, denominator degrees of freedom and the level of significance, α . Proceed to locate the F value that corresponds to the within degrees of freedom in the first column and the between degrees of freedom in the first row for a given α - level. Often, the F table is provided for α = 0.05 or 0.01. You will note later that most statistical software produce all statistics. An important statistic among these is the p-value which is the probability computed using calculate F- value. The decision to reject or not to reject the null hypothesis is based on the comparison made between the p- value and the α - level. We reject the null hypothesis if p –value < α and otherwise fail to reject. 3.5 Computation formulae The formulae previously discussed are difficult to apply. Equivalent formulae that are easy to use are presented below.

• Sum of squares total (SST) = yijj

n

i

t i2

11 ==∑∑ - C.F.

where C.F. is the correction factor calculated as 1

11

2

ny

Tij

j

n

i

t i

( )==∑∑ .

• Sum of squares treatment (SSTrt) = 1 2

1ny

ii

i

ni

.=∑ - C.F.

where .iy = 1

1ny

iij

j

ni

=∑ .

Page 31: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3311

• Sum of squares error (SSE) = SST - SSTrt. The mean squares are computed as the ratio of sum of squares -to-the degrees of freedom. The analysis of variance table denoted ANOVA is a convenient display of calculations of between, within and total sum of squares, the associated degrees of freedom and mean squares. It is composed of none negative values. A general ANOVA table follows: Source of Degrees of Sum of Mean Fcalc Ftable Variation Freedom Squares Square

Between t-1 SSB MSB MSBMSW

Within nT - t SSW MSW Total nT - 1 SST The analysis of variance can be viewed as the process of partitioning the total sum of squares and degrees of freedom into two sources, between and within. Dividing the sum of squares by the appropriate degrees of freedom provides the variance estimates and the F value used to test the hypothesis of equal population means. The degrees of freedom and the sum of squares are the only additive columns. Thus, need to compute two and the third can be obtained by subtraction. Example 3.1 To test if the mean time needed to mix a batch of material is the same for machines produced by three manufacturers, the following data on the time (in minutes) needed to mix the material were obtained.

Manufacturer 1 2 3

20 28 20 26 26 19 24 31 23 22 28 21

Sample mean .iy : 23 28 21 Sample variance 2

is : 6.67 4.67 3.33 Test if the population mean times needed to mix a batch of material differ for the three manufacturers at 5% significance level. Solution Treatments, t = 3, and sample size per treatment, n1 = n2 =n3 = n =4 y.. = ( y1. + y2. + y3. )/3 = (23 + 28 + 21)/3 = 24.

Page 32: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3322

Or y .. = ∑∑= =1 1

1i j

ijT

yn

= 12288 = 24

SSB = n y yi ij

ni

( ). ...−=∑ 2

1 = 4(23-24)2 + 4(28-24)2 + 4(21-24)2

= 104.

MSB = SSBt − 1

= 104

2 = 52.

SSW = ( )n si ii

t

−=∑ 1 2

1 = 3(6.67) + 3(4.67) + 3(3.33)

= 44.01

MSW = SSWn tT −

= 44 0112 3

.−

= 4.89

Fcalc = MSBMSW

= 52

4 89. = 10.63

Ftable, 0.05(2, 9) = 4.26 The ANOVA Table Source of Degrees of Sum of Mean Fcalc Ftable, 0.05 Variation Freedom Squares Square Between 2 104.00 52.00 10.63 4.26 Within 9 44.01 4.89 Total 11 148.01 Conclusion: Since Fcalc= 10.63 > Ftable, 0.05(2, 9) = 4.26, we reject the null hypothesis that the mean time needed to mix a batch of material is the same for each manufacturer at 5% significance level. This means that there is at least one significant difference between the means. The rejection of the null hypothesis using F does not pinpoint where the specific differences are. Further analysis is therefore required to investigate which treatment means that are different. Multiple comparison tests (some are more conservative than the other) are used to achieve this. If the structure of the treatment means is known priori to the experiment, contrast or regression techniques could be used. For instance, if the treatments have qualitative structure, then reasonable contrasts can be constructed. If the structure is quantitative, then regression techniques can be applied. If the treatment structure is not known at all, which is unusual, multiple comparison test techniques can be used.

Page 33: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3333

It should be noted that if we got a nonsignificant F test in the analysis of variance, it would indicate the failure of the experiment to detect any difference among treatments. Nonsignificant F test does not, in any way prove that all treatments are the same, because the failure to detect treatment difference, could be the result of either a very small or nil treatment difference or a very large experimental error, or both. Thus, one need to examine the size of the experimental error and the numerical difference among treatment means, whenever F test is nonsignificant. Steps in testing hypothesis Below are useful steps to follow when conducting a test of hypothesis. • State the statistical model and the associated assumptions based on the design of

experiment used and treatment structure. • State the null and alternative hypothesis, based on the interest of the investigator. • Choose the level of significance α , which depends on the desired confidence to be

attached to the results. • Develop the critical region (rejection region) which depends on the alternative

hypothesis. • Compute the test statistic, say, sum of squares, mean squares, F-calculated and p-

values. • Draw conclusions based on the analysis of variance results. Further statistical analyses

are directed by the outcome of the ANOVA results. 3.6 Advantages and disadvantages A completely randomised design has the following advantages over other designs: • Easy to set up and analyse; • Provides maximum number of degrees of freedom for estimation of error variation; • Missing values cause no difficulty. Disadvantages • The approach is insensitive when the experimental units are heterogeneous. This is

because it assumes the units to be homogeneous; • It is difficult to maintain homogeneity among units when the treatment numbers is

large. Thus, the approach is suitable only for small numbers of treatments. Exercise 3.1 3.1 Decide by F Table whether the following F calculated values would be greater at 0.01 significant level:

i) F at df1 = 14 and df2 = 100 ii) F at df1 = 2 and df2 = 40

Page 34: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3344

iii) F at df1 = 9 and df2 = 30

3.2 Consider the following set of data on scores.

Group A Group B Group C 62 60 59 60 60 49 50 58 49 48 53 47 47 49 42

i) Find the total sum of squares, the within groups sum of squares, and the between group sum of squares.

ii) Present your results in a analysis of variance table. iii) Is there enough evidence at 5 % significance level to suggest that the three treatments are significant?

3.3 A researcher investigates emotional stability in three groups of children, a control

group who come from a stable background, children who have been physically abused, and children who have been sexually abused. Higher scores indicate greater stability. The researcher wants to test the hypothesis that any abused child shows less emotional stability. The following is the data.

Control Physically abused Sexually abused 8 3 4 9 4 2 7 3 2 8 2 3 9 4 3 a) State both null and alternative hypotheses. b) Set up a ANOVA table and test whether there is significant difference between

the groups at a 5 % significance level. 3.4 A researcher wants to know what type of humour appeals most to students. She looks

at three different types, slapsticks, puns and stand-up comedy. Three different groups laughed as follows.

Slapstick Puns Stand-up comedy 5 3 8 3 6 6 5 4 4 4 9 3 6 3 3

Page 35: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3355

Conduct a one-way analysis of variance to test if the three types differ significantly at 5 % significance level.

3.5 A study investigated the perception of corporate ethical values among individuals

specialising in marketing. The following data on scores were recorded where higher scores indicate higher ethical values.

Marketing Marketing Managers Research Advertising

6 5 6 5 5 7 4 4 6 5 4 5 6 5 6 4 4 6

Sample mean 5 4.5 6 Sample variance 0.8 0.3 0.4

Using 5 % significance level, test if there are significant differences in

perception for the groups of specialists. 3.6 As a result of the recent revisions to the tax law, investment in equity instruments has

become increasingly attractive. The accompanying table lists the annual internal rates of return for several different investment portfolios managed by three separate investment firms.

Firm A Firm B Firm C

16.9 15.1 10.0 15.0 12.5 13.1

16.2 13.0 12.3 15.8 11.8 10.2 17.1 8.9

Carry out the analysis of the above data to test the equality of the three investment firms with respect to the mean annual internal rate of return earned on portfolios. Use a 5 % significance level.

3.7 Samples of peanut butter produced by three different manufacturers are tested for a

flatoxin content with the following results:

Brand A B C 2.5 2.5 2.3 6.3 1.8 1.5 3.1 3.6 0.4 2.7 4.1 3.8 5.5 1.2 2.2 4.3 0.7 1.0

Page 36: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3366

a) Determine whether there is a significance difference between the brand means at 5 % significance level.

b) Outline the assumptions for a valid analysis of variance. 3.8 The following are the litres per 100 kilometres which a test driver obtained with

measured quantities of five brands of petrol containing various additives:

Brand S T C M E 8.71 8.11 8.71 9.80 9.40 11.20 8.71 8.71 11.20 11.76 10.69 7.35 10.23 11.20 9.80

Test the hypothesis that the five brands of petrol give the same results. Use the 1 % significance level.

3.9 A postgraduate student in the Department of Dietetics studied the effect of diet on

blood sugar. Originally 32 subjects were selected for their uniformity and assigned randomly to four diet groups; eight individuals per diet group. A mishap resulted in the loss of the records for six subjects. The following are the results for the remaining cases:

Diet I II III IV 24 26 30 30 18 21 32 28 25 23 29 27 23 25 25 23 22 20 31 31 24 33 25

20 29 28

Determine whether the four diets have different effect on blood sugar levels. Use 5 % significance level.

3.10 In an assessment of five different reading programmes, a number of children

judged to be equivalent in abilities on the basis of pre-testing were assigned at random to the five programmes. Assessments on the reading capacities of the children completing the programme produced the following scores:

Programme

I II III IV V 63 81 72 59 62 67 71 77 65 71 59 74 79 70 73 60 70 83 71 67

Page 37: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3377

72 73 70 67 68 58 83 82 60 61 65 79 71 62 68 64 80 77 66 66

Determine whether there are any differences in the five programmes. Use a 5 % significance level.

3.11 A random ample of 16 observations was selected from each of four populations. A

portion of the ANOVA table is given below:

Source of Degrees of Sum of Mean F Variation Freedom Squares Square Between 400 Within Total 1 500 a) Complete the missing entries in the ANOVA table. b) Test whether the treatment means of the four populations are equal,

using a 5 % significance level. 3.12 Random samples of 25 observations were selected from each of three populations.

For these data, sum of squares between (SSB) = 120 and sum of square within (SSW) = 216.

a) Set up the ANOVA table for this problem. b) What is the critical of F? Use a 5 % significance level. c) Are the three population means equal, at 5 % significance level?

Page 38: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3388

4. BEYOND ANALYSIS OF VARIANCE 4.1 Introduction A common first step is to subject the data to an analysis of variance to determine whether or not significant differences exist among the treatment means. The overall F test provides statistical evidence of existence of some significant difference between the treatments under investigation. For instance, a rejection of the null hypothesis indicates that the treatment means are not all equal. That is, either µ1 ≠ µ2 or µ1 ≠ µ3 or µ2 ≠ µ3, or µ1 ≠ µ2 ≠ µ3. It cannot tell us where the differences between the means lie. While t test applies only for two treatments, F test applies to two or more treatments. Interest would be in finding source of that difference that contributed to an overall significant F test. Various procedures are in use under such circumstance. Recent approach suggests use of regression techniques if the treatments are of quantitative nature. If they are of qualitative nature and priori information on the treatment structure was available, appropriate contrasts “questions” could be formulated and tested through the ANOVA. If no structure is known and the treatments are of qualitative nature, multiple comparison procedures can then be applied. 4.2 Multiple comparison procedures After the analysis of variance, the data are further analysed in an attempt to explain the nature of the response in more detail. A number of statistical procedures may be used for this purpose. Among these are:

• Fitting response functions using regression techniques. • Planned sets of contrasts among means, or groups of means. • Pairwise multiple comparison procedures.

Some of these procedures are appropriate with some kinds of treatments and entirely inappropriate with others. These statistical test procedures are used under different circumstances. Most commonly used are the post hoc tests which are modified t –tests known to control for familywise error rates. Fisher’s least significant different (LSD) This is the most widely used method for making pairwise comparisons of treatment means. Suppose the overall F test led to a rejection of H0 : µ1 =µ2 =µ3 . The following could be the possible causes: i) H0 : µ1 =µ2 against Ha : µ1 ≠ µ2 ii) H0 : µ1 =µ3 against Ha : µ1 ≠ µ3

iv) H0 : µ2 =µ3 against Ha : µ2 ≠ µ3 To test any of the above possibilities, t -test procedures can be applied. The test statistic for Fisher’s LSD at 5 % significance level is computed as follows:

Page 39: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

3399

LSD0.05 = t0.025 x s.e.( y yi i− ' ) with MSW degrees of freedom

where s.e.( y yi i− ' ) = MSWn ni i

( )'

1 1+

Reject H0 : µi =µi' if | y yi i− ' | > LSD0.05 in support of the alternative at 5 % significance level. Fisher’s LSD test is commonly referred to as protected or restricted LSD. It is only applied when the overall F test is significant. Example 4.1 Consider the information obtained in Example 3.1.

Sample size: ni : 4 4 4 Sample mean .iy : 23 28 21

MSW = 4.89

s.e.( y yi i− ' ) = MSWn ni i

( )'

1 1+ = )

41

41(89.4 +

= 1.584 The table value with 9 degrees of freedom at 5 % significance level is t0.025 = 2.262. Thus, LSD0.05 = 2.262 x 1.584

= 3.583 Reject H0 : µi = 'iµ if | y yi i− ' | > LSD0.05.

Treatment difference

Difference

Status

21 yy − = 23 - 28

31 yy − = 23 – 21

32 yy − = 28 - 21

-5 2 7

Significant Not significant Significant

The overall F-test assured us that at least two of the treatment means are significantly different, at 5 % significant level. Further analysis using Fisher’s LSD indicate that the difference in Trt mean 1 versus 2 and Trt mean 2 versus 3. Treatment mean 1 is not significantly different from treatment mean 3, at 5 % significance level. A confidence interval estimate of the form ( y yi i− ' )± LSD0.05 can also be used for the same test. If the interval includes the value 0, we fail to reject the hypothesis that the

Page 40: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4400

treatment means are equal. However, if the confidence interval does not include the value 0, we conclude that there is a difference between the treatment means. Similarly, ( y yi i− ' )± LSD0.05 implies a 95 % confidence interval for treatment difference is Trt 1 vs Trt 2: (-5 ± 3.583) = (-8.583, -1.417) Trt 1 vs Trt 3: (2 ± 3.583) = (-1.583, 5.583) Trt 2 vs Trt 3: (7 ± 3.583) = (3.417, 10.583) Comparison-wise Type I error rate: This is the error rate that indicates the level of significance associated with a single statistical test. Thus, the comparison-wise Type I error remains α , say α = 0.05. Experiment-wise Type I error rate: Suppose we conduct a pair wise test and for each single t-test, we set α = 0.05. The probability that we will not make a Type I error is 1 - 0.05 = 0.95 for each test. The probability that we will not make a Type I error for two consecutive t- tests is (0.95)(0.95) = 0.9025. Thus, the probability of making at least one Type I error is 1 - 0.9025 = 0.0975.When we use sequentially test two sets of hypotheses, the Type I error rate associated with this is not 0.05, but actually 0.0975. This Type I error rate is called experiment-wise Type I error rate. In general, suppose we consider k treatments. The number of possible pairwise comparisons, C is

k2⎛⎝⎜⎞⎠⎟ =

kk

!( )! !− 2 2

= k k( )− 1

2

The probability of making at least one Type I error is Experiment-wise Type I error rate, α EW = 1 - (1-α )C The Fisher’s LSD procedures leads to a experiment-wise Type I error rate that depends on the comparison-wise Type I error rate,α and the number of comparisons, C. Bonferroni adjustment: α EW = 1 - (1-α )C < Cα . Thus, the maximum probability of making a Type I error for the overall experiment α EW can be maintained if we use a comparison-wise Type I error rate of size α EW/k. Example 4.2 Refer to the information in example 4.1. Using α = 0.05. Number of treatments, k =3, thus possible pairwise comparisons,

C = k k( )− 1

2 =

2)13(3 − = 3.

Page 41: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4411

α EW = 1 - (1-α )C =1 – (1-0.05)3

=1 – (0.95)3 = 0.143 Cα = 3(0.05) = 0.15

We use a comparison-wise Type I error rate of size kEWα

= 3143.0 = 0.048, since α EW =

0.143 < Cα = 0.15. Tukey’s procedures: Allows one to perform tests of all possible pairwise comparisons and still maintain an overall experiment-wise Type I error rate, such as α EW = 0.05. It uses a “studentised range” probability distribution. Considers all treatment means to have the same sample size, n and equal variances. However, a generalised Tukey’s test can be used for unequal sample size case. Then a sampling distribution of

q =

nMSW

yy minmax −

where

maxy = largest sample mean and

miny = Smallest sample mean MSW = Mean square within treatments.

Follows a studentised range distribution. Tukey’s significant difference, denoted,

TSD = qMSW

n

Tukey’s procedure is an unprotected testing approach. Thus, Tukey’s procedure provides an alternative to analysis of variance for testing, if the treatment means of k populations are equal. However, to use Tukey’s procedures we need to estimate the population variance using MSW. Example 4.3 Consider the information given in Example 4.1. i 1 2 3

Sample size: ni : 4 4 4 Sample mean .iy : 23 28 21

MSW = 4.89 Error degrees of freedom = 9

Page 42: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4422

The critical value of the studentised range, q(α , k, v) for the 3 pairwise comparisons, k and v error degrees of freedom at 5 % significance level is obtained from Table E. Thus,

q = q(α , k, v) = q(0.05, 3, 9) = 3.95

Hence, TSD = qMSW

n= 3.95

489.4 = 4.367

maxy = 28

miny = 21 We reject H0: 2µ = 3µ since | 32 yy − | = 7 > TSD = 4.367 and conclude that the two treatment means are significantly different at 5 % significance level. Similarly, we reject H0: 2µ = 1µ since | 12 yy − | = 5 > TSD = 4.367 and conclude that the two treatment means are significantly different at 5 % significance level. But we fail to reject H0: 1µ = 3µ since | 31 yy − | = 2 < TSD = 4.367 and conclude that the two treatment means are not significantly different at 5 % significance level. These conclusions can be summarised as follows: Ordered treatment means: 2µ 1µ 3µ Any two treatment means sharing the same line are not significantly different at 5 % significance level. Remark: The most often used and most often misused are the multiple comparison tests. Their purpose is to detect possible groups among a set of unstructured treatments. They are not meant for quantitative treatments, for which response methodology is more appropriate. Nor are they intended to substitute for meaningful orthogonal comparisons, which can be formulated in advance based on the treatments used. The following points should be noted:

• Care should be taken to select a statistical procedure which is appropriate for the data being analysed.

• For experiments involving factorial sets of treatments or graded levels of quantitative factors there is almost always a statistical procedure, which can be specified in advance and which is more appropriate than a multiple comparison test.

• For experiments involving qualitative treatments it is often possible to form planned sets of comparisons to answer the objectives of experiment.

• Multiple comparison tests may be useful for grouping means from experiments involving unstructured qualitative treatments.

Page 43: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4433

• Indiscriminant use of multiple comparison tests can result in loss of information and reduced efficiency when more appropriate procedures are available.

Exercise 4.1 Refer to Exercise 3.1 to answer the following questions. 4.1 Refer to the results from question 3.2 to compute LSD at 5 % significance level and determine which treatment means that are different. 4.2 Refer to the results from question 3.3 to compute Tukey’s (TSD) critical value at 5 % significance level and determine which treatment means that are different. 4.3 Refer to the results from question 3.6 to construct 95 % confidence interval for each

of the pairwise treatment means difference. Use these intervals to test the equality of these means.

4.4 Refer to the results from question 3.8 to compute Tukey’s (TSD) critical value at 5 % significance level and determine which treatment means that are different. 4.5 Refer to the results from question 3.7 to compute LSD at 5 % significance level and determine which treatment means that are different.

Page 44: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4444

5. RANDOMISED COMPLETE BLOCK DESIGN 5.1 Introduction Extraneous factors, not considered in the experiment, can inflate the mean square within (MSE) component. This causes the F value to be small, thus, signalling no significance difference among treatment means when in fact such a difference exists. We wish to compare the treatment means when all known variation is control or rather eliminated from the experimental error. One way of eliminating the known variation from the experimental error is by grouping the experimental units into homogeneous groups, commonly known as “block.” For instance, if an experiment is to be carried out in KZN and the race or gender is truly known to have an effect, the setting of the experiment should then take this known variation into consideration. The race could be used as blocks. The treatment understudy should be applied to each block where each application is based on independent randomisation. Or if a study involves assessment of different types of fuels in a given City, the cars should be considered as blocks because they are known to differ in fuel consumption. Or if different management practices are to be compared within the farming community in KZN, the size of farm, either small, or medium or large should be considered as a blocking factor. They are known to differ and this information should be incorporated into the experiment. Or if an experiment involves comparing of different animal feeds, where the breed is known to have an effect, then the breeding should be used as a blocking factor. Or if an agronomist wants to conduct an experiment on a field known to have different levels of soil fertility, then this information should be used as a blocking factor. And so on. The randomised complete block design (RCBD) draws its name from the fact that the treatments are allocated at random in each block. Independent randomisation is applied in each block. “Complete” implies that each block contains a complete set of treatments. This is an extension of a completely randomised design (CRD) in a situation where experimental units are no longer homogeneous. The principle behind this design is to divide all experimental units into homogeneous groups before applying the treatments. Each group is referred to as a block or replication in case of a balanced design. Balanced, because each treatment occurs equally often in each block. Differences between blocks cancel out for any comparison of treatments. The criteria applied in grouping should ensure that there is minimum variation within the blocks and maximum variation between them. Differences between blocks are then removed from the ‘random’ or unexplained variation. The following should be noted with a RCBD a) Blocks should be laid perpendicular to the gradient in case of a directional variation. b) Blocks need not be continuous.

Page 45: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4455

c) Possible to replicate within a block. That is to say, a treatment may appear more than

once in a block d) A block should signify a known variation that need to be controlled by the experiment. e) All the treatments should be randomised within each block, ensuring independent

randomisation in each block. Even when no obvious natural blocks that exist, it is still sensible to define blocks representing major patterns of variation. Consider an experiment involving different varieties. Harvesting may be carried on each block on each day if it is impossible to harvest all on a single day. Such blocking controls the variability that may be introduced in a day (due to rain). Missing data can also occur in RCBD. The good thing with the design is that, the analysis can still be performed in the event of losing a complete block. A major restriction in the use of this design, is the requirement that all treatments must appear in each block. 5.2 Aspect of blocking The analysis of completely randomised design assumes that the experimental units are homogeneous. Any treatment effect between the groups or treatments is expected to be due to the treatments only, under such assumption. Hence, the within treatments variation is assumed to be purely random. The experimental error is overestimated if the assumption is not true. The blocking technique is meant to utilise priori information concerning the nature of experimental units. Blocking is therefore defined as the process of grouping the experimental units into homogeneous groups such that the variation within the blocks is maximised and that between block is maximised. The approach aims at obtaining estimate of experimental errors that is unbiased. The field layout Consider an experiment set to investigate the effect of 5 nitrogen levels on the growth of a new variety. Three types of soils are used as the blocking factor. Thus, 5 x 3 experimental units were used. We denote the nitrogen levels by N0, N1, N2, N3, and N4. Suppose the soil types were clay, loam and sand. The five nitrogen levels are randomly assigned to each block. At each stage, a new randomisation scheme is used. The layout is presented below. Block 1

N1 N3 N0 N2 N4 Block 2

Page 46: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4466

N2 N4 N1 N0 N3 Block 3

N4 N0 N1 N3 N2 5.3 The model Suppose the experimental material is grouped into b homogeneous groups (referred to as blocks) and t treatments under investigation are randomly assigned, ensuring independent randomisation at each stage. Suppose yij is the response variable corresponding to treatment i measured on block j, where i =1, 2, . . ., t and j = 1, 2, . . .,b. We assume one measurement on each treatment on each block. Also, there is no treatment by block interaction. The response variable yij is partitioned into components, say, due to overall mean, block, treatment and random error effects. The mathematical expression is yij = µ + iτ + jβ + ijε where µ = the overall mean iτ = the ith treatment effect jβ = the jth block effect ijε = the random effect The random effect ijε is assumed to be identically and independently distributed normal with zero mean and constant variance. (i.e. ijε ∼ i.i.dΝ(0, σ2) ). The model is also assumed to be additive. The data corrected from b blocks involving t treatments is usually summarised in a two-way table of treatment totals as follows:

Treatment

Block 1 2 3 . . . b

Treatment Totals

1 2 . . . t

y11 y12 y13 . . . y1b y21 y22 y23 . . . y2b . . . . . . . . . . . . . . . . . . . . . yt1 yt2 yt3 ytb

y1. y2.

.

.

. yt.

Block Totals

y.1 y.2 y.3 . . . y.b y..

Page 47: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4477

Notations

Where, the marginal treatment and block totals are denoted yi. = ∑=

b

jijy

1

, and y.j = ∑=

t

iijy

1

,

respectively. The overall total is denoted y.. = ∑∑= =

t

i

b

jijy

1 1

.

Similarly, the marginal means for the treatments and blocks are .iy = b1 ∑

=

b

jijy

1

, and jy. =

t1∑

=

t

iijy

1, respectively. The grand mean is ..y =

bt1 ∑∑

= =

t

i

b

jijy

1 1

Definition formulae The Sum of squares is a squared deviation summed over the levels. Thus, the sum of squares total is a measure of overall deviation of each observation from the overall mean. These deviations are summed over the levels of treatment and blocks.

Sum of squares total, SSTotal = ∑∑= =

−t

i

b

jij yy

1 1

2.. )(

The total sum of squares is partitioned into the three components that due to blocks, treatments and random effects. The sum of squares block is a measure of deviation of block means from the overall mean.

Sum of squares block, SSBlk = t∑=

−b

jj yy

1

2... )(

Similarly, the sum of squares treatment is a measure of deviation of treatment means from the overall mean.

Sum of squares treatment, SSTrt = b∑=

−t

ii yy

1

2... )(

The sum of squares error is a measure of within experimental unit variation. That is, the random variation due to treatments treated alike. It is also referred to as a measure of uncontrollable variation within the experimental units.

Sum of squares error, SSE = ∑∑= =

+−−t

i

b

jjiij yyyy

1 1

2.... )(

Computation formulae The analysis using the definition formulae is tedious. Statistical formulae that are equivalent to definition formulae are often used. We referred these as computation of sum of squares.

Page 48: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4488

Usually the first item to compute is the correction factor, which is the sum of squares mean. This requires adding all the bt observations squaring the result and dividing it by total observations, bt. Thus,

Correction factor, C.F. = bt

yt

i

b

jij∑∑

= =1 1

2)(

The sum of squares total requires each of the bt observations to be squared, summed and then subtracted the correction factor. Thus,

SSTotal = ∑∑= =

t

i

b

jijy

1 1

2 - CF

An easier way to compute the sum of squares block and sum of squares treatment is to construct a two way table totals both body and marginal. To compute the sum of squares block, square each block mean, average the sum over the treatment levels and then subtract the correction factor. Thus,

SSBlk = t1 ∑

=

b

jjy

1

2. - CF

Similarly, the sum of squares for treatment is obtained by squaring each treatment mean, averaging the sum over the block levels and then subtracting the correction factor. Thus,

SSTrt = b1 ∑

=

t

iiy

1

2. - CF

The property of additivity of the model allows the sum of squares error to be computed by subtracting both SSB and SSTrt from SSTotal. Thus,

SSE = SSTotal – SSBlk – SSTrt The above sum of squares are called corrected or adjusted sum of squares. The unadjusted or uncorrected sums of squares are obtained when correction factor is not subtracted during the computation. The total degrees of freedom (df) computed by subtracting one from the total number of observation are bt –1. These are partitioned into degrees of freedom due to treatments, blocks and error. Thus (t-1) df due to treatment, (b-1) df due to block and (b-1)(t-1) due to error. Computation of mean squares The mean squares are computed as averages of sum of squares over the degrees of freedom. These are known to have a distribution called chi-square.

Mean square blocks, MSBlk = 1

1−b

(SSBlk)

The quantity MSBlk is distributed chi-square with b-1 degrees of freedom.

Page 49: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

4499

Mean square treatment, MSTrt = 1

1−t

(SSTrt)

Similarly, MSTrt is distributed chi-square with t-1 degrees of freedom.

Mean square error, MSE = )1)(1(

1−− tb

(SSE)

Also, MSE is distributed chi- square with (b-1)(t-1) degrees of freedom. Computation of the F-value The ratio of MSTrt to MSE has an F-distribution with (t-1) numerator degrees of freedom and (b-1)(t-1) denominator degrees of freedom. Both quantities MSTrt and MSE are assumed to be unbiased estimators of the common variance, σ2 when null hypothesis of equality of treatment means is true. That is, H0: 1τ = 2τ = . . . = tτ = 0. In case the treatment effects are not equal, the MSTrt tends to be larger than the MSE. The larger the quantity the more likely we to rejecting the null hypothesis in favour of the alternative. Therefore, the F-calculated value for testing the null hypothesis at a specified significance level is computed as

Fcalc = MSE

MSTrt

The calculated F-value is compared against an FTable –Value obtained with (t-1) numerator df and (b-1)(t-1) denominator df., at α significance level. The null hypothesis is rejected if the Fcalc is greater than FTable. Similarly, the ratio MSBlk to MSE is distributed F with (b-1) numerator degrees of freedom and (b-1)(t-1) denominator degrees of freedom. Often, the test is not performed simply because the information about the blocks is priori known. The hypothesis tested by this quantity depends on the nature of the blocks whether considered random or fixed effects. When blocks are considered fixed effect then the quantity

Fcalc = MSE

MSBlk

test H0 : β1 = β2 = … = βb = 0, against Ha : At least two blocks are different. When the blocks are considered to be random effect the interest would be assessing the block variability. This provides an indication on how effective the blocking was. The hypothesis tested by the F- calculate under this condition is H0 : σ2 = 0 against Ha : σ2 > 0

Page 50: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5500

Reject the null hypothesis in both cases (fixed or random effects), if the Fcalc > FTable obtained using (b-1) numerator df and (b-1)(t-1) denominator df, at α significance level. The above computations are summarised in a table called analysis of variance table, (ANOVA). The format of ANVOA is as follows: Source of Degrees of Sum of Mean FCalculated Variation freedom squares squares

Block b -1 SSBlk MSBlk F = MSE

MSBlk

Treatment t –1 SSTrt MSTrt F = MSE

MSTrt

Error (b-1)(t-1) SSE MSE Total bt –1 SSTotal Example 5.1 An automobile dealer conducted a test to determine if the time needed to complete a minor engine tune-up depends on whether a computerised engine analyser or an electronic analyser is used. Because tune-up time varies among compact, intermediate, and full-size cars, the three types of cars were used as blocks in the experiment. The data obtained are presented below.

Car Analyser Computerised Electronic

Block Totals

Compact Intermediate Full-size

50 42 55 44

63 46

92 99 109

Treatment Total 168 132 300 We consider cars to our blocking factor and analysers as the treatments under investigation. Thus we have three blocks and two treatments. We wish to test the equality of two analyser methods at 5 % significance level. Note: this is a very insensitive experiment because of very few degrees of freedom for error. Hypothesis: H0: 1τ = 2τ = 0 against Ha : 1τ ≠ 2τ Critical region: Reject H0: 1τ = 2τ = 0 in favour of Ha : 1τ ≠ 2τ if Fcalc > FTable (0.05, 1, 2). Computation of the sums of squares:

C.F. = bt

yt

i

b

jij∑∑

= =1 1

2)(=

)3)(2()300( 2

= 15 000

Page 51: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5511

SSTotal = ∑∑= =

t

i

b

jijy

1 1

2 - CF = 502 + 422 + . . . + 462 – C.F.

= 15 310 – 15 000 = 310

SSBlk = t1 ∑

=

b

jjy

1

2. - CF =

21 (922 + 992 + 1092) – C.F.

= 21 (30 146) – 15 000 = 73

SSTrt = b1 ∑

=

t

iiy

1

2. - CF =

31 (1682 + 1322) – C.F.

= 31 (45 648) – 15 000 = 216

SSE = SSTotal – SSBlk – SSTrt = 310 – 73 – 216 = 21

Computation of the mean squares:

MSBlk = 1

1−b

(SSBlk)

= 13

1−

(73) = 36.5

MSTrt = 1

1−t

(SSTrt)

= 12

1−

(216) = 216

MSE = )1)(1(

1−− tb

(SSE)

= )12)(13(

1−−

(21) = 10.5

Computation of Fcalc

Fcalc = MSE

MSTrt = 5.10

216 = 20.571

Fcalc = MSE

MSBlk = 5.105.36 = 3.476

FTable –Values FT(0.05, 1, 2) = 18.5; FT(0.05, 2, 2) = 19.0

Page 52: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5522

The ANOVA Table Source of Degrees of Sum of Mean FCalculated FTable, 0.05 Variation freedom squares squares Cars 2 73 36.5 3.476 19.0 Analyser 1 216 216 20.571 18.5 Error 2 21 10.5 Total 5 310

Conclusions: Reject H0: 1τ = 2τ = 0 in favour of Ha : 1τ ≠ 2τ since Fcalc = 20.571 > FT (0.05, 1, 2) = 18.5. Thus, we have enough evidence that the two analyser methods are significantly different at 5 % significance level If we assume the type of cars to be random effect, then we would fail to reject H0 : σ2 = 0 in favour of Ha : σ2 > 0, since Fcalc = 3.476 < FT (0.05, 2, 2) = 19.0. Thus, the variability among the car types was not significantly different from zero. Remark: It should be noted that the multiple comparisons tests discussed in Section 3 also apply to randomised complete block design. Exercise 5.1 5.1 A nation-wide real estate chain is in the process of comparing townhouse prices in

four cities across the country. It is however known that the area size of a townhouse is also a determining factor in price fixing and should be controlled by using blocks. Therefore in each city, the selling prices of a 90-square-meter, a 120-square-meter, a 150-square-meter, a 180-square-meter and a 210-square-meter townhouse are randomly selected. The results are recorded to the nearest thousand Rand and are shown below.

Townhouse size (m2) Bloemfontein Durban Port Jo’burg Elizabeth 90 165 185 173 200 120 198 193 181 196 150 251 215 197 278 180 312 268 229 332 210 405 381 294 446

Test if the townhouses in the four cities are significantly different at 5 % significance level. (Hint: The cities are the treatments and the townhouse sizes are the blocks).

Page 53: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5533

5.2 Five different auditing procedures were compared with respect to total audit time. To control for possible variation due to the person conducting the audit, four accountants were selected randomly and treated as blocks in the experiment. The following values were obtained using the ANOVA procedures:

SSTotal = 100; SSTrt = 45; SSBlk = 36.

a) Set up an ANOVA Table, filling in the missing information. b) Test to see if there is any significant difference in total audit stemming from

the auditing procedure used. Use α = 0.05. c) Determine which treatments could be significantly different, using Tukey’s

procedures. 5.3 An important factor in selecting software for word-processing and data base

management systems is the time required to learn how to use a particular system. To evaluate three file management systems, a firm designed a test involving five different word-processing operators. Since operator variability was believed to be a significant factor, each of the five operators was trained on each of the three file management systems. The data obtained are presented below:

Operator

System A B C

1 2 3 4 5

16 16 24 19 17 22 14 13 19 13 12 18 18 17 22

a) Carry out analysis of variance and present your results in ANOVA Table. b) Using α = 0.05, test to see if there is any significant difference in mean

training times for the three systems. c) Compute LSD at α = 0.05 and indicate which treatments could be significantly

different. d) Compute TSD at α = 0.05 and indicate which treatments could be significantly

different. e) Comment on the results obtained in parts (c ) and (d).

5.4 Three groups of students are to be tested for percentage of high-level questions asked

by each group. As questions can be on various types of material, six lessons are taught to each group and a record is made of the percentage of high-level questions asked by each group on all six lessons.

a) Show a data layout for this situation. b) Provide an ANOVA Table outline giving only the source of variation and degrees

of freedom.

5.5 Suppose data from question 4.4 is as follows:

Group

Page 54: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5544

Lesson A B C 1 2 3 4 5 6

13 18 7 16 25 17 28 24 14 26 13 15 27 16 12 23 19 9

Carry out analysis of variance on this data treating each lesson as a block and state your conclusions.

5.6 The effects of four types of graphite coaters on light box readings are to be studied. As

these readings might differ from day to day, observations are to be taken on each of the four types every day for three days. The order of testing of the four types on any given day can be randomised. The results are

Day

Graphite Coater Type M A K L

1 2 3

4.0 4.8 5.0 4.6 4.8 5.0 5.2 4.6 4.0 4.8 5.6 5.0

a) State the null and alternative hypotheses to test equality of the four graphite coater types.

b) Analyse the data as a randomised complete block design and present your results in an ANOVA Table.

c) Determine whether the four types are significantly different at 1 % significance level.

d) Determine which types are different at 1 % significance level using Tukey’s test procedures.

e) State your overall conclusions. 5.7 A study on a physical strength measurement in kilogrammes on seven subjects before

and after a specified training period gave the results shown below.

Subject Pretest Posttest 1 2 3 4 5 6 7

45.36 52.16 49.90 56.70 40.82 47.63 49.90 58.97 56.70 63.50 58.97 63.75 47.63 56.70

a) Carry out the analysis as a pair t-test, stating the hypothesis. Use α = 0.05. b) Carry out the analysis as a randomised complete block design, using subjects

as blocks Use α = 0.05. c) Using the results from parts (a) and (c), verify t2 = F.

Page 55: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5555

6. SPLIT-PLOT DESIGN 6.1 Introduction A factor is a kind of treatment, and any factor can supply several treatments. For example, if diet is a factor under consideration, then several diets can be used. If baking temperature is a factor, then baking can be done at several temperatures. Such a factor provides one-way treatment structure. A researcher may be interested in determining the combined effect of two or more factors. For instance, the interest may be in investigating the effect of humidity on seed germination in the presence of temperature. Such joined effect is referred to as interaction. The process of formulating all possible combinations of the levels of these factors produces treatment combinations when are then randomly applied to the experimental units. This process is called factorial arrangement. 6.2 The field layout Consider a case of an agronomist who wishes to investigate the effect of spacing on maize yield in the presence of nitrogen. Suppose 3 spacing (s1, s2, and s3), and 4 nitrogen levels (n0, n1, n2, and n3) are considered. This is a two-way treatment structure with the two factors being spacing (at 3 levels) and nitrogen (at 4 levels). We formulate all possible combinations as s1n0, s1n1, s1n2, s1n3, s2n0, s2n1, s2n2, s2n3, s3n0, s3n1, s3n2, s3n3 The 12 treatment combinations are randomly assigned to the experimental units according to the experimental design used, say CRD or RCBD. These treatments should be replicated in order to have an estimate of experimental error needed for drawing inference. Sometimes, it is not practical to randomly assign these treatments completely according to these designs. Suppose the study involves mechanisation (say m1, m2, m3, etc) as one factor and variety (v1, v2, v3, etc) as another. Note that the mechanisation may refer to method of land preparation. It is impractical to formulate these combinations and then randomise them according to CRD or RCBD, especially when mechanisation involves use of farm machinery. An alternative approach would be to randomise the machination factor first and then the variety over each level of the first factor. We illustrate this point using 3 levels of one factor and 4 levels of the other factor. Block I M1

V2 V1 V4 V3 M2

V1 V3 V2 V4 M3

V4 V2 V3 V4 The process is repeated for the other replications ensuring independent randomisation at each stage. The process involves two stages of randomisation. In case for RCBD, we first randomise the three levels of mechanisation in each block and then the levels of variety over each level of mechanisation. The design discussed above is called split-plot design. The word ‘treatment’ and ‘factor’

Page 56: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5566

are used interchangeably in this case since they mean the same thing. The split-plot design involves two- or higher-order treatment structure with an incomplete block design structure and at least two different sizes of experimental units. The bigger size is associated with whole-plot treatment and the smaller size to the sub-plot treatment. The decision on which treatment to applied to a whole-plot or to a sub-plot is based on practicability and precision required for each treatment. The treatment of much interest is placed on the smaller experimental unit and that of less interest on the larger unit. The interaction is also measured with a higher precision. Since in split-plot experiments variation among sub-plots is expected to be less than among whole- plots, the factors which require smaller amounts of experimental material, or which are of major importance, or which are expected to exhibit smaller differences, or for which greater precision is desired for any reason, are assigned to the sub-plots. The selection of such a design depends on practicability of the treatments. Say applying fertilizer to a whole plot and varieties to a sub plot, etc. The fact that there are two experimental units imply that there are two experimental errors, hereby, referred to as error (a) and error (b). The plot layout requires the whole -plot treatments to be randomly applied the whole -plot and then the sub plot treatments are applied to each whole -plot randomly. Each application demand for an independent randomisation. Split-plot designs are frequently used for factorial experiments. Such designs may incorporate one or more of the completely random, randomised complete block, or Latin square designs. 6.3 The model Suppose we wish to investigate on the joined effect of two factors namely A and B, on yield of maize. Let ‘r’ equal the number of blocks, ‘a’ the number of levels of A or whole-plot per block, and ‘b’ the number of levels of B or sub-plots per whole-plot. Thus, we have ab treatment combinations replicated r times. We have abr total number of experimental units. Let yijk be an observation associate with ith block, jth factor A effect, and kth factor B effect. The observation yijk is expressed in a mathematical form as yijk = µ + ρi + αj + γij + βk + (αβ)jk + εijk i =1, 2, . . ., r; j =1, 2, . . ., a; k = 1, 2, . . ., b Where µ = overall mean ρi = ith block effect αj = jth factor A effect γij = ijth random effect associated with whole-plot factor βk = kth factor B effect (αβ)jk = jkth interaction effect εijk = random effect associated with sub-plot factor

Page 57: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5577

The effects γij, and εijk are assumed to be normally and independently distributed about zero means with 2

γσ as the common variance of the γ’s, the whole-plot random

components, and with 2σ as the common variance of the ε’s, the sub-plot random components. The form of the analysis of variance for a two-factor split-plot experiment for a randomised complete block design is presented below. Source of Degrees of Sum of Mean F-Calculate Variation Freedom Squares Squares Block r-1 Factor A a-1 Error (a) (r-1)(a-1) Factor B b-1 A*B (a-1)(b-1) Error (b) a(r-1)(b-1) Total abr - 1 Error (a) is composed of the interaction between the whole-plot factor and the blocks. As was mentioned earlier, factor A by block interaction is assumed to be no existence. Thus, error (a) test the equality of level means of factor A (i.e. Error (a) = A*Block) Error (b) is composed of factor A by block and factor A by factor B by block interactions (Error (b) = B*Block +A*B*Block). The effects of factor B and those of the interaction between factor A and B are tested using error (b). 6.4 The analysis The analysis of variance is illustrated through an example as follows: Consider 4 strains of perennial ryegrass were grown as swards at each of the two fertiliser levels. The 4 strains were S23, New Zealand, Kent and X. The fertiliser levels were denoted by H, heavy, and A, average. The experiment was laid out as four blocks of four whole plots for the varieties each split in two for application of fertiliser. The midsummer dry matter yields, in units of 10 lb/acre, were as follows:

Block Strains Manure 1 2 3 4

S23 H 299 318 284 279 A 247 202 171 183

New Zealand H 315 247 289 307 A 257 175 188 174

X H 403 439 355 324 A 222 170 192 176

Kent H 382 353 383 310 A 233 216 200 143

Page 58: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5588

The whole-plot factor A is Strain, the Sub-plot factor B is Manure or fertiliser. With respective to our example, r = 4, a = 4, and b = 2. Computation of whole-plot analysis This requires setting up of a two way table of blocks and factor A treatment totals. Thus

Strains

Blocks 1 2 3 4

Strain Totals

S23 New Zealand X Kent

546 520 455 462 572 422 477 481 625 609 547 500 615 569 583 453

1983 1952 2281 2220

Block Totals 2358 2120 2062 1896 8436 Correction factor (C.F.)

C.F. = rab

ykji

ijk∑,,

2)( =

32)8436( 2

= 2223940.5

Sum of squares for the whole-plots

SS(Whole-plot) = b

yji

ij∑,

2.

- C.F. = 21 (5462 + 5202 + . . . + 5832 + 4532) – C.F.

= 21 (4510942) – 2223940.5 = 31530.5

Sum of square due to blocks

SSBlk = ab

yi

i∑ 2..

- C.F. = 81 (23582 + 21202 + 20622 + 18962) – C.F.

= 81 (17901224) – 2223940.5 = 13712.5

Sum of square due to strains

SS(Strains) = SS(A) = rb

yj

j∑ 2..

- C.F.

Page 59: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

5599

= 81 ( 19832 + 19522 + 22812 + 22202) – C.F.

= 81 (17873954) - 2223940.5 = 10303.7

Sum of squares for whole-plot error SSE(a) = SS(Whole-plot) – SSBlk - SS(A) = 31530.5 - 13712.5 - 10303.7 = 7514.3 Computation of sub-plot analysis This section requires a two way table of factor A and factor B totals.

Strains (Factor A) Manure (Factor B) H A

Strain Totals

S23 New Zealand X Kent

1180 803 1158 794 1521 760 1428 792

1983 1952 2281 2220

Manure Totals 5287 3149 8436 Sum of squares due to factor B

SS(B) = ra

yk

k∑ 2..

- C.F. = 161 (52872 + 31492) – C.F.

= 161 (3786857) - 2223940.5 = 142845.1

Sum of squares due to factor A and B interaction

SS(AB) = r

ykj

jk∑,

2

- C.F. – SS(A) – SS(B)

= 41 (11802 + 8032 + . . . + 7922) - C.F. – SS(A) – SS(B)

Page 60: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6600

= 41 (9566098) - 2223940.5 - 10303.75 - 142845.13

= 14435.1 Sum of squares total SSTotal = ∑

kjiijky

,,

2 - C.F. = 2992 + 3182 + . . . + 1432 - 2223940.5

= 2420734.0 - 2223940.5 = 196793.5 Sum of squares for sub-plot error SSE(b) = SSTotal - SS(Whole-plot) - SS(B) - SS(AB) = 196793.5 - 31530.5 - 142845.1 - 14435.1 = 7982.8 The above calculations are summarised in ANOVA Table as follows: The ANOVA Table Source of variation D.F. SS MS F-Calculated F-Table, 0.05 Block 3 13712.5 4570.8 5.47 Strains 3 10303.7 3434.6 4.11 FT(3, 9) = 3.86 Error (a) 9 7514.3 834.9 Manure 1 142845.1 142845.1 214.73 FT(1,12) = 4.75 Strain*Manure 3 14435.1 4811.7 7.23 FT(3,12) = 3.49 Error (b) 12 7982.8 665.2 Total 31 196793.5 Critical region: Testing the four strains: Reject H0 : α1 = α2 = α3 = α4 = 00 if F-Calculated > FT(3, 9) = 3.86 and conclude that the strains are significantly different at 5 % significance level. Testing the effect of the two types of manure: Reject H0 : βk = βk = 00 if F-Calculated > FT(1,12) = 4.75 and conclude that the two types of manure are significantly different at 5 % significance level. Testing for the strain by manure interaction: Reject H0 : (αβ)11 = (αβ)12 = . . . = (αβ)42 = 00 if F-Calculated > FT(3,12) = 3.49 and conclude that the interaction between the strains and manure types are significantly different at 5 % significance level.

Page 61: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6611

Conclusions: We reject H0 : α1 = α2 = α3 = α4 = 00 since F-Calculated = 4.11 > FT(3, 9) = 3.86 and conclude that the strains are significantly different at 5 % significance level. Strain X shows a higher performance followed by Kent based on the means. We reject H0 : βk = βk = 00 since F-Calculated = 214.73 > FT(1,12) = 4.75 and conclude that the two types of manure are significantly different at 5 % significance level. Actually, H type has a higher effect than A based on the means. We reject H0 : (αβ)11 = (αβ)12 = . . . = (αβ)42 = 00 since F-Calculated = 7.23 > FT(3,12) = 3.49 and conclude that the interaction between the strains and manure types are significantly different at 5 % significance level. The following is the graphical presentation of the interaction. Manure A consistently performed better than manure B. Manure B appears to have a constant effect across the strains. It is hard to note the source of the interaction from the graph.

Exercises 6.1 6.1 A researcher is interested in the effects of moisture and nitrogen on the growth of

wheat plants. In the experiment, a particular variety of wheat is planted in 10 tube of soil in the greenhouse. Each tub is divided into 3 parts, and different levels of nitrogen (0, 10, 20) are applied randomly, one to each part. Five of the tubs are selected randomly and given high moisture while the other 5 are given normal moisture.

a) Identify both the whole plot and subplot experimental units. Explain. b) Make a sketch of the field layout and explain the randomisation process. c) Give an outline of ANOVA Table (Source of variation and degrees of freedom

only).

Strain by manure interaction

050

100150200250300350400

S23 NZ X KENT

Strains

Mea

n yi

eld

ManureA ManureB

Page 62: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6622

6.2 An experiment was conducted using a split-plot design. The experiment consisted of 3 pairs of identical steers each pair used as a block, 2 rations (A, B) as whole plot treatments, and 2 cooking methods (1, 2) as sub-plot treatments. Within each pair of steers, one is assigned at random to feed A and one to feed B. After slaughter, two identical roasts are obtained and two roasts are randomly assigned to the two cooking methods. Recorded data are weight losses due cooking. (Assume methods and rations to be fixed effects).

Method

Ration

Block Pair1 Pair2 Pair3

1 2 1 2

A A B B

11.0 17.0 11.0 2.5 9.0 6.5 5.0 8.0 8.0 3.5 4.0 4.5

a) Write down a mathematical model stating the necessary assumptions. b) State the null hypotheses for testing methods, rations and their interaction. c) Analyse the data and present your results in an ANOVA Table. d) State the critical regions for testing the hypotheses stated in part (b). e) Present a two-way table of treatment means. f) Compute the standard errors for testing the means differences for methods,

rations, and method by ration interactions.

Page 63: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6633

7. NESTED DESIGNS 7.1 Introduction Consider an experiment involving two fertiliser levels and three varieties. Thus, we have 2 x 3 = 6 treatment combinations. We consider such as case, the factors are said to be crossed. This means that every level of every factor could be used in combination with every level of every other factor. The intersections of these factor levels are the subclasses or cells of the situation, wherein data arise. Absence of data from a cell does not imply non-existence of that cell, only that it has no data. The total number of cells in a crossed classification is the product of the number of levels of the various factors, noting that not all of them may have observations in them. Nesting in design structure occurs when we have sub-units within larger experimental units. Examples: pigs within pens; plants within pots; pies within an oven; farms within a region; technicians within a method; sires within progeny; insecticides within source, etc. In general, levels of B are nested within levels of A. Thus, we do not have A*B interaction effect, but have A effect and B within A (denoted B(A)) effect. More often, in the treatment structure, levels of A are crossed with levels of factor B. The following example illustrates the concept of nested classification: Example 7.1 Suppose that at a university a student survey is carried out to ascertain the reaction to instructors’ usage of a new computing facility. Suppose that all first years have to take English or Geology or Chemistry in their first semester. All three courses in the first semester are large and are divided into sections, each section with a different instructor and not all sections have the same number of students. Each student provided his or her opinion measured on a scale of 1-10, of his instructor’s use of the computer. The investigator’s interest is whether the instructors differ in their use of the computers. A Schematic representation of this nested classification follows: The (nij) denotes the number of students in section j of course i ( i=1, 2, 3; j = 1, …, 4).

Course English Geology Chemistry Sec.1 (28) Sec.1 (31) Sec.1 (27) Sec.2 (27) Sec.2 (29) Sec.2 (32) Sec.3 (30) Sec.3 (29) Sec.4 (30)

The measure of effect due to section j, say for j =1, it would mean the effect of the English course, of the Geology course and of the Chemistry course would be meaningless. This is because the three sections, composed of different groups of students, have nothing in common other than that they are all numbered 1 in respect of their respective courses. The number is only for identification purpose. Section 1 of English is no way related to section 1 of Geology. The only thing in common is the number 1, which is purely an identifier. These are not like the variety by fertiliser treatment combination discussed earlier. Fertiliser 1 on variety 1 was the same as fertiliser 1 on variety 2 and on variety 3. The sections are not related in this way, and are identities within their own courses. They

Page 64: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6644

are considered as sections within courses. Thus, sections nested within course. Similarly the students are nested within sections. An ANOVA outline would be:

Source of variation Degrees of freedom Courses 2 Sections within Course 2+1+3 = 6 Students within Sections within Course By subtraction (254)

Total 262 The main use of the design would be mainly in assessing the degree of variation due to each component. Is the variation most between plants within pots or pots within treatments? Would be the interesting question. Nested designs have a characteristic that interaction does not occur, but nesting does. For instance, when we say A is nested in B, we cannot then say A interacts with B. Often nesting is denoted by say, A(B), meaning A is nested in B or A:B or A/B and the degrees of freedom are expressed as b(a-1), where a is levels of A and b is levels of B. We say levels of one factor are nested within or are subsamples of, levels of another factor. Such experiments are also sometimes called hierarchical experiments. For instance, in an on-farm experiment you may have farm types, farms nested within types and replications nested within farms.

Farm Types: 1 2 3 Farms within types: 1 2 3 1 2 1 2 3 4 Replications within farms: 1 2 3 1 2 3 1 2 3 ... ... In general there is no limit to the degree of nesting that can be handled. The extent of its use depends entirely on the data and the environment from which they came. Example 7.2 Consider an experiment involving product of three manufacturing plants in each of two areas, A and B, and of two plants in area C. The observations on the quality of a product made in eight manufacturing plants in three areas is presented below. Area A B C_____

Page 65: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6655

Plants I II III I II III I II Observations 6 6, 8 6,7,8 5, 7 6, 7 6 7 7, 9 Two way table of totals

Plants

Area A B C

Plants Totals

I II III

6 12 7 14 13 16 21 6 0

25 43 27

Area Totals 41 31 23 95 ∑

kjiijky

,,

= 95 Total observations = 14

∑ 2ijky = 62 + 62 + . . . + 92 = 659

Correction factor, C.F. = 14

)95( 2

= 644.64

Total sum of squares

SSTotal = ∑ 2ijky - C.F. = 659 - 644.64

= 14.36 with 13 degrees of freedom. Area sum of squares

SSArea = 6

412

+ 5

312

+ 3

232

- C.F = 648.7 – 644.64

= 4.06 with (3 areas -1 = 2) degrees of freedom. Plants sum of squares ignoring areas

SS plants = 62 + 2

)86( 2+ + . . . + 2

)97( 2+ - C. F.

= 5.86 with (8 plants – 1 =7) degrees of freedom. Plants within area sum of squares SSPlants(Area) = SS plants (ignoring areas) – SSArea = 5.86 - 4.06 = 1.80 with (7 – 2 = 5) degrees of freedom. Error sum of squares SSE = SSTotal – SS Plants(ignoring areas) = 14.36 - 5.86 = 8.50 with 13 –7 = 6 degrees of freedom.

Page 66: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6666

The ANOVA table Source of Degrees of Sum of Mean F-cacl F- table,

0.05 Variation Freedom Squares Squares Area 2 4.06 2.03 5.639 5.79 Plants within areas 5 1.80 0.36 Observation within plants 6 8.50 1.42 Total 13 14.36 Often, nested designs are meant to provide information about variability, and therefore, makes no sense to compute F value. Perhaps, the areas are fixed and hence can test the equality of the means using F- test. Estimation of an experimental error is only possible if the replications are independent. In this case, plants within areas are independent but observations within plants are not. Therefore, we estimate experimental error using plants within areas. The F-value for testing the equality of the areas is obtained as

F = )(AMSP

MSArea = 36.003.2 = 5.639

which is compared against F-T = 5.79 obtained using 2 df numerator and 5 df denominator at 5 % significance level. We fail to reject H0 :µ1 = µ2 = µ3 since F-calc = 5.639 is not greater than F-T = 5.79 at 5 % significance level. Suppose we assume observations within plants to be randomly distributed normal with zero mean a constant variance, σ2, and also plants within areas to be normally distributed with zero mean and 2

pσ . Some techniques, which are beyond this manual, are available for estimating these variance components. The following estimates of these variance components are obtained through such techniques. The observation within plants variance component is estimate as 2σ = 1.42 The size of the estimate suggests that the total variance is purely due to observations within plants. Similarly, an estimate of plants within area variance components would be approximated as

2pσ =

64.1)( MSEAMSP − =

64.142.136.0 − = -0.64

Since the variance will never be negative, we consider the estimate not to be significantly different from zero. Thus 2ˆ pσ ≅ 0, indicating no variation between plants within areas.

Page 67: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6677

Exercise 7.1 7.1 An educator proposes a new teaching method and wishes to compare the achievement

of students using his method with that of students using a traditional method. Twenty students are randomly placed into two groups with ten students per group. Tests are given to all 20 students at the beginning of a semester, at the end of the semester, and ten weeks after the end of the semester. The educator wishes to see whether there is a difference in the average achievement between the two methods at each of the three time periods.

a) Write a mathematical model for this situation. bb)) Set up an ANOVA table and show the F tests that can be made.

7.2 In a study made of the characteristics associated with guidance competence versus

counselling competence, 144 students were divided into 9 groups of 16 each. These nine groups represented all combinations of three levels of guidance ranking (high, medium, low) and three levels of counselling ranking (high, medium, low). All subjects were then given nine subtests. Assume the rankings as two fixed factors, the subtests as fixed, and the subjects within the nine groups as random.

a) Present a schematic diagram for this information. b) Give an outline of ANOVA table with source of variation and degrees of freedom

only. 7.3 Three days of sampling where each sample was subjected to two types of size graders

gave the following results, coded by subtracting 4 percent moisture and multiplying by 10.

Day 1 2 3 Grader A B A B A B Sample 1 4 11 5 11 0 6

2 6 7 17 13 -1 -2 3 6 10 8 15 2 5 4 13 11 3 14 8 6 5 7 10 14 20 8 10 6 7 11 11 19 4 10 7 14 16 6 11 5 18 8 12 10 11 17 10 13 9 9 12 16 4 16 17 10 6 9 -1 9 8 15 11 8 13 3 14 7 11

Assume graders fixed, days random, and samples within days random.

a) State the necessary hypotheses. b) Give an outline of the ANOVA table with source of variation

and degrees of freedom only. c) Complete the ANOVA table by working out the calculations.

Page 68: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6688

8. NONPARAMETRIC STATISTICS 8.1 Introduction Nonparametric methods are often applicable in situations where the parametric methods are not. They require less restrictive assumptions concerning the data and the form of the probability distributions generating the data. The scale of measurement for the data somehow determines whether to use parametric or nonparametric methods. Most parametric methods use interval or ratio-scaled data. Thus, means, medians, variances, standard deviations interquartile ranges, etc., can be computed and interpreted. Parametric methods cannot be applied on nominal or ordinal-scaled data. Nonparametric methods are the only way nominal or ordinal-scaled data can be statistically analysed and sound conclusions made. The form or type of assumptions made to generate data also determines whether to use parametric or nonparametric method. Many parametric methods require assumptions. For instance, for a small sample case, normal distribution with a constant variance is required in order to apply t-distribution. The nonparametric methods do not require assumptions about the population probability distribution, and can be used when one is not prepared to make distribution assumptions. This property has led to nonparametric methods to be referred to as distribution-free methods. The sign test, the Wilcoxon signed-rank test, the Mann-Whitney-Wilcoxon test, the Kruskal-Wallis test, and Spearman rank correlation are the nonparametric methods discussed. 8.2 Sign test This section is better introduced through an example. Consider a study of consumer preference for two brands of orange juice, where 12 people were given unmarked samples of the two brands. The brand each individual tasted was selected randomly. Each individual stated a preference for one of the two brands. The question of interest is to determine whether the preferences for the two products are equal. Hypothesis

Ho : P=0.5 <No difference in preference for one brand over the other exists>. H1: P≠ 0.5 <A difference in preference for one brand over the other exists>

Where, P= Population proportion of consumers favouring one brand. Suppose we denote, preference for brand A by ‘+’ and that of brand B by ‘-‘. The data is recorded in form of ‘+’ and ‘-‘ sign, hence, Sign-test. Under Ho, the number of ‘+’ are equal to ‘-‘ signs. If we consider ‘+’ sign to denote success, then with n = 12, and P = 0.5, we have a binomial probability distribution case. We can compute probabilities for all the 12 people, giving a symmetric binomial distribution. This sampling distribution is used to determine a rejection rule.

Page 69: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

6699

Binomial Probabilities (P=0.5, n=12)

0

0.05

0.1

0.15

0.2

0.25

0 1 2 3 4 5 6 7 8 9 10 11 12

Number of + Signs

Prob

abili

ty

The rejection rule is established as follows. Suppose our α = 0.05. For a two tailed test, we have 0.025 on one tail and 0.025 on the other. Thus, starting at the lower end of the distribution, 0.0002 + 0.0029 + 0.0161 = 0.0192 probability of obtaining 0, 1 or 2 + signs. Adding the probability of 3 would give 0.0729, which exceeds the set probability, 0.025 for the lower tail. So we stop at 2 + sign. At the upper tail, we get 0.0192 probability corresponding to 10, 11 or 12 + signs. The closest we get to 0.05 is 0.0192 + 0.0192 = 0.0384. Thus, the rejection rule is Reject Ho if the number of + signs is less than 3 or greater than 9.

The binomial probability distribution can be used for n=20 (small sample case). Large-sample normal approximation of binomial probabilities can be used for sample size n, greater than 20 to determine the rejection rule for the sign test. Normal approximation of the sampling distribution of the number of + signs when no preference exists requires determination of Mean: µ = 0.5n Standard deviation: σ = 0 25. n

Thus, Z= σµ−X

Example 8.1 The following data show the preferences indicated by 10 individuals in taste tests involving two brands of a product. Individual Brand A versus Brand B

Page 70: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7700

1 + 2 + 3 + 4 - 5 + 6 + 7 - 8 + 9 - 10 + We test for a significant difference in the preferences for the two brands at 5% significance level. A + indicates a preference for brand A over brand B. Hypothesis

Ho : P = 0.5 H1 : P ≠ 0.5

Where P= Population proportion of consumers favouring one brand A. The binomial probabilities for P = 0.5 and n = 10

Number of + Signs Binomial Probability 0 1 2 3 4 5 6 7 8 9 10

0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010

Starting at the lower end of the distribution: 0.0010 + 0.0098 = 0.0108 for 0, and 1. If we include 2, we get 0.0547, which exceed 0,05. Thus, we stop at 1. Similarly, from the upper end of the distribution we get 0.0010 + 0.0098 = 0.0108 for 9 and 10. Therefore, we reject Ho if the number of + signs is less than 2 and greater than 8. We fail to reject Ho in favour of H1 because we have 7 + signs. There is no evidence from this data that individual’s preference differ significantly for the two brands at 5 % significance level. 8.3 Wilcoxon Signed-Rank Test

Page 71: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7711

The Wilcoxon Signed-Rank Test is the nonparametric alternative to the parametric paired sample test. In the parametric case, the population of differences between pairs of observations is assumed normally distributed. The nonparametric Wilcoxon Signed-Rank Test can be used when the appropriateness of the assumption of normality is in question. The procedure is illustrated by the following example. Example 8.2 A manufacturing firm is attempting to determine whether a difference between task-completion times exists for two population methods. A sample of 11 workers was selected and each worker completed a production task using both production methods. The production method that each worker used first was selected randomly. A positive difference in task-completion times indicates that method 1 required more time and a negative difference indicates that method 2 required more time. Production task-completion times (Minutes)

Absolute Signed Worker Method 1 Method 2 Difference Difference Rank Rank 1 10.2 9.5 0.7 0.7 8 +8 2 9.6 9.8 -0.2 0.2 2 -2 3 9.2 8.8 0.4 0.4 3.5 +3.5 4 10.6 10.1 0.5 0.5 5.5 +5.5 5 9.9 10.3 -0.4 0.4 3.5 -3.5 6 10.2 9.3 0.9 0.9 10 +10 7 10.6 10.5 0.1 0.1 1 +1 8 10.0 10.0 0.0 0.0 - - 9 11.2 10.6 0.6 0.6 7 +7 10 10.7 10.2 0.5 0.5 5.5 +5.5 11 10.6 9.8 0.8 0.8 9 +9

Sum of signed ranks +44 Hypothesis

Ho: The populations are identical H1: The populations are not identical

The first step is to rank the absolute differences between the two methods, from lowest to the highest, where any differences of zeros are discarded. Tied differences are assigned average rank values. The ranks are given the sign of the original difference in the data. The sum of signed rank is finally obtained. For our example, we have +44. If the populations representing task-completion times for each of the two methods are identical, we would expect the positive ranks and the negative ranks to cancel out. Thus, we wish to test if the sum of signed rank is significantly different from zero. Let T denote the sum of the signed-rank values in a Wilcoxon signed-rank test. The distribution of T is approximated when the number of pairs of data is 10 or more and the

Page 72: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7722

populations are identical. Thus, the sampling distribution of T for identical population is mean µT =0, and

standard deviation, Tσ =n n n( )( )+ +1 2 1

6.

Referring to the above example, Tσ =10 11 21

6( )( )

= 19.62.

Z = T

TTσµ− =

44 019 62−.

= 2.24 Conclusion: Reject Ho since Zcal = 2.24 is greater than Ztable= 1.96 at 5 % significance level and conclude that the two populations are not identical in terms of task-completion times. It is worth to note that the Wilcoxon signed-rank test does not enable us to conclude in what ways the populations differ. Exercise 8.1 8.1 A test was conducted of two overnight mail-delivery services. Two samples of

identical deliveries were set up such that both delivery services were notified of the need for a delivery at the same time. The number of hours required to make the delivery is showed below for each service time.

Delivery

Service 1 2

1 2 3 4 5 6 7 8 9 10 11

24.5 28.0 26.0 25.5 28.0 32.0 21.0 20.0 18.0 19.5 36.0 28.0 25.0 29.0 21.0 22.0 24.0 23.5 26.0 29.5 31.0 30.0

Test at 5% significance level is the data suggest a difference in the delivery times for the two services.

8.4 Mann-Whitney-Wilcoxon Test

Page 73: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7733

This is a nonparametric test used to determine whether there is a difference between two populations. Unlike Wilcoxon signed-rank test, it is not based on paired samples. It concerns two independent random samples one from each population. In the case on parametric test, normality distribution and equality of variances were assumed. The Mann-Whitney-Wilcoxon (MWW) test does not require either of the assumptions. However, it does require that the measurement scale for the data generated by the two independent random samples be at least ordinal. Small-Sample Case: Appropriate when sample sizes are less or equal to 10. The following steps are taken in carrying out the test. Combine the data from both samples and then rank them from smallest value ranked 1 and the largest value ranked the highest. Sum the ranks for each sample separately. The sum of ranks denoted by T takes two values, either smallest or largest from the two samples. Under Ho, the value of T is expected to be near the average of the sum of the smallest plus the largest values of T. That is,

T =(TL+TU)/2. The critical value of the MWW T statistic exists when both sample sizes are less than or equal to 10. The n1 corresponds to the sample whose rank sum is being used in the test.

TU = n1(n1+n2+1) -TL Reject Ho if T is strictly less than TL or strictly greater than TU. Large-Sample Case: Appropriate when sample size is greater or equal to 10. In this case, the MWW T statistic can be approximated normal with a sampling distribution that has

Mean µT = 12

11 1 2n n n( )+ + and

Standard deviation Tσ =1

1211 2 1 2n n n n( )+ +

General steps for MWW T test. 1. Rank the combined sample observations from lowest to the largest, with tied values

being assigned the average of the tied rankings. 2. Compute the T, the sum of the ranks for the first sample. When we reject the hypothesis that the populations are identical using MWW test, we cannot state how they differ. The populations could have different means, different variances, and/or different forms. The MWW test has the advantage that it does not require any probability distribution assumptions and can be used on ordinal data. Example 8.3

Page 74: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7744

Two fuel additives are being tested to determine their effect on fuel consumption. Seven cars were tested using additive 1 and another independent sample of nine cars was tested using additive 2. The data below show the kilometre per litre obtained using the additives. Test using MWW test to see if there is a significant difference in fuel consumption at 5 % significance level

Additive 1 Rank Additive 2 Rank 17.3 2 18.4 6 19.1 10 16.7 1 18.2 5 18.6 7 17.5 3 Sum 34

18.7 8.5 17.8 4 21.3 15 21.0 14 22.1 16 18.7 8.5 19.8 11 20.7 13 20.2 12 Sum 102

The combined samples are ranked and the rank sum for each sample obtained. This is a small sample test since, n1=7 and n2=9. T=34. With α = 0.05, n1=7 and n2=9, TL = 41 and TU = 7(7+9+1) -41 = 78 Conclusion: Since T=34 < 41, we reject Ho and conclude that there is a significant difference in fuel consumption. 8.5 Kruskal-Wallis Test Kruskal-Wallis test is an extension of Mann-Whitney-Wilcoxon test for three or more populations. The hypothesis is stated as follows:

Ho : All k populations are identical H1 : Not all populations are identical

Recall that the parametric test such as completely randomised design requires interval or ratio data. The Kruskal-Wallis test, which does not require the assumptions of normality and equal variance, is used with ordinal data as well as with interval or ratio data. The Kruskal-Wallis test statistics, which is based on the sum of ranks for each of the samples, can be computed as follows:

W = 12

1

2

1n nRnT T

i

ii

k

( )[

+ =∑ ] - 3(nT +1)

where k = the number of populations ni = the number of items in sample i

Page 75: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7755

nT = total number of items in all samples Ri = sum of the ranks for sample i. Under Ho, the populations are identical with the sampling distribution of W being approximated by a χ 2 with k-1 degrees of freedom. The approximation works well if each of the sample size is greater or equal to 5. See Table C. The following example illustrates the computation procedure. Example 8.4 Three products received the following performance ratings by a panel of 15 consumers. We wish to use Kruskal-Wallis test to determine if there is a significant difference in the performance ratings for the product, at 5% significance level.

A Rank B Rank C Rank 50 4 62 8 75 10 48 3 65 9 Sum= 34

80 11 95 14 98 15 87 12 90 13 Sum=65

60 7 45 2 30 1 58 6 57 2 Sum=21

The first step is to rank all the 15 data values, with the lowest ranked 1 and the largest ranked 15. The average rank is assigned to tied data. Sum of ranks: RA =34, RB =65, RC =21 Sample sizes: nA = 5, nB =5, nC =5 Total number of items in all samples; nT =15 k=3, thus degrees of freedom =2

W = ]5

215

655

34[

)16(1512 222

++ -3(16) =10.22

χ 2

(2, 0.05) =5.99 Conclusion: Reject Ho and conclude the ratings for the products differ at 5% significance level. Note that the procedure would also have been applied directly to the original data if the data had been the ordinal rankings of the 15 consumers. The step of constructing the rank orderings from the performance evaluation ratings would have been omitted. 8.6 Spearman Rank Correlation

Page 76: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7766

Spearman’s rank correlation is used to find a measure of association between two random variables when only ordinal data are available. The Spearman rank-correlation coefficient is computed using the following formula:

rs = 1 - 6

1

2

2

dn n

i∑−( )

where n = the number of items or individuals being ranked. xi =the rank of item i with respect to one variable yi = the rank of item i with respect to a second variable di = xi - yi 6 is a constant. While r is a measure of linear correlation between X and Y, rs is a measure of increasing or decreasing relationship. The rs ranges from -1 to 1. Positive values near 1 indicate a strong positive association between the rankings. That is, as one rank increases the other rank also increase. Similarly, negative values near -1 indicate a strong negative association in the ranks. The sampling distribution of rs is

Mean: µrs=0 and Standard deviation:

srσ =

11n −

for n≥ 10.

Z = s

s

r

rsrσµ−

has standard normal with mean zero and unit variance.

Consider the following example to illustrate the computation procedures. Example 8.5 At a wine tasting function, two judges were asked to independently rank the 10 wines on exhibit from most desirable (rank=1) to least desirable (rank=10). The preferences were as follows:

JudgeA Rank

Judge B Rank Difference di

di

2 6 2 8 10

5 2 7 9

1 0 1 1

1 0 1 1

Page 77: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7777

7 3 1 4 9 5

6 3 1 8 10 4

1 0 0 -4 -1 1

1 0 0 16 1 1

di

2∑ = 22, n= 10 So,

rs = 1 - 6

1

2

2

dn n

i∑−( )

= 1 - 6 22

10 10 12

( )( )−

= 0.867

Conclusion: The high value of rs = 0.867 suggests the two judges’ preferences coincides very closely.

Page 78: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7788

9. REGRESSION ANALYSIS 9.1 Introduction Regression analysis is a statistical procedure used to develop a mathematical equation showing how variables are related. The variable that is predicted using this mathematical equation is called a dependent variable while the variable used to predict is called independent variable. Regression analysis involving only one independent and one dependent variable is called a simple linear regression. Multiple regression analysis incase of two or more independent variable. Consider the following examples of pairs of random variables where X is an independent variable and Y a dependent variable.

X Y • Advertising • Training • Speed • Hours worked • Daily temperature • Hours studied • Product X’s price • Bond Interest rate • Cost of living

• Company turnover • Labour productivity • Fuel consumption • Machine output • Electricity demand • Statistics results • Product X’s Sales level • Number of bond

defaulters • Poverty

Several objectives exist for carrying out regression analysis, among them are to: • See if Xi affects Y. The objective would be to investigate whether there is a change in

Y when the level of X is changed. Thus establishing a functional relationship between the two variables. In this case, X is assumed to be a continuous variable. A scatter plot would show if a relationship exist between the two variables.

• See how Xi affects Y. Would be interested in knowing by how much the value of Y changes per unit change in X.

• Predict Y given Xi The objective in this case is to provide a mathematical function that would be used in predicting values of Y per given X.

Consider for example, an experiment to estimate the mean weight gain per month for steers fed on a particular variety of feeds. The dependent variable, weight gain could be affected by many factors such as initial weight of the steer, amount of feed offered per day, protein content of the feed, water content of the feed, and so on. 9.2 Simple Linear Regression

Page 79: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

7799

Involves an independent variable denoted by X and a dependent variable denoted by Y. The Xs are selected levels of the treatment under investigation. The response corresponding to the effect is measured. In simple linear regression we want to explain the behaviour of dependent variable Y in terms of X. Simple linear regression is concerned with establishment of a linear function of independent variable X. The procedure involves fitting simple linear regression to the data where parameters are estimated. The suitability of the model is then assessed. The first step should be to plot the raw data in order to have an indication of the relation between Y and X. If such relationship is not noticeable, then other reasons should be give for proceeding to fit the regression line. The simplest type of model relating a response variable y to a single independent variable x is given by the following equation of a straight line:

y x= + +β β ε0 1 where, β0 is the intercept (value of y when x=0) β 1 is the slope of the straight line (change in y for a unit change in x) ε is a random variable. Note that the random error term takes into account all unpredictable and unknown factors that are not included in the model. The interest is mainly in estimating the two unknown parametersβ0 and β1 where their estimates are denoted by a and b, respectively. The statistics a and b are computed from the data using a technique called least squares estimation procedure. The least squares method is a procedure used to find a straight line that provides the best approximation for the relationship between the independent and dependent variables. This line is called estimated regression line or the estimated regression equation. The following equations have been shown using calculus, to provide the minimum sum of squared deviations between the observed values of dependent variable yi and the estimated values of the dependent variable )yi :

b = ( )( )

( )x x y y

x xi i

i

− −

−∑∑ 2 =

SS

xy

xx < By definition>

b = x y x y n

x x ni i i i

i i

−∑∑∑

∑∑( ) /( ) /2 2 <Used in computation>

a = y -b x

Example 9.1a

Page 80: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8800

A property analyst is examining the relationship between the City Council’s valuation on residential property and the market value (selling price) of the properties. A random sample of eight recent property transactions was examined. The data are as follows:

City council valuation (R1 000)

x

Market value (R1 000)

Y

x2

xy

y2 12 65 144 780 4225 45 220 2025 9900 48400 32 142 1024 4544 20164 50 310 2500 15500 96100 28 196 784 5488 38416 56 364 3136 20384 132496 18 116 324 2088 13456 40 260 1600 10400 67600

281 1673 11537 69084 420857 The scatter diagram of the above data is presented below.

City council values against market values

y = 6.1912x - 8.3392

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60

City council values (R1 000)

Mar

ket v

alue

s (R

1 00

0)

For the above example:

x∑ =281, y∑ =1 673, xy∑ =69 084, x∑ 2 = 11 537, y∑ 2 = 420 857.

b = x y x y n

x x ni i i i

i i

−∑∑∑

∑∑( ) /( ) /2 2

= 69084 281 1673 8

11537 281 82

−−

( )( ) /( ) /

= 6.1912.

a = y -b x = 209.125 - (6.1912)(35.125) = -8.3392.

Page 81: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8811

Thus, the estimated regression line is $y = -8.3392+6.1912x The estimates of intercept and slope, namely, a and b are unbiased estimators of population parameters β0 and β1 , respectively. Caution: Extrapolation outside the range of x may lead to meaningless results. For instance, at x = 0, we get y = -8.3392. That is, at a zero city council valuation, we get R -8.3392 market value. The above regression line is meaningful only when x values fall within 12 ≤ x ≤ 40 interval. Note: A regression line obtained using the standardised values of X and Y passes through the origin, thus with zero intercept. The correlation coefficient between standardised X and Y, r equals the slope, b, obtained using the same standardised values. Example 9.1b A substance used in biological and medical research is shipped by airfreight to users in cartons of 1,000 ampules. The data below, involving 10 shipments, were collected on the number of times the carton was transferred from one aircraft to another over the shipment route (X) and the number of ampules found to be broken upon arrival (Y). Assume a linear regression model. ii: 1 2 3 4 5 6 7 8 9 10 Xi: 1 0 2 0 3 1 0 1 2 0 Yi: 16 9 17 12 22 13 8 15 19 11

Scatter plot

0

5

10

15

20

25

0 0.5 1 1.5 2 2.5 3 3.5

X

Y

I XI Yi Xi- iX Yi- iY (Xi- iX )( Yi- iY ) (Xi- iX )2 (Yi- iY )2 iY ei=(Yi- iY ) 2

ie 1 1 16 0 1.8 0 0 3.24 14.2 1.8 3.24

Page 82: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8822

2 0 9 -1 -5.2 5.2 1 27.04 10.2 -1.2 1.443 2 17 1 2.8 2.8 1 7.84 18.2 -1.2 1.444 0 12 -1 -2.2 2.2 1 4.84 10.2 1.8 3.245 3 22 2 7.8 15.6 4 60.84 22.2 -0.2 0.046 1 13 0 -1.2 0 0 1.44 14.2 -1.2 1.447 0 8 -1 -6.2 6.2 1 38.44 10.2 -2.2 4.848 1 15 0 0.8 0 0 0.64 14.2 0.8 0.649 2 19 1 4.8 4.8 1 23.04 18.2 0.8 0.64

10 0 11 -1 -3.2 3.2 1 10.24 10.2 0.8 0.64 10 142 40 10 177.6 0 17.6

Information required for computation: n =10, ∑ iX = 10, ∑ iY =142, SXY =∑ −− ))(( YYXX = 40, SX = ∑ − 2)( XX = 10, SY =∑ − 2)( YY =177.6 Computation The estimate of the slope,

b = X

XY

SS =

1040 = 4

The estimate of the intercept, a = XbY − = 14.2 – 4(1) =10.2 Estimate linear regression line is, iY = a + bX = 10.2 + 4X

MSE = 2−n

SSE = 8

6.17 = 2.2

Regression analysis

Estimator Coef Std Error t -value P-value a 10.2 0.663 15.38 <0.000 b 4 0.469 8.53 <0.000

Page 83: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8833

Fitted regression line

0

5

10

15

20

25

0 0.5 1 1.5 2 2.5 3 3.5

X

Y

9.3 Model and assumptions It is important to distinguish between a deterministic model and a probabilistic model when testing for significance in regression analysis. In a deterministic model, the relationship between X and Y is such that if the value of the independent variable is specified, the value of the dependent variable is determined exactly. A probabilistic model if we are unable to guarantee a single value of Y for each value of X. Thus, mathematically, Deterministic model: y x= +β β0 1 < A model with no error> Probabilistic model: y x= + +β β ε0 1 < A model that allows for uncontrollable

components to be denoted> The difference between the two models is in ε , which measures how far the actual y value is above or below the regression line. The following are the assumptions about ε , the error term in the regression model.

• The error term ε is a random variable with a mean zero. • The variance of ε , denoted 2σ , is the same for all values of x. • The values of ε are independent. • The error term ε is a normally distributed random variable.

We would be more concerned with assessing how the fitted model explains more of the real life situation. That is, how close are the fitted values to the observed value? would be the question of interest. Thus, we would aim at minimising the error term. The above stated model assumes a straight line situation which often is not the case. A non-linear model may turn out to explain the data more clearly than the straight line case. The reliability of the final model depends on the validity of the underlying assumptions and the adequacy of the fitted model in explaining more of the variation in the data.

Page 84: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8844

The coefficient of determination, denoted by r2 which is expressed as a ratio of sum of squares regression to sum of squares total is often used as a measure of the goodness of fit of the estimated regression line. A higher r2 value is associated with a better fit, however, it does not allow us to concluded whether a regression relationship is statistically significant. The computation of r2 fails to take into consideration the sample size. 9.4 Partitioning the total sum of squares The total sum of squares can be partitioned into regression sums of squares and residual sums of squares. That is: Sum of squares about the mean = Sum of squares due to regression + Sun of squares for residual. • Sum of squares about the sample mean: ( )y y−∑ 2 • Sum of squares due to regression (the portion of the overall distance that can be

attributed to the independent variable x): ( $ )y y−∑ 2 • Sum of squares due to residual (that portion of the distance between y and y that

cannot be accounted for by the independent variable x): ( $ )y y−∑ 2 . In summary, ( )y y−∑ 2 = ( $ )y y−∑ 2 + ( $ )y y−∑ 2 <Total variability <Variability <Unexplained variability> in y-values > explained by model> The following computations obtained using the information given in the above example illustrate the point.

Page 85: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8855

X y $y (y- $y ) (y- y ) ( $y - y ) (y- $y )2 ( $y - y )2 (y- y )2

12 65 65.96 -0.96 -144.12 143.16 0.912 20496.16 20770.57

45 220 270.26 -50.26 10.88 -61.14 2526.550 3738.687 118.3744

32 142 189.78 -47.78 -67.12 19.34 2282.852 374.0665 4505.094

50 310 301.22 8.78 100.88 -92.10 77.074 8482.557 10176.77

28 196 165.01 30.99 -13.12 44.11 960.107 1945.304 172.1344

56 364 338.37 25.63 154.88 -129.25 656.999 16705.05 23987.81

18 116 103.10 12.90 -93.12 106.02 166.348 11239.73 8671.334

40 260 239.31 20.69 50.88 -30.19 428.126 911.3636 2588.774 7098.970 63892.92 70990.88

Where,

( )y y−∑ 2 = 70990.88

( $ )y y−∑ 2 = 7098.97

( $ )y y−∑ 2 = 63892.92 The results agree, except for the rounding errors. 9.5 An estimate of the variance of residual term The variance of ε , denoted by 2σ is estimated using the sum of squares due to residual, SSE. SSE = ( $ )y y−∑ 2 = Syy - bSxy. The degrees of freedom indicate how many independent pieces of information involving the n independent values used to compute the sum of squares. SSE is associated with n-2 degrees of freedom because two parameters (β0 and β1 ) have to be estimated. The mean square (MSE) is a number computed by dividing a sum of squares by its degrees of freedom. It has been shown that, MSE or s2 provides estimate of 2σ . Thus,

MSE = SSEn − 2

Page 86: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8866

From the above example,

MSE = SSEn − 2

= 7098 97

8 2.

− =1183.16

9.6 Inference about the β0 and β1 parameters The main interest would be to test if the slope is significantly different from zero, indicating change in y per unit change in x. An appropriate hypothesis is

Ho: β1 = 0 Ha: β1 ≠ 0

The above hypothesis can be tested using t- test or F- test or a confidence interval. We need to obtain b, the estimate of β1 and the associated variance in order to conduct the appropriate test. The sampling distribution of estimate b is normal with mean β1 and variance 2

bσ , where,

2bσ =

xxS

Sxx = x x ni i

2∑ ∑− ( ) / Since 2

bσ is hardly known, it is estimated by sb where s2 replaces 2σ in the above equation. Thus,

sb2 =

sSx

2

The test statistic is

tcalc = b

sb

− β1

which follows t distribution with n-2 degrees of freedom. The decision rule is to reject Ho if the absolute tcalc denote by |tcalc| is greater than tα / 2 . For the above example, b= 6.1912, standard error of b denoted s.e.(b) = sb

2 = sb.

Page 87: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8877

s = 1183 16. = 34.4

Sx = x x ni i2∑ ∑− ( ) /

= 11537 - (281)2/8 = 1666.875

s.e.(b) = 1183 16

1666 875..

= 0.8425

Thus, tcalc = 6 1912 0

0 8425.

.−

= 7.349

From the t table, the value of t corresponding to 6 degrees of freedom and α = 0.05 for a two tailed test is t0.025 = 2.447. Conclusion: We reject Ho: β1 = 0 since |tcalc| is greater than t0.025 = 2.447 and conclude that the slope is significantly different from zero at 5 % significance level. An F- test exists for testing the above hypothesis. The t- test and F- test give the same results for a regression model with only one independent variable. This is due to relation between the two distribution for one independent variable (F=t2 relationship). The following computations are necessary in order to test the above hypothesis concerning the slope parameter. Sum of squares due to regression, denoted by SSR = ( $ )y y−∑ 2 associated with 1 degree of freedom (number of parameters - 1). Sum of square due to residual, denoted by SSE = ( $ )y y−∑ 2 associated with n-p degrees of freedom (n is the sample size and p is the number of regression parameters). Sum of squares due to total, denoted by SST = ( )y y−∑ 2 associated with n-1 degrees of freedom. The following are the corresponding mean squares:

MSR = SSR

1 and MSE =

SSEn p−

Under Ho: β1 = 0 both MSR and MSE are two independent estimates of σ 2 . The ratio MSR to MSE is known to have a sampling distribution that is F with 1 and n-p degrees of freedom. (In this case p=2).

Page 88: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8888

For the above example, we get MSR = 63892.92 and MSE = 1183.16 which implies that

Fcalc = MSRMSE

= 63892 921183 16

..

= 54.0

Note: F = t2 (i.e. 54.0 = 7.3492) From the F- distribution table we get F (1,6; 0.05) = 5.99. We reject Ho: β1 = 0 at 5 % significance level since Fcalc > F (1,6; 0.05) = 5.99 and conclude that there is statistically significant relationship between the x and y. Caution: Rejection of the null hypothesis does not imply that the relationship between the x and y is linear. A proper way to phrase the statement is that, a linear relationship explains a significant amount of the variability in y over the range of x values observed in the sample. Confidence Interval for β1 Confidence interval provides an alternative to testing the hypothesis Ho: β1 = 0 against Ha: β1 ≠ 0. The following is a 95 % confidence interval for β1

b ± t0.025s.e.(b) In reference to the above example, a 95 % confidence interval for β1 is

6.1912 ± 2.447(0.8425) Thus, (4.1296, 8.2528) is a 95% confidence interval for β1 . We reject Ho because the interval does not contain zero. Similarly, the variance of the intercept estimate is given by the following formula

sa2 =

MSE xnSxx

2∑

= MSE nx

Sxx

( )2

9.7 Confidence interval estimate of the mean value of y

Page 89: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

8899

There are two types of interval estimates, namely, confidence interval estimate and prediction interval estimate. The former is an estimate of the mean value of y for a particular value of x while the latter concerns the prediction of an individual value of y corresponding to a given value of x. The computed values using the equation $y a bx= + are both the same. The difference is only in computation of the standard error.

Suppose we denote the estimate of the mean value by $ym and individual value estimate by $yind . The corresponding values and their associated variances are computed using the following formula: Mean value: $y a bxm m= +

Estimated variance of $ym : sm2 = s

nx x

x x nm2

2

2 2

1[

( )( ( ) / )

]+−

− ∑∑

Individual value: $y a bxind ind= +

Estimated variance of $yind : sind2 = s

nx x

x x nind2

2

2 211

[( )

( ( ) / )]+ +

− ∑∑

For our example, suppose we wish to estimate the mean value for a given value of xm=30.

$y a bxm m= + = -8.3392+6.1912(30)

= 177.3968 and

sm2 = s

nx x

x x nm2

2

2 2

1[

( )( ( ) / )

]+−

− ∑∑

= 1183 1618

30 35 1251666 875

2

. [( . )

( . )]+

= 166.5385 Suppose we wish to estimate the individual value for a given value of xind=30.

$y a bxind ind= + = -8.3392+6.1912(30) = 177.3968 and

Page 90: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9900

sind2 = s

nx x

x x nind2

2

2 211

[( )

( ( ) / )]+ +

− ∑∑

= 1183 16 118

30 35 1251666 875

2

. [( . )

( . )]+ +

= 1349.6980 Note: The variance associated to the individual value prediction is greater than that associated to the mean value. Consequently, the confidence interval for the individual vale is wider than that of mean value. Exercise 9.1 9.1 A restaurant operating on a ‘reservations only’ basis would like to use the number of advance reservations x to predict the number of dinners y to be prepared. Data on reservations and number of dinners served for one day chosen at random from each week in a 100-week period gave the following results:

x = 150 y = 120

( )x x−∑ 2 = 90 000 ( )y y−∑ 2 = 70 000

( )( )x x y y− −∑ = 60 000

a) Find the least squares estimates a and b for the linear regression line $y = a + bx.

b) Predict the number of meals to be prepared if the number of reservations is 135.

c) Construct a 90 % confidence interval for the slope. Does information on x (number of advance reservations) help in predicting y (number of dinners prepared)?

9.2 Interest rates charged for home mortgages have, in general, declined over recent

months. With the apparent favourable influence for new home building, the data shown below are the prevailing mortgage interest rates and the number of housing starts in a city over a period of 18 months.

Month Interest rate Number of housing starts

x y 1 10.5 360 2 10.3 340 3 10.6 370 4 11.4 360 5 11.8 330 6 11.3 300

Page 91: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9911

7 11.0 290 8 10.5 340 9 10.2 360 10 10.0 370 11 9.8 380 12 9.8 390 13 9.9 375 14 10.0 350 15 10.0 345 16 9.9 360 17 9.8 380

18 9.7 395

a) Plot the data. b) Use these data to obtain a linear regression equation. c) Is the slope significantly different from zero? d) Predict the number of housing starts for interest rates of 10.2% and 9.5%. e) Do you predict that the prevailing interest rate will increase or decrease next month

(month 19)? 9.8 Testing model assumptions A residual is the difference between the actual value of the dependent variable yi and the value predicted by the regression equation $yi . The analysis of residuals plays an important role in validating the assumptions made in regression analysis. The hypothesis test discussed above is valid only when assumptions made on regression equation are satisfied. Residual plots are graphical presentations of the residuals that help reveal patterns and thus help determine whether the assumptions concerning the error component and the form of regression model are satisfied. The following are the common residual plots • A plot of residuals against the independent variable x. • A plot of residuals against the predicted value of the dependent variable. • A standardised residual plot in which each residual is standardised by dividing the

residual by its standard deviation. 9.9 Diagnostic procedures Residual plot against x

Page 92: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9922

A residual plot against the independent variable x is constructed by placing x on the horizontal axis and the residuals on the vertical axis. The residual plot should give an overall impression of a horizontal band of points if the assumptions are valid and a linear relationship between x and y is appropriate.

Residual plot against x

-60

-40

-20

0

20

40

0 10 20 30 40 50 60

X

Res

idua

l

Using the Residual Plot a) An overall impression of a horizontal band of points from a residual plot implies that

the model is valid and a linear relationship between x and the expected value of y exist.

b) A cone shape pattern of the residual plot suggests that the variance is not constant. That

is to say, the variability of about the regression line is greater for larger values of x. c) A quadratic pattern of the residual plots suggests that the linear model is not adequate

and quadratic model should be fitted. Note that for simple linear regression, both the residual plot against x and the residual plot against the predicted value $y provide the same information. With multiple regression models, the residual plot against $y . Standardised residual plots are provided by most computer software. A random variable is standardised by subtracting its mean and dividing the result by its standard deviation. The standard deviation of the ith residual is

sy - $y = s hi2 1( )−

where hi = 1 2

2nx x

x xi

i

+−

−∑( )

( ) and s2 = MSE

If the normality assumption is satisfied, 95 % of the computed standardised residual should lie between -2 and 2. Outliers

Page 93: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9933

Outliers represent observations that are suspect and warrant careful examination. Sometimes they may occur due to erroneous data recording. They may also indicate some signs of violation of model assumptions or unusual values may occur due to change. Example 9.2 Consider the following data set to illustrate effect of an outlier.

x y 1 45 1 55 2 50 3 75 3 40 3 45 4 30 4 35 5 25 6 15

The effect of an outlier

0

20

40

60

80

0 1 2 3 4 5 6

X

Y

A negative linear relationship exists between X and Y except for the value at x=3 and y=75 which is out of the pattern. Most statistical software classify an observation with standardised residual that is either less than -2 or more than 2 to be an outlier. Influential observations An influential observation which may be an outlier is a value that is far away from the mean Consider the following data to illustrate the aspect of influential observation.

x y 10 125 10 130 15 120 20 115

Page 94: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9944

20 120 25 110 70 100

Example 9.3

A high leverage observation

90

100

110

120

130

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

X

Y

The observation at x=70 and y=100 is an observation with an extreme value of x. Thus, correspond to a high leverage. The leverage is computed using the following formula

hi = 1 2

2nx x

x xi

i

+−

−∑( )

( )

An observation is declared to be influential if hi > 6/n. The appropriate approach to handling data with influential observations if to run the regression analysis with and without the observation. Although time consuming, the approach will reveal the influence of the observation on the results. Exercise 9.2 9.3 Consider the following data for two variables X and Y.

X 135 110 130 145 175 160 120 Y 145 100 120 120 130 130 110

a) Compute the standardised residuals for these data. Do there appear to be any

outliers in the data? Explain. b) Plot the standardised residuals against $y . Does this plot reveal any outliers? c) Develop a scatter plot for these data. Does the scatter diagram indicate any outliers

in the data? In general, what implications does this have for the simple linear regression?

Page 95: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9955

9.10 Polynomial models The response in dependent variable y will not always be linear whenever the independent variable x is of quantitative nature. Sometimes the response may either quadratic or cubic or higher than 3rd degree. For instance, a linear equation may not adequately represent the relationship between yield and the amount of fertiliser applied to the plot. The following data on yield of tomatoes receiving plots receiving different amount of fertiliser.

Plot Amount of fertiliser x

Yield y

1 12 24 2 5 18 3 15 31 4 17 33 5 20 26 6 14 30 7 6 20 8 23 25 9 11 25

10 13 27 11 8 21 12 18 29 13 22 29 14 25 26

Scatterplot of yield versus fertiliser

5

10

15

20

25

30

35

40

0 5 10 15 20 25

Amount of fertiliser X

Yiel

d Y

A model describing the quadratic form showed in the above figure is

y =β β β ε0 1 22+ + +x x

A general polynomial regression model relating a dependent variable y to a single quantitative independent variable x is given by

y x x xpp= + + + + +β β β β ε0 1 2

2 ...

Page 96: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9966

The choice of p and hence the choice of an appropriate regression model will depend on the experimental situation. 9.11 Multiple regression The probabilistic model for multiple regression analysis is a direct extension of the linear regression analysis. For p independent variables, we have

y x x xp p= + + + + +β β β β ε0 1 1 2 2 ... The estimated regression equation is

$ ...y b b x b x b xp p= + + + +0 1 1 2 2 Referred to multiple regression model because it involves more than one independent variable. For example, consider an experiment set to study the yield of tomato crop. Several independent variables say amount of fertiliser (X1), amount of water (X2), and hours of sunlight on clear days (X3) could all have an effect on the yield. The multiple regression model that relates a dependent variable y to a set of quantitative independent variables is a direct extension of a polynomial regression model in one independent variable. Any independent variables may be powers of other independent variables, example x2 might be x1

2 or x3 a cross-product term x1x2. A point to note is that no x is a perfect linear function of other xs.

y x x xp p= + + + + +β β β β ε0 1 1 2 2 ... In general,β j (j≠ 0 ) represents the expected change in y for a unit increase in xj while holding all other xs constant. A simplest model that allow for interaction between x1 and x2 is

y =β β β β ε0 1 1 2 2 3 1 2+ + + +x x x x Say for a give x=2, expected value of y, denoted E(y) is expressed as E(y) = β β β β0 1 1 2 3 12 2+ + +x x( ) ( )

= ( ) ( )β β β β0 2 1 3 12 2+ + + x Here the intercept and slope are ( )β β0 22+ and ( )β β1 32+ , respectively.

Page 97: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9977

10. INTRODUCTION TO MULTIVARIATE ANALYSIS 10.1 An overview Multivariate data occur in all branches of science. Almost all data collected by today’s researchers can be classified as multivariate data. For example, a marketing researcher might be interested in identifying characteristics of individuals that would be enable the researcher to determine whether a certain individual is likely to purchase a specific product. A wheat breeder might be interested in more than just the yields of some new varieties of wheat. The wheat breeder may also be interested in these varieties’ resistance to insect damage and drought. A social scientist might be interested in studying relationships between teenage girls’ dating behaviours and their fathers’ attitudes. The objectives of scientific investigations for which multivariate techniques most naturally lend themselves, include the following:

Data reduction or structural simplification. Sorting and grouping. Investigation of the dependence among variables. Prediction. Hypothesis construction and testing.

Multivariate techniques are applicable when more than one variable is measured on an experimental unit. Such variables could be correlated and univariate analysis would not be helpful in extracting relevant information. Multivariate techniques are classified into two categories, namely variable-directed and individual or experimental unit directed. Some of these techniques are: Variable directed Principal component analysis (PCA) Factor analysis (FA) Canonical correlation analysis (CCA) Multiple regression analysis (MRA) Individual directed Discriminant analysis (DA) Cluster analysis (CA) Multivariate analysis of variance (MANOVA) The above techniques will be discussed with examples later in Section 9.3. 10.2 Possible areas of applications

Page 98: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9988

Medicine and health Example 10.1a A study conducted to investigate the reactions of cancer patients to radiotherapy. Measurements were made on 6 reaction variables for 98 patients. Interest – data reduction. Example 10.1b Research on the genetic basis for alcoholism. One group has found that the activity of the two enzymes (monoamine oxidase and adenylate cyclase) produced by platelets was significantly reduced in alcoholics. The results of this study hold promise for the development of a simple screening test for the early detection of alcoholism. Interest – to identify and measure physiological variables that could be used effectively to discriminate alcoholics from nonalcoholics. Sociology Example 10.2a Competing current theories suggest that one strong socioeconomic dimension and a few minor unexplored dimensions determine the structure of American occupations. Measurements on 25 variables for 583 occupations were analysed using multivariate methods in order to provide support for one or two of the positions. Interest – hypothesis verification. Example 10.2b In a study of mobility, counts of the number of foreign-born and second-generation U S residents in 1970 were tabulated by country of origin and state of residence. Interest – to find natural homogeneous groupings. Business and economics Example 10.3a Measurements of 6 accounting and financial variables were used in developing a multivariate model to help insurance regulators identify potentially insolvent property-liability insurers. Using the model, an insurance company could be classified as solvent or distressed and remedial steps could then be taken to prevent bankruptcy of the distressed firm. Interest – to obtain a classification rule for distinguishing solvent firms from distressed firms. Example 10.3b Knowledge of the relationships among policy instruments and goals for underdeveloped countries can aid the process of national development and modernisation. Data from 74 non-communist underdeveloped countries allowed an investigator to find the subsets of

Page 99: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

9999

goals and instruments most closely associated with each other and to estimate the nature of the simultaneous relationships between the two subsets. Interest – to determine the dependence between two sets of variables corresponding to goals and instruments. Education Example 10.4 Scholastic Aptitude Test (SAT) scores and high school academic performance are often used as indicators of academic success in college. Measurements on 5 precollege predictor variables and 4 college performance criterion variables were used to determine the association between the predictor and criterion scores. The study was concerned with substituting the usefulness of test scores and high school achievement as predictors of college performance. Interest – prediction of college performance variables based on the set of predictor variables. Biology Example 10.5a Two species of chickweed have proved difficult to identify. Measurements on 4 variables for chickweed plants, known to belong to the two species, were used to construct a function whose values allowed one to separate the two groups. Consequently, the function could be used to classify a new candidate plant as belonging to one species or the other. Interest – sorting or classification. Example 10.5b In plant breeding it is necessary, after the end of one generation, to select those plants that will be the parents of the next generation. The selection is to be done in such a way that the succeeding generation will be improved in a number of characteristics over that of the previous generation. Many characteristics are often measured and evaluated. The plant breeder’s goal is to maximise the genetic gain in the minimum amount of time. Multivariate techniques were used in a bean-breeding programme to convert measurements on several variables relating to yield and protein content into a “selection index.” Scores on this index were then used to determine parents of the subsequent families of beans. Interest – construction of an index to replace measurements on many variables and the development of a sorting rule. Environmental studies Example 10.6 The atmospheric concentrations of air pollutants in the Los Angeles area have been extensively studied. In one of study, daily measurements on seven pollution –related variables were recorded over an extended period of time. Of the immediate interest was whether the levels of air pollutants were roughly constant throughout the week or whether there was a noticeable difference between weekdays and weekends. Interest – hypothesis testing and data reduction.

Page 100: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110000

Other areas where multivariate techniques apply are in meteorology, geology, psychology and sports. 10.2 Principal component analysis Principal component analysis approach is useful in discovering dimensionality of the data, data screening, checking clusters and finding abnormalities. It applies technique of grouping variables that are highly correlated together. The variables within a group are highly correlated and between groups are uncorrelated. New variables are expressed as linear combination of the p original variables. Principal component scores are used as inputs in other analysis. Multiple regression analysis is characterised by multicollinearity problem, which come about as a result of predictor variables being correlated. In such a situation, the selected PC scores are used as regressors. Plots of the first PC scores helps to identify outliers and clusters that may be associated with the data. 10.3 Factor analysis Factor analysis follows the same principal of PCA. The main difference being that the former has distributional properties whereas the later does not. A few factors do explain the original variables without loss of information. When the new factors cannot be explained, rotation techniques, some which are orthogonal, are applied. The PCs selected using PCA can be used as the new factors. 10.4 Discriminant analysis Dicriminant analysis is a multivariate procedure used to develop a rule that separate two or more groups of individuals, given measurements for these individuals on several variables. Discriminant analysis is similar to regression analysis except that the dependent variable is categorical rather than continuous. In regression analysis the interest is in predicting the value of a variable based on a set of predictor variables. In discriminant analysis, the interest is in predicting class membership of an individual observation based on a set of predictor variables. Several rules exist. A likelihhod rule; the linear discriminant function rule; a mahalanobis distance rule; a posterior probability rule, etc. The groups are known before hand. 10.5 Cluster Analysis Suppose a study on farming system in a given area has been conducted. Variables measured on each farm in the data set might include period farm had been farmed, number of animals, fertiliser used, type of trees, average income, soil types, crops grown, size of

Page 101: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110011

the family, labour, etc. The researcher want to use this information to partition farmers into subgroups, so that farmers that fall into distinct subgroups have similar characteristics with respect to the measured variables. The partition would allow for efficient use of the resources by the farmers. In more general terms, suppose a researcher has data collected on a large number of experimental units. Basic questions posed for cluster analysis would be whether it is possible to devise a classification or grouping scheme, that would allow for partitioning of the experimental units into classes or groups, called clusters, so that the units within a class or group are similar to one another while those in distinct classes or groups are not similar to those in the other groups. Cluster analysis involves techniques that produce classifications from data that are initially unclassified, and must not be confused with discriminant analysis where one initially knows how many distinct groups exist and where one has data that is known to come from each of these distinct groups. 11. CATEGORICAL DATA ANALYTIC METHODS 11.1Introduction In many studies measurements are made on binary rather than numerical scales. For example, studies of altitudes or opinions with the two categories for the response variable being agree or disagree. Others form of responses being exposed or not exposed, yes or no, present or absent, improved or unimproved. The type of data collected relates to responses to question like, how many have the attribute? How many said yes? etc. We end up with frequency counts. Analysis of such data uses a chi-square distribution denoted

2χ . This distribution is defined as the sum of squares of independent, normally distributed variables with zero mean and unit variance. The table values are the intersection of the α- value and the respective degrees of freedom. For example, α = 0.05 and 6 degrees of freedom, the table value from Table cc is 12.6. There are three areas of inferential statistics in which the chi-square test for significance is commonly applied. They are

• Tests for independence of associations; • Tests for equality of proportions in more than two populations; and • Test for goodness of fit tests.

The chi-square statistic tests the null hypothesis by comparing a set of observed frequencies, which are, based on sample findings, to a set of expected frequencies, which describe the null hypothesis. It measures the extent to which the observed and expected frequencies differ. Large differences will result in the null hypothesis being rejected. The chi-square statistic is computes

Page 102: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110022

2χ == i

iii

E

EO∑ − 2)(,, i =1, 2, . . ., k

where, Oi is the ith observed count. Ei is the ith expected count. k is the number of categories. The calculated 2χ is compared against a table value obtained using k-1 degrees of freedom and a specified α- level. In case of a contingency table the total number of cells constitute the number of categories, k. 11.2 Test for independence of association. This test is applied when an investigator wishes to determine the independence of two random variables. Independence implies that outcomes of one random variable in no way influence the outcomes of a second random variable. The null hypothesis and alternative hypothesis are stated as follows: H0: The two categories are independent. Against Ha: The two categories are dependent. The procedure is illustrated through the following example. Example 11.1 A certain brewery company manufactures and distributes three types of beers which are categorised as 1) a low-calorie light beer, 2) a regular beer and 3) a dark beer. In analysis of the market segments for the three beers, the firm’s market research group has raised the question of whether preferences for the three beers differ between male and female beer drinkers. If beer preference is independent of the sex of the beer drinker, one advertising campaign will be initiated for all their beers. However, if beer preference depends on the sex of the beer drinker, the company will tailor its promotions towards different target markets. The hypotheses of this test is stated as

H0: Beer preference is independent of the sex of the beer drinker. Against

Ha: Beer preference is not independent of the sex of the beer drinker (i.e., males and females differ in their preference).

A sample is selected and each individual is asked to state his or her preference for the three company’s beers. Every individual in the sample will be placed in one of the six cells (3 x 2 = 6). The table generate by the 3 x 2 cells is called a contingency table. The test of independence makes use of the contingency table format and for this reason is sometimes referred to as a contingency table test.

Page 103: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110033

Suppose that a simple random sample of 150 beer drinkers has been selected. After taste-testing the three beers, the individuals in the sample are asked to state their preference, or first choice. The responses are presented in a contingency table below: Observed frequencies (Oij’s)

Sex

Beer Preference Light Regular Dark

Totals

Male Female

20 40 20 30 30 10

80 70

Totals 50 70 30 150 Expected frequencies for the cells of the contingency table are based on the following rationale.

Assume the null hypothesis. Under this assumption we have 15050 =

31 of the beer drinkers

prefer light beer, 15070 =

157 prefer regular beer, and

15030 =

51 prefer dark beer. If the

independence assumption is valid, these same fractions must be applicable to both male and female beer drinkers. Thus, under the assumption of independence, we would expect

the 80 male drinkers to show that 31 (80) = 26.67 prefer light beer,

157 (80) = 37.33 prefer

regular beer, and 51 (80) = 16 prefer dark beer. Similar argument follows for female beer

drinkers. Expected frequencies if beer preference is independent of the sex of the beer drinker (Eij)

Sex

Beer Preference Light Regular Dark

Totals

Male Female

26.67 37.33 16.00 23.33 32.67 14.00

80 70

Totals 50 70 30 150 The general formula for computing expected frequencies for a contingency table in the test for independence is Eij = (Row i Total)(Column j Total)/ sample size In general, the contingency table test statistic is computed as

2χ = ij

jiijij

E

EO∑ −,

2)(

Page 104: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110044

where, Oij is the observed frequency for contingency table category in row i and column j.

Eij is the expected frequency for contingency table in row i and column j based on the assumption of independence.

With r rows and c columns in the contingency table, the test statistic has a chi-square distribution with (r-1)(c-1) degrees of freedom provided the expected frequencies are 5 or more for all categories. Referring back to our example, we note that all expected frequencies are at least 5. Thus, the sample size is adequate and can proceed to calculate chi-square statistic.

2χ = ij

jiijij

E

EO∑ −,

2)( =

67.26)67.2620( 2− +

33.37)33.3740( 2− + . . . +

00.14)00.1410( 2−

= 1.67 + 0.19 + . . . + 1.14 = 6.13 Degrees of freedom = (r-1)(c-1) = (2-1)(3-1) = 2. Using α = 0.05, 2χ --Table = 5.99 We reject H0 since 2χ -calc = 6.13 is greater than 2χ --Table = 5.99 and conclude that the preference for the beers is not independent of the sex of the beer drinkers. Exercise 11.1 11.1 The following data is on the distribution of employment status in five areas denoted by polygon codes, from KZN. The study involved a random sample of 2942 persons.

Polygons Employment Status 5010012 5010013 5010014 5010015 5010016

EMPLOYED 344 435 291 30 257 UNEMPLOYED 13 25 14 1 15 NOT_WORKIN 211 276 189 72 130 UNSPECIFIE 178 218 125 14 104 TOTAL 746 954 619 117 506

Using α = 0.05, test whether there is an association between employment status and five areas.

11.2 The Abacus Media Company publishes 4 magazines for the teenager (between 13

and 17 years of age) market. The executive editor of Abacus would like to know whether a readership preference for the four magazines is independent of gender. A survey of 200 teenagers in stationery stores was carried out. Randomly selected teenagers who bought at least one of the four magazines were asked to indicate which of the four magazines they preferred. Their responses are presented below.

Page 105: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110055

Gender

Magazine Preference Beat Youth Grow Live

Girls Boys

18 12 20 28 38 26 34 24

Using α = 0.05, test whether there is an association between gender and magazine preference.

11.3 A motor vehicle distributor wishes to find out if the size of car bought is in any way

related to the age of a buyer. From sales invoices over the past two years, a sample of 300 buyers were classified by size of the car bought and buyer’s age. The following contingency table was constructed.

Buyer’s Age

Car size bought Small Medium Large

Under 30 30 – 45 Over 45

10 22 34 24 42 48 52 32 36

Using α = 0.05, test whether car size bought and buyer’s age are independent. Interpret your results.

11.4 A sample of parts provided the following contingency table data concerning part quality and production shift.

Shift

Number Good Defective

First Second Third

368 32 285 15 176 24

Use α = 0.05 and test whether part quality is independent of the production shit. What is your conclusion.

11.3 Tests for equality of proportions in more than two populations Earlier sections discussed the case of comparing two population proportions using either normal or t-distributions. The situation is different when more than two population proportions are to be compared. The Chi-square distribution is used in such a situation. The test for equality of proportions in more than two populations is equivalent to the test for independence of association. The null hypothesis is stated as no differences exist between the proportions of a given category of one random variable examined across all categories of a second random variable. The following example illustrates the procedures used to test for the equality of proportions in more than two populations.

Page 106: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110066

Example 11.2 A local air carrier would like to know if there is any difference between the proportion of travellers classified as business or non-business making reservations for each of their four classes. A survey of 300 reservations over the past week shows the following use of each class of travel by passengers. The observed frequencies

Type of traveller

Class of Travel Emerald Amethyst Diamond Ruby

Row Totals

Business Non-Business

32 22 42 32 48 26 68 30

128 172

Column Totals 80 48 110 62 300 Let P1 = Proportion of Emerald class business traveller. P2 = Proportion of Amethyst class business traveller. P3 = Proportion of Diamond class business traveller. P4 = Proportion of Ruby class business traveller. Hypothesis H0 : P1 = P2 = P3 = P4

H1 : At least one population proportion is different. Note the null hypothesis could also be stated that type of traveller is independent of the class of travel used. The expected frequencies

Type of traveller

Class of Travel Emerald Amethyst Diamond Ruby

Row Totals

Business Non-Business

34.1 20.5 46.9 26.5 45.9 27.5 63.1 35.5

128 172

Column Totals 80 48 110 62 300 Test statistics

2χ = ij

jiijij

E

EO∑ −,

2)(

= 1.34

)1.3432( 2−++

5.20)5.2022( 2−

.. .. .. ++ 5.35

)5.3530( 2−

= 0.1293 + 0.1096 + . . . + 0.8521

= 3.3028

Page 107: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110077

Using α = 0.05, and (2-1)(4-1) = 3 degrees of freedom, 2χ -Table = 7.815, we fail to reject H0 since calculate 2χ = 3.328 is not greater than table value 2χ -Table = 7.815. Conclude that the proportion of business people using each class of travel is the same. This finding is equivalent to concluding that type of traveller and class of travel, are independent in an independence of association hypothesis test. Exercise 11.2 11.5 An insurance organisation sampled its field sales force in the four provinces

concerning their attitudes towards compensation. Respondents were given the choice between the present method (fixed salary plus year-end bonus) and a proposed new method (straight commission).

Response preference Province

Cape Transvaal OFS Natal Present method New Method

68 135 47 79 32 50 23 31

a) Test, at the 5 % level of significance, whether there is any difference in the proportion

of sales staff between the four provinces who prefer the present method? b) Interpret your findings. 11.4 Test for goodness of fit tests The following are the general steps used to conduct a goodness of fit test for any hypothesised probability distribution: • Formulate a null hypothesis indicating a hypothesised distribution for k classes or

categories of a population. • Select a simple random sample of size n items, and record the observed frequencies

for each of the k classes or categories. • Based on the assumption that the null hypothesis is true, determine the expected

frequencies for each category. • Use the observed and expected frequencies to compute a value of 2χ for the test. • Reject H0 if the calculated 2χ value is greater than table 2χ value obtained with k-1

degrees of freedom at α level of significance. We illustrate the computation through the following example. Example 11.3 Patients that arrive for treatment at the emergency room of a large metropolitan hospital are assigned to one of the following three categories based on the seriousness of their condition. Category 1: Patient condition is stable; immediate treatment by a physician is not required.

Page 108: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110088

Category 2: Patient condition is serious; immediate treatment is not required, but patient should be monitored for vital signs until a physician is available. Category 3: Patient condition is critical; the patient’s life will be endangered without immediate treatment. The population of interest is a multinomial population since the condition of each patient is classified into one and only one of the three categories stable, serious, and critical. The available information over the last year indicate that 50 % of the patients who arrived for treatment were classified as stable, 30 % were classified as serious, and 20 % were classified as critical. There has been an increased volume for the emergency room due to recent improvement. The director of the hospital is concerned that the percentage of patients classified as having stable, serious, or critical conditions may have also charged. Validation of this claim is required. Let P1 = fraction of patients classified as stable. P2 = fraction of patients classified as serious. P3 = fraction of patients classified as critical. Hypothesis H0 : P1 = 0.5, P2 = 0.30, P3 = 0.20 H1 : The population proportions are not P1 = 0.5, P2 = 0.30, P3 = 0.20 Suppose the hospital selected a sample of 200 patients who have been tested since the volume increased in the emergency room. The following are observed frequencies.

Stable Serious Critical 98 48 54

The expected frequencies for each category under H0 are

Stable Serious Critical 200(0.50) = 100 200(0.30) = 60 200(0.20) = 40

The goodness of fit test focuses on the differences between the observed frequencies and the expected frequencies. With the expected frequencies greater than 5 for all three categories, the sample size requirement is satisfied and we proceed to compute the test statistic.

Test statistic 2χ = i

iii

E

EO∑ − 2)(

= 100

)10098( 2−++

60)6048( 2−

++ 40

)4054( 2−

= 0.04 + 2.40 + 4.90 = 7.34

Page 109: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

110099

Using α = 0.05, and k = 3 -1 = 2 degrees of freedom, 2χ -Table = 5.99. We reject H0 since 2χ = 7.34 is larger than the critical value 5.99. In rejecting H0 we conclude that the increase in volume for the emergency room has altered the percentages of patients whose conditions are stable, serious, or critical. The goodness of fit test uses the chi-square distribution to determine whether a hypothesised probability distribution for a population provides a good fit. Acceptance or rejection of the hypothesised probability distribution depends on the differences between the observed frequencies in a sample and the expected frequencies based on the assumed probability distribution. Exercise 11.3 11.6 Conduct a test of the following hypothesis using the chi-square goodness of fit test.

H0 : PA = 0.4, PB = 0.40, PC = 0.20 H1 : The population proportions are not PA = 0.4, PB = 0.40, PC = 0.20

11.7 A sample of size 200 yielded 60 in category A, 120 in category B, and 20 in category

C. Using α = 0.01, test to see if the proportions are as stated in H0. 11.8 A manufacturer has adopted a new container design. Colour preferences indicated in a sample of 150 individuals are as follows.

Red Blue Green 40 64 46

Test using α =0.1 to see if the colour preferences are different. (Hint: Formulate the null hypothesis as H0 : P1 = P2 = P3 = P4 = 1/3 )

11.9 Grade distribution guidelines for a statistics course at a major university are as

follows:

10% A, 30 % B, 40 % C, 15 % D, and 5 % F.

A sample of 120 statistics grades at the end of a semester showed 18 A’s, 30 B’s, 40 C’s, 22 D’s, and 10 F’s.

Test using α =0.05 to see if the actual grades deviate significantly from the grade distribution guidelines.

11.10 An accounted for a department store knows from past experience that 23 % of the

store’s customers pay cash for their purchases, 35 % write cheques, and the remaining 42 % use credit cards. The accountant examines a random sample of 200 sales receipts for the week before Christmas and makes the following sales summary.

Page 110: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

111100

Cash Cheque Credit cards Number of Customers 37 47 116

Use the chi-square goodness of fit test to see if the preceding percentages fit these observations. Use α = 0.05.

11.11 Consider the following data on age distribution in the two polygons.

Age Group 5010001 Frequency

5090061 Frequency

0 -10 177 75 11_20 231 54 21_30 240 34 31_40 141 14 41_50 169 18 51_60 124 10 61_70 38 9 71_80 8 1 81_90 1 0

91_100 0 0 Over 101 0 0

UN 2 0 TOTAL 1131 215

Use the chi-square goodness of fit test to see if the age group distribution for polygon 5090061 follows Poisson distribution. Use α = 0.05.

Page 111: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

111111

APPENDIXES TABLE A

The Normal Distribution

Pr(Z ≤ z) =

2

2

2

21 σ

σπ

zz

e−

∞−∫

Φ(-z) = 1 - Φ(z)

Z Φ(z) z Φ(z) z Φ(z)

0.00 0.500 1.10 0.864 2.05 0.980 0.05 0.520 1.15 0.875 2.10 0.982 0.10 0.540 1.20 0.885 2.15 0.984 0.15 0.560 1.25 0.894 2.20 0.986 0.20 0.579 1.282 0.900 2.25 0.988 0.25 0.599 1.30 0.903 2.30 0.989 0.30 0.618 1.35 0.911 2.326 0.990 0.35 0.637 1.40 0.919 2.35 0.991 0.40 0.655 1.45 0.926 2.40 0.992 0.45 0.674 1.50 0.933 2.45 0.993 0.50 0.691 1.55 0.939 2.50 0.994 0.55 0.709 1.60 0.945 2.55 0.995 0.60 0.726 1.645 0.950 2.576 0.995 0.65 0.742 1.65 0.951 2.60 0.995 0.70 0.758 1.70 0.955 2.65 0.996 0.75 0.773 1.75 0.960 2.70 0.997 0.80 0.788 1.80 0.964 2.75 0.997 0.85 0.802 1.85 0.968 2.80 0.997 0.90 0.816 1.90 0.971 2.85 0.998 0.95 0.829 1.95 0.974 2.90 0.998 1.00 0.841 1.960 0.975 2.95 0.998 1.05 0.853 2.00 0.977 2.00 0.999

Page 112: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

111122

TABLE B The t-Distribution

2 ( 1) / 2

[( 1) / 2]Pr( )( / 2)(1 / )

t

r

rT t dwr r w rπ +−∞

Γ +≤ =

Γ +∫

[Pr( ) 1 Pr( )]T t T t≤ − = − ≤

Pr( )T t≤ r 0.90 0.95 0.975 0.99 0.995 1 3.078 6.314 12.706 31.821 63.657 2 1.886 2.920 4.303 6.965 9.925 3 1.638 2.353 3.182 4.541 5.841 4 1.533 2.132 2.776 3.747 4.604 5 1.476 2.015 2.571 3.365 4.032 6 1.440 1.943 2.447 3.143 3.707 7 1.415 1.895 2.365 2.998 3.499 8 1.397 1.860 2.306 2.896 3.355 9 1.383 1.833 2.262 2.821 3.250

10 1.372 1.812 2.228 2.764 3.169 11 1.363 1.796 2.201 2.718 3.106 12 1.356 1.782 2.179 2.681 3.055 13 1.350 1.771 2.160 2.650 3.012 14 1.345 1.761 2.145 2.624 2.977 15 1.341 1.753 2.131 2.602 2.947 16 1.337 1.746 2.120 2.583 2.921 17 1.333 1.740 2.110 2.567 2.898 18 1.330 1.734 2.101 2.552 2.878 19 1.328 1.729 2.093 2.539 2.861 20 1.325 1.725 2.086 2.528 2.845 21 1.323 1.721 2.080 2.518 2.831 22 1.321 1.717 2.074 2.508 2.819 23 1.319 1.714 2.069 2.500 2.807 24 1.318 1.711 2.064 2.492 2.797 25 1.316 1.708 2.060 2.485 2.787 26 1.315 1.706 2.056 2.479 2.779 27 1.314 1.703 2.052 2.473 2.771 28 1.313 1.701 2.048 2.467 2.763 29 1.311 1.699 2.045 2.462 2.756 30 1.310 1.697 2.042 2.457 2.750

Page 113: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

111133

TABLE C The Chi-square Distribution

Upper Probability Points 2 2

,( )v PP P χ χ= ≥

Entries in the table are the values 2,Pνχ of the 2χ -distribution for various degrees of

freedom ν and one-tailed probabilities P.

P ν 0.99 0.975 0.95 0.90 0.50 0.10 0.05 0.025 0.01 0.005

1 0.000 0.001 0.004 0.016 0.455 2.706 3.841 5.024 6.635 7.8792 0.020 0.051 0.103 0.211 1.386 4.605 5.991 7.378 9.210 10.5973 0.115 0.216 0.352 0.584 2.366 6.251 7.815 9.348 11.345 12.8384 0.297 0.484 0.711 1.064 3.357 7.779 9.488 11.143 13.277 14.8605 0.554 0.831 1.145 1.610 4.351 9.236 11.070 12.833 15.086 16.750

6 0.872 1.237 1.635 2.204 5.348 10.645 12.592 14.449 16.812 18.5487 1.239 1.690 2.167 2.833 6.346 12.017 14.067 16.013 18.475 20.2788 1.646 2.180 2.733 3.490 7.344 13.362 15.507 17.535 20.090 21.9559 2.088 2.700 3.325 4.168 8.343 14.684 16.919 19.023 21.666 23.58910 2.558 3.247 3.940 4.865 9.342 15.987 18.307 20.483 23.209 25.188

11 3.053 3.816 4.575 5.578 10.341 17.275 19.675 21.920 24.725 26.75712 3.571 4.404 5.226 6.304 11.340 18.549 21.026 23.337 26.217 28.30013 4.107 5.009 5.892 7.042 12.340 19.812 22.362 24.736 27.688 29.81914 4.660 5.629 6.571 7.790 13.339 21.064 23.685 26.119 29.141 31.31915 5.229 6.262 7.261 8.547 14.339 22.307 24.996 27.488 30.578 32. 801

16 5.812 6.908 7.962 9.312 15.338 23.542 26.296 28.845 32.000 34. 26717 6.408 7.564 8.672 10.085 16.338 24.769 27.587 30.191 33.409 35. 71818 7.015 8.231 9.390 10.865 17.338 25.989 28.869 31.526 34.805 37. 15619 7.633 8.907 10.117 11.651 18.338 27.204 30.144 32.852 36.191 38. 58220 8.260 9.591 10.851 12.443 19.337 28.412 31.410 34.170 27.566 39. 997

21 8.897 10.283 11.591 13.240 20.337 29.615 32.671 35.479 38.932 41. 40122 9.542 10.928 12.338 14.041 21.337 30.813 33.924 36.781 40.289 42. 79623 10.196 11.689 13.091 14.848 22.337 32.007 35.172 38.076 41.638 44. 18124 10.856 12.401 13.848 15.659 23.337 33.196 36.415 39.364 42.980 45. 55925 11.524 13.120 14.611 16.473 24.337 34.382 37.652 40.646 44.314 46. 928

26 12.198 13.844 15.379 17.292 25.336 35.563 38.885 41.923 45.642 48. 29027 12.879 14.573 16.151 18.114 26.336 36.741 40.113 43.195 46.963 49. 64528 13.565 15.308 16.928 18.939 27.336 37.916 41.337 44.461 48.278 50. 99329 14.256 16.047 17.708 19.768 28.336 39.087 42.557 45.722 49.588 52. 33630 14.953 16.791 18.493 20.599 29.336 40.256 43.773 46.979 50.892 53. 672

For 230 2 2 1v vχ> − − is approximately distributed as normal (0, 1).

Page 114: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

111144

TABLE D The F-Distribution

1/ 2

1 2

/ 2 11 2 1 2

( ) / 201 2 1 2

[( ) /]( / )Pr( )( / 2) ( / 2)(1 / )

r rb

r r

r r r r wF b dwr r r w r

+

Γ +≤ =

Γ Γ +∫

1r Pr( )F b≤

2r 1 2 3 4 5 6 7 8 9 10 12 15 0.95 1 161 200 216 225 230 234 237 239 241 242 244 246 0.975 648 800 864 900 922 937 948 957 963 969 977 985 0.99 4052 4999 5403 5625 5764 5859 5928 5982 6023 6056 6106 6157 0.95 2 18.5 19.2 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 0.975 38.5 39.0 39.2 39.2 39.3 39.3 39.4 39.4 39.4 39.4 39.4 39.4 0.99 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 0.95 3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 0.975 17.4 16.0 15.4 15.1 14.9 14.7 14.6 14.5 14.5 14.4 14.3 14.3 0.99 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 26.9 0.95 4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 0.975 12.2 10.6 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84 8.75 8.66 0.99 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.4 14.2 0.95 5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 0.975 10.0 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62 6.52 6.43 0.99 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.89 9.72 0.95 6 5.99 5.14 4.76 4.53 4.39 4.39 4.21 4.15 4.10 4.06 4.00 3.94 0.975 8.81 7.26 6.60 6.23 5.99 5.99 5.70 5.60 5.52 5.46 5.37 5.27 0.99 13.7 10.9 9.78 9.15 8.75 8.75 8.26 8.10 7.98 7.87 7.72 7.56 0.95 7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 0.975 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76 4.67 4.57 0.99 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 0.95 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 0.975 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30 4.20 4.10 0.99 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 0.95 9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 0.975 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96 3.87 3.77 0.99 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 0.95 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 0.975 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72 3.62 3.52 0.99 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 0.95 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 0.975 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37 3.28 3.18 0.99 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 0.95 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 0.975 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06 2.96 2.86 0.99 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52

Page 115: Statistical Analysis in Resesarchmodule · Statistical Analysis in Research Module PMNjuho E-mail: NjuhoP@ukzn.ac.za 3 Examples of such hypotheses: A larger area can be planted as

Statistical Analysis in Research Module PMNjuho E-mail: [email protected]

111155

REFERENCES Clarke, G P Y., Haines, L M., Dicks, H M., Stielau, K., and Brittain, S. (1999). Basic

statistical methods teaching manual. School of Mathematics, Statistics and Information Technology. University of Natal Pietermaritzburg.

Durrheim, K., Lachenicht, L., Richter L., and Gray, D. (2001). Statistics tutorial workbook. Research methods. School of Psychology. University of Natal Pietermaritzburg.

Freund, J E., and Simon G A. (1995). Statistics: A first course. Sixth Edition. Prentice- Hall, Inc. A Simon & Schuster Company. New Jersey. USA.

Hildebrand, D K. and Ott, Lyman. (1991). Statistical thinking for managers. Third Edition. PWS- KENT Publishing Company. USA.

Johnson, D. E. (1998). Applied multivariate methods for data analysts. Brooks/Cole Publishing Company. CA. USA.

Kitchens, L. J. (1996). Exploring statistics: A modern introduction to data analysis and inference. 2Ed. Brooks/Cole Publishing Company. CA. USA.

Lewis-Beck, M. S. (1994). Factor analysis & related techniques. Vol. 5. SAGE Publications, Inc.

Lindgren, B W., and Berry, D A. (1981). Elementary statistics. MacMillan Publishing Co. Inc. New York.

Manly, B. F. J. (1994). Multivariate statistical methods. A primer 2nd ed. Chapman & Hall. London. UK.

Mendenhall, W., Wackerly, R L., and Scheaffer, R L. (1990). Mathematical statistics with applications. Fourth Edition. PWS- KENT Publishing Company. USA.

Montgomery, D. C. (1976). Design and analysis of experiments. John Wiley & Sons, Inc. Neter, J., Kutner, M H, Nachtsheim, C J and Wasserman, W (1996). Applied linear

statistical models. Fourth Edition. McGraw-Hill Companies. Boston, Massachusetts. USA.

Rinaman, C W. (1993). Foundations of probability & statistics. Saunders College Publishing. Forth Worth Philadelphia. USA.

Viljoen, C S., and Van der Merwe, L. (2000). Applied elementary statistics for business and economics. Volume 2. Creda Communications, Elliot Avenue, Epping II, Cape Town.

Wegner, T. (2000). Applied business statistics. The Rustica Press, Ndabeni, Western Cape.