( ug course) - jnkvv

1

PRACTICAL MANUAL

STATISTICAL METHODS

( UG COURSE)

Compiled by

DEPARTMENT OF MATHEMATICS AND STATISTICS

Jawaharlal Nehru Krishi Vishwa Vidyalaya,

JABALPUR 482 004

3

Contents

S.

No.

Chapter Name Description Page No.

1 Graphical Representation of

data

1. Construction of Discrete and continuous frequency distribution

2. Construction of Bar Diagram, Histogram, Pie Diagram, Frequency curve and Frequency polygon

1-8

2 Measures of Central tendency 1. Definition, Formula and Calculation of Mean, Median , Mode, Geometric Mean and Harmonic Mean for grouped and ungrouped data

2. Definition, Formula and Calculation of Quartiles, Deciles and Percentiles for grouped and ungrouped data

9-21

3 Measures of Dispersion 1. Definition, Formula and Calculation of absolute measures of Dispersion, Range, Quartile Deviation, Mean Deviation, Standard Deviation

2. Definition, Formula and Calculation of relative measures of Dispersion, CD and CV for grouped and ungrouped data

22-29

4 Moments, Skewness and

Kurtosis

1. Definition and types of moments, skewness and Kurtosis

2. Formula and calculation of raw moments, moments about origin, central moments and different types of coefficient of skewness and kurtosis

30 -40

5 Correlation and Regression 1. Definition and types of Correlation and Regression.

2. Calculation of Correlation and regression coefficient along with their test of significance.

41-49

6 Test of Significance 1. Definition of Null and Alternative Hypothesis and different tests of significance

2. Application of t test for single mean, t-test for independent samples, paired t test, F-test, Chi-square test

50-59

7 Analysis of Variance( One way

and Two way classification)

1. Definition and steps of analysis of one way and two way classification.

2. Analysis of CRD and RBD as an example of one way and two way ANOVA

60-79

8. Sampling Methods 1. Definition of SRS, SRSWR and SRSWOR and difference between census and sampling

2. Procedures of selecting a simple random sample

80-86

1

1. Graphical Representation of data

Mujahida Sayyed Asst. professor (Maths & Stat.), College of Agriculture, JNKVV, Ganjbasoda, 464221(M.P.), India

Email id : [email protected]

Frequency Distribution: A tabular presentation of the data in which the frequencies of values of a

variable are given along with class is called a frequency distribution. Two types of frequency

distribution are available

1. Discrete Frequency Distribution: A frequency distribution which is formed by distinct

values of a discrete variable eg. 1,2,5 etc.

2. Continuous Frequency Distribution: A frequency distribution which is formed by distinct

values of a continuous variable eg. 0-10, 10-20, 20-30 etc.

Process: For construction of Discrete Frequency Distribution

Step I. Set the data in ascending order.

Step II. Make a blank table consisting of three columns with the title: Variable, Tally Marks and

Frequency.

Step III. Read off the observations one by one from the data given and for each one record a tally

mark against each observation. In tally marks for each variable fifth frequency is denoted by cutting

the first four frequency from top left to bottom right and then sixth frequency is again by a straight

tally marks and so on.

Step IV. In the end, count all the tally marks in a row and write their number in the frequency

column.

Step V. Write down the total frequency in the last row at the bottom.

Objective : Prepare a discrete frequency distribution from the following data

Kinds of data:

5 5 2 6 1 5 2 9 5 4

3 4 11 7 2 5 12 6

Solution : First arrange the data in ascending order

1 2 2 2 3 4 4 5 5 5

5 5 6 6 7 9 11 12

Prepare a table in the format described above in the process.

Count the numbers by tally method we get the required discrete frequency distribution:

No. of Letters, Variable (X) Tally Marks No. of Words, Frequency(f)

1 │ 1

2 │││ 3

3 │ 1

4 ││ 2

5 ││││ 5

6 ││ 2

7 │ 1

9 │ 1

11 │ 1

12 │ 1

Total 18

2

Continuous Frequency Distribution:

A continuous frequency distribution i.e. a frequency distribution which obtained by dividing the

entire range of the given observations on a continuous variable into groups and distributing the

frequencies over these groups . It can be done by two methods

1. Inclusive method of class intervals : When lower and upper limit of a class interval are

included in the class intervals.

2. Exclusive method of class intervals: When the upper limit of a class interval is equal to the

lower limit of the next higher class intervals.

Process: For construction of Continuous Frequency Distribution

Step I. Set the data in ascending order.

Step II. Find the range= max value –min value.

Step III. Decide the approximate number k of classes by the formula K= 1+3.322 log10N, where N

is the total frequency. Round up the answer to the next integer. After dividing the range by number

of classes class interval is obtained.

Step IV. Classify the data by exclusive and/or inclusive method for the desired width of the class

intervals.

Step V. Make a blank table consisting of three columns with the title: Variable, Tally Marks and

Frequency.

Step VI. Read off the observations one by one from the data given and for each one record a tally

mark against each observation.

Step VII. In the end, count all the tally marks in a row and write their number in the frequency

column.

Step VIII. Write down the total frequency in the last row at the bottom.

********************************************************************************

Objective : Prepare a continuous grouped frequency distribution from the following data.

Kinds of data: 20 students appear in an examination. The marks obtained out of 50 maximum

marks are as follows:

5, 16, 17, 17, 20, 21, 22, 22, 22, 25, 25, 26, 26, 30, 31, 31, 34, 35, 42 and 48.

Prepare a frequency distribution taking 10 as the width of the class-intervals .

Solution: Arrange the data in the ascending order

5 16 17 17 20 21 22 22 22 25

25 26 26 30 31 31 34 35 42 48

Here lower limit is 5 and upper limit is 48.

Since it is given that the desired class interval is 10, so frequency distribution for Inclusive Method

of Class intervals:

Marks Tally Marks No. of students

1-10 │ 1

11-20 ││││ 4

21-30 │││││││ │ 9

31-40 ││││ 4

41-50 ││ 2

Total 20

3

Exclusive Method of Class intervals:

Marks Tally Marks No. of students

0-10 │ 1

10-20 │││ 3

20-30 │││││││ │ 9

30-40 │││ │ 5

40-50 ││ 2

Total 20

********************************************************************************

Conversion of Inclusive series to Exclusive series: To apply any statistical technique (Mean,

Median etc.) , first the inclusive classes should be converted to exclusive classes.

For this purpose we find the difference of 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠−𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠

2 and add

this amount to upper limit of first class and subtract it from the lower limit of next higher class.

In the present example the conversion factor = 𝟏𝟏−𝟏𝟎

𝟐 = 0.5. So we add 0.5 to 10 and subtract 0.5

from 11 and finally get the exclusive classes 1-10.5, 10.5-20.5, etc.

********************************************************************************

Graphical Representation of data:- Graphical Representation is a way of analysing

numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram.

It is easy to understand and it is one of the most important learning strategies. It always depends on

the type of information in a particular domain. There are different types of graphical representation.

Some of them are as follows

• Bar Diagram – Bar Diagram is used to display the category of data and it compares the

data using solid bars to represent the quantities.

• Histogram – The graph that uses bars to represent the frequency of numerical data that are

organised into intervals. Since all the intervals are equal and continuous, all the bars have

the same width.

• Pie diagram–Shows the relationships of the parts of the whole. The circle is considered

with 100% and the categories occupied is represented with that specific percentage like

15%, 56% , etc.

• Frequency Polygon – It shows the frequency of data on a given number to curve.

• Frequency curve - Frequency curve is a graph of frequency distribution where the line is

smooth.

Merits of Using Graphs

Some of the merits of using graphs are as follows:

• The graph is easily understood by everyone without any prior knowledge.

• It saves time.

• It allows to relate and compare the data for different time periods

• It is used in statistics to determine the mean, median and mode for different data, as well as

in interpolation and extrapolation of data.

https://byjus.com/maths/bar-graph/

4

1. Simple Bar Diagram:

Bar graph is a diagram that uses bars to show comparisons between categories of

data. The bars can be either horizontal or vertical. Bar graphs with vertical bars are sometimes

called vertical bar graphs. A bar graph will have two axes. One axis will describe the types of

categories being compared, and the other will have numerical values that represent the values of the

data. It does not matter which axis is which, but it will determine what bar graph is shown. If the

descriptions are on the horizontal axis, the bars will be oriented vertically, and if the values are

along the horizontal axis, the bars will be oriented horizontally.

Objective : Prepare a simple Bar diagram for the given data:

Kinds of data: Aggregated figures for merchandise export in India for eight years are as

Follows.

Years 1971 1972 1973 1974 1975 1976 1977 1978

Exports (million Rs.) 1962 2174 2419 3024 3852 4688 5555 5112

Solution: For Simple Bar Diagram

Step I: Draw X and Y axis.

Step II: Take year on X axis .

Step III: Take scale of 1000 on Y axis which represent Exports.

Step IV: Draw the equal width bars on X axis.

Results: The above figure shows the Bar diagram.

********************************************************************************

2. Histogram :

Histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a

vertical axis. The horizontal axis is more or less a number line, labelled with what the data

represents. The vertical axis is labelled either frequency or relative frequency (or percent frequency

or probability). The histogram (like the stemplot) can give the shape of the data, the center, and the

spread of the data. The shape of the data refers to the shape of the distribution, whether normal,

0

1000

2000

3000

4000

5000

6000

1971 1972 1973 1974 1975 1976 1977 1978

Exp

ort

(m

illio

n R

s.)

Year

Bar Diagram

Export

5

approximately normal, or skewed in some direction, whereas the center is thought of as the middle

of a data set, and the spread indicates how far the values are dispersed about the center. In a skewed

distribution, the mean is pulled toward the tail of the distribution. In histogram the area of rectangle

is proportional to the frequency of the corresponding range of the variable.

Objective: Construction of the bar diagram for the given data:

Kinds of data: The following data are the number of books bought by 50 part-time college

students at College;

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,2, 2, 2, 2, 2, 2, 2, 2, 2, 2,, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,4, 4, 4, 4,

4, 4,5, 5, 5, 5, 5,6, 6

Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six

students buy four books. Five students buy five books. Two students buy six books. Calculate the

width of each bar/bin size/interval size.

Solution:

Process:

Step I: The smallest data value is 1, and the largest data value is 6. To make sure each is included

in an interval, we can use 0.5 as the smallest value and 6.5 as the largest value by subtracting and

adding 0.5 to these values. A small range here of 6 (6.5 – 0.5), so a fewer number of bins; let’s say

six this time. So, six divided by six bins gives a bin size (or interval size) of one.

Step II: Notice that may choose different rational numbers to add to, or subtract from, maximum

and minimum values when calculating bin size.

Step III :

Result:

The above histogram displays the number of books on the x-axis and the frequency on the y-axis:

********************************************************************************

3. PIE Diagram:

Pie charts are simple diagrams for displaying categorical or grouped data. These

charts are commonly used within industry to communicate simple ideas, for example market share.

They are used to show the proportions of a whole. They are best used when there are only a handful

of categories to display.

A pie chart consists of a circle divided into segments, one segment for each category. The size of

each segment is determined by the frequency of the category and measured by the angle of the

segment. As the total number of degrees in a circle is 360, the angle given to a segment is 360◦

6

times the fraction of the data in the category, that is

Angle = (Number in category /Total number in sample (n)) × 360, or we can also express pie

diagram in %.

Objective: Draw a pie chart to display the information.

Kinds of data: A family's weekly expenditure on its house mortgage, food and fuel is as follows:

Expense Rupees

Mortage 300

Food 225

Fuel 75

Solution: Process: Step I: The total weekly expenditure = 300+225+75 = 600Rs.

Step II:

Percentage of weekly expenditure on:

Mortage = 300

600∗ 100% = 50%

Food = 225

600∗ 100% = 37.5%

Fuel = 75

600∗ 100% = 12.5%

Step III: To draw a pie chart, divide the circle into 100 percentage parts. Then

allocate the number of percentage parts required for each item.

Result: Above figure shows the pie diagram of the given data.

*******************************************************************************

4. Frequency Polygon: Frequency polygon are analogous to line graph, and just as line graph make continuous data

visually easy to interpret, so too do frequency polygons. It can also be obtained by joining the mid

points of the class interval on x-axis and their corresponding frequency on y-axis by a straight line.

Step I: Examine the data and decide on the number of intervals and resulting interval size, for both

the x-axis and y-axis.

Step II: The x-axis will show the lower and upper bound for each interval, containing the data

values, whereas the y-axis will represent the frequencies of the values.

Step III: Each data point represents the frequency for each interval.

Step IV: If an interval has three data values in it, the frequency polygon will show a 3 at the upper

endpoint of that interval.

Step V: After choosing the appropriate intervals, begin plotting the data points. After all the points

are plotted, draw line segments to connect them.

7

Objective: construction of frequency polygon from the frequency table.

Kinds of data:

Frequency Distribution for Calculus Final Test Scores

Lower Bound Upper Bound Mid Value Frequency

49.5 59.5 54.5 5

59.5 69.5 64.5 10

69.5 79.5 74.5 30

79.5 89.5 84.5 40

89.5 99.5 94.5 15

Solution:

Result: Above figure shows the frequency polygon diagram of the given data.

*******************************************************************************

5. Frequency curve: The frequency-curve for a distribution can be obtained by drawing a smooth

and free hand curve through the mid-points of the upper sides of the rectangles forming the

histogram.

Result: Above figure shows the frequency curve diagram of the given data.

****************************************************************************

0

10

20

30

40

50

44.5 54.5 64.5 74.5 84.5 94.5 104.5

8

Exercise:

Q1. Define Graphical Representation. Also write the advantage of Graphical representation of data?

Q2. The following data gives the information of the number of children involved in different

activities.

Activities Dance Music Art Cricket Football

No. of Children 30 40 25 20 53

Draw Simple bar Diagram.

Q3. The percentage of total income spent under various heads by a family is given below.

Different Heads Food Clothin

g Health Education House Rent Miscellaneous

% age of Total

Number 40% 10% 10% 15% 20% 5%

Represent the above data in the form of bar graph.

Q4. The following table shows the numbers of hours spent by a child on different events on a

working day. Represent the adjoining information on a pie chart.

Activity School Sleep Playing Study TV Others

No. of Hours 6 8 2 4 1 3

Q5. Make a frequency table and histogram of the following data:

3, 5, 8, 11, 13, 2, 19, 23, 22, 25, 3,10, 21,14, 9,12,17 ,22, 23, 14

*******************************************************************************

9

Measures of Central Tendency

Umesh Singh

Assistant Professor (Statistics), College of Agriculture , Tikamgarh, 472001,India


According to Professor Bowley, Averages are "statistical constants which enable us to comprehend

in a single effort the significance of the whole." They give us an idea about the concentration of the

values in the central part of the distribution. Plainly speaking, an average of a statistical series is the

value of the variable which is representative of the entire distribution.

The following are the five measures of central tendency that are in common use:

(i) Arithmetic Mean

(ii) Median

(iii) Mode

(iv) Geometric Mean

(v) Harmonic Mean

Requisites for an ideal Measure of Central Tendency

The following are the characteristics to be satisfied by an ideal measure of central tendency

(i) It should be rigidly defined.

(ii) It should be readily comprehensible and easy to calculate.

(iii) It should be based on all the observations.

(iv) It should be suitable for further mathematical treatment.

(v) It should be affected as little as possible by fluctuations of sampling.

(vi) It should not be affected much by extreme values.

1. Arithmetic Mean:

Arithmetic mean of a set of observations is their sum divided by the number of observations.

Arithmetic mean for ungrouped data: The arithmetic mean X of n observations X1,X2, X3. . .,X

n is given by

n

X

n

X...........XXXX

n

1i

i

n321

==

+++=

Arithmetic mean for grouped data :

In case of frequency distribution, Xi/fi, i = 1, 2, 3, 4,……n, where fi is the frequency of the variable

Xi;

===

++

+++=

=

=

=

= Nf ,N

Xf

f

Xf

f......fff

Xf...........XfXfXfX

n

1i

i

n

1i

ii

n

1i

i

n

1i

ii

n321

nn332211

In case of grouped or continuous frequency distribution X is taken as the mid-value of the

corresponding class.

Remark. The Greek Capital letter, Σ Sigma, is used to indicate summation of elements in a set or a

sample or a population. It is usually indexed by an index to show how many elements are to be

summed.

Properties or Arithmetic Mean

Property 1. Algebraic sum of the deviations of a set of values from their arithmetic mean is zero. If

Xi / fi, i= 1, 2, ... , n is the frequency distribution, then ,0)(1

=−=

XXf i

n

i

i X being the mean of

distribution.

10

Property 2. The sum of the squares of the deviations of a set of values from their arithmetic mean

is always minimum.

Property 3. Mean of the composite series- If Xi, (i = 1, 2, ... , k) are the means of k-component

series of sizes ni ( i = 1, 2, ... , k) respectively, then the mean X of the composite series obtained on

combining the component series is given by the formula:

k321

kk332211

n.....nnn

X.......nXnXnXn X

+++

+++=

********************************************************************************

Objective: Find the arithmetic mean of the following ungrouped data:

Kinds of data : Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12.

Solution: We know that

Arithmetic mean n

X

X

n

1i

i== =

10+7+11+9+9+10+7+9+12

10 =

84

9 =9.33

*******************************************************************************

Objective: Find the arithmetic mean of the following discrete frequency distribution:

Kinds of data:

Xi 2 9 16 35 32 89 95 65 55

fi 8 2 5 7 6 8 9 6 2

Solution –

Xi 2 9 16 35 32 89 95 65 55 Total

fi 8 2 5 7 6 8 9 6 2 =

=9

1i

i 53f

fiXi 16 18 80 245 192 712 855 390 110 2618

39.4953

2618

f

Xf

Xn

1i

i

n

1i

ii

===

=

=

********************************************************************************

Objective : Find the arithmetic mean of the following continuous grouped frequency distribution:

Kinds of data:

Xi 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90

fi 8 2 5 7 6 8 9 6 2

Solution –

Class interval Xi (Mid-point) fi fiXi

0-10 5 8 40

10-20 15 2 30 20-30 25 5 125 30-40 35 7 245 40-50 45 6 270 50-60 55 8 440 60-70 65 9 585 70-80 75 6 450

80-90 85 2 170

Total =

=9

1i

i 53f

2355

11

433.4453

2355

f

Xf

Xn

1i

i

n

1i

ii

===

=

=

Objective: Find the arithmetic mean of the pooled data

Kinds of data: The average of 5 numbers (first series) is 40 and the average of another 4 numbers

(second series) is 50.

Solution: we know that pooled mean formula = 𝑛1𝑋1 +𝑛2𝑋2

𝑛1+𝑛2 =

5∗40+4∗50

5+4 =

400

9 = 44.44

********************************************************************************

2. Median:

Median of a distribution is the value of the variable which divides it into two equal parts. It

is the va1ue which exceeds and is exceeded by the same number of observations, i.e., it is the value

such that the number of observations above it is equal to the number of observations below it. The

median is thus a positional average.

Median for ungrouped data:

In case of ungrouped data, if the number of observations is odd then median is the middle value

after the values have been arranged in ascending or descending order of magnitude.

In case of odd number of observation

term2

1nMedian

th

+=

In case of even number of observations, in fact any value lying between the two middle values can

be taken as median but conventionally we take it to be the mean of the middle terms. So

2

term 2

2n term

2

n

Median

thth

++

=

In case of discrete frequency distribution median is obtained by considering the cumulative

frequencies. The steps for calculating median are given below:

(i) Find N/2, where N = ∑fi.

(ii) See the (less than) cumulative frequency (cf.) just greater than N/2.

(iii) The corresponding value of X is median.

********************************************************************************

Objective: Find the median of the ungrouped data when the number of observations is odd.

Kinds of data: The values are 5, 20,15,35,18, 25, 40.

Solution – Step 1 Arrange values in ascending order of their magnitude

5, 15, 18, 20, 25, 35, 40

Step 2 the number of observation is odd i.e. 7

So, term2

1nMedian

th+

=

term4 term2

17Median th

th

=

+= , which is 20

********************************************************************************

12

Objective: Find the median of the ungrouped data when the number of observations is even.

Kinds of data: The values are 8, 20, 50, 25, 15, 30

Solution – Step 1 Arrange values in ascending order of their magnitude

8, 15, 20, 25, 30, 50

Step 2 the number of observation is even i.e. 6

In case of even number of observation

2

term2

2nterm

2

n

Median

thth

++

=

2

term2

26term

2

6

Median

thth

++

=

( ) ( )5.22

2

2502

2

term4 term3Median

thrd

=+

=+

=

********************************************************************************

Median for grouped data

In the case of continuous frequency distribution, the class corresponding to the c.f. just greater than

N/2 is called the median class and the value of median is obtained by the following formula:

hf

CN

lMedian *2

−

+=

Where l is the lower limit of the median class,

f is the frequency of the median class,

h is the magnitude of the median class,

'C' is cumulative frequency preceding to the median class, and N = ∑fi

********************************************************************************

Objective: Find the Median of the following discrete grouped frequency distribution:

Kinds of data:

Xi 1 2 3 4 5 6 7 8 9 Total

fi 8 10 11 16 20 25 15 9 6 120

Solution : Here N = ∑fi = 120

→ So, N/2 = 120/2 = 60

Xi 1 2 3 4 5 6 7 8 9 Total

fi 8 10 11 16 20 25 15 9 6 120

C.f. 8 18 29 45 65 90 105 114 120

Cumulative frequency (c.f.) just greater than N/2 value i.e. 65 and the value of X Corresponding to

65 is 5. Therefore, median is 5.

********************************************************************************

Objective: Find the Median wage of the following continuous grouped frequency distribution

Kinds of data:

Wages (in Rs.) 20-30 30-40 40-50 50-60 60-70 70-80 80-90

No. of labours 3 5 20 10 5 7 2

13

Solution:

Wages (in Rs.) 20-30 30-40 40-50 50-60 60-70 70-80 80-90

No. of labours 3 5 20 10 5 7 2

c.f. 3 8 28 38 43 50 52

Here N/2 = 52/2=26. Cumulative frequency just greater than 26 is 28 and corresponding class is 40-50. Thus

median class is 40-50.

Now h*f

C2

N

lMedian

−

+=

10*20

82

52

40Median

−

+= = 49

So, Median wage is Rs 49.

********************************************************************************

3. Mode Mode is the value which occurs most frequently in a set of observations and around which the other

items of the set cluster densely. In other words, mode is the value of the variable which is predominant in the

series. Mode is the value which occurs most frequently in a set of observations and around which the other

items of the set cluster densely. In other words, mode is the value of the variable which is predominant in the

series. Thus in the case of discrete frequency distribution mode is the value of X corresponding to maximum

frequency. For example, the mode of {4, 2, 4, 3, 2, 2, 1, and 2} is 2 because it occurs four times, which is

more than any other number. Now look at the following discrete series:

Variable 10 20 30 40 50 55 60 89 94

Frequency 2 3 12 30 25 11 9 7 3

Here, as you can see the maximum frequency is 30, the value of mode is 40. In this case, as there is

a unique value of mode, the data is unimodal. But, the mode is not necessarily unique, unlike arithmetic

mean and median. You can have data with two modes (bi-modal) or more than two modes (multi-modal). It

may be possible that there may be no mode if no value appears more frequent than any other value in the

distribution. For example, in a series 1, 1, 2, 2, 3, 3, 4, 4, there is no mode.

But in anyone (or-more) of the following cases:

(i) If the maximum frequency is repeated,

(ii) If the maximum frequency occurs in the very beginning or at the end of the distribution, and '

(iii) If there are irregularities in the distribution, the value of mode is determined by the method of

grouping. This is illustrated below by an example.

Objective: Find the mode of the following frequency distribution:

Kinds of data:

Size ( X) 1 2 3 4 5 6 7 8 9 10 11 12

Frequency (f) 3 8 15 23 35 40 32 28 20 45 14 6

Solution: Here we see that the distribution is not regular since the frequencies are increasing steadily up to

40 and then decrease but the frequency 45 after 20 does not seem to be consistent with the

distribution. Here we cannot say that since maximum frequency is 45, mode is 10. Here we shall

locate mode by the method of grouping as explained below:

14

The frequencies in column (i) are the original frequencies. Column (ii) is obtained by combining

the frequencies two by two. If we leave the first frequency and combine the remaining frequencies

two by two we get column (iii) We proceed to combine the frequencies three by three to obtain

column (iv). The combination of frequencies three by three after leaving the first frequency results

in column (v) and after leaving the first two frequencies results in column (vi). '

To find mode we form the following table:

Column Number Maximum Frequency Value or combination of values of X

giving max. frequency

(i) 45 10

(ii) 75 5, 6

(iii) 72 6, 7

(iv) 98 4, 5, 6

(v) 107 5, 6, 7

(vi) 100 6, 7, 8

We find that the value 6 is repeated maximum number of times and hence the value of mode is 6

and not 10 which is an irregular item

Mode for ungrouped data: In the case of discrete frequency distribution mode is the value of X

corresponding to maximum frequency.

***************************************************************************************

Objective: Find the Mode of the following ungrouped data:

Kinds of data: The values are 4, 2, 4, 3, 2, 2, 1, and 2.

Solution: Here the mode is 2 because it occurs four times, which is more than any other number.

**************************************************************************************

Mode for grouped data

In case of continuous frequency distribution mode is given by the formula:

h*2f

lMode201

01

−−

−+=

ff

ff

Where l is the lower limit of model class, h the magnitude of the model class, f1 is frequency of the

modal class, f0 and f2 are the frequencies of the classes preceding and succeeding to the modal class

respectively.

***************************************************************************************

15

Objective: Find the mode of the following continuous grouped frequency distribution:

Kinds of data:

Class interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Frequency (f) 5 8 7 12 28 20 10 10

Solution: Here maximum frequency is 28. Thus the class 40-50 is the modal class.

So, l=40, f1=28, f0=12, f2=20, h=10,

h*2f

lMode201

01

−−

−+=

ff

ff

10*201256

122804Mode

−−

−+= =40+ 6.667= 46.667

**************************************************************************************

4. Geometric Mean (GM)

The Geometric mean of a set of n observation is the nth root of their product.

Geometric mean for grouped data-

The geometric mean G of n observations xi, i=1,2,3….n is

nnxxxxG ........( 321=

This can be written as n

nxxxxG /1

321 )........(=

nnxxxxG1

321 log..........log.log.loglog =

Taking log in both sides

nxxxxn

G log................logloglog1

log 321 ++=

=

=n

1i

ilogxn

1logG

=

=

n

1i

ilogxn

1AntilogG

Geometric mean for grouped data-

In the case of grouped or continuous frequency distribution, X is taken to be the value corresponding

to the mid-point of the class-intervals.

In case of frequency distribution Xi / fi., (i = 1. 2 ...., n) geometric mean, G is given by

Nf

n

fff nxxxxG1

321 ............ 321= , where N = ∑fi.

Taking logarithm of both sides, we get

nn xfxfxfxfN


log 332211 +++=

=

=n

i

ii xfN

G1

log1

log

Thus we see that logarithm of G is the arithmetic mean of the logarithms of the given values.

So

=

=

n

i

ii xfN

AntiG1

log1

log

***************************************************************************************

16

Objective: Calculate Geometric mean from the following data: 3,13,11,15,5,4,2

Solution: In this example number of observation n=7, by definition of geometric mean

G= (3.13.11.15.5.4.2)1/7

nxxxxn


log 321 ++=

2log................11log13log3log7

1log ++++=G

3010.06020.06989.01760.10413.11139.10.47717

1log ++++++=G

]4106.5[7

1log =G

0.772944log =G

So, 5.9280.772944)log( == AntiG

***************************************************************************************

Objective: Calculate Geometric mean from the following continuous grouped frequency data:

Kinds of data:

Solution: We know that in case of Grouped data

=

=n

i

ii xfN

G1

log1

log

Calculations are given below in the table

Class Interval Frequency Mid Values Xi log Xi fi log Xi

0-10 1 5 0.69 0.698

10-20 3 15 1.17 3.528

20-30 4 25 1.39 5.591

30-40 2 35 1.54 3.088

Total 10 12.907

After substituting the values in the formula, we get logG=12.91

10 = 1.29

Hence GM=Antilog(1.29)=19.53

**************************************************************************************

5. Harmonic Mean. Harmonic mean of a number of observations is the reciprocal of the arithmetic mean of

the reciprocals of the given values.

Mean for ungrouped data:

The, harmonic mean H, of n observations Xi, i = 1, 2,….., n is given by

=

=

n

i ixn

H

1

11

1

Harmonic Mean for grouped data:

In case of frequency distribution Xi / fi, i = 1, 2,….., n is

Class Interval 0-10 10-20 20-30 30-40 Total

Frequency 1 3 4 2 10

17

=

=

n

i ix

f

N

H

1

1

1, where N = ∑fi.

***************************************************************************************

Objective: Find the harmonic mean for the following ungrouped data

Kinds of data: Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12

Solution: Harmonic Mean

=

=

n

i ixn

H

1

11

1=

11

9(

1

10+

1

7+

1

11+

1

9+

1

9+

1

10+

1

7+

1

9+

1

12) = 9.06

********************************************************************************

Objective: Find the Harmonic Mean of the given class.

The table given below represent the frequency-distribution of ages for Standard college students.

Ages (years) 19 20 21 22 23 24 25 26

Number of students 5 8 7 12 28 20 10 10

Solution:

Ages (X) 19 20 21 22 23 24 25 26

Number of students(fi) 5 8 7 12 28 20 10 10

1/xi 0.053 0.050 0.048 0.045 0.043 0.042 0.040 0.038

fi *(1/xi) 0.263 0.400 0.333 0.545 1.217 0.833 0.400 0.385

=

=

n

i i

i

x

f

N

H

1

1

1

( )0.3850.4000.8331.2170.5450.3330.4000.263100

1

1H

+++++++

=

( ) ( )84.22

0437.0

1

377.4100

1

1===H

***************************************************************************************

Objective: Computation of average speed using harmonic mean.

Kinds of data: A cyclist pedals from his house to his college at a speed of 10 m.p.h. and back from the

college to his house at 15 m.p.h. Find the average speed.

Solution. Let the distance from the house to the college be x miles. In going from house to college, the

distance (x miles) is covered in x/10 hours, while in coming from college to house, the distance is

covered in x/15 hours. Thus a total distance of 2x miles is covered in (x/10 + x/15) hours.

+

==

1510

2

taken timeTotal

travelleddistance Total speed average Hence

xx

x

...12

15

1

10

1

2hpm=

+

=

***************************************************************************************

18

Partition Values - These are the values which divide the series into a number of equal parts. 1. Quartiles: The three points which divide series into four equal parts are called quartiles. The first, second

and third points are known as the first, second and third quartiles respectively. The first quartile, Q1, is

the value which exceed 25% of the observations and is exceeded by 75% of the observations. The second

quartile, Q2, coincides with median. The third quartile, Q3, is the point which has 75% observations

before it and 25% observations after it. 2. Deciles: The nine points which divide the series into ten equal parts are called deciles. 3. Percentiles: The ninety-nine points which divide the series into hundred equal parts are called

percentiles. For example, D5, the fifth decile, has 50% observations before it and P35, the thirty fifth percentile, is the

point which exceed 35% of the observations. The methods of computing the partition values are the same

as those of locating the median in the case of both grouped and ungrouped data.

Formula & Examples for ungrouped data set

Arrange the data in ascending order, then

1. Quartiles : 1,2,3i wheren,observatio theof value4

1)i.(nQ

th

i =

+=

2. Deciles: 1,2,3....9i wheren,observatio theof value10

1)i.(nD

th

i =

+=

3. Percentiles : 91,2,3....9i wheren,observatio theof value100

1)i.(nP

th

i =

+=

*********************************************************************************************************************

Objective: Calculation of first Quartile, 3rd Deciles, 20th Percentile from the given data.

Kinds of data: 3,13,11,11,5,4,2

Solution:

Arranging Observations in the ascending order, We get :

2,3,4,5,11,11,13

Here, n=7

1,2,3i wheren,observatio theof value4

1)i.(nQ

th

i =

+=

For first quartile, put i=1

nobservatio theof value4

1)1.(7Q

th

1

+=


)81.(Q

th

1

=

Q1 = 2nd value of the observation, which is 3.

1,2,3....9i wheren,observatio theof value10

1)i.(nD

th

i =

+=

For 3rd decile, put i=3


1)3.(7D

th

3

+=


)83.(D

th

3

=

( ) nobservatio theof value4.2Dth

3 =

19

)20.4(3nobservatio2D rd

3

ndnd −+=

)3-40.4(3D3 +=

)10.4(3D3 +=

3.4 D3 =

91,2,3....9i wheren,observatio theof value100

1)i.(nP

th

i =

+=

For 20th percentile, put i=20


1)20.(7P

th

20

+=


)820.(P

th

20

=


160P

th

20

=

( ) nobservatio theof value6.1Pth

20 =

]1[2 0.6nobservatio1P ndst

20

st−+=

]2[3 0.62P20 −+=

2.6P20 =

********************************************************************************

Objective: Calculation of median, quartiles, 4th decile and 27th percentile.

Kinds of data: Eight coins were tossed together and the number of heads resulting was noted. The

operation was repeated 256 times and the frequencies (f) that were obtained for different values of

x, the number of heads, are shown in the following table.

x 0 1 2 3 4 5 6 7 8

f 1 9 26 59 72 52 29 7 1

Solution:

x 0 1 2 3 4 5 6 7 8

f 1 9 26 59 72 52 29 7 1

cf 1 10 36 95 167 219 248 255 256

Median: Here N/2 = 256/2 = 128. Cumulative frequency (c.f.) just greater than 128 is 167. Thus,

median = 4.

Ql : Here 4

N= 64. c.f. just greater than 64 is 95. Hence Ql is 3.

Q3 : Here 4

3 N = 192 and c.f. just greater than 1921s 21,9. Thus Q3 = 5.

D4.: Here 10

4 N=

10

2564= 102·4 and c.f. just greater than 102·4 is 167. Hence D4=4.

P27: Here 100

27 N=

100

25627= 69·12 and c.f. just greater than 69·12 is 95. Hence P27=3

*******************************************************************************

Formula & Examples for grouped data set

The partition values may be determined from grouped data in the same way as the median. For calculating

partition values from grouped data we will form cumulative frequency column. Quartiles for grouped data

will be calculated from the following formulae-

1. Quartile

20

h*f

C4

Ni

lQi

−

+= , where i=1,2,3

2. Deciles

h*f

C10

Ni

lDi

−

+= , where i=1,2,3….9

3. Percentiles

h*f

C100

Ni

lPi

−

+= , where i=1,2,3….99

Where l is the lower limit of the class containing quartile, decile and percentile, f is the frequency

of the class containing quartile, decile and percentile, N = ∑fi, h is the magnitude of the class

containing quartile, decile and percentile, 'C' is cumulative frequency proceeding to the class

containing quartile, decile and percentile.

********************************************************************************

Objective: Calculation of 3rd quartiles, 4th decile and 37th percentile from the grouped data.

class 0-15 15-30 30-45 45-60 60-75 75-90 90-105 105-120 120-135 135-150

frequency 1 4 17 28 25 18 13 6 5 3

Solution:

Class 0-15 15-30 30-45 45-60 60-75 75-90 90-105 105-120 120-135 135-150

frequency 1 4 17 28 25 18 13 6 5 3

Cf 1 5 22 50 75 93 106 112 117 120

For third quartile 4

3 N=

4

1203 = 90. Cumulative frequency just greater than 90 is 93 and corresponding

class is 75-90. Thus Q3 class is 75-90. From table we see that l=75, h=15, c=75, f=18

h*f

C4

N3

lQ3

−

+= , so 87.515*18

574

2013

57Q3 =

−

+=

For 4th decile 10

4 N=

10

1204 = 48. Cumulative frequency just greater than 48 is 50 and corresponding

class is 45-60. Thus Q3 class is 45-60. From table we see that l=45, h=15, c=22, f=28

h*f

C10

N4

lD4

−

+= , so 58.8515*28

2210

2014

54D4 =

−

+=

For 37th percentile 100

37 N=

100

12037 = 44.4. Cumulative frequency just greater than 44.4 is 50 and

corresponding class is 45-60. Thus P37 class is 45-60. From table we see that l=45, h=15, c=22, f=28

21

h*f

C100

N73

lP37

−

+= , so 5715*28

22100

20137

54D37 =

−

+=

***************************************************************************************

Exercise:

Q1. Find the Arithmetic Mean, Median and Mode from the following distribution.

classes 10-14 15-19 20-24 25-29 30-34 35-39

frequency 22 35 52 40 32 19

(Ans: A.M.=24.05, Median=23.63, Mode=22.43)

Q2. Find the Arithmetic, Geometric and Harmonic mean of the following frequency distribution.

Marks 0-10 10-20 20-30 30-40

No. of students 5 8 3 4

(Ans: A.M.=18.00, GM=14.58, HM=11.31)

Q3. The average salary of male employees in a firm was Rs.5200 and that of females was Rs.4200. The

mean salary of all the employees was Rs.5000. Find the percentage of male and female employees.

(Ans: Male 80%, Female 20%)

Q4. The Median and Mode of the following wage distribution are known to be Rs. 33.50 and Rs. 34.00

respectively. Find the value of f3, f4 and f5.

Wages 0-10 10-20 20-30 30-40 40-50 50-60 60-70

Frequency 4 16 f3 f4 f5 6 4

(Ans: f3= 60, f4=100, f5=40)

Q5. Find the arithmetic mean of the following frequency distribution: (Ans:21.66)

Xi 1 4 7 13 19 25 28 22 81 16

fi 7 46 19 51 89 89 28 19 33 93

Q6. The strength of 7 colleges in a city are 385; 1748; 1343; 1935; 786; 2874 and 2108. Find its median.

(Ans:1748)

Q7. The mean mark of 100 students was given to be 40. It was found later that a mark 53 was read as 83.

What is the corrected mean mark? (Ans: 39.70)

Q8. Calculate 3rd Quartile, 6th Deciles and 45th Percentiles from the following data:-

81,96,76,108,85,80,100,83,70,95,32,33 (Ans: Q3= 102.5, D6= 84.6, P45=83.4)

Q9. Calculate D7 and P85 for the following data: 79, 82, 36, 38, 51, 72, 68, 70, 64, 63

(Ans: D7= 71.4, P85=81.45)

Q10. The following is frequency distribution of over time (per week) performed by various officers from a

certain software company. Determine the value of D5, Q1 and P45.

Overtime (in hours) 4-8 8-12 12-16 16-20 20-24 24-28

No. of officers 4 8 16 18 20 18

(Ans- D5= 19.11, Q1=15.3, P45=18.17)

22

3. Measures of Dispersion Surabhi Jain

Assistant professor (Statistics), College of Agriculture , JNKVV, Jabalpur (M.P.) 482004,India


Dispersion: The measures of central tendency give us a single value that represents the central part

of the whole distribution whereas dispersion gives us an idea about the Scatteredness of the data. In

statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution

is stretched or squeezed. Dispersion helps us to study the variability of the items. It indicates the

extent to which all other values are dispersed about the central value in a particular distribution.

Measures of Dispersion: In dispersion, there are two types of measure. The first one is the

absolute measure, which measures the dispersion in the same statistical unit. The second type is the

relative measure of dispersion, which measures the ratio or percentage. Dispersion also helps a

researcher in comparing two or more series.

Characteristic of an ideal measure of Dispersion: To be an ideal measure, the measure of

dispersion should satisfy the following characteristics.

(1) It should be easy to calculate and easy to understand.

(2) It should be rigidly defined.

(3) It should be based upon all the observations.

(4) It should be suitable for further mathematical treatment.

(5) It should be affected as little as possible by fluctuations of sampling.

In statistics, there are many techniques that are applied to measure dispersion.

The absolute measures of dispersion are

(1) Range (2) Quartile Deviation (3) Mean Deviation (4)Standard Deviation

(1) Range: It is defined as the difference between the maximum and minimum value of any

dataset.

For ungrouped data Range = Maximum value – Minimum value

For grouped data Range = upper value of last class interval – lowest value of first class

interval

Characteristics: (1) It is the simplest but the crude measure of dispersion. (2) It takes lesser time.

(3) It is based only on two extreme observations so subject to chance fluctuations and cannot tell us

anything about the character of the distribution. (4) Range cannot be computed in the case of “open

ends’ distribution i.e., a distribution where the lower limit of the first group and upper limit of the

higher group is not given.(5) It is not suitable for further mathematical treatment.

(2) Quartile Deviation or Semi- interquartile range: It is the difference between first and third

quartile divided by 2. It is a better method when we are interested in knowing the range within

which certain proportion of the items fall.

Formula Quartile Deviation= 𝑄3− 𝑄1

2

Characteristics:

(1)It is easy to calculate. (2) Since the Quartile deviation only makes the use of 50 % of data so it is

also not a reliable measure of dispersion but it is better than range. (3) The quartile deviation is not

23

affected by the extreme items. It is completely dependent on the central items. If these values are

irregular and abnormal the result is bound to be affected. (4)This method of calculating dispersion

can be applied generally in case of open end series where the importance of extreme values is not

considered.

(3) Mean Deviation: It is defined as the average of the sum of absolute deviation of all the

observation from their Average A (A=Mean, Median or Mode).

For ungrouped data MD = ∑|𝑋𝑖−𝐴|

𝑛, where A= mean, median or mode

where A= mean, median or mode

Characteristics: (1) It is based on all the observations but the step of ignoring the signs of

deviations creates artificiality and makes it useless for further mathematical treatment. (2) Average

Deviation may be calculated either by taking deviations from Mean or Median or Mode. (3)

Average Deviation is not affected by extreme items. (4) It is easy to calculate and understand. (5) It

is illogical and mathematically unsound to assume all negative signs as positive signs. Because the

method is not mathematically sound, the results obtained by this method are not reliable. (6) This

method is unsuitable for making comparisons either of the series or structure of the series.

(Best Measure): It is defined as the square root of the average of the

sum of squares of deviation of all the observation from their mean. The concept of standard

deviation, which was introduced by Karl Pearson has a practical significance because it is free from

all defects, which exists in a range, quartile deviation or average deviation.

For ungrouped data SD = √∑(𝑋𝑖−𝑋) 2

𝑛 =√

∑ 𝑥𝑖2

𝑛− (

∑ 𝑥𝑖

𝑛)2

For grouped data SD =√∑ 𝑓𝑖(𝑋𝑖−𝑋) 2

∑ 𝑓𝑖 = √

∑ 𝑓𝑖𝑥𝑖2

∑ 𝑓𝑖− (

∑ 𝑓𝑖𝑥𝑖

∑ 𝑓𝑖)2

Characteristics: (1) It is the best measure of dispersion among all. (2) It is difficult to compute. (3)

The step of squaring the deviations overcomes the drawback of Mean Deviation. (4) Standard

deviation is the best measure of dispersion because it takes into account all the items and is capable

of future algebraic treatment and statistical analysis. It is possible to calculate standard deviation

for two or more series.(5) This measure is most suitable for making comparisons among two or

more series about variability.(6) It assigns more weights to extreme items and less weight to items

that are nearer to mean. It is because of this fact that the squares of the deviations which are large in

size would be proportionately greater than the squares of those deviations which are comparatively

small.

Mathematical properties of standard deviation (σ)

(i) If different values are increased or decreased by a constant, the standard deviation will remain

the same. If different values are multiplied or divided by a constant than the standard deviation will

be multiplied or divided by that constant.

(ii) Combined standard deviation can be obtained for two or more series with below given formula:

If n1 and n2 are the sizes, 𝑥1 and 𝑥2 are the means and 𝜎1 and 𝜎2 , the standard deviations of the two

series, then the standard deviation 𝜎 of the combined series of size n1 + n2 is given by

For grouped data MD = ∑ 𝑓𝑖|𝑋𝑖−𝐴|

∑ 𝑓𝑖,

(4) Standard Deviation

24

𝜎2 =1

𝑛1+𝑛2 [ 𝑛1(𝜎1

2 + 𝑑12) + 𝑛2(𝜎2

2 + 𝑑22)], where d1 = 𝑥1 - �� and d2 = 𝑥2 - �� and �� =

𝑛1𝑥1 +𝑛2𝑥2

𝒏𝟏+𝒏𝟐,

is the mean of combined series.

(iii) Variance is independent of change of origin means if we use di = xi – A then 𝜎2 = 𝜎𝑑2 but not

of scale means if we use di = 𝑥𝑖−𝐴

ℎ, then 𝜎2 = ℎ2𝜎𝑑

2

********************************************************************************

Relative Measures for comparison of two series:

(CD): To compare the variability of two series Coefficient of

dispersion is used. They are pure numbers independent of the unit of the measurement. The

coefficients of dispersion based upon different measures of dispersion are as follows :

(4) Standard Deviation, CD = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

𝑀𝑒𝑎𝑛

Characteristics: Used to compare the dispersion of two or more distributions. Selection of

appropriate measure depends upon the measures of central tendency and dispersion.

(2) Coefficient of Variation (CV): 100 times the coefficient of dispersion based upon standard

deviation is called coefficient of variation. (Unit less Measure).

CV = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

𝑀𝑒𝑎𝑛 *100

Characteristics: It is expressed in percentage. Lesser value of coefficient of variation indicates

more consistency.

********************************************************************************

Objective: Computation of Measures of Dispersion by all methods for Ungrouped data.

Kinds of Data: Suppose the data are 10, 7, 5, 9, 9, 10, 7, 3, 12

Solution:

(1) Range=max. value - min. value = 12 – 3 = 9

(2) Quartile Deviation: the formula for Quartile Deviation QD= 𝑄3− 𝑄1

2

First arrange the observation in ascending order

3, 5, 7, 7, 9, 9, 10, 10, 12

Now the formula for quartile Qi = 𝑖∗(𝑛+1)

4, where i= the number of quartile i.e. i=1,2,3,or 4 and

n= the number of observation

Q1=1∗(9+1)

4 𝑡ℎ =

10

4𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =2.5th observation

So Q1 = 2nd term + 0.5*(3rd term – 2nd term)

So Q1= 5+0.5*(7-5) =6

Similarly Q3 =3∗(9+1)

4 𝑡ℎ =

30

4𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =7.5th observation

So Q3 = 10+ 0.5*(10-10)=10

Now QD = (10−6)

2 =2

(1) Coefficient of Dispersion

based upon

(1) Range, Coefficient of Dispersion = Maximum value –minimum value

Maximum value+minimum value

(2) Quartile Deviation, Coefficient of Dispersion = 𝑄3− 𝑄1

𝑄3+ 𝑄1

(3) Mean Deviation, Coefficient of Dispersion = 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑟𝑜𝑚 𝑤ℎ𝑖𝑐ℎ 𝑖𝑡 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑

25

(3) Mean Deviation: The formula for Mean deviation is MD = ∑|𝑋𝑖−𝐴|

𝑛, where A= mean, median or

mode

Here we calculate first the mean deviation about mean.

Now Mean=(10+7+5+9+9+10+7+3+12)

9 =8

hence

MD=1

9(|10 − 8| + |7 − 8| + |5 − 8| + |9 − 8| + |9 − 8| + |10 − 8| + |7 − 8| + |3 − 8| +

|12 − 8|)

=1

9 (2+1+3+1+1+2+1+5+4) =

20

9 = 2.22

(4) Standard Deviation:

Mean=(10+7+5+9+9+10+7+3+12)

9 =8

SD=√(10−8)2+(7−8)2+(5−8)2+(9−8)2+(9−8)2+(10−8)2+(7−8)2+(3−8)2+(12−8)2

9

=√(4+1+9+1+1+4+1+25+16)

9 =√

62

9 = 2.62

********************************************************************************

Objective: Computation of Measures of Dispersion by all methods for Grouped data.

Kinds of data: The age distribution of 542 members are given below

Age(in years) 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Total

No. of members 3 61 132 153 140 51 2 542

Solution:

(1)Range = 90-20=70

(2) Quartile Deviation : first we will find the first and third quartile

Age(in

years)

No. of

members

Cumulative

Frequency Xi FiXi (Xi -��)

Fi|𝐗𝐢 −

��)| (Xi -��)2 Fi(Xi -��)2

20-30 3 3 25 75 -29.7 89.2 883.3 2649.8

30-40 61 64 35 2135 -19.7 1202.9 388.9 23721.6

40-50 132 196 45 5940 -9.7 1283.0 94.5 12471.1

50-60 153 349 55 8415 0.3 42.8 0.1 12.0

60-70 140 489 65 9100 10.3 1439.2 105.7 14795.0

70-80 51 540 75 3825 20.3 1034.3 411.3 20975.2

80-90 2 542 85 170 30.3 60.6 916.9 1833.8

Total 542 2183 385 29660 1.96 5152 2800.54 76458.49

First we will determine the first Quartile class = 1∗542

4 = 135.5

135.5 come in 40-50 cumulative frequency class. So the first Quartile

Q1 = 40 +(

1∗542

4 −64)

132 * 10 = 40+

715

132 = 40+5.42=45.42 years

Similarly for Q3 = 3∗542

4 = 406.5,

406.5 come in 60-70 cumulative frequency class. So the third Quartile is

26

Q3 = 60 +(

3∗542

4 −349)

140 * 10 = 60+

575

140 = 60+4.11=64.11 years

So the quartile deviation is =(64.11−45.42)

2 =

18.69

2 =9.345 years

(3) Mean Deviation : first calculate the mean

Mean= 29660

542 = 54.72 years

From the above table Mean Deviation = 5152

542 =9.51 years

(4) Standard Deviation=√76548.9

542 =√141.07 = 11.88 years

********************************************************************************

Objective: Computation of variability of two series by coefficient of variation.

Kinds of data : Goals scored by two teams A and B in a football season were as follows

No. of goals scored in a

match 0 1 2 3 4

No. of

matches

A 27 9 8 5 4

B 17 9 6 5 3

Solution: Here we have to calculate the CV of both the team separately

No. of

goals

(𝒙𝒊)

A

(𝒇𝑨) 𝒇𝑨𝒙𝒊 (𝒙𝒊 − 𝒙

(𝒙𝒊 − 𝒙) 2 𝒇𝒊(𝒙𝒊 − 𝒙) 𝟐

B

(𝒇𝑩) 𝒇𝑩𝒚𝒊 (𝒚𝒊 − ��)

(𝒚𝒊 − 𝒚) 2 𝒇𝒊(𝒚𝒊 − 𝒚) 𝟐

0 27 0 -1.05 1.10 29.77 17 0 -1.2 1.44 24.48

1 9 9 -0.05 0.00 0.02 9 9 -0.2 0.04 0.36

2 8 16 0.95 0.90 7.22 6 12 0.8 0.64 3.84

3 5 15 1.95 3.80 19.01 5 15 1.8 3.24 16.2

4 4 16 2.95 8.70 34.81 3 12 2.8 7.84 23.52

Total 53 56 4.75 14.51 90.83 40 48 4 13.2 68.4

First we will calculate the mean and standard deviation of first (A) series

𝑋𝐴 =

56

53 = 1.05, 𝜎𝐴 = √

90.83

53 = √1.714 = 1.31 then CV=

𝜎𝐴

𝑋𝐴 *100 =

1.31

1.05 *100 = 124.76

Now we calculate the mean and standard deviation of Second (B) series

𝑋𝐵 =

48

40 = 1.2, 𝜎𝐵 = √

68.4

40 = √1.71 = 1.30 then CV=

𝜎𝐵

𝑋𝐵 *100 =

1.30

1.2 *100 = 108.33

After comparing the coefficient of variation of series A and B it was found that the series B because

of lower CV value is more consistent.

********************************************************************************

Objective : Comparison of wage earners of two firms

Kinds of data : An analysis of monthly wages paid to workers in two firms A and B, belonging to

the same industry, gives the following results:

Firm A Firm B

Number of wage earners 586 (𝑛𝐴) 648 (𝑛𝐵)

Average monthly wage Rs. 52.50 (𝑋𝐴 ) Rs. 47.50 (𝑋𝐵

)

Variance of the distribution of

wages

100 (𝜎𝐴2) 121 (𝜎𝐵

2)

27

(a) Which firm A or B pays out the larger amount as monthly wages? (Ans: Firm B)

(b) In which firm A or B, there is greater variability in individual wages? (Ans: Firm B)

(c) What are the measures of (i) average monthly wage and (ii) the variance of the distribution of

wages of all the workers in the firms A and B taken together? (49.9, 10.8)

Solution: (a) Here we have to find the total amount of monthly wages paid by firm A and Firm B.

Since the number of workers (nA) and average monthly wage (𝑋𝐴 ) is given. With the help of this

we calculate ∑ 𝑋𝐴. By using the formula 𝑋𝐴 =

∑ 𝑋𝐴

𝑛𝐴, we get ∑ 𝑋𝐴=𝑛𝐴 * 𝑋𝐴

= 586* 52.50=30765

Similarly for Firm B we get ∑ 𝑋𝐵=𝑛𝐵 * 𝑋𝐵 = 648* 47.50=30780

Hence we find that the firm B pays out the larger amount as monthly wages.

(b) We know that the variability is determined by coefficient of variation. Here we calculate the CV

for both the firm. The formula for 𝐶𝑉𝐴 = 𝜎𝐴

𝑋𝐴 *100 and 𝐶𝑉𝐵 =

𝜎𝐵

𝑋𝐵 *100

By putting the values we get 𝐶𝑉𝐴 = 10

52.50 *100=19.04

And 𝐶𝑉𝐵 = 11

47.50 *100 =23.15

Since 𝐶𝑉𝐵 > 𝐶𝑉𝐴, hence in the firm B there is greater variability in individual wages.

(c) (i) �� = 𝑛𝐴𝑥𝐴 +𝑛𝐵𝑥𝐵

𝒏𝑨+𝒏𝑩 =

586∗52.50+648∗47.50

586+648 =

30765+30780

1234 = 49.87

(ii) We know that the formula of combined variance is

𝜎2 =1

𝑛𝐴+𝑛𝐵 [ 𝑛𝐴 (𝜎𝐴

2 + 𝑑𝐴2) + 𝑛𝐴 (𝜎𝐴

2 + 𝑑𝐴2)], where dA = 𝑥𝐴 - �� and dB = 𝑥𝐵 - �� and �� =

𝑛𝐴𝑥𝐴 +𝑛𝐵𝑥𝐵

𝒏𝑨+𝒏𝑩 , is the mean of combined series.

Here dA = 52.50 – 49.87 = 2.63 and dB = 47.50 – 49.87= -2.37

By putting the values we get

𝜎2 =1

586+648 [586(100 + (2.63)2) + 648(121 + (−2.37)2)],

By solving we get, 𝜎2 = 62653.30+82047.75

1234 =117.26

The variance of the distribution of wages of all the workers in the firms A and B taken together is

117.26

********************************************************************************

Objective : Standard deviation of combined sample

Kinds of data : The first of two samples has 100 items with mean 15 and S.D. 3. If the whole

group has 250 items with mean 15.6 and SD √13.44. Find the SD of the second sample.

Solution: Here it is given that n1 = 100, 𝑥1 = 15, and 𝜎1 = 3 and n=250, �� = 15.6, σ=√13.44

We know that the formula of combined standard deviation

𝜎2 =1

𝑛1+𝑛2 [ 𝑛1(𝜎1

2 + 𝑑12) + 𝑛2(𝜎2

2 + 𝑑22)], where d1 = 𝑥1 - �� and d2 = 𝑥2 - �� and �� =

𝑛1𝑥1 +𝑛2𝑥2

𝒏𝟏+𝒏𝟐,

is the mean of combined series.

So, first we will find the size of second sample n2 = n-n1, so n2=250-100=150.

Here since the mean of first sample and combined mean is given. With the help of these we find the

mean of second sample. By putting the values in �� = 𝑛1𝑥1 +𝑛2𝑥2

𝒏𝟏+𝒏𝟐

We get 15.6 = 100∗15+150∗𝑥2

100+150, by solving we get 𝑥2 = 16

28

Now d1 = 15-15.6 = -0.6 and d2 = 16-15.6=0.4

By putting all these values in the formula of combined variance, we get

13.44 =1

100+150 [100(32 + (−0.6)2) + 150(𝜎2

2 + (0.4)2)],

By solving the value of 𝜎2 = 4

********************************************************************************

Objective: Corrected mean and corrected standard deviation corresponding to the corrected

figures:

Kinds of data: for a group of 200 candidates, the mean and standard deviation of scores were

found to be 40 and 15 respectively. Later on it was discovered that the scores 43 and 35 were

misread as 34 and 53 respectively. Find the corrected mean and corrected standard deviation

corresponding to the corrected figures.

Solution: Here it is given that n=200, mean=40 and SD= 15.

Wrong scores are 34, 53 and corrected scores are 43 and 35.

(i) Corrected mean: to calculate the corrected mean first we find the total score by using the

formula �� = ∑ 𝑋

𝑛,

By putting the values we get ∑ 𝑋 = 200* 40 = 8000

Next we find the corrected total score= total score- wrong scores + correct scores= 8000-

(34+53)+(43+35)=7991

Hence the corrected mean= 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑠𝑐𝑜𝑟𝑒

𝑛𝑜.𝑜𝑓 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 =

7991

200 = 39.95

(ii) Corrected SD: we know that the formula of SD=√∑ 𝑥𝑖

2

𝑛− (

∑ 𝑥𝑖

𝑛)2 ,

Here the SD is 15 and mean is 40 then first we calculate the sum of square ∑ 𝑥𝑖2 by using the

formula of SD.

∑ 𝑥𝑖2 = n*( 𝜎2 + ��2) =200*(225+1600) = 365000

Now we calculate corrected ∑ 𝑥𝑖2 = 365000-(sum of square of wrong figure)+(sum of square of

corrected figure)

corrected ∑ 𝑥𝑖2 = 365000-(342 + 532)+(432 + 352)= 365000-

3965+3074=364109

Now corrected SD= √𝑐𝑜𝑟𝑟𝑒𝑡𝑒𝑑 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒

𝑛𝑜.𝑜𝑓 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒− 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑀𝑒𝑎𝑛2 = √

364109

200− 39.952 =

√224.54 = 14.98

Hence the corrected mean = 39.95 and corrected standard deviation=14.98

********************************************************************************

Important Points on Dispersion:

1. Range, QD, MD and SD are the absolute measures of dispersion.

2. CD and CV are the relative measures of dispersion.

3. Range is the crude measure of dispersion.

4. Standard deviation is the best measure of dispersion.

5. Coefficient of variation is unitless measure of dispersion and suggested by karl pearson.

6. A low standard deviation indicates that the data points tend to be close to the mean.

29

Exercise:

Q1. Calculate the variance of the following series. (i) 5,5,5,5,5 (ii) 4,5,6. (Ans. (i) 0, (ii)0.67)

Q2. Mean and Standard deviation of 10 figures are 50 and 10 respectively. What will be the mean

and SD if (i) every figure is increased by 4 (ii) every figure is multiplied by 2 (iii) if the figures

are multiplied by 2 and then diminished by 4? (Ans. (i)54,10 (ii) 100,20 (iii) 96,20).

Q3. Calculate mean deviation and standard deviation from following table:

Classes 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40

Frequency 2 5 7 13 21 16 8 3

(Ans: Mean Deviation = 6.23, Standard Deviation=8.05)

Q4. If the mean of 100 observations is 50 and CV is 40 %. Calculate the Standard Deviation.

(Ans. SD=20)

Q5. The arithmetic mean and variance of a set of 10 figures are known to be 17 and 33

respectively. Out of 10 figures one figure (i.e. 26) was found inaccurate and was weeded out.

What is the resulting? (a) Arithmetic Mean (b) variance of the 9 figures.

(Ans: AM=16, variance = 26.67)

Q6. The means of two samples of size 50 and 100 respectively are 54.1 and 50.3 and the standard

deviations are 8 and 7. Obtain the mean and standard deviation of the sample of size 150

obtained by combining the two samples. (Ans: Combined mean=51.57, Combined S.D.= 7.5)

Q7. An analysis of monthly wages paid to workers in two firms A and B, belonging to the same

industry, gives the following results:

Firm A Firm B

Number of wage earners 500 600

Average monthly wage Rs. 186.00 Rs. 175.00

Variance of the distribution of wages 81 100

(a) Which firm A or B pays out the larger amount as monthly wages? (Ans: Firm B)

(b) In which firm A or B, there is greater variability in individual wages? (Ans: Firm B)

(c) What are the measures of (i) average monthly wage and (ii) the variance of the

distribution of wages of all the workers in the firms A and B taken together?

(Ans: Combined monthly wage: Rs. 180, Combined variance = 121.36)

30

4. Moments, Skewness and Kurtosis

R. S. Solanki

Assistant professor (Maths & Stat.) , College of Agriculture , Waraseoni, Balaghat (M.P.),India


1. Moments:

Moment word is very popular in mechanical sciences. In science moment is a measure of

energy which generates the frequency. In Statistics, moments are the arithmetic means of first,

second, third and so on, i.e. rth power of the deviation taken from either mean or an arbitrary point

of a distribution. In other words, moments are statistical measures that give certain characteristics

of the distribution. In statistics, some moments are very important. Generally, in any frequency

distribution, four moments are obtained which are known as first, second, third and fourth

moments. These four moments describe the information about mean, variance, skewness and

kurtosis of a frequency distribution. Calculation of moments gives some features of a distribution

which are of statistical importance.

Moments can be classified in raw and central moment. Raw moments are measured about

any arbitrary point A (say). If A is taken to be zero then raw moments are called moments about

origin. When A is taken to be Arithmetic mean we get central moments. The first raw moment

about origin is mean whereas the first central moment is zero. The second raw and central moments

are mean square deviation and variance, respectively. The third and fourth moments are useful in

measuring skewness and kurtosis.

Methods of Calculation

1. Moments about Arbitrary Point i.e. raw moments For Ungrouped Data

If Nxxx ...,,, 21 are N observations of a variable x, then their moments about an arbitrary point A are

Zero order moment ( ) 11 0

'0 =−=

i

i AxN

First order moment ( ) −=i

i AxN

1'1

Second order moment ( ) −=i

i AxN

2'2

1

Third order moment ( ) −=i

i AxN

3'3

1

Fourth order moment ( ) −=i

i AxN

4'4

1

In general the thr order moment about arbitrary point A is given by

( ) ...,2,1,0;1

' =−= rAxN i

r

ir

For Grouped Data

If kxxx ...,,, 21 are k values (or mid values in case of class intervals) of a variable x with their

corresponding frequencies kfff ...,,, 21 then moments about an arbitrary point A are

31

Zero order moment ( ) ==−=i

i

i

ii fNAxfN

;11 0

'0

First order moment ( )1

'1

1 −=

i

ii AxfN

Second order moment ( )2

'2

1 −=

i

ii AxfN

Third order moment ( )3

'3

1 −=

i

ii AxfN

Fourth order moment ( )4

'4

1 −=

i

ii AxfN

In general the thr order moment about arbitrary point A is given by

( ) ...,2,1,0,;1

' ==−= rfNAxfN i

i

i

r

iir

2. Moments about origin: In raw moments if A is taken to be zero then raw moments are called

moments about origin and denoted by mr.

In general, For Ungrouped Data 𝑚𝑟= ∑(𝑋𝑖)𝑟

𝑁, here N is number of observation and r=0,1,2,….

for Grouped data 𝑚𝑟= ∑ 𝑓𝑖(𝑋𝑖)𝑟

∑ 𝑓𝑖

3. Moments about arithmetic mean i.e. central moments

When we take the deviation from the arithmetic mean and calculate the moments, these are known as

moments about arithmetic mean or central moments.

For Ungrouped Data

If Nxxx ...,,, 21 are N observations of variable x, then their moments about arithmetic mean

=i

ixN

x1

are


0 =−= i

i xxN

First order moment ( ) 01 1

1 =−= i

i xxN

Second order moment ( ) )(1

22

2 VariancexxN i

i =−=

Third order moment ( )33

1 −=

i

i xxN

Fourth order moment ( )44

1 −=

i

i xxN

In general the thr order moment about arithmetic mean x is given by

( ) ...,2,1,0;1

=−= rxxN i

r

ir

32

For Grouped Data

If kxxx ...,,, 21 are k values (or mid values in case of class intervals) of a variable x with their

corresponding frequencies kfff ...,,, 21 then moments about arithmetic mean

==ii

ii fNxfN

x ;1

are


0 =−= i

ii xxfN

First order moment ( ) 01 1

1 =−= i

ii xxfN

Second order moment ( ) )(1

22

2 VariancexxfN i

ii =−=

Third order moment ( ) −=i

ii xxfN

3

3

1

Fourth order moment ( ) −=i

ii xxfN

4

4

1

In general the thr order moment about arithmetic mean x is given by

( ) ...,2,1,0,;1

==−= rfNxxfN i

i

i

r

iir

Relationship between central moments and raw moments: is given by

𝜇𝑟= 𝜇𝑟′ - 𝑟𝑐1 (𝜇1′)( 𝜇𝑟−1′)+ 𝑟𝑐2

(𝜇1′)2( 𝜇𝑟−2′)+……+(−1)𝑟 (𝜇1′)𝑟

In particular, 𝜇2= 𝜇2′ - (𝜇1′)2

𝜇3= 𝜇3′- 3 𝜇1′𝜇2′ + 2(𝜇1′)3

𝜇4= 𝜇4′ - 4 𝜇1′𝜇3′+6 (𝜇1′)2( 𝜇2′ )- 3(𝜇1′)4

Important: (i) 𝜇0 = 𝜇0′=1 (ii) First central moment is always zero. (iii) 𝜇2 = 𝑆𝐷2 =variance

********************************************************************************

2. Skewness:

The skewness of a distribution is defined as the lack of symmetry. In a symmetrical

distribution, the Mean, Median and Mode are equal to each other and the ordinate at mean divides

the distribution into two equal parts such that one part is mirror image of the other. If some

observations, of very high (low) magnitude, are added to such a distribution, its right (left) tail gets

elongated. These observations are also known as extreme observations. The presence of extreme

observations on the right hand side of a distribution makes it positively skewed and the three

averages, viz., mean, median and mode, will no longer be equal. We shall in fact have Mean >

Median > Mode when a distribution is positively skewed. On the other hand, the presence of

extreme observations to the left hand side of a distribution make it negatively skewed and the

relationship between mean, median and mode is: Mean < Median < Mode (see following figure).

33

Measures of Skewness

1. The Karl Pearson’s coefficient of skewness kS , based on mode is given by

.. DS

ModeMeanSk

−=

The sign of kS gives the direction and its magnitude give the extent of skewness. If kS > 0, the

distribution is positively skewed, and if kS < 0 it is negatively skewed.

Karl Pearson's coefficient of skewness kS , is defined in terms of median as

..

)(3

DS

MedianMeanSk

−=

The range of Karl Pearson’s coefficient of skewness is .33 +− kS

2. The Bowley’s coefficient of skewness (quartile coefficient of skewness)

13

213

1223

1223 2

)()(

)()(

QQ

QQQ

QQQQ

QQQQSb

−

−+=

−+−

−−−= , where 21,QQ and 3Q are first, second and third

quartiles respectively. The range of Bowley’s coefficient of skewness is .11 +− bS

3. Coefficient of skewness based on moments: The Coefficient of skewness based on moments is

given by Sk =√𝜷𝟏(𝜷𝟐+𝟑)

𝟐(𝟓𝜷𝟐− 𝟔𝜷𝟏−𝟗) where 𝛽1 =

𝜇32

𝜇23 , 𝛽2 =

𝜇4

𝜇22 .

********************************************************************************

3. Kurtosis:

Kurtosis is another measure of the shape of a distribution. Whereas skewness measures

the lack of symmetry of the frequency curve of a distribution, kurtosis is a measure of the

relative peakedness of its frequency curve. Various frequency curves can be divided into three

categories depending upon the shape of their peak. The three shapes are termed as Leptokurtic,

Mesokurtic and Platykurtic as shown in following figure.

34

Measures of Kurtosis

Karl Pearson’s has developed Beta and Gama coefficients (or Beta and Gama measures) of kurtosis

based on the central moments, which are given below respectively

22

42

= and ( )322 −=

The value of 32 = )0( 2 = for a mesokurtic (normal) curve. When 32 )0( 2 , the curve is

more peaked than the mesokurtic curve and is termed as leptokurtic. Similarly, when 32

)0( 2 , the curve is less peaked than the mesokurtic curve and is called as platykurtic curve.

Objective: Moments, Measures of Skewness and Kurtosis (Ungrouped data).

Kinds of data: The daily earnings (in rupees) of sample of 7 agriculture workers are : 126, 121,

124, 122, 125, 124, 123. Compute first four raw (at point 123) and central moments, coefficients of

skewness and coefficients of kurtosis.

Solution: Moments about any arbitrary value (A=123) i.e. raw moments

Table: Calculation for raw moments.

Sr. No. x ( )123−x ( )2123−x ( )3123−x ( )4123−x

1 126 3 9 27 81

2 121 -2 4 -8 16

3 124 1 1 1 1

4 122 -1 1 -1 1

5 125 2 4 8 16

6 124 1 1 1 1

7 123 0 0 0 0

Total 865 4 20 28 116

The first raw moment

( ) 57.047

11'1 ==−=

i

i AxN

The second raw moment

( ) 86.2207

11 2'2 ==−=

i

i AxN

35

The third raw moment

( ) 4287

11 3'3 ==−=

i

i AxN

The fourth raw moment

( ) 57.161167

11 4'4 ==−=

i

i AxN

.

Moments about the Arithmetic Mean i.e. central moments

The arithmetic mean of daily earnings of agriculture workers is

57.1238657

11===

i

ixN

x

Table: Calculation for central moments.

Sr. x ( )57.123−x ( )257.123−x ( )357.123−x ( )457.123−x

1 126 2.43 5.90 14.35 34.87

2 121 -2.57 6.60 -16.97 43.62

3 124 0.43 0.18 0.08 0.03

4 122 -1.57 2.46 -3.87 6.08

5 125 1.43 2.04 2.92 4.18

6 124 0.43 0.18 0.08 0.03

7 123 -0.57 0.32 -0.19 0.11

Total 865 0.00 17.71 -3.60 88.92

The first central moment

( ) 00.000.07

111 ==−=

i

i xxN

The second central moment

( ) 53.271.177

11 2

2 ==−= i

i xxN

The third central moment

( ) 51.060.37

11 3

3 −=−=−= i

i xxN

The fourth central moment

( ) 70.1292.887

11 4

4 ==−= i

i xxN

.

Karl Pearson’s coefficient of skewness

The median ( dM ) of daily earnings of agriculture workers:

Arrange the data in ascending order

121,122,123,124,124,125,126

Total number of observations N = 7 (odd)

Hence the median

36

th

d

NM

+=

2

1 term =

th

+

2

17= 4th term= 124.

The mode ( oM ) of daily earnings of agriculture workers:

Since the frequency of 124 is maximum (i.e. 2), hence

oM =124.

Standard deviation (σ):

( )59.1

7

71.17

2

==

−

=

N

xxi

i

.

Karl Pearson’s coefficient of skewness based on median

( ) ( )

81.059.1

12457.12333−=

−=

−=

d

k

MxS

Karl Pearson’s coefficient of skewness based on mode

27.059.1

12457.123−=

−=

−=

o

k

MxS .

Bowley’s coefficient of skewness

Arrange the data in ascending order

121,122,123,124,124,125,126

Total number of observations N = 7

Hence the first quartile 1Q

thN

Q

+=

4

11 term =

th

+

4

17term = 2nd term= 122

Second quartile == dMQ2 124

Third quartile 3Q

thN

Q

+=

4

133 term =

th

+

4

173 term = 6th term= 125

Hence Bowley’s coefficient of skewness

33.0122125

12421221252

)()(

)()(

13

213

1223

1223 −=−

−+=

−

−+=

−+−

−−−=

QQ

QQQ

QQQQ

QQQQSb .

Coefficients of kurtosis:

98.1)53.2(

70.12

222

42 ===

and

( ) ( ) 02.1398.1322 −=−=−= .

Hence the curve is negatively skewed and platykurtic.

37

Objective: Moments, Measures of Skewness and Kurtosis (Grouped data).

Kinds of data: Compute first four raw (at A=11) and central moments and coefficients of

skewness and kurtosis for the following data on milk yield:

Milk yield (kg) 4-6 6-8 8-10 10-12 12-14 14-16 16-18

No. of Cows 8 10 27 38 25 20 7

Solution: Moments about any arbitrary value (A=11) i.e. raw moments

Table: Calculation for raw moments.

Sr. Milk

yield )(Kg

No. of

Cows )( f

Mid

Value )(x

( )Axf − ( )2Axf − ( )3Axf − ( )4Axf −

1 4-6 8 5 -48 288 -1728 10368

2 6-8 10 7 -40 160 -640 2560

3 8-10 27 9 -54 108 -216 432

4 10-12 38 11 0 0 0 0

5 12-14 25 13 50 100 200 400

6 14-16 20 15 80 320 1280 5120

7 16-18 7 17 42 252 1512 9072

Total N=135 30 1228 408 27952

The first raw moment

( ) 22.030135

11'1 ==−=

i

ii AxfN

The second raw moment

( ) 10.91228135

11 2'2 ==−=

i

ii AxfN

The third raw moment

( ) 02.3408135

11 3'3 ==−=

i

ii AxfN

The fourth raw moment

( ) 05.20727952135

11 4'4 ==−=

i

ii AxfN

.

Moments about the Arithmetic Mean i.e. central moments:

The arithmetic mean of milk yield

22.111515135

11===

i

ii xfN

x

Table: Calculation for central moments.

Sr. No. Milk

yield

)(kg

No. of

Cows

)( f

Mid

Value

)(x

fx ( )xxf − ( )2xxf − ( )3xxf − ( )4xxf −

1 4-6 8 5 40 -49.78 309.73 -1927.20 11991.46

2 6-8 10 7 70 -42.22 178.27 -752.70 3178.08

38

3 8-10 27 9 243 -60.00 133.33 -296.30 658.44

4 10-12 38 11 418 -8.44 1.88 -0.42 0.09

5 12-14 25 13 325 44.44 79.01 140.47 249.72

6 14-16 20 15 300 75.56 285.43 1078.30 4073.57

7 16-18 7 17 119 40.44 233.68 1350.15 7800.84

Total N=135 1515 0.00 1221.33 -407.70 27952.20

The first central moment

( ) 00.000.0135

111 ==−=

i

ii xxfN

The second central moment

( ) 05.933.1221135

11 2

2 ==−= i

ii xxfN

The third central moment

( ) 02.370.407135

11 3

3 −=−=−= i

ii xxfN

The fourth central moment

( ) 05.20720.27952135

11 4

4 ==−= i

ii xxfN

.

Karl Pearson’s coefficient of skewness

Table: Calculation for median and mode.

Sr. Milk

yield

)(kg

No. of

Cows

)( f

Mid

Value

)(x

cf

Median number= 682

1135

2

1=

+=

+N

⸫ Median class = (10-12).

Maximum frequency = 38

⸫ Model class = (10-12).

1 4-6 8 5 8

2 6-8 10 7 18

3 8-10 27 9 45

4 10-12 38 11 83

5 12-14 25 13 108

6 14-16 20 15 128

7 16-18 7 17 135

Total N=135

The median ( dM ) of milk yield:

45,135,38,2,101 ===== CNfiL

Median = 18.11452

135

38

210

21 =

−+=

−+= C

N

f

iLM d .

The mode ( oM ) of milk yield:

2,25,27,38,10 2011 ===== ifffL

Mode = 92.1022527382

273810

2 201

011 =

−−

−+=

−−

−+= i

fff

ffLM o .

Standard deviation (σ):

39

( )01.3

135

33.121

2

==

−

=

N

xxfi

ii

.

Karl Pearson’s coefficient of skewness based on median

( ) ( )

.04.001.3

18.1122.1133=

−=

−=

d

k

MxS

Karl Pearson’s coefficient of skewness based on mode

.10.001.3

92.1022.11=

−=

−=

o

k

MxS

Bowley’s coefficient of skewness:

The first quartile 1Q

thN

Q

=

41 term =

th

4

135term = 33.75 34th term

34th term is in the class interval “8-10” . Hence

18,135,27,2,81 ===== CNfiL

17.9184

135

27

28

411 =

−+=

−+= C

N

f

iLQ

Second quartile == dMQ2 11.18

Third quartile 3Q

thN

Q

=

4

33 term =

th

4

405term = 101.25 101th term

101th term is in the class interval “12-14” . Hence

83,135,25,2,121 ===== CNfiL

46.13834

1353

25

212

4

313 =

−

+=

−+= C

N

f

iLQ Hence Bowley’s coefficient of skewness

06.017.946.13

18.11217.946.132

)()(

)()(

13

213

1223

1223 =−

−+=

−

−+=

−+−

−−−=

QQ

QQQ

QQQQ

QQQQSb

Coefficients of kurtosis:

.53.2)05.9(

05.207

222

42 ===

and

( ) ( ) .47.0353.2322 −=−=−= Hence the curve is positively skewed and platykurtic.

********************************************************************************

Objective: Computation of Mean and variance when moments about arbitrary value is given .

Kinds of data: The first three moments of a distribution about the value 2 of a variable are 1, 16

and -40.

Solution: Here arbitrary value A=2 and the moments are 𝜇1′ =1, 𝜇2′=16 and 𝜇3′= -40

We know that 𝜇1′= ∑ 𝑓𝑖(𝑋𝑖−2)1

∑ 𝑓𝑖 =1 , hence

∑ 𝒇𝒊𝒙𝒊

∑ 𝒇𝒊 - 2

∑ 𝒇𝒊

∑ 𝒇𝒊 =1 which gives �� =

∑ 𝒇𝒊𝒙𝒊

∑ 𝒇𝒊 = 1+2 = 3

Hence the mean is 3.

40

We know that 𝜇2= 𝜇2′ - (𝜇1′)2, by putting the values we get

𝜇2= 𝜇2′ - (𝜇1′)2 = 16- 1*1 = 15

Hence the variance is 15.

******************************************************************************** Exercise:

Q1. The marks obtained by 46 students in an examination are as follows:

Calculate Karl Pearson’s and Bowley’s coefficients of skewness.

(Ans.: Karl Pearson’s coefficient of skewness = -0.31 and

Bowley’s coefficient of skewness = -0.22)

Q2. Calculate Karl Pearson’s and Bowley’s coefficients of skewness for the following distribution:

(Ans.: Karl Pearson’s coefficient of skewness = -0.36 and

Bowley’s coefficient of skewness = -1)

Q3. Compute the first four raw and central moments with coefficient of kurtosis for the following

data:

Plant Height (cm) 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70

No. of plants 5 14 16 25 14 12 8 6

(Ans. 25.20756,25.766,25.98,65.3 '4

'3

'2

'1 =−=−=−=

.66.0;14.16890,34.212,93.84,0 2321 −===== )

********************************************************************************

Marks 0-5 5-10 10-15 15-20 20-25 25-30

Students 5 7 10 16 4 4

Measurement 3.5 4.5 5.5 6.5 7.5 8.5 9.5

Frequency 3 7 22 60 85 32 8

41

5. Correlation and Regression Surabhi Jain

Assistant Professor (Statistics) , College of Agriculture , JNKVV, Jabalpur (M.P.) 482004,India


Correlation: Correlation is a measure of linear relationship between two variables. It is a statistical

technique that can show whether and how strongly pairs of variables are related. For example,

height and weight are related; taller people tend to be heavier than shorter people. Correlation

works for quantifiable data. It cannot be used for purely categorical data, such as gender, brands

purchased, or favorite color.

It can be defined as a bi-variate analysis that measures the strength of association between

two variables and the direction of the relationship.

Karl Pearson correlation coefficient (or Product moment correlation

coefficient): Pearson r correlation is the most widely used correlation statistic to measure the

degree of the relationship between linearly related variables. The correlation coefficient between X

on Y and Y on X is same and calculated by Karl Pearson correlation formula

𝑟𝑥𝑦 =𝑐𝑜𝑣(𝑥,𝑦)

𝜎𝑥𝜎𝑦 =

∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

√∑(𝑥𝑖−𝑥) 2 ∑(𝑦𝑖−𝑦) 2=

𝑛 ∑ 𝑥𝑖𝑦𝑖−∑ 𝑥𝑖𝑦𝑖

√𝑛 ∑ 𝑥𝑖2−(∑ 𝑥𝑖)2√𝑛 ∑ 𝑦𝑖

2−(∑ 𝑦𝑖)2 ,

here n=number of observation, xi = value of ith observation of x variable, yi = value of ith

observation of y variable.

Assumptions:

Normality: Both variables should be normally distributed (normally distributed variables have a

bell-shaped curve).

Linearity: It assumes a straight line relationship between each of the two variables.

Homoscedasticity: It assumes that data is equally distributed about the regression line. It basically

means that the variances along the line of best fit remain similar as you move along the line.

Type of correlation:

Positive Correlation: If two variables deviate in the same direction then the correlation is said to be

positive correlation. The line corresponding to the scatter plot is an increasing line sloping up from left to

right. Example height and weight of a group of persons, the income and expenditure etc.

Negative Correlation: If two variables deviate in the opposite direction, increase in one results in decrease

in the other, then the correlation is said to be negative correlation. The line corresponding to the scatter

plot is an decreasing line sloping down from left to right. Example price and demand of a commodity.

No correlation: occurs when there is no linear dependency between the variables.

42

Range of Correlation coefficient (r): Correlation coefficient lies between -1 to +1. It is a pure

number and independent of unit of measurement.

Effect of change of origin and scale: Correlation coefficient is independent of change of

origin(Xi=Xi-A) and scale(Xi=Xi/h).

Correlation between independent variables: Two independent variables are uncorrelated, but

two uncorrelated variables (if r=0 found) need not necessarily be independent.

Test of significance of correlation coefficient (Null Hypo. r=0): To test the significance of

correlation coefficient the t test statistic is used as follows:

𝑡𝑐𝑎𝑙=𝑟𝑐𝑎𝑙∗√𝑛−2

√1−𝑟𝑐𝑎𝑙2

at (n-2) d.f., here rcal is the calculated value of correlation coefficient.

To test the null hypothesis we compare the calculated value of t with tabulated value of t at (n-2)

degree of freedom.

If tcal>ttab, the null hypothesis is rejected and we conclude that the correlation is significant.

If tcal<ttab, the null hypothesis is accepted and we conclude that the correlation is non-significant.

**************************************************************************

Objective: Computation of correlation coefficient and test of significance of correlation coefficient

of the given data.

Kinds of data: The marks obtained by 8 students in Mathematics and Statistics are given below:

Student A B C D E F G H

Mathematics 25 30 32 35 37 40 42 45

Statistics 8 10 15 17 20 22 24 25

Solution: Let us assume that the marks in mathematics are X and marks in Statistics are Y.

We know that the formula for correlation coefficient is

r= ∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

√∑(𝑥𝑖−𝑥) 2 ∑(𝑦𝑖−𝑦) 2,

First we calculate the mean of X and Y.

�� = ∑ 𝑋𝑖

𝑛 =

286

8 = 35.75 ≈ 36 and �� =

∑ 𝑌𝑖

𝑛 =

141

8= 17.63 ≈ 18

Other calculations are presented below in the table.

Student Mathematics (X) Statistics(Y) (𝑿𝒊 − ��) (𝒀𝒊 − ��) (𝑿𝒊 − ��)(𝒀𝒊 − ��) (𝑿𝒊 − ��)𝟐 (𝒀𝒊 − ��)𝟐

A 25 8 -11 -10 110 121 100

B 30 10 -6 -8 48 36 64

C 32 15 -4 -3 12 16 9

D 35 17 -1 -1 1 1 1

E 37 20 1 2 2 1 4

F 40 22 4 4 16 16 16

G 42 24 6 6 36 36 36

H 45 25 9 7 63 81 49

total 286 141 -2 -3 288 308 279

By putting these values in the formula we get

r= 288

√308∗279 = 0.983

43

Test of significance of r=0.983

To test the significance of correlation coefficient the t test statistic is used as follows:

𝑡𝑐𝑎𝑙=𝑟𝑐𝑎𝑙∗√𝑛−2

√1−𝑟𝑐𝑎𝑙2 at (n-2) d.f,

By putting the values in the formula we get

𝑡𝑐𝑎𝑙=0.983∗√8−2

√1−0.9832 = 13.11

The table value of t at 6 degree of freedom at 5 % level of significance is 2.447.

Conclusion: Here since the calculated value of t (13.11) is greater than tabulated value of t (2.447)

at 6 degree of freedom. The null hypothesis is rejected and we found that the correlation coefficient

is highly significant. This indicates that marks in mathematics are associated with marks in

statistics.

********************************************************************************

Objective : Corrected correlation coefficient corresponding to the corrected figures:

Kinds of data: In two set of variables X and Y with 50 observations each, the following data were

observed: �� =10, 𝜎𝑥 = 3, �� =6, 𝜎𝑦 = 2 and r(x,y)=0.3

But on subsequent verification it was found that one value of X (=10) and one value of (Y=6) were

inaccurate and hence weeded out. With the remaining 49 pair of values, how is original value of r

affected?

Solution: First we will find the corrected mean

We know that �� = ∑ 𝑋𝑖

𝑛 , here �� =10 and n=50 then by solving we get ∑ 𝑋𝑖 = n*�� = 50*10=500

Since it was found that one value X=10 was inaccurate so we remove it from X and get

∑ 𝑋𝑖 = 500-10=490, now the number of observation is 49.

Hence the corrected mean �� = ∑ 𝑋𝑖

𝑛 =

490

49 =10

We know that 𝜎𝑥2 =

∑ 𝑋𝑖2

𝑛 -(

∑ 𝑋𝑖

𝑛)2 or 𝜎𝑥

2 = ∑ 𝑋𝑖

2

𝑛 - ��2 or ∑ 𝑋𝑖

2 = n*(𝜎𝑥2 + ��2)

Here n=50, 𝜎𝑥 = 3 and �� =10 then ∑ 𝑋𝑖2 = 50* (9 + 100)= 5450

Now we find the corrected ∑ 𝑋𝑖2 by removing 102

.

Corrected ∑ 𝑋𝑖2 = 5450-100=5350

Now we will find the corrected 𝜎𝑥2 by putting the corrected ∑ 𝑋𝑖

2 and corrected mean �� and n=49

corrected 𝜎𝑥2 =

∑ 𝑋𝑖2

𝑛 - ��2 =

5350

49 - 102 = 109.18-100=9.18

Similarly we repeat the same procedure for variable Y and find the corrected mean �� and ∑ 𝑌𝑖2

∑ 𝑌𝑖 = n*�� = 50*6=300,

Since it was found that one value Y=6 was inaccurate so we remove it from Y and get

∑ 𝑌𝑖 =300-6=294, now the number of observation is 49.

Hence the corrected mean �� = ∑ 𝑌𝑖

𝑛 =

294

49 =6

𝜎𝑦2 =

∑ 𝑌𝑖2

𝑛 - ��2 or ∑ 𝑌𝑖

2 = n*(𝜎𝑦2 + ��2) = 50*(22 + 62)=2000

Now we find the corrected ∑ 𝑌𝑖2 by removing 62

.

Corrected ∑ 𝑌𝑖2 = 2000-36=1964

Now we will find the corrected 𝜎𝑦2 by putting the corrected ∑ 𝑌𝑖

2 and corrected mean �� and n=49

corrected 𝜎𝑦2 =

∑ 𝑌𝑖2

𝑛 - ��2 =

1964

49 - 62 = 40.08-36=4.08

44

Here since r =𝑐𝑜𝑣(𝑥,𝑦)

𝜎𝑥∗𝜎𝑦=

∑ 𝑥𝑦

𝑛−��

𝜎𝑥∗𝜎𝑦= 0.3, so Cov(x,y)= r*𝜎𝑥 ∗ 𝜎𝑦=

∑ 𝑥𝑦

𝑛− ��

By putting the values 0.3*3*2=∑ 𝑥𝑦

50 - 10*6 , we get ∑ 𝑥𝑦 = 50*(1.8+60)=3090

Next we get the corrected value of ∑ 𝑥𝑦 = 3090-wrong values=3090-10*6=3030

Hence corrected r=

∑ 𝑥𝑦

𝑛−��

𝜎𝑥∗𝜎𝑦 =

3030

49− 10∗6

√9.18∗4.08 =

1.84

6.12 = 0.3

Hence we found that there is no change in correlation coefficient.

*******************************************************************************

Exercise:

Q1. Define correlation coefficient. Also write the properties of correlation coefficient.

Q2. Calculate the correlation coefficient between the variable X and Y from the following bi-

variate data.

X 71 68 70 67 70 71 70 73

Y 69 67 65 63 65 62 65 64

(Ans: r=0)

Q3. Calculate the correlation coefficient between the variable X and Y from the following bi-

variate data.

X 1 3 4 5 7 8 10

Y 2 6 8 10 14 16 20

(Ans: r=1)

Q4. Calculate the correlation coefficient between the heights of father and son from the following

data:

Height of father (inches) 65 66 67 68 69 70 71

Height of Son (inches) 67 68 66 69 72 72 69

Apply t test to test the significance and interpret the result. (Ans: r=+0.67, tcal=2.02)

Q5. A computer while calculating correlation coefficient between two variables X and Y from 25

pairs of observation obtained the following results: n=25, ∑ 𝑋=125, ∑ 𝑋2 = 650, ∑ 𝑌=100,

∑ 𝑌2 = 460, ∑ 𝑋𝑌=508. But on subsequent verification it was found that he had copied down

two pairs as (6,14) and (8,6) while the correct values were (8,12) and (6,8). Obtain the correct

value of correlation coefficient?

(Ans: 0.67)

*************************************************************************

45

Regression: The term regression was given by a British biometrician Sir Francis Galton. It is a

mathematical measure of average relationship between two or more variables. Regression is a

technique used to model and analyze the relationships between variables and often times how they

contribute and are related to producing a particular outcome together. A linear regression refers to a

regression model that is completely made up of linear variables.

Lines of Regression: The line of regression is the line which gives the best estimate to the value of

one variable for any specific value of the other variable. The line of regression is the line of best fit

and is obtained by the principle of least squares. Both the lines of regression passes through or

intersect at the point (��, ��).

In linear regression there are two lines of regression.

One is Y on X (Y=a+b*X), where X is independent variable and Y is dependent variable. By

applying the principle of least square the regression line for Y on X is given by (y-��) =𝒃𝒚𝒙(x-��)

Where 𝑏𝑦𝑥 =𝑐𝑜𝑣(𝑥,𝑦)

𝜎𝑥2

=𝑟𝜎𝑥𝜎𝑦

𝜎𝑥2

= r𝜎𝑦

𝜎𝑥=

∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

∑(𝑥𝑖−𝑥) 2

Other One is X on Y (X=a+b*Y) where Y is independent variable and X is dependent variable. By

applying the principle of least square the regression line for X on Y is given by (x-��) =𝒃𝒙𝒚(y-��)

for x on y

Where 𝑏𝑥𝑦 =𝑐𝑜𝑣(𝑥,𝑦)

𝜎𝑦2 =

𝑟𝜎𝑥𝜎𝑦

𝜎𝑦2 = r

𝜎𝑥

𝜎𝑦=

∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

∑(𝑦𝑖−𝑦) 2

Here 𝑏𝑦𝑥 and 𝑏𝑥𝑦are the regression coefficient and shows the change in dependent variable

with a unit change in independent variable.

Angle between two lines of Regression:

We know that the slope of two lines of regression are r𝜎𝑦

𝜎𝑥 and

𝜎𝑥

𝑟𝜎𝑦. If θ is the angle between

two lines of regression then tan θ = 1−𝑟2

𝑟 (

𝜎𝑥𝜎𝑦

𝜎𝑥2+𝜎𝑦

2)

Properties of Regression coefficient and relationship between correlation and regression

coefficient

• Regression coefficient lies between -∞ to +∞.

• Regression coefficient is independent of change of origin and but not of scale.

• Correlation coefficient is the Geometric mean between the regression coefficients.

(r = ±√𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦)

• If one of the regression coefficients is greater than unity the other must be less than unity.

• Arithmetic mean of the regression coefficient is greater than the correlation coefficient r if r>0.

1

2(𝑏𝑦𝑥 + 𝑏𝑥𝑦) ≥ 𝑟

• If the two variables are uncorrelated the lines of regression become perpendicular to each other.

(If r=0, 𝜃 =𝜋

2)

• The two lines of regression are coincide with each other if r= ±1, 𝑡ℎ𝑒𝑛 𝜃 = 0 𝑜𝑟 𝜋.

• The sign of correlation coefficient and regression coefficients are same because each of them

depends on sign of cov(x,y).

46

Test of significance of regression coefficient (Null Hypo. 𝑏𝑦𝑥 =0, 𝑏𝑥𝑦 = 0): To test the

significance of regression coefficient the t test statistic is used as follows:

𝑡𝑐𝑎𝑙 = 𝑏𝑦𝑥

𝑆.𝐸.𝑜𝑓 𝑏𝑦𝑥 =

𝑏𝑦𝑥

√(∑(𝑦−𝑦) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑥−𝑥) 2 )/(𝑛−2) ∑(𝑥−𝑥) 2

based on (n-2) d.f.(for y on x )

𝑡𝑐𝑎𝑙 = 𝑏𝑥𝑦

𝑆.𝐸.𝑜𝑓 𝑏𝑥𝑦 =

𝑏𝑥𝑦

√(∑(𝑥−𝑥) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑦−𝑦) 2 )/(𝑛−2) ∑(𝑦−𝑦) 2

based on (n-2) d.f.(for x on y )

To test the null hypothesis we compare the calculated value of t with tabulated value of t at (n-2)

degree of freedom.

If tcal>ttab, the null hypothesis is rejected and we conclude that the regression coefficient is

significant.

If tcal<ttab, the null hypothesis is accepted and we conclude that the coefficient is non-significant.

*************************************************************************

Objective: Determination of line of regression of Y on X and X on Y and their explanation?

Kinds of data: The lines of regression are Y= 5+2.8X and X=3-0.5Y .

Solution: Here it is given that line of regression of

Y on X is Y= 5+2.8X, so the regression coefficient are 𝑏𝑦𝑥 = 2.8

Similarly, X on Y is X= 3-0.5Y, so the regression coefficient are 𝑏𝑥𝑦 = -1.5

Since we know that the sign of both the of regression coefficients are same. Here the sign of both

coefficients are different from each other which is not possible. Hence the equations are not the

estimated regression equations of Y on X and X on Y respectively.

*******************************************************************************

Objective: Determination of (i) line of regression of Y on X and X on Y (ii) mean of X and mean

of Y (iii) variance of Y when the variance of X is given?.

Kinds of data: The two lines of regression X+2Y-5=0 and 2X+3Y-8=0 and variance of X is 12 is

given.

Solution: (i) If we assume the line X+2Y-5=0 as the regression line of Y on X , then the equation

can be written as 2Y= -X+5 or Y= -0.5X+2.5 and 𝑏𝑦𝑥 = -0.5

Similarly if we assume the line 2X+3Y-8=0 as the regression line of X on Y , then the equation can

be written as 2X= -3Y+8 or X= -1.5X+4 and 𝑏𝑥𝑦 = -1.5

Here since the sign of both the regression coefficient are same and also the one regression

coefficient is greater than unity and other one is smaller than unity. We can also verify

r = √𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦 =√−0.5 ∗ −1.5= √0.75= -0.87, which lies between -1 to +1.

So our estimation of line of regression of Y on X and X on Y is correct.

(ii) Since both the line of regressions passes through the point (��, ��), the equations can be written

as

��+2��-5=0 ………(1)

2��+3��-8=0 ………….(2) by solving these equations we get �� and ��.

By multiplying 2 in the equation (1) we get 2��+4��-10=0….(3)

Subtract equation (2) from equation (3) we get 2��+4��-10- 2��-3�� +8=0

By solving we get �� = 2, by putting the value of �� in equation (1) we get �� = 1.

Hence the mean of X and Y are �� = 1 and �� = 2.

(iii) Here 𝜎𝑥2 =12 is given. We have to find 𝜎𝑦

2.

47

Since we know that 𝑏𝑦𝑥 = r*𝜎𝑦

𝜎𝑥, the value of r, 𝑏𝑦𝑥 and 𝜎𝑥is known.

By putting these values −0.5 = -0.87*𝜎𝑦

3.46, we get 𝜎𝑦 = 1.99≈ 2

Hence 𝜎𝑦2= 4.

*******************************************************************************

Objective: Construction of line of regression and estimation of dependent variable when mean,

standard deviation and correlation coefficient is given.

Kinds of data: The following results were obtained in the analysis of data on yield of dry bark in

ounces (Y) and age in years (X) of 200 cinchona plants:

X (age in Years) Y (Yield)

Average 9.2 16.5

Standard Deviation 2.1 4.2

Correlation coefficient +0.84

Estimate the yield of dry bark of a plant of age 8 years.

Solution: Here �� = 9.2, 𝜎𝑋=2.1, �� = 16.5, 𝜎𝑌=4.2 and r=0.84

(i) Construction of line of regression: we know that the line of regression of Y on X is given by

(Y-��) =𝑏𝑌𝑋(X-��), where 𝑏𝑦𝑥 = r𝜎𝑦

𝜎𝑥= 0.84*

4.2

2.1= 1.68

By putting the values we get (Y-16.5) =1.68 ∗(X-9.2)

Y=1.68X+1.04

Similarly the line of regression of X on Y is given by

(X-��)=𝑏𝑌𝑋(Y − ��), where 𝑏𝑋𝑌 = r𝜎𝑋

𝜎𝑌= 0.84*

2.1

4.2= 0.42

By putting the values we get (X-9.2) =0.42 ∗(Y-16.5)

X=0.42Y+2.27

(ii) Estimation of the yield (Y) of dry bark of a plant of age 8 years(X):

The line of regression of Y on X is Y=1.68X+1.04,

Put X=8 we get Y=1.68*8+1.04 = 14.48

Hence the yield (Y) of dry bark of a plant of age 8 years(X) is 14.48 ounce.

Objective: Computation of correlation coefficient and the equations of the line of regression of Y

on X and X on Y and the estimation of the value of Y when the value of X is known and the value

of X when the value of Y is known.

Kinds of data: The following table relate to the data of stature (inches) of brother and sister from

Pearson and Lee’s sample of 1,401 families.

Family

number 1 2 3 4 5 6 7 8 9 10 11

Brother,X 71 68 66 67 70 71 70 73 72 65 66

Sister,Y 69 64 65 63 65 62 65 64 66 59 62

48

Solution: First we calculate the mean �� = 759

11 = 69 , �� =

7o4

11 = 64

Family

Number

Brother

X

Sister

Y (𝑿𝒊 − ��) (𝒀𝒊 − ��) (𝑿𝒊 − ��)𝟐 (𝒀𝒊 − ��)𝟐 (𝑿𝒊 − ��)(𝒀𝒊 − ��)

1 71 69 2 5 4 25 10

2 68 64 -1 0 1 0 0

3 66 65 -3 1 9 1 -3

4 67 63 -2 -1 4 1 2

5 70 65 1 1 1 1 1

6 71 62 2 -2 4 4 -4

7 70 65 1 1 1 1 1

8 73 64 4 0 16 0 0

9 72 66 3 2 9 4 6

10 65 59 -4 -5 16 25 20

11 66 62 -3 -2 9 4 6

Total 759 704 74 66 39

Then by using the formula of correlation coefficient, we have

𝑟𝑥𝑦 =∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

√∑(𝑥𝑖−𝑥) 2 ∑(𝑦𝑖−𝑦) 2=

39

√74∗66 = 0.558

Test of significance of correlation coefficient

t =𝑟𝑐𝑎𝑙∗√𝑛−2

√1−𝑟𝑐𝑎𝑙2

=0.558∗ √11−2

√1−0.5582 = 2.018

The table value of t at 9 df. At 5 % level of significance is 2.26.

Since t calculated is less than t tabulated the null hypothesis is accepted. The correlation coefficient

is not significant.

Calculation of Regression Coefficient

Using the formula of regression coefficient of Y on X and X on Y, we have

𝑏𝑦𝑥 = ∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

∑(𝑥𝑖−𝑥) 2 = 39

74 = 0.527, 𝑏𝑥𝑦 =

∑(𝑥𝑖−𝑥) (𝑦𝑖−��)

∑(𝑦𝑖−𝑦) 2 = 39

66 = 0.591

Hence, the equation of regression line of Y on X is

Y- 64 = 0.527 (X-69)

Hence, the equation of regression line of X on Y is

X- 69 = 0.591 (Y-64)

Estimation of Y when X is given :

If we want to calculate the value of Y for X=70 then by putting X=70 in the line of regression of Y

on X we get Y - 64 =0.527*(70 -69)

Hence Y= 64 + 0.527 * 1 =64.527

Estimation of X when Y is given :

If we want to calculate the value of X for Y=62 then by putting Y=62 in the line of regression of X

on Y we get X - 69 =0.591(62 -64)

Hence X= 69 + 0.591 * (-2) =67.82

Test of significance of regression coefficient of y on x

𝑡𝑦𝑥=𝑏𝑦𝑥

√(∑(𝑦−𝑦) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑥−𝑥) 2 )/(𝑛−2) ∑(𝑥−𝑥) 2

= 0.527

√ 66−(39)2

74(11−2)∗74

=0.527

0.261 = 2.017

49

Test of significance of regression coefficient of x on y

𝑡𝑥𝑦=𝑏𝑥𝑦

√(∑(𝑥−𝑥) 2−(∑(𝑥−��)(𝑦−��))

2

∑(𝑦−𝑦) 2 )/(𝑛−2) ∑(𝑦−𝑦) 2

= 0.591

√ 74−(39)2

66(11−2)∗66

=0.591

0.292 = 1.799

Since the value of t calculated is less than t tabulated. Regression coefficients are not significant.

*******************************************************************************

Exercise:

Q1. Define Regression Coefficient. Also write the properties of Regression coefficient.

Q2.The observations on X(Marks in Economics) and Y (Marks in Maths) for 10 students are given

below:

X 59 65 45 52 60 62 70 55 45 49

Y 75 70 55 65 60 69 80 65 59 61

Compute the least square regression equations of Y on X and X on Y. Also estimate the value

of Y for X=61. (Ans: Y-65.9=0.76*(X-56.2), X-56.2=0.92(Y-65.9), Y=69.54for X=61)

Q3. The following data pertain to the marks in subjects A and B in a certain examination

Subject A Subject B

Mean marks 39.5 47.5

Standard Deviation of marks 10.8 16.8

Correlation coefficient +0.42

Find the two lines of regression and estimate the marks in B for candidates who secured 50

marks in A. (Ans. Y=0.65X+21.82, X=0.27Y+26.67, Y=54.34 for X=50)

Q4. From the observations of the age (X) and the mean blood pressure (Y), following quantities

were calculated: - �� = 60, �� = 141, ∑ 𝑥2= 1000, ∑𝑦2= 1936, ∑ 𝑥𝑦=1380, where x=X-�� and

y=Y-��. Find the regression equation of Y on X and estimate the mean blood pressure for

women of age 35 years. (Ans: Y=1.38X+65.1, Y=113.4 For X=35)

50

6. Test of Significance

Mujahida Sayyed

Asst. professor (Maths & Stat.), College of Agriculture, JNKVV, Ganjbasoda, 464221(M.P.), India


Once sample data has been gathered through an experiment, statistical inference allows analysts to

assess some claim about the population from which the sample has been drawn. The methods of

inference used to support or reject claims based on sample data are known as tests of significance.

Null Hypothesis: Every test of significance begins with a null hypothesis H0. H0 represents a

theory that has been put forward, either because it is believed to be true or because it is to be used

as a basis for argument, but has not been proved. For example, in a clinical trial of a new drug, the

null hypothesis might be that the new drug is no better, on average, than the current drug.

Null Hypothesis H0: there is no difference between the two drugs on average.

Alternative Hypothesis: The alternative hypothesis, Ha, is a statement of what a statistical

hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative

hypothesis might be that the new drug has a different effect, on average, compared to that of the

current drug.

Alternative Hypothesis Ha: the two drugs have different effects, on average.

The alternative hypothesis might also be that the new drug is better, on average, than the current

drug. In this case Ha: the new drug is better than the current drug, on average.

The final conclusion once the test has been carried out is always given in terms of the null

hypothesis. "reject H0 in favor of Ha" or "do not reject H0"; we never conclude "reject Ha", or even

"accept Ha".

If we conclude "do not reject H0", this does not necessarily mean that the null hypothesis is true, it

only suggests that there is not sufficient evidence against H0 in favor of Ha; rejecting the null

hypothesis then, suggests that the alternative hypothesis may be true.

Hypotheses are always stated in terms of population parameter, such as the mean µ. An alternative

hypothesis may be one-sided or two-sided. A one-sided hypothesis claims that a parameter is either

larger or smaller than the value given by the null hypothesis. A two-sided hypothesis claims that a

parameter is simply not equal to the value given by the null hypothesis the direction does not

matter.

Hypotheses for a one-sided test for a population mean take the following form:

H0: = k

Ha: > k

or

H0: = k

Ha: < k.

Hypotheses for a two-sided test for a population mean take the following form:

H0: = k

Ha: k.

51

1. t TEST FOR SINGLE MEAN:

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the

null hypothesis is supported. It can be used to determine if two sets of data are significantly different from

each other, and is most commonly applied when the test statistic would follow a normal distribution if the

value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced

by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t

distribution.

𝑡 = �� − 𝜇

𝑠/√𝑛

Where

μ = Population Mean

�� =Sample Mean

s = Sample standard deviation= √∑(𝑥𝑖−��)2

𝑛−1

n = No. of sample observation

if tcal > ttab then the difference is significant and null hypothesis is rejected at 5% or 1% level of

significance.

if ttab< tcal then the difference is non- significant and null hypothesis is accepted at 5% or 1% level

of significance.

*******************************************************************************

Objective: Test the significance of difference between sample mean and population mean.

Kinds of data: Based on field experiments, a new variety of greengram is expected to give an

yield of 12 quintals per hectare. The variety was tested on 10 randomly selected farmer's fields. The

yields (quintal/hectare) were recorded 14.3, 12.6, 13.7, 10.9, 13.7, 12.0, 11.4, 12.0, 12.6 and 13.1.

Do the result conform the expectation?

Solution: Here the null and alternative hypothesis is

H0 =The average yield of the new variety of greengram is 12q/hac.

Vs H1= The average yield of the new variety of greengram is not 12q/hac.

we know that, t-test for single mean is given by 𝑡 = ��−𝜇

𝑠/√𝑛

It is given that population mean μ= 12 and n=10,

then we calculate the Sample mean x = 126.3

10 =12.63.

Next we have to calculate S=√∑(𝑥𝑖−��)2

𝑛−1

Total

Yields(xi) 14.3 12.6 13.7 10.9 13.7 12 11.4 12 12.6 13.1 126.3

(xi-��) 1.67 -0.03 1.07 -1.73 1.07 -0.63 -1.23 -0.63 -0.03 0.47

(𝐱𝐢 − ��)𝟐 2.79 0.00 1.14 2.99 1.14 0.40 1.51 0.40 0.00 0.22 10.60

Standard deviation of sample s= √10.60

9

52

= 1.085

By putting the values in t statistics we get

t = x−μ

s/√n

= 12.63 − 12

1.0853/√10

t = 1.836

and d.f. = 10-1= 9

The table value of t at 9 d.f. and 5% level of significance is ttab = 2.262.

Since tcal < ttab. Difference is not significant and we accept the null hypothesis.

Result: Here we accept null hypothesis this means that the new variety of greengram will give an

average yield of 12 quintals per hectare.

*******************************************************************************

2. t TEST FOR TWO SAMPLE MEAN:

Comparison of two sample means �� and �� assumed to have been obtained on the basis of random

samples of sizes n1and n2 from the same population which is assumed to be normal.

The approximate test is given by (under H0: 𝑥 = �� against H1: 𝑥 ≠ ��)

t = ��− ��

𝑠√1

𝑛1 +

1

𝑛2

, where �� = ∑ 𝑋𝑖

𝑛 and �� =

∑ 𝑌𝑖

𝑛,

𝑠2 = 1

𝑛1 + 𝑛2 − 2 ⌊∑(𝑥𝑖 − ��)2

𝑛1

𝑖=1

+ ∑(𝑦𝑖 − ��)2

𝑛2

𝑖=1

⌋

= 𝑛1𝑠1

2+𝑛2𝑠22

𝑛1+𝑛2−2

follows Student’s t statistics with 𝑛1 + 𝑛2 − 2 d. f.

*******************************************************************************

Objective : To test the significance of difference between two treatment means.

Kinds of data: Two kinds of manure applied to 15 plots of one acres; other condition

remaining the same. The yields (in quintals) are given below

Manure I: 14 20 34 48 32 42 30 44

Manure II: 31 18 22 28 40 26 45

Examine the significance of the difference between the mean yields due to the application of

different kinds of manure.


H0 : There is no significance difference between two the mean yields due to the application of

different kinds of manure. Vs

H1 : There is significance difference between two the mean yields due to the application of

different kinds of manure.

53

we use t test for difference of mean

t = ��− ��

𝒔√𝟏

𝒏𝟏 +

𝟏

𝒏𝟐

x and y are the sample mean of I and II sample.

�� = ∑ 𝑋𝑖

𝑛=

264

8 = 33 and �� =

∑ 𝑌𝑖

𝑛=

210

7 =30

Next we will calculate s2 = 1

n1+n2−2 ⌊∑ (xi − x)2n1

i=1 + ∑ (yi − y)2n2i=1 ⌋

By putting the values we get 𝑠2 =1

8+7−2 ⌊968 + 554⌋ = 117.07

Then s = 10.82

By putting all the values in

t = ��− ��

𝑠√1

𝑛1 +

1

𝑛2

,

we get

t =33− 30

10.82√1

8 +

1

7

= 0.54

d.f. = 𝑛1 + 𝑛2 − 2 = 13

The tabulated value of t for 13 d. f at 5% level of significance is 2.16.

Since tcal < ttab then it is not significant and we accept null hypothesis.

Result : Since tcal < ttab then we conclude that there is no significance difference between the two

mean yields due to the application of different kinds of manure.

*******************************************************************************

3. t TEST FOR PAIRED OBSERVATION:

This test is used for testing whether two series of paired observations are generated from the

same population on the basis of the difference in their sample means. The approximate test is given

by

t = ��

𝑠/√𝑛 , follows student’s t-distribution with n-1 d.f.

Here 𝑑 = ∑ 𝑑𝑖

𝑛𝑖=1

𝑛 and 𝑠2 =

∑ (𝑑𝑖−��)2𝑛𝑖=1

𝑛−1

Manure I (x-𝒙) (x-��)2 Manure II (y-��) (y-��)2

14 -19 361 31 +1 1

20 -13 169 18 -12 144

34 +1 1 22 -8 64

48 +15 225 28 -2 4

32 -1 1 40 +10 100

42 +9 81 26 -4 16

30 -3 9 45 +15 225

44 +11 121

264 968 210 554

54

di = xi -yi being the difference of the ith observation in the two sample.

*******************************************************************************

Objective: To test the significance of difference between two treatment means, when observations

are paired.

Kinds of data: Two treatments A and B are assigned randomly to two animals from each of six

litters. The following increase in body weights(oz.) of the animals were observed at the end of the

experiment

Treatment Litter Number

A

B

1 2 3 4 5 6

28 32 29 36 29 34

25 24 27 30 30 29

Test the significance of the difference between treatments A and B.

Solution: Hypothesis

H0 : There is no significance difference between treatments A and B.

Vs H1: There is significance difference between treatments A and B.

Here since the observation are paired Student’s t-distribution with n-1 d.f. is t = ��

𝑠/√𝑛

Where 𝑑 = ∑ 𝑑𝑖

𝑛𝑖=1

𝑛 ,

∑ (𝑑𝑖−��)2𝑛𝑖=1

𝑛−1 and di = xi -yi

Litter

number

Treatment

A(xi) B(yi)

di = xi -yi

𝒅𝒊 − �� (𝒅𝒊 − ��)𝟐

1 28 25 3 -0.83 0.69

2 32 24 8 4.17 17.36

3 29 27 2 -1.83 3.36

4 36 30 6 2.17 4.70

5 29 30 -1 -4.83 23.36

6 34 29 5 1.17 1.36

Total ∑ 𝑑𝑖𝑛𝑖=1 =23 50.83

𝑑 = ∑ 𝑑𝑖

𝑛𝑖=1

𝑛 =

23

6 =3.83 and 𝑠2 =

∑ (𝑑𝑖−��)2𝑛𝑖=1

𝑛−1 =

50.83

6−1 = 10.17

then s = 3.19

by putting the values we get t= 3.833

3.1885/√6 = 2.94

degree of freedom = 6-1 = 5

The table value of t at 5% level of significance and 5 degree of freedom is 2.571.

Since tcal > ttab , Difference is significant hence null hypothesis is rejected.

Result: Since null hypothesis is rejected, therefore there is significance difference between

treatments A and B.

*******************************************************************************

4. F TEST (VARIANCE RATIO TEST):

F distribution is applied in several tests of significance relating to the equality of two

sampling variances drawn on the basis of independent samples from a normal population. The

approximate test is

Variance Ratio (F) = Larger estimate of variance

Smaller estimate of variance

55

= 𝑠1

2

𝑆22

where 𝑠12 =

∑ (𝑥𝑖−��)2𝑛1𝑖=1

𝑛1−1 and 𝑠2

2 = ∑ (𝑥2−��)2𝑛2

𝑖=1

𝑛2−1

Follows F distribution with n1-1 and n2-1 d.f..

******************************************************************************

Objective: To test the significance of equality of two sample variances.

Kinds of data: Two random samples are chosen from two normal populations

Sample I: 20 16 26 27 23 22 18 24 25 19

Sample II: 17 23 32 25 22 24 28 18 31 33 20 27

Obtain estimates of the variance of the population and test whether the two populations have the

same variance.


H0: The two populations have the same variance.

Vs H1: The two populations have not the same variance.

We know that Variance Ratio (F) = Larger estimate of variance

Smaller estimate of variance

= 𝑠1

2

𝑆22, follows F distribution with n1-1 and n2-1 d.f.

Here 𝑠12 =

∑ (𝑥𝑖−��)2𝑛1𝑖=1

𝑛1−1 and 𝑆2

2 = ∑ (𝑥2−��)2𝑛2

𝑖=1

𝑛2−1

�� = ∑ 𝑋𝑖

𝑛 =

220

10 = 22 𝑎𝑛𝑑 �� =

∑ 𝑌𝑖

𝑛 =

300

12 =25

Sample I Sample II

xi (xi-��) (xi-��)2 yi (yi-��) (yi-��)2

20 -2 4 17 -8 64

16 -6 36 23 -2 4

26 4 16 32 7 49

27 5 25 25 0 0

23 1 1 22 -3 9

22 0 0 24 -1 1

18 -4 16 28 3 9

24 2 4 18 -7 49

25 3 9 31 6 36

19 -3 9 33 8 64

20 -5 25

27 2 4

220 120 300 314

By putting the values we get, 𝑠12 =

∑ (𝑥𝑖−��)2𝑛1𝑖=1

𝑛1−1 =

120

10−1 = 13.33 and

𝑠22 =

∑ (𝑥2−��)2𝑛2𝑖=1

𝑛2−1=

314

12−1 = 28.55

Hence we get F = 𝑠2

2

𝑆12 =

28.55

13.33 = 2.14

The tabulated value of F at 5 % level of significance and 9 and all d.f. is 2.89.

Since Fcal < Ftab , it is not significant and null hypothesis is accepted.

Result: Since Fcal < Ftab the null hypothesis is accepted and we conclude that the two population

have the same variances.

56

******************************************************************************

5. CHI- SQUARE TEST (χ2 TEST):

𝝌𝟐 test for goodness of fit:

Chi square is a measure to evaluate the difference between observed frequencies and

expected frequencies and to examine whether the difference so obtained is due to a chance factor or

due to sampling error.

To test the goodness of fit the chi-square test statistic is given by

χ2 = ∑(𝑂𝑖−𝐸𝑖)2

𝐸𝑖, at (n-1) d.f. , Where, 𝑂𝑖 = Observed Frequency

𝐸𝑖 = Expected Frequency

𝝌𝟐 test for 2X2 contingency table: In a contingency table if each attribute is divided into two

classes it is known as 2×2 contingency table.

For such data, the statistical hypothesis under test is that the two attribute are independent of one

another. For the 2X2 contingency table, the 𝜒2 test is given by 𝜒2 = 𝑁(𝑎𝑑−𝑏𝑐)2

(𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑) , where

N=a+b+c+d, for 1 d.f.

Or alternatively we can calculate the expected frequency of each cell and then apply the chi-square

test of goodness of fit. eg. E(a) =(𝑎+𝑏)(𝑎+𝑐)

𝑎+𝑏+𝑐+𝑑, E(b) =

(𝑎+𝑏)(𝑏+𝑑)

𝑎+𝑏+𝑐+𝑑 or accordingly.

To test the goodness of fit the chi-square test statistic is given by

χ2 = ∑(𝑂𝑖−𝐸𝑖)2

𝐸𝑖, at 1 d.f.

Where, 𝑂𝑖 = Observed Frequency, 𝐸𝑖 = Expected Frequency

If 𝛘𝑐𝑎𝑙2 > 𝛘𝑡𝑎𝑏

2 , at 1 d.f., we reject the null hypothesis.

*******************************************************************************

Yates’ correction for continuity : F. Yates has suggested a correction for continuity in χ2 value

calculated in connection with a (2 × 2) table, particularly when cell frequencies are small (since no

cell frequency should be less than 5 in any case, through 10 is better as stated earlier) and x2 is just

on the significance level. The correction suggested by Yates is popularly known as Yates’

correction. It involves the reduction of the deviation of observed from expected frequencies which

of course reduces the value of x2 . The rule for correction is to adjust the observed frequency in

each cell of a (2 × 2) table in such a way as to reduce the deviation of the observed from the

expected frequency for that cell by 0.5, but this adjustment is made in all the cells without

disturbing the marginal totals. The formula for finding the value of c2 after applying Yates’

correction can be stated thus:

Yates correction in chi square test

a b (a+b)

c d (c+d)

(a+c) (b+d) N=a+b+c+d

57

It may again be emphasised that Yates’ correction is made only in case of (2 × 2) table and that too when

cell frequencies are small.

*******************************************************************************

Objective: Testing whether the frequencies are equally distributed in a given dataset.

Kinds of data: 200 digits were chosen at random from a set of tables. The frequencies of the digits

were as follows.

Digits 0 1 2 3 4 5 6 7 8 9

Frequency 22 21 16 20 23 15 18 21 19 25

Solution: We set up the null hypothesis H0: The digits were equally distributed in the given

dataset.

Under the null hypothesis the expected frequencies of the digits would be = 𝑠𝑢𝑚 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

𝑛𝑜.𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =

200

10 =20

Then the value of 𝜘2 = (22−20)2

20 +

(21−20)2

20+

(16−20)2

20+

(20−20)2

20 +

(23−20)2

20+

(15−20)2

20+

(18−20)2

20

+(21−20)2

20+

(19−20)2

20 +

(25−20)2

20 =

1

20 (4+1+16+0+9+25+4+1+1+25) =

86

20 =4.3

The tabulated value of 𝜘2 at 9 d.f. and 5 % level of significance is 16.91. Since the calculated value

of 𝜘2 is less than the tabulated value, the null hypothesis is accepted. Hence we conclude that the

digits are equally distributed in a given dataset.

*****************************************************************************

Objective: Chi-square test for 2X2 contingency table

Kinds of data: The table given below show the data obtained during an epidemic of cholera

Germinated Not Germinated

Inoculated 31 469

Not Inoculated 185 1315

Test the effectiveness of inoculation in preventing the attack of cholera.


H0: Inoculation is not effective in preventing the attack of cholera i. e. 𝑂𝑖 = 𝐸𝑖, Vs

H1: Inoculation is effective in preventing the attack of cholera i.e. 𝑂𝑖 ≠ 𝐸𝑖.

Here we use χ2 test

χ2 = ∑(𝑶𝒊−𝑬𝒊)𝟐

𝑬𝒊 , Where, 𝑂𝑖 = Observed Frequency, 𝐸𝑖 = Expected Frequency

Observed Frequencies are:

Germinated Not Germinated Total

Chemically Treated 31 469 500

Untreated 185 1315 1500

58

Total 216 1784 2000

Calculation of Expected Frequencies:

For germinated E(31)= 500×216

2000 = 54, E(185) =

1500×216

2000 = 162

For not germinated E(469)= 500×1784

2000 = 446, E(1315) =

1500×1784

2000 = 1338

Expected Frequencies are:

Germinated Not Germinated Total

Chemically Treated 54 446 500

Untreated 162 1338 1500

Total 216 1784 2000

Next we calculate χ2 = ∑(𝑶𝒊−𝑬𝒊)𝟐

𝑬𝒊 ,

Observed

Frequencies(Oi)

Expected

frequencies(Ei)

Difference

(Oi-Ei)

Square of

differences (Oi-Ei)2

(Oi-Ei)2/Ei

31 54 -23 529 9.796

469 446 23 529 1.186

185 162 23 529 3.265

1315 1338 -23 529 0.395

Total 14.642

Here we get χ𝑐𝑎𝑙2 = 14.64

Degree of Freedom= (2-1)(2-1) = 1

Table values for 1 degree of freedom at 5% level of significance = 3.841

Since 𝛘𝑐𝑎𝑙2 = 14.642 and 𝛘𝑡𝑎𝑏

2 = 3.841

𝛘𝑐𝑎𝑙2 > 𝛘𝑡𝑎𝑏

2 , we reject the null hypothesis.

Result: 𝛘𝑐𝑎𝑙2 > 𝛘𝑡𝑎𝑏

2 , we reject the null hypothesis that is Inoculation is effective in preventing the

attack of cholera.

*******************************************************************************

Objective: Chi-square test for 2X2 contingency table when cell frequency is less than 5

Kinds of data: The following information was obtained in a sample of 50 small general shops. Can

it be said that there are relatively more women owners in villages than in town?

Shops

In Towns In Villages Total

Run by men 17 18 35

Run by women 3 12 15

Total 20 30 50

Test your result at 5% level of significance .( 𝜒2 for 1 d. f. is 3.841)

Solution: The null and alternative hypothesis are

H0: there are not relatively more women owners in villages than in town. Vs.

H1: there are relatively more women owners in villages than in town.

Here since the one cell frequency is less than 5 we apply the chi-square formula along with Yate’s

correction as given below

𝜒2 = [|𝑎𝑑 − 𝑏𝑐| −

𝑁2]

2

𝑁

𝐶1 𝐶2 𝑅1𝑅2

Where 2*2 contingency table is |𝑎 𝑏 𝑐 𝑑

|

𝐶1 = sum of first Column 𝑅1 = sum of first Row

59

𝐶2 = sum of second Column R2 = sum of second Row and N = Grand total

By putting the values in the formula we gwt

𝜒2 = [|17×12−18×3|−

50

2]

250

20×30×35×15 = 2.48

The critical value of 𝜒2 for 1 d.f. and α = 0.05 is 3.841 i.e.

𝛘𝑐𝑎𝑙2 = 2.48 and 𝛘𝑡𝑎𝑏

2 = 3.841

𝛘𝑐𝑎𝑙2 < 𝛘𝑡𝑎𝑏

2 , we accept the null hypothesis.

Result: 𝛘𝑐𝑎𝑙2 < 𝛘𝑡𝑎𝑏

2 , we accept the null hypothesis. It may be conclude that there are not relatively

more women owners in villages than in town.

*******************************************************************************

Exercise:

Q1. Six boys are selected at random from a school and their marks in Mathematics are found to be

63, 63, 64, 66, 60, 68 out of 100. In the light of these marks, discuss the general observation

that the mean marks in Mathematics in the school were 66. (Ans. tcal =-1.78)

Q2. The summary of the result of an yield trial on onion with two methods of propagation is given

below. Determine whether the methods differ with regard to onion yield. The onion yield is

given in kg/plot

Method I n1= 12 ��1 = 25.25 Sum of square = 186.25

Method II n2= 12 ��2 = 28.83 Sum of square = 737.67

Ans. tcal =-1.35)

Q3. A certain stimulus administrated to each 12 patients resulted in the following change in blood

pressure

di 5 2 8 -1 3 0 -2 1 5 0 4

Can it be concluded that the stimulus will in general be accompanied by an increase in blood

pressure? (Ans. Paired t test, tcal = 2.89)

Q4. The following table gives the number of units produced per day by two workers A and B for a

number of days:

A: 40 30 38 41 38 35

B: 39 38 41 23 32 39 40 34

should these results be accepted as evidence that B is the more stable worker?

(Ans. 𝑆12=16, 𝑆2

2=31.44)

Q5. A certain type of surgical operation can be performed either with a local anesthetic or with a

general anesthetic. Results are given below

Alive Dead

Local 511 24

General 147 18

Use 𝜒2 test for testing the difference in the mortality rates associated with the different types

of anesthetic. (Ans. 𝛘𝑐𝑎𝑙2 = 9.22)

Q6. Twenty two animals suffered from the same disease with the same severity. A serum was

administered to 10 of the animals and the remaining were uninoculated to serve as control. The

results were as follows:

Recovered Died Total

Inoculated 7 3 10

Uninoculated 3 9 12

Total 10 12 22

Apply the𝜒2 test to test the association between inoculation and control of the disease.

Interpret the result. (Ans. 𝛘𝑐𝑎𝑙2 = 2.82)

60

7. Analysis of Variance (One way and Two way classification)

P.Mishra

Assistant professor (Statistics) , college of agriculture , JNKVV, Powarkheda, (M.P.) 461110,India


Analysis of Variance (ANOVA) :The ANOVA is a powerful statistical tool for tests of

significance. The test of significance based on t-distribution is an adequate procedure only for

testing the significance of the difference between two sample means. In a situation when we have

two or more samples to consider at a time, an alternative procedure is needed for testing the

hypothesis that all the samples have been drawn from the same population. For example, if three

fertilizers are to be compared to find their efficacy, this could be done by a field experiment, in

which each fertilizer is applied to 10 plots and then the 30 plots are later harvested with the crop

yield being calculated for each plot. Now we have 3 groups of ten figures and we wish to know if

there are any differences between these groups. The answer to this problem is provided by the

technique of ANOVA.

Assumptions of ANOVA

The ANOVA test is carried out based on these below assumptions,

• The observations are normally distributed

• The observations are independent from each other

• The variance of populations are equal

Treatments: The objects of comparison in an experiment are defined as treatments

i) Suppose an Agronomist wishes to know the effect of different spacing on the yield of a crop,

different spacing will be treatments. Each spacing will be called a treatment.

(2) A teacher practices different teaching methods on different groups in his class to see which

yields the best results.

(3) A doctor treats a patient with a skin condition with different creams to see which is most

effective.

Experimental unit: Experimental unit is the object to which treatment is applied to record the

observations. ) If treatments are different varieties, then the objects to which treatments are applied

to make observations will be different plot of land. The plots will be called experimental units.

Blocks : In agricultural experiments, most of the times we divide the whole experimental unit

(field) into relatively homogeneous sub-groups or strata. These strata, which are more uniform

amongst themselves than the field as a whole are known as blocks.

Degrees of freedom: It is defined as the difference between the total number of items and the total

number of constraints. If “n” is the total number of items and “k” the total number of constraints

then the degrees of freedom (d.f.) is given by d.f. = n-k. In other words the number of degrees of

freedom generally refers to the number of independent observations in a sample minus the number

of population parameters that must be estimated from sample data.

Level of significance(LOS): The maximum probability at which we would be willing to risk a

type-I error is known as level of significance or the size of Type-I error is level of significance. The

level of significance usually employed in testing of hypothesis are 5% and 1%. The Level of

significance is always fixed in advance before collecting the sample information. LOS 5% means

the results obtained will be true is 95% out of 100 cases and the results may be wrong is 5 out of

100 cases.

Experimental error:

61

The variations in response among the different experimental units may be partitioned in to two

components:

i) the systematic part / the assignable part and

ii) the non-systematic / non assignable part.

Variations in experimental units due to different treatments, blocking etc. which are known to the

experimenter, constitute the assignable part. On the other hand, the part of the variation which can

not be assigned to specific reasons or causes are termed as the experimental error. Thus it is often

found that the experimental units receiving the same treatments and experimental conditions but

providing differential responses. This type of variations in response may be due to inherent

differences among the experimental units, error associated during measurement etc. these factor are

known as extraneous factor. So the variation in responses due to these extraneous factors is turned

as experimental error.

The purpose of designing an experiment is to increase the precision of the experiment. For

reducing the experimental error, we adopt some techniques. These techniques form the 3 basic

Principles of experimental designs.

1. Replication: The repetition of treatments under investigation is known as replication. A replication is used (i) to secure more accurate estimate of the experimental error, a term which

represents the differences that would be observed if the same treatments were applied several times

to the same experimental units;

(ii) To reduce the experimental error and thereby to increase precision, which is a measure of the

variability of the experimental error.

2. Randomization: Random allocation of treatments to different experimental units known as

randomization.

3. Local control: It has been observed that all extraneous sources of variation are not removed by

randomization and replication. This necessitates a refinement in the experimental technique. For

this purpose, we make use of local control, a term referring to the grouping of homogeneous

experimental units. The main purpose of the principle of local control is to increase the efficiency

of an experimental design by decreasing the experimental error.

One-Way ANOVA

One-way ANOVA is an inferential statistical model to analyze three or more than three

variances at a time to test the equality between them. It's a test of hypothesis for several sample

means investigating only one factor at k levels corresponding to k populations is called One Way

ANOVA. Users may use this 1-way ANOVA test calculator to generate the ANOVA classification

table for the test of hypothesis by comparing estimated F-statistic (F0) from the samples of

populations & critical value of F (Fe) at a stated level of significance (such as 1%, 2%, 3%, 4%, 5%

etc) from the F-distribution table. Only one factor can be analyzed at multiple levels by using

this method. This technique allows each group of samples to have different number of observations.

It should satisfy replication & randomization to design the statistical experiments.

ANOVA Table for One-Way Classification

ANOVA table for one-way classification shows what are all the formulas & input parameters used

in the analysis of variance for one factor which involves two or more than two treatment means together

to check if the null hypothesis is accepted or rejected at a stated level of significance in statistical

experiments.

https://getcalc.com/statistics-fdistribution-table.htm

62

Sources of Variation df SS MSS F-ratio

Between Treatment k - 1 SST SST/k-1= MST MST/ MSE= F T

Error N - k SSE SSE/N-k= MSE

Total N - 1

Notable Points for One-Way ANOVA Test

The below are the important notes of one-way ANOVA for test of hypothesis for a single factor

involves three or more treatment means together.

• The null hypothesis H0 : μ1 = μ2 = . . . = μk Alternative hypothesis H1 : μ1 ≠ μ2 ≠ . . . ≠ μk

• State the level of significance α (1%, 2%, 5%, 10%, 50% etc) • The sum of all N elements in all the sample data set is known as the Grand Total and is

represented by an English alphabet "G". • The correction factor CF = G2/N • The Total Sum of Squares all individual elements often abbreviated as TSS is obtained by

TSS = ∑∑xij2 - CF

• The Sum of Squares of all the class Totals often abbreviated as SST is obtained by SST = ∑Ti

2/ni - CF • The Sum of Squares due to Error often abbreviated as SSE is obtained by

SSE = TSS - SST • The degrees of freedom for Total Sum of Squares

TSS = N - 1 • The degrees of freedom for Sum of Squares of all the class Totals

SST = k - 1 • The degrees of freedom for Sum of Squares due to Error

SSE = N - k • The Mean Sum of Squares of Treatment often abbreviated as MST is obtained by

MST = SST/(k - 1) • The Mean Sum of Squares due to Error often abbreviated as MSE is obtained by

MSE = SSE/(N - k) • The variance ratio of F between the treatment is the higher variance to lower variance

F = MST/MSE or MSE/MST (The numerator should be always high) • The Critical value of F can be obtained by referring the F distribution table for (k-1, N-k) at

stated level of significance such as 1%, 5%, 9%, 10% or 50% etc. • The difference between the treatments is not significant, if the calculated F value is lesser

than the value from the F table. Therefore, the null hypothesis H0 is accepted. • The difference between the treatments is significant, if the calculated F value is greater

than the value from the F table. Therefore, the null hypothesis H0 is rejected. *******************************************************************************

63

COMPLETELY RANDOMIZED DESIGN (CRD)

Completely randomized design (CRD) is the simplest of all designs where only two principles of

design of experiments i.e. replication and randomization have been used. The principle of local

control is not used in this design. The basic characteristic of this design is that the whole

experimental area (i) should be of homogeneous in nature and (ii) should be divided into as many

number of experimental unit as the sum of the number of replications of all the treatments. Let us,

suppose there are five treatments A, B, C, D, E replicated 5, 4, 3, 3, and 5 times respectively then

according to this design we require the whole experimental area to be divided in to 20 experimental

units of equal size. Thus, completely randomized design is applicable only when the experimental

area is homogeneous in nature. Under laboratory condition, where other conditions including the

environmental condition are controlled, completely randomized design is the most accepted and

widely used design. Let there be t treatments replicated r1, r2, ……rt times respectively. So in total

we require an experimental area of 1

t

ii

r=

number of homogeneous experimental units of equal size.

Randomization and Layout :

To facilitate easy understanding we shall demonstrate the layout and randomization procedure in a

field experiment conducted in CRD with 5 treatments A, B, C, D, E being replicated 5, 4, 3, 2, 6

times respectively. The steps are given as follows :

(i) Total number of experimental unit required is 5+4+3+2+6 = 20 . Divide the whole

experimental area into 20 experimental units of equal size. For laboratory experiments

the experimental units may be test tubes, petri dishes, beakers, pots etc. depending upon

the nature of the experiment.

(ii) Number the experimental units 1 to 20.

Experimental area

Figure – 1

Figure – 2: Experimental area divided and numbered in to 20 experimental units

(iii) Assign the five treatments in to 20 experimental units randomly in such a way that the

treatments A, B, C, D, E are allotted 5, 4, 3, 2, 6 times respectively. For this we require

a random number table and follow the steps given below:

A) Method 1:

Start at any page, any point of row-column intersection of random number table. Let the

starting point be the intersection of 5th row – 4th column and read vertically downward to get

20 distinct random number of two digits. Since 80 is the highest two digit number which is

multiple of 20, we reject the number 81 to 99 and 00.If the random number be more than 20

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

64

then it should be divided by 20 and the remainder will be taken. The process will continue till

we have 20 distinct random numbers; if remainder is zero then we shall take it as the last

number i.e. 20.

a) In the process the random numbers selected are 08, 12, 01, 18, 14, 18, 02, 12, 12, 20, 12, 10,

14, 00, 15, 07, 05, 16, 7, 18, 19, 03, 10, 08, 16, 09, 13, 14, 17, 18, 06, 17, 19, 08, 15 and 11.

b) Repeated random numbers appeared in the above list, so we shall discard the random

numbers which have appeared previously. Thus the selected random numbers will be 08,

12, 01, 18, 14, 02, 20, 10, 15, 07, 05, 16, 19, 03, 09, 13, 04, 17, 06, 11. These random

numbers correspond to the 20 experimental units.

c) To first 5 experimental units corresponding to first 5 random numbers allotted with the first

treatment A, next 4 experimental units corresponding to next four random numbers are

allotted with treatment B and so on.

d) We demonstrate the whole process (a) to (d) in the following table :

Random numbers

taken from the table Remainder Selected random numbers

Treatment

allotted

08 08 08 A

32 12 12 A

01 01 01 A

58 18 18 A

14 14 14 A

18 18 Not selected -

02 02 02 B


52 12 Not selected ―

20 20 20 B


10 10 10 B



55 15 15 B

07 07 07 C

05 05 05 C

16 16 16 C

27 7 Not selected -


79 19 19 D

03 03 03 D

10 10 Not selected

08 08 Not selected

56 16 Not selected

29 9 9 E

13 13 13 E

14 14 14 E

17 17 17 E

18 18 Not selected

46 6 6 E

37 17 Not selected

59 19 Not selected

08 08 Not selected

15 15 Not selected

11 11 11 E

65

Figure – 3: Layout along with allocation of treatments

B) Method 2:

Step 1: In the first method we take 2 digit random numbers and in the process we are to reject a lot

of random numbers because of repetition. To avoid, instead of taking 2 digit random numbers one

may take 3 digit random numbers starting from any page any point intersection of row-column of

random number table. Let us use the same random number table and start at the intersection of 5th

row 2nd column i.e. 208. We take 20 distinct random numbers of 3 digits and the numbers are 208,

412, 480, 318, 094, 158, 082, 232, 252, 020, 392, 950, 394, 800, 435, 187, 851, 164, 273, 384.

Interestingly, we do not discard any number because of repetition in the process i.e. chances of ties

is less here.

Step 2: Rank the random numbers with smallest number getting the lowest rank 1. Thus the

random number along with their respective ranks are :

R No 208 412 480 318 94 158 82 232 252 20 392 950 394 800 435 187 851 164 273 384

Rank 7 15 17 11 3 4 2 8 9 1 13 20 14 18 16 6 19 5 10 12

These ranks correspond to the 20 numbered experimental units

Step 3: Allot first treatment A to first five plots appearing in order i.e. allot treatment A to

7,15,17,11 and 3rd experimental units. Allot treatment B to next four experimental units

i.e.4,2,8,and 9th experimental units and so on.

R No 208 412 480 318 94 158 82 232 252 20 392 950 394 800 435 187 851 164 273 384

Rank 7 15 17 11 3 4 2 8 9 1 13 20 14 18 16 6 19 5 10 12

Treat. A A A A A B B B B C C C D D E E E E E E

Layout :

1 C 2 B 3 A 4 B 5 E

6 E 7 A 8 B 9 B 10 E

11 A 12 E 13 C 14 D 15 A

16 E 17 A 18 D 19 E 20 C

Figure – 4: Layout along with allocation of treatments.

C) Method 3 :

The above two methods are applicable only when random number table is available. But while

conducting experiments at farmers field random number table may not be available. To overcome

this difficulty, we may opt for ‘drawing lots’ technique for randomization. The procedure is as

follows :

1 A 2 B 3 D 4 E 5 C

6 E 7 C 8 A 9 E 10 B

11 E 12 A 13 E 14 A 15 B

16 C 17 E 18 A 19 D 20 B

66

a) According to this problem we are to allocate five treatments in to twenty experimental units.

Initially we take a piece of paper and make 20 small pieces of equal size and shape.

b) Twenty pieces paper, thus made, are then labeled and numbered according to treatments and

corresponding number of replications such that five papers are marked with ‘A’, four with ‘B’,

three with ‘C’, two with ‘D’ and six with ‘E’.

c) Fold the papers uniformly and place them in bucket/busket/jar etc.

d) Draw one piece of paper at a time and repeat the drawing without replacing it and with

continuous stirring of the container after every draw.

e) Note the sequence of the appearance of the treatments.

f) Allot the treatments to the experimental units based on the treatment letter label and the

sequence. Thus here the sequence correspond to the experimental units from one to twenty. Let

the appearance of the treatment for this case be as follows :

Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Treatment D B A A C D B E B C A E E C B E A E E A

Thus the treatment A is allotted to the experimental units 3, 4, 11, 17 and 20, treatment B to 2, 7,

9, 15 and so on. Ultimately the final layout will be as follows :

1 D 2 B 3 A 4 A 5 C

6 D 7 B 8 E 9 B 10 C

11 A 12 E 13 E 14 C 15 B

16 E 17 A 18 E 19 E 20 A

Figure – 5: Layout along with allocation of treatments.

Analysis :

Statistical Model : Let there be t number of treatments with r1, r2, r3….rt number of replications

respectively in a completely randomized design. So the model for the experiment will be :

ij i ijy e = + + ; i =1,2,3….t; j = 1,2,….ri

where yij = response corresponding to jth observation of the ith treatment

= general effect

i = additional effect due to i-th treatment and 0i ir =

ije = error associated with j-th observation of i-th treatment and are i.i.d N(0, 2 ).

Assumption of the model:

The above model is based on the assumptions that the affects are additive in nature and the error

components are identically, independently distributed as normal variate with mean zero and

constant variance.

Hypothesis to be tested :

0 1 2 3

1

H : ......... ....... = 0 against the alternative hypothesis

H : All ' are not equal

i t

s

= = = = = =

Let the level of significance be . Let the observations of the total n = 1

t

ii

r=

experimental units be

as follows:

67

Replication Treatment

1 2 ………. i ………. t

1 y11 y21 ……….. yi1 ………. yt1

2 y12 y22 ………. yi2 ………. yt2

:

:

:

:

:

:

:

:

y2r2

………

……….

:

:

:

:

……….

……….

:

:

:

:

ri ………. yiri ………. :

:

:

y1r1 :

ttry

Total y1o y2o ………. yio ……….. yto

Mean 1o

y 2o

y io

y to

y

The analysis for this type of data is the same as that of one way classified data discussed in chapter

1 section(1.2). From the above table we calculate the following quantities :

Grand total = ( )1

observationirt

i j=

= 11 21 31

1 1

........trt

trt iji j

y y y y y G= =

+ + + + = =

Correction factor =

2G

n= CF

Total Sum of Squares (TSS) = ( )2 2

1 1 1

observationi tr rt t

iji j i j

CF y CF= = =

− = −

=2 2 2 211 21 31 ........

ttry y y y CF+ + + + −

Treatment Sum of Squares (TrSS)

1 2 3

2

thio

01 1

2 2 2 2 2

1 2 3

y,where sum of the observations for the i treatment

....... .......

i

o o o io to

rt

i iji j

i

i t

CF y yr

y y y y yCF

r r r r r

= =

= − = =

= + + + + −

Error Sum of Squares ( By Subtraction ) = T SS – Tr SS = Er SS.

ANOVA table for Completely Randomized Design:

SOV d.f. SS MS F-

ratio

Tabulated

F (0.05)

Tabulated

F (0.01)

Treatment t-1 TrSS TrMS = 1

TrSS

t − TrMS

ErMS

Error n-t ErSS ErMS = ErSS

n t−

Total n-1 TSS

68

The null hypothesis is rejected at level of significance if the calculated value of F ratio

corresponding to treatment be greater than the table value at the same level of significance with (t-

1,n-t) degrees of freedom that means we reject Ho if Fcal > Ftab;( 1),( )t n t − −

otherwise one can not

reject the null hypothesis. When the test is non-significant we conclude that there exists no

significant differences among the treatments with respect to the particular characters under

consideration; all treatments are statistically at par.

When the test is significant i.e. when the null hypothesis is rejected then one should find out which

pair of treatments are significantly different and which treatment is either the best or the worst with

respect to the particular characters under consideration.

One of the ways to answer these queries is to use t – test to compare all possible pairs of treatment

means. This procedure is simplified with the help of least significant difference (critical difference)

value as per the given formula below :

,( )2

'

1 1( )

n t

i i

LSD ErMS tr r

−= + where,

'i and i refer to the treatments involved in

comparison and t is the table value of t distribution at level of significance with (n-t) d.f. and

'

1 1( )

i i

ErMSr r+ is the standard error of difference (SEd) between the means for treatments i and

i. Thus if the absolute value of the difference between the treatment means exceeds the

corresponding CD value then the two treatments are significantly different and the better treatment

is adjudged based on the mean values commensurating with the nature of the character under study.

Advantages and disadvantages of CRD:

A) Advantages:

i) Simplest of all experimental design.

ii) Flexibility in adopting different number of replications for different treatments. This is the only

design in which different number of replications can be used for different treatments. In

practical situation it is very useful because sometimes the experimenter come across with the

problem of varied availability of experimental materials. Sometimes response from particular

experimental unit(s) may not be available, even then the data can be analyzed if CRD design

was adopted.

B) Disadvantage:

i) The basic assumption of homogeneity of experimental units, particularly under field condition

is rare. That is why this design is suitable mostly in laboratory condition or green house

condition.

ii) The principle of “local control” is not used in this design which is very efficient in reducing the

experimental error.

With the increase in number of treatment especially under field condition it becomes very difficult

to use this design, because of difficulty in getting more number of homogeneous experimental

units.

******************************************************************************

69

Objective : C.R.D analysis with unequal replication

Kinds of data: Mycelial growth in terms of diameter of the colony (mm) of R. solani isolates on

PDA medium after 14 hours of incubation

R. solani isolates Mycelial growth Treatment

total

Treatment mean

Repl. 1 Repl. 2 Repl. 3 (Ti)

RS 1 29.0 28.0 29.0 86.0 28.67

RS 2 33.5 31.5 29.0 94.0 31.33

RS 3 26.5 30.0

56.5 28.25

RS 4 48.5 46.5 49.0 144.0 48.00

RS 5 34.5 31.0

65.5 32.72

Grand total 172.0 167.0 107.0 446.0

Grand mean 34.31

Solution: Here we test whether the treatments differ significantly or not.

Grand total = 446

Correction factor = 4462

13= 15301.23

Total sum of squares = (292 + 282 + ⋯ + 34.52 + 312) - CF = 789.27

Treatment sum of squares =( 86)2/3 +(94)2/3 +( 56.5)2/2 +(144)2/3+(65.52/2 -CF

= 16063.9 - CF = 762.69

Error sum of squares = Total sum of squares–variety sum of squares

= 789.27-762.69 = 26.58

Source of variation Degree of

freedom

Sum of

squares

Mean square Computed F Tabular F 5%

Treatment 4 762.69 190.67 57.38* 3.84

Error 8 26.58 3.32

Total 12 789.27

Here Fcal is greater than Ftab, it was found that the treatment differ significantly. Next we calculate

the LSD and CD as per the formula described above.

For example to compare treatment 1 and treatment 2 we calculate

standard error =√3.32 ∗ (1

3+

1

3)=1.49 and t value at 5% and 8 degree of freedom =2.30

Now CD or LSD=1.49*2.30=3.44

70

And the difference between treatment means of 1 and 2 =2.66. Hence we find that the treatment 1

and 2 doesnot differ significantly as given in table. The comparison between all the treatments are

given below in table along with their significance.

Treatment RS 1 RS 2 RS 3 RS 4 RS 5

RS 1 0.00 2.66

(3.44)

0.42

(3.84)

19.33*

(3.44)

4.05*

(3.84)

RS 2

0.00 3.08

(3.84)

16.67*

(3.44)

1.39

(3.84)

RS 3

0.00 19.75*

(3.84)

4.47*

(4.21)

RS 4

0.00 15.28*

(3.84)

RS 5

0.00

*******************************************************************************

Objective: Analysis of CRD with equal replication

Kinds of data: Grain yield of rice resulting from use of different foliar and granular insecticides

for the control of brown plant hoppers and stem borers, from a CRD experiment with 4 replication

® and 7 treatment (t).

Treatment

Grain yield (kg/ha) Treatment

total (T)

Treatment

means R1 R2 R3 R4

Dol- mix (1 kg) 2537 2069 2104 1797 8507 2127

Ferterra 3366 2591 2211 2544 10712 2678

DDT + Y-BHC 2536 2459 2827 2385 10207 2552

Standard 2387 2453 1556 2116 8512 2128

Dimecron-Boom 1997 1679 1649 1859 7184 1796

Dimecron-Knap 1796 1704 1904 1320 6724 1681

Control 1401 1516 1270 1077 5264 1316

Grand total (G) 57110

Grand mean 2040

Solution: Here we test whether the treatment differ significantly or not.

The Grand total = 57110.

Correction factor = (57110)2/28= 116484004

Total sum of squares = (2537 2 + 20692 +….10772) - CF

= 124061416 - CF = 7577412.4

Treatment sum of square = (85072 +107122 +------52642)/4 -CF

=122071179 – CF = 5587174.9

Error sum of square = 7577412.4-5587174.9 = 1990237.50

ANOVA (CRD with equal replication) of rice yield data

Tabular F

Source of Variation DF SS Mean Square Fcal 5% 1%

71

Treatment 6 5587174 931196 9.83 2.57 3.81

Error 21 1990238 94773

Total 27 7577412

Hence we find that the treatment differ significantly. After that we calculate Critical

difference.

The standard error of difference between treatment means=√2∗94773

4 = 217.68 and

The tvalue at 5% level of significance and 21 error df =2.08

Now the CD or LSD at 5 % level of significance= 452.70.kg/ha

The tvalue at 1% level of significance and 21 error df =2.831

Now the CD or LSD at 1 % level of significance=217.68*2.831=616.33 kg/ha.

Comparison between mean yields of a control and each of the six insecticide treatments

using the LSD test are given in table below.

Treatment Mean yield (kg/ha) Difference From control

T7 2127 811**

T6 2678 1362**

T5 2552 1236**

T4 2128 812**

T3 1796 480*

T2 1681 365ns

T1 1316

* indicates significant difference at 5 %, ** indicates Significant difference at 1 % and ns

indicates non-significant difference

Two-Way ANOVA

Two Way ANOVA is an inferential statistical model to analyze three or more than three variances

at a time to test the equality & inter-relationship between them. It's a test of hypothesis for several

sample means to analyze the inter-relationship between the factors and influencing variables at k

levels corresponding to k populations is called as Two way ANOVA. Users may use this 2-way

ANOVA test calculator to generate the ANOVA table for the test of hypothesis (H0) for treatment

means & subject or class means at a stated level of significance with the help of F-distribution. In

this analysis of variance, the observations drawn from the populations should be in same length.

This model should satisfy replication, randomization & local control to design statistical

experiments. Users may use this 2-way ANOVA test calculator to generate the ANOVA

classification table for the test of hypothesis (H0) for treatment means & varieties (class) means at a

stated level of significance with the help of F-test.

72

ANOVA Table for Two-Way Classification

ANOVA table for two-way classification shows what are all the formulas & input parameters used in the

analysis of variance for more than one factor which involves two or more than two treatment means

together with null hypothesis at a stated level of significance.

Sources of

variation

Df SS MSS F-ratio

Between treatment k – 1 SSR SSR/k - 1= MST MST/MSE= FR

Between block h – 1 SSC SSC/h - 1= MSV MSV/MSE= FC

Error (h - 1)(k - 1) SSE SSE/ (k - 1)(h - 1)= MSE

Total N – 1

Notable Points for Two-Way ANOVA Test

The below are the important notes of two-way ANOVA for test of hypothesis for a two or more factors

involves three or more treatment or subject means together.

• The null hypothesis H0 : μ1 = μ2 = . . . = μk H0 : μ.1 = μ.2 = . . . = μ.h Shows no significant difference between the variances. Alternative Hypothesis H1 : H1 : μ1 ≠ μ2 ≠ . . . ≠ μk H1 : μ.1 ≠ μ.2 ≠ . . . ≠ μ.h Shows the significant difference among the variances.

• State the level of significance α (1%, 2%, 5%, 10%, 50% etc) • The sum of all N elements in all the sample data set is known as the Grand Total and is

represented by an English alphabet "G". • The correction factor CF = G2/N = G2/kh • The Total Sum of Squares of all individual elements often abbreviated as TSS is obtained by

TSS = ∑∑xij2 - CF

• The sum of squares of all the treatment (row) totals in the two-way table (h x k) often abbreviated as SST is obtained by SST = SSR = ∑ {Ti.

2/h} - CF • The sum of squares between classes or sum of squares between columns is

SSV = SSC = {T.j2/k} - CF

k is the number of observations in each columns • The sum of squares due to error often abbreviated as SSE is obtained by

SSE = TSS - SSR - SSC • The degrees of freedom for Total Sum of Squares

TSS = N - 1 = hk - 1 • The degrees of freedom for Sum of Squares between treatments

SST = k - 1 • The degrees of freedom for Sum of Squares between varieties

SSV = h - 1 • The degrees of freedom for error sum of squares

73

SSE = (k - 1)(h - 1) • The Mean Sum of Squares of Treatment often abbreviated as MST is obtained by

MST = SST/(k - 1) • The Mean Sum of Squares for varieties often abbreviated as MSE is obtained by

MSV = SSV/(h - 1) • The Mean Sum of Squares due to Error often abbreviated as MSE is obtained by

MSE = SSE/(h - 1)(k - 1) • The variance ratio for treatments FR is the higher variance to lower variance

FR = MST/MSE or MSE/MST (The numerator should be always high) • The variance ratio for subjects or classes Fc is the higher variance to lower variance

Fc = MSV/MSE or MSE/MSV (The numerator should be always high) • The Critical value of F for between treatments (rows) can be obtained by referring the F

distribution table for (k-1, (k-1)(h-1)) at stated level of significance such as 1%, 5%, 9%, 10% or 50% etc.

• The Critical value of F for between varieties (columns) or subjects can be obtained by referring the F distribution table for (h-1, (k-1)(h-1)) at stated level of significance such as 1%, 5%, 9%, 10% or 50% etc.

• The difference between the treatments (rows) is not significant, if the calculated Fe value is lesser than the value from the F table. Therefore, the null hypothesis H0 is accepted.

• The difference between the treatments (rows) is significant, if the calculated F value is greater than the value from the F table. Therefore, the null hypothesis H0 is rejected.

• The difference between the subjects or varieties (columns) is not significant, if the calculated Fe value is lesser than the value from the F table. Therefore, the null hypothesis H0 is accepted.

• The difference between the subjects or varieties (columns) is significant, if the calculated F value is greater than the value from the F table. Therefore, the null hypothesis H0 is rejected.

*******************************************************************************

RANDOMIZED BLOCK DESIGN (RBD) In such situations the principle of local control is

adopted and the experimental material is grouped into homogeneous sub groups. The subgroup is

commonly termed as block. The blocks are formed with units having common characteristics which

may influence the response under study.

Advantages and disadvantages of RBD:

A) Advantage:

1. The principle advantage of RBD is that it increases the precision of the experiment.

This is due to the reduction of experimental error by adoption of local control.

2. The amount of information obtained in RBD is more as compared to CRD. Hence, RBD is more efficient than CRD. Since the layout of RBD involves equal replication of treatments, statistical analysis is simple.

B) Disadvantage:

1. When the number of treatments is increased, the block size will increase.

2. If the block size is large maintaining homogeneity is difficult and hence when more number of

treatments is present this design may not be suitable.

74

Analysis:

Let us suppose that we have t number of treatments, each being replicated r number of times. The

appropriate statistical model for RBD will be

ij i j ij

y e = + + + , I =1, 2, 3,…….,t; j = 1,2,….r

where, yij = response corresponding to jth replication/block of the ith treatment

= general effect

i = additional effect due to i-th treatment and 0i =

j = additional effect due to j-th replication/block and 0j =

ije = error associated with j-th replication/block of i-th treatment and are i.i.d N(0, 2 ).

The above model is based on the assumptions that the affects are additive in nature and the error

components are identically, independently distributed as normal variate with mean zero and

constant variance.

Let the level of significance be .

Hypothesis to be tested:

The null hypotheses to be tested are

0 1 2

1 2

: (1) ...... ...... 0

(2) ...... ...... 0

i t

j r

H

= = = = = =

= = = = = =

Against the alternative hypotheses

1: (1) ' are not equal

(2) ' are not equal

H s

s

Let the observations of these n = rt units be as follows:

Replications/Blocks

Treatments 1 2 …. J …. r Total Mean

1 y11 y12 …. y1j …. y1r y1o 10

y

2 y21 y22 …. y2j …. y2r y2o 20

y

: : : : : : : : :

I yi1 yi2 …. yij …. yir yio i0

y

: : : : : : : : :

T yt1 yt2 …. ytj …. ytr yto t0

y

Total yo1 yo2 …. yoj …. yor yoo

Mean o1

y o2

y …. oj

y …. or

y

The analysis of this design is the same as that of two-way classified data with one observation per

cell discussed in chapter 1 section (1.3).

From the above table we calculate the following quantities :

Grand total = ij

,

yi j

= 11 21 31 ........ try y y y G+ + + + =

Correction factor =2G

rt= CF

75

Total Sum of Squares (TSS) = 2

ij,

yi j

CF−

=2 2 2 211 21 31 ........ try y y y CF+ + + + −

Treatment Sum of Squares (TrSS)

1 2 3

2io

1

2 2 2 2 2

y

....... .......o o o io to

t

i CFr

y y y y yCF

r r r r r

== −

= + + + + −

Replication Sum of Squares (RSS)

1 2 3

2oj

1

22 2 2 2

y

....... .......ojo o o or

r

jCF

t

yy y y yCF

t t t t t

== −

= + + + + −

Error Sum of Squares (by subtraction ) = T SS – TrSS - RSS

ANOVA table for RBD

Source Of

Variation

d.f. SS MS F-ratio Tabulated

F (0.05)

Tabulated

F (0.01)

Treatment t-1 TrSS TrMS = 1

TrSS

t − TrMS

ErMS

Replication

(Block)

r-1 RSS RMS =1

RSS

r − RMS

ErMS

Error (t-1)(r-1) ErSS ErMS =(t-1)(r-1)

ErSS

Total rt-1 TSS

The null hypotheses are rejected at level of significance if the calculated values of F ratio

corresponding to treatment and replication be greater than the corresponding table value at the same

level of significance with (t-1), (t-1)(r-1) and (r-1), (t-1)(r-1) degrees of freedom respectively. That

means we reject Ho if Fcal > Ftab, otherwise one can not reject the null hypothesis. When the test

is non-significant we conclude that there exists no significant differences among the

treatments/replications with respect to the particular character under consideration; all

treatments/replications are statistically at par.

When the test(s) is (are) significant(s) we reject the null hypothesis and try to find out the

replication or the treatments which are significantly different from each other. Like in case of CRD,

here also in RBD we use the least significant difference (critical difference value) for comparing

difference between the pair of means. The CD’s are calculated as follows:

;( 1)( 1)

2

2( )

t r

ErMSLSD CD t

t − −

=

76

where t is the number of treatments and ;( 1)( 1)

2t r

t − − is the table value of t at α level of

significance and (t-1)(r-1) degrees of freedom.

*******************************************************************************

Objective: Analysis of Randomized Block Design

Kinds of data: An experiment was conducted in RBD to study the comparative performance of

fodder sorghum under rainfed condition. The rearranged data given in below table. Green matter

yield of Sorghum (Kg/plot)

Variety I II III IV Total Mean

African Tall 22.9 25.9 39.1 33.9 121.8 30.45

Co-11 29.5 30.4 35.3 29.6 124.8 31.2

FS -1 28.8 24.4 32.1 28.6 113.9 28.475

K -7 47 40.9 42.8 32.1 162.8 40.7

Co-24 28.9 20.4 21.1 31.8 102.2 25.55

Total 157.1 142.0 170.4 156.0 625.5

Solution: Here we test whether the varieties differ significantly or not.

Correction factor = 625.52

20= 19562.51

Total sum of squares = (22.92 + 25.92 + ⋯ . +31.82) − 𝐶𝐹 = 20514.95 – CF = 952.44

Bock sum of squares =157.12+1422+170.42+1562

5 - CF= 19643.31 – CF =80.80

Variety sum of squares =121.82+124.82+113.92+162.82+102.22

4− 𝐶𝐹= 15525.50 - CF = 520.53

Error sum of squares = 952.44 – 80.80 – 520.53 = 351.11

By putting the values in ANOVA we get

Here we found that the varieties differ significantly.

Source of

variation DF SS MSS F cal F tab

Replication 3 80.80 26.9 < 1 3.490

Variety 4 521 130 4.448* 3.259

Error 12 351 29.3

Total 19

Variety Mean

K -7 40.7

Co -11 31.2

African Tall 30.45

FS -1 28.48

Co -24 25.55 33.8)8248.3)(179.2(

)(.

8348.3

6294.14

4

)2588.29(22)(

==

=

=

=

==

dSEtCD

r

EMSDSE

77

From the bar chart it can be concluded that sorghum variety K-7 produces significantly higher than

green matter than all other varieties. The remaining varieties are all on par.

*******************************************************************************

Objective : Analysis of Randomized block design.

Kinds of data: The yields of 6 varieties of a crop in lbs., along with the plan of the

experiment, are given below. The number of blocks is 5, plot of size is 1/20 acre and the

varieties have been represented by A, B, C, D and E and analyze the data and state your

conclusions

B-I B 12

E 26

D 10

C 15

A 26

F 62

B-II E 23

C 16

F 56

A 30

D 20

B 10

B-III A 28

B 9

E 35

F 64

D 23

C 14

B-IV F 75

D 20

E 30

C 14

B 7

A 23

B-V D 17

F 70

A 20

C 12

B 9

E 28

Solution:

Null hypothesis H01: There is no significant difference between variety means

1 = 2 = 3 = 4 = 5 = 6

H02: There is no significant difference between block means

1 = 2 = 3 = 4 = 5

Correction factor = (𝐺𝑇)2

𝑟𝑘

Sum of square due to varieties = ∑ 𝑣𝑖

2

𝑟 – CF

Block Sum of square(BSS)= ∑ 𝑏𝑗

2

𝑘 - CF

Total sum of squares (TSS)=∑ ∑ 𝑦2 – CF

Error Sum of Square (ESS)= TSS- VSS- BSS

40.7

31.2 30.45 28.47525.55

0

10

20

30

40

50

K -7 Co-11 African Tall FS -1 Co-24

Sorgham variety

Treatment

78

First rearrange the given data

Blocks Varieties Block totals Means

A B C D E F

B1 26 12 15 10 26 62 ΣB1 = 151 25.17

B2 30 10 16 20 23 56 ΣB2 =155 25.83

B3 28 9 14 23 35 64 ΣB3 = 173 28.83

B4 23 7 14 20 30 75 ΣB4 = 169 28.17

B5 20 9 12 17 28 70 ΣB5 = 156 26.00

Variety

totals

ΣA =

127

ΣB =

47

ΣC =

71

ΣD =

90

ΣE =

142

ΣF =

327

GT = 804 -

Means 25.4 9.4 14.2 18 28.4 65.4 - -

CF= 8042

30 = 21547.2

VSS =1272+472+712+902+1422+3272

5− 21547.2 = 31714.4 – 21547.2= 10167.2

BSS = 1512+1552+1732+1692+1562

6− 21547.2

= 21608.67 – 21547.2 = 61.47

TSS= (262 + 122 + 152 + ⋯ … . , +282 + 702) -21547.2

= 32194 – 21547.2

= 10646.8

ESS= TSS – BSS – Tr.S.S.

= 10646.8 – 61.47 – 10167.2

= 418.13

ANOVA TABLE

Sources of Variation

d.f S.S. M.S. F-cal. Value

F- table Value

Blocks 5-1=4 61.47 15.37 0.74 F0.05 (4, 20) =2.87

Varieties 6-1=5 10167.2 2033.44 97.25 F0.05 (5, 20) = 2.71

Error 29-4-5= 20 418.13 20.91

Total 30-1-29 10646.8

Calculated value of F (Treatments) > Table value of F, H0 is rejected and hence

we conclude that there is highly significant difference between variety means.

Where SEm =√𝐸𝑀𝑆

𝑟 = √

20.91

4 = 2.04

SED = √2 * SEm = 1.414 * 2.04 = 2.88

79

Critical difference = SED x t-table value for error d.f. at 5% LOS

CD = 2.88 * 2.09

= 6.04

Coefficient of variation =√𝐸𝑀𝑆

�� * 100 =

√20.91

26.8 *100 = 17 %

(i) Those pairs not scored are significant

(ii) Those pairs underscored are non-significant

Variety F gives significantly higher yield than all the other varieties; varieties D,C and B

are on par and gives significantly higher yield than variety A.

Exercise:

Q1. Explain analysis of one way classification?

Q2. What do you understand by analysis of variance?

Q3.What are assumptions of analysis of variance?

Q4.The yields of four varieties of wheat per plot (in lbs.) obtained from an experiment in randomized

block design are given below:

Variety Replication

1 II III IV V

V1 7 9 8 10 10

V2 12 13 15 11 13

V3 15 20 15 18 16

V4 8 10 12 10 8

Analyze the data and state your conclusion.(Ans. Variety Variance=66.13, Error variance=2.59)

Q5.The following table gives the yields in pounds per plot, of five varieties of wheat after being applied

to 4,3,2,4 and 3 plots respectively

Varieties Yield in lbs.

A 8 8 6 10

B 10 9 8

C 8 10

D 7 10 9 8

E 12 8 10

Analyze the data and state your conclusion.(Ans. Variety Variance=1.86, Error variance=2.28)

Q6. Write the short notes :

(a)Local control

(b)Replication

©Advantage of C.R.D.

F E A D C B

65.4 28.4 25.4 18.0 14.2 9.40

80

8. Sampling Methods

R. S. Solanki Assistant professor (Maths & Stat.), College of Agriculture , Waraseoni, Balaghat (M.P.),India


1. Introduction

The terminology "sampling" indicates the selection of a part of a group or an aggregate with

a view to obtaining information about the whole. This aggregate or the totality of all members is

known as Population although they need not be human beings. The selected part, which is used to

ascertain the characteristics of the population, is called Sample. While choosing a sample, the

population is assumed to be composed of individual units or members, some of which are included

in the sample. The total number of members of the population and the number included in the

sample are called Population Size and Sample Size respectively. The process of generalising on the

basis of information collected on a part is really a traditional practice. The annual production of a

certain crop in a region is computed on the basis of a sample. The quality of a product coming out

of a production process is ascertained on the basis of a sample. The government and its various

agencies conduct surveys from time to time to examine various economic and related issues

through samples. Sampling methodology can be used by an auditor or an accountant to estimate the

value of total inventory in the stores without actually inspecting all the items physically. Opinion

polls based on samples is used to forecast the result of a forthcoming election

2. Advantage of sampling over census

The census or complete enumeration consists in collecting data from each and every unit

from the population. The sampling only chooses a part of the units from the population for the same

study. The sampling has a number of advantages as compared to complete enumeration due to a

variety of reasons.

Less Expensive: The first obvious advantage of sampling is that it is less expensive. If we want to

study the consumer reaction before launching a new product it will be much less expensive to carry

out a consumer survey based on a sample rather than studying the entire population which is the

potential group of customers.

Less Time Consuming: The smaller size of the sample enables us to collect the data more quickly

than to survey all the units of the population even if we are willing to spend money. This is

particularly the case if the decision is time bound. An accountant may be interested to know the

total inventory value quickly to prepare a periodical report like a monthly balance sheet and a profit

and loss account. A detailed study on the inventory is likely to take too long to enable him to

prepare the report in time.

Greater Accuracy: It is possible to achieve greater accuracy by using appropriate sampling

techniques than by a complete enumeration of all the units of the population. Complete

enumeration may result in accuracies of the data. Consider an inspector who is visually inspecting

the quality of finishing of a certain machinery. After observing a large number of such items he

cannot just distinguish items with defective finish from good one's. Once such inspection fatigue

develops the accuracy of examining the population completely is considerably decreased. On the

other hand, if a small number of items is observed the basic data will be much more accurate.

81

Destructive Enumeration: Sampling is indispensable if the enumeration is destructive. If you are

interested in computing the average life of fluorescent lamps supplied in a batch the life of the

entire batch cannot be examined to compute the average since this means that the entire supply will

be wasted. Thus, in this case there is no other alternative than to examine the life of a sample of

lamps and draw an inference about the entire batch.

3. Simple Random Sampling

The representative character of a sample is ensured by allocating some probability to each

unit of the population for being included in the sample. The simple random sample assigns equal

probability to each unit of the population. The simple random sample can be chosen both with and

without replacement.

Simple Random Sampling with Replacement (SRSWR): Suppose the population consists of N

units and we want to select a sample of size n. In simple random sampling with replacement we

choose an observation from the population in such a manner that every unit of the population has

an equal chance of 1/N to be included in the sample. After the first unit is selected its value is

recorded and it is again placed back in the population. The second unit is drawn exactly in the

swipe manner as the first unit. This procedure is continued until nth unit of the sample is selected.

Clearly, in this case each unit of the population has an equal chance of 1/N to be included in each of

the n units of the sample.

In this case the number of possible samples of size n selected from the population of size N is 𝑁𝑛.

The samples selected through this method are not distinct.

Simple Random Sampling without Replacement (SRSWOR): In this case when the first unit is

chosen every unit of the population has a chance of 1/N to be included in the sample. After the first

unit is chosen it is no longer replaced in the population. The second unit is selected from the

remaining (N-1) members of the population so that each unit has a chance of (1/N-1) to be included

in the sample. The procedure is continued till nth unit of the sample is chosen with probability [1/

(N-n+1)].

In this case the number of possible samples of size n selected from the population of size N is 𝑁𝑐𝑛

. The samples selected through this method are distinct.

Advantages and Disadvantages of Simple Random Sampling:

Advantages: It is a fair method of sampling and if applied appropriately it helps to reduce any bias

involved as compared to any other sampling method involved. This sampling method is a very

basic method of collecting the data. There is no technical knowledge required and need basic

listening and recording skills. Simple random sampling offers researchers an opportunity to

perform data analysis and a way that creates a lower margin of error within the information

collected. It offers an equal chance of selection for everyone within the population group.

Disadvantages: It is a costlier method of sampling as it requires a complete list of all potential

respondents to be available beforehand. It relies on the quality of the researchers performing the

work. It can require a sample size that is too large. It does not work well with widely diverse or

dispersed population groups.

82

4. Selection of Simple Random Sample

The concept of "randomness" implies that every item being considered has an equal chance

of being selected as part of the sample. To ensure randomness of selection one may adopt either the

Lottery Method or use table of random numbers.

Lottery Method: This is a very popular method of taking a random sample. Under this method, all

items of the universe are numbered or named on separate slips of paper of identical size and shape.

These slips are then folded and mixed up in a container or drum. A blindfold selection is then made

of the number of slips required to constitute the desired sample size. The selection of items thus

depends entirely on chance. The method would be quite clear with the help of an example. If we

want to take a sample of 10 persons out of a population of 100, the procedure is to write the names

of all the 100 persons on separate slips of paper, fold these slips, mix them thoroughly and then

make a blindfold selection of 10 slips. The lottery method is very popular in lottery draws where a

decision about prizes is to be made. However, while adopting lottery method it is absolutely

essential to see that the slips are of identical size, shape and colour, otherwise there is a lot of

possibility of personal prejudice and bias affecting the results. The process of writing N number of

slips is cumbersome and shuffling a large number of slips, where population size is very large, is

difficult. Also human bias may enter while choosing the slips. Hence the other alternative i.e.

random numbers can be used.

Random Number Table Method: A random number table is a table of digits. The digit given in

each position in the table was originally chosen randomly from the digits 1, 2, 3, 4, 5, 6, 7, 8, 9, 0

by a random process in which each digit is equally likely to be chosen, as demonstrated in the small

sample shown below.

Table of Random Numbers

36518 36777 89116 05542 29705

46132 81380 75635 19428 88048

31841 77367 40791 97402 27569

84180 93793 64953 51472 65358

78435 37586 07015 98729 76703

83775 21564 81639 27973 62413

08747 20092 12615 35046 67753

90184 02338 39318 54936 34641

23701 75230 47200 78176 85248

16224 97661 79907 06611 26501

85652 62817 57881 90589 74567

69630 10883 13683 93389 92725

95525 86316 87384 22633 68158

The table usually contains 5-digit numbers, arranged in rows and columns, for ease of reading.

Typically, a full table may extend over as many as four or more pages. The occurrence of any two

digits in any two places is independent of each other. Several standard tables of random numbers

are available, among which the following may be specially mentioned, as they have been tested

extensively for randomness:

83

• Tippett’s (1927) random number tables consisting of 41,600 random digits grouped into

10,400 sets of four-digit random numbers.

• Fisher and Yates (1938) table of random numbers with 15,000 random digits arranged into

1,500 sets of ten-digit random numbers.

• Kendall and Babington Smith (1939) table of random numbers consisting of 1,00,000

random digits grouped into 25,000 sets of four-digit random numbers.

• Rand Corporation (1955) table of random numbers consisting of 1,00,000 random digits

grouped into 20,000 sets of five-digit random numbers.

• C.R. Rao, Mitra and Mathai (1966) table of random numbers.

How to use a random number table: This method is one from a variety of methods of reading

numbers from random number tables.

i. Assume you have the test scores for a population of 200 students. Each student has been

assigned a number from 1 to 200. We want to randomly sample only 5 of the students.

ii. Since the population size is a three-digit number, we will use the first three digits of the

numbers listed in the table.

iii. Without looking, point to a starting spot in the above random number table. Assume we

land on 93793 (2nd column, 4th entry).

iv. This location gives the first three digits to be 937. This choice is too large (> 200), so we

choose the next number in that column. Keep in mind that we are looking for numbers

whose first three digits are from 001 to 200 (representing students).

v. The second choice gives the first three digits to be 375, also too large. Continue down the

column until you find 5 of the numbers whose first three digits are less than or equal to 200.

vi. From this table, we arrive at 200 (20092), 023 (02338), 108 (10883), 070 (07015), and 126

(12615).

Students 23, 70, 108, 126, and 200 will be used for our random sample. Our sample set of students

has been randomly selected where each student had an equal chance of being selected and the

selection of one student did not influence the selection of other students.

******************************************************************************

Objective: Selection of simple random sample using random number table.

Kinds of data: The number of diseased plants (out of 9) in 24 areas are in the following table:

S.No. of areas 1 2 3 4 5 6 7 8 9 10 11 12

Diseased Plants 1 4 1 2 5 1 1 1 7 2 3 3

S.No. of areas 13 14 15 16 17 18 19 20 21 22 23 24

Diseased Plants 2 2 3 1 2 7 2 6 3 5 3 4

Select a simple random sample with and without replacement of size 6. Compute the average

diseased plants based on the sample. Compare this with the average diseased plants of the

population.

Solution:

Simple random sample with replacement:

We have the diseased plants of population of 24 areas. Each area has been assigned a number from

1 to 24. We want to randomly sample with replacement of only 6 of the 24 areas.

Step 1: Since the population size is a two digit number, we will use the first two digits of the

numbers listed in the random number table (see appendix).

Step 2: Without looking, point to a starting spot in the random number table. Assume we land on

72918 (4th column, 21th entry). This location gives the first two digits to be 72. This choice is too

84

large (> 24), so we choose the next number in that column. Keep in mind that we are looking for

numbers whose first two digits are from 01 to 24 (representing areas).

Step 3: The second choice (12468) gives the first two digits to be 12 (≤ 24), so we accept it.

Step 4: Continue down the column until we find 6 of the numbers whose first two digits are less

than or equal to 24. From this table, we arrive at 12 (12468), 17 (17262), 02 (02401), 11 (11333),

10 (10631) and 17 (17220).

Areas 02, 10, 11, 12, 17 and 17 will be used for our random sample (area no 17 repeat twice

because our random sample is with replacement).

Average diseased plants based on simple random sample with replacement:

S.No. of areas 02 10 11 12 17 17

Diseased Plants 4 2 3 3 2 2

Average diseased plants = 36.26

223324=

+++++

Simple random sample without replacement:

We have the diseased plants of population of 24 areas. Each area has been assigned a number from

1 to 24. We want to randomly sample without replacement of only 6 of the 24 areas.

Step 1: Since the population size is a two digit number, we will use the first two digits of the

numbers listed in the random number table (see appendix).

Step 2: Without looking, point to a starting spot in the random number table. Assume we land on

13211 (7th column, 17th entry). This location gives the first two digits to be 13. This choice is (≤

24), so we choose this number. Keep in mind that we are looking for numbers whose first two

digits are from 01 to 24 (representing areas).

Step 3: Continue down the column until we find 6 of the numbers (repeated numbers not allowed in

SRSWOR) whose first two digits are less than or equal to 24. From this table, we arrive at 22

(22250), 12 (12944), 04 (04014), 19 (19386), 01 (01573) and 20 (20963). Areas 01, 04, 12, 19, 20

and 22 will be used for our random sample.

Average diseased plants based on simple random sample without replacement:

S.No. of areas 01 04 12 19 20 22

Diseased Plants 1 2 3 2 6 5

Average diseased plants = 31.36

562321=

+++++

Average diseased plants based on population:

Average diseased plants =

39.224

435362721322332711152141=

+++++++++++++++++++++++

Conclusion: From the above calculation it has been concluded that the average number of diseased

plants based on simple random samples with and without replacement and population are almost

equal to 3.

******************************************************************************

Objective: Selection of simple random sample under SRSWOR.

Kinds of data: The data relate to the hypothetical population whose units are 1, 2, 3, 4 and 5.

Draw a sample of size n=3 using SRSWOR and show sample mean is an estimate of population

mean.

85

Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁𝑐𝑛 = 5𝑐3

=10.

Population mean 𝑦𝑁 = ∑ 𝑦𝑖

𝑁 =

15

5=3 and Compute the mean of each sample 𝑦𝑛 =

∑ 𝑦𝑖

𝑛

The 10 possible samples are given below in the table.

S.No. Possible

samples

Sample mean

𝒚𝒏

1. 1,2,3 2.0

2. 2,3,4 3.0

3. 3,4,5 4.0

4. 4,5,1 3.33

5. 5,1,2 2.67

6. 1,3,4 2.67

7. 2,4,5 3.67

8. 3,5,1 3.0

9. 4,1,2 2.33

10. 5,2,3 3.33

Total 30.0

Now we have to check whether E (𝑦𝑛 ) = 𝑦𝑁

E (𝑦𝑛 )= ∑ 𝑦𝑛

𝑁𝑐𝑛

= 30

10 =3 =𝑦𝑁

Hence we can say, that sample mean 𝑦𝑛 is an estimate of population mean 𝑦𝑁 .

*******************************************************************************

Objective: Selection of simple random sample under SRSWR.

Kind of data: Consider a finite population of size N=5 including the values of sampling units as

(1,2,3,4,5). Enumerate all possible samples of size n=2 using SRSWR and check whether the

sample mean is an estimate of population mean.

Solution: Number of all possible samples of size n=2 under SRSWOR is given by 𝑁𝑛 = 52=25.

The Population 𝑦𝑁 = ∑ 𝑦𝑖

𝑁 =

15

5=3 and Compute the mean of each sample 𝑦𝑛 =

∑ 𝑦𝑖

𝑛

S.No. Possible

Samples

Sample mean

𝒚𝒏

S.No. Possible

Samples

Sample mean

𝒚𝒏

1 1,2 1.5 13 4,1 2.5

2 1,3 2.0 14 5,1 3.0

3 1,4 2.5 15 3,2 2.5

4 1,5 3.0 16 4,2 3.0

5 2,3 2.5 17 5,2 3.5

6 2,4 3.0 18 4,3 3.5

7 2,5 3.5 19 5,3 4.0

8 3,4 3.5 20 5,4 4.5

9 3,5 4.0 21 1,1 1.0

10 4,5 4.5 22 2,2 2,0

11 2,1 1.5 23 3,3 3.0

86

12 3,1 2.0 24 4,4 4.0

25 5,5 5.0

Total 75.0

Now we have to check whether E (𝑦𝑛 )= 𝑦𝑁

E (𝑦𝑛 )= ∑ 𝑦𝑛

𝑁𝑛 =

75

25 =3 =𝑦𝑁

Hence we can say that sample mean 𝑦𝑛 is an estimate of population mean.

*******************************************************************************

Exercise:

Q1. The data below indicate the number of workers in the factory for twelve factories

Factory 1 2 3 4 5 6 7 8 9 10 11 12

No. of

Workers

2145 1547 745 215 784 3125 126 471 841 3215 2496 589

Select a simple random sample without replacement of size four with the help of random

number table (see Appendix). Compute the average number of workers per factory based on

the sample. Compare this number with the average number of workers per factory in the

population.

Q2. A class has 115 students. Select a simple random sample with replacement of size 15.

Q3. The following data are the yields (q/ha) of 30 varieties of paddy maintained in a research

station for breeding trials:

49 78 57 55 45 26 70 21 75 94 56 62 64 79 85

47 67 43 31 38 33 50 37 75 32 42 52 22 63 40

Select a simple random sample without replacement of size 8. Compute the average yield of

paddy based on the sample. Compare this yield with the average yield of paddy in the

population.

Q4. A population have 7 units 1,2,3,4,5,6,7. Write down all possible samples of size 2 (without

replacement), which can be drawn from the given population and verify that the sample mean

is an estimate of the population mean.

Q5. How many random samples of size 5 can be drawn from a population of size 10 if sample is

done with replacement.

********************************************************************************

87

REFERENCES:

1. Practicals in Statistics , by Dr.H.L.Sharma

2. Statistical Methods, by G.W.Snedecor.

3. Experimental Designs and Survey Sampling: Methods and Applications, by H.L.Sharma

4. A handbook of Agricultural Statistics BY Dr. S. R. S Chandel

5. The Theory of Sample surveys and Statistical Decisions by K. S. Kushwaha and Rajesh Kumar

6. Fundamentals of Mathematical Statistics by S. C. Gupta and V. K. Kapoor

7. A Text book of Agricultural Statistics, R. Rangaswamy, New Age International (P)

Limited, publishers

8. Mishra, P. (Ed.), Homa, F. (Ed.).. Essentials of Statistics in Agriculture Sciences.

New York: Apple Academic Press., Inc CRC ( Taylor and Francis Group)

9. P.K.Sahu (2004) . “Agriculture and Applied Statistics-I” ,Kalyani publisher.

https://www.amazon.in/Agriculture-Applied-Statistics-I-P-K-Sahu/dp/B014UO0UN8

( ug course) - jnkvv

Documents