bcc104 business statistics

336

Click here to load reader

Upload: sasha-rodgers

Post on 20-Dec-2015

119 views

Category:

Documents


14 download

DESCRIPTION

bbdjajnndkjdhdi

TRANSCRIPT

Page 1: BCC104 Business Statistics

S M USIKKIM MANIPAL UNIVERSITYDirectorate of Distance Education

Business Statistics

Edition: Summer 2011

BCC104

B1463

Page 2: BCC104 Business Statistics

Vikas® is the registered trademark of Vikas® Publishing House Pvt. Ltd.

VIKAS® PUBLISHING HOUSE PVT LTDE-28, Sector-8, Noida - 201301 (UP)Phone: 0120-4078900 • Fax: 0120-4078999Regd. Office: 576, Masjid Road, Jangpura, New Delhi 110 014Website: www.vikaspublishing.com • Email: [email protected]

Information contained in this book has been published by VIKAS® Publishing House Pvt. Ltd. and hasbeen obtained by its Authors from sources believed to be reliable and are correct to the best of theirknowledge. However, the Publisher and its Authors shall in no event be liable for any errors, omissionsor damages arising out of use of this information and specifically disclaim any implied warranties ormerchantability or fitness for any particular use.

This book is a distance education module comprising a collection of learning materials for our students.All rights reserved. No part of this work may be reproduced in any form by any means without permissionin writing from Sikkim Manipal University, Gangtok, Sikkim. Printed and Published on behalf of SikkimManipal University, Gangtok, Sikkim by Mr Rajkumar Mascreen, GM, Manipal Universal Learning PvtLtd. Manipal - 576 104. Printed at Manipal Press Limited, Manipal.

Authors:J.S. Chandan: Units(1.3-1.10, 2.1-2.4, 3.3, units-4, 5, 8, 11) Copyright © J.S. Chandan, 2011G.S. Monga: (Unit-9) Copyright © G.S Monga, 2011Vijay Gupta: Unit(3.4-3.11) Copyright © Vijay Gupta, 2011C.R. Kothari: Units(unit-6, 7, 10) Copyright © C.R. Kothari, 2011Vikas® Publishing House: Units(1.1-1.2, 2.5-2.10, 3.1-3.2, units-12-14) Copyright © Reserved, 2011

SIKKIM MANIPAL UNIVERSITY (SMU DDE)

DeanDirectorate of Distance EducationSikkim Manipal University (SMU DDE)

BOARD OF STUDIES

Chairman HOD Arts and Humanities SMU DDE Additional Registrar SMU DDE Controller of Examination SMU DDE Prof. Ramesh Murthy Principal Academics Manipal Universal Learning Pvt Ltd

Rajesh A.R., Head Employment, Manipal Universal Learning Pvt Ltd

Dr Ramesh Murthy, Director, SMU DE

Dr Gayathri Devi, Dean, SMU DE

Dr Shivram Krishnan, Professor & HOD, A&H, SMU DDE

Srinath P.S., Additional Registrar, Student Evaluation, SMU DDE

Ashok Kumar K., Additional Registrar, SMU DDE

Prof. S.N. Maheshwari, Director General, Delhi Institute of Advanced Studies, Delhi (Formerly, Principal, Hindu College, Delhi University & Professor & Dean, Faculty of Commerce and Business Management , Goa University)

Dr Anil Singh, Associate Professor, University of Delhi

Page 3: BCC104 Business Statistics

Business Statistics

Contents

Unit 1

Information and Data Sources 1–22

Unit 2

Data Collection Methods 23–42

Unit 3

Data Analysis Techniques 43–85

Unit 4

Index Numbers 87–118

Unit 5

Data Representation 119–139

Unit 6

Correlation 141–164

Unit 7

Regression 165–187

Unit 8

Time Series 189–214

Unit 9

Testing of Hypothesis 215–235

Page 4: BCC104 Business Statistics

Unit 10

Chi-Square Test 237–249

Unit 11

t-Test, z-Test and Analysis of Variance 251–278

Unit 12

Research Report Writing 279–301

Unit 13

Exercise I 303–311

Unit 14

Exercise II 313–327

Page 5: BCC104 Business Statistics

SUBJECT INTRODUCTION

Business Statistics

Statistics is considered a mathematical science pertaining to the collection,analysis, interpretation or explanation and presentation of data. The subject ofstatistics is primarily concerned with making decisions about various disciplinesof market and employment, such as stock market trends, unemployment ratesin various sectors of industries, demographic shifts, interest rates, inflation ratesover the years, and so on. Statistics is also considered a science that deals withnumbers or figures describing the state of affairs of various situations with whichwe are generally and specifically concerned.

This book, Business Statistics, comprises fourteen units.

Unit 1- Information and Data Sources: Explains the need for information indecision making. It defines a problem and discusses how information areevaluated and processed. It also defines the various types of data.

Unit 2- Data Collection Methods: Discusses different methods of datacollection, such as observation, questionnaire, interviews and experiments. Italso lists the merits and demerits of data collection methods.

Unit 3- Data Analysis Techniques: Explains the various techniques of analysingdata, including percentage, ratio, average, mean, mode, median, quartiles, rangeand standard deviation.

Unit 4- Index Numbers: Defines and classifies index numbers. It also explainsthe methods of construction of different types of index numbers.

Unit 5- Data Representation: Lists the various tools of data representation,including tables, graphs and diagrams, and discusses their features.

Unit 6- Correlation: Defines correlation analysis. It also discusses the conceptsof coefficient of determination, coefficient of correlation, Karl Pearson’s coefficientand Spearman’s rank correlation.

Unit 7- Regression: Defines the term ‘regression’ and lists the assumptions inregression analysis. It also describes the simple regression model, scatterdiagram method and least square method.

Unit 8- Time Series: Lists the components of time series. It also describes thevarious methods of measuring trends and seasonal variations.

Page 6: BCC104 Business Statistics

Unit 9- Testing of Hypothesis: Defines a hypothesis and list its characteristics.It also explains the various ways of formulating hypotheses.

Unit 10- Chi-Square Test: Discusses the meaning, characteristics andsignificance of Chi-square test. It also lists the areas of application of Chi-squaretest and steps involved in finding the value of Chi-square test.

Unit 11- t-Test, z-Test and Analysis of Variance: Describes the method toperform -t-Test, z-Test and Analysis of Variance. It also identifies the conditionsin which these tests are applicable.

Unit 12- Research Report Writing: Describes the types, characteristics andmechanics of report and explains how to write a good report. It also discussesthe types of research reports.

Unit 13- Exercise I

Unit 14- Exercise II

Objectives of studying the subject

After studying this subject, you should be able to:

Explain why information or data is needed for decision-making

Explain the various techniques of collecting and analysing data

Define what index numbers are and use different methods to constructindex numbers

Use tables, graphs and diagrams for representing data

Perform correlation analysis and regression analysis

Describe the various methods of measuring trends and seasonal variations

Formulate and test hypotheses

Use Chi-square test, t-test, z-test and analysis of variance

Describe the various aspects of report writing, including its types,characteristics and mechanics

Page 7: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 1

Unit 1 Information and Data Sources

Structure

1.1 IntroductionObjectives

1.2 Need for Information in Decision-Making1.3 Types of Data1.4 Data Sources: Primary vs Secondary1.5 Research Procedure1.6 Summary1.7 Glossary1.8 Terminal Questions1.9 Answers

1.10 Further Reading

1.1 Introduction

Information is processed from raw data. It is verified to be accurate, specificand organized for a special purpose. The value of information lies solely in itsability to affect a behaviour, decision or outcome.

In this unit, you will learn about information, decision-making, data and itsvarious types. The information should be context specific and available when itis required, i.e., timely. Data is the numerical result of measurements. Thearrangement of the collected data defines its type. Data can be the basis ofgraphs, images, or observations of a set of variables. Raw or unprocessed datarefers to a collection of numbers, characters, images or other outputs fromdevices that collect information to convert physical quantities into symbols.Statistics is the science of the collection, organization, and interpretation ofdata. It deals with all aspects of this, including the planning of data collection interms of the design of surveys and experiments. You will also learn aboutvariables and random variable. A variable is any characteristic which can assumedifferent values.

In probability and statistics, a random or stochastic variable refers to avariable whose value results from a measurement on some type of randomprocess. In formal terms, it refers to a function from a probability space, typicallyto the real numbers, which is measurable. Intuitively, a random variable is a

Page 8: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 2

numerical description of the outcome of an experiment, for example, the possibleresults of rolling two dice: (1, 1), (1, 2), etc. Random variables can be classifiedas either discrete or as continuous. The former refers to a random variable thatmay assume either a finite number of values or an infinite sequence of values,while the latter refers to a variable that may assume any numerical value in aninterval or collection of intervals. An example of a random variable of mixedtype would be based on an experiment where a coin is flipped and the spinneris spun only if the result of the coin toss is heads.

A random variable can also be divided into two main categories, qualitativerandom variables and quantitative random variables. The classification of data isdone before processing it. It involves separating items according to similarcharacteristics and grouping them into four classes: geographical, chronological,qualitative and quantitative.

Further in this unit, you will learn about primary data, secondary data andthe sources from which these are collected. The validity and accuracy of thefinal judgement is most crucial and depends on how well the data was gatheredin the first place. The quality of data will greatly affect the conclusions andhence, utmost importance must be given to this process and every possibleprecaution should be taken to ensure accuracy while gathering and collectingdata.

Objectives

After studying this unit, you should be able to:

Explain why information is needed in decision-making

Define a problem, evaluate and process information, and take as decision

Explain the meaning and scope of data and list the types of variables

Define variable and its types

Differentiate between primary and secondary data

Explain the procedures of conducting research, including the methods ofcollecting primary and secondary data

1.2 Need for Information in Decision-Making

Information plays a vital role in decision-making. It is provided by the informationsystem set up in the organization. Information consists of data (facts and figures)which is processed and retrieved to be used for forecasting and decision-making.

Page 9: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 3

In an information-oriented and information-driven society, everyone is a userand a provider of information. Tremendous growth in technology in general andcommunication technologies in particular has served as a powerful driving forcein industry. With the advent of computer systems, communication technologieshave gained more power and have given birth to specialized fields such asinformation sciences and information technology.

On analysis we find that the real driving force behind this growth is theinformation behind these technologies and not the technologies themselves.The following steps are used while taking decisions based on information.

1.2.1 Defining the Problem

A problem understood properly is more than half its solution. This requires properdefinition of the problem and finding the issue that is to be covered. That isdecided on the basis of analysis of the information provided or gathered.Figure 1.1 shows a fish-bone diagram that helps in understanding and analysinga complex problem that is interlinked as a model of the problem under analysis.

Major Effect

Cause 5 Cause 6 Cause 7 Cause 8

Cause 1 Cause 2 Cause 3 Cause 4

SubCauses:

123

Figure 1.1 Fish-bone Diagram

Example: The fishbone diagram portrays various causes for an effect or problemand is often used in brainstorming sessions.

The given diagram was drawn by a manufacturing team in order tounderstand the source of periodic iron contamination. Six generic terms wereused to prompt ideas while the branches portray the causes of the problem.

Page 10: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 4

Measurement Materials Methods

Lab error Raw materials Analytical procedure

Not follow

ed

Calibration

Supplier City Supplier 1

Supplier 2Plantsystem

H02

DB

T Truck

AK

W-2

Lab solvent contaminationSolvent contamination

Analyst

calibrationIm

proper

Calibration

Supplier

In lab

In lab

Supplier

Sampling

Dry bottles

Iron tolls

Iron in Product

Rusty pipes

At reactor

In

Out

At sam

ple point

Heat exchanger leak

E583

E470

P 584P 560

P 573 Pip

es

Pum

ps

Rea

ctor

s

Exc

hang

ers

E 533

E 470

Materials of constructionMaintenance

Inexperiencedanalyst

Rust nearsample point

Exp

osed

pip

eTo

ols

Ope

ning

err

orIro

n to

ols

Environment Manpower MachinesFishbone Diagram

The figure shows that the term ‘machines’ contains the idea ‘materials ofconstruction’ which shows four kinds of equipment having specific machinenumbers. However, it must be noted that some ideas appear twice. ‘Calibration’appears under ‘methods’ as a factor in the analytical procedure and under‘measurement’ as a cause of lab error.

1.2.2 Evaluating the Information

The information should be context specific and available when it is required,i.e., timely. As a decision-maker, you must evaluate the accuracy of informationand the sources of information used in taking a decision. If there are manysources available for information, then select the source that can provideauthentic information. The following are the four states in which information canbe categorized:

Information you have and you are aware of it.

Information you do not have but you are aware of it.

Information that you have but you are not aware of it.

Information you do not have and you are not aware of it.

Page 11: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 5

Books, Articles,and Documents

Interim InformationProducts

Raw Data

Figure 1.2 Information Pyramid

An information pyramid explains the various sources of information asshown in Figure 1.2. Easy sources of information are books, articles anddocuments, which are shown at the top of the pyramid. Other sources ofinformation are based on raw data collected by someone else or collected byyou. Information is produced by converting raw data into meaningful informationand is presented in an easily understandable format.

1.2.3 Processing the Information

When information is specifically arranged according to the requirement orproblem then it is termed knowledge. Relevant information is extracted fromvarious sources. If the information is not relevant or as per the requirementthen the decision-maker either uses the other sources or collects the additionalaccurate data for further analysis and decision-making. The information is judgedfor its relevance, validity and inter-dependence. These are evaluated andintegrated to arrive at a conclusion and take a decision. The process is shownin the Figure 1.3.

InformationCollected

UsableInformation

AdditionalInformation Needed

Value-addingto Information

InformationRequired for

Decision-making

Figure 1.3 Processing the Information

Page 12: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 6

1.2.4 Taking the Decision

The process of decision-making is interactive and involves all concerned personsin an establishment or organization to give their opinion for taking decision onthe basis of information collected. Decision taken should be able to solve theproblem. If the problem is not solved then the decision taken is reviewed andre-analysed. It may be examined with more insight and further modified to meetvarious needs. Figure 1.4 shows problems and their relationship with decisions.

ProblemsDecisions

Figure 1.4 Relationship between Problem and Decision

Thus, an organization must be able to take effective decisions to organizeits activities based on relevant information. It must develop proper mechanismsfor efficient and harmonized information exchange between various departments.

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Accurate and timely _____________ is considered as one of themost powerful resources.

(b) When information is specifically arranged according to therequirement or problem, it is termed as ______________.

2. State whether true or false.

(a) A problem understood properly is more than half its solution.

(b) If a problem is solved, then the decision taken is reviewed and re-analysed.

Page 13: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 7

1.3 Types of Data

Data

Data can be defined as the qualitative or quantitative attributes of a variable orset of variables. They are usually the results of measurements and can bepresented in the form of graphs, images or observations of a set of variables.Data is considered to be the lowest level of abstraction from which informationand knowledge are derived.

Data comprise the numerical results of measurements. Data can also beused in singular sense such as a set of data. For example, if we ask the studentsin a classroom their ages and we write down their ages as they tell us, then acollection of these numbers would be considered as data. Similarly, informationregarding incomes of families, IQ scores of students, test scores of students ina class, heights of policemen in Mumbai, and so on, when collected, is knownas data. If this data is written down as collected, then it is known as raw data. Ifthis data is written in an ascending or descending order, then it would be calledordered data. If this ordered data is arranged in arrays of rows and columns,then the data is known to be presented in an ordered array.

1.3.1 Types of Variables

Variable

A variable is any characteristic that can assume different values. Age, height,IQ, and so on are all variables since their values can change when applied todifferent people. For example, Mr X is a variable since X can represent anybody.On the other hand, a constant will always have the same value. For example,the number of days in a week are constant and will always remain the same.Consider the following illustration:

Let, x + 6 > 10 be an inequality. Now, if x is a whole number, then it canhave any value greater than 4. While the values 6 and 10 are constant and donot change, x can be 5, 6, 7... and up to any value. Thus, x is a variable whichcan have any number of different values.

There are two types of variables: discrete variable and continuous variable.A discrete variable takes whole number values and consists of distinct,recognizable individual elements that can be counted, such as the number ofbooks in a library. Similarly, the number of children in a family would be consideredas values of a discrete variable, since the children can be counted exactly.

Page 14: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 8

On the other hand, a continuous variable is a variable whose values cantheoretically take on an infinite number of values within a given range of values.

Hence, these values are measured as against being counted. However,since the measurement value would depend upon how accurately we measureit, any exact value would simply be one of the infinite number of values on acontinuous scale between two given points. For example, the height of a childtouches every one of the infinite number of points between, let us say, 40 inchesand 40.1 inches as he/she grows from 40 inches to 40.1 inches. Accordingly,the value of a continuous variable is more accurately defined if it is stated asbeing between two points such as 40 inches and 40.1 inches.

A random variable

Roughly speaking, in probability and statistics a random or a stochastic variableis a variable whose value results from a measurement on some type of randomprocess. Usually, it is a function from a probability space, typically to the realnumbers that is measurable (for finite probability spaces, the measurablerequirement is superfluous). Intuitively, any random variable is a numericaldescription of the outcome of an experiment like the probable results of rollingtwo dice (1, 1), (1, 2). Random variables can be either classified as discrete (itmay assume a finite number of values or an infinite sequence of values) or ascontinuous (any numerical value in an interval or a collection of intervals). Thepossible outcomes of a yet-to-be-performed experiment can be represented bya random variable’s possible values, or the quantity of potential values withuncertain already-existing value (e.g., as a result of incomplete information orimprecise measurements). Realizations of a random variable are known asrandom varieties.

Random variables are of two types, namely, discrete and continuous. Fora regular random variable, the probability of any specific value can be zero,whereas the probability of some infinite set of values (such as an interval ofnon-zero length) can be positive. A random variable can be ‘mixed’, with a partof its probability spread out over an interval like any typical continuous variable,and another part of it concentrated on particular values like a discrete variable.These classifications are equivalent to the categorization of probabilitydistributions.

A random variable is a phenomenon of interest in which the observedoutcomes of an activity are entirely, by chance, absolutely unpredictable andmay differ from response to response. By definition of randomness, each possibleentity has the same chance of being considered. For instance, lottery drawings

Page 15: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 9

are considered to be random drawings so that each number has exactly thesame chance of being picked up. Similarly, the value of the outcome of a toss ofa fair coin is random, since a head or a tail has the same chance of occurring.

A random variable may be qualitative or quantitative in nature. Thequalitative random variables yield categorical responses so that the responsesfit into one category or another. For example, a response to a question such as‘Are you currently unemployed?’ would fit in the category of either ‘yes’ or ‘no’.On the other hand, quantitative random variables yield numerical responses.For example, responses to questions such as, ‘How many rooms are there inyour house?’ or ‘How many children are there in the family?’ would be in numericalvalues. Also, these values being whole numbers are considered discrete values.These are the values of discrete quantitative random variables. On the otherhand, responses to questions like, ‘How tall are you?’ or ‘How much do youweigh?’ would be the values of continuous quantitative random variables, sincethese values are measured and not counted. Some examples of these variablesare:

(i) Qualitative random variables

Sex of students in the class

Political affiliation of a faculty member in the college

Opinions of economists regarding the economic conditions in thecountry

(ii) Quantitative random variables

(a) Discrete quantitative random variables

Number of people attending a conference

Number of eggs in the refrigerator

Number of children at a summer camp

(b) Continuous quantitative random variables

Heights of models in a beauty contest

Weights of people joining a diet programme

Lengths of steel bars produced in a given production run

1.3.2 Classification of Data

When the raw data has been collected and edited, it should be put into anordered form (ascending or descending order), so that it can be looked at more

Page 16: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 10

objectively. The next important step towards processing the data is classification.Classification means separating items according to similar characteristics andgrouping them into various classes. The items in different classes will differfrom each other on the basis of some characteristics or attributes. Classificationof data is very similar to sorting of mail at a post office, where a mail is classifiedaccording to its geographical destination and may further be classified into thetype of mail such as first class, parcel post, and so on. The data may be classifiedinto four broad classes:

(i) Geographical. This classification groups the data according to locationaldifferences among the items. The geographical areas are usually listedin alphabetical order for easy reference. For example, the book listingcolleges and universities in various states in USA would first list the statesin the alphabetical order and then the colleges and the universities withinthese states in the alphabetical order.

(ii) Chronological. This classification includes data according to the timeperiod in which the items under consideration occurred. For example, thesales of automobiles in India over the last ten years may be groupedaccording to the year in which such sales took place.

(iii) Qualitative. In this type of classification, the data is grouped togetheraccording to some distinguished characteristic or attribute such as religion,sex, age, national origin, and so on. This classification simply identifieswhether a given attribute is present or absent in a given population. Forexample, the population may be divided into two classes: male and female.Then the attribute of male will go into one class and the attribute of femalewill go into the other.

(iv) Quantitative. This refers to the classification of data according to someattribute which has magnitude and can be measured such as classificationaccording to weight, height, income, and so on. For example, the salariesof professors at a university may be classified according to their ranksuch as instructor, assistant professor, associate professor and fullprofessor.

Hence, the collected data should be arranged systematically to give it shape,form and meaning. The division of the data into homogeneous groups accordingto their characteristics, recorded in a statistical inquiry, is called classification.

Page 17: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 11

Self-Assessment Questions

3. State whether true or false.

(a) If the data is written in an ascending or descending order, it wouldbe called ordered data.

(b) Items in different classes will differ from each other on the basis ofsome characteristics or attributes.

4. Fill in the blanks with the appropriate terms.

(a) A ______________ is any characteristic that can assume differentvalues.

(b) Classification means separating items according to similar________________ and grouping them into various classes.

1.4 Data Sources: Primary vs Secondary

The statistical data, as previously discussed, may be classified under twocategories depending upon the sources utilized. These categories are:

1. Primary Data. Primary data is one which is collected by the investigatorhimself for the purpose of a specific inquiry or study. Such data is originalin character and is generated by surveys conducted by individuals orresearch institutions. For example, if a researcher is interested to knowwhat women think about the issue of abortion, he/she must undertake asurvey and collect data on the opinions of women by asking relevantquestions. Such data collected would be considered as primary data.

2. Secondary Data. When an investigator uses the data which has alreadybeen collected by others, such data is called secondary data. This data isprimary data for the agency that collected it and becomes secondarydata for someone else who uses this data for his own purposes. Secondarydata can be obtained from journals, reports, government publications,publications of professional and research organizations and so on. Forexample, if a researcher desires to analyse the weather conditions ofdifferent regions, he can get the required information or data from therecords of the meteorology department. Even though secondary data isless expensive to collect in terms of money and time, the quality of thisdata may even be better under certain situations, because it may have

Page 18: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 12

been collected by persons who were specifically trained for that purpose.However, such secondary data must be used with utmost care. The reasonis that such data may be full of errors due to the fact that the purpose ofthe collection of data by the primary agency may have been differentfrom that of the user of the secondary data. Additionally, there may havebeen biases introduced during collection of data or analysis of data. Forexample, the size of the sample may have been inadequate or there mayhave been arithmetical or definitional errors. Hence, it is necessary tocritically investigate the validity of secondary data as well as the credibilityof the primary data collection agency.

Sources of Data

The following are some of the sources of data for collecting first hand information.

Census

World Bank

WHO (World Health Organization)

NSSO (National Sample Survey Organization)

Economic Survey

National Family and Health Surveys

SRS Surveys

Multiple Indicator Survey

CSO. RBI, Gov.nic.in, CMIE

Since the quality of the results obtained from statistical data for the purposeof using these outcomes for managerial decision-making depends upon thequality of the collected information itself, it is important that a sound investigativeprocess be established to ensure that the data is highly representative andunbiased. This requires a high degree of skill and also certain precautionarymeasures are to be taken.

Activity 1

Collect first hand information from five families in your neighbourhood oneducation, health and economic status. Tabulate the data as qualitative orquantitative. Also classify the attributes as per the four measurement scales.

Page 19: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 13

Self-Assessment Questions

5. Fill in the blanks with the appropriate terms.

(a) ___________ data is one which is collected by the investigator himselffor the purpose of a specific inquiry or study.

(b) When an investigator uses the data which has already been collectedby others, such data is called ________________ data.

6. Choose the right answer from the given options.

(a) To collect first hand information, we use ________________.

(i) Census (ii) Interview

(iii) Observation (iv) Questionnaire

(b) It is necessary to critically investigate the validity of ___________data.

(i) World Bank (ii) Census

(iii) Secondary (iv) Primary

1.5 Research Procedure

In general, all data, whether qualitative or quantitative, is measured in someform. Even discrete quantitative data which is counted can fit into some form ofmeasurement. There are four widely accepted levels of measurement. Theselevels, from the weakest on the one extreme to the strongest on the other, inorder are: Nominal scale, Ordinal scale, Interval scale and Ratio scale. Beforediscussing these various measurement levels, let us look at some of the attributespossessed by these scales.1

The scales are explained later.

(i) Magnitude. This is the quantitative value that exists or is assigned to anattribute or characteristic and such values, when compared, will determinewhether the value of a given attribute in one case is greater than, equal toor less than the value of the same attribute in another case. For example,if student X gets 100 per cent marks in the final examination in a courseand student Y gets 40 per cent in the same exam, then student X may beconsidered as more knowledgeable in that area than student Y.

Page 20: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 14

(ii) Equal intervals. Some measurement scales are constructed in such amanner that the magnitude of an interval between any two points alongthe scale has the same value or the same magnitude within the sameinterval of any other two points along the same scale. For example, thedifference in heights of students between 60 inches and 63 inches is thesame in magnitude as the difference between 70 inches and 73 inches.This means that the value of the magnitude is 3 inches, no matter wheresuch interval is measured on the scale. There may be some exceptionsto this rule. For instance, the value of the difference between the IQ of180 and 190 may be different than the value of the difference of an IQ of80 and 90, even though, numerically both these differences have thesame value.

(iii) Absolute zero point. The third attribute of the measurement scale is thepresence or absence of the zero point where the attribute has no value atall. For example, the characteristic of height of a person does not have anabsolute zero point, since positive quantitative value of the attribute alwaysexists, no matter what the age of the person may be. On the other hand,the number of TV sets in a family can have an absolute zero value if thefamily has no TV set at all. In some unique cases we may assign a zerovalue to an attribute for qualitative comparison purposes even when thevalue of such an attribute is a positive quantitative number. For example,we may say that an unintelligent person has zero intelligence, even thoughit does not mean absolute zero.

1.5.1 Measurement of Scale

In light of the three attributes, let us now consider and discuss the fourmeasurement scales.

(i) Nominal scale. Applied to qualitative data only, it is also known asclassificatory scale, where the objects or items are classified into variousdiscrete and distinct groups or categories without any ranking or orderassociated with such classified data. It does not possess any of the threeattributes discussed earlier: magnitude, equal intervals and absolute zeropoint. It is the weakest form of measurement so that some statisticiansdo not consider it as a scale at all. Examples of nominal scale would becategorizing people according to their religion such as Christian, Muslim,Hindu and so on, or according to their political affiliation such as Democrat,Republican or Socialist. Other categories of nominal scale may be smokingor non-smoking, ownership of house or no ownership of house, and soon.

Page 21: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 15

(ii) Ordinal scale. Also known as ranking scale, it possesses only the attributeof magnitude. This means that various categories of items can becompared with each other only in order of rank assigned to thesecategories. However, these ranks only indicate as to which category isgreater or better, but does not indicate the magnitude of the differenceamong these categories. For example, the students in a class may becategorized according to their grades of A, B, C, D and F where A isbetter than B, and so on, and the classification is from the highest gradeto the lowest grade. Another example of ordinal scaling would be theclassification of teaching faculty ranks in the colleges as full professors,associate professors, assistant professors and instructors.

(iii) Interval scale. The interval scale measures the values of quantitativerandom variables and identifies not only which category is greater or betterbut also by how much. It is a stronger form of measurement and possessestwo attributes, which are magnitude and equal intervals. It does notpossess, however, the absolute zero point. Measurements of height, weightand time are all examples of interval scale.

(iv) Ratio scale. The ratio scale is also used for measurement of quantitativerandom variables, but it differs from interval scale in that it has a true zeropoint, meaning that the values of such variables can be zero. It makesmathematical manipulations easier such as divisions and multiplications.Examples of ratio scale are physical measurements including temperature,number of students registered in various classes, and so on. Thetemperature can be zero which means the total absence of heat and it isalso possible that zero students are registered for a given class. Similarly,heights and weights, though considered in interval scale, can havehypothetical zero values.

These measurement scales assist in designing survey methods for thepurpose of collecting relevant data.

1.5.2 Methods of Collecting Primary and Secondary Data

Planning the Study

Before any procedures for data collection are established, the purpose and thescope of the study must be clearly specified. If any similar studies have beenconducted, prior to the current one, then the investigator may want to use somesecondary data in his own study, and may redefine his objectives on the basisof the previous studies conducted. The scope of the study must take into

Page 22: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 16

consideration the field to be covered, and the time period in which to conductthe study. The time span is very important, because in certain areas, theconditions change very quickly, and hence, by the time the study is completed,it may become irrelevant. The statistical units and the desired accuracy of suchunits must be clearly specified.

Methods of Collecting Primary Data

Primary data is collected by the investigator for specific study. This data shouldbe unique in nature and should be kept secret until it is published. The followingare the methods of collecting primary data.

Questionnaires: These are the most popular means of collecting primarydata. The questionnaires are designed as per specific problems, for example, itcan be used for interviewing or for a telephone survey. It can be posted, e-mailed or faxed and can be used for a large number of people or organizations.It does not require prior arrangements and there is no interviewer bias. Thequestionnaire must not be too long, too complex, uninteresting or too personal.The questions asked must be simple so that the respondent can read allquestions and reply. The basic subject of the questionnaire must be made clearin a covering letter. The researcher must give his/her identification, why thedata is being collected and the declaration of confidentiality and anonymity.Request and instructions to return the duly filled questionnaire must be mentionedwith the return date. You can make a request as, ‘It would be greatly appreciatedif you may possibly return the completed questionnaire by.......... if it is possible.’

Interviews: This is a technique basically used to know the mind-set, likingsor behaviour of the person being interviewed. Interviews can be conducted on apersonal one-to-one basis or in a group. Interviews can be of structured, semi-structured and unstructured types. Structured type is based on a cautiouslyworded interview plan. In semi-structured type, the interview is based onquestions that provide scope to the respondent to answer at length. Unstructuredtype is also termed as an in-depth interview. The interviewer starts with thegeneral questions to encourage the respondent to talk without restraint. Forconducting an interview the researcher has to prepare a list of topics on whichthe information is required. Select the type of interview to frame the relevantquestions and then fix appointment with the respondent.

Telephone interview: This is also a type of interview which is conductedon personal or face-to-face basis. It gives high response rate and the answerscan be taped for keeping record. This method can be used if the respondenthas a telephone.

Page 23: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 17

Focus group interviews: This type of interview is conducted by a qualifiedrepresentative on a small group of respondents in a non-structured and naturalmanner. The representative leads the conversation and the main idea is to getinsights by carefully listening to a small selected group of people on specificsubjects.

Observation: In this method the behavioural styles of specific people,objects and happenings are recorded in a systematic way. Observationalmethods can be structured or unstructured, disguised or undisguised, natural,personal, mechanical, participant and non-participant. In the structuredobservation, the researcher decides that what is to be observed and how theobserved records will be analysed, while in an unstructured observation theresearcher observes all phases of the event and then records the relevant ones.A researcher watches the real behaviour as it happens in personal observation.In participant observation, the researcher becomes the part of the group beinginvestigated, while in non-participant observation the researcher does notcommunicate with the group being observed.

Methods of Collecting Secondary Data

The chief sources of secondary data may be broadly classified into the followingtwo groups:

(i) Published sources

(ii) Unpublished sources

(i) Published sources: There are a number of national organizations andinternational agencies which collect and publish statistical data relating tobusiness, trade, labour, price, consumption, production, etc. Thesepublications are useful sources of secondary data. Some of thesepublished sources are as follows:

1. Official publications of the Central and State Governments such asmonthly abstract of statistics, national income statistics and vitalstatistics of India.

2. Publications of semi-government organizations, e.g., the ReserveBank of India bulletin.

3. Publications of research institutions, e.g., the publications of theIndian Council of Agricultural Research (ICAR), New Delhi.

4. Publications of commercial and financial institutions, e.g., thepublications of the FICCI

Page 24: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 18

5. Reports of various committees and commissions appointed by thegovernment, such as the Wanchoo Commission Report on Taxation.

6. Newspapers and periodicals like Economic Times and StatesmanYearbook also publish useful statistical data.

7. International publications like the U.N. Statistical Yearbook andDemographic Yearbook.

(ii) Unpublished sources: The records maintained by private firms orbusiness houses which may not like to release their data to any outsideagency; the studies carried out by research scholars in universities orresearch institutes may also provide useful statistical data.

Precautions in the use of secondary data: Secondary data should beused with extra caution since they have been collected by someone other thanthe investigator. Before using such data, the investigator must be satisfied inregard to the reliability, accuracy, adequacy and suitability of the data to thegiven problem under investigation. Before using secondary data, the investigatorshould examine the following questions.

1. Is the data suitable for the purpose of investigation? For this, he shouldcompare the objectives, the nature and the scope of the given enquirywith the original investigation. He should also confirm that the variousterms and units were clearly defined and were uniform throughout theearlier investigation and these definitions are suitable for the presentenquiry as well.

2. Is the data reliable? For this, the investigator himself should be satisfiedabout the following:

(i) The reliability, integrity and experience of the collecting organization

(ii) The reliability of the source of information

(iii) The methods used for collection and analysis of the data.

(iv) The degree of accuracy desired by the company.

3. Is the data adequate? Adequacy of data is to be judged in the light of therequirements of the survey and the geographical areas covered by theavailable data. Adequacy of data is also to be considered in the light ofthe time period for which the data is available.

Hence, in order to arrive at conclusions free from limitations andinaccuracies, the secondary data must be subjected to thorough scrutiny andediting before it is accepted for use.

Page 25: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 19

Activity 2

Observe a group of students participating in a debate competition. Collectdata on their behaviour and categorize them as most active, moderatelyactive and less active.

Self-Assessment Questions

7. Fill in the blanks with the appropriate terms.

(a) _____________ is the quantitative value that exists or is assignedto an attribute or characteristic.

(b) Absolute zero point refers to the _____________ which has no valueat all on measurement scale.

8. State whether true or false.

(a) Nominal scale is the weakest form of measurement so that somestatisticians do not consider it as a scale at all.

(b) In the observation method, the behavioural styles of specific people,objects and happenings are recorded in an unsystematic way.

1.6 Summary

Let us recapitulate the important concepts discussed in this unit:

Information plays a vital role in decision making. It is provided by theinformation system set up in the organization. The management dependson information systems for effective decision-making. Information consistsof data (facts and figures) which is processed and retrieved to be used forforecasting and decision-making.

The information should be context specific and available when it is required.

When information is specifically arranged according to the requirementor problem, it is termed as knowledge.

Data comprise the numerical results of any measurement. Data can alsobe used in singular sense, such as a set of data.

A variable is any characteristic that can assume different values. Thereare two types of variables: discrete variable and continuous variable.

Page 26: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 20

A random variable is a phenomenon in which the observed outcomes ofan activity are entirely, by chance, absolutely unpredictable and may differfrom response to response. By definition of randomness, each possibleentity has the same chance of being considered. A random variable maybe qualitative or quantitative in nature.

Classification means separating items according to similar characteristicsand grouping them into various classes. The data may be classified intofour broad classes as (i) geographical, (ii) chronological, (iii) qualitativeand (iv) quantitative.

The statistical data may be classified under two categories depending uponthe sources utilized as primary data and secondary data.

Primary data is data that is collected by the investigator himself for thepurpose of a specific inquiry or study. Such data is original in character andis generated by surveys conducted by individuals or research institutions.

When an investigator uses the data which has already been collected byothers, such data is called secondary data. This data is primary data forthe agency that collected it and becomes secondary data for someoneelse who uses this data for his own purposes.

The various sources which give first hand information to collect data areCensus, World Bank, WHO, NSSO, Economic Survey, Demographic andHealth Surveys, etc.

In general, all data, whether qualitative or quantitative, is measured insome form. There are four widely accepted levels of measurement. Theselevels, from the weakest on the one extreme to the strongest on the other,in order are nominal scale, ordinal scale, interval scale and ratio scale.

1.7 Glossary

Data: Numerical results of any measurement

Variable: Any character that can assume different values

Random variable: A qualitative or quantitative phenomenon in which theobserved outcomes of an activity entirely or by chance absolutelyunpredictable and may differ from response to response.

Primary data: Data collected by the investigator for the purpose of aspecific inquiry or study. The data is original in character and is generatedby surveys conducted by individuals or research institutions.

Page 27: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 21

Secondary data: When an investigator uses the data which has alreadybeen collected by others, then the data is secondary data for theinvestigator but it remains primary data for those who collected it. It isobtained from journals, reports, government publications, etc.

1.8 Terminal Questions

1. What role does information plays in decision-making?

2. How is information evaluated and processed? Explain with the help ofexamples.

3. Define the various types of data with the help of examples.

4. Explain the four categories of data classification with the help of examples.

5. Differentiate between primary data and secondary data. Under whatcircumstances would secondary data be more useful than primary data?

6. Describe in detail the four types of measurement scales. Illustrate yourexplanation with examples.

7. What are the various modes of data collection? Under what circumstanceswould each method be more suitable as compared to other methods?Give reasons for your beliefs.

1.9 Answers

Answers to Self-Assessment Questions

1. (a) Information; (b) Knowledge

2. (a) True; (b) False

3. (a) True; (b) True

4. (a) Variable; (b) Characteristics

5. (a) Primary; (b) Secondary

6. (a) i; (b) iii

7. (a) Magnitude; (b) Attribute

8. (a) True; (b) False

Page 28: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 22

Answers to Terminal Questions

1. Refer to Section 1.2

2. Refer to Sections 1.2.2 and 1.2.3

3. Refer to Section 1.3

4. Refer to Section 1.3.2

5. Refer to Section 1.4

6. Refer to Section 1.5

7. Refer to Section 1.5.2

1.10 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2002.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2007.

Endnote

1. Aggarwal, Y.P. Statistical Methods, New Delhi: Sterling Publishers, 1986, p.5.

Page 29: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 23

Unit 2 Data Collection Methods

Structure

2.1 IntroductionObjectives

2.2 Observation2.3 Questionnaire2.4 Interviews2.5 Experiments2.6 Summary2.7 Glossary2.8 Terminal Questions2.9 Answers

2.10 Further Reading

2.1 Introduction

In the previous unit, you learnt about information and data sources. Data sourceshelp in collecting data.

In this unit, you will learn about the various data collection methods.The unit describes the advantages and shortfalls of various types ofobservations. You will also learn about the process of preparing aquestionnaire, what all should be kept in mind while drafting it and whatpattern of questions should be adopted, i.e., dichotomous, multiple choiceor open questions. Also, you would learn about the different modes ofinterviews along with their merits and demerits. Accurate records have to bemade to keep people updated about the current scenario of the society. Asthere are several methods of data collection, the methods that consume theleast amount of time are put into use. Data collecting techniques such asquestionnaires and interviews play a vital role in collecting large amount ofinformation in a short period of time and hence have been discussed in thisunit. Experiments are resorted to when it is necessary to collect factual datawhen nothing is available for reference. It may also be conducted to verify atheory. Experiment is a study conducted under controlled conditions.

Page 30: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 24

Objectives

After studying this unit, you should be able to:

Prepare a questionnaire

Explain the significance of interviews

Discuss other modes of data collection along with their advantages

Explain the importance of experiments

2.2 Observation

Observation may be defined as recording behavioural patterns without verbalcommunication.

Primary data can be collected using the following method.

Direct personal observation. Under this method, the investigatorpresents himself personally before the informant and obtains a first handinformation. This method is most suitable when the field of enquiry is small anda greater degree of accuracy is required.

We shall now see the merits and limitations of the observation method.

Merits

(i) The first hand information obtained by the investigator is bound tobe more reliable and accurate since the investigator can extract thecorrect information by removing doubts, if any, in the minds of therespondents regarding certain questions.

(ii) High response rate, since the answers to various questions areobtained on the spot.

(iii) It permits explanation of questions concerning difficult subject matter.

(iv) It permits evaluation of respondent, his circumstances and reliability.

(v) This method is useful where spontaneity of response is required.

(vi) It provides personal rapport, which helps to overcome reluctance torespond.

(vii) Where the investigator and the informant talk face to face, it becomespossible to explore questions in depth.

(viii) Information is collected promptly and there is no dribbling.

Page 31: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 25

Limitations

(i) This method is suitable only for intensive studies and not for extensiveenquiries.

(ii) This method is time-consuming and the investigation may have tobe spanned over a long period.

(iii) This method is highly subjective in nature and the results of theenquiry may be adversely affected by the personal bias, whim andprejudices of the investigator.

Activity 1

Find a situation when direct personal observation is the perfect methodfor data collection.

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Observation may be defined as recording ___________ patternswithout verbal communication.

(b) Direct personal observation method is most suitable when the fieldof ____________ is small and a greater degree of accuracy isrequired.

2. State whether true or false.

(a) Direct personal observation does not permit explanation of questionsconcerning difficult subject matter.

(b) Direct personal observation provides personal rapport, which helpsovercome reluctance to respond.

2.3 Questionnaire

Questionnaire method can be used either as mailing the questionnaires orsending through enumerators.

2.3.1 Mailed Questionnaire Method

Under this method, the investigator prepares a questionnaire containing a numberof questions pertaining to the field of enquiry. These questionnaires are sent by

Page 32: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 26

post to the informants together with a polite covering letter explaining in detailthe aims and objectives of collecting the information and requesting therespondents to cooperate by furnishing the correct replies and returning thequestionnaire duly filled in. In order to ensure quick response, the return postageexpenses are usually borne by the investigator. This method is usually adoptedby research workers, private individuals and non-official agencies. The successof this method depends upon the proper drafting of the questionnaire and thecooperation of the respondents.

Merits

(i) By this method, a large field of investigation may be covered at avery low cost. In fact, this is the most economical method in terms oftime, money and manpower.

(ii) Errors due to personal bias of the investigators or enumerators arecompletely eliminated as the information is supplied by the personconcerned in his own handwriting.

Limitations

(i) This method can be used only if the respondents are educated andcan understand the questions well, and reply in their own handwriting.

(ii) Sometimes, the informants may not send back the schedules andeven if they return the schedules, they may be incorrectly filled in.

(iii) Sometimes, the informants are not willing to give written informationin their own handwriting on certain personal questions like income,personal habits and property.

(iv) There is no scope for asking supplementary questions for cross-checking of the information supplied by the respondents.

2.3.2 Questionnaire Sent Through Enumerators

Under this method, instead of sending the questionnaire through post, theinvestigator appoints agents known as enumerators, who go to the respondentspersonally with the questionnaire, ask them the questions given therein, andrecord their replies. This method is generally used by business houses, largepublic enterprises and research institutions.

Merits

(i) The information collected through this method is more reliable asthe enumerators can explain in detail the objectives and aims of theenquiry to the respondents and win their cooperation.

Page 33: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 27

(ii) Since the enumerators personally call on the respondents, there isvery little non-response.

(iii) This technique can be used with advantage even if the respondentsare illiterate.

(iv) The enumerators can effectively check the accuracy of theinformation supplied through some intelligent cross-questioning byasking supplementary questions.

Limitations

(i) The method is more expensive and can only be used by financiallystrong bodies or institutions.

(ii) It is more time-consuming than the mailed questionnaire method.

(iii) The success of the method depends on the skill and efficiency ofthe enumerators who collect the information and also on the efficiencyand wisdom with which the questionnaire is drafted.

2.3.3 Drafting or Framing the Questionnaire

Since the questionnaire is the only medium of communication between theinvestigator and the respondents, it must be designed or drafted with utmostcare and caution so that all the relevant and essential information for the enquirymay be collected without any difficulty, ambiguity or vagueness. Designing ofquestionnaire, therefore, requires a high degree of skill and experience on thepart of the investigator. No hard and fast rules can be laid down for designing orframing a questionnaire. However, if would help if the following general pointsare borne in mind while drafting a questionnaire:

1. The size of the questionnaire should be as small as possible. The numberof questions should be kept to the minimum keeping in view the nature,objectives and purpose of enquiry. Respondents’ time should not be wastedby asking irrelevant and unimportant questions. Fifteen to twenty-five maybe regarded as a fair number. If a larger number of questions isunavoidable in any enquiry, the questionnaire should preferably be dividedinto two or more parts.

2. Questions should be clear, brief, unambiguous, non-offending, courteousin tone, corroborative in nature and to the point.

3. Questions should be logically arranged.

4. Questions should be short, simple and easy to understand. The usage ofvague or multiple meaning words should be avoided. Unless the

Page 34: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 28

respondents are technically trained, the use of technical terms should beavoided.

5. Questions should be so designed that the respondents can easilycomprehend and answer them. Questions involving mathematicalcalculations should not be asked.

6. Questions of sensitive or personal nature should be avoided.

7. The questionnaire should provide necessary instructions to theenumerators.

8. If a particular question needs clarification, it should be explained by wayof a footnote.

9. Questions should be capable of objective answer. Various types ofquestions in the questionnaire may be grouped under three categories:

(i) Dichotomous or simple alternate questions in which therespondent has to choose between two clear-cut alternatives like‘Yes’ or ‘No’, ‘Right’ or ‘Wrong’, ‘Either’, ‘Or’, and so on. Thistechnique can be applied elegantly in situations where two clear-cut alternatives exist.

(ii) Multiple choice questions in which the respondent is asked to selectone out of a number of responses. All possible answers to a questionare listed and the respondent chooses one of these. Such questionssave time and facilitate tabulation. This method should be used onlyif a few alternative answers exist to a particular question.

(iii) Open questions are those in which no alternative answers aresuggested and the respondents are free to express their frank andindependent opinions on the problem in their own words usually inan essay form.

10. Cross-checks: The questionnaire should be so designed as to provide across-check on the accuracy of the information supplied by therespondents by including some connected questions.

11. Pre-testing the questionnaire: The questionnaire should be tried on asmall group before using it for the given enquiry. This will help in improvingor modifying the questionnaire in the light of the drawbacks, shortcomingsand problems faced by the investigator in the pre-test.

Page 35: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 29

12. A covering letter, stating briefly the aims and objectives of the enquiry,soliciting cooperation of the respondents, and explaining various termsand concepts, should be enclosed along with the questionnaire.

13. In case of a mailed questionnaire method, a self-addressed stampedenvelope should be enclosed.

14. To ensure quick response, the respondents may be offerred incentives inthe form of gift coupons, a sample of the product to be introduced, or apromise to supply a copy of the findings after the survey work is over.

15. Method of tabulation and analysis, whether hand-operated, machine-operated or computerized, should also be kept in mind while designingthe questionnaire.

16. Lastly, the questionnaire should be made attractive by a proper layoutand an appealing get up.

2.3.4 A Specimen Questionnaire

This hypothetical study is adapted from a study developed by Deepak Mahendruin India. Assume that this study involves 200 professors in New York collegeswho are asked about their interest in buying automobiles. The basic objectiveof this survey is to determine certain marketing trends among the population ofprofessors in New York regarding their automobile buying patterns and arebased upon the following factors:

The profile of the decision-maker who finally decides to buy a particulartype of car.

People around the decision maker who influence the decision-makingprocess.

The factors affecting the selection of a particular dealer of cars.

People in the family who make or affect decisions regarding the maximumbudget that can be allocated for purchasing a car.

The effect of various options available in the car.

The image and reliability of the company that makes these cars.

The effect of heavy promotion on television about the utility of the car onthe decision-maker.

Page 36: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 30

(For the sake of simplicity, it is assumed that the professors have onlyone car in the family.)

The Questionnaire

1. General

Name: ...................................................................................

Age: ......................................................................................

Sex: M .................... F ....................

Marital status: Married .................. Unmarried .................

Number of members in the family

1–2...................

3–4...................

5–6...................

Over 6..............

Yearly income

Less than 30,000...................

30,000– 39,999......................

40,000– 49,999......................

50,000 and more...................

2. What type of car do you own now?

.................Indian

.................Japanese

.................European

3. What size of car do you own?

.................Luxury

.................Mid-size

.................Compact

4. Did you buy this car new or used?

.................New....................Used

5. If you bought a used car, did you buy it from a dealer or a private party?.................Dealer.................Private party

Page 37: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 31

6. If you bought a new car, how long have you owned this car?

.................Number of years

7. If you bought a used car, how old is this car now?

..............Number of years

8. Price paid for the car..........New..........Used

9. Who influenced your decision to purchase the above brand of car?Indicate if more than one.

...............Yourself ...................... Your wife

...............Your children ...................... Your friend

...............Your neighbour ...................... Your colleague

Others.................................................................................. .

10. Indicate as to who decided about the budget allocation for the car.

...............Yourself

...............Your spouse

...............Family decision

11. If you bought your car from a dealer, then who influenced your decisionregarding the selection of a particular dealer?

...............Yourself

...............Your friend

...............Your colleague

...............Family decision

12. How did you come to know about this dealer?

...............TV commercial

...............Newspapers

...............Personal references

...............Others

13. Rank the following factors that affected the final decision at the time ofpurchasing the car. A rank of 1 measures the most important factor, arank of 2 measures the second most important factor, and so on.

...............Very inconvenient without the car

...............Money was available

Page 38: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 32

...............Reputation of car manufacturer

...............Discounts offered

...............Interest rate on financing

...............Guarantees and warranties offered

...........................Others

14. Did you make an extensive survey regarding price comparisons afteryou decided to buy the particular car? ............ Yes......... No

15. If you bought a used car, how did you learn about it?............ Newspapers

...............Friend ............... Others

16. In order of preference, what were the major reasons for buying a usedcar?

...............Unavailability of adequate funds

...............Cheaper insurance

...............Lack of parking garage

...............Condition of the car

...............Others

17. Which of the following media you think is most effective in creating animpact on the potential customer relative to a particular brand of the car?

................TV ...............Newspapers

................Magazines ...............Favourable news reports

................Word of mouth ...............Others

The responses to such questions would form the basis of analysis inorder to achieve the set marketing objectives.

Activity 2

Draft a questionnaire to collect data on any awareness program such aspolio awareness.

Page 39: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 33

Self-Assessment Questions

3. State whether true or false.

(a) Mailed questionnaires are sent by post to the informants togetherwith a polite covering letter explaining in detail the aims and objectivesof collecting the information and requesting the respondents tocooperate by furnishing the correct replies and returning thequestionnaire duly filled in.

(b) Designing of questionnaire requires a high degree of skill andexperience on the part of the investigator.

4. Fill in the blanks with the appropriate terms.

(a) If a particular question needs clarification, it should be explained byway of a __________________.

(b) Questions should be ________________ arranged.

2.4 Interviews

Indirect personal interview. Under this method, instead of directly approachingthe informants, the investigator interviews several third persons who are directlyor indirectly concerned with the subject matter of the enquiry and who are inpossession of the requisite information. Such a procedure is followed by theenquiry committees and commissions appointed by the Government of India.The committee selects persons, known as witnesses, and collects informationfrom them by getting answers to questions decided in advance. This method ishighly suitable where direct personal investigation is not practicable eitherbecause the informants are unwilling or reluctant to supply information or wherethe information desired is complex and the study in hand is extensive.

Merits

(i) This method is less costly and less time-consuming than directpersonal investigation.

(ii) Under this method, the enquiry can be formulated and conductedmore effectively and efficiently as it is possible to obtain the viewsand suggestions of the experts on the given problem.

Page 40: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 34

Limitations

The success of this method depends upon:

(i) The representative character of the witnesses.

(ii) The personal knowledge of the witnesses about the subject matterof enquiry.

(iii) The personal prejudices of the witnesses as regards definiteness instating what is wanted.

(iv) The ability of the interviewer to extract information from the witnessesby asking appropriate questions and cross-questions.

2.4.1 Other Methods

Telephone survey. Under this method, the investigator, instead of presentinghimself before the informants, contacts them on telephone and collectsinformation from them.

Merits

(i) The method is more convenient than personal interview.

(ii) This method is less time-consuming and can be applied even toextensive fields of enquiries. Telephone survey has all the other meritsof personal interview.

Limitations

(i) This method excludes those who do not have a telephone as alsothose who have unlisted telephones.

(ii) This method is also subjective in nature and personal bias, whimand prejudices of the investigator may adversely affect the results ofthe enquiry.

Information received through local agents. Under this method, the informationis not collected formally by the investigator, but local agents, commonly knownas correspondents are appointed in different parts of the area under investigation.These agents collect information in their areas and transmit the same to theinvestigator. They apply their own judgement as to the best method of obtaininginformation. This method is usually employed by newspapers or periodicalagencies which require information in different fields such as economic trends,business, stock and share markets, sports, politics and so on.

Page 41: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 35

Merits

(i) This method is very cheap and economical for extensiveinvestigations.

(ii) The required information can be obtained expeditiously since onlyrough estimates are required.

Limitations

(i) Since the correspondents apply their own judgement about themethod of collecting the information, the results are often vitiateddue to personal prejudices and whims of the correspondents. Thedata so obtained is thus not so reliable.

(ii) This method is suitable only if the purpose of investigation is to obtainrough and approximate estimates. It is unsuited where a high degreeof accuracy is desired.

Activity 3

How will you conduct an interview if the person is not ready to give it? Givean example.

Self-Assessment Questions

5. Fill in the blanks with the appropriate terms.

(a) The committee selects persons, known as ____________ andcollects information from them by getting answers to questionsdecided in advance.

(b) The local agents collect information in their areas and ____________the same to the investigator.

6. Fill in the blanks with the appropriate terms.

(a) The success of the interview method depends upon the______________ character of the witnesses.

(b) The telephone survey method is more convenient than personal________________.

Page 42: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 36

2.5 Experiments

Experiments are another method of collecting data. Experiments are resorted towhen it is necessary to collect factual data when nothing is available for reference.It may also be conducted to verify a theory. It is a study conducted under controlledconditions. Experiments are made by researchers to understand the cause andeffect relationships. Such relationships are also made in observational studiesbut here, there is no control on how subjects are assigned to groups.

Experimental design

This design contains information gathering exercises that have variations undercontrol of the experimenter. In observational studies, there is no control oncondition. Mostly, an experimenter wants to know the effect of some process oncertain objects, which are taken as ‘experimental units’. Such objects are eithera small section of people, few groups, etc. Such design finds broad applicationin natural and social sciences.

The random design experiment is very helpful in situations when we haveto analyse huge amount of outcome data. The word ‘experiment’ or ‘randomexperiment’ is used when we face an uncertain situation and we need to havesome observations about the situation. Random does not imply haphazard. Weneed to be careful to ensure that appropriate random methods are used. Theactual results of the uncertain situation are referred to as outcome or samplepoint. In the random experiment, nothing can say with certainty about theoutcome. An experiment may comprise one or more observations. If there is asingle observation, we use the term random trail or simply trial. An electric fan,for example, may be selected from a factory to examine whether or not it isdefective. A single fan selected is a trial. We can select as many fans as wewish. The number of observations will be equal to that of fans. The propertiesof a random experiment may be listed as follows:

We can repeat the experiment any number of times.

A random trial comprises at least two possible outcomes.

We cannot say with certainty about the outcome of the random trial orrandom experiment.

There are three things in common in all statistical experiments.

1. The experiment can have many possible outcomes.

2. We can specify each possible outcome in advance.

3. The outcome of the experiment is dependent on chance.

Page 43: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 37

A coin toss, for example, has all the attributes of a statistical experiment.In this case, there is more than one possible outcome. It is possible to specifyeach possible outcome (i.e., heads or tails) in advance. Funally, there is anelement of chance, since the outcome is uncertain.

Analysis of the experimental design has the foundation of varianceanalysis. This analysis is done by collecting models having variance alreadyobserved, and these were partitioned into different components on differentfactors, and then estimation and testing were carried out.

We now consider another experiment where eight objects are to beweighed using a pan balance and a set of few standard weights. Each instrumentweighs the difference between objects in the left pan against those in the rightpan. Further, there is an addition of standard weights that were kept on thelighter pan and equilibrium point is noted. There was a random error for eachexperiment averaging zero. Standard deviation errors, due to the probabilitydistribution, are s on different weights and these are independent. We denotetrue weight as q1, ..., q8.Experiments considered are,

1. Weighing of each object in one pan, while the other is empty. We denoteXi as the weight of the ith object, where i vary from 1 to 8.

2. Carry on weighing of eight as per schedule given below. We take measureddifference as Yi where i vary from 1 to 8.

Left pan Right pan

1st weighing: 1 2 3 4 5 6 7 8 (empty)2nd: 1 2 3 8 4 5 6 7 3rd: 1 4 5 8 2 3 6 7 4th: 1 6 7 8 2 3 4 5 5th: 2 4 6 8 1 3 5 7 6th: 2 5 7 8 1 3 4 6 7th: 3 4 7 8 1 2 5 6 8th: 3 5 6 8 1 2 4 7

The weight 1 has estimated value of,

1 2 3 4 5 6 7 81 .

8Y Y Y Y Y Y Y Y

Estimated value for weights of the other items, 2 is

1 2 3 4 5 6 7 82 .

8Y Y Y Y Y Y Y Y

Page 44: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 38

In decision-making one has to choose better alternatives. Here, 2 is thevariance of the estimate X1 of 1 for the first experiment. But 2/8 is the variancefor second experiment. Thus, there is eight times more precision in the secondexperiment, for a single item. Estimates are done for all simultaneously havingthe same precision. If weighed separately 64 weight are to taken with 8 weighingin the second experiment. Estimates for items in the second experiment haveerrors, correlated to other.

This also serves as an example for the design of experiments that involvecombinatorial designs.

1. Selecting a problem: For designing an experiment, one must select aproblem and put a phrase for it. This will direct the design as well asoutcomes of the experiment. Issues related to questions like ‘Who, What,When, Why and How’ need to be addressed. Let us take the case ofautomobile accidents and design an experiment for this. We now collectdata for this experiment. Depending on how the presentation of the problemis stated, the aim of the experiment may be different. This may either leadto the design of a road surface for existing automobiles or a brand newautomobile. To make research more precise and cover greater depth,proven models should underlie a design for the experiment.

2. Determining dependent variables: Dependent variables needmeasurement in the experiment and there may be various dependentvariables. Variables should be split into system levels and individual levels.Questions on the experiments are only taken for a system level. Suchvariables are created so that a conclusion can be drawn. Further, suchconclusions should be supported from as many different angles aspossible. Such operations are called converging operations. System leveldependent variables tell how many experimenters are there while a certaintask is being done. If taken at individual level, these dependent variablesare taken as measurements for a particular subject. Such measurementsof dependent variables are to be analysed and reduced.

Dependent variables may consist of different measures like performanceand subjective. Performance measures tell time taken by the participantin completing the task plus number of mistakes made during the task.Subjective measures tell about the method used or not used byparticipants.

3. Determining independent variables: These variables get manipulatedin the experiment. These are related to people, typically sex, age, levelof education, general work experience or vision. To ensure meeting of

Page 45: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 39

specifications, subjects are to be screened prior to running theexperiment.

4. Determining the number of levels of independent variables: Thisdetermines the number of experimental conditions to be manipulated. Ifan experiment is to be designed for assessing relative performance offew automobiles, say 10, then independent variables have number oflevels as 10.

5. Determining the possible combinations: There is a need to establishtypes of combinations in independent variables. Only then can anexperiment be taken as valid.

6. Determining the number of observations: Depending on desiredanalysis, certain factors are to be considered before deciding on thenumber of observations. This includes the number of trials be taken toget familiarized with the experiment.

7. Redesign: This is necessary for obtaining an optimal design. Redesignis essential when there are certain lacunae in the experiment design.Inconsistencies are caused by inaccuracy while stating the problem,selection of inadequate variables and non-availability of desired apparatus.Recommended timeframe for redesign is:

Planning and scheduling — 44 per cent

Testing — 6–10 per cent

Reduction, analysis and writing — 45–50 per cent

8. Randomization: A trial that is randomized and controlled is most reliableand impartial. It is a process of assigning participants not by choice, butby chance. This is done either to the group carrying out the investigationor those who are controlling. This ensures trials do not receive the preferredresults.

9. Data collection: Data collection must ensure that these experiments aresupported by factual data. This lies in collection of raw data and adheringto the experimental conditions. The data here may be very large.

10. Data reduction: For data reduction, raw data are taken into manageablechunks for further utilization. Entire data may not be found pertinent andthus need to be excluded and not be considered for analysis.

11. Data verification: This is essential and mostly carried out by plottingreduced data that gives a visual picture of how data is located. Thesepoints indicate erroneous data collection.

Page 46: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 40

True experimental design needs an environment that is created for controlof spurious data that may mislead the experimental conclusion. A purchaselaboratory makes an approach most suited for this. Researchers modify onevariable at a time to determine the effect on sales volume. Virtual purchaselabs, which are Internet-based labs, are becoming popular.

Self-Assessment Questions

7. Fill in the blanks with the appropriate terms.

(a) Experiments are made by _______________ to understand thecause and effect relationships.

(b) Analysis of the experimental design has the foundation of___________ analysis.

8. State whether true or false.

(a) In observational studies, there is no control on condition.

(b) In decision-making one has to choose worse alternatives.

2.6 Summary

Let us recapitulate the important concepts discussed in this unit:

Observation may be defined as recording behavioural patterns withoutverbal communication.

Questionnaire method for data collection by can be used either mailingthe questionnaires or sending them through enumerators.

The questionnaire is the only medium of communication between theinvestigator and the respondents, so it must be designed or drafted withutmost care and caution so that all the relevant and essential informationfor the enquiry may be collected without any difficulty, ambiguity orvagueness.

Instead of directly approaching the informants, the investigator caninterview several third persons who are directly or indirectly concernedwith the subject matter of the enquiry and who are in possession of therequisite information using indirect personal interview method.

The investigator, instead of presenting himself before the informants,contacts them on telephone and collects information from them.

Page 47: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 41

Experiments are resorted to when it is necessary to collect factual datawhen nothing is available for reference. It may also be conducted to verifythe theory. It is a study conducted under controlled conditions.

2.7 Glossary

Direct personal observation: In this, the investigator himself is presentbefore the informant and obtains first hand information.

Mailed questionnaire method: In this, the investigator prepares aquestionnaire containing a number of questions pertaining to the field ofenquiry.

Questionnaire sent through enumerators: In this, the investigatorappoints agents known as enumerators, who go to the respondentspersonally with the questionnaire and record the respondent’sreplies.

Indirect personal interviews: In this, the investigator interviews severalthird persons who are directly or indirectly concerned with the subjectmatter of the enquiry and who are in possession of the requisiteinformation.

Telephone survey: In this, the investigator contacts the informants ontelephone and collects the information.

Information received through local agents: In this, the information isnot collected formally by the investigator, but by local agents commonlyknown as correspondents.

2.8 Terminal Questions

1. What is observation? Why it is important for data collection?

2. Discuss the features of indirect personal interview.

3. Discuss the merits and demerits of both types of questionnaires.

4. What points must be considered while drafting a questionnaire?

5. How is information received through local agents? What are its meritsand demerits?

6. What is the experiment method? What role does it play in data collection?

Page 48: BCC104 Business Statistics

Business Statistics Unit 2

Sikkim Manipal University Page No. 42

2.9 Answers

Answers to Self-Assessment Questions

1. (a) Behavioural; (b) Enquiry

2. (a) False; (b) True

3. (a) True; (b) True

4. (a) Footnote; (b) Logically

5. (a) Witnesses; (b) Transmit

6. (a) Representative; (b) Interview

7. (a) Researchers; (b) Variance

8. (a) True; (b) False

Answers to Terminal Questions

1. Refer Section 2.2

2. Refer Section 2.3.1 and 2.3.2

3. Refer Section 2.3.3

4. Refer Section 2.4

5. Refer Section 2.4.1

6. Refer Section 2.5

2.10 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2002.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2007.

Page 49: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 43

Unit 3 Data Analysis Techniques

Structure

3.1 IntroductionObjectives

3.2 Percentages, Ratios and Averages3.3 Mean, Mode and Median3.4 Quartiles3.5 Range3.6 Standard Deviation3.7 Summary3.8 Glossary3.9 Terminal Questions

3.10 Answers3.11 Further Reading

3.1 Introduction

In the previous unit, you learnt about various data collection methods. Thecollected data is analysed to get useful information. In this unit, you will learnabout the various techniques of data analysis. Percentage is the result obtainedby multiplying a quantity by 100. If 50% of the students in a class are girls, itmeans that out of every 100 students, 50 are girls. A ratio is a comparisonbetween two values. It shows the number of times one value is contained in orcontains the other. For example, if the ratio of girls to boys in a class is 1:2, itmeans that two times the number of girls is contained in boys. Average is themeasure of the middle value of the data set. A measure of central tendency is asingle value that attempts to describe a set of data by identifying the centralposition within that set of data. The three common measures of central tendency—mean, median and mode— are explained in this unit. Dispersion tells us aboutthe spread of data. The commonly used measures of dispersion are quartiledeviation, range and standard deviation.

Objectives

After studying this unit, you should be able to:

Evaluate percentages, ratios and averages

Calculate arithmetic mean, median and mode

Page 50: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 44

Evaluate and represent data using quartiles, deciles and percentiles

Calculate range and standard deviation

3.2 Percentages, Ratios and Averages

Cent is a French word for hundred. Per cent stands for ‘every hundred’ and isthe most powerful tool for comparison of numerical and statistical data.Percentage is used in business and economic fields for making comparison onprofit, growth rate, magnitude, performance, etc. The concept of percentageapplies mainly on ratios. A ratio, when multiplied by 100, becomes percentage.

An average is the measure of central tendency of a set of numbers. Wemostly come across such problems of finding an average value for a set ofnumbers. For example, a student has secured 60% in mathematics, 70% in physicsand 80% in chemistry. If one is asked to find the average, we calculate it as (60 +70 + 80)/3 = 70%. Average is also known as arithmetic mean. General formulafor finding an average of n numbers; x1, x2, x3, ..., xn is An = (x1, x2, x3, ..., xn)/n.

3.2.1 Percentage

Mathematically, percentage value is calculated for ratios that have a denominator.A denominator is the base value of a percentage. If there is a ratio 3 to 10 (3/10), this literally means 3 in 10. To convert it into percentage, we should multiplyit with 100 (hundred) and it is then expressed as 30% (or 30 per cent).

When a value of measured quantity is subject to some change, this canbe recorded as:

(i) Absolute value change

(ii) Percentage change

These two changes are related to each other.

(i) Absolute value change: This is defined as the actual change in thequantity. For example, if there is a sales figure of 220 crores in the year2000 and 250 crores in the year 2001, the absolute value change is 30crores.

(ii) Percentage change: Here, change is expressed as a ratio of originalvalue and then multiplied by 100 (hundred). In the example cited above,Percentage change = (Absolute value change/Original quantity) × 100 =(30/220) × 100 = 13.64%. Percentage change is always taken with

Page 51: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 45

reference to its original value, unless otherwise stated. The changesexpressed as percentage present a better picture of the change.

Percentage point change and percentage change: Percentage pointchange only notes the change in percentage whereas percentage change notesthe change with reference to the original value. This is explained with the helpof example 3.1.Example 3.1: Savings expressed as percentage of Gross Domestic Product(GDP) was 20% in 2000 and 25% in 2002. What is the percentage point changeand percentage change in this period?Solution: Percentage point change in savings rate = 25 – 20 = 5% (Five percent).

Percentage change of savings rate = (25 – 20)/20 × 100 = 25% (Twentyfive per cent).

Numerator and Denominator

The numerator has a direct relationship with ratio or percentage. When numeratorincreases, the ratio also increases, if denominator remains constant. Thedenominator has an inverse relationship with ratio or percentage. When thedenominator increases, ratio or percentage decreases and when the denominatordecreases, ratio or percentage increases.

If changes take place both in the numerator as well as in the denominator,first solve for change in any one of them keeping the other one constant andthen in the new value of the percentage, use the change in the other.Example 3.2 will illustrate this concept.Example 3.2: Petrol prices increase by 20%. Ramesh has decided to reduceits consumption so that he does not incur additional expenditure. By whatpercentage should he reduce the petrol consumption?Solution: Let us assume that Ramesh consumes 100 litres of petrol. Let theprice of petrol be Rs x per litre. He was paying Rs 100x. Due to this increase hehas to pay now 1.2 × x × 100 = 120x.

He has reduced consumption to y litres. So, 1.2 × x × y = 100 × x y =100/1.2 = 83.33.

Percentage reduction in consumption = 100 – y = (100 – 83.33) = 16.67%.

3.2.2 Ratio

When a comparison is carried out between two numbers, it is useful to knowhow many times one number is greater or smaller than the other. Thus, we are

Page 52: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 46

often required to express one number as the fraction of the other. Ratio of anumber a to a number b is defined as quotient of number a and b.

The numbers that form the ratio are known as terms of the ratio. Numeratorof the ratio is known as antecedent and the denominator is known as consequent.A ratio has no unit for homogeneous quantity, but in case of heterogeneousquantity, it depends on the units of numerator and denominator. Here, the unitis just a number. For example, a specific gravity that is the ratio of density isunitless. Current, in electricity, is a ratio of flow of charge and time, so current iscoulomb per unit time. This unit has a special name as ‘ampere’.

Ratios are expressed as percentages and for this it is multiplied by 100. Aratio is given as 3/5 = 0.6. This can be expressed as 0.6 × 100 = 60%.

Properties of Ratio

(i) If numerator and denominator are multiplied by the same number, ratioremains unchanged. This means a/b = ma/mb.

(ii) If numerator and denominator are divided by the same number, ratioremains unchanged. This means a/b = (a/m)/(b/m).

(iii) To compare magnitudes of two ratios, their denominator should be equatedand values of numerator will then decide which one is greater. If wecompare values of 8/3 and 11/4, we have to make a common denominator.We multiply 8/3 by 4 in numerator as well as denominator and get 32/12.We then multiply 11/4 by 3 in both, numerator and denominator and get33/12. Thus, we find that 11/4 > 8/3.

(iv) Ratio of two fractions can be expressed as ratio of two integers. Thus,a/b : c/d = ad/bc.

(v) If either of the terms of a ratio is a surd, then this ratio will never be aninteger unless both the terms are equal or numerator is an integral multipleof the denominator. Thus, the ratio of sqrt(3)/sqrt(2) will never be an integer.

(vi) When two ratios are multiplied, their numerators and denominators arealso multiplied. For example, a/b × c/d = ac/bd.

(vii) When ratio a/b is compounded with itself, the resulting ratio, a2/b2 is knownas duplicate ratio and a3/b3 is triplicate ratio and a0.5/b0.5 is the sub-duplicateratio of a/b.

(viii) If a/b = c/d = e/f = g/h = k, then, (a+c+e+g)/(b+d+f+h) = k.

Page 53: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 47

(ix) If a1/b1, a2/b2, a3/b3, ..., an/bn are unequal fractions then the ratio, (a1, a2,a3, a4)/(b1, b2, b3, b4) lies between the lowest and the highest of thesefractions.

(x) If there are two equations containing three unknowns as, a1x + b1y + c1z= 0 and a2x + b2y + c2z = 0; then values of x, y and z can not be resolvedunless we get the third equation, but the proportion in which x, y and z liecan be solved.

(xi) If the ratio is a/b > 1 and if there is a positive number k, then (a + k)/(b +k) < a/b and (a – k)/(b – k) > a/b. Similarly, if a/b < 1 and if there is apositive number k, then (a + k)/(b + k) > a/b and (a - k)/(b - k) < a/b.

3.2.3 Averages

An average is the measure of central tendency of a set of numbers. The generalformula for finding an average of n numbers; x1, x2, x3, ..., xn is An = (x1, x2, x3, ...,xn)/n. There is another type of average, known as weighted average.

When there are two or more groups with known averages, then thecombined average is found by weighted average. If we have r groups havingaverages as A

1, A

2, A

3,….., A

r and elements as n

1, n

2, n

3,…….., n

r, then weighted

average is given as:

Aw = ( n1A1 + n2A2 + n3A3 +…..+ nrAr)/( n1 + n2 + n3 +……..+ nr)

An average is also known as an arithmetic mean.Example 3.3: A man travels from point A to point B at 60 kmph and returns at100 kmph. Find the average speed.Solution: Average speed = Total distance/Total time taken.

Let the distance between A to B, be d. Time taken for going from A to B isd/60 and for returning to A is d/100.

Total time is d/60 + d/100.

Total distance = 2d.

Hence, average speed = 2d/[ d/60 + d/100] = 2d × 600/(16d) = 75 kmph.Example 3.4: Average marks of 20 students in an examination is reduced by 2.If the topper of the class who secured 90 marks was replaced by a new student.What was the score of this new student?Solution: Let the average marks when topper is included and not replaced bythe new student be x. There are 20 students, so total number is 20x. Newaverage is x – 2 and hence total mark is 20(x – 2) = 20x – 40. Thus, there is a

Page 54: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 48

reduction of 40 marks and this must be due to the new student who got 40marks less than the student he replaced. So, he got only 90 – 40 = 50 marks.

Activity 1

An investor buys Rs 1200 worth of shares in a company each month. Duringthe first five months, he bought the shares at a price of Rs 10, Rs 12, Rs15, Rs 20 and Rs 24 per share. After 5 months what is the average pricepaid for the shares by him?

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Percentage value is calculated for ratios that have a___________________.

(b) Numerator of the ratio is known as _________________ and thedenominator is known as consequent.

2. State whether true or false.

(a) An average is the measure of central tendency of a set of numbers.

(b) The numerator does not have a direct relationship with ratio orpercentage.

3.3 Mean, Mode and Median

3.3.1 Arithmetic Mean

There are several commonly used measures such as arithmetic mean, modeand median. These values are very useful not only in presenting the overallpicture of the entire data but also for the purpose of making comparisons amongtwo or more sets of data.

As an example, questions like ‘How hot is the month of June in Delhi?’can be answered generally by a single figure of the average for that month.Similarly, suppose we want to find out if boys and girls of age 10 years differ inheight for the purpose of making comparisons. Then, by taking the averageheight of boys of that age and the average height of girls of the same age, wecan compare and record the differences.

Page 55: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 49

While arithmetic mean is the most commonly used measure of centraltendency, mode and median are more suitable measures under certain set ofconditions and for certain types of data. However, each measure of centraltendency should meet the following requisites.

1. It should be easy to calculate and understand.

2. It should be rigidly defined. It should have only one interpretation so thatthe personal prejudice or the bias of the investigator does not affect itsusefulness.

3. It should be representative of the data. If it is calculated from a sample,the sample should be random enough to be accurately representing thepopulation.

4. It should have a sampling stability. It should not be affected by samplingfluctuations. This means that if we pick ten different groups of collegestudents at random and compute the average of each group, then weshould expect to get approximately the same value from each of thesegroups.

5. It should not be affected much by extreme values. If few, very small orvery large items are present in the data, they will unduly influence thevalue of the average by shifting it to one side or other, so that the averagewould not be really typical of the entire series. Hence, the average chosenshould be such that it is not unduly affected by such extreme values.

Arithmetic mean is also commonly known as the mean. Even thoughaverage, in general, means measure of central tendency, when we use theword average in our daily routine, we always mean the arithmetic average. Theterm is widely used by almost everyone in daily communication. We speak ofan individual being an average student or of average intelligence. We alwaystalk about average family size or average family income or grade point average(GPA) for students, and so on.

For discussion purposes, let us assume a variable X which stands forsome scores such as the ages of students. Let the ages of 5 students be 19,20, 22, 22 and 17 years. Then variable X would represent these ages as follows:

X: 19, 20, 22, 22, 17

Placing the Greek symbol (Sigma) before X would indicate a commandthat all values of X are to be added together. Thus:

X = 19 + 20 + 22 + 22 + 17

Page 56: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 50

The mean is computed by adding all the data values and dividing it bythe number of such values. The symbol used for sample average is X so that:

19+20+ 22+22+17X =

5

In general, if there are n values in the sample, then

1 2 nX + X +.........+ XX =

n

In other words,

n

ii=1

X

X = , i =1, 2 ... nn

According to this formula, the mean can be obtained by adding up allvalues of Xi, where the value of i starts at 1 and ends at n with unit incrementsso that i = 1, 2, 3, ... n.

If instead of taking a sample, we take the entire population in ourcalculations of the mean, then the symbol for the mean of the population is(mu) and the size of the population is N, so that:

1 , 1, 2 ...

N

ii

Xi N

N

In real-life cases, a population is usually very large and hence thepopulation mean is considered an unknown constant. The value of N is alsovery large and is in the thousands, millions or sometimes even infinity. Samplemean is thus used as an estimator for estimating population mean.

If we have the data in grouped discrete form with frequencies, then thesample mean is given by:

( )f XXf

Here, f = Summation of all frequencies

= n

f(X) = Summation of each value of X multiplied by itscorresponding frequency ( f )

Page 57: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 51

Example 3.5: Let us take the ages of 10 students as follows:

19, 20, 22, 22, 17, 22, 20, 23, 17, 18Solution: This data can be arranged in a frequency distribution as follows:

(X) (f) f(X) 17 2 34 18 1 18 19 1 19 20 2 40 22 3 66 23 1 23

Total = 10 200

In this case, we have f = 10 and f(X) = 200, so that:

X =( )f Xf

= 200/10 = 20Example 3.6: Calculate the mean of the marks of 46 students given in thefollowing table.

Frequency of Marks of 46 Students

Marks Frequency

(X) ( f )

9 110 211 312 613 1014 1115 716 317 218 1

Total 46

Solution: This is a discrete frequency distribution, and is calculated using the

equation ( )f xxf

. The following table shows the method of obtianing f(X).

Page 58: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 52

Marks (X) Frequency ( f ) f(X)

9 1 9

10 2 20

11 3 33

12 6 72

13 10 130

14 11 154

15 7 105

16 3 48

17 2 34

18 1 18

f = 46 f(X) = 623

( ) 623 13.54

46f XX

f

= = =

Example 3.7: The mean age of a group of 100 persons (grouped in intervals10–, 12–,..., etc.) was found to be 32.02. Later, it was discovered that age 57was misread as 27. Find the corrected mean.Solution: Let the mean be denoted by X. So, putting the given values in theformula of arithmetic mean, we have,

32.02 = 100

X , i.e., X = 3202

Correct X = 3202 – 27 + 57 = 3232

Correct AM = 3232100

= 32.32

Example 3.8: The mean monthly salary paid to all employees in a company isRs 500. The monthly salaries paid to male and female employees average Rs520 and Rs 420, respectively. Determine the percentage of males and femalesemployed by the company.Solution: Let N1 be the number of males and N2 be the number of femalesemployed by the company. Also, let x1 and x2 be the monthly average salariespaid to male and female employees and x– be the mean monthly salary paid toall the employees.

x = 1 1 2 2

1 2

N x N xN N

Page 59: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 53

or 500 = 520 4201 2

1 2

N NN N

or 20N1= 80N2

orNN

1

2=

8020

41

Hence, the males and females are in the ratio of 4 : 1 or 80 per cent aremales and 20 per cent are females in those employed by the company.

The Weighted Arithmetic Mean

In the computation of arithmetic mean we had given equal importance to eachobservation in the series. This equal importance may be misleading if theindividual values constituting the series have different importance as in thefollowing example:

The Raja Toy shop sells

Toy Cars at Rs 3 each

Toy Locomotives at Rs 5 each

Toy Aeroplanes at Rs 7 each

Toy Double Decker at Rs 9 each

What shall be the average price of the toys sold, if the shop sells 4 toys,one of each kind?

Mean Price, i.e., x = x4

= Rs244

= Rs 6

In this case, the importance of each observation (price quotation) is equalin as much as one toy of each variety has been sold. In the above computationof the arithmetic mean, this fact has been taken care of by including ‘once only’the price of each toy.

But if the shop sells 100 toys: 50 cars, 25 locomotives, 15 aeroplanes and10 double deckers, the importance of the four price quotations to the dealer isnot equal as a source of earning revenue. In fact, their respective importanceis equal to the number of units of each toy sold, i.e.,

The importance of Toy Car 50

The importance of Locomotive 25

The importance of Aeroplane 15

The importance of Double Decker 10

Page 60: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 54

It may be noted that 50, 25, 15, 10 are the quantities of the various classes oftoys sold. It is for these quantities that the term ‘weights’ is used in statistical language.Weight is represented by symbol ‘w’, and w represents the sum of weights.

While determining the ‘average price of toy sold’, these weights are ofgreat importance and are taken into account in the manner illustrated below:

x = 1 1 2 2 3 3 4 4

1 2 3 4

w x w x w x w xw w w w+ + ++ + +

= wxw

When w1, w2, w3, w4 are the respective weights of x1, x2, x3, x4 which inturn represent the price of four varieties of toys, viz., car, locomotive, aeroplaneand double decker, respectively.

x = (50 3) (25 5) (15 7) (10 9)50 25 15 10

¥ + ¥ + ¥ + ¥+ + +

= (150) (125) (105) (90)100

= 470100

= Rs 4.70

Table 3.1 summarizes the steps taken in the computation of the weightedarithmetic mean.

Table 3.1 Weighted Arithmetic Mean of Toys Sold by the Raja Toy Shop

Toys Price per Toy Number Sold Price × WeightRs x w xw

Car 3 50 150

Locomotive 5 25 125

Aeroplane 7 15 105

Double Decker 9 10 90

w = 100 xw = 470

w = 100; wx = 470

x = wxw

= 470100

= 4.70

The weighted arithmetic mean is particularly useful where we have tocompute the mean of means. If we are given two arithmetic means, one foreach of two different series, in respect of the same variable, and are required tofind the arithmetic mean of the combined series, the weighted arithmetic meanis the only suitable method of its determination.Example 3.9: The arithmetic mean of daily wages of two manufacturing concernsA Ltd. and B Ltd. is Rs 5 and Rs 7, respectively. Determine the average dailywages of both concerns if the number of workers employed were 2,000 and4,000 respectively.

Page 61: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 55

Solution: (i) Multiply each average (viz. 5 and 7), by the number of workers inthe concern it represents.

(ii) Add up the two products obtained in (i) above.

(iii) Divide the total obtained in (ii) by the total number of workers.

Weighted Mean of Mean Wages of A Ltd. and B Ltd.

Manufacturing Mean Wages Workers Mean Wages ×Concern x Employed Workers Employed

w wx

A Ltd. 5 2,000 10,000

B Ltd. 7 4,000 28,000

w = 6,000 wx = 38,000

x = wxw

ÂÂ

= 38,0006,000

= Rs 6.33

The above mentioned examples explain that ‘Arithmetic Means andPercentage’ are not original data. They are derived figures and their importanceis relative to the original data from which they are obtained. This relativeimportance must be taken into account by weighting while averaging them(means and percentage).

Advantages of Mean

1. Its concept is familiar to most people and is intuitively clear.

2. Every data set has a mean, which is unique and describes the entire datato some degree. For example, when we say that the average salary of aprofessor is Rs 25,000 per month, it gives us a reasonable idea about thesalaries of professors.

3. It is a measure that can be easily calculated.

4. It includes all values of the data set in its calculation.

5. Its value varies very little from sample to sample taken from the samepopulation.

6. It is useful for performing statistical procedures such as computing andcomparing the means of several data sets.

Page 62: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 56

Disadvantages of Mean

1. It is affected by extreme values, and hence, are not very reliable whenthe data set has extreme values especially when these extreme valuesare on one side of the ordered data. Thus, a mean of such data is nottruly a representative of such data. For example, the average age of threepersons of ages 4, 6 and 80 years gives us an average of 30.

2. It is tedious to compute for a large data set as every point in the data setis to be used in computations.

3. We are unable to compute the mean for a data set that has open-endedclasses either at the high or at the low end of the scale.

4. The mean cannot be calculated for qualitative characteristics such asbeauty or intelligence, unless these can be converted into quantitativefigures such as intelligence into IQs.

3.3.2 Median

The second measure of central tendency that has a wide usage in statisticalworks is the median. Median is that value of a variable which divides the seriesin such a manner that the number of items below it is equal to the number ofitems above it. Half of the total number of observations lies below the medianand half above it. The median is thus a positional average.

The median of ungrouped data is found easily if the items are first arrangedin order of the magnitude. The median may then be located simply by counting,and its value can be obtained by reading the value of the middle observations.If we have five observations whose values are 8, 10, 1, 3 and 5, the values arefirst arrayed: 1, 3, 5, 8 and 10. It is now apparent that the value of the median is5, since two observations are below that value and two observations are aboveit. When there is an even number of cases, there is no actual middle item andthe median is taken to be the average of the values of the items lying on eitherside of (N + 1)/2, where N is the total number of items. Thus, if the values of sixitems of a series are 1, 2, 3, 5, 8 and 10, then the median is the value of itemnumber (6 + 1)/2 = 3.5, which is approximated as the average of the third andthe fourth items, i.e., (3+5)/2 = 4.

Thus, the steps required for obtaining median are:

1. Arrange the data as an array of increasing magnitude.

2. Obtain the value of the (N+ l)/2th item.

Page 63: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 57

Frequency is the number of times a given data occurs in a data set. A relativefrequency is the fraction of times a data occurs. Cumulative frequency is theaccumulation of previous relative frequencies. For example, the data below givesthe number of hours devoted by 20 students of a class to study at home:

5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3

Following table gives the frequency distribution, relative frequencydistribution and cumulative frequency distribution:

Hours Number of Students (Frequency)

Relative Frequency

Cumulative Frequency

2 3 3/20=0.15 0.15

3 5 5/20=0.25 0.15+0.25=0.4

4 3 3/20=0.15 0.4+0.15=0.55

5 6 6/20=0.3 0.55+0.3=0.85

6 2 2/20=0.1 0.83+0.1=0.95

7 1 1/20=0.05 0.95+0.05=1

Total 20

Even in the case of grouped data, the procedure for obtaining median isstraightforward as long as the variable is discrete or non-continuous as is clearfrom the following example.Example 3.10: Obtain the median size of shoes sold from the following data.

Number of Shoes Sold by Size in One Year

Size Number of Pairs Cumulative Total

5 30 30

5 12 40 70

6 50 120

6 12 150 270

7 300 570

7 12 600 1170

8 950 2120

8 12 820 2940

9 750 3690

9 12 440 4130

10 250 4380

10 12 150 4530

11 40 4570

11 12 39 4609

Total 4609

Page 64: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 58

Solution: Median, is the value of ( )N 1

2th =

4609 + 12

th = 2305th item. Since the

items are already arranged in ascending order (size-wise), the size of 2305thitem is easily determined by constructing the cumulative frequency. Thus, themedian size of shoes sold is 8½, the size of 2305th item.

In the case of grouped data with continuous variable, the determinationof median is a bit more involved. Consider the following table where the datarelating to the distribution of male workers by average monthly earnings is given.Clearly the median of 6291 is the earnings of (6291 + 1)/2 = 3146th workerarranged in ascending order of earnings.

From the cumulative frequency, it is clear that this worker has his incomein the class interval 67.5–72.5. But, it is impossible to determine his exact income.We therefore, resort to approximation by assuming that the 795 workers of thisclass are distributed uniformly across the interval 67.5 to 72.5. The medianworker is (3146–2713) = 433rd of these 795, and hence, the value correspondingto him can be approximated as,

67 5 433795

72 5 67 5. × ( . . ) = 67.5 + 2.73 = 70.23

Distribution of Male Workers by Average Monthly Earnings

Group No. Monthly No. of Cumulative No.Earnings (Rs) Workers of Workers

1 27.5–32.5 120 120

2 32.5–37.5 152 272

3 37.5–42.5 170 442

4 42.5–47.5 214 656

5 47.5–52.5 410 1066

6 52.5–57.5 429 1495

7 57.5–62.5 568 2063

8 62.5–67.5 650 2713

9 67.5–72.5 795 3508

10 72.5–77.5 915 4423

11 77.5–82.5 745 5168

12 82.5–87.5 530 5698

13 87.5–92.5 259 5957

14 92.5–97.5 152 6109

Page 65: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 59

15 97.5–102.5 107 6216

16 102.5–107.5 50 6266

17 107.5–112.5 25 6291

Total 6291

The value of the median can thus be put in the form of the formula,

Me = l

N C

fi

12 × Where l is the lower limit of the median class, i its

width, f its frequency, C the cumulative frequency upto (but not including) themedian class, and N is the total number of cases.

Finding Median by Graphical Analysis

The median can quite conveniently be determined by reference to the ogivewhich plots the cumulative frequency against the variable. The value of the itembelow which half the items lie, can easily be read from the ogive as is shown inexample 3.11.Example 3.11: Obtain the median of data given in the following table.

Monthly Earnings Frequency Less Than More Than

27.5 __ 0 6291

32.5 120 120 6171

37.5 152 272 6019

42.5 170 442 5849

47.5 214 656 5635

52.5 410 1066 5225

57.5 429 1495 4796

62.5 568 2063 4228

67.5 650 2713 3578

72.5 795 3508 2783

77.5 915 4423 1868

82.5 745 5168 1123

87.5 530 5698 593

92.5 259 5957 334

97.5 152 6109 182

102.5 107 6216 75

107.5 50 6266 25

112.5 25 6291 0

Page 66: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 60

Solution: It is clear that this is grouped data. The first class is 27.5–32.5, whosefrequency is 120, and the last class is 107.5–112.5 whose frequency is 25.Figure 3.1 shows the ogive of less than cumulative frequency. The median isthe value below which N/2 items lie, is 6291/2 = 3145.5 items lie, which is readof from Figure 3.2 as about 70. More accuracy than this is unobtainable becauseof the space limitation on the earning scale.

27.5

32.5

37.5

42.5

47.5

52.5

57.5

67.5

72.5

77.5

82.5

87.5

92.5

97.5

102.

510

7.5

112.

5

0

1000

2000

3000

4000

5000

60006291

62.5

MEDIAN

MORE THAN LESS THAN

Monthly Earnings in Rupees

Num

ber o

f Wor

kers

Figure 3.1 Median Determination by Plotting Less than and More than CumulativeFrequency

The median can also be determined by plotting both ‘less than’ and ‘morethan’ cumulative frequency as shown in Figure 3.1. It should be obvious that thetwo curves should intersect at the median of the data.

Page 67: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 61

1000

2000

3000

4000

5000

6000

0

27.5

32.5

37.5

42.5

47.5

52.5

57.5

62.5

67.5

72.5

77.5

82.5

87.5

92.5

97.5

102.

510

7.5

112.

5

Num

ber o

f Wor

kers

Monthly Earnings in Rupees

MEDIAN

Figure 3.2 Median

Advantages of Median

1. Median is a positional average and hence the extreme values in the dataset do not affect it as much as they do to the mean.

2. Median is easy to understand and can be calculated from any kind ofdata, even from grouped data with open-ended classes.

3. We can find the median even when our data set is qualitative and can bearranged in the ascending or the descending order, such as averagebeauty or average intelligence.

4. Similar to mean, median is also unique, meaning that there is only onemedian in a given set of data.

5. Median can be located visually when the data is in the form of ordereddata.

6. The sum of absolute differences of all values in the data set from themedian value is minimum. This means that it is less than any other valueof central tendency in the data set, which makes it more central in certainsituations.

Page 68: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 62

Disadvantages of Median

1. The data must be arranged in order to find the median. This can be verytime consuming for a large number of elements in the data set.

2. The value of the median is affected more by sampling variations. Differentsamples from the same population may give significantly different valuesof the median.

3. The calculation of median in case of grouped data is based on theassumption that the values of observation are evenly spaced over theentire class interval and this is usually not so.

4. Median is comparatively less stable than mean, particularly for smallsamples, due to fluctuations in sampling.

5. Median is not suitable for further mathematical treatment. For example,we cannot compute the median of the combined group from the medianvalues of different groups.

3.3.3 Mode

The mode is that value of the variable which occurs or repeats itself the greatestnumber of times. The mode is the most ‘fashionable’ size in the sense that it isthe most common and typical, and is defined by Zizek as ‘the value occurringmost frequently in a series (or group of items) and around which the other itemsare distributed most densely’.

The mode of a distribution is the value at the point around which the itemstend to be most heavily concentrated. It is the most frequent or the most commonvalue, provided that a sufficiently large number of items are available, to give asmooth distribution. It will correspond to the value of the maximum point(ordinate), of a frequency distribution if it is an ‘ideal’ or smooth distribution. Itmay be regarded as the most typical of a series of values. The modal wage, forexample, is the wage received by more individuals than any other wage. Themodal ‘hat’ size is that, which is worn by more persons than any other singlesize.

It may be noted that the occurrence of one or a few extremely high or lowvalues has no effect upon the mode. If a series of data are unclassified, nothave been either arrayed or put into a frequency distribution, the mode cannotbe readily located.

Page 69: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 63

Taking first an extremely simple example, if seven men are receiving dailywages of Rs 5, 6, 7, 7, 7, 8 and 10, it is clear that the modal wage is Rs 7 perday. If we have a series such as 2, 3, 5, 6, 7, 10 and 11, it is apparent that thereis no mode.

There are several methods of estimating the value of the mode. But, it isseldom that the different methods of ascertaining the mode give us identicalresults. Consequently, it becomes necessary to decide as to which methodwould be most suitable for the purpose in hand. In order that a choice of themethod may be made, we should understand each of the methods and thedifferences that exist among them.

The four important methods of estimating mode of a series are: (i) Locatingthe most frequently repeated value in the array; (ii) Estimating the mode byinterpolation; (iii) Locating the mode by graphic method; and (iv) Estimating themode from the mean and the median. Only the last three methods are discussedin this unit.

Estimating the Mode by Interpolation. In the case of continuousfrequency distributions, the problem of determining the value of the mode is notso simple as it might have appeared from the foregoing description. Havinglocated the modal class of the data, the next problem in the case of continuousseries is to interpolate the value of the mode within this ‘modal’ class.

The interpolation is made by the use of any one of the following formulae:

(i) Mo = lf

f fi1

2

0 2

× ;

(ii) Mo = lf

f fi2

0

0 2

×

(iii) Mo = lf f

f f f fi1

1 0

1 0 1 2

( ) ( )

×

Where l1 is the lower limit of the modal class, l2 is the upper limit of themodal class, f0 equals the frequency of the preceding class in value, f1 equalsthe frequency of the modal class in value, f2 equals the frequency of the followingclass (class next to modal class) in value, and i equals the interval of the modalclass.

Page 70: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 64

Example 3.12: Determine the mode for the data given in the following table.

Wage Group Frequency (f)

14 — 18 618 — 22 1822 — 26 1926 — 30 1230 — 34 534 — 38 438 — 42 342 — 46 246 — 50 150 — 54 0

54 — 58 1

Solution: In the given data, 22 – 26 is the modal class since it has the largestfrequency. The lower limit of the modal class is 22, its upper limit is 26, itsfrequency is 19, the frequency of the preceding class is 18, and of the followingclass is 12. The class interval is 4. Using the various methods of determiningmode, we have,

(i) Mo = 1222 4

18 12

(ii) Mo = 26 –

18 418 12

¥+

= 22 85

= 26 – 125

= 23.6 = 23.6

(iii) Mo = 19 1822 4

(19 18) ( 19 12)-

+ ¥- + -

= 4228

+ = 22.5

In formulae (i) and (ii), the frequency of the classes adjoining the modalclass is used to pull the estimate of the mode away from the midpoint towardseither the upper or lower class limit. In this particular case, the frequency of theclass preceding the modal class is more than the frequency of the class followingand therefore, the estimated mode is less than the midvalue of the modal class.This seems quite logical. If the frequencies are more on one side of the modalclass than on the other it can be reasonably concluded that the items in themodal class are concentrated more towards the class limit of the adjoining classwith the larger frequency.

Formula (iii) is also based on a logic similar to that of (i) and (ii). In thiscase, to interpolate the value of the mode within the modal class, the differencesbetween the frequency of the modal class, and the respective frequencies ofthe classes adjoining it are used. This formula usually gives results better than

Page 71: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 65

the values obtained by the other and exactly equal to the results obtained bygraphic method. Formulae (i) and (ii) give values which are different from thevalue obtained by formula (iii) and are more close to the central point of modalclass. If the frequencies of the class adjoining the modal are equal, the mode isexpected to be located at the midvalue of the modal class, but if the frequencyon one of the sides is greater, the mode will be pulled away from the centralpoint. It will be pulled more and more if the difference between the frequenciesof the classes adjoining the modal class is higher and higher. In Example 3.12,the frequency of the modal class is 19 and that of preceding class is 18. So, themode should be quite close to the lower limit of the modal class. The midpointof the modal class is 24 and lower limit of the modal class is 22.

Locating the Mode by the Graphic Method. The method of graphicinterpolation is illustrated in Figure 3.3. The upper corners of the rectangle overthe modal class have been joined by straight lines to those of the adjoiningrectangles as shown in the diagram; the right corner to the corresponding oneof the adjoining rectangle on the left, etc. If a perpendicular is drawn from thepoint of intersection of these lines, we have a value for the mode indicated onthe base line. The graphic approach is, in principle, similar to the arithmeticinterpolation explained earlier.

The mode may also be determined graphically from an ogive or cumulativefrequency curve. It is found by drawing a perpendicular to the base from thatpoint on the curve where the curve is most nearly vertical, i.e., steepest (inother words, where it passes through the greatest distance vertically and smallestdistance horizontal). The point where it cuts the base gives us the value of themode. How accurately this method determines the mode is governed by:(i) The shape of the ogive, (ii) The scale on which the curve is drawn.

Estimating the Mode from the Mean and the Median. There usuallyexists a relationship among the mean, median and mode for moderatelyasymmetrical distributions. If the distribution is symmetrical, the mean, medianand mode will have identical values, but if the distribution is skewed (moderately)the mean, median and mode will pull apart. If the distribution tails off towardshigher values, the mean and the median will be greater than the mode. If it tailsoff towards lower values, the mode will be greater than either of the other twomeasures. In either case, the median will be about one-third as far away fromthe mean as the mode is. This means that,

Mode = Mean – 3 (Mean – Median)

= 3 Median – 2 Mean

Page 72: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 66

Figure 3.3 Method of Mode Determination by Graphic Interpolation

In the case of the average monthly earnings, the mean is 68.53 and themedian is 70.2. If these values are substituted in the above formula, we get,

Mode = 68.5 – 3(68.5 –70.2) = 68.5 + 5.1 = 73.6

According to the formula used earlier,

Mode = lf

f fi1

2

0 2

×

= 72 5 745795 745

5. ×

= 72.5 + 2.4 = 74.9

OR

Mode = 1 01

1 0 2

f - fl + × i

2 f - f - f

= 72 5915 795

2 915 795 7455.

××

= 72 5 120290

5. × = 74.57

Page 73: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 67

The difference between the two estimates is due to the fact that theassumption of relationship between the mean, median and mode may not alwaysbe true which is obviously not valid in this case.Example 3.13: (i) In a moderately symmetrical distribution, the mode and meanare 32.1 and 35.4 respectively. Calculate the median.

(ii) If the mode and median of moderately asymmetrical series arerespectively 16'' and 15.7'', what would be its most probable median?

(iii) In a moderately skewed distribution, the mean and the median arerespectively 25.6 and 26.1 inches. What is the mode of the distribution?Solution: (i) We know,

Mean – Mode = 3 (Mean – Median)or 3 Median = Mode + 2 Mean

or Median = 32.1 2 35.4

3+ ¥

= 102.9

3= 34.3

(ii) 2 Mean = 3 Median – Mode

or Mean = 12

3 15 7 16 0 31 12

( × . . ) . = 15.55

(iii) Mode = 3 Median – 2 Mean= 3 × 26.1 – 2 × 25.6 = 78.3 – 51.2 = 27.1

Advantages of Mode

1. Similar to median, the mode is not affected by extreme values in the data.

2. Its value can be obtained in open-ended distributions without ascertainingthe class limits.

3. It can be easily used to describe qualitative phenomenon. For example, ifmost people prefer a certain brand of tea, then this will become the modalpoint.

4. Mode is easy to calculate and understand. In some cases, it can be locatedsimply by observation or inspection.

Disadvantages of Mode

1. Quite often, there is no modal value.

2. It can be bi-modal or multi-modal, or it can have all modal values makingits significance more difficult to measure.

Page 74: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 68

3. If there is more than one modal value, the data is difficult to interpret.

4. A mode is not suitable for algebraic manipulations.

5. Since the mode is the value of maximum frequency in the data set, itcannot be rigidly defined if such frequency occurs at the beginning or atthe end of the distribution.

6. It does not include all observations in the data set, and hence, less reliablein most of the situations.

Activity 2

The following figures represent the number of books issued at the counterof a commerce library in 11 different days. Calculate the median.

96, 180, 98, 75, 270, 20, 102, 100, 94, 75, 200.

Self-Assessment Questions

3. State whether true or false.

(a) The mean is computed by adding all the data values and dividing itby the number of such values.

(b) The mode is that value of the variable which occurs or repeats itselfthe greatest number of times.

4. Fill in the blanks with the appropriate terms.

(a) Weight is represented by symbol ‘w’, and Sw represents the___________ of weights.

(b) Median is that ______________ of a variable which divides the seriesin such a manner that the number of items below it is equal to thenumber of items above it.

3.4 Quartiles

Some measures, other than measures of central tendency, are often employedwhen summarizing or describing a set of data where it is necessary to dividethe data into equal parts. These are positional measures and are called quantilesand consist of quartiles, deciles and percentiles. The quartiles divide the datainto four equal parts. The deciles divide the total ordered data into ten equalparts and the percentiles divide the data into 100 equal parts. Consequently,

Page 75: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 69

there are three quartiles, nine deciles and 99 percentiles. The quartiles aredenoted by the symbol Q, which can be fractioned as Q1, Q2, Q3, ..., and so on.Here, Q1 will be such point in the ordered data which has 25 per cent of the databelow and Q2 will represent 75 per cent of the data above it. In other words, Q1

is the value corresponding to 1

4n

th ordered observation. Similarly, Q2 divides

the data in the middle, and is also equal to the median and its value, Q2 is givenby:

Q2 = The value of 12 th

4n

ordered observation in the data.

Similarly, we can calculate the values of various deciles. For instance,

D1 = 1 th

10n

observaton in the ordered data, and

D7 = 17 th

10n

observation in the ordered data.

Percentiles are generally used in the research area of education wherepeople are given standard tests and it is desirable to compare the relative positionof the subject’s performance on the test. Percentiles are similarly calculated as:

P7 = 17 th

100n

observation in the ordered data.

and,

P69 = 169 th

100n

observation in the ordered data.

Quartiles

The formula for calculating the values of quartiles for grouped data is given asfollows:

Q = L + (j/f)C

Where,

Q = The quartile under consideration.

L = Lower limit of the class interval which contains the value of Q.

j = The number of units we lack from the class interval which containsthe value of Q, in reaching the value of Q.

Page 76: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 70

f = Frequency of the class interval containing Q.

C = Size of the class interval.Let us assume, we took the data of the ages of 100 students and a frequencydistribution for this data has been constructed as shown.

The frequency distribution is as follows:

Ages (CI) Mid-point (X) (f) f(X) f(X)2

16 and upto 17 16.5 4 66 1089.0 17 and upto 18 17.5 14 245 4287.5 18 and upto 19 18.5 18 333 6160.5 19 and upto 20 19.5 28 546 10647.0 20 and upto 21 20.5 20 410 8405.0 21 and upto 22 21.5 12 258 5547.0 22 and upto 23 22.5 4 90 2025.0

Total = 100 1948 38161

In our case, in order to find Q1, where Q1 is the cut-off point so that 25 percent of the data is below this point and 75 per cent of the data is above, we seethat the first group has 4 students and the second group has 14 students, makinga total of 18 students. Since Q1 cuts off at 25 students, it is the third classinterval which contains Q1. This means that the value of L in our formula is 18.

Since we already have 18 students in the first two groups, we need 7more students from the third group to make it a total of 25 students, which is thevalue of Q1. Hence, the value of (j) is 7. Also, since the frequency of this thirdclass interval which contains Q1 is 18, the value of (f) in our formula is 18. Thesize of the class interval C is given as 1. Substituting these values in the formulafor Q, we get,

Q1 = 18 + (7/18)1

= 18 + 0.38 = 18.38

This means that 25 per cent of the students are below 18.38 years of ageand 75 per cent are above this age.

Similarly, we can calculate the value of Q2, using the same formula. Hence,

Q2 = L + (j/f)C

= 19 + (14/28)1

= 19.5

This also happens to be the median.

Page 77: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 71

By using the same formula and the same logic we can calculate the valuesof all deciles as well as percentiles.

We have defined the median as the value of the item which is located atthe centre of the array. We can define other measures which are located atother specified points. Thus, the Nth percentile of an array is the value of theitem such that N per cent items lie below it. Clearly then, the Nth percentile Pn ofgrouped data is given by,

Pn = l

nN C

fi

100 ×

Here, l is the lower limit of the class in which nN/100th item lies, i its width,f its frequency, C the cumulative frequency upto (but not including) this class,and N is the total number of items.

We can similarly define the Nth decile as the value of the item belowwhich (nN/10) items of the array lie. Clearly,

Dn = P10n = lnN C

fi

10 ×

where the symbols have the obvious meanings.

The other most commonly referred to measures of location are thequartiles. Thus, nth quartile is the value of the item which lies at the n(N/5)thitem. Clearly, Q2, the second quartile, is the median for grouped data.

Qn = P l

nN C

fin25

4

×

Self-Assessment Questions

5. Fill in the blanks with the appropriate terms.

(a) The positional measures are called ______________ and consist ofquartiles, deciles and percentiles.

(b) The Nth percentile of an ____________ is the value of the itemsuch that N per cent items lie below it.

6. State whether true or false.

(a) The quartiles divide the data into eight equal parts.

(b) The deciles divide the total ordered data into ten equal parts.

Page 78: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 72

3.5 Range

The crudest measure of dispersion is the range of the distribution. Range ofany series is the difference between the highest and the lowest values in theseries. If the marks received in an examination taken by 248 students arearranged in the ascending order, then the range will be equal to the differencebetween the highest and the lowest marks.

In a frequency distribution, the range is taken to be the difference betweenthe lower limit of the class at the lower extreme of the distribution and the upperlimit of the class at the upper extreme.

Table 3.2 Weekly Earnings of Labourers in Four Workshops of the Same Type

Weekly Earnings

No. of Workers

Rs Workshop A Workshop B Workshop C Workshop D

15–16 ... ... 2 ...

17–18 ... 2 4 ...

19–20 ... 4 4 4

21–22 10 10 10 14

23–24 22 14 16 16

25–26 20 18 14 16

27–28 14 16 12 12

29–30 14 10 6 12

31–32 ... 6 6 4

33–34 ... ... 2 2

35–36 ... ... ... ...

37–38 ... ... 4 ...

Total 80 80 80 80

Mean 25.5 25.5 25.5 25.5

Consider the data on weekly earnings of workers on four workshops givenin the table. We note the following:

Workshop Range

A 9

B 15

C 23

D 15

From these figures, it is clear that the greater the range, the greater is thevariation of the values in the group.

Page 79: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 73

The range is a measure of absolute dispersion and as such, cannot beusefully employed for comparing the variability of two distributions expressed indifferent units. The amount of dispersion measured, say, in pounds, is notcomparable with dispersion measured in inches. So, the need of measuringrelative dispersion arises.

An absolute measure can be converted into a relative measure if we divideit by some other value regarded as standard for the purpose. We may use themean of the distribution or any other positional average as the standard.

For Table 3.2, the relative dispersion would be:

Workshop A = 9

25 5.Workshop C =

2325 5.

Workshop B = 15

25 5.Workshop D =

1525 5.

An alternate method of converting an absolute variation into a relativeone would be, to use the total of the extremes as the standard. This will beequal to dividing the difference of the extreme items by the total of the extremeitems. Thus,

Relative Dispersion = Difference of extreme items, i.e., Range

Sum of extreme items

The relative dispersion of the series is called the coefficient or the ratio ofdispersion. In our example of weekly earnings of workers considered earlier,the coefficients would be:

Workshop A = 9

21 30951

Workshop B = 15

17 321549

Workshop C = 23

15 382353

Workshop D = 15

19 341553

Merits and Limitations of Range

Merits. Of the various characteristics that a good measure of dispersion shouldpossess, the range has only two, viz (i) it is easy to understand, and (ii) itscomputation is simple.Limitations. Besides the aforesaid two qualities, the range does not satisfy theother test of a good measure and hence it is often termed as a crude measureof dispersion.

The following are the limitations that are inherent in the range as a conceptof variability:

Page 80: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 74

(i) Since it is based on two extreme cases in the entire distribution, the rangemay be considerably changed if either of the extreme cases happens todrop out, while the removal of any other case would not affect it at all.

(ii) It does not tell anything about the distribution of values in the seriesrelative to a measure of central tendency.

(iii) It cannot be computed when distribution has open-end classes.

(iv) It does not take into account the entire data. These can be illustrated bythe following illustration. Consider the data given in Table 3.3.

The table is designed to illustrate three distributions with the same numberof cases but different variability. The removal of two extreme students fromsection A would make its range equal to that of B or C.

Table 3.3 Distribution with the Same Number of Cases,but Different Variability

No. of StudentsClass

Section Section SectionA B C

0–10 ... ... ...

10–20 1 ... ...

20–30 12 12 19

30–40 17 20 18

40–50 29 35 16

50–60 18 25 18

60–70 16 10 18

70–80 6 8 21

80–90 11 ... ...

90–100 ... ... ...

Total 110 110 110

Range 80 60 60

The greater range of A is not a description of the entire group of 110students, but of the two most extreme students only. Further, though sectionsB and C have the same range, the students in section B cluster more closelyaround the central tendency of the group than they do in section C. Thus, therange fails to reveal the greater homogeneity of B or the greater dispersion ofC. Due to this defect, it is seldom used as a measure of dispersion.

Specific Uses of Range

In spite of the numerous limitations of the range as a measure of dispersion,there are the following circumstances when it is the most appropriate one:

Page 81: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 75

(i) In situations where the extremes involve some hazard for whichpreparation should be made, it may be more important to know the mostextreme cases to be encountered than to know anything else about thedistribution. For example, an explorer would like to know the lowest andthe highest temperatures on record in the region he is about to enter; oran engineer would like to know the maximum rainfall during 24 hours forthe construction of a storage.

(ii) In the study of prices of securities, range has a special field of activity.Thus, to highlight fluctuations in the prices of shares or bullion, it is acommon practice to indicate the range over which the prices have movedduring a certain period of time. This information, besides being of use tothe operators, gives an indication of the stability of the bullion market, orthat of the investment climate.

(iii) In statistical quality control, the range is used as a measure of variation.For example, we determine the range over which, variations in quality aredue to random causes, which is made the basis for the fixation of controllimits.

Self-Assessment Questions

7. Fill in the blanks with the appropriate terms.

(a) Range of any series is the ______________ between the highestand the lowest values in the series.

(b) The relative dispersion of the series is called the___________________ or the ratio of dispersion.

8. State whether true or false.

(a) The crudest measure of dispersion is the range of the distribution.

(b) An absolute measure can not be converted into a relative measureif we divide it by some other value regarded as standard for thepurpose.

3.6 Standard Deviation

By far, the most universally used and the most useful measure of dispersion isthe standard deviation or the root mean square deviation about the mean. Wehave seen that all the methods of measuring dispersion so far discussed are

Page 82: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 76

not universally adopted for want of adequacy and accuracy. The range is notsatisfactory as its magnitude is determined by most extreme cases in the entiregroup. Further, the range is notable because it is dependent on the item whosesize is largely a matter of chance. Mean deviation method is also anunsatisfactory measure of scatter, as it ignores the algebraic signs of deviation.We desire a measure of scatter which is free from these shortcomings. Tosome extent, standard deviation is one such measure.

The calculation of standard deviation differs in the following respects fromthat of mean deviation. First, in calculating standard deviation, the deviationsare squared. This is done so as to get rid of negative signs without committingalgebraic violence. Further, the squaring of deviations provides added weightto the extreme items, a desirable feature for certain types of series.

Second, the deviations are always recorded from the arithmetic mean,because although the sum of deviations is the minimum from the median, thesum of squares of deviations is minimum when deviations are measured fromthe arithmetic average. The deviation from x is represented by .

Thus, standard deviation, (sigma), is defined as the square root of themean of the squares of the deviations of individual items from their arithmeticmean.

= ( )x xN 2

For grouped data (discrete variables)

= f x x

f( )

2

and, for grouped data (continuous variables)

= f M x

f( )

Where, M is the mid-value of the group.

The use of these formulae is illustrated by the following examples.Example 3.14: Compute the standard deviation for the following data:

11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21

Page 83: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 77

Solution: Here the formula = 2( )x x

N is appropriate. We first calculate

the mean as x = x N/ = 176/11 = 16, and then calculate the deviation asfollows:

x (x – x ) (x – x )2

11 –5 25

12 –4 16

13 –3 9

14 –2 4

15 –1 1

16 0 0

17 +1 1

18 +2 4

19 +3 9

20 +4 16

21 +5 25

176 110

Thus, by using the formula, = 2( )x x

N , we get

= 11011

10 = 3.16

Example 3.15: Find the standard deviation of the data in the followingdistributions:

x 12 13 14 15 16 17 18 20

f 4 11 32 21 15 8 5 4

Solution: For this discrete variable grouped data, we use the formula

= f x xf

( )

2. Since for calculation of x , we need fx and then for we

need f x x( ) 2 , the calculations are conveniently made in the followingformat:

Page 84: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 78

x f fx d = x – x d2 fd2

12 4 48 –3 9 3613 11 143 –2 4 4414 32 448 –1 1 3215 21 315 0 0 016 15 240 1 1 1517 8 136 2 4 3218 5 90 3 9 4520 4 80 5 25 100

100 1500 304

Here, x = /fx f = 1500/100 = 15

and = fdf

2

= 304100 = 3 04. = 1.74

Calculation of Standard Deviation by Short-cut Method

The three examples worked out previously have one common simplifying feature,namely x in each, turned out to be an integer, thus simplifying calculations. Inmost cases, it is very unlikely that it will turn out to be so. In such cases, thecalculation of d and d2 becomes quite time-consuming. Short-cut methods haveconsequently been developed. These are on the same lines as those for thecalculation of mean itself.

In the short-cut method, we calculate deviations x' from an assumedmean A. Then, for ungrouped data

= F

HGIKJ

xN

xN

2 2

and for grouped data

=

22fx fxf f

This formula is valid for both discrete and continuous variables. In case ofcontinuous variables, x in the equation x' = x – A, stands for the mid-value of theclass in question.

Page 85: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 79

Note that the second term in each of the formulae is a correction termbecause of the difference in the values of A and x . When A is taken as x itself,this correction is automatically reduced to zero. The following examples explainthe use of these formulae.Example 3.16: Compute the standard deviation by the short-cut method for thefollowing data:

11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21Solution: Let us assume that A = 15.

x' = (x – 15) x'2

11 –4 1612 –3 913 –2 414 –1 115 0 016 1 117 2 418 3 919 4 1620 5 2521 6 36

N = 11 x = 111 x 2 = 121

= F

HGIKJ

xN

xN

2 2

= 12111

1111

2 FHGIKJ

= 11 1

= 10

= 3.16

Example 3.17: Calculate the standard deviation of the following data by theshort-cut method.

x 0–10 10–20 20–30 30–40 40–50 50–60 60–70 70–80 f 18 16 15 12 10 5 2 1

Page 86: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 80

Solution:

Mid- Frequency Deviation Deviation Squaredpoint ( f ) from class time deviation(x) of assumed frequency times

mean ( fx') frequency(x') ( fx'2)

0–10 5 18 –2 –36 72

10–20 15 16 –1 –16 16

–52

20–30 25 15 0 0 0

30–40 35 12 1 12 12

40–50 45 10 2 20 40

50–60 55 5 3 15 45

60–70 65 2 4 8 32

70–80 75 1 5 5 25

60

f = 79 60 242

–52

fx = 8

Since the deviations are from assumed mean and expressed in terms ofclass-interval units,

= 22fx fxi

N N

= 10 24279

879

2× FHG

IKJ

= 10 × 1.75 = 17.5

Combining Standard Deviations of Two Distributions

If we were given two sets of data of N1 and N2 items with means x1 and x2 andstandard deviations 1 and 2 respectively, we can obtain the mean and thestandard deviation x and of the combined distribution by the following formulae:

x = N x N xN N

1 1 2 2

1 2

and = N N N x x N x x

N N1 1

22 2

21 1

22 2

2

1 2

( ) ( )

Page 87: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 81

Example 3.18: The mean and the standard deviations of two distributions of100 and 150 items are 50, 5 and 40, 6 respectively. Find the standard deviationof all taken together.Solution: Combined mean,

x = N x N xN N

1 1 2 2

1 2

= 100 50 150 40100 150× ×

= 44

Combined standard deviation,

= 2 2 2 2

1 1 2 2 1 1 2 2

1 2

( ) ( )

N N N x x N x xN N

= 100 5 150 6 100 44 50 150 44 40100 150

2 2 2 2× ( ) ( ) ( ) ( )

= 7.46

Comparison of Various Measures of Dispersion

The range is the easiest to calculate measure of dispersion, but since it dependson extreme values, it is extremely sensitive to the size of the sample and to thesample variability. In fact, as the sample size increases, the range increasesdramatically, because the more the items one considers, the more likely it isthat some item will turn up which is larger than the previous maximum or smallerthan the previous minimum. So, in general, it is impossible to interpret properlythe significance of a given range unless the sample size is constant. It is for thisreason that there appears to be only one valid application of the range, namelyin statistical quality control where the same sample size is repeatedly used, sothat comparison of ranges are not distorted by differences in sample size.

The quartile deviations and other such positional measures of dispersionsare also easy to calculate, but suffer from the disadvantage that they are notamenable to algebraic treatment. Similarly, the mean deviation is not suitablebecause we cannot obtain the mean deviation of a combined series from thedeviations of component series. However, it is easy to interpret and easier tocalculate than the standard deviation.

The standard deviation of a set of data, on the other hand, is one of themost important statistic describing it. It lends itself to rigorous algebraic treatment,is rigidly defined and is based on all observations. It is therefore, quite insensitiveto sample size (provided the size is ‘large enough’) and is least affected bysampling variations.

Page 88: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 82

It is used extensively in testing of hypothesis about population parametersbased on sampling statistics.

In fact, the standard deviation has such stable mathematical propertiesthat it is used as a standard scale for measuring deviations from the mean. Ifwe are told that the performance of an individual is 10 points better than themean, it really does not tell us enough, for 10 points may or may not be alarge enough difference to be of significance. But, if we know that the for thescore is only 4 points, so that on this scale, the performance is 2.5 better thanthe mean, the statement becomes meaningful. This indicates an extremely goodperformance. This sigma scale is a very commonly used scale for measuringand specifying deviations which immediately suggest the significance of thedeviation.

The only disadvantage of the standard deviation lies in the amount ofwork involved in its calculation, and the large weight it attaches to extremevalues because of the process of squaring involved in its calculations.

Activity 3

Calculate standard deviation from the following data:

Size of Item 6 7 8 9 10 11 12

Frequency 3 6 9 13 8 5 4

Self-Assessment Questions

9. Fill in the blanks with the appropriate terms.

(a) The squaring of deviations provides added _________________ tothe extreme items.

(b) Standard deviation, (sigma), is defined as the square root of themean of the _________________ of the deviations of individual itemsfrom their arithmetic mean.

10. State whether true or false.

(a) In calculating standard deviation, the deviations are squared.

(b) The deviations are sometimes recorded from the arithmetic mean.

Page 89: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 83

3.7 Summary

Let us recapitulate the important concepts discussed in this unit:

Percentage is used in business and economic fields for making comparisonon profit, growth rate, magnitude, performance, etc. The concept ofpercentage applies mainly on ratios.

Percentage point change only notes the change in percentage whereaspercentage change notes the change with reference to the original value.

An average is the measure of central tendency of a set of numbers.

The mean is computed by adding all the data values and dividing it bythe number of such values.

In weighted arithmetic mean, the weight is represented by symbol ‘w’,and w represents the sum of weights. It is used to compute the meanof means.

Median is that value of a variable which divides the series in such a mannerthat the number of items below it is equal to the number of items above it.Half of the total number of observations lies below the median and halfabove it. The median is thus a positional average.

The mode is that value of the variable which occurs or repeats itself thegreatest number of times. The mode of a distribution is the value at thepoint around which the items tend to be most heavily concentrated. It isthe most fre-quent or the most common value, provided that a sufficientlylarge number of items are available, to give a smooth distribution.

The positional measures are called quantiles and consist of quartiles,deciles and percentiles. The quartiles divide the data into four equal parts.The deciles divide the total ordered data into ten equal parts and thepercentiles divide the data into 100 equal parts.

The crudest measure of dispersion is the range of the distribution. Rangeof any series is the difference between the highest and the lowest valuesin the series.

In a frequency distribution, the range is taken to be the difference betweenthe lower limit of the class at the lower extreme of the distribution and theupper limit of the class at the upper extreme.

An absolute measure can be converted into a relative measure if we divideit by some other value regarded as standard for the purpose.

Page 90: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 84

In calculating standard deviation, the deviations are squared. This is doneso as to get rid of negative signs without committing algebraic violence.The deviations are always recorded from the arithmetic mean.

Standard deviation, (sigma), is defined as the square root of the meanof the squares of the deviations of individual items from their arithmeticmean.

3.8 Glossary

Mean: An arithmetic average and measure of central location.

Median: The measure of central tendency that appears in the centre ofan ordered data.

Mode: Another form of average that can be defined as the most frequentlyoccurring value in the data.

Quartile: A positional measure that divides the data into four equal parts.

Range: The difference between the maximum and minimum values. Itindicates the limits within which the values fall.

Standard deviation: A measure of the variability or dispersion of apopulation, a data set, or a probability distribution. A low standard deviationindicates that the data points tend to be very close to the same value (themean); while high standard deviation indicates that the data are spreadout over a large range of values.

3.9 Terminal Question

1. How is percentage change calculated? Name the two changes recordedin it?

2. What is the significance of ratio and averages?

3. Explain the methods for calculating mean.

4. Explain the term median with the help of an example.

5. Explain the significance of mode in statistical calculations.

6. What are quartiles?

7. What is range? How it is calculated?

8. How is standard deviation calculated? Explain with the help of an example.

Page 91: BCC104 Business Statistics

Business Statistics Unit 3

Sikkim Manipal University Page No. 85

3.10 Answers

Answers to Self-Assessment Questions

1. (a) Denominator; (b) Antecedent

2. (a) True; (b) False

3. (a) True; (b) True

4. (a) Sum; (b) Value

5. (a) Quartiles; (b) Array

6. (a) False; (b) True

7. (a) Difference; (b) Coefficient

8. (a) True; (b) False

9. (a) Weight; (b) Squares

10. (a) True; (b) False

Answers to Terminal Questions

1. Refer Section 3.2.1

2. Refer Sections 3.2.2 and 3.2.3

3. Refer Section 3.3.1

4. Refer Section 3.3.2

5. Refer Section 3.3.3

6. Refer Section 3.4

7. Refer Section 3.5

8. Refer Section 3.6

3.11 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010.

Page 92: BCC104 Business Statistics
Page 93: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 87

Unit 4 Index Numbers

Structure

4.1 IntroductionObjectives

4.2 Index Numbers4.3 Summary4.4 Glossary4.5 Terminal Questions4.6 Answers4.7 Further Reading

4.1 Introduction

In the previous unit you learnt about data analysis techniques such as measuresof dispersion. In this unit you will learn about index numbers, its various typesand the reason as to why index numbers are required. Index numbers are aspecialized type of average. They are designed to measure the relative changein the level of a phenomenon with respect to time, geographical locations orsome other characteristics. You will also learn about the different formulaemethods devised for constructing index numbers and what all problems one willface while constructing index numbers.

Objectives

After studying this unit, you should be able to:

Discuss the various formulae and methods used in constructing indexnumbers

Construct index numbers

Use index numbers for various purposes.

4.2 Index Numbers

Index numbers are a specialized type of average. They are designed to measurethe relative change in the level of a phenomenon with respect to time,geographical locations or some other characteristics.

Page 94: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 88

Originally, index numbers were developed for measuring the effect ofchanges in the price level. But today, index numbers are also used to measurechanges in industrial production, fluctuations in the level of business activitiesor variations in the agricultural output, etc. In fact, if we want to get an idea as towhat is happening to an economy, we have to simply look to a few importantindices like those of industrial output, agricultural production and business activity.In the words of G. Simpson and F. Kafka, ‘Index numbers are today one of themost widely used statistical devices. They are used to take the pulse of theeconomy and they have come to be used as indicators of inflationary ordeflationary tendencies’.

4.2.1 Types of Index Numbers

Methods of Construction of Index Numbers

Methods of constructing index numbers can broadly be divided into two classes:

(i) Unweighted indices

(ii) Weighted indices

In case of unweighted indices, weights are not expressly assigned,whereas in the weighted indices, weights are expressly assigned to the variousitems. Each of these types may be further classified under two heads:

(i) Aggregate of prices method

(ii) Average of price relatives method

The following chart illustrates the various methods

Simple aggregateof prices

Simple averageof prices relatives

Weighted aggregateof prices

Weighted averageof prices relatives

WeightedUnweighted

Index Numbers

A. Unweighted Index Numbers

1. Simple Aggregate of Prices Method: Under this method, the total ofprices for all commodities in the current year is divided by the total ofprices for these commodities in the base year and the quotient is multipliedby 100. Symbolically,

Page 95: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 89

P01

PP

1

0

100

where,

P1 = Total of current year prices for various commodities.

P0 = Total of base year prices for various commodities.

This method of constructing index numbers is very simple and requiresthe following steps for its computation.

(i) Total the prices of various commodities for each time period to get P0and P1 . These totals are in rupees.

(ii) Divide the total of the given time period, P1 , by the base period total,P0 , and express the result in per cent, by multiplying the quotient by100.

Example 4.1: From the following data, construct an index number of prices bysimple aggregative method for 1982 taking 1981 as the base:

Commodity Unit Price in 1981 Price in 1982

Milk litre 2.00 2.50

Butter kg 12.00 15.00

Cheese kg 10.00 12.00

Bread One 2.00 2.50

Eggs dozen 4.00 5.00

Solution: Construction of index numbers

Commodity Unit P0 P1

Milk litre 2.00 2.50

Butter kg 12.00 15.00

Cheese kg 10.00 12.00

Bread One 2.00 2.50

Eggs dozen 4.00 5.00

P0 = 30.00 P1 = 37.00

P PP01

1

0

100

= 3730

100 = 123.33%

Page 96: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 90

This means that as compared to 1981, there is a net increase of (123.33)23.33 per cent in 1982, in the prices of commodities included in the index.This method suffers from two drawbacks,

(i) The unit by which each item is priced, introduces a concealed weight inthe simple aggregate of actual prices. For example, milk is quoted perlitre in Example 4.1. If the price is expressed in terms of per gallon, theindex might be very different.

(ii) Equal weightage is given to all the items irrespective of their relativeimportance.

2. Simple Average of Price Relative Method

Under this method, the price relatives for each commodity are calculated andtheir average is found out. The steps involved in the construction of this indexare:

(i) Obtain the price relative by dividing the price of each commodity in thegiven time period, Pl, by its price in the base period, P0, and express this

result in per cent, i.e., obtain PP

1

0

100 for each commodity..

(ii) Average these price relatives for the given time period by dividing thetotal of price relatives for different commodities by the number ofcommodities. Symbolically,

P

PP

01

1

0

100 LNM

OQP

N

where, N refers to the number of commodities (items) whose price relativesare thus averaged.Example 4.2: From the data given in Example 4.1, compute the price index for1982 with 1981 as base, by simple average of price relatives method.

Page 97: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 91

Solution: Construction of price index

Commodities Unit Price in 1981 Price in 1982 Price RelativeP0 P1

(Rs.) (Rs.)

Milk litre 2.00 2.502.50 100 1252.00

Butter kg 12.00 15.0015 100 12512

Cheese kg 10.00 12.0012 100 12010

Bread one 2.00 2.502.50 100 1252.00

Eggs dozen 4.00 5.005 100 1254

N = 5 1

0

P 100 620P

1

001

P 100P 620P 124

N 5

The simple average of price relative method is superior to the simpleaggregate of prices method in two respects:

(i) Since we are comparing price per litre with price per litre, and price perkilogram with price per kilogram, the concealed weight due to use ofdifferent units is completely removed.

(ii) The index is not influenced by extreme items, as equal importance isgiven to all items.

Page 98: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 92

However, the greatest drawback of unweighted indices is that equalimportance or weight is given to all items included in the index number, which isnot proper. As such, unweighted indices are of little use in practice.

B. Weighted Index Numbers

1. Weighted Aggregate of Prices Index: These indices are similar to thesimple aggregative type with the fundamental difference that weights areassigned explicitly to the various items included in the index. In the matterof assigning weights, authors differ. As a result, a large number of formulaemethods have been devised for constructing index numbers. Some ofthe important formulae methods are as follows:

(i) Laspeyre’s Method: In this method, base year quantities are takenas weights. The formula for constructing the index is:

1 001

0 0

P qP 100P q

where P1 = Price in the current year

P0 = Price in the base year

q0 = Quantity in the base year

According to this method, the index number for each year is obtained inthree steps:

(i) The price of each commodity in each year is multiplied by the base yearquantity of that commodity. For the base year, each product is symbolizedby P0q0, and for the current year by P1q0.

(ii) The products for each year are totalled and 1 0P q and 0 0P q are obtained.

(iii) 1 0P q is divided by 0 0P q and the quotient is multiplied by 100 to obtainthe index.

Example 4.3: From the following data, calculate the index number of prices for1982 with 1972 as base using the Laspeyre’s method.

1972 1982

Item Price Quantity Price Quantity

A 2 8 4 6

B 5 10 6 5

C 4 14 5 10

D 2 19 2 13

Page 99: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 93

Solution: Representing base year (1972) price by P0, base year quantity by q0,current year (1982) price by P1 and current year quantity by q1 we have:

Commodity P0 q0 P1 q1 P0 q0 P1 q0

A 2 8 4 6 16 32

B 5 10 6 5 50 60

C 4 14 5 10 56 70

D 2 19 2 13 38 38

P q0 0 Pq1 0

= 160 = 200

Index number of prices by Laspeyre’s method =

PqP q

1 0

0 0

100

200 100 125160

Laspeyre’s index is very widely used. It tells us about the change in theaggregate value of the base period list of goods when valued at a given periodprice.

However, this index has one drawback. It does not take into considerationthe changes in the consumption pattern that take place with the passage oftime.(ii) Paasche’s Index: In this method, the current year quantities (q1), are takenas weights. The formula for constructing this index is:

P PqP q01

1 1

0 1

100

Steps for constructing the Paasche’s index are the same as those takenin constructing Laspeyre’s index with the only difference that the price of eachcommodity in each year is multiplied by the quantity of that commodity in thecurrent year rather than by the quantity in the base year.Example 4.4: Taking the data given in Example 4.3, compute the index numberof prices for 1982 with 1972 as base, using the Paasche’s method.

Page 100: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 94

Solution: Construction of Paasche’s Index

Commodity P0 q0 P1 q1 P0 q1 P1 q1

A 2 8 4 6 12 24

B 5 10 6 5 25 30

C 4 14 5 10 40 50

D 2 19 2 13 26 26

P q0 1 = Pq1 1 =

103 130

Index number of prices by Paasche’s method =

PqP q

1 1

0 1

100

130 100 126.21103

Although this method takes into consideration the changes in theconsumption pattern, the need for collecting data regarding quantities for eachyear or each period makes the method very expensive. Hence, where the numberof commodities is large, Paasche’s method is not preferred.(iii) Bowley-Drobisch Method: This method is the simple arithmetic mean ofLaspeyre’s and Paasche’s indices. The formula for constructing Bowley-Drobischindex is:

P01

PqP q

PqP q

1 0

0 0

1 1

0 1

2100

P L P01

2

Where L = Laspeyre’s index

P = Paasche’s indexExample 4.5: Compute the index number of prices for 1976 with 1970 as baseusing the Bowley-Drobisch method from the following data.

1970 1976

Items Price Quantity Price Quantity

1 2 20 5 15

2 4 4 8 5

3 1 10 2 12

4 5 5 10 6

Page 101: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 95

Solution: Computation of price index by Bowley-Drobisch formula,

Items P0 q0 P1 q1 P0q0 P0q1 P1q0 P1q1

1 2 20 5 15 40 30 100 75

2 4 4 8 5 16 20 32 40

3 1 10 2 12 10 12 20 24

4 5 5 10 6 25 30 50 60

P0q0 P0q1 P1q0 P1q1= 91 = 92 = 202 = 199

According to Bowley-Drobisch formula: P01

PqP q

PqP q

1 0

0 0

1 1

0 1

2100

202 19991 92 100

2

2.2198 2.1630 1002

= 4.3828 × 50 = 219.14(iv) Marshall-Edgeworth Method: In this method, the sums of base year andcurrent year quantities are taken as weights. The formula for constructing theindex is:

P P q qP q q01

1 0 1

0 0 1

100

( )( )

or P Pq PqP q P q01

1 0 1 1

0 0 0 1

100

Example 4.6: For the data given in Example 4.5, compute index number ofprices for 1976 with 1970 as base using the Marshall-Edgeworth formula:Solution: Computation of price index by Marshall-Edgeworth formula:

Item P0 q0 P1 q1 P0q0 P0q1 P1q0 P1q1

1 2 20 5 15 40 30 100 75

2 4 4 8 5 16 20 32 40

3 1 10 2 12 10 12 20 24

4 5 5 10 6 25 30 50 60

P0q0 P0q1 P1q0 P1q1= 91 = 92 = 202 = 199

Page 102: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 96

According to Marshall-Edgeworth formula:

P P q qP q q

P q PqP q P q01

1 0 1

0 0 1

0 0 1 1

0 0 0 1

100 100( )( )

202 199 401100 10091 92 183

= 219.125

(v) Kelly’s Method: In this method, neither base year nor current year quantitiesare taken as weights. Instead, the quantities of some reference year or theaverage quantity of two or more years may be taken as weights. The formulafor constructing the index is:

101

0

100PqPP q

Where, q, is the quantity of some reference year.Example 4.7: Calculate the index number of prices for 1981 with 1980 as baseyear for the following data, using Kelly’s method.

Item Quantity Price in 1980 Price in 1981

Bricks 10 units 100 160

Timber 7 ’’ 200 210

Board 15 ’’ 50 60

Sand 9 ’’ 20 30

Cement 10 ’’ 10 14

Solution: Computation of price index by Kelly’s method:

Item q P0 P1 P0q P1q

Bricks 10 100 160 1000 1600

Timber 7 200 210 1400 1470

Boards 15 50 60 750 900

Sand 9 20 30 180 270

Cement 10 10 14 100 140

P q0

3430 Pq1

4380

Page 103: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 97

According to Kelly’s method:

P PqP q01

1

0

100

4380 100 127.6973430

(vi) Fisher’s Ideal Index: This method is the geometric mean of Laspeyre’sand Paasche’s indices.The formula for constructing the index is:

P PqP q

PqP q01

1 0

0 0

1 1

0 1

100

Fisher’s formula is known as ideal index because of the following reasons:

(i) It takes into account prices and quantities of both the current year as wellas the base year.

(ii) It uses geometric mean which, theoretically, is the best average forconstructing index numbers.

(iii) It satisfies both the time reversal test and the factor reversal test.

(iv) It is free from bias. The weight biases embodied in Laspeyre’s andPaasche’s methods are crossed geometrically, and thus, eliminatedcompletely.

Example 4.8: Construct the index number of prices for the year 1980 with 1979as base using Fisher’s Ideal Method.

1979 1980

Commodity Price Quantity Price Quantity

A 20 8 40 6

B 50 10 60 5

C 40 15 50 10

D 20 20 20 15

Page 104: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 98

Solution: Construction of price index by Fisher’s Ideal Formula:

Commodity P0 q0 P1 q1 P0q0 P0q1 P1q0 P1q1

A 20 8 40 6 160 120 320 240

B 50 10 60 5 500 250 600 300

C 40 15 50 10 600 400 750 500

D 20 20 20 15 400 300 400 300

P0q0 P0q1 P1q0 P1q1

= 1660 = 1070 = 2070 = 1340

Price index by Fisher’s Ideal Formula is:

P011 0

0 0

1 1

0 1

100

PqP q

PqP q

2070 1340 1001660 1070

1.247 1.252 100 1.5612 100

= 1.25 100 125

2. Weighted Average of Price Relatives

This method is similar to the simple average of price relatives method with thefundamental difference that, explicit weights are assigned to each commodityincluded in the index. Since price relatives are in percentages, the weights usedare value weights.

The following steps are taken in the construction of weighted average ofprice relatives index:

(i) Calculate the price relatives, PP

1

0

100FHG

IKJ for each commodity..

(ii) Determine the value weight of each commodity in the group by multiplyingits price in base year by its quantity in the base year, i.e., calculate P0q0

for each commodity. If, however, current year quantities are given, thenthe weights shall be represented by P1q1.

(iii) Multiply the price relative of each commodity by its value weight ascalculated in (ii).

Page 105: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 99

(iv) Sum up the products obtained under (iii).

(v) Divide the total (iv) above, by the total of the value weights. Symbolically,index number obtained by the method of weighted average of pricerelatives is:

P

PP

P q

P qPVV01

1

00 0

0 0

100

FHG

IKJ

LNM

OQP

or

This method is also known as Family Budget method.Example 4.9: Calculate consumer price index using weighted average of pricerelatives method for the year 1986 with 1985 as base for the following data:

Price (in Rs)

Commodity Quantity 1985 1986

A 100 8 12B 25 6 8C 10 5 15D 20 10 25

Solution: Calculation of consumer price index

Commodity q0 P0 P1 Price Relative P0q0 PV

PP

1

0

100FHG

IKJ or P or V

A 100 8 12 150.00 800 120000B 25 6 8 133.33 150 20000C 10 5 15 300.00 50 15000D 20 10 25 250.00 200 50000

V PV= 1200 = 205000

Weighted average of price relative index

or consumer price index

FHG

IKJ

LNM

OQP

PP

P q

P qPVV

1

00 0

0 0

100

205000 170.831200

Page 106: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 100

Problems in the Construction of Index Numbers

Different problems are faced in the construction of different types of indexnumbers. We shall deal here with only those problems that must be tackledbefore constructing index numbers of prices.

Definition of Purpose

It is absolutely necessary that the purpose of the index numbers be rigorouslydefined. This would help in deciding the nature of data to be collected, thechoice of the base year, the formula to be used and other related matters. Forexample, if an index number is intended to measure consumer prices, it mustnot include wholesale prices. Similarly, if a consumer price index number isintended to measure the changes in the cost of living of families with low incomes,great care should be exercised not to include goods ordinarily used by middle-income and upper-income groups. In fact, before constructing index numbers,we must precisely know what we want to measure, and what we intend to usethis measurement for.

Selection of a Base Period

In order to make comparison between prices referring to several time periods,some point of reference is almost always established. This point of reference iscalled the base period. The prices of a certain time period are taken as thestandard, and assigned the value of 100 per cent. Though the selection of thebase period would primarily depend upon the purpose of the index, there aretwo important guidelines to consider in choosing a base.

(i) The base period should be a period of normal and stable economicconditions. It should be free from abnormalities and random or irregularfluctuations like wars, earthquakes, famines, strikes, lock-outs, booms,depressions, etc.

(ii) The base year should not be too distant in the past. Since the indexnumbers are useful in decision-making, and economic practices are oftena matter of the short run, we should choose a base which is relativelyclose to the year being studied.

Fixed Base and Chain Base. While selecting the base year, a decisionhas to be made whether the base shall remain fixed or not. If the period ofcomparison is fixed for all current years, it is called fixed base method. If, on theother hand, the prices of the current year are linked with the prices of thepreceding year and not with the fixed year or period, it is called chain basemethod. Chain base method is useful in cases where there are quick and frequent

Page 107: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 101

changes in fashion, tastes and habits of the people. In such cases comparisonwith the preceding year is more worthwhile.

Selection of Commodities or Items

While constructing an index number, it is not possible to take into account allthe items whose price changes are to be represented by the index number.Hence, the need for selecting a sample. For example, while constructing ageneral purpose wholesale price index, it is impossible to take all the items.Thus, only a few representative items are selected from the whole lot. Whileselecting the sample, the following points should be kept in mind.

(i) The selected commodity or item should be representative of the tastes,customs and necessities of the people to whom the index number relates.

(ii) It should be stable in quality and as far as possible should be standardizedor graded so that it can easily be identified after a time lapse.

(iii) The sample should be as large as possible. Theoretically, the larger thenumber of items, the more accurate would be the results disclosed by anindex number. But it must be noted that, larger the number of items, thegreater shall be the cost and time taken.

(iv) As different varieties of a commodity are sold in the market, a decisionhas to be made as to which variety should be included in the indexnumbers. Ordinarily, all those varieties which are in common use shouldbe included.

Obtaining Price Quotations

After selecting the items, the next problem is to collect their prices. The price ofa commodity varies from place to place and even from shop to shop in thesame market. Just as it is not possible to include all the commodities in an indexnumber, it is similarly impractical to collect price quotations from all places wherea commodity is bought or sold. Thus, a selection is to be made of representativeplaces and shops. Generally, such places and shops are selected where thecommodity is bought and sold in large quantities. After selecting the places andshops from where price quotations are to be obtained, the next step is to appointsome representatives who will supply the price quotations from time to time.

Since prices can be quoted in two ways, i.e., either by expressing thequantity of commodity per unit of money or by expressing the quantity of moneyper unit of commodity, a decision has to be made regarding the manner inwhich prices are to be quoted. It is better to quote the price of a commodity X as50 paise per kg rather than quoting it as 2 kg per one rupee.

Page 108: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 102

Another decision in regard to price quotations is whether the wholesaleprices or the retail prices are to be collected. In general, the larger the numberof quotations, the better it is. Ordinarily, however, at least one quotation perweek in case of weekly indices, and at least four quotations per month in caseof monthly indices are essential.

Choice of Average

Since index numbers are specialized averages, a decision has to be made asto which particular average (i.e., arithmetic mean, mode, median, harmonicmean or geometric mean) should be used for the construction of index numbers.Mode, median and harmonic mean are almost never used in the construction ofindex numbers.

Therefore, a choice has to be made between arithmetic mean andgeometric mean. Though, theoretically, geometric mean is better for the purpose,arithmetic mean due to its simplicity of computation is more commonly used.

Choice of Weights

A suitable method is devised by which the varying importance of different itemsis taken into account. This is done by assigning ‘weights’. The term ‘weight’,refers to the relative importance of different items in the construction of index.

There are two methods of assigning weights, (i) Implicit, and (ii) Explicit.In the case of implicit weighting, a commodity or its variety is included in theindex a number of times. In the case of explicit weighting, on the other hand,some outward evidence of importance of various items in the index is given.

Selection of an Appropriate Formula

A large number of formulae have been devised for constructing index numbers.A decision has, therefore, to be made as to which formula is the most suitablefor the purpose depending upon the availability of the data regarding the pricesand quantities of the selected commodities in the base and/or current year.

Quantity or Volume Index Numbers

Price indices measure changes in the price level of certain commodities. Onthe other hand, quantity or volume index numbers measure the changes in thephysical volume of goods produced, distributed or consumed. These indicesare important indicators of the level of output in the economy or in parts of it.

The quantity indices can be obtained easily by replacing p by q and viceversa, in the various formulae discussed earlier.The quantity index by different methods is:

Page 109: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 103

(i) Laspeyre’s Method: Q011 0

0 0

100

q Pq P

(ii) Paasche’s Method: Q011 1

0 1

100

q Pq P

(iii) Bowley-Drobisch Method: Q01

1 0

0 0

1 1

0 1

2100

q Pq P

q Pq P

(iv) Marshall-Edgeworth Method: Q011 0 1

0 0 1

100

q P Pq P P

( )( )

(v) Fisher’s Ideal Index: Q011 0

0 0

1 1

0 1

100

q Pq P

q Pq P

(vi) Kelly’s method: Q011

0

100

q Pq P

Example 4.10: Compute quantity index for the year 1982 with base 1980= 100, for the following data, using (i) Laspeyre’s method (ii) Paasche’s method,(iii) Bowley-Drobisch method, (iv) Marshall-Edgeworth method, and (v) Fisher’sideal formula.

Prices Quantities

Commodity 1980 1982 1980 1982

A 5.00 6.50 5 7

B 7.75 8.80 6 10

C 9.63 7.75 4 6

D 12.50 12.75 9 9

Solution: Computation of quantity index

Commodity P0 q0 P1 q1 q0P0 q0P1 q1P0 q1P1

A 5.00 5 6.50 7 25.00 32.50 35.00 45.50

B 7.75 6 8.80 10 46.50 52.80 77.50 88.00

C 9.63 4 7.75 6 38.52 31.00 57.78 46.50

D 12.50 9 12.75 9 112.50 114.75 112.50 114.75

=q P0 0

222 52. =q P0 1

23105. =q P1 0

282 78. =q P1 1

294 75.

Page 110: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 104

(i) Laspeyre’s quantity index or Qq Pq P01

1 0

0 0

100

282.78 100 127.08222.52

(ii) Paasche’s quantity index or Qq Pq P01

1 1

0 1

100

294.75 100 127.57231.05

(iii) Bowley-Drobisch quantity index or Q

q Pq P

q Pq P

01

1 0

0 0

1 1

0 1

2100

282.78 294.75222.52 231.05 100

2

1.2708 1.2757 1002

= 127.325

(iv) Marshall-Edgeworth quantity index or Qq P q Pq P q P01

1 0 1 1

0 0 0 1

100

282.78 294.75 100222.52 231.05

= 127.329

(v) Quantity index by Fisher’s ideal formula or Q01

q Pq P

q Pq P

1 0

0 0

1 1

0 1

100

282.78 294.75 100222.52 231.05

= 1.273 × 100

= 127.3

Page 111: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 105

Value Index Numbers

Value means price times quantity. Thus, a value index ‘V’ is the sum of the valueof a given year divided by the sum of the values for the base year. The formula,therefore, is:

V PqP q

1 1

0 0

100 where V = value index

In most cases, the value figures given in the formula may be stated moresimply as:

V VV

1

0

In this type of index, both price and quantity are variable in the numerator.Weights are not to be applied because they are inherent in the value figures. Avalue index, therefore, is an aggregate of values.

Tests of Consistency

As there are several formulae for constructing index numbers, the problem is toselect the most appropriate formula in a given situation. Irving Fisher hassuggested two tests for selecting an appropriate formula. These are:

(i) Time reversal test

(ii) Factor reversal test

Time reversal test

According to Fisher, the formula for calculating the index should be such that itgives the same ratio between one point of comparison and another no matterwhich of the two is taken as base. In other words, the index number preparedforward should be the reciprocal of the index number prepared backward. Thus,if from 1982 to 1983, the prices of a basket of goods have increased from Rs400 to Rs 800, the index number for 1983 with 1982 as base is 200 per cent.Now if the index number for 1983 with 1982 as base is 200 per cent, the indexnumber for 1982 with 1983 with base should be 50 per cent. One figure isreciprocal of the other and their product (2 × 0.5) is unity. Therefore, time reversaltest is satisfied if 01 10P × P =1 .Time reversal test is satisfied by:

(i) Fisher’s Ideal Formula,

(ii) Marshall-Edgeworth Method

Page 112: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 106

(iii) Kelly’s Method

(iv) Simple Geometric mean of Price Relatives

Factor reversal test

According to Fisher, the formula for constructing the index number should permitnot only the interchange of the two times without giving inconsistent results, itshould also permit the interchange of weights without giving inconsistent results.

Simply stated, the test is satisfied if the change in price multiplied by thechange in quantity is equal to the total change in value. Thus, factor reversaltest is satisfied if:

P Q PqP q01 01

1 1

0 0

Where P01 represents change in price in the current year, Q01 representschange in quantity in the current year, 1 1P q represents total value in the currentyear, and 0 0P q represents total value in the base year..

The factor reversal test is satisfied only by Fisher’s Ideal Formula. Thus,Fisher’s formula satisfies both time reversal test and factor reversal test.

Proof

According to Fisher’s Ideal Index:

P PqP q

PqP q01

1 0

0 0

1 1

0 1

P100 1

1 1

0 0

1 0

P qPq

P qPq

Q011 0

0 0

1 1

0 1

q Pq P

q Pq P

(i) Thus, P P PqP q

PqP q

P qPq

P qPq01 10

1 0

0 0

1 1

0 1

0 1

1 1

0 0

1 0

1 1

Hence, the time reversal test is satisfied.

Page 113: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 107

(ii) Similarly, according to Fisher’s Ideal Formula:

1 0 1 01 1 1 101 01

0 0 0 1 0 0 0 1

Pq q PPq q PP QP q P q q P q P

1 1 1 1 1 1

0 0 0 0 0 0

P q q P P qP q q P P q

Hence, the factor reversal test is also satisfied by Fisher’s Ideal Formula.

Besides these two tests, two other tests have been suggested by someauthors.

These are, (i) Unit test, (ii) Circular test

Unit test

According to unit test, the formula for constructing index numbers should beindependent of the units in which prices and quantities are quoted. This test issatisfied only by simple aggregative index method.

Circular test

This test is just an extension of the time reversal test for more than two periodsand is based on the shiftability of the base period. This test requires the indexnumber to work in a circular manner such that if an index is constructed for theyear a on base year b, and for the year b on base year c, we should get thesame result as if we calculate directly an index for year a on base year c withoutgoing through b as an intermediary. Thus, if there are three periods a, b and c,the circular test is satisfied if,

P P P01 12 10 1 The circular test is satisfied only by the index number formula based on,

(i) Simple aggregate of prices.

(ii) Kelly’s method or fixed weighted aggregate of prices.

An index which satisfies this list has the advantage of reducing thecomputations every time a change in the base year has to be made. Suchindices can be adjusted from year to year without referring each time to theoriginal base.

Page 114: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 108

Example 4.11: From the following data , show that Fisher’s Ideal Index satisfiesboth following time reversal test and factor reversal test.

1980 1981

Commodity Price Quantity Price Quantity

A 4 10 5 8

B 6 8 9 7

C 14 5 7 12

D 3 12 6 8

E 5 7 8 5

Solution: Computation for time reversal test and factor reversal test

Commodity P0 q0 P1 q1 P0q0 P0q1 P1q0 P1q1

A 4 10 5 8 40 32 50 40

B 6 8 9 7 48 42 72 63

C 14 5 7 12 70 168 35 84

D 3 12 6 8 36 24 72 48

E 5 7 8 5 35 25 56 40

P q P q0 0 0 1

229 291

Pq Pq1 0 1 1

285 275

(i) Time reversal test is satisfied when P P01 10 1 .

According to Fisher’s ideal index,

1 0 1 101

0 0 0 1

Pq PqPP q P q

and0 1 0 0

101 1 1 0

PP q P qPq Pq

P P01 01285229

275291

291275

229285

1 1

Hence, time reversal test is satisfied.

Page 115: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 109

(ii) Factor reversal test is satisfied when P Q PqP q01 01

1 1

0 0

.

P PqP q

PqP q

Q q Pq P

q Pq P01

1 0

0 0

1 1

0 101

1 0

0 0

1 1

0 1

and

P Q01 01285229

275291

291229

275285

275 275229 229

275229

1 1

0 0

PqP q

Hence, the factor reversal test is satisfied.

Fixed and Chain Base Indices

As stated earlier, the base may be fixed or changing. It is said to be fixed whenthe period of comparison or the base year is fixed for all current years. Thus, ifthe indices of 1971, 1972, 1973 and 1974 are all calculated with 1970 as thebase year, such indices will be called fixed base indices. If, on the other hand,the whole series of index numbers is not related to any one base period, but theindices for different years are obtained by relating each year’s price to that ofthe immediately preceding year, the indices so obtained are called chain baseindices. For example, in the case of chain base indices, for 1974, 1973 will bethe base; for 1973, 1972 will be the base; for 1972, 1971 will be the base, andso on. The relatives obtained by the chain base method are called link relatives,whereas the relatives obtained by the fixed base method are called chainrelatives.Example 4.12: From the following data relating to the wholesale prices of wheatfor six years, construct index numbers using (i) 1980 as base, and (ii) By chainbase method.

Year Price (per quintal) Year Price (per quintal)

Rs. Rs.

1980 100 1983 130

1981 120 1984 140

1982 125 1985 150

Page 116: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 110

Solution: (i) Computation of index numbers with 1980 as base,

Year Price of wheat Index Number Year Price Index No.

(1980 = 100) of Wheat (1980 = 100)

1980 100 100 1983 130130 100 130100

1981 120120 100 120100

1984 140140 100 140100

1982 125125 100 125100

1985 150150 100 150100

(ii) Construction of link relative indices (chain base method)

Year Price of Link Relative Year Price Link Relative

Wheat Index of Wheat Index

1980 100 100 1983 130130 100 104125

1981 120120 100 120100

1984140 140 100 107.692130

1982 125125 100 104.167120

1985 150150 100 107.14140

Conversion of Link Relatives into Chain Relatives

Chain relatives or chain indices can be obtained either directly or by convertinglink relatives into chain relatives with the help of the following formula:

Link relative for the Chain relative for×current year the previous yearChain relative for =current year 100

Taking the data from Example 4.12, we can show the method of conversionas follows:

Page 117: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 111

Year Price of wheat Link relative Chain relative

1980 100 100.00 100

1981 120 120.00120 100 120

100

1982 125 104.167104.167 120 125

100

1983 130 104.00104 125 130

100

1984 140 107.692107.692 130 140

100

1985 150 107.14107.14 140 150

100

Base Shifting

Sometimes, it becomes necessary to shift the base from one period to another.This becomes necessary either because the previous base has become too oldand useless for comparison purposes or because comparison has to be madewith another series of index numbers having different base period. This can bedone in two ways,

(i) By reconstructing the series with the new base. This means that therelatives of each individual item are constructed with the new base andthus an entirely new series is formed.

(ii) By using a shorter method which is as follows: divide each index numberof the series by the index number of the time period selected as newbase and multiply the quotient by 100. Symbolically,

Current year’s old index numberIndex number = 100(based on new base year) New base year’sold index number

Example 4.13: The following are the index numbers of prices with 1939 asbase,

Year: 1939 1940 1945 1950 1955 1960

Index Number: 100 110 120 200 400 380

Shift the base to the year 1950.

Page 118: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 112

Solution: Index numbers with 1950 as base (1950–100)

Year Index Index Number

(1939 = 100) (1950 = 100)

1939 100100 100 50200

1940 110110 100 55200

1945 120120 100 60200

1950 200200 100 100200

1955 400400 100 200200

1960 380380 100 190200

Splicing

Sometimes, an index number series is discontinued because its base hasbecome too old and so it has lost its utility. A new series of index numbers maybe computed with some recent year as base. For example, the weights of anindex number may have become out of date and a new index with new weightsmay be constructed. This would result in two series of index numbers. It maysometimes be necessary to connect the two series of index number into onecontinuous series. The procedure employed for connecting an old series ofindex numbers with a revised series, in order to make the series continuous iscalled splicing. The process of splicing is very simple and is similar to the oneused in shifting the base. The spliced index numbers are calculated with thehelp of the following formula:

New base year’s

Spliced index number = Current year’snew indexnumber ×Oldindex no.

100

Page 119: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 113

Example 4.14: Index A was started in 1969 and continued upto 1975 in whichyear another index B was started. Splice the index B to index A so that acontinuous series of index numbers from 1969 upto date may be available:

Year: 1969 1970 1971 1972 1973 1974 1975

(A) Index Numbers (Old): 100 120 130 200 300 350 400

Year: 1975 1976 1977 1978 1979 1980

(B) Index Numbers (New): 100 110 90 110 98 96

Solution: Index B spliced to Index A

Year Old Index Nos. New Index Nos. Index B Spliced to Index A

(Base 1969 = 100)

1969 1001970 1201971 1301972 2001973 3001974 350

1975 400 100400×100 400

100

1976 110400×110 440

100

1977 90400×90 360

100

1978 110400 ×110 440

100

1979 98400 × 98 392

100

1980 96400 × 96 384

100

Splicing is very useful for making comparison between new and old indexnumbers.

Deflating

Deflating is the process of making allowances for the effect of changing pricelevels. With increasing price levels, the purchasing power of money is reduced.As a result, the real wage figures are reduced and the real wages become less

Page 120: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 114

than the money wages. To get the real wage figure, the money wage figure maybe reduced to the extent the price level has risen. The process of calculatingthe real wages by applying index numbers to the money wages so as to allowfor the change in the price level is called deflating. Thus, deflating is the processby which a series of money wages or incomes can be corrected for price changesto find out the level of real wages or incomes. This is done with the help of thefollowing formula:

Money wageReal wage = ×100Price index

Real wage index = Real wages for the current yearReal wages for the base year

100

Example 4.15: The average of monthly wages in different years is as follows:Year : 1977 1978 1979 1980 1981 1982 1983Wages (Rs) : 200 240 350 360 360 380 400Price Index : 100 150 200 220 230 250 250Calculate real wages index numbers.

Solution: Construction of real wage indices

Year Wages Price index Real wages Real wages index

(Rs) (1977 = 100)

1977 200 100200 100 200100

100

1978 240 150240 100 160150

160 100 80200

1979 350 200350 100 175200

175 100 87.5200

1980 360 220360 100 163.63220

163.63 100 81.81

200

1981 360 230360 100 156.52230

156.52 100 78.26

200

1982 380 250380 100 152250

152 100 76200

1983 400 250400 100 160250

160 100 80200

Page 121: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 115

Uses and Importance of Index Numbers

Index numbers have become indispensable for analysing economic and businessconditions although they are used almost in all sciences—natural, social andphysical. The main uses of index numbers can be summarized as follows:

1. They help in framing suitable policies

Index numbers of the data relating to prices, production, profits, importsand exports, personnel and financial matters are indispensable for anyorganization in framing suitable policies and formulation of executivedecisions. For example, the cost of living index numbers help theemployers in deciding the increase in dearness allowance of theiremployees or adjusting their salaries and wages in accordance withchanges in their cost of living.

2. Index numbers help in studying trends and tendencies

Since the index numbers study the relative changes in the level ofphenomenon over a period of time, the time series so formed enable usto study the general trend of the phenomen under study. For example, bystudying the index numbers of wholesale prices in India for the last tenyears, we can say that the general price level in India is showing an upwardtrend as it is rising year after year. Similarly, by examining the indexnumbers of production (industrial and agricultural), volume of trade, importsand exports, etc., for the last few years, we can draw useful conclusionsabout the trend of production and business activity.

3. Index numbers are very useful in deflating

In time-series analysis, index numbers are used to adjust the originaldata for price changes, or to adjust wage changes for cost of living changesand thus transform nominal wages into real wages. Moreover, nominalincome can be transformed into real income, and nominal sales into realsales through appropriate index numbers.

Activity 1

Collect data on wholesale prices of rice for 5 continuous years startingfrom year 2005. Construct index numbers using 2005 as base.

Page 122: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 116

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Index number shows by its ______________ the changes in amagnitude which is not susceptible either to accurate measurementin itself or to direct valuation in practice.

(b) _________________ is the process of making allowances for theeffect of changing price levels.

2. State whether true or false.

(a) The simple average of price relative method is superior to the simpleaggregate of prices method.

(b) The term ‘weight’ refers to the relative importance of similar items inthe construction of index.

4.3 Summary

Let us recapitulate the important concepts discussed in this unit:

Index numbers are a specialized type of average. They are designed tomeasure the relative change in the level of a phenomenon with respect totime, geographical locations or some other characteristics.

In case of unweighted indices, weights are not expressly assigned,whereas in the weighted indices, weights are expressly assigned to thevarious items.

Weighted indices are similar to the simple aggregative type. Thefundamental difference is that weights are assigned explicitly to the variousitems included in the index.

It is absolutely necessary that the purpose of the index numbers rigorouslydefined. This would help in deciding the nature of data to be collected, thechoice of the base year, the formula to be used and other related matters.

While selecting the base year, a decision has to be made whether thebase shall remain fixed or not. If the period of comparison is fixed for allcurrent years, it is called fixed base method. If, on the other hand, theprices of the current year are linked with the prices of the preceding yearand not with the fixed year or period, it is called chain base method.

Page 123: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 117

Value means price times quantity. Thus, a value index ‘V’ is the sum ofthe value of a given year divided by the sum of the values for the baseyear.

Deflating is the process of making allowances for the effect of changingprice levels. With increasing price levels, the purchasing power of moneyis reduced.

4.4 Glossary

Index numbers: The index number measures the relative change in themagnitude of a group of related, distinct variables in two or more situations.Index numbers can be used to measure changes in price, wagesproduction, employment, national income, etc., over a period of time.

Splicing: The process employed for connecting an old series of indexnumbers with a revised series in order to make the series continuous

Deflating: The process of making the allowances for the effect of changingprice levels.

4.5 Terminal Questions

1. Explain the importance of index numbers.

2. Broadly discuss the division of the two methods of construction of indexnumbers.

3. Describe the Marshall-Edgeworth method for constructing index numbers.

4. Why is it necessary to define the purpose of index numbers?

5. Differentiate between fixed and chain based indices.

4.6 Answers

Answers to Self-Assessment Questions

1. (a) Variations; (b) Deflating

2. (a) True; (b) False

Page 124: BCC104 Business Statistics

Business Statistics Unit 4

Sikkim Manipal University Page No. 118

Answers to Terminal Questions

1. Refer Section 4.2

2. Refer Section 4.2.1

3. Refer Section 4.2.1

4. Refer Section 4.2.1

5. Refer Section 4.2.1

4.7 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010

Page 125: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 119

Unit 5 Data Representation

Structure

5.1 IntroductionObjectives

5.2 Tables5.3 Graphs5.4 Diagrams5.5 Summary5.6 Glossary5.7 Terminal Questions5.8 Answers5.9 Further Reading

5.1 Introduction

In the previous unit, you learnt about index numbers, which are a specializedtype of average.

In this unit, you will learn about the construction of tables, diagrams andgraphs and how important these are to a business and their usages. In any typeof business firm, a large amount of raw data is generated from various businesssources. Such data becomes quite cumbersome and confusing for managementto handle and analyse. In a business firm, data can be of various types, relatingto various categories such as number of each item of the inventory, record ofsales from different departments, keeping an account of all kinds of bills and soon. It is almost impossible for management to deal with all this data in raw form.Therefore, such data must be presented in a suitable and summarized formwithout any loss of relevant information so that it can be efficiently used fordecision-making. Hence, we construct appropriate tables, graphs and diagramsto interpret and summarize the entire set of raw data.

In view of the ever increasing importance of statistical data in businessoperations and their management, this unit discusses the presentation of datain the form of graphs, tables and diagrams, their importance and use.

Page 126: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 120

Objectives

After studying this unit, you should be able to:

Explain the types of tables, graphs and diagrams

Construct tables, graphs and diagrams

Describe the concept of frequency polygon and relative frequency

Explain the construction of ogive curves and their types

Construct histograms

Represent and evaluate data in diagrammatic and graphic forms

5.2 Tables

Classification of data is usually followed by tabulation, which is considered themechanical part of classification.

Tabulation is the systematic arrangement of data in columns and rows.The analysis of data is done by arranging the columns and rows to facilitatecomparisons.

Tabulation has the following objectives:

(i) Simplicity. The removal of unnecessary details gives a clear andconcise picture of the data

(ii) Economy of space and time

(iii) Ease in comprehension and remembering

(iv) Facility of comparisons. Comparisons within a table and with othertables may be made

(v) Ease in handling of totals, analysis, interpretation, etc.

5.2.1 Construction of Tables

A table is constructed depending on the type of information to be presented andthe requirements of statistical analysis. The following are the essential featuresof a table:

(i) Title. It should have a clear and relevant title, which describes the contentsof the table. The title should be brief and self explanatory.

(ii) Stubs and captions. It should have clear headings and sub headings.Column headings are called captions and row headings are called stubs.

Page 127: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 121

The stubs are usually wider than the captions.

(iii) Unit. It should indicate all the units used.

(iv) Body. The body of the table should contain all information arrangedaccording to description.

(v) Headnote. The headnote or prefatory note, placed just below the title, ina less prominent type, gives some additional explanation about the table.Sometimes, the headnote consists of the unit of measurement.

(vi) Footnotes. A footnote at the bottom of the table may clarify some omissionsof special features.

A source note gives information about the source used, if any.

(vii) Arrangement of data. Data may be arranged according to requirementsin chronological, alphabetical, geographical, or any other order.

(viii) Emphasis. The items to be emphasized may be put in different print ormarked suitably.

(ix) Other details. Percentages, ratios, etc. should be shown in separatecolumns. Thick and thin lines should be drawn at proper places.

A table should be easy to read and should contain only the relevant details.If the aim of clarification is not achieved, the table should be redesigned.

5.2.2 Types of Tables

Depending on the nature of the data and other requirements, tables may bedivided into various types.

General tables or Reference tables. These contain detailed informationfor general use and reference, e.g., tables published by government agencies.

Specific purpose or Derivative tables. These are usually summarized fromgeneral tables and are useful for comparison and analytical purposes. Averages,percentages etc. are incorporated along with information in these tables.

Simple and Complex tables. A table showing only one characteristic is asimple table. The complex table shows two or more characteristics or groups ofitems.

Page 128: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 122

Table 5.1 represents simple table.

Table 5.1 Cinema Attendance among Adult Male Factory Workers in Bombay

March 1972

Frequency Number of Workers

Less than once a month 3780

1 to 4 times a month 1652

More than 4 times a month 926

Table 5.2 is an example of complex table and is the result of a survey onthe cinema going habits of adult factory workers.

Table 5.2 Cinema Attendance among Adult Male Factory Workers in Bombay

March 1972

Cinema Attendance Single Married

Frequency Under 30 Over 30 Under 30 Over 30

Less than once a month 122 374 1404 1880

1–4 times a month 1046 202 289 115

More than 4 times a month 881 23 112 10

Total 2049 599 1805 2005

It is obvious that the tabular form of classification of data is a greatimprovement over the narrative form.

Frequently, table construction involves deciding which attribute should betaken as primary and which as secondary. For the previous table, we can alsoconsider whether it would be improved further if ‘under 30’ and ‘30 and over’had been the main column headings and ‘single’ and ‘married’ the sub headings.The modifications depend on the purpose of the table. If the activities of agegroups are to be compared, it is best left as it stands. But if a comparisonbetween men of different marital status is required, the change would be animprovement.

5.2.3 Advantages of Tabulation of Data

(i) Tabulated data can be more easily understood and grasped thanuntabulated data.

Page 129: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 123

(ii) A table facilitates comparisons between subdivisions and with other tables.

(iii) It enables the required figures to be located easily.

(iv) It reveals patterns within the figures, which might otherwise not have beenobvious, e.g., from the previous table, we can conclude that regular andfrequent cinema attendance is mainly confined to younger age group.

(v) It makes the summation of items and the detection of errors and omissions,easier.

(vi) It obviates repetition of explanatory phrases and headings and hencetakes less space.

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Tabulation is the _____________ arrangement of data in columnsand rows.

(b) Tabulated data can be more easily understood and grasped than_____________ data.

2. State whether true or false.

(a) A table showing two characteristics is a simple table.

(b) A table facilitates comparisons between subdivisions and with othertables.

5.3 Graphs

In a graph, the independent variable should always be placed on the horizontalor X-axis and the dependent variable on the vertical or Y-axis.

5.3.1 Line Graph

Here, the points are plotted on paper (or graph paper) and joined by straightlines. Generally, continuous variables are plotted by the line graph.

Page 130: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 124

Example 5.1: The monthly averages of Retail Price Index from 1996 to 2003(Jan. 1996 = 100) were as follows:

Year 1996 1997 1998 1999 2000 2001 2002 2003

Retail Price Index 100 105.8 109.0 109.6 110.7 114.5 119.3 122.3

Draw a diagram to display these figures.Solution: Here, years are plotted along the horizontal line and the retail priceindex along the vertical line.

Erect perpendiculars to horizontal line from the points marked as retailprice index for the years 1997, 1998, ..., 2003 and cut off these ordinatesaccording to the given data and thus various points will be plotted on the paper.Join these points by straight lines.

125

120

115

110

105

1001996 1997 1998 1999 2000 2001 2002 2003

Re

tail

Pri

ce I

ndex

Year

5.3.2 Frequency Polygon

A frequency polygon is a line chart of frequency distribution in which, either thevalues of discrete variables or midpoints of class intervals are plotted againstthe frequencies and these plotted points are joined together by straight lines.Since the frequencies generally do not start at zero or end at zero, this diagramas such would not touch the horizontal axis. However, since the area under theentire curve is the same as that of a histogram which is 100 per cent of the datapresented, the curve can be enclosed so that the starting point is joined with afictitious preceding point whose value is zero. This ensures that the start of thecurve is at horizontal axis and the last point is joined with a fictitious succeedingpoint whose value is also zero, so that the curve ends at the horizontal axis.This enclosed diagram is known as the frequency polygon.

We can construct the frequency polygon (Figure 5.1) from Table 5.3presented for the ages of 30 workers as follows:

Page 131: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 125

Table 5.3 Ages of 30 Workers

Class Internal (years)

Mid-Point (f)

15 upto 25 20 5 25 upto 35 30 3 35 upto 45 40 7 45 upto 55 50 5 55 upto 65 60 3 65 upto 75 70 7

(20, 5)

(30, 3)

(40, 7)

(50, 5)

(70, 7)

(60, 3)

Figure 5.1 Frequency Polygen Curve

5.3.3 Relative Frequency

In a frequency distribution, if the frequency in each class interval is convertedinto a proportion, dividing it by the total frequency, we get a series of proportionscalled relative frequencies. A distribution presented with relative frequenciesrather than actual frequencies is called a relative frequency distribution. Thesum of all relative frequencies in a distribution is 1.Example 5.2: Calculate relative frequency from the given table.

Class Interval Frequency25—35 735—45 945—55 2255—65 765—75 375—85 2

Page 132: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 126

Solution: This example shows that the sum of all relative frequencies in adistribution is 1.

Class Interval Frequency Relative Frequency Explanation

25—35 7 0.14 750 0.14

35—45 9 0.18 950 0.18

45—55 22 0.44 etc.55—65 7 0.1465—75 3 0.06

75—85 2 0.04

Total 50 1.00

The concept of relative frequencies is useful in sampling theory. It canalso be used to compare two frequency distributions with unequal total frequencywith the same series of class intervals as in the following example.Example 5.3: Compare the following frequency distribution.

Class Interval f1 f2

10—20 5 1220—30 10 2430—40 6 3040—50 3 1950—60 1 15

Solution: The following table shows the comparison.

Class Interval f1 f2 Rel. Freq. f1 Rel. Freq. f2

10—20 5 12 0.20 0.1220—30 10 24 0.40 0.2430—40 6 30 0.24 0.3040—50 3 19 0.12 0.1950—60 1 15 0.04 0.15

Total 25 100 1.00 1.00

A direct visual comparison of two frequency distributions can be made bydrawing their frequency polygons.Example 5.4: Draw frequency polygons for the relative frequency distributionsgiven in Example 5.3.Solution: The following is the frequency polygon for the relative frequencies asmentioned in Example 5.3.

Page 133: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 127

0.1

0.2

0.3

0.4

Rel

ativ

e Fr

eque

ncy

O5 15 25 35 45 55 65

X

Class marks

1

2

5.3.4 Ogive Curves

Cumulative frequency curve or ogive is the graphic representation of a cumulativefrequency distribution. Ogives are of two types. One of these is less than andthe other one is greater than ogive. Both these ogives are constructed basedon Table 5.4 of 30 workers.

Table 5.4 Cummulative Frequency

Class Interval Mid-point ( f ) Cum. Freq. Cum. Freq.(years) (less than) (greater than)

15 and upto 25 20 5 5 (less than 25) 30 (more than 15)

25 and upto 35 30 3 8 (less than 35) 25 (more than 25)

35 and upto 45 40 7 15 (less than 45) 22 (more than 35)

45 and upto 55 50 5 20 (less than 55) 15 (more than 45)

55 and upto 65 60 3 23 (less than 65) 10 (more than 55)

65 and upto 75 70 7 30 (less than 75) 7 (more than 65)

(i) Less than ogive. In this case, the less than cumulative frequencies areplotted against the upper boundaries of their respective class intervals.Figure 5.2 shows ‘less than’ ogive.

Page 134: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 128

Class Interval

‘Less than’ Ogive

Figure 5.2 ‘Less than’, Ogive

(ii) Greater than ogive. In this case, the greater than cumulative frequenciesare plotted against the lower boundaries of their respective class intervals.

‘More than’ Ogive

Gre

ater

than

Cum

ulat

ive

Freq

uenc

y

Class Interval

Figure 5.3 ‘Greater than’, Ogive

These ogives can be used for comparison purposes. Several ogives canbe drawn on the same grid, preferably with different colours for easiervisualization and differentiation.

Although diagrams and graphs are powerful and effective media forpresenting statistical data, they can only represent a limited amount of informationand they are not of much help when intensive analysis of data is required.

Page 135: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 129

5.3.5 Histograms

A histogram is the graphical description of data and is constructed from afrequency table. It displays the distribution method of a data set and is used forstatistical as well as mathematical calculations.

The word histogram is derived from the Greek word histos which means‘anything set upright’ and ‘gramma’ which means ‘drawing, record, writing’. It isconsidered the most important basic tool of statistical quality control process.

In this type of representation, the given data is plotted in the form of aseries of rectangles. Class intervals are marked along the X-axis and thefrequencies along the Y-axis according to a suitable scale. Unlike the bar chart,which is one dimensional, meaning that only the length of the bar is importantand not the width, a histogram is two-dimensional in which both the length andthe width are important. A histogram is constructed from a frequency distributionof a grouped data, where the height of the rectangle is proportional to therespective frequency and the width represents the class interval. Each rectangleis joined with the other and any blank spaces between the rectangles wouldmean that the category is empty and there are no values in that class interval.

Let us construct a histogram for our example of ages of 30 workers. Forconvenience is sake, we will present the frequency distribution along with themidpoint of each interval, where the midpoint is simply the average of the valuesof the lower and the upper boundary of each class interval. The frequencydistribution table is shown as follows:

Class Interval (years) Midpoint ( f )

15 and upto 25 20 5

25 and upto 35 30 3

35 and upto 45 40 7

45 and upto 55 50 5

55 and upto 65 60 3

65 and upto 75 70 7

Page 136: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 130

The histogram of this data would be shown as follows:

5

3

7

5

3

7

Class Interval

Activity 1

The following frequency distribution represents the number of days duringa year that the faculty of the college was absent from work due to illness.

Number of Days Number of Employees

0–2 5

3–5 10

6–8 20. 9–11 10

12–14 5

Total 50

(a) Construct a frequency distribution for this data.

(b) Construct a greater than cumulative frequency distribution as wellas a less than cumulative frequency distribution for this data.

(c) How many employees were absent for less than 3 days during theyear?

(d) How many employees were absent for more than 8 days during theyear?

(e) Draw a frequency polygon for this data.

Page 137: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 131

Self-Assessment Questions

3. State whether true or false.

(a) In a graph, the independent variable should always be placed in avertical axis.

(b) A distribution presented with relative frequencies rather than actualfrequencies is called a relative frequency distribution.

4. Fill in the blanks with the appropriate terms.

(a) A direct visual comparison of two ____________ distributions canbe made by drawing their frequency polygons.

(b) A histogram is constructed from a frequency distribution of a groupeddata, where the height of the rectangle is _______________ to therespective frequency and the width represents the class interval.

5.4 Diagrams

The data we collect can often be more easily understood for interpretation if it ispresented graphically or pictorially. Diagrams and graphs give visual indicationsof magnitudes, groupings, trends and patterns in the data. These importantfeatures are more simply presented in the form of graphs. Also, diagrams facilitatecomparisons between two or more sets of data.

The diagrams should be clear and easy to read and understand. Toomuch information should not be shown in the same diagram; otherwise, it maybecome cumbersome and confusing. Each diagram should include a brief andself explanatory title dealing with the subject matter. The scale of the presentationshould be chosen in such a way that the resulting diagram is of appropriatesize. The intervals on the vertical as well as the horizontal axis should be ofequal size; otherwise, distortions would occur.

Diagrams are more suitable to illustrate data which is discrete, whilecontinuous data is better represented by graphs. The following are thediagrammatic and the graphic representation methods that are commonly used.

5.4.1 One Dimensional Diagrams

Bars are simply vertical lines where the lengths of the bars are proportional totheir corresponding numerical values. The width of the bar is unimportant but

Page 138: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 132

all bars should have the same width so as not to confuse the reader of thediagram. Additionally, the bars should be equally spaced.Example 5.5: Construct a subdivided bar chart for the three types of expendituresin dollars for a family of four for the years 1988, 1989, 1990 and 1991 given asfollows:

Year Food Education Other Total

1988 3000 2000 3000 8000

1989 3500 3000 4000 10500

1990 4000 3500 5000 12500

1991 5000 5000 6000 16000

Solution: The subdivided bar chart would be as follows:

16000

14000

12000

10000

8000

6000

4000

2000

0 1988 1989 1990 1991Year

A Subdivided Bar Diagram

Food

Education

Other

Exp

end

iture

Percentage Component Bars or Divided Bar Charts

When in the previous case, the component lengths represent the percentages(instead of the actual amounts) of each component we get percentagecomponent bar charts. The heights of all the bars will be the same as shown inFigure 5.4.

Page 139: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 133

Figure 5.4 Percentage Component Bar Chart showing Expenses and Savings ofMr X

Multiple Bar Charts

In multiple bar charts the interrelated component parts are shown in adjoiningbars, coloured or marked differently, thus allowing comparison between differentparts as shown in Figure 5.5.

Figure 5.5 Multiple Bar Chart showing Expenses and Savings of Mr X

These charts can be used if the overall total is not required. Some chartsgiven earlier show totals also.

5.4.2 Two Dimensional Diagrams

Two dimensional diagrams take two components of data for representation.These are also called area diagrams as they consider two dimensions. Thetypes are rectangles, squares and pie. They can be best explained with thehelp of a squares diagram.

Page 140: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 134

Squares: The square diagram is easy and simple to draw. Take the square rootof the values of various given items that are to be shown in the diagrams andthen select a suitable scale to draw the squares.

Example 5.6: Yield of rice in Kgs. per acre of five countries are as follows:

Country USA Australia UK Canada India

Yield of rice 6400 1600 2500 3600 4900in Kgs per acre

Represent this data using square diagram.Solution: To draw the square diagrams calculate as follows:

Country Yield Square root Side of the square in cm

U.S.A 6400 80 4

Australia 1600 40 2

U.K. 2500 50 2.5

Canada 3600 60 3

India 4900 70 3.5

4 cm 2 cm 2.5 cm 3 cm 3.5 cm

5.4.3 Pie Diagram

This type of diagram enables us to show the partitioning of a total into itscomponent parts. The diagram is in the form of a circle and is also called a piebecause the entire diagram looks like a pie and the components resemble slicescut from it. The size of the slice represents the proportion of the component outof the whole.

Example 5.7: The following figures relate to the cost of the construction of ahouse. The various components of cost that go into it are represented aspercentages of the total cost.

Page 141: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 135

Item % Expenditure

Labour 25

Cement, Bricks 30

Steel 15

Timber, Glass 20

Miscellaneous 10

Construct a pie chart for the above data.Solution: The pie chart for this data is presented as follows:

Labour25%

Misc10%Timber,

Glass20%

Cement, Bricks30%

Steel15%

Pie charts are very useful for comparison purposes, especially when thereare only a few components. If there are too many components, it may becomeconfusing to differentiate the relative values in the pie.

5.4.4 Three Dimensional Diagrams

Three dimensional diagrams are also termed as volume diagram and consist ofcubes, cylinders, spheres, etc. In these diagrams, three dimensions, namelylength, width and height are taken into account. Cubes are used where side ofa cube is drawn in proportion to the cube root of the magnitude of data.Example 5.8: Represent the following data using volume diagram.

Category Number of Students

Undergraduate 64000

Postgraduate 27000

Professionals 8000

Page 142: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 136

Solution: The sides of cubes are calculated as follows:

Category Number of Students Cube Root Side of Cube

Undergraduate 64000 40 4 cm

Postgraduate 27000 30 3 cm

Professional 8000 20 2 cm

4cm 3cm 2cm

Activity 2

The following table represents the racial breakdown of people in the Flushingarea in Queens, New York.

Race White Black Hispanic Asians Others

Number 205,000 30,520 20,300 15,650 5,400

Construct a pie chart to represent this data. (Make sure that the slices ofthe pie proportionately represent the various ethnic populations.)

Self-Assessment Questions

5. Fill in the blanks with the appropriate terms.

(a) Each diagram should include a brief and self ______________ titledealing with the subject matter.

(b) Bars are simply vertical lines where the ______________ of the barsare proportional to their corresponding numerical values.

6. State whether true or false.

(a) Diagrams and graphs give visual indications of magnitudes,groupings, trends and patterns in the data.

(b) Diagrams facilitate comparisons between two or more sets of data.

Page 143: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 137

5.5 Summary

Let us recapitulate the important concepts discussed in this unit:

Classification of data is usually followed by tabulation, which is consideredthe mechanical part of classification.

Tabulation is the systematic arrangement of data in columns and rows.The analysis of the data is done so by arranging the columns and rows tofacilitate comparisons.

A table should be easy to read and should contain only the relevant details.If the aim of clarification is not achieved, the table should be redesigned.

In a graph, the independent variable should always be placed on thehorizontal or X-axis and the dependent variable on the vertical or Y-axis.

A frequency polygon is a line chart of frequency distribution in which thevalues of discrete variables or midpoints of class intervals are plottedagainst the frequencies and these plotted points are joined together bystraight lines.

In a frequency distribution, if the frequency in each class interval isconverted into a proportion, dividing it by the total frequency, we get aseries of proportions called relative frequencies.

Cumulative frequency curve or ogive is the graphic representation of acumulative frequency distribution. Ogives are of two types, ‘less than’and ‘greater than’ ogives.

A histogram is the graphical description of data and is constructed from afrequency table. It displays the distribution method of a data set and isused for statistical as well as mathematical calculations.

Diagrams and graphs give visual indications of magnitudes, groupings,trends and patterns in the data.

A pie diagram illustrates the partitioning of a total into its component parts.

5.6 Glossary

Table: The systematic arrangement of data in columns and rows.

Frequency polygon: A line chart of frequency distribution in which thevalues of discrete variables or midpoints of class intervals are plotted

Page 144: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 138

against the frequencies and these plotted points are joined together bystraight lines.

Relative frequency: The series of proportions achieved after convertingeach class interval into a proportion, dividing it by the total frequency.

Ogive curve: A graphic representation of a cumulative frequencydistribution.

Histogram: The graphical description of data constructed from afrequency table. It displays the distribution method of a data set and isused for statistical as well as mathematical calculations.

Pie diagram: A diagram that enables us to show the partitioning of a totalinto its component parts.

5.7 Terminal Questions

1. What are the essential features of a table?

2. Giving suitable examples distinguish between a simple and a complextable.

3. Explain frequency polygon giving an example.

4. Define relative frequency. What are the areas where relative frequency isconsidered useful?

5. What is an ogive curve? Explain its types and significance.

6. How are histograms useful in data representation?

7. What features should be kept in mind while drawing a diagram?

8. Explain one dimensional, two dimensional and three dimensional diagramswith the help of examples.

5.8 Answers

Answers to Self-Assessment Questions

1. (a) Systematic; (b) Untabulated

2. (a) False; (b) True

3. (a) False; (b) True

Page 145: BCC104 Business Statistics

Business Statistics Unit 5

Sikkim Manipal University Page No. 139

4. (a) Frequency; (b) Proportional

5. (a) Explanatory; (b) Lengths

6. (a) True; (b) True

Answers to Terminal Questions

1. Refer Section 5.2.1

2. Refer Section 5.2.2

3. Refer Section 5.3.2

4. Refer Section 5.3.3

5. Refer Section 5.3.4

6. Refer Section 5.3.5

7. Refer Section 5.4

8. Refer Sections 5.4.1, 5.4.2 and 5.4.4

5.9 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010.

Page 146: BCC104 Business Statistics
Page 147: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 141

Unit 6 Correlation

Structure

6.1 IntroductionObjectives

6.2 Correlation Analysis6.3 Coefficient of Correlation6.4 Spearman’s Rank Correlation6.5 Summary6.6 Glossary6.7 Terminal Questions6.8 Answers6.9 Further Reading

6.1 Introduction

In the previous unit, you learnt about various data representation techniquesand their significance in decision-making.

In this unit, you will learn about correlation analysis. Correlation is one ofthe most significant statistics. Correlation can be defined as the interdependencebetween variable quantities. If the values of two variables changes with respectto each other, then they are said to be correlated. For example, if the variablesare stock prices and the price of one stock increases at the same time the priceof another stock increases, then the two stock prices are positively correlated.If the price of one stock goes down when the price of the other increases, thenthe two stock prices are negatively correlated. However, if we are unable to finda consistent pattern in the variation of the two stock prices, then they areuncorrelated.

The strength of correlation is measured by the coefficient of correlation.The value of the coefficient of correlation lies in the interval [–1, 1]. Positivecorrelations lie between 0 and 1; 0 means that there is no correlation; negativecorrelations lie between 0 and –1. The purpose of doing correlations is to allowus to make a prediction about one variable based on what we know aboutanother variable.

Page 148: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 142

Objectives

After studying this unit, you should be able to:

Explain correlation analysis

Evaluate coefficient of determination and coefficient of correlation

Calculate probable error of the coefficient of correlation

Calculate correlation using various methods

Define limitations of correlation analysis

6.2 Correlation Analysis

Correlation analysis is a statistical tool generally used to describe the degree towhich one variable is related to another. The relationship, if any, is usuallyassumed to be a linear one. In fact, the word correlation refers to the relationshipor interdependence between two variables. There are various phenomena thatare related to each other. For instance, when demand of a certain commodityincreases, its price goes up, and when its demand decreases, its price goesdown.

On the basis of the theory of correlation, one can study the comparativechanges occurring in two related phenomena and their cause–effect relationcan be examined. It should be borne in mind that relationship like ‘black catcauses bad luck’ cannot be explained by the theory of correlation, since theyare all imaginary and are incapable of being justified mathematically. Thus,correlation is concerned with relationships between two related and quantifiablevariables. For example, if the height of students as well as the height of thetrees increases, then we cannot call it a correlation because the two phenomenaare not related to each other.

Correlation can be positive or negative. The sign of the correlationcoefficient between two stock prices shows whether the two stock prices arepositively or negatively correlated. If the coefficient of correlation is greater thanzero but not greater than 1, then the stock prices are positively correlated andmove in the same directions. If the coefficient of correlation is less than zero butnot less than –1, then the stock prices are negatively correlated and move inopposite directions.

The correlation coefficient’s numerical value shows the strength of thecorrelation between the two stock prices. The stronger the positive correlation,

Page 149: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 143

the closer will be the value of the correlation coefficient to +1. The stronger thenegative correlation, the closer will be the correlation coefficient to –1. If the twostock prices are perfectly uncorrelated, the value of the correlation coefficient iszero. This can be explained as under:

Changes in Independent Changes in Dependent Nature ofVariable Variable Correlation

Increase (+) Increase (+) Positive (+)

Decrease (–) Decrease (–) Positive (+)

Increase (+) Decrease (–) Negative (–)

Decrease (–) Increase (+) Negative (–)

Statisticians have developed two measures for describing the correlationbetween two variables, viz., the coefficient of determination and the coefficientof correlation. We now explain, illustrate and interpret the two coefficientsconcerning the relationship between two variables.

6.2.1 The Coefficient of Determination

The coefficient of determination (symbolically indicated as r2, though some peoplewould prefer to put it as R2) is a measure of the degree of linear association orcorrelation between two variables, say X and Y, one of which happens to be andindependent variable and the other dependent. This coefficient is based on thefollowing two kinds of variations:

(i) The variation of the Y values around the fitted regression line viz.,

2ˆ ,Y Y technically known as the unexplained variation.

(ii) The variation of the Y values around their own mean viz., 2 ,Y Y technically known as the total variation.

If we subtract the unexplained variation from the total variation, we obtainwhat is known as the explained variation, i.e., the variation explained by the lineof regression. Thus, Explained Variation = (Total variation) – (Unexplainedvariation)

2 2ˆY Y Y Y

2Y Y

The Total and Explained as well as Unexplained variations are shown inFigure 6.1.

Page 150: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 144

Regression line of

onYX

20 40 60 80 100 120 X- axis

XIncome (’00 Rs)

0

20

40

60

80

100

Y-axis

XMean line of Y

Explained Variationi.e.,Y Y

specific point

Unexplained

variation (i.e.,–

)

at a specificpointY

Y

Total variation (i.e.,– )

or ‘ ’ at a specificpointY

Y

Y

Mea

nlin

eof

X

Com

sum

ptio

nEx

pend

iture

(’ 00

Rs)

( )

Y

Y

at a

Figure 6.1 Diagram Showing Total, Explained and Unexplained Variations

Coefficients of determination is that fraction of the total variation of Ywhich is explained by the regression line. In other words, coefficient ofdetermination is the ratio of explained variation to total variation in the Y variablerelated to the X variable. Coefficient of determination algebraically can be statedas follows:

r2 =Explained variation

Total variation

=

2

2

Y Y

Y Y

Alternatively r2 can also be stated as under:

r2 = 1 – Explained variationTotal variation

= 1 –

2

2

Y Y

Y Y

Page 151: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 145

6.2.2 Interpreting r2

Coefficient of determination explains how much of the variation in one factorcan be caused or explained by its relationship to another factor. It is the squareof correlation coefficient. For example, if you have two sets of scores on TestsX and Y and they correlate at r = 0.90, the coefficient of determination r2 will be0.81. This information can be interpreted as, 81% of the variance in Test X hasbeen explained by the Test Y.

As a matter of practice the squared correlations should be interpretedbecause the correlation coefficient is misleading in suggesting the existence ofmore correlation than really exists and the problem gets worse as the correlationapproaches zero.

Example 6.1: Calculate the coefficient of determination (r2) using data givenbelow. Calculate and analyse the result.

Observations 1 2 3 4 5 6 7 8 9 10

Income (X) (‘00 Rs) 41 65 50 57 96 94 110 30 79 65

ConsumptionExpenditure (Y) (‘00 Rs) 44 60 39 51 80 68 84 34 55 48

Solution: r2 can be worked out as shown below:

Since, r2 =Unexplained variation1

Total variation =

2

2

ˆ1

Y Y

Y Y

As, 2 22 2Y Y Y Y nY , we can write,

r2 = 2

2 2

ˆ1

Y Y

Y nY

Calculating and putting the various values, we have the following equation:

r2 = 2260.54 260.541 1 0.897

2526.1034223 10 56.3

Analysis of Result: The regression equation used to calculate the value of thecoefficient of determination (r2) from the sample data shows that, about 90% ofthe variations in consumption expenditure can be explained. In other words, itmeans that the variations in income explain about 90% of variations inconsumption expenditure.

Page 152: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 146

Observation 1 2 3 4 5 6 7 8 9 10

Income (X) (’00 Rs) 41 65 50 57 96 94 110 30 79 65ConsumptionExpenditure (Y) (’00 Rs)44 60 39 51 80 68 84 34 55 48

Activity 1

Using the various correlation methods discussed in the unit, compute thecorrelation for the following data:

Person Height (x) Self Esteem (y)

1 68 4.1 2 71 4.6 3 62 3.8 4 75 4.4 5 58 3.2 6 60 3.1

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Correlation is concerned with relationship between two related and____________ variables.

(b) Coefficients of _____________ is that fraction of the total variationof Y which is explained by the regression line.

2. State whether true or false.

(a) The word correlation refers to the relationship or interdependencebetween two variables.

(b) Correlation can either be positive or negative.

6.3 Coefficient of Correlation

The coefficient of correlation, symbolically denoted by ‘r’, is another importantmeasure to describe how well one variable is explained by another. It measures

Page 153: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 147

the degree of relationship between the two casually related variables. The valueof this coefficient can never be more than +1 or less than –1. Thus, +1 and –1are the limits of this coefficient. For a unit change in independent variable, ifthere happens to be a constant change in the dependent variable in the samedirection, then the value of the coefficient will be +1 indicative of the perfectpositive correlation; but if such a change occurs in the opposite direction, thevalue of the coefficient will be –1, indicating the perfect negative correlation. Inpractical life, the possibility of obtaining either a perfect positive or perfectnegative correlation is very remote particularly in respect of phenomenaconcerning social sciences. If the coefficient of correlation has a zero valuethen it means that there exists no correlation between the variables under study.

There are several methods of finding the coefficient of correlation but thefollowing ones are considered important:

(i) Coefficient of Correlation by the Method of Least Squares.

(ii) Coefficient of Correlation using Simple Regression Coefficients.

(iii) Coefficient of Correlation through Product Moment Method or KarlPearson’s Coefficient of Correlation.

Whichever of these above mentioned three methods we adopt, we getthe same value of r.

(i) Coefficient of Correlation by the Method of Least Squares

Under this method, first of all, the estimating equation is obtained using leastsquare method of simple regression analysis. The equation is worked out as,

ˆiY a bX

Total variation 2Y Y

Unexplained variation 2ˆY Y

Explained variation 2Y Y

Then, by applying the following formulae, we can find the value of the coefficientof correlation:

r =2 Explained variation

Total variationr

=Unexplained variation1

Total variation

Page 154: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 148

=

2

2

ˆ1

Y Y

Y Y

This clearly shows that the coefficient of correlation happens to be thesquareroot of the coefficient of determination.

Short-cut formula for finding the value of ‘r’ by the method of least squarescan be repeated and readily written as follows:

r =2

2 2a Y b XY nY

Y nY

Where, a = Y-intercept

b = Slope of the estimating equation

X = Values of the independent variable

Y = Values of dependent variable

Y_

= Mean of the observed values of Y

n = Number of items in the sample

(i.e., pairs of observed data)

The plus (+) or the minus (–) sign of the coefficient of correlation workedout by the method of least squares is related to the sign of ‘b’ in the estimatingequation viz., ˆ .iY a bX If ‘b’ has a minus sign, the sign of ‘r’ will also be minusbut if ‘b’ has a plus sign, then the sign of ‘r’ will also be plus. The value of ‘r’indicates the degree along with the direction of the relationship between thetwo variables X and Y.

(ii) Coefficient of Correlation using Simple Regression Coefficients

Under this method, the estimating equation of Y and the estimating equation ofX is worked out using the method of least squares. From these estimatingequations we find the regression coefficient of X on Y, i.e., the slope of theestimating equation of X (symbolically written as bXY) and this happens to be

equal to X

Y

r

and similarly, we find the regression coefficient of Y on X, i.e., the

slope of the estimating equation of Y (symbolically written as bYX) and this

happens to be equal to Y

X

r

. For finding ‘r’, the square root of the product of

these two regression coefficients are work out as follows:1

Page 155: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 149

r = .XY YXb b

= .X Y

Y Xr r

= 2r = r

As stated earlier, the sign of ‘r’ will depend upon the sign of the regressioncoefficients. If they have minus sign, then ‘r’ will take a minus sign but the signof ‘r’ will be positive if regression coefficients have positive signs.

6.3.1 Karl Pearson’s Coefficient

Karl Pearson’s method is most widely used method of measuring the relationshipbetween two variables. This coefficient is based on the following assumptions:

(i) There is a linear relationship between the two variables which means thatstraight line would be obtained if the observed data are plotted on a graph.

(ii) The two variables are casually related which means that one of thevariables is independent and the other one is dependent.

(iii) A large number of independent causes are operating in both the variablesso as to produce a normal distribution.

According to Karl Pearson, ‘r’ can be worked out as under:

r =X Y

XYn

Where, X = (X – X_

)Y = (Y – Y

_)

X = Standard deviation of

X series and is equal to 2X

n

Y = Standard deviation of

Y series and is equal to 2Y

n

n = Number of pairs of X and Y observed.

A short-cut formula, known as the Product Moment Formula, can bederived from the above stated formula as under:

Page 156: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 150

r = X Y

XYn

= 2 2

XY

X Yn n

n =2 2

XY

X Y

The above formulae are based on obtaining true means (viz. and X Y )first and then doing all other calculations. This happens to be a tedious task,particularly if the true means are in fractions. To avoid difficult calculations, wemake use of the assumed means in taking out deviations and doing the relatedcalculations. In such a situation, we can use the following formula for findingthe value of ‘r’:2

(i) In case of ungrouped data:

r =2 22 2

.

.

dX dY dX dYn n n

dX dX dY dYn n n n

= 2 2

2 2

. dX dYdX dYn

dX dYdX dY

n n

Where, dX = (X – XA) XA = Assumed average of X

dY = (Y – YA) YA = Assumed average of Y

dX2 = (X – XA)2

dY2 = (Y – YA)2

dX . dY = (X – XA) (Y – YA)

n = Number of pairs of observations of X and Y

Page 157: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 151

(ii) In case of grouped data:

r =2 22 2

. .fdX dY fdX fdYn n n

fdX fdX fdY fdYn n n n

or r =2 2

2 2

.. fdX fdYfdX dYn

fdX fdYfdX fdYn n

Where, fdX.dY =0f (X – XA) (Y – YA)

fdX = f (X – XA)

fdY = f (Y – YA)

fdY2 = f (Y – YA)2

fdX2 = f (X – XA)2

n = Number of pairs of observations of X and Y.

6.3.2 Probable Error (P.E.) of the Coefficient of Correlation

Probable Error (P.E.) of r is very useful in interpreting the value of r and isworked out as under for Karl Pearson’s coefficient of correlation:

21P.E. 0.6745 rn

If r is less than its P.E., it is not at all significant. If r is more than P.E., thereis correlation. If r is more than 6 times its P.E. and greater than ± 0.5, then it isconsidered significant.

Example 6.2:

From the following data calculate ‘r’ between X and Y applying the followingthree methods:

(i) The method of least squares.

(ii) The method based on regression coefficients.

Page 158: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 152

(iii) The product moment method of Karl Pearson.

Verify the obtained result of any one method with that of another.

X 1 2 3 4 5 6 7 8 9

Y 9 8 10 12 11 13 14 16 15

Solution:Let us develop the following table for calculating the value of ‘r’:

X Y X2 Y2 XY

1 9 1 81 92 8 4 64 163 10 9 100 304 12 16 144 485 11 25 121 556 13 36 169 787 14 49 196 988 16 64 256 1289 15 81 225 135

n=9

X = 45 Y = 108 X2 = 285 Y2 = 1356 XY = 597

X_

= 5; Y_

= 12

(i) Coefficient of correlation by the method of least squares is worked out asfollows:

First of all find out the estimating equation,

Y = a + bXi

Where, b = 22

XY nX Y

X nX

=

597 9 5 12 597 540285 9 25 285 225

=

57 0.9560

and a = Y_

– bX_

= 12 – 0.95(5) = 12 – 4.75 = 7.25

Hence, Y = 7.25 + 0.95Xi

Page 159: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 153

Now ‘r’ can be worked out as under by the method of least squares,

r =Unexplained variation1

Total variation

=

2

2

ˆ1

Y Y

Y Y

=

2

2

Y Y

Y Y

=

2

22

a Y b XY nY

Y nY

This is as per short-cut formula,

r =

2

2

7.25 108 0.95 597 9 12

1356 9 12

=783 567.15 1296

1356 1296

=54.15

60 = 0.9025 = 0.95

(ii) Coefficient of correlation by the method based on regression coefficientsis worked out as follows:

Regression coefficients of Y on X,

i.e., bYX = 22

XY nX Y

X nX

= 597 9 5 12 597 540 57

285 225 60285 9 5

Regression coefficient of X on Y,

i.e., bXY = 22

XY nX Y

Y nY

= 2

597 9 5 12 597 540 571356 1296 601356 9 12

Page 160: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 154

Hence, r = .YX XYb b

=57 57 57 0.9560 60 60

(iii) Coefficient of correlation by the product moment method of Karl Pearsonis worked out as under:

r =2 22 2

XY nX Y

X nX Y nY

=

2 2

597 9 5 12

285 9 5 1356 9 12

=597 540 57

285 225 1356 1296 60 60

=57 0.9560

Hence, we get the value of r = 0.95. We get the same value applying theother two methods also. Therefore, whichever method we apply, the results willbe the same.

6.3.3 Some Other Measures

Two other measures are often talked about along with the coefficients ofdeterminations and that of correlation. These are as follows:

(i) Coefficient of Nondetermination. Instead of using coefficient ofdetermination, sometimes coefficient of nondetermination is used.Coefficient of nondetermination (denoted by k2) is the ratio of unexplainedvariation to total variation in the Y variable related to the X variable.Algebrically, we can write it as follows:

k2 =Unexplained variation

Total variation =

2

2

ˆY Y

Y Y

Concerning the data of Example 6.1 of this unit, coefficient ofnondetermination will be calculated as follows:

2 260.54 0.1032526.10

k

The value of k2 shows that about 10% of the variation in consumption

expenditure remains unexplained by the regression equation we hadworked out, viz., Y 14.000 + 0.616Xi. In simple terms, this means that

Page 161: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 155

variable other than X is responsible for 10% of the variations in thedependent variable Y in the given case.

Coefficient of nondetermination can as well be worked out as under:k2 = 1 – r2

Accordingly for Example 6.1, it will be equal to 1–0.897 = 0.103

Note: Always remember that r2 + k2 = 1.

(ii) Coefficient of Alienation. Based on k2, we can work out one more measurenamely the Coefficient of alienation, symbolically written as ‘k’. Thus,Coefficient of alienation, i.e., ‘k’ = 2k

Unlike r + k2 = 1, the sum of ‘r’ and ‘k’ will not be equal to 1 unless one ofthe two coefficients is 1 and in this case the remaining coefficients must bezero. In all other cases, ‘r’ + ‘k’ > 1. Coefficient of alienation is not a popularmeasure from practical point of view and is used very rarely.

Activity 2

Two random variables have the regression with equations,

3X + 2Y – 26 = 0

6X + Y – 31 = 0

Find the mean value of X as well as of Y and the correlation coefficientbetween X and Y. If the variance of X is 25, find sY from the data givenabove.

Self-Assessment Questions

3. State whether true or false.

(a) The value of this coefficient can never be more than +1 or lessthan -1.

(b) Coefficient of determination (denoted by k2) is the ratio of unexplainedvariation to total variation in the Y variable related to the X variable.

4. Fill in the blanks with the appropriate terms.

(a) The coefficient of correlation, symbolically denoted by 'r', measuresthe degree of relationship between the two _____________ relatedvariables.

(b) If r is less than its probable error (P.E.), it is not at all significant butif r is more than P.E., there is_______________.

Page 162: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 156

6.4 Spearman’s Rank Correlation

If observations on two variables are given in the form of ranks and not asnumerical values, it is possible to compute what is known as rank correlationbetween the two series.

The rank correlation, written as , is a descriptive index of agreementbetween ranks over individuals. It is the same as the ordinary coefficient ofcorrelation computed on ranks, but its formula is simpler.

2

2

61

( 1)iD

n n

Here, n is the number of observations and Di, the positive differencebetween ranks associated with the individuals i.

Like r, the rank correlation lies between –1 and +1.

Example 6.3: The ranks given by two judges to 10 individuals are as follows:

Rank given by

Individual Judge I Judge II D D2

x y = x – y

1 1 7 6 362 2 5 3 93 7 8 1 14 9 10 1 15 8 9 1 16 6 4 2 47 4 1 3 98 3 6 3 99 10 3 7 49

10 5 2 3 9

D2 = 128

Solution: The rank correlation is given by,2

3 3

6 6 1281 1 1 0.776 0.22410 10

Dn n

The value of = 0.224 shows that the agreement between the judges isnot high.

Page 163: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 157

Example 6.4: Consider example 6.3 to compute r and then compare.

Solution: The simple coefficient of correlation r for the previous data is calculatedas follows:

x y x2 y2 xy

1 7 1 49 72 5 4 25 107 8 49 64 569 10 81 100 908 9 64 81 726 4 36 16 244 1 16 1 43 6 9 36 18

10 3 100 9 305 2 25 4 10

x = 55 y = 55 x2 = 385 y2 = 385 xy = 321

r =

55 55321 1010 10

2 255 55385 10 385 1010 10

= 18.5

82.5 82.5 =

18.582.5 = 0.224

This shows that the Spearman for any two sets of ranks is the same asthe Pearson r for the set of ranks. But it is much easier to compute .

Often, the ranks are not given. Instead, the numerical values ofobservations are given. In such a case, we must attach the ranks to thesevalues to calculate .

Example 6.5: From the following table, compute the coefficient of correlationbetween age of husbands and age of wives :

Age of Age of wives Total

Husbands 15 – 25 25 – 35 35 – 45 45 – 55 55 – 65 65 – 75

15 – 25 1 1 – – – – 225 – 35 2 12 1 – – – 15

35 – 45 – 4 10 1 – – 1545 – 55 – – 3 6 1 – 1055 – 65 – – – 2 4 2 865 – 75 – – – – 1 2 3

Total 3 17 14 9 6 4 53

Page 164: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 158

r =

222 2.

x y x y

x x y y

N fd d fd fd

N fd fd N fd fd

=

2 2

53 86 10 16

53 98 10 . 53 92 16

= 0.907

Example: 6.6 If covariance between X and Y variables is 10 and the variancesof X and Y are 16 and 9 respectively, find the coefficient of correlation.

Covariance of X and Y = 11 = xyN

= 10

Variance of X, 2x = 16 x = 4

Variance of Y, 2y = 9 y = 3

Thus, 11 = 10 = xyN

Also, r =x y

xyN =

11

.x y

=

104 3 = 0.833

Page 165: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 159

Example 6.7: The marks of 8 candidates in Mathematics and English are givenbelow

Mathematics 76 90 98 69 54 82 67 52

English 25 37 56 12 7 36 23 11

Calculate the rank coefficient of correlation

Solution:

Marks in Marks in Rank in Rank in Rank D2

Mathematics English Mathematics English Difference(R1) (R2) (D) = (R1 – R2)

76 25 4 4 0 090 37 2 2 0 098 56 1 1 0 069 12 5 6 – 1 154 7 7 8 – 1 182 36 3 3 0 067 23 5 6 – 1 152 11 8 7 + 1 1

Total D = 0 D2 =4

Here, N = 8Rank correlation coefficient,

R =2

36

1D

N N

= 36(4)

1(8 8)

= 0.952

Example 6.8: Compute rank correlation coefficient from the following data ofmarks obtained by eight students in the papers of Physics and Mathematics:

Marks in Physics 15 20 27 13 45 60 20 75

Marks in Mathematics 50 30 55 30 25 10 30 70

Page 166: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 160

Solution:

Marks in Marks in Rank in Rank of Difference D2

Physics Mathematics Physics Mathematics (D) in Ranks

15 50 7 3 4 16

20 305 6

5.52

4 5 65

3

0.5 0.25

27 55 4 2 2 4

13 30 84 5 6

53

3 9

45 25 3 7 –4 1660 10 2 8 –6 36

20 305 6

5.52

4 5 6

53

0.5 0.25

75 70 1 1 0 0

Total D2 = 81.5

In this example, two students have secured equal marks viz., 20 in physics,so the ranks awarded to them are the arithmetic means of the ranks that theywould have got (viz., 5 and 6) had they differed at least by a small number and

so the ranks awarded to them are 5 62

= 5.5 each.

Similarly, three students who got equal marks (30 each) in Mathematics

were accorded the rank 4 5 63

= 5 for each.

Now, R =

3 32

3

612 12

1

m m n nD

N N

=

3 3

3

2 2 3 36 81.5

12 121

8 8

= 0

Example 6.9: Ten competitors in a beauty contest are ranked by three judgesin the following order :

1st Judge 1 6 5 10 3 2 4 9 7 8

2nd Judge 3 5 8 4 7 10 2 1 6 9

3rd Judge 6 4 9 8 1 2 3 10 5 7

Page 167: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 161

Use the rank correlation coefficient to determining which pair of judgeshas the nearest approach to common tastes in beauty.

R1 R2 R3 (R1 – R2)2 = D2 (R2 – R3)2 = D2 (R1 – R3)2 = D2

1 3 6 4 9 256 5 4 1 1 45 8 9 9 1 16

10 4 8 36 16 43 7 1 16 36 42 10 2 64 64 04 2 3 4 1 19 1 10 64 81 17 6 5 1 1 48 9 7 1 4 1

N = 10 N = 10 N = 10 D2= 200 D 2= 214 D2 = 60

Rank correlation between the judgements of Ist and 2nd judges

R12 =

2

3 3

6 20061 1

10 10

D

N N

= – 0.212

Rank correlation between the judgements of 2nd and 3rd judges :

R23 =

2

3 3

6 21461 1

10 10

D

N N

= – 0.297

Rank correlation between the Judgements of 1st and 3rd judges :

R13 =

2

3 3

6 6061 1

10 10

D

N N

= – 0.636

Since the coefficient of rank correlation is maximum in the judgements offirst and third judges, we conclude that they have the nearest approach tocommon tastes in beauty.

Self-Assessment Questions

5. Fill in the blanks with the appropriate terms.

(a) If observations on two variables are given in the form of ranks andnot as __________ values, then it is possible to compute rankcorrelation between the two series.

Page 168: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 162

(b) The _____________________ for any two sets of ranks is thesame as the Pearson r for the set of ranks.

6. State whether true or false.

(a) The rank correlation, written as , is a descriptive index of agreementbetween ranks over individuals.

(b) Like r, the rank correlation lies between –1 and +1.

6.5 Summary

Let us recapitulate the important concepts discussed in this unit:

Correlation analysis is the statistical tool generally used to describe thedegree to which, one variable is related to another.

The theory by means of which quantitative connections between two setsof phenomena are determined is called the ‘Theory of Correlation’.

Correlation can either be positive or it can be negative.

The coefficient of determination can have a value ranging from zero toone. The value of one can occur only if the unexplained variation is zero,which simply means that all the data points in the Scatter diagram fallexactly on the regression line.

The coefficient of correlation, symbolically denoted by ‘r’, is anotherimportant measure to describe how well one variable is explained byanother. It measures the degree of relationship between the two casuallyrelated variables. The value of this coefficient can never be more than +1or less than –1.

Karl Pearson’s method is the most widely used method of measuring therelationship between two variables.

If r is less than its P.E., it is not at all significant. If r is more than P.E.,there is correlation.

If observations on two variables are given in the form of ranks and not asnumerical values, it is possible to compute what is known as rankcorrelation between the two series.

Page 169: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 163

6.6 Glossary

Correlation analysis: A statistical tool used to describe the degree towhich one variable is related to another.

Coefficient of determination: A measure of the degree of linearassociation or correlation between two variables, one of which must bean independent variable and the other, a dependent variable.

Coefficient of correlation: It is symbolically denoted by ‘r’ and is animportant measure to describe how well one variable is explained byanother. It measures the degree of relationship between the two casuallyrelated variables.

6.7 Terminal Questions

1. What is the importance of correlation analysis?

2. How will you determine the coefficient of determination?

3. Explain the method to calculate the coefficient of correlation using simpleregression coefficient.

4. Describe Karl Pearson’s method of measuring coefficient of correlation.

5. What is the relationship between coefficient of nondetermination andcoefficient of alienation?

6. What is Spearman’s rank correlation?

6.8 Answers

Answers to Self-Assessment Questions

1. (a) Quantifiable; (b) Determination

2. (a) True; (b) True

3. (a) True; (b) False

4. (a) Casually; (b) Correlation

5. (a) Numerical; (b) Spearman

6. (a) True; (b) True

Page 170: BCC104 Business Statistics

Business Statistics Unit 6

Sikkim Manipal University Page No. 164

Answers to Terminal Questions

1. Refer Section 6.2

2. Refer Sections 6.2.1 and 6.2.2

3. Refer Section 6.3

4. Refer Section 6.3.1

5. Refer Section 6.3.3

6. Refer Section 6.4

6.9 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010.

Endnotes

1. Remember the short-cut formulae to workout bXY

and bYX

:

2 2XYXY nXY

bY nY

and 2 2YXXY nXY

bnXX

2. In case we take assumed mean to be zero for X variable as for Y variable then ourformula will be as under:

r =2 22 2

XY X Yn n n

X X Y Yn n n n

or r = 2 22 2

XY XYn

X YX Yn n

r =2 2 2 2

XY nXY

X nX Y nY

Page 171: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 165

Unit 7 Regression

Structure

7.1 IntroductionObjectives

7.2 Regression Analysis7.3 Simple Linear Regression Model7.4 Summary7.5 Glossary7.6 Terminal Questions7.7 Answers7.8 Further Reading

7.1 Introduction

In the previous unit, you learnt about correlation, a technique that looks at indirectrelationships and establishes variables.

In this unit, you will learn about regression analysis. Regression is astatistical measure that determines the strength of relationship between adependent variable (variable to be predicted) and, one or more independentvariables (variables on which the prediction is based). It is a commonly usedtool in forecasting and financial analysis. For instance, suppose you want toforecast sales for your company and it is seen that your company’s sales go upand down depending on changes in GDP. The sales you are forecasting wouldbe the dependent variable because their value depends on the value of GDP,which, in turn, would be the independent variable. You would then need todetermine the strength of the relationship between these two variables in orderto forecast sales. If GDP increases/decreases by 1%, how much will your salesincrease or decrease? The regression equation is y=bx+a, where y is thedependent variable which we intend to forecast, x is the independent variable,b is the slope of the regression and a is the y-intercept.

You can use this simple model to solve your business problems. If yourresearch leads you to believe that the next GDP change will be a certainpercentage, you can plug that percentage into the model and generate a salesforecast. This can help you develop a more objective plan and budget for theupcoming year. You will also learn about the scatter diagram, least squaresmethod and standard error of estimate.

Page 172: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 166

Objectives

After studying this unit, you should be able to:

Describe how assumptions are made in regression analysis

Explain simple linear regression model

Define scatter diagram method and least square method

Judge the accuracy of estimating equation

Compute and interpret standard error of the estimate

7.2 Regression Analysis

The term ‘regression’ was first used in 1877 by Sir Francis Galton who made astudy that showed that the height of children born to tall parents will tend tomove back or ‘regress’ toward the mean height of the population. He designatedthe word regression as the name of the process of predicting one variable fromanother variable. Regression analysis is a statistical technique that attempts toexplore and model the relationship between two or more variables. For example,an analyst may want to know if there is a relationship between road accidentsand the age of the driver. If we find a correlation between these two, then it ispossible to make use of this relationship in making estimates and to forecastthe value of the number of road accidents (dependent variable) on the basis ofthe age of the drivers (independent variables). Regression analysis forms animportant part of the statistical analysis of the data obtained from designedexperiments. The results of regression along with the results from the analysisof variance provide information that is useful to identify significant factors in anexperiment and explore the nature of the relationship between these factorsand the response. Similarly, an investigator may employ regression analysis totest his theory having the cause and effect relationship. All this explains thatregression analysis is an extremely useful tool specially in problems of businessand industry involving predictions.

7.2.1 Assumptions in Regression Analysis

While making use of the regression techniques for making predictions, it isalways assumed that:

(a) There is an actual relationship between the dependent and independentvariables.

Page 173: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 167

(b) The values of the dependent variable are random but the values of theindependent variable are fixed quantities without error and are chosen bythe experimentor.

(c) There is clear indication of direction of the relationship. This means thatdependent variable is a function of independent variable. (For example,when we say that advertising has an effect on sales, then we are sayingthat sales has an effect on advertising).

(d) The conditions (that existed when the relationship between the dependentand independent variable was estimated by the regression) are the samewhen the regression model is being used. In other words, it simply meansthat the relationship has not changed since the regression equation wascomputed.

(e) The analysis is to be used to predict values within the range (and not forvalues outside the range) for which it is valid.

Activity 1

Construct a regression line for r = 1.00 and r = –1.00.

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) The values of the dependent variable are random but the values ofthe independent variable are fixed quantities without error and arechosen by the ________________.

(b) The conditions that existed when the relationship between thedependent and independent variable was estimated by the regressionare the same when the ___________ model is being used.

2. State whether true or false.

(a) The regression analysis is to be used to predict values within therange (and not for values outside the range) for which it is valid.

(b) There is not an actual relationship between the dependent andindependent variables.

Page 174: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 168

7.3 Simple Linear Regression Model

In case of simple linear regression analysis, a single variable is used to predictanother variable on the assumption of linear relationship (i.e., relationship ofthe type defined by Y = a + bX) between the given variables. The variable to bepredicted is called the dependent variable and the variable on which the predictionis based is called the independent variable.

Simple linear regression model1 (or the Regression Line) is stated as,

Yi = a + bXi + ei

Where, Yi is the dependent variable

Xi is the independent variable

ei is unpredictable random element (usually called as

residual or error term)

(a) a represents the Y-intercept, i.e., the intercept specifies the value of thedependent variable when the independent variable has a value of zero.(But this term has practical meaning only if a zero value for the independentvariable is possible).

(b) b is a constant, indicating the slope of the regression line. Slope of theline indicates the amount of change in the value of the dependent variablefor a unit change in the independent variable.

If the two constants (viz., a and b) are known, the accuracy of our predictionof Y (denoted by Y and read as Y--hat) depends on the magnitude of the valuesof ei. If in the model, all the ei tend to have very large values then the estimateswill not be very good but if these values are relatively small, then the predictedvalues ( Y ) will tend to be close to the true values (Yi).Estimating the Intercept and Slope of the Regression Model (or Estimatingthe Regression Equation)The two constants or the parameters viz., ‘a’ and ‘b’ in the regression model forthe entire population or universe are generally unknown and as such areestimated from sample information. The following are the two methods used forestimation:

(a) Scatter diagram method

(b) Least squares method

Page 175: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 169

7.3.1 Scatter Diagram Method

This method makes use of the Scatter diagram, also known as Dot diagram.Scatter diagram2 is a diagram representing two series with the known variable,i.e., independent variable plotted on the X-axis and the variable to be estimated,i.e., dependent variable to be plotted on the Y-axis on a graph paper (ReferFigure 7.1) to get the following information:

Income Consumption ExpenditureX Y

(Hundreds of Rupees) (Hundreds of Rupees)

41 4465 6050 3957 5196 8094 68

110 8430 3479 5565 48

The scatter diagram by itself is not sufficient for predicting values of thedependent variable. Some formal expression of the relationship between thetwo variables is necessary for predictive purposes. For the purpose, one maysimply take a ruler and draw a straight line through the points in the scatterdiagram and this way can determine the intercept and the slope of the said lineand then the line can be defined as ˆ

iY a bX , with the help of which we canpredict Y for a given value of X. But there are shortcomings in this approach.For example, if five different persons draw such a straight line in the samescatter diagram, it is possible that there may be five different estimates of a andb, specially when the dots are more dispersed in the diagram. Hence, theestimates cannot be worked out only through this approach. A more systematicand statistical method is required to estimate the constants of the predictiveequation. The least squares method is used to draw the best fit line.

Page 176: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 170

0 20 40 60 80 100 120

120

100

80

60

40

20

X-axis

Y-axis

Con

sum

ptio

n Ex

pend

iture

(00

Rs)

Figure 7.1 Scatter Diagram

7.3.2 Least Square Method

Least squares method of fitting a line (the line of best fit or the regression line)through the scatter diagram is a method which minimizes the sum of the squaredvertical deviations from the fitted line. In other words, the line to be fitted willpass through the points of the scatter diagram in such a way that the sum of thesquares of the vertical deviations of these points from the line will be a minimum.

The meaning of the least squares criterion can be easily understoodthrough reference to Figure 7.2 drawn below, where the earlier figure in scatterdiagram has been reproduced along with a line which represents the leastsquares line fit to the data.

Figure 7.2 Scatter Diagram, Regression Line andShort Vertical Lines Representing ‘e’

Page 177: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 171

In Figure 7.2, the vertical deviations of the individual points from the lineare shown as the short vertical lines joining the points to the least squares line.These deviations will be denoted by the symbol ‘e’. The value of ‘e’ varies fromone point to another. In some cases it is positive, while in others it is negative.If the line drawn happens to be least squares line, then the values of ie is theleast possible. It is because of this feature that the method is known as LeastSquares Method.

Why we insist on minimizing the sum of squared deviations is a questionthat needs explanation. If we denote the deviations from the actual value Y tothe estimated value Y as ˆ( – )Y Y or ei, it is logical that we want the

1

ˆ( – ) or ,n

ii

Y Y e

to be as small as possible. However, mere examining

1

ˆ( – ) or ,n

ii

Y Y e

is inappropriate, since any ei can be positive or negative. Large

positive values and large negative values could cancel one another. But largevalues of ei regardless of their sign, indicate a poor prediction. Even if we ignore

the signs while working out 1| |

n

ii

e , where

0if0if

||ii

iii ee

eee the difficulties

may continue. Hence, the

standard procedure is to eliminate the effect of signs by squaring eachobservation. Squaring each term accomplishes two purposes viz., (i) It magnifies(or penalizes) the larger errors, and (ii) It cancels the effect of the positive andnegative values (since a negative error when squared becomes positive). Thechoice of minimizing the squared sum of errors rather than the sum of theabsolute values implies that there are many small errors rather than a few largeerrors. Hence, in obtaining the regression line, we follow the approach that thesum of the squared deviations be minimum and on this basis work out thevalues of its constants viz., ‘a’ and ‘b’ also known as the intercept and the slopeof the line. This is done with the help of the following two normal equations:3

Y = na + bX

XY = aX + bX2

In the above two equations, ‘a’ and ‘b’ are unknowns and all other valuesviz., X, Y, X2, XY, are the sum of the products and cross products to becalculated from the sample data, and ‘n’ means the number of observations inthe sample.

Page 178: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 172

The following examples explain the Least squares method.Example 7.1: Fit a regression line ˆ

iY a bX by the method of Least squaresto the given sample information.Observations 1 2 3 4 5 6 7 8 9 10

Income (X) (’00 Rs) 41 65 50 57 96 94 110 30 79 65

Consumption

Expenditure (Y) (’00 Rs) 44 60 39 51 80 68 84 34 55 48

Solution: We are to fit a regression line ˆiY a bX to the given data by the

method of Least squares. Accordingly, we work out the ‘a’ and ‘b’ values withthe help of the normal equations as stated above and also for the purpose,work out X, Y, XY, X2 values from the given sample information table onSummations for Regression Equation.

Summations for Regression Equation

Observations Income Consumption XY X2 Y2

X ExpenditureY

(’00 Rs) (’00 Rs)

1 41 44 1804 1681 1936

2 65 60 3900 4225 36003 50 39 1950 2500 15214 57 51 2907 3249 2601

5 96 80 7680 9216 64006 94 68 6392 8836 46247 110 84 9240 12100 7056

8 30 34 1020 900 11569 79 55 4345 6241 3025

10 65 48 3120 4225 2304

n = 10 X = 687 Y =563 XY = 42358 X2= 53173 Y2 = 34223

Putting the values in the required normal equations we have,

563 = 10a + 687b

42358 = 687a + 53173b

Solving these two equations for a and b we obtain,

a = 14.000 and b = 0.616

Page 179: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 173

Hence, the equation for the required regression line is,

Y = a + bXi

or, Y = 14.000 + 0.616Xi

This equation is known as the regression equation of Y on X from whichY values can be estimated for given values of X variable.4

7.3.3 Checking the Accuracy of Equation

After finding the regression line as stated above, one can check its accuracyalso. The method to be used for the purpose follows from the mathematicalproperty of a line fitted by the method of least squares viz., the individual positiveand negative errors must sum to zero. In other words, using the estimatingequation one must find out whether the term ˆY Y is zero and if this is so,then one can reasonably be sure that he has not committed any mistake indetermining the estimating equation.

The Problem of Prediction

When we talk about prediction or estimation, we usually imply that if the

relationship Yi = a + bX

i + e

i exists, then the regression equation, ˆ

iY a bX provides a base for making estimates of the value for Y which will beassociated with particular values of X. In Example 7.1, we worked out theregression equation for the income and consumption data as,

Y = 14.000 + 0.616Xi

On the basis of this equation we can make a point estimate of Y for anygiven value of X. Suppose we wish to estimate the consumption expenditure ofindividuals with income of Rs 10,000. We substitute X = 100 for the same in ourequation and get an estimate of consumption expenditure as follows:

ˆ =14.000+0.616 100 =75.60Y

Thus, the regression relationship indicates that individuals with Rs 10,000 ofincome may be expected to spend approximately Rs 7,560 on consumption.But this is only an expected or an estimated value and it is possible thatactual consumption expenditure of same individual with that income maydeviate from this amount and if so, then our estimate will be an error, thelikelihood of which will be high if the estimate is applied to any one individual.The interval estimate method is considered better and it states an interval inwhich the expected consumption expenditure may fall. Remember that the

Page 180: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 174

wider the interval, the greater the level of confidence we can have, but thewidth of the interval (or what is technically known as the precision of theestimate) is associated with a specified level of confidence and is dependenton the variability (consumption expenditure in our case) found in the sample.This variability is measured by the standard deviation of the error term, ‘e’,and is popularly known as the standard error of the estimate.

Standard Error of Estimate

Standard error of estimate is a measure developed by the statisticians formeasuring the reliability of the estimating equation. Like standard deviation,

the Standard Error (S.E.) of Y measures the variability or scatter of theobserved values of Y around the regression line. Standard Error of Estimate

(S.E. of Y ) is worked out as under:

S.E. of 2 2ˆ( )ˆ (or )

2 2eY Y e

Y Sn n

where, S.E. of Y (or Se) = Standard error of the estimate

Y = Observed value of Y

Y = Estimated value of Y

e = The error term = (Y– Y )

n = Number of observations in the sample

Note: In the above formula, n – 2 is used instead of n because of the fact thattwo degrees of freedom are lost in basing the estimate on the variability of thesample observations about the line with two constants viz., ‘a’ and ‘b’ whoseposition is determined by those same sample observations.

The square of the Se, also known as the variance of the error term, is thebasic measure of reliability. The larger the variance, the more significant themagnitudes of the e’s and the less reliable the regression analysis in predictingthe data.Interpreting the Standard Error of Estimate and Finding the ConfidenceLimits for the Estimate in Large and Small SamplesThe larger the S.E. of estimate (SEe), the greater happens to be the dispersion,or scattering, of given observations around the regression line. But if the S.E. ofestimate happens to be zero then the estimating equation is a ‘perfect’ estimator(i.e., cent per cent correct estimator) of the dependent variable.In case of large samples, i.e., where n > 30 in a sample, it is assumed

Page 181: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 175

that the observed points are normally distributed around the regressionline and we may find,

68% of all points within ˆ 1Y SEe limitss

95.5% of all points within ˆ 2Y SEe limitss

99.7% of all points within ˆ 3Y SEe limitss

This can be stated as,

(i) The observed values of Y are normally distributed around each estimatedvalue of Y and;

(ii) The variance of the distributions around each possible value of Y is thesame.

In case of small samples, i.e., where n 30 in a sample the ‘t’ distributionis used for finding the two limits more appropriately.This is done as follows:

Upper limit = Y + ‘t’ (SEe)

Lower limit = Y – ‘t’ (SEe)

Where, Y = The estimated value of Y for a given value of X.

SEe = The standard error of estimate.

‘t’ = Table value of ‘t’ for given degrees of freedom for aspecified confidence level.

7.3.4 Some Other Details Concerning Simple Regression

Sometimes the estimating equation of Y also known as the Regression equationof Y on X, is written as follows:

Y Y = Yi

Xr X X

or, Y = Yi

Xr X X Y

Where, r = Coefficient of simple correlation between X andY

Y = Standard deviation of Y

X = Standard deviation of X

X_

= Mean of X

Page 182: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 176

Y_

= Mean of Y

Y = Value of Y to be estimated

Xi = Any given value of X for which Y is to beestimated.

This is based on the formula we have used, i.e., ˆiY a bX . The coefficient

of Xi is defined as,

Coefficient of Xi = b =Y

Xr

(Also known as regression coefficient of Y on X or slope of the regressionline of Y on X) or bYX.

=

22

2 22 2 2 2

XY nXY Y nY

Y nY X nX X nX

= 22

XY nXY

X nX

and a = Y

Xr X Y

= Y bX since Y

Xb r

Similarly, the estimating equation of X, also known as the regressionequation of X on Y, can be stated as:

X X = X

Yr Y Y

or X = X

Yr Y Y X

and the

Regression coefficient of X on Y (or bXY) 22X

Y

XY nXYrY nY

If we are given the two regression equations as stated above, along withthe values of ‘a’ and ‘b’ constants to solve the same for finding the value of Xand Y, then the values of X and Y so obtained, are the mean value of X (i.e., X )

and the mean value of Y (i.e., Y ).

Page 183: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 177

If we are given the two regression coefficients (viz., bXY and bYX), then wecan work out the value of coefficient of correlation by just taking the square rootof the product of the regression coefficients as shown below:

r = .YX XYb b

= .Y X

X Yr r

= .r r = r

The (±) sign of r will be determined on the basis of the sign of the regressioncoefficients given. If regression coefficients have minus sign then r will be takenwith minus (–) sign and if regression coefficients have plus sign then r will betaken with plus (+) sign. (Remember that both regression coefficients willnecessarily have the same sign whether it is minus or plus for their sign isgoverned by the sign of coefficient of correlation.)Example 7.2: Given is the following information:

X YMean 39.5 47.5

Standard Deviation 10.8 17.8

Simple correlation coefficient between X and Y is = + 0.42Find the estimating equation of Y and X.

Solution:Estimating equation of Y can be worked out as,

Y Y = Yi

Xr X X

or Y = Yi

Xr X X Y

= 17.80.42 39.5 47.510.8 iX

= 0.69 27.25 47.5iX

= 0.69Xi + 20.25

Similarly, the estimating equation of X can be worked out as under:

X X = Xi

Yr Y Y

Page 184: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 178

or X = Xi

Yr Y Y X

or = 10.80.42 47.5 39.517.8 iY

= 0.26Yi – 12.35 + 39.5

= 0.26Yi + 27.15

Example 7.3: Given is the following data:Variance of X = 9Regression equations:

4X – 5Y + 33 = 0

20X – 9Y – 107 = 0

Find: (i) Mean values of X and Y.

(ii) Coefficient of Correlation between X and Y.

(iii) Standard deviation of Y.Solution:

(i) For finding the mean values of X and Y, we solve the two given regressionequations for the values of X and Y as follows:

4X – 5Y + 33 = 0 (1)

20X – 9Y –107 = 0 (2)

If we multiply Equation (1) by 5, we have the following equations:20X –25Y = –165 (3)

20X – 9Y = 107 (2)

– + –

– 16Y = –272

Subtracting Equation (2) from Equation (3) we get, or Y = 17 Putting this value of Y in Equation (1) we have,

4X = – 33 + 5(17)

or X =33 85 52 13

4 4

Hence, X_

= 13 and Y = 17

Page 185: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 179

(ii) For finding the coefficient of correlation, first of all we presume one of thetwo given regression equations as the estimating equation of X. Letequation 4X – 5Y + 33 = 0 be the estimating equation of X, then we have,

5 33ˆ4 4

iYX

and

From this we can write bXY 54

The other given equation is then taken as the estimating equation of Yand can be written as,

20 107ˆ9 9

iXY

and from this we can write bYX 209

If the above equations are correct then r must be equal to,r = 5 / 4 20 / 9 25 / 9 = 5/3 = 1.6

which is an impossible equation, since r can in no case be greater than 1.Hence, we change our supposition about the estimating equations andby reversing it, we re-write the estimating equations as under:

9 107ˆ20 20

iYX

and4 33ˆ

5 5iXY

Hence, r = 9 / 20 4 / 5

= 9 / 25= 3/5

= 0.6

Since, regression coefficients have plus signs, we take r = + 0.6

(iii) Standard deviation of Y can be calculated as follows: Variance of X = 9 Standard deviation of X = 3

YYX

Xb r

=4 0.6 0.25 3

YY

Hence, Y = 4

Page 186: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 180

Alternatively, we can work it out as under:

XXY

Yb r

=9 1.80.620 3

Y

Y

Hence, Y = 4

Activity 2

Regression of savings (S) of a family on income (Y) may be expressed asYS am

, where ‘a’ and ‘m’ are constants. In random sample of 100 families,

the variance of savings is one-quarter of the variance of incomes and thecoefficient of correlation is found to be +0.4. Obtain the estimate of ‘m’.

Example 7.4: Heights of the father and son are given below. Find the height ofthe son when the height of the father is 69 inches.

Father’s height 71 68 66 67 70 71 70 73 72 65 66(inches)

Son’s Height 69 64 65 63 65 62 65 64 66 59 62(inches)

Solution: Let father’s height be X and son’s height be Y.

Regression line of Y on X

X ( )X X = x x2 Y ( )Y Y = y y2 xy

71 + 2 4 69 + 5 25 +1068 – 1 1 64 0 0 066 – 3 9 65 +1 1 – 367 – 2 4 63 – 1 1 +270 +1 1 65 +1 1 +171 +2 4 62 – 2 4 – 470 +1 1 65 +1 1 +173 +4 16 64 0 0 072 +3 9 66 +2 4 +665 – 4 16 59 – 5 25 +2066 – 3 9 62 – 2 4 +6

X = 759 x = 0 x2 = 74 Y = 704 y = 0 y2 = 66 xy = 39

( )Y Y = . ( )y

xr X X

Y = 70411

= 64 ;

Page 187: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 181

X =75911

= 69

Note

Y =YN

= X = XN

. y

xr

=

2xy

x

= 39

74 = 0.527

(Y – 64) = 0.527 (X – 69)

= 0.527 X + 27.64

For X = 69, Y = 0.527 (69) + 27.64

= 64.003 64.

Example 7.5: Obtain the two regression equations for the following data usingthe method of least squares :

x 1 2 3 4 5

y 5 7 9 10 11

x y xy x2 y2

1 5 5 1 252 7 14 4 493 9 27 9 814 10 40 16 1005 11 55 25 121

x = 15 y = 42 xy = 141 x2 = 55 y2 = 376

Regression equation of y on x :y = a + bx

where y = Na + bxand xy = ax + bx2

Thus, 42 = 5a + 15 b ...(i)141 = 15a + 55 b ...(ii)

Solving (i) and (ii), we get a = 3.9 and b = 1.5Thus, y = 3.9 + 1.5 xRegression equation of x on y

x = a + bywhere x = Na + byand xy = ay + by2

Thus, 15 = 5a + 42b ...(iii)and 141 = 42a + 376 b ...(iv)

Page 188: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 182

Solving (iii) and (iv), we get

a =39 1315 5

and b =

23

Thus, x =13 25 3

y

Example 7.6: The following table shows the ages (x) and blood pressure (y) of8 persons.

x 52 63 45 36 72 65 47 25

y 62 53 51 25 79 43 60 33

Obtain the regression equation of y on x and find the expected blood pressure ofa person who is 49 years old.

Solution: Let Ax = 50 and Ay = 50(Assumed means)

x (x – 50) = dx d2x y (y – 50) = dy d2

y dxdy

52 + 2 4 62 +12 144 +24

63 + 13 169 53 +3 9 +3945 – 5 25 51 + 1 1 – 5

36 – 14 196 25 – 25 625 + 350

72 + 22 484 79 + 29 841 + 63865 + 15 225 43 – 7 49 – 105

47 – 3 9 60 + 10 100 – 3025 – 25 625 33 – 17 289 + 425

x = 405 dx = 5 d2x = 1737 y = 406 dy = 6 d2

y = 2058 dxdy = 1336

( )y y = . y

xr x x

y = 4068

yN

= 50.75;

x = 4058

xN

. y

xr

=

22

x y x y

x x

N d d d d

N d d

=

28(1136) (5)(6)

8(1737) 5

= 0.768

(y – 50.75) = 0.768 (x – 50.625)or y = 11.87 + 0.768x y49 = 11.87 + 0.768 (49) = 49.502

Page 189: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 183

Example 7.7: The equation of two regression lines obtained in a correlation analysisof 60 observations are 5x = 6y + 24 and 1000y = 768x – 3608. What is the correlationcoefficient and what is its probable error?

Show that the ratio of the coefficient of variance of x to that of y is 524

. What is

the ratio of variance of x and y?

The equations of the regression lines are given as5x = 6y + 24 and 1000y = 768x – 3608

bxy = . x

yr

= 65

...(i)

and byx =768

.1000

y

xr

...(ii)

Multiplying these, we get

bxy × byx = r2 = 6 7685 1000 r = ± 0.96

Since both bxy and byx are positive, the correlation coefficient r is also positive andhence r = + 0.96.

Also, probable error of r,

P.Er =21

0.6745r

N

P.Er =21 0.96

0.674560

Also we know that each regression line passes through ( , )x y . So from the givenequations of these lines we have

5 x = 6 24y

and 1000 y = 768 3608x Solving these we get

x = 6 and y = 1 ...(iii)

Also from (i), we have 6

.5

x

yr

where r = 0.96

or x

y

= 6 1 55 0.96 4 ...(iv)

Page 190: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 184

And the ratio of the coefficients of variance of x to that of y

/

/x

y

x

y

=x

y

yx

=

1 56 4

...(from (iii) & (iv))

=524

Self-Assessment Questions

3. State whether true or false.

(a) The scatter diagram by itself is not sufficient for predicting values ofthe dependent variable.

(b) The interval estimate method is considered worse as it states aninterval in which the expected consumption expenditure may fall.

4. Fill in the blanks with the appropriate terms.

(a) In case of simple linear regression analysis, a single variable is usedto __________ another variable on the assumption of linearrelationship (i.e., relationship of the type defined by Y = a + bX)between the given variables.

(b) Standard error of estimate is a measure developed by the statisticiansfor measuring the reliability of the _____________ equation.

7.4 Summary

Let us recapitulate the important concepts discussed in this unit:

The term ‘regression’ was first used in 1877 by Sir Francis Galton whomade a study that showed the process of predicting one variable fromanother variable.

When there is a well established relationship between variables, it ispossible to make use of this relationship in making estimates and toforecast the value of one variable (the unknown or the dependent variable)on the basis of the other variable/s (the known or the independentvariable/s).

There is an actual relationship between dependent and independentvariables.

Page 191: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 185

In case of simple linear regression analysis, a single variable is used topredict another variable on the assumption of linear relationship (i.e.,relationship of the type defined by Y = a + bX) between the given variables.The variable to be predicted is called the dependent variable and thevariable on which the prediction is based is called the independent variable.

Scatter diagram is also known as Dot diagram. Scatter diagram representstwo series with the known variable, i.e., independent variable plotted onthe X-axis and the variable to be estimated, i.e., dependent variable to beplotted on the Y-axis.

Least squares method of fitting a line (the line of best fit or the regressionline) through the scatter diagram is a method which minimizes the sum ofthe squared vertical deviations from the fitted line.

The variability in sample is measured by the standard deviation of theerror term, ‘e’, and is popularly known as the standard error of the estimate.

The larger the S.E. of estimate (SEe), the greater happens to be thedispersion, or scattering, of given observations around the regressionline.

The (±) sign of r will be determined on the basis of the sign of the regressioncoefficients given. If regression coefficients have minus sign then r will betaken with minus (–) sign and if regression coefficients have plus signthen r will be taken with plus (+) sign.

7.5 Glossary

Regression analysis: A relationship used for making estimates andforecasts about the value of one variable (the unknown or the dependentvariable) on the basis of the other variable/s (the known or the independentvariable/s).

Scatter diagram: Also known as a Dot diagram, used to represent twoseries with the known variables, i.e., independent variable plotted on theX-axis and the variable to be estimated, i.e., dependent variable to beplotted on the Y-axis on a graph paper for the given information.

Standard error of estimate: A measure developed by statisticians formeasuring the reliability of the estimating equation.

Page 192: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 186

7.6 Terminal Questions

1. Define regression analysis. How will you predict the value of a dependentvariable?

2. Differentiate between Scatter diagram and Least Squares method.

3. Can the accuracy of estimated equation be checked? Explain.

4. How is the standard error of estimate calculated?

5. What is a Scatter diagram? How does it help in studying correlationbetween two variables? Explain.

7.7 Answers

Answers to Self-Assessment Questions

1. (a) Experimentor; (b) Regression

2. (a) True; (b) False

3. (a) True; (b) False

4. (a) Predict; (b) Estimating

Answers to Terminal Questions

1. Refer Section 7.2

2. Refer Section 7.3.1

3. Refer Sections 7.3.1 and 7.3.2

4. Refer Section 7.3.3

5. Refer Section 7.3.3

7.8 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010.

Page 193: BCC104 Business Statistics

Business Statistics Unit 7

Sikkim Manipal University Page No. 187

Endnotes

1. Usually the estimate of Y denoted by Y is written as,

ˆiY a bX

on the assumption that the random disturbance to the system averages out or has an

expected value of zero (i.e., e = 0) for any single observation. This regression model is

known as the Regression line of Y on X from which the value of Y can be estimated for

the given value of X.

2.

(1)(2) (3) (4) (5)

Five possible forms, which Scatter diagram may assume has been depicted in the above

five diagrams. First diagram is indicative of perfect positive relationship, Second shows

perfect negative relationship, Third shows no relationship, Fourth shows positive

relationship and Fifth shows negative relationship between the two variables under

consideration.

3. If we proceed centering each variable, i.e., setting its origin at its mean, then the two

equations will be as under:

Y = na + bX

XY = aX + bX2

But since Y and X will be zero, the first equation and the first term of the second

equation will disappear and we shall simply have the following equations:

XY = bX2

b = XY/X2

The value of ‘a’ can then be worked out as:

a = Y – b X

4. It should be pointed out that the equation used to estimate the Y variable values from

values of X should not be used to estimate the values of X variable from given values of

Y variable. Another regression equation (known as the regression equation of X on Y of

the type X = a + bY) that reverses the two value should be used if it is desired to estimate

X from value of Y.

Page 194: BCC104 Business Statistics
Page 195: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 189

Unit 8 Time Series

Structure

8.1 IntroductionObjectives

8.2 Components of Time Series8.3 Different Methods of Measuring Trend8.4 Different Methods of Measuring Seasonal Variations8.5 Summary8.6 Glossary8.7 Terminal Questions8.8 Answers8.9 Further Reading

8.1 Introduction

In the previous unit, you learnt about regression analysis and its significance indata analysis.

In this unit, you will learn how time series analysis differs from regressionanalysis. We often see a number of charts on company drawing boards or innewspapers, where we see lines going up and down from left to right on agraph. The vertical axis represents a variable such as productivity or crime datain the city and the horizontal axis represents the different periods of increasingtime such as days, weeks, months or years. Analysis of the movements of suchvariables over periods of time is referred to as time series analysis, which canthen be defined as a set of numeric observations of a dependent variable,measured at specific points in time in chronological order, usually at equalintervals, in order to determine the relationship of time to such variables.

You will also learn that one of the major elements of planning andspecifically strategic planning of any organization is accurately forecasting thefuture events that would have an impact on the operations of an organization.Previous performances must be studied so as to forecast future activity. Even inour daily lives, we plan our future events on the basis of a reasonable estimateof the future environment that would affect our plans, whether it is forecastingrain on our picnic on Saturday or forecasting economic conditions for ten years.Textbook publishers, for example, must predict future sales of books to printenough copies for students. Financial advisors must predict the values of a

Page 196: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 190

variety of economic factors in order to advise clients regarding stocks, bondsand other business opportunities. Similarly, hotel builders in a city must projectthe future influx of tourists, and so on. The quality of such forecasts is stronglyrelated to all the relevant information that can be extracted and used from pastdata. In that respect, time series can be used to determine patterns in past dataover a period of time and extrapolate the data into the future.

Objectives

After studying this unit, you should be able to:

Analyse the components of time series

Explain the different methods of measuring trend

Calculate simple averages and moving averages

Measure irregular variations and seasonal adjustments

8.2 Components of Time Series

The time series analysis method is quite accurate where the future is expectedto be similar to the past. The underlying assumption in time series is that thesame factors will continue to influence the future patterns of economic activityin a similar manner as in the past. These techniques are fairly sophisticated andrequire experts to use these methods.

The classical approach to analyse a time series is in terms of four distincttypes of variations or separate components that influence a time series.

1. Secular Trend or Simply Trend (T). Trend is a general long-termmovement in the time series value of the variable (Y) over a fairly longperiod of time. The variable (Y) is the factor that we are interested inevaluating for the future. It could be sales, population, crime rate and soon.

These variables are observed over a long period of time and anychanges related to time are noted and calculated and a trend of thesechanges is established.

If a trend can be determined and the rate of change can beascertained, then tentative estimates on the same series values into thefuture can be made. However, such forecasts are based on the assumptionthat the conditions affecting the steady growth or decline are reasonablyexpected to remain unchanged in the future. A change in these conditions

Page 197: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 191

would affect the forecasts. For example, a time series involving increasein population over time is shown in Figure 8.1.

Figure 8.1 Time Series Graph on Population Increase

2. Cyclical Fluctuations (C). These refer to regular swings or patterns thatrepeat over a long period of time. The movements are considered cyclicalonly if they occur after time intervals of more than one year. These are thechanges that take place as a result of economic booms or depressions.These may be up or down, and are recurrent in nature and have a durationof several years— usually lasting for two to ten years. These movementsalso differ in intensity or amplitude and each phase of movement changesgradually into the phase that follows it.

The cyclic variation for revenues in an industry against time is showngraphically in Figure 8.2.

Figure 8.2 Cyclic Variation for Revenues

3. Seasonal Variation (S). This involves patterns of change that repeat overa period of one year or less. Then they repeat from year to year and theyare brought about by fixed events. For example, sales of consumer itemsincrease prior to Deepawali due to the tradition of giving gifts.

Page 198: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 192

Since these variations repeat during a period of twelve months, theycan be predicted fairly and accurately. Some factors that cause seasonalvariations are:

(i) Season and climate. Changes in the climate and weather conditionshave a profound effect on sales. For example, the sale of umbrellasin India is always more during monsoon season. Similarly, duringwinter, there is a greater demand for woollen clothes and hot drinks,while during summer months, there is an increase in sales of fansand air conditioners.

(ii) Customs and festivals. Customs and traditions affect the patternof seasonal spending. For example, in India, festivals such asBaisakhi and Diwali mean a big demand for sweets and candy.

An accurate assessment of seasonal behaviour is an aid inbusiness planning and scheduling such as in the area of production,inventory control, personnel, advertising, and so on. The seasonalfluctuations over four repeating quarters in a given year for sale of agiven item is illustrated in Figure 8.3.

Figure 8.3 Seasonal Fluctuations Over Four Quarters in a Year

4. Irregular or Random Variation (I). These variations are accidental,random or simply due to chance factors. Thus, they are whollyunpredictable. These fluctuations may be caused by such isolated incidentsas floods, famines, strikes or wars. Sudden changes in demand or abreakthrough in a technological development may be included in this

Page 199: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 193

category. Accordingly, it is almost impossible to isolate and measure thevalue and the impact of these erratic movements on forecasting modelsor techniques. This phenomenon is graphically shown in Figure 8.4.

Figure 8.4 Irregular or Random Variation

It is traditionally acknowledged that the value of the time series (Y) is afunction of the impact of variable trend (T), seasonal variation (S), cyclicalvariation (C) and irregular fluctuation (I). These relationships may vary dependingupon assumptions and purposes. The effects of these four components mightbe additive, multiplicative, or a combination thereof in a number of ways. However,the traditional time series analysis model is characterized by multiplicativerelationship, so that:

Y = T × S × C × I

This model is appropriate for those situations where percentage changesbest represent movement in the series and the components are not viewed asabsolute values but as relative values.

Another approach to define the relationship may be additive, so that:

Y = T + S + C + I

This model is useful when the variations in the time series are in absolutevalues and can be separated and traced to each of these four parts and eachpart can be measured independently.

Page 200: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 194

Activity 1

The Indian Motorcycle Company is concerned about declining sales in theWestern region. The following data shows monthly sales (in millions of ) ofthe motorcycles for the past twelve months.

Month Sales (in millions of `)

January 6.5

February 6.0

March 6.3

April 5.1

May 5.6

June 4.8

July 4.0

August 3.6

September 3.5

October 3.1

November 3.0

December 3.0

(i) Plot the trend line and describe the relationship between sales andtime.

(ii) What is the average monthly change in sales?

(iii) If the monthly sales fall below ` 2.4 million, then the West Coastoffice must be closed. Is it likely that the office will be closed duringthe next six months?

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Trend is a general long-term ____________ in the time series valueof the variable (Y) over a fairly long period of time.

(b) Cyclic fluctuations refer to ____________ swings or patterns thatrepeat over a long period of time.

Page 201: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 195

2. State whether true or false.

(a) The time series analysis method is quite accurate where the futureis expected to be similar to the past.

(b) Changes in the climate and weather conditions have a profound effecton sales.

8.3 Different Methods of Measuring Trend

8.3.1 Trend Analysis

While chance variations are difficult to identify, separate, control or predict, amore precise measurement of trend, cyclical effects and seasonal effects canbe made in order to make the forecasts more reliable. In this section, we discusstechniques that would allow us to describe trend.

When a time series shows an upward or downward long-term linear trend,then regression analysis can be used to estimate this trend and project thetrends into forecasting the future values of the variables involved. The equationfor the straight line which we have used to describe the linear relationshipbetween independent variable X and dependent variable Y is;

Y = b0 + b1X

Here, b0 = Intercept on the Y-axis and b1 = Slope of the straight line

In time series analysis, the independent variable is time, so we will usethe symbol t in place of X and we will use the symbol Yt in place of Yc whichwe have used previously.

Hence, the equation for linear trend is given as:

Yt = b0 + b1t

Here, Yt = Forecast value of the time series in period t

b0 = Intercept of the trend line on Y-axis

b1 = Slope of the trend line

t = Time period

As discussed earlier, we can calculate the values of b0 and bl by thefollowing formulae:

1 12 2( ) – ( )( ) , and( ) – ( ) 0

n ty t yb b y b tn t t

Page 202: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 196

Here, y = Actual value of the time series in period time t

n = Number of periods

y = Average value of time series y

n

t = Average value of tt =

n

Knowing these values, we can calculate the value of y.

Example 8.1: A car fleet owner has five cars which have been in the fleet forseveral different years. The manager wants to establish if there is a linearrelationship between the age of the car and the repairs in hundreds of dollarsfor a given year. This way, he can predict the repair expenses for each year asthe cars become older. The information for the repair costs he collected for lastyear on these cars is given as follows:

Car # Age (t) Repairs (Y)1 1 42 3 63 3 74 5 75 6 9

The manager wants to predict the repair expenses for the next year forthe two cars that are three years old now.Solution: The trend in repair costs suggests a linear relationship with the ageof the car, so that the linear regression equation is given as:

0 1tY b b t

Here, 1 2 2

( ) ( )( )( ) ( )

n ty t ybn t t

and, 10b y b t

Page 203: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 197

To calculate the various values, let us form a new table as follows:

Age of Car (t) Repair Cost (Y) tY t2

1 4 4 1

3 6 18 9

3 7 21 9

5 7 35 25

6 9 54 36

Total 18 33 132 80

Knowing that n = 5, let us substitute these values to calculate theregression coefficients b0 and b1.

Then, 1 2

5(132) (18)(33)5(80) (18)

b

660 – 594400 – 324

66 0.8776

and, 0 1b y b t

Here,33 6.65

yy =n

and,18 3.65

tt =n

Then, 0 6.6 0.87(3.6)b

= 6.6 – 3.13

= 3.47

Hence, 3.47 0.87tY t

The cars that are 3 years old now will be 4 years old next year, so thatt = 4.

Page 204: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 198

Hence, (4) 3.47 0.87(4)Y

3.47 3.48

= 6.95

Accordingly, the repair costs on each car that is 3 years old now areexpected to be ` 695.00

8.3.2 Measuring the Cyclical Effect

Cyclic variation, as we have discussed before, is a pattern that repeats overtime periods longer than one year. These variations are generally unpredictablein relation to the time of occurrence, duration as well as amplitude. However,these variations have to be separated and identified. The measure we use toidentify cyclical variation is the percentage of trend and the procedure used,known as the residual trend.

As we have discussed earlier, there are four components of time series.These are secular trend (T), seasonal variation (S), cyclical variation (C) andirregular (or chance) variation (I). Since the time period considered for seasonalvariation is less than one year, it can be excluded from the study, because whenwe look at time series consisting of annual data spread over many years, thenonly the secular trend, cyclical variation and irregular variation are considered.

Since secular trend component can be described by the trend line (usuallycalculated by line of regression), we can isolate cyclical and irregular componentsfrom the trend. Furthermore, since irregular variation occurs by chance andcannot be predicted or identified accurately, it can be reasonably assumed thatmost of the variation in time series left unexplained by the trend component canbe explained by the cyclical component. In that respect, cyclical variation canbe considered as the residual, once other causes of variation have beenidentified.

The measure of cyclic variation as percentage of trend is calculated asfollows:

(i) Determine the trend line (usually by regression analysis).

(ii) Compute the trend value Yt for each time period (t) under consideration.

(iii) Calculate the ratio Y/Yt for each time period.

(iv) Multiply this ratio by 100 to get the percentage of trend, so that,

Percentage of trend = 100.t

YY

Page 205: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 199

Example 8.2: The following is the data for energy consumption (measured inquadrillions of BTU) in the United States from 1981 to 1986 as reported in thestatistical abstracts of the United States.

Year Time Period (t) Annual EnergyConsumption (Y)

1981 1 74.0

1982 2 70.8

1983 3 70.5

1984 4 74.1

1985 5 74.0

1986 6 73.9

Assuming a linear trend, calculate the percentage of trend for each year(cyclical variation).Solution: First, we find the secular trend by the regression line method which isgiven by:

0 1tY b b t

Here, 1 2 2

( ) ( )( )( ) ( )

n ty t ybn t t

and, 0 1b y b t

Let us make a table for these values.

t Y tY 2t1 74.0 74.0 1

2 70.8 141.6 4

3 70.5 211.5 9

4 74.1 296.4 16

5 74.0 370.0 25

6 73.9 443.4 36

= 21t 437.3Y 1536.9tY 2 91t Substituting these values we get,

1 2

6(1536.9) (21)(437.3)6(91) (21)

b

Page 206: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 200

9221.4 – 9183.3

546 – 441

38.1 0.363105

and, 0 1=b y – b t

Here,437.3 72.88

6yy

n

21 3.56

t

Hence, 0 72.88 0.363(3.5)b

= 72.88 – 1.27

= 71.61

Then, 71.61 0.363tY t

Calculating the value of Yt for each time period, we get the following tablefor percentage of trend (Y/Yt)100.

Time Period Energy Consumption Trend Percentage of Trend

(t) (Y) (Yt) (Y/Yt)100

1 74.0 71.97 102.82

2 70.8 72.34 97.87

3 70.5 72.70 96.97

4 74.1 73.06 101.42

5 74.0 73.43 100.77

6 73.9 73.79 100.15

The following graph shows the actual energy consumption (Y), trend line(Yt) and the cyclical fluctuations above and below the trend line over the timeperiod (t) for 6 years.

Page 207: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 201

Yt

Y

Frequently, we draw a graph of cyclic variation as the percentage of trend.This process eliminates the trend line and isolates the cyclical component ofthe time series.

It must be understood that cyclical fluctuations are not accuratelypredictable, and hence, we cannot predict the future cyclic variations basedupon such past cyclic variations.

The percentage of trend figures show that in 1981, the actual consumptionof energy was 102.82% of expected consumption that year and in 1983, theactual consumption was 96.97% of the expected consumption.

Page 208: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 202

Self-Assessment Questions

3. State whether true or false.

(a) The four components of time series are secular trend (T), seasonalvariation (S), cyclical variation (C) and irregular (or chance) variation(I).

(b) The measure used to identify cyclical variation is the residual trendand the procedure used is the percentage of trend.

4. Fill in the blanks with the appropriate terms.

(a) When a time series shows an upward or downward long-term lineartrend, then regression analysis can be used to ______________this trend and project the trends into forecasting the future values ofthe variables involved.

(b) Cyclic variation is a pattern that ___________________ over timeperiods longer than one year.

8.4 Different Methods of Measuring Seasonal Variations

Seasonal variation has been defined as the predictable and repetitive movementaround the trend line in a period of one year or less. For the measurement ofseasonal variation, the time interval involved may be in terms of days, weeks,months or quarters. Because of the predictability of seasonal trends, we canplan in advance to meet these variations. For example, study of seasonalvariations in the production data makes it possible to plan for hiring of additionalpersonnel for peak periods of production or to accumulate an inventory of rawmaterials or to allocate vacation time to personnel, and so on. Some of themethods used for the measurement of seasonal variations are described asfollows.

8.4.1 Simple Averages

This is the simplest method of isolating seasonal fluctuations in time series. It isbased on the assumption that the series contain only the seasonal and irregularfluctuations. Assume that the time series involve monthly data over a time periodof, say, five years. Assume further that we want to find the seasonal index for

Page 209: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 203

the month of March. (The seasonal variation will be the same for March in everyyear. Seasonal index describes the degree of seasonal variation).

Then the seasonal index for the month of March will be calculated asfollows:

Monthly average for MarchSeasonal Index for March= ×10

Average of monthly averages

The following steps can be used in the calculation of seasonal index(variation) for the month of March (or any month), over the 5-year period,regarding the sale of cars by one distributor.

(i) Calculate the average sale of cars for the month of March over the lastfive years.

(ii) Calculate the average sale of cars for each month over the five years andthen calculate the average of these monthly averages.

(iii) Use the given formula to calculate seasonal index for March.

Let us say that the average sale of cars for the month of March over theperiod of 5 years is 360, and the average of all monthly average is 316. Thenthe seasonal index for March = (360/316) × 100 = 113.92.

8.4.2 Moving Averages

This is the most widely used method of measuring seasonal variations. Theseasonal index is based upon a mean of 100 with the degree of seasonal variation(seasonal index) measured by variations away from this base value. For example,if we look at the seasonality of rental of row boats at the lake during the threesummer months (a quarter) and we find that the seasonal index is 135 and wealso know that the total boat rentals for the entire last year was 1680, then wecan estimate the number of summer rentals for the row boats.

The average number of quarterly boats rented = 1680/4 = 420.

The seasonal index, 135 for the summer quarter means that the summerrentals are 135 percent of the average quarterly rentals.

Hence, summer rentals = 420 × (135/100) = 567.

The steps required to compute the seasonal index can be enumerated byillustrating an example.

Page 210: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 204

Example 8.3: Assume that a record of rental of row boats for the previous 3years on a quarterly basis is given as follows:

Year Rentals Per Quarter Total

I II III IV

1991 350 300 450 400 1500

1992 330 360 500 410 1600

1993 370 350 520 440 1680

Solution:Step 1. The first step is to calculate the four-quarter moving total for time series.This total is associated with the middle data point in the set of values for the fourquarters, shown as follows.

Year Quarters Rentals Moving Total

1991 I 350

II 300

1500

III 450

IV 400

The moving total for the given values of four quarters is 1500, which issimply the addition of the four quarter values. This value of 1500 is placed in themiddle of values 300 and 450 and recorded in the next column. For the nextmoving total of the four quarters, we will drop the value of the first quarter, whichis 350, from the total and add the value of the fifth quarter (in other words, firstquarter of the next year), and this total will be placed in the middle of the nexttwo values, which are 450 and 400, and so on. These values of the movingtotals are shown in column 4 of the next table.Step 2. The next step is to calculate the quarter moving average. This can bedone by dividing the four quarter moving total, as calculated in Step 1, by 4,since there are 4 quarters. The quarters moving average is recorded in column5 in the next table. The entire table of calculations is shown as follows:

Page 211: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 205

Year Quarters Rentals Quarter Quarter Quarter Percentage ofMoving Moving Centered Actual toTotal Average Moving Centered

Average Moving Average

(1) (2) (3) (4) (5) (6) (7)

I 350II 300

1500 375.0III 450 372.50 120.80

1480 370.0IV 400 377.50 105.96

1540 385.01992 I 330 391.25 84.35

1590 397.5II 360 398.75 90.28

1600 400.0III 500 405.00 123.45

1640 410.0IV 410 408.75 100.30

1630 407.51993 I 370 410.00 90.24

1650 412.5II 350 416.25 84.08

1680 420.0III 520IV 440

Step 3. After the moving averages for each consecutive 4 quarters have beentaken, then we centre these moving averages. As we see from the above table,the quarterly moving average falls between the quarters. This is because thenumber of quarters is even which is 4. If we had odd number of time periods,

Page 212: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 206

such as 7 days of the week, then the moving average would already be centredand the third step here would not be necessary. Accordingly, we centre ouraverages in order to associate each average with the corresponding quarter,rather than between the quarters. This is shown in column 6, where the centredmoving average is calculated as the average of the two consecutive movingaverages.

The moving average (or the centred moving average) aims to eliminateseasonal and irregular fluctuations (S and I) from the original time series, sothat this average represents the cyclical and trend components of the series.

As the following graph shows for this data, the centred moving averagehas smoothed the peaks and troughs of the original time series.

Centred

Step 4. Column 7 in the table contains calculated entries which are percentagesof the actual values to the corresponding centred moving average values. Forexample, the first four quarters centred moving average of 372.50 in the tablehas the corresponding actual value of 450, so that the percentage of actualvalue to centred moving average would be:

Actual Value×100

CentredMoving Average Value

450= ×100

372.5

= 120.80

Step 5. The purpose of this step is to eliminate the remaining cyclical and irregularfluctuations still present in the values in Column 7 of the table. This can be doneby calculating the ‘modified mean’ for each quarter. The modified mean for eachquarter of the three-year time period under consideration is calculated as follows.

Page 213: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 207

(i) Make a table of values in column 7 of the previous table (percentage ofactual to moving average values) for each quarter of the three years asshown in the following table.

Year Quarter I Quarter II Quarter (III) Quarter (IV)

1991 – – 120.80 105.96

1992 84.35 90.28 123.45 100.30

1993 90.24 84.08 – –

(ii) We take the average of these values for each quarter. It should be notedthat if there are many years and quarters taken into consideration insteadof 3 years as we have taken, then the highest and the lowest values fromeach quarterly data would be discarded and the average of the remainingdata would be considered. By discarding the highest and the lowest valuesfrom each quarter data, we tend to reduce the extreme cyclical and irregularfluctuations, which are further smoothed when we average the remainingvalues. Thus, the modified mean can be considered as an index ofseasonal component. This modified mean for each quarter data is shownas follows:

84.35+90.24

Quarter I= =87.2952

90.28+84.08

Quarter II= =87.1802

120.80+123.45Quarter III= =122.125

2

105.96+100.30Quarter IV = =103.13

2

Total = 399.73

The modified means as calculated here are preliminary seasonal indices.These average should be 100 per cent or a total of 400 for the 4 quarters.However, our total is 399.73. This can be corrected by the following step.

Step 6. First, we calculate an adjustment factor. This is done by dividing thedesired or expected total of 400, by the actual total obtained of 399.73, so that,

400Adjustment = =1.0007

399.73

Page 214: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 208

By multiplying the modified mean for each quarter by the adjustment factor,we get the seasonal index for each quarter, so that,

Quarter I = 87.295 × 1.0007 = 87.356

Quarter II = 87.180 × 1.0007 = 87.241

Quarter III = 122.125 × 1.0007 = 122.201

Quarter IV = 103.13 × 1.0007 = 103.202

Total = 400.000

Average seasonal index 400 1004

(This average seasonal index is approximated to 100 because of roundingoff errors).

The logical meaning behind this method is based on the fact that thecentred moving average part of this process eliminates the influence of seculartrend and cyclical fluctuations (T × C). This may be represented by the followingexpression:

T × S × C × I

= S × IT × C

Here, (T × S × C × I) is the influence of trend, seasonal variations, cyclicfluctuations and irregular or chance variations.

Thus, the ratio of moving average represents the influence of seasonaland irregular components. However, if these ratios for each quarter over a periodof years are averaged, then most random or irregular fluctuations would beeliminated so that,

S × I

= SI

and this would give us the value of seasonal influences.

8.4.3 Measuring Irregular Variation and Seasonal Adjustments

Typically, irregular variation is random in nature, unpredictable and occurs overcomparatively short periods of time. Because of its unpredictability, it is generallynot measured or explained mathematically. Usually, subjective and logical

Page 215: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 209

reasoning explains such variation. For example, cold weather in Brazil andColumbia is considered responsible for increase in the price of coffee beans,because cold weather destroys coffee plants. Similarly, the Persian Gulf War,an irregular factor resulted in increase in airline and ship travel for a number ofmonths because of the movement of personnel and supplies. However, theirregular component can be isolated by eliminating other components from thetime series data. For example, time series data contains (T × S × C × I)components and if we can eliminate (T × S × C) elements from the data, thenwe are left with (I) component. We can follow the previous example to determinethe (I) component as follows. The data presented has already been provided orcalculated.

Year Quarters Rentals Centered Moving T × S × C × I /(T × C)Time Series Values Average (T × C) = S × I

(T × S × C × I)

1991 I 350 – –

II 300 – –

III 450 372.50 1.208

IV 400 377.50 1.060

1992 I 330 391.25 0.843

II 360 398.75 0.903

III 500 405.00 1.235

IV 410 408.75 1.003

1993 I 370 410.00 0.902

II 350 416.25 0.841

III 520 – –

IV 440 – –

The seasonal indices for each quarter have already been calculated as: Quarter I = 87.356

Quarter II = 87.241

Quarter III = 122.201

Quarter IV = 103.202

Page 216: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 210

Then the seasonal influence is given by:

Quarter I = 87.356/100 = 0.874

Quarter II = 87.241/100 = 0.872

Quarter III = 122.201/100 = 1.222

Quarter IV = 103.202/100 = 1.032

Making another table of (S × I) values and (S) values and dividing (S × I)by (S) we get the values of (I) as follows:

Year Quarters (S × I) (S) (I)

1991 I – – –

II – – –

III 1.208 1.222 0.988

IV 1.060 1.032 1.027

1992 I 0.843 0.874 0.965

II 0.903 0.872 1.036

III 1.235 1.222 1.011

IV 1.003 1.032 0.972

1993 I 0.902 0.874 1.032

II 0.841 0.872 0.964

III – – –

IV – – –

Seasonal Adjustments

Many times, we read about time series values as seasonally adjusted. This isaccomplished by dividing the original time series values by their correspondingseasonal indices. These deseasonalized values allow more direct and equitablecomparisons of values from different time periods. For example, in comparingthe demands for rental row boats (example that we have been following), itwould not be equitable to compare the demand of second quarter (spring) withthe demand of third quarter (summer), when the demand is traditionally higher.However, these demand values can be compared when we remove the seasonalinfluence from these time series values.

The seasonally-adjusted values for the demand of row boats in eachquarter are based on the values previously calculated and shown as follows.

Page 217: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 211

Year Quarter Rentals (S) Seasonally-Adjusted Rounded-off(T × S × C × I) Values Values

1991 I 350 – – –

II 300 – – –

III 450 1.222 368.25 368

IV 400 1.032 387.60 388

1992 I 330 0.874 377.57 378

II 360 0.872 412.80 413

III 500 1.222 409.16 409

IV 410 1.032 397.29 397

1993 I 370 0.874 423.34 423

II 350 0.872 401.38 401

III 520 – – –

IV 440 – – –

The seasonally-adjusted value for each quarter is calculated as:

OriginalValue=

Seasonal Index

These calculations complete the process of separating and identifyingthe four components of the time series, namely secular trend (T), seasonalvariation (S), cyclical variation (C) and irregular variation (I).

Activity 2

The following data represents the quarterly earnings per share of a softwarecompany for the last four years.

Quarter

Year 1 2 3 4

1st year 0.27 0.35 0.43 1.25

2nd year 0.40 0.55 0.45 1.35

3rd year 0.52 0.70 0.53 1.55

4th year 0.60 0.80 0.64 1.85

Analyse the quarterly time series to determine the effects of the trend,cyclical, seasonal and irregular components.

Page 218: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 212

Self-Assessment Questions

5. Fill in the blanks with the appropriate terms.

(a) Seasonal variation has been defined as the ________________ andrepetitive movement around the trend line in a period of one year orless.

(b) Time series values can be seasonally ______________ by dividingthe original time series values by their corresponding seasonalindices.

6. State whether true or false.

(a) Simple average is the difficult method of isolating seasonalfluctuations in time series.

(b) Regular variation is random in nature, unpredictable and occurs overcomparatively short periods of time.

8.5 Summary

Let us recapitulate the important concepts discussed in this unit:

The time series analysis method is quite accurate where the future isexpected to be similar to the past. The underlying assumption in timeseries is that the same factors will continue to influence the future patternsof economic activity in a similar manner as in the past.

Trend is a general long-term movement in the time series value of thevariable (Y) over a fairly long period of time. The variable (Y) is the factorthat we are interested in evaluating for the future.

Cyclic fluctuations refer to regular swings or patterns that repeat over along period of time. The movements are considered cyclical only if theyoccur after time intervals of more than one year.

Changes in the climate and weather conditions have a profound effect onsales. Customs and traditions affect the pattern of seasonal spending.

Irregular or random variations are accidental, random or simply due tochance factors. Thus, they are wholly unpredictable.

When a time series shows an upward or downward long-term linear trend,then regression analysis can be used to estimate this trend and projectthe trends into forecasting the future values of the variables involved.

Page 219: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 213

Cyclic variation is a pattern that repeats over time periods longer thanone year. These variations are generally unpredictable in relation to thetime of occurrence, duration as well as amplitude.

The measure used to identify cyclical variation is the percentage of trendand the procedure used is known as the residual trend.

Seasonal variation has been defined as the predictable and repetitivemovement around the trend line in a period of one year or less. For themeasurement of seasonal variation, the time interval involved may be interms of days, weeks, months or quarters.

Seasonal index describes the degree of seasonal variation.

The moving average or the centred moving average aims to eliminate seasonaland irregular fluctuations (S and I) from the original time series, so that thisaverage represents the cyclical and trend components of the series.

Irregular variation is random in nature, unpredictable and occurs overcomparatively short periods of time.

8.6 Glossary

Seasonal variation: Patterns of change that repeat over a period of oneyear or less. The factors that cause seasonal variations are season andclimate and customs and festivals.

Irregular variations: These variations are unpredictable and can beaccidental, random or simply due to chance factor.

Cyclic variation: A pattern that repeats over time periods longer thanone year.

8.7 Terminal Questions

1. Differentiate between secular trend and cyclic fluctuations.

2. How is irregular variation caused?

3. Define seasonal variation.

4. What do you understand by trend analysis?

5. How will you measure cyclical effect?

6. Describe the simple average method of isolating seasonal fluctuations intime series.

Page 220: BCC104 Business Statistics

Business Statistics Unit 8

Sikkim Manipal University Page No. 214

7. What are the ways of measuring irregular variation?

8. How are seasonal adjustments made?

8.8 Answers

Answers to Self-Assessment Questions

1. (a) Movement; (b) Regular

2. (a) True; (b) True

3. (a) True; (b) False

4. (a) Estimate; (b) Repeats

5. (a) Predictable; (b) Adjusted

6. (a) False; (b) True

Answers to Terminal Questions

1. Refer Section 8.2

2. Refer Section 8.2

3. Refer Section 8.2

4. Refer Section 8.3.1

5. Refer Section 8.3.2

6. Refer Section 8.4.1

7. Refer Section 8.4.3

8. Refer Section 8.4.3

8.9 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010

Page 221: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 215

Unit 9 Testing of Hypothesis

Structure

9.1 IntroductionObjectives

9.2 Hypothesis Formulation9.3 Summary9.4 Glossary9.5 Terminal Questions9.6 Answers9.7 Further Reading

9.1 Introduction

In the previous unit, you learnt about interpolation of polynomial as a usefulmethod for functional approximation.

In this unit, you will learn about hypothesis, null and alternative hypotheses,critical region, penalty, standard error and hypothesis testing. Hypothesis is anassumption that is tested to find its logical or empirical consequence. It refers toa provisional idea whose merit needs evaluation, but having no specific meaning.A hypothesis should be clear and accurate. Various concepts, such as null andalternative hypotheses, enable to verify the testability of an assumption. Duringthe course of hypothesis testing, some inference about the population like themean and proportion are made. Any useful hypothesis will enable predictionsby reasoning, including deductive reasoning. Statistical decisions have to bemade in the presence of uncertainty. The null hypothesis is tested about thepopulation mean which has a specific value m. Testing a statistical hypothesison the basis of a sample enables us to decide whether the hypothesis shouldbe accepted or rejected. The Critical Region (CR) or Rejection Region (RR) isa set of values for testing statistic for which the null hypothesis is rejected in ahypothesis test.

Objectives

After studing this unit, you should be able to:

Describe the concepts of hypothesis and list the types of errors

Explain the null and alternate hypotheses

Page 222: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 216

Discuss the concepts of critical region or the region of hypothesis rejection

Calculate standard errors of statistics

9.2 Hypothesis Formulation

A hypothesis is an approximate assumption that a researcher wants to test forits logical or empirical consequences. Hypothesis refers to a provisional ideawhose merit needs evaluation, but having no specific meaning. Though it isoften referred as a convenient mathematical approach for simplifyingcumbersome calculation. Setting up and testing hypothesis is an integral art ofstatistical inference. Hypotheses are often statements about populationparameters like variance and expected value. During the course of hypothesistesting, some inference about population like the mean and proportion are made.Any useful hypothesis will enable predictions by reasoning including deductivereasoning. According to Karl Popper, a hypothesis must be falsifiable and that aproposition or theory cannot be called scientific if it does not admit the possibilityof being shown false. Hypothesis might predict outcome of an experiment in alab setting the observation of a phenomenon in nature. Thus, hypothesis is aexplanation of a phenomenon proposal suggesting a possible correlationbetween multiple phenomena.The characteristics of hypothesis are:

Clear and accurate: Hypothesis should be clear and accurate so as todraw a consistent conclusion.

Statement of relationship between variables: If a hypothesis isrelational, it should state the relationship between different variables.

Testability: A hypothesis should be open to testing so that other deductionscan be made from it and can be confirmed or disproved by observation.The researcher should do some prior study to make the hypothesis atestable one.

Specific with limited scope: A hypothesis, which is specific, with limitedscope, is easily testable than a hypothesis with limitless scope. Therefore,a researcher should pay more time to do research on such kind ofhypothesis.

Simplicity: A hypothesis should be stated in the most simple and clearterms to make it understandable.

Page 223: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 217

Consistency: A hypothesis should be reliable and consistent withestablished and known facts.

Time limit: A hypothesis should be capable of being tested within areasonable time. In other words, it can be said that the excellence of ahypothesis is judged by the time taken to collect the data needed for thetest.

Empirical reference: A hypothesis should explain or support all thesufficient facts needed to understand what the problem is all about.

A hypothesis is a statement or assumption concerning a population. Forthe purpose of decision-making, a hypothesis has to be verified and thenaccepted or rejected. This is done with the help of observations. We test asample and make a decision on the basis of the result obtained. Decision-making plays significant role in different areas such as marketing, industry andmanagement.

9.2.1 Statistical Decision-Making

Testing a statistical hypothesis on the basis of a sample enables us to decidewhether the hypothesis should be accepted or rejected. The sample data enableus to accept or reject the hypothesis. Since the sample data give incompleteinformation about the population, the result of the test need not be consideredto be final or unchallengeable. The procedure, on which the basis of sampleresults, enables to decide whether a hypothesis is to be accepted or rejected.This is called Hypothesis Testing or Test of Significance.

Note 1: A test provides evidence, if any, against a hypothesis, usually called a null hypothesis.The test cannot prove the hypothesis to be correct. It can give some evidence against it.

The test of hypothesis is a procedure to decide whether to accept orreject a hypothesis.

Note 2: The acceptance of a hypotheses implies if there is no evidence from the sample that weshould believe otherwise.

The rejection of a hypothesis leads us to conclude that it is false. Thisway of putting the problem is convenient because of the uncertainty inherent inthe problem. In view of this we must always briefly state a hypothesis that wehope to reject.

A hypothesis stated in the hope of being rejected is called a null hypothesisand is denoted by H0.

Page 224: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 218

If H0 is rejected, it may lead to the acceptance of an alternative hypothesisdenoted by H1.

For example, a new fragrance soap is introduced in the market. The nullhypothesis H0, which may be rejected, is that the new soap is not be better thanthe existing soap.

Similarly, a dice is suspected to be rolled. Roll the dice a number of timesto test.

The Null Hypothesis H0: p = 1/6 for showing six.

The Alternative hypothesis H1: p 1/6.

For example, skulls found at an ancient site may all belong to race X orrace Y on the basis of their diameters. We may test the hypothesis, that themean is of the population from which the present skulls came. We have thehypotheses.

H0 : = x, H1 : = y

Here; we should not insist on calling either hypothesis null and the otheralternative since the reverse could also be true.

9.2.2 Committing Errors: Type I and Type II

Types of Errors: There are two types of errors in statistical hypothesis,which are as follows:

o Type I Error: In this type of error, you may reject a null hypothesiswhen it is true. It means rejection of a hypothesis, which should havebeen accepted. It is denoted by (alpha) and is also known alphaerror.

o Type II Error: In this type of error, you are supposed to accept a nullhypothesis when it is not true. It means accepting a hypothesis, whichshould have been rejected. It is denoted by (beta) and is also knownas beta error.

Type I error can be controlled by fixing it at a lower level. For example, ifyou fix it at 2%, then the maximum probability to commit Type I error is 0.02.But, reducing Type I error has a disadvantage when the sample size is fixed, asit increases the chances of Type II error. In other words, it can be said that bothtypes of errors cannot be reduced simultaneously. The only solution of thisproblem is to set an appropriate level by considering the costs and penaltiesattached to them or to strike a proper balance between both types of errors.

Page 225: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 219

In a hypothesis test, a Type I error occurs when the null hypothesis isrejected when it is in fact true; that is, H0 is wrongly rejected. For example, in aclinical trial of a new drug, the null hypothesis might be that the new drug is nobetter, on average, than the current drug; that is H0: there is no difference betweenthe two drugs on average. A Type I error would occur if we concluded that thetwo drugs produced different effects, when in fact there was no differencebetween them.

In a hypothesis test, a Type II error occurs when the null hypothesis H0, isnot rejected when it is in fact false. For example, in a clinical trial of a new drug,the null hypothesis might be that the new drug is no better, on average, than thecurrent drug; that is H0: there is no difference between the two drugs on average.A Type II error would occur if it were concluded that the two drugs produced thesame effect, that is, there is no difference between the two drugs on average,when in fact they produced different ones.

In how many ways can we commit errors?

We reject a hypothesis when it may be true. This is Type I Error.

We accept a hypothesis when it may be false. This is Type II Error.

The other true situations are desirable:

We accept a hypothesis when it is true. We reject a hypothesis when it isfalse.

Accept H0 Reject H0

H0

True

Accept True H0

Desirable Reject True H0

Type I Error

H1

False

Accept False H0

Type II Error Reject False H0

Desirable

The level of significance implies the probability of Type I error. A five percent level implies that the probability of committing a Type I error is 0.05. A oneper cent level implies 0.01 probability of committing Type I error.

Lowering the significance level and hence the probability of Type I error isgood but unfortunately, it would lead to the undesirable situation of committingType II error.

Page 226: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 220

To sum up:

Type I Error: Rejecting H0 when H0 is true.

Type II Error: Accepting H0 when H0 is false.

Note. The probability of making a Type I error is the level of significance of a statistical test. It isdenoted by Where, = Prob. (Rejecting H0 / H0 true)

1 – = Prob. (Accepting H0 / H0 true)

The probability of making a Type II error is denoted by .Where, = Prob. (Accepting H0 / H0 false)

1– = Prob. (Rejecting H0 / H0 false) = Prob. (The test correctly rejectsH0 when H0 is false)

1- is called the power of the test. It depends on the level of significance, sample size n and the parameter value.

9.2.3 Null and Alternate Hypotheses

Hypothesis is usually considered as the principal instrument in research. Thebasic concepts regarding the testability of a hypothesis are as follows:

Null Hypothesis: While comparing two different methods in terms oftheir superiority, wherein, the assumption is that both the methods areequally good is called null hypothesis. It is also known as statisticalhypothesis and is symbolized as H0.

Alternate Hypothesis: While comparing two different methods, regardingtheir superiority, wherein, stating a particular method to be good or badas compared to the other one is called alternate hypothesis. It is symbolizedas H1.

Comparison of Null Hypothesis with Alternate Hypothesis

Following are the points of comparison between null hypothesis and alternatehypothesis:

Null hypothesis is always specific, while alternate hypothesis gives anapproximate value.

The rejection of null hypothesis involves great risk, which is not in thecase of alternate hypothesis.

Null hypothesis is more frequently used in statistics than alternatehypothesis because it is specific and is not based on probabilities.

Page 227: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 221

The hypothesis to be tested is called the null hypothesis and is denotedby H0.This is to be tested against other possible states of nature called alternativehypothesis. The alternative is usually denoted by H1.

The null hypothesis implies that there is no difference between the statisticand the population parameter. To test whether there is no difference betweenthe sample mean X and the population , we write the null hypothesis.

H0: X = µ

The alternative hypothesis would be,

H1: µ

This means > µ or < µ. This is called a two-tailed hypothesis.

The alternative hypothesis H1: > µ is right tailed.

The alternative hypothesis H1: < µ is left tailed.

These are one sided or one-tailed alternatives.

Note 1: The alternative hypothesis H1 implies all such values of the parameter, which are not

specified by the null hypothesis H0.

Note 2: Testing a statistical hypothesis is a rule, which leads to a decision to accept or reject ahypothesis.

A one-tailed test requires rejection of the null hypothesis when the samplestatistic is greater than the population value or less than the population value ata certain level of significance.

1. We may want to test if the sample mean exceeds the population mean .Then the null hypothesis is,

H0: > µ2. In the other case the null hypothesis could be,

H0: < µEach of these two situations leads to a one-tailed test and has to be dealt

with in the same manner as the two-tailed test. Here, the critical rejection is onone side only, right for > µ and left for < µ. Both the Figures 9.1 and 9.2 hereshow a five per cent level of test of significance.

For example, a minister in a certain government has an average life of 11months without being involved in a scam. A new party claims to provide ministerswith an average life of more than 11 months without scam. We would like to testif, on the average, the new ministers last longer than 11 months. We may writethe null hypothesis H0: = 11 and alternative hypothesis H1: > 11.

Page 228: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 222

Figure 9.1 H0: >

Figure 9.2 H0: <

9.2.4 Critical Region

The Critical Region (CR), or Rejection Region (RR), is a set of values for testingstatistic for which the null hypothesis is rejected in a hypothesis test. It means,the sample space for the test statistic is partitioned into two regions; one regionas the critical region will lead us to reject the null hypothesis H0, the other not.So, if the observed value of the test statistic is a member of the critical region,we conclude that ‘reject H0’; if it is not a member of the critical region then weconclude that ‘do not reject H0’.

We shall consider test problems arising out of Type I Error.

The level of significance of a test is the maximum probability with whichwe are willing to take a risk of Type I error.

If we take a 5% significance level ( = 0.05), we are 95% confident( = 0.95) that a right decision has been made.

A 1% significance level ( = 0.01), makes us 99% confident ( = 0.99)about the correctness of the decision.

The critical region is the area of the sampling distribution in which the teststatistic must fall for the null hypothesis to be rejected.

We can say that the critical region corresponds to the range of values ofthe statistic, which according to the test requires the hypothesis to be rejected.

Two-tailed and One-tailed Tests: A two-tailed test rejects the nullhypothesis if the sample mean is either more or less than the hypothesized

Page 229: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 223

value of the mean of the population. It is considered to be apt when nullhypothesis is of some specific value whereas alternate hypothesis is notequal to the value of null hypothesis. In a two-tailed curve there are tworejection regions, also called critical regions as shown in Figure 9.3.

Acceptance and rejection regions in case of a two-tailed test (With 5% significance level)

Rejection region

Rejection region Acceptance region (Accept H0

if the sample mean X falls in this region)

0.475 of area

0.475 of area

{Both taken together equals 0.95 or 95% of area}

2H0 = Z = 1.96

Reject H0 if the sample mean ( X ) falls in either

of these two regions

Z = –1.96

LIM

IT

LIM

IT

Figure 9.3 Critical Region

Conditions for the Occurrence of One-tailed Test: When the populationmean is either lower or higher than some hypothesised value, one-tailedtest is considered to be appropriate where the rejection is only on the lefttail of the curve. This is known as left-tailed test (refer Figure 9.4).

Page 230: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 224

0

Figure 9.4 Left-Tailed Test

For example, what will happen if the acceptance region is made larger? will decrease. It will be more easily possible to accept H0 when H0 is false (TypeII error), i.e., it will lower the probability by making a Type I error, but raise thatof , Type II error. , are probabilities of making an error; 1 – , l – areprobabilities of making correct decisions (refer Figure 9.5).

Figure 9.5 Acceptance Region

Page 231: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 225

Example 9.1: Can we say + = 1?Solution: No. Each is concerned with a different type of error. But both are notindependent of each other.

9.2.5 Penalty

Usually Type II error is considered the worse of the two though, it is mainly thecircumstances of a case that decide the answer to this question.

If Type I error means accepting the hypothesis that a guilty person isinnocent and if Type II error means accepting the hypothesis that an innocentperson is guilty, then Type II error would be dangerous. The penalties and costsassociated with an error determine the balance or trade off between Type I andType II errors.

Usually Type I error is shown as the shaded area, say 5% of a normalcurve which is supposed to represent the data. If the sample statistic, say thesample mean, falls in the shaded area, the hypothesis is rejected at 5 per centlevel of significance.

9.2.6 Standard Error

The concept of Standard Error (SE) of statistics is used to test the precision ofa sample and provides the confidence limits for the corresponding populationparameter.

The statistic may be the sample arithmetic mean, the sample proportionp, etc.

The SE of any such statistic is the standard deviation of the samplingdistribution of the statistic. Given below is SE in common use.

1 21 2

1 1 2 21 2

1 2

( )

( )

SE X Xn n

PQ P QSE p pn n

SE of difference between two means 1 2,X X or two proportions p1, p2

sample sizes n1, n2 can be stated as,

1 21 2

1 1 2 21 2

1 2

( )

( )

SE X Xn n

PQ P QSE p pn n

Page 232: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 226

Where, n is the number of observations

X is the sample mean

is the population standard deviation

p is the sample proportion, q = 1 – p

P is the population proportion, Q = l – P

9.2.7 Testing of Hypothesis

A note on statistical decision-making

Statistical decisions have to be made in the presence of uncertainty. In testingof hypothesis, the choice is between H0 and H1. In estimation, there are severalchoices available. The design of experiments requires one to choose betweenthe nature and extent of observations. All this has to be done in the presence ofuncertainty. A decision function D(x), assigns to every possible outcome a uniqueaction. This may result in loss, positive or negative, depending on an unknownparameter w.

So, the loss function is L(w, D), which depends on the outcome x is arandom variable. Its expected value is called the risk function.

9.2.8 Tests for a Sample Mean XWe have to test the null hypothesis that the population mean has a specifiedvalue µ, i.e., H0: X = µ. For large n, if H0 is true then,

( )XzSE X

is approximately nominal. The theoretical region for z

depending on the desired level of significance can be calculated.

For example, a factory produces items, each weighing 5 kg with variance4. Can a random sample of size 900 with mean weight 4.45 kg be justified ashaving been taken from this factory?

n = 900

X = 4.45

µ = 5

= 4 = 2

( )Xz

SE X

= /X

n

=4.45 5

2 / 30

= 8.25

Page 233: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 227

We have z > 3. The null hypothesis is rejected. The sample may not beregarded as originally from the factory at 0.27% level of significance(corresponding to 99.73% acceptance region).

9.2.9 Test for Equality of Two Proportions

If P1, P2 are proportions of some characteristic of two samples of sizes n1, n2,drawn from populations with proportions P1, P2, then we have H0: P1 = P2 vsH1:P1 P2

Case (I): If H0 is true, then let P1 = P2 = pWhere, p can be found from the data:

1 1 2 2

1 2

1

n P n Ppn n

q p

p is the mean of the two proportions.

1 21 2

1 1( )SE P P pqn n

1 2

1 2,

( )P Pz P

SE P P

is approximately normal (0,1)

We write z ~ N(0, 1)

The usual rules for rejection or acceptance are applicable here.Case (II): If it is assumed that the proportion under question is not the same inthe two populations from which the samples are drawn and that P1, P2 are thetrue proportions, we write,

1 1 2 21 2

1 2( ) Pq P qSE P P

n n

We can also write the confidence interval for P1 – P2.

For two independent samples of sizes n1, n2 selected from two binomialpopulations, the 100 (1 – ) % confidence limits for P1 – P2 are,

1 1 2 21 2 / 2

1 2( ) Pq P qP P z

n n

Page 234: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 228

The 90% confidence limits would be [with = 0.1, 100 (1 – ) = 0.90]

1 1 2 21 2

1 2( ) 1.645 Pq P qP P

n n

Example 9.2: Out of 5000 interviewees, 2400 are in favour of a proposal, andout of another set of 2000 interviewees, 1200 are in favour. Is the differencesignificant?

Where, 12400 0.485000

P 21200 0.62000

P

Solution: Given, 12400 0.485000

P 21200 0.62000

P

n1 = 5000 n2 = 2000

0.48 0.52 0.6 0.4

5000 2000SE

= 0.013 (using Case (II))

1 2 0.12 9.2 30.013

P PzSE

The difference is highly significant at 0.27% level.

9.2.10 Large Sample Test for Equality of Two Means 1 2,X XSuppose two samples of sizes n1 and n2 are drawn from populations havingmeans 1, 2 and standard deviations 1, 2

To test the equality of means 1 2,X X we write,

0 1 2

1 1 2

::

HH

If we assume H0 is true then,

1 22 21 2

1 2

X Xz

n n

, approximately normally distributed with mean 0, and

S.D. = 1.We write z ~ N (0, 1)

Page 235: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 229

As usual, if | z | > 2 we reject H0 at 4.55% level of significance and so on.Example 9.3: Two groups of sizes 121 and 81 are subjected to tests. Theirmeans are found to be 84 and 81 and standard deviations 10 and 12. Test forthe significance of difference between the groups.Solution: 1X = 84 2X = 81 n1 = 121 n2 = 81

1 = 10 2 = 12

1 22 21 2

1 2

X Xz

n n

,

84 81

121 81

z

= 1.86 < 1.96

The difference is not significant at the 5% level of significance.

9.2.11 Small Sample Tests of Significance

The sampling distribution of many statistics for large samples is approximatelynormal. For small samples with n < 30, the normal distribution, as shown inExample 9.3, can be used only if the sample is from a normal population withknown .

If is not known, we can use student’s t distribution instead of the normal.We then replace by sample standard deviation with some modification asgiven below.

Let x1, x2, ..., xn be a random sample of size n drawn from a normalpopulation with mean and S.D. . Then,

/ 1xt

n

.

Here, t follows the student’s t distribution with n – 1 degrees of freedom.

Note: For small samples of n < 30, the term 1n , in SE = / 1s n , corrects the bias,resulting from the use of sample standard deviation as an estimator of Also,

2

21 1 or s n ns S

n nS

Procedure: Small Samples

To test the null hypothesis 0 :H , against the alternative hypothesis

1 :H

Page 236: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 230

Calculate ( )XtSE X

and compare it with the table value with n – 1 degrees

of freedom (d.f.) at level of significance per cent

If this value > table value, reject H0

If this value < table value, accept H0

(Significance level idea same as for large samples)

We can also find the 95% (or any other) confidence limits for .

For the two-tailed test (use the same rules as for large samples; substitutet for z) the 95% confidence limits are,

/ 1 X t s n

Rejection Region. At % level for two-tailed test if | t | > t/2 reject.

For one-tailed test, (right) if t > t reject

(left) if t > –t reject

At 5 per cent level the three cases are,

If | t | > t0.025 reject two-tailed

If t > t0.05 reject one-tailed right

If t < t0.05 reject one-tailed left

For proportions, the same procedure is to be followed.Example 9.4: A firm produces tubes of diameter 2 cm. A sample of 10 tubes isfound to have a diameter of 2.01 cm and variance 0.004. Is the differencesignificant? Given t0.05,9= 2.26Solution:

/ 12.01 2

0.004/ 10 1

0.010.0210.48

Xts n

Since, |t| < 2.26, the difference is not significant at 5% level.

Page 237: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 231

9.2.12 Paired Observations

t-Test for the Difference of Means

Let (x1, y1), (x2, y2,), ...,(xn, yn), be the pairs of values for the same subjects, e.g.,sales data before (x) and after an advertisement campaign (y)

Performance of candidates before (x) and after training (y)

We have to test the significance of the difference between x, y values.

For each pair (xi-, yi), find di = xi- – yi

H0: 1 = 2, i.e., no difference before and after and H1: 1 2

We find the mean d of d values and use the statistics:

2

/

( )1

dtS n

d dS

n

Example 9.5: Eleven students were given a test and their marks noted. Aftertraining, their marks in a second test were noted. Do the marks indicate anybenefit from training?Solution:

Student 1 2 3 4 5 6 7 8 9 10 11

x 23 20 19 21 18 20 18 17 23 16 19

y 24 19 22 18 20 22 20 20 23 20 17

di –1 1 –3 3 –2 –2 –2 –3 0 –4 2

2

11 111

( )2.49

111 1 10

1 0.1212.24 / 11 2.49 11

d

d ds

ndf

dt

The difference is not significant.

Page 238: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 232

9.2.13 Test for a Given Population Variance

In the test for given population variance, the variance is the square of the standarddeviation, whatever you say about a variance can be, for all practical purposes,extended to a population standard deviation.

To test the hypothesis that a sample x1, x2, …xn of size n has a specifiedvariance 0

2 20 0

02 2

1 0

Null hypothesis :or

:

H

H

Test statistic

222

2 20 0

( – )x xns

If 2 is greater than the table value, we reject the null hypothesis.

Activity 1

A dice is rolled 49152 times. Of these 25149 times it shows 4, 5 and 6. Testthe hypothesis that the dice is unbiased.

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) A hypothesis is an approximate _________________ that aresearcher wants to test for its logical or empirical consequences.

(b) The Critical Region (CR) or Rejection Region (RR) is a set of valuesfor testing statistic for which the ________________ hypothesis isrejected in a hypothesis test.

2. State whether true or false.

(a) Hypothesis should be clear and accurate so as to draw a consistentconclusion.

(b) Type I error can not be controlled by fixing it at a lower level.

Page 239: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 233

9.3 Summary

Let us recapitulate the important concepts discussed in this unit:

A hypothesis is an approximate assumption that a researcher wants totest for its logical or empirical consequences. It refers to a provisionalidea whose merit needs evaluation, but having no specific meaning.

A hypothesis should be reliable and consistent with established and knownfacts.

In a hypothesis test, a Type I error occurs when the null hypothesis isrejected when it is in fact true.

Null hypothesis is always specific, while alternate hypothesis gives anapproximate value.

A one-tailed test requires rejection of the null hypothesis when the samplestatistic is greater than the population value or less than the populationvalue at a certain level of significance.

The Critical Region (CR) or Rejection Region (RR) is a set of values fortesting statistic for which the null hypothesis is rejected in a hypothesistest.

A two-tailed test rejects the null hypothesis if the sample mean is eithermore or less than the hypothesized value of the mean of the population.

The concept of Standard Error (SE) of statistics is used to test the precisionof a sample and provides the confidence limits for the correspondingpopulation parameter.

The sampling distribution of many statistics for large samples isapproximately normal.

9.4 Glossary

Hypothesis: An approximate assumption about population parameterslike variance and expected value that is tested by a researcher for itslogical or empirical consequences.

Critical region: A set of values for testing statistic for which the nullhypothesis is rejected and the alternate hypothesis is accepted in ahypothesis test.

Page 240: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 234

Standard error: In statistics, it is used to test the precision of a sampleand provides the confidence limits for the corresponding populationparameter.

9.5 Terminal Questions

1. What is a hypothesis? Explain the characteristics of hypothesis.

2. Explain the importance of statistical decision-making.

3. What are type I and type II errors?

4. Differentiate between null and alternative hypotheses.

5. Describe critical region with the help of an example.

6. What are the conditions for the occurrence of one-tailed test?

7. What is penalty?

8. How is standard error calculated?

9.6 Answers

Answers to Self-Assessment Questions

1. (a) Assumption; (b) Null

2. (a) True; (b) False

Answers to Terminal Questions

1. Refer Section 9.2

2. Refer Section 9.2.1

3. Refer Section 9.2.2

4. Refer Section 9.2.3

5. Refer Section 9.2.4

6. Refer Section 9.2.4

Page 241: BCC104 Business Statistics

Business Statistics Unit 9

Sikkim Manipal University Page No. 235

7. Refer Section 9.2.5

8. Refer Section 9.2.6

9.7 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010.

Page 242: BCC104 Business Statistics
Page 243: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 237

Unit 10 Chi-Square Test

Structure

10.1 Introduction Objectives

10.2 Chi-Square Test10.3 Summary10.4 Glossary10.5 Terminal Questions10.6 Answers10.7 Further Reading

10.1 Introduction

In the previous unit you learnt about testing of hypothesis. The test statistic ofaccepting or rejecting a null hypothesis is evaluated using 2. In this unit you willlearn about Chi-square test also called Chi-squared or 2 test. Any statisticalhypothesis test, in which the test statistic has a Chi-square distribution, whenthe null hypothesis is true, is termed as Chi-square test. Chi-square test is anon-parametric test of statistical significance for bivariate tabular analysis alsoknown as cross-breaks. Amongst the several tests used in statistics for judgingthe significance of the sampling data, Chi-square test, developed by R.A. Fisher,is considered an important test. Chi-square, symbolically written as 2

(pronounced as Ki-square), is a statistical measure with the help of which it ispossible to assess the significance of the difference between the observedfrequencies and the expected frequencies obtained from some hypotheticaluniverse. Chi-square tests enable us to test and compare whether more thantwo population proportions can be considered equal. Hence, it is a statisticaltest commonly used to compare observed data with expected data and testingthe null hypothesis, which states that there is no significant difference betweenthe expected and the observed result.

Objectives

After studying this unit, you should be able to:

Explain the Chi-square test of significance

Describe the degrees of freedom

Page 244: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 238

Define the conditions for the application of test

Explain the additive property of Chi-square

10.2 Chi-Square Test

Chi-square test is a non-parametric test of statistical significance for bivariatetabular analysis (also known as cross-breaks). Any appropriate test of statisticalsignificance lets you know the degree of confidence you can have in acceptingor rejecting a hypothesis. Typically, the Chi-square test is any statisticalhypothesis test, in which the test statistics has a chi-square distribution whenthe null hypothesis is true. It is performed on different samples (of people) whoare different enough in some characteristic or aspect of their behaviour that wecan generalize from the samples selected. The population from which oursamples are drawn should also be different in the behaviour or characteristic.Amongst the several tests used in statistics for judging the significance of thesampling data, Chi-square test, developed by Ronald A. Fisher, is consideredas an important test. Chi-square, symbolically written as 2 (pronounced as Ki-square), is a statistical measure with the help of which, it is possible to assessthe significance of the difference between the observed frequencies and theexpected frequencies obtained from some hypothetical universe. Chi-squaretests enable us to test whether more than two population proportions can beconsidered equal. In order that Chi-square test may be applicable, both thefrequencies must be grouped in the same way and the theoretical distributionmust be adjusted to give the same total frequency which is equal to that ofobserved frequencies. 2 is calculated with the help of the following formula:

22 0( )e

e

f ff

Where, f0 means the observed frequency; and

fe means the expected frequency.

Whether or not a calculated value of 2 is significant, it can be ascertainedby looking at the tabulated values of 2 (given at the end of this book in appendixpart) for given degrees of freedom at a certain level of confidence (generally a5% level is taken). If the calculated value of 2 exceeds the table value, thedifference between the observed and expected frequencies is taken as significantbut if the table value is more than the calculated value of 2, then the difference

Page 245: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 239

between the observed and expected frequencies is considered as insignificant,i.e., considered to have arisen as a result of chance and as such can be ignored.

10.2.1 Degrees of Freedom

As already stated in the earlier unit, the number of independent constraintsdetermines the number of degrees of freedom (or df). If there are 10 frequencyclasses and there is one independent constraint, then there are (10 – 1) = 9degrees of freedom. Thus, if n is the number of groups and one constraint isplaced by making the totals of observed and expected frequencies equal, df =(n – 1); when two constraints are placed by making the totals as well as thearithmetic means equal then df = (n – 2) and so on. In the case of a contingencytable (i.e., a table with two columns and more than two rows or table with tworows but more than two columns or a table with more than two rows and morethan two columns) or in the case of a 2 × 2 table the degrees of freedom isworked out as follows:

df = (c – 1)(r – 1)Where, c = Number of columns

r = Number of rows

10.2.2 Conditions for the Application of Test

The following conditions should be satisfied before the test can be applied:

(i) Observations recorded and used are collected on a random basis.

(ii) All the members (or items) in the sample must be independent.

(iii) No group should contain very few items say less than 10. In cases wherethe frequencies are less than 10, regrouping is done by combining thefrequencies of adjoining groups so that the new frequencies becomegreater than 10. Some statisticians take this number as 5, but 10 isregarded as better by most of the statisticians.

(iv) The overall number of items (i.e., N) must be reasonably large. It shouldat least be 50, howsoever small the number of groups may be.

(v) The constraints must be linear. Constraints which involve linear equationsin the cell frequencies of a contingency table (i.e., equations containingno squares or higher powers of the frequencies) are known as linearconstraints.

Page 246: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 240

10.2.3 Areas of Application of Chi-Square Test

Chi-square test is applicable in large number of problems. The test is, in fact, atechnique through the use of which it is possible for us to (a) Test the goodnessof fit; (b) Test the homogeneity of a number of frequency distributions; and (c)Test the significance of association between two attributes. In other words, Chi-square test is a test of independence, goodness of fit and homogeneity. Attimes Chi-square test is used as a test of population variance also.

As a test of goodness of fit, 2 test enables us to see how well thedistribution of observe data fits the assumed theoretical distribution such asBinomial distribution, Poisson distribution or the Normal distribution.

As a test of independence, 2 test helps explain whether or not twoattributes are associated. For instance, we may be interested in knowing whethera new medicine is effective in controlling fever or not and 2 test will help us indeciding this issue. In such a situation, we proceed on the null hypothesis thatthe two attributes (viz., new medicine and control of fever) are independent.Which means that new medicine is not effective in controlling fever. It may,however, be stated here that 2 is not a measure of the degree of relationship orthe form of relationship between two attributes but it simply is a technique ofjudging the significance of such association or relationship between twoattributes.

As a test of homogeneity, 2 test helps us in stating whether differentsamples come from the same universe. Through this test, we can also explainwhether the results worked out on the basis of sample/samples are in conformitywith well defined hypothesis or the results fail to support the given hypothesis.As such the test can be taken as an important decision-making technique.

As a test of population variance. Chi-square is also used to test thesignificance of population variance through confidence intervals, specially incase of small samples.

10.2.4 Steps Involved in Finding the Value of Chi-Square

The various steps involved are as follows:

(i) First of all calculate the expected frequencies.

(ii) Obtain the difference between observed and expected frequencies andfind out the squares of these differences, i.e., calculate ( f0 – fe)

2.

(iii) Divide the quantity ( f0 – fe)2 obtained, as stated above by the corresponding

expected frequency to get 2

0( ) e

e

f ff .

Page 247: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 241

(iv) Then find summation of 2

0( ) e

e

f ff values or what we call

20( )e

e

f ff

This is the required 2 value.

The 2 value obtained as such should be compared with relevant tablevalue of 2 and inference may be drawn as stated above.

The following examples illustrate the use of Chi-square test.Example 10.1: A dice is thrown 132 times with the following results:

Number Turned Up 1 2 3 4 5 6

Frequency 16 20 25 14 29 28

Test the hypothesis that the dice is unbiased.Solution: Let us take the hypothesis that the dice is unbiased. If that is so, theprobability of obtaining any one of the six numbers is 1/6 and as such the

expected frequency of any one number coming upward is 1132× = 22.

6Now,

we can write the observed frequencies along with expected frequencies andwork out the value of 2 as follows:

No. Turned Observed Expected ( f0 – fe) (f0 – fe)2 2

0( ) e

e

f ffUp Frequency Frequency

(or f0) (or fe )

1 16 22 –6 36 36/222 20 22 –2 4 4/223 25 22 3 9 9/224 14 22 –8 64 64/225 29 22 7 49 49/226 28 22 6 36 36/22

2

0( )e

e

f ff

= 9

Hence, the calculated value of 2 = 9

Degrees of freedom in the given problem is (n – 1) = (6 – 1) = 5

The table value of 2 for 5 degrees of freedom at 5% level of significanceis 10.071. If we compare the calculated and table values of 2 we find thatcalculated value is less than the table value and as such could have arisen dueto fluctuations of sampling. The result thus supports the hypothesis and it canbe concluded that the dice is unbiased.

Page 248: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 242

Example 10.2:Find the value of 2 for the following information:

Class Observed A B C D EFrequency 8 29 44 15 4Theoretical (orExpected) Frequency 7 24 38 24 7

Solution:Since some of the frequencies are less than 10, we shall first regroup the givendata as follows and then work out the value of 2:

Class Observed FrequencyExpected Frequency (f0 – fe)2

0( ) e

e

f ff(f0) (fe)

A and B (8+29) = 37 (7+24) = 31 6 36/31

C 44 38 6 36/38

D and E (15+4) = 19 (24+7) = 31 –12 144/31

2

2 0( )e

e

f ff

= 6.76 approx.

The table value of 2 for two degrees of freedom at 5% level of significanceis 5.991. The calculated value of 2 is much higher than this table value whichmeans that the calculated value cannot be said to have arisen just because ofchance. It is significant. Hence, the hypothesis does not hold good. This meansthat the sampling techniques adopted by the two investigators differ and arenot similar. Naturally, then the technique of one must be superior than that ofthe other.

10.2.5 Alternative Formula for Finding the Value of Chi-Square ina (2 × 2) Table

There is an alternative method of calculating the value of 2 in the case of a(2 × 2) table. Let us write the cell frequencies and marginal totals in case of a(2 × 2) table as follows:

a b

c d

( )a + b

( )c + d

N( ) ( )a + c b + d

Page 249: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 243

Then the formula for calculating the value of 2 will be stated as follows:

2 = 2(ad - bc) N

(a + c)(b+ d)(a+ b)(c + d)

Where, N means the total frequency, ad means the larger cross product,bc means the smaller cross product and (a + c), (b + d), (a + b) and (c + d) arethe marginal totals. The alternative formula is rarely used in finding out thevalue of Chi-square as it is not applicable uniformly in all cases but can be usedonly in a (2 × 2) contingency table.

10.2.6 Yates’ Correction

F. Yates suggested a correction in 2 value calculated in connection with a (2 ×2) table particularly when cell frequencies are small (since no cell frequencyshould be less than 5 in any case, though 10 is better as stated earlier) and 2

is just on the significance level. The correction suggested by Yates is popularlyknown as Yates’ correction. It involves the reduction of the deviation of observed,from expected frequencies which of course reduces the value of 2. The rule forcorrection is to adjust the observed frequency in each cell of a (2 × 2) table insuch a way as to reduce the deviation of the observed from the expectedfrequency for that cell by 0.5, and this adjustment is made in all the cells withoutdisturbing the marginal totals. The formula for finding the value of 2 after applyingYates’ correction is written as under:

2 (corrected) = 2.( 0.5 )

( )( )( )( )N ad bc N

a b c d a c b d

In case we use the usual formula for calculating the value of Chi-square

viz

22 0 e

e

(f - f )., =

f then Yates’ correction can be applied as under:

2(corrected) =

2 201 1 02 2

1 2

0.5 0.5e e

e e

f f f ff f

It may again be emphasized that Yates’ correction is made only in case of(2 × 2) table and that too when cell frequencies are small.

10.2.7 Chi-Square as a Test of Population Variance

2 is used, at times, to test the significance of population variance (p)2 through

confidence intervals. This, in other words, means that we can use 2 test to

Page 250: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 244

judge if a random sample has been drawn from a normal population with mean() and with specified variance (p)

2. In such a situation, the test statistic for anull hypothesis will be as under:

2 = 2 2

2 2( ) ( )

( ) ( )i s s

p p

X X n

with (n–1) degrees of freedom.

By comparing the calculated value (with the help of the above formula)with the table value of 2 for (n–1) df at a certain level of significance, we mayaccept or reject the null hypothesis. If the calculated value is equal or less thanthe table value, the null hypothesis is to be accepted but if the calculated valueis greater than the table value, the hypothesis is rejected. All this can be madeclear by an example.Example 10.3:Weight of 10 students is as follows:

Sl. No. 1 2 3 4 5 6 7 8 9 10

Weight in kg. 38 40 45 53 47 43 55 48 52 49

Can we say that the variance of the distribution of weights of all studentsfrom which the above sample of 10 students was drawn is equal to 20 squarekg? Test this at 5% and 1% level of significance.Solution:First of all, we should work out the standard deviation of the sample (s)Calculation of the sample standard deviation:

Sl. No. Xi i sX X ( i sX X )2

Weight in kg

1 38 – 9 812 40 – 7 493 45 – 2 044 53 + 6 365 47 + 0 006 43 – 4 167 55 + 8 648 48 + 1 019 52 + 5 25

10 49 + 2 04

n = 10 Xi = 470 ( i sX X )2 = 280

Page 251: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 245

sX =470 47 kg10

iXn

s = 2( ) 280 28 5.3 kg

10i sX X

n

s = 28

Taking the null hypothesis as H0: (p)2 = (s)

2

The test statistic = 2

2( ) 10 28 280 14

20 20( )s

p

n

Degrees of freedom in this case is (n – 1) = 10 – 1 = 9

At 5% level of significance, the table value of 2 = 16.92, and at 1% levelof significance it is 21.67 for 9 df, and both these values are greater than thecalculated value of 2 which is 14. Hence, we accept the null hypothesis andconclude that the variance of the given distribution can be taken as 20 squarekg at 5% as well as at 1% level of significance.

10.2.8 Additive Property of Chi-Square (2)

An important property of 2 is its additive nature. This means that several valuesof 2 can be added together and if the degrees of freedom are also added, thisnumber gives the degrees of freedom of the total value of 2. Thus, if a numberof 2 values have been obtained from a number of samples of similar data,then, because of the additive nature of 2, we can combine the various values of2 by just simply adding them. Such addition of various values of 2 gives onevalue of 2 which helps in forming a better idea about the significance of theproblem under consideration. The following example illustrates the additiveproperty of the 2.Example 10.4: The following values of 2 are obtained from differentinvestigations carried to examine the effectiveness of a recently inventedmedicine for checking malaria.

Investigation 2 df

1 2.5 1

2 3.2 1

3 4.1 1

4 3.7 1

5 4.5 1

Page 252: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 246

What conclusion would you draw about the effectiveness of the newmedicine on the basis of the five investigations taken together?Solution: By adding all the values of 2, we obtain a value equal to 18.0. Alsoby adding the various d.f. as given in the question, we obtain a figure 5. We cannow state that the value of 2 for 5 degrees of freedom (when all the fiveinvestigations are taken together) is 18.0.

Let us take the hypothesis that the new medicine is not effective. Thetable value of 2 for 5 degrees of freedom at 5% level of significance is 10.070.But our calculated value is higher than this table value which means that thedifference is significant and is not due to chance. As such the hypothesis iswrong and it can be concluded that the new medicine is effective in checkingmalaria.

10.2.9 Important Characteristics of Chi-Square (2) Test

(i) This test is based on frequencies and not on the parameters like meanand standard deviation.

(ii) This test is used for testing the hypothesis and is not useful for estimation.

(iii) This test possesses the additive property.

(iv) This test can also be applied to a complex contingency table with severalclasses and as such is a very useful test in research work.

(v) This test is an important non-parametric (or a distribution free) test as norigid assumptions are necessary in regard to the type of population andno need of the parameter values. It involves less mathematical details.

A Word of Caution in Using 2 Test

Chi-square test is no doubt a most frequently used test but its correct applicationis equally an uphill task. It should be borne in mind that the test is to be appliedonly when the individual observations of sample are independent which meansthat the occurrence of one individual observation (event) has no effect upon theoccurrence of any other observation (event) in the sample under consideration.The researcher, while applying this test, must remain careful about all thesethings and must thoroughly understand the rationale of this important test beforeusing it and drawing inferences concerning his hypothesis.

Page 253: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 247

Activity 1

200 digits were chosen at random from a set of tables. The frequencies ofthe digits were:

Digit 0 1 2 3 4 5 6 7 8 9

Frequency 18 19 23 21 16 25 22 20 21 15

Calculate 2.

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) Chi-square test is a non-parametric test of statistical significance for______________ tabular analysis.

(b) 2 is used to test the significance of population variance (p)2 through_____________ intervals.

2. State whether true or false.

(a) Chi-square tests enable us to test whether more than two populationproportions can be considered equal.

(b) Chi-square test is based on frequencies and also on the parameterslike mean and standard deviation.

10.3 Summary

Let us recapitulate the important concepts discussed in this unit:

Chi-square test is a non-parametric test of statistical significance forbivariate tabular analysis (also known as cross-breaks).

The Chi-square test is any statistical hypothesis test, in which the teststatistics has a chi-square distribution when the null hypothesis is true.

Chi-square, symbolically written as 2 (pronounced as Ki-square), is astatistical measure with the help of which, it is possible to assess thesignificance of the difference between the observed frequencies and theexpected frequencies obtained from some hypothetical universe.

Page 254: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 248

The correction suggested by Yates is popularly known as Yates’ correction.It involves the reduction of the deviation of observed, from expectedfrequencies which of course reduces the value of 2.

2 is used to test the significance of population variance (p)2 throughconfidence intervals.

An important property of 2 is its additive nature. This means that severalvalues of 2 can be added together and if the degrees of freedom are alsoadded, this number gives the degrees of freedom of the total value of 2.

10.4 Glossary

Chi-square test: A non-parametric test of statistical significance used tocompare observed data with expected data. It also tests the validity ofnull hypothesis.

Degrees of freedom: The number of independent observations in asample of data to estimate a parameter of the population from which thatsample is drawn.

10.5 Terminal Questions

1. Explain Chi-square test. Why is it considered an important test in statisticalanalysis?

2. Describe the term ‘Degrees of Freedom’.

3. Define the necessary conditions required for the application of test?

4. What are the areas of application of Chi-square test?

5. How will you find the value of Chi-square?

6. Define Yates’ correction formula for Chi-square.

7. Chi-square can be used as a test of population variance. Explain.

8. Describe the additive properties of Chi-square.

9. Explain the important characteristics of Chi-square test.

Page 255: BCC104 Business Statistics

Business Statistics Unit 10

Sikkim Manipal University Page No. 249

10.6 Answers

Answers to Self-Assessment Questions

1. (a) Bivariate; (b) Confidence

2. (a) True; (b) False

Answers to Terminal Questions

1. Refer Section 10.2

2. Refer Section 10.2.1

3. Refer Section 10.2.2

4. Refer Section 10.2.3

5. Refer Section 10.2.4

6. Refer Section 10.2.6

7. Refer Section 10.2.7

8. Refer Section 10.2.8

9. Refer Section 10.2.9

10.7 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010.

Page 256: BCC104 Business Statistics
Page 257: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 251

Unit 11 t-Test, z-Test and Analysis of Variance

Structure

11.1 IntroductionObjectives

11.2 t-Test11.3 z-Test11.4 Analysis of Variance11.5 Summary11.6 Glossary11.7 Terminal Questions11.8 Answers11.9 Futher Reading

11.1 Introduction

In the previous unit, you learnt about Chi-squared or 2 test which is a non-parametric test of statistical significance for bivariate tabular analysis. In thisunit you will learn about t-test, z-test and analysis of variance or ANOVA. z-testand t-test are basically the same as they compare between two means to suggestwhether both samples come from the same population. A t-test is any statisticalhypothesis test in which the test statistic follows a Student’s t distribution, if thenull hypothesis is supported. It is most commonly applied when the test statisticwould follow a normal distribution. Similarly, a z-test is any statistical test forwhich the distribution of the test statistic under the null hypothesis can beapproximated by a normal distribution. In statistics, analysis of variance (ANOVA)is a collection of statistical models and their associated procedures in which theobserved variance in a particular variable is partitioned into componentsattributable to different sources of variation.

Objectives

After studying this unit, you should be able to:

Explain the significance of t-test

Discuss the importance of z-test

Define analysis of variance or ANOVA

Explain degrees of freedom and F distribution

Page 258: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 252

11.2 t-Test

Sir William S. Gosset (pen name Student) developed a significance test andthrough it made a significant contribution to the theory of sampling applicable incase of small samples. When population variance is not known, the test iscommonly known as Student’s t-test and is based on the t distribution.

Like normal distribution, t distribution is also symmetrical but happens tobe flatter than normal distribution. Moreover, there is a different t distributionfor every possible sample size. As the sample size gets larger, the shape of thet distribution loses its flatness and becomes approximately equal to the normaldistribution. In fact, for sample sizes of more than 30, the t distribution is soclose to the normal distribution that we will use the normal to approximate the tdistribution. Thus, when n is small, the t distribution is far from normal, but whenn is infinite, it is identical to normal distribution.

For applying t-test in context of small samples, the t value is calculatedfirst of all and, then the calculated value is compared with the table value of t atcertain level of significance for given degrees of freedom. If the calculated valueof t exceeds the table value (say t

0.05), we infer that the difference is significant

at 5% level but if calculated value is t0 is less than its concerning table value, thedifference is not treated as significant.

The t-test is used when two conditions are fullfiled,

(i) The sample size is less than 30, i.e., when 30n .

(ii) The population standard deviation (p) must be unknown.In using the t-test, we assume the following:

(i) That the population is normal or approximately normal;

(ii) That the observations are independent and the samples are randomlydrawn samples;

(iii) That there is no measurement error;

(iv) That in the case of two samples, population variances are regarded asequal if equality of the two population means is to be tested.

The following formulae are commonly used to calculate the t value:

(i) To test the significance of the mean of a random sample

| || x

XtS SE X

Page 259: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 253

Where, X = Mean of the sample µ = Mean of the universe

XSE = S.E. of mean in case of small sample and is worked out as follows:

2( )i

sX

X XnSE

n n

and the degrees of freedom = (n – 1)

The above stated formula for t can as well be stated as under:

| |

X

XtSE

2

| |( )

1

XX Xn

n

2

| |=( )

1

X nX Xn

If we want to work out the probable or fiducial limits of population mean(µ) in case of small samples, we can use either of the following:

(a) Probable limits with 95% confidence level:

0.05( )XX SE t

(b) Probable limits with 99% confidence level:

0.01( )XX SE t

At other confidence levels, the limits can be worked out in a similar manner,taking the concerning table value of t just as we have taken t0.05 in (a) and t0.01 in(b) above.

(ii) To test the difference between the means of the two samples

1 2

1 2| |SE X X

X Xt

Where, 1X = Mean of the sample 1 2X = Mean of the sample 2

1 2X XSE = Standard Error of difference between two sample means andis worked out as follows:

Page 260: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 254

1 2

221 1 2 2

1 2

1 2

( ) ( )2

1 1

i iX X

X X X XSE

n n

n n

and the degrees of freedom = (n1 + n2 – 2).

When the actual means are in fraction, then use of assumed means isconvenient. In such a case, the standard deviation of difference, i.e.,

2 21 1 2 2

1 2

( ) + ( )2

i iX X X Xn n

can be worked out by the following short-cut formula:

2 2 2 21 1 2 1 1 1 2 2 2 2

1 2

( ) ( ) ( ) ( )2

i i i iX A X A n X A n X An n

Where, A1 = Assumed mean of sample 1A2 = Assumed mean of sample 2X1 = True mean of sample 1X2 = True mean of sample 2

(iii) To test the significance of an observed correlation coefficient

22

1rt n

r

Here, t is based on (n – 2) degrees of freedom.

(iv) In context of the ‘difference test’

Difference test is applied in the case of paired data and in this context t iscalculated as under:

00 DiffDitt

DiffDiff

XXt nn

Where, DiffX or D = Mean of the differences of sample items.

0 = The value zero on the hypothesis shows that there is nodifference

Diff. = Standard deviation of difference and is worked out as,

Page 261: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 255

2)( 1)

DiffD Xn

or

2 2( )( 1)

D D nn

D = Differences

n = Number of pairs in two samples and is based on (n –1) degrees offreedom.

The following examples would illustrate the application of t-test using theabove stated formulae.Example 11.1:A sample of 10 measurements of the diameter of a sphere, gave a meanX = 4.38 inches and a standard deviation, = 0.06 inches. Find (a) 95% and(b) 99% confidence limits for the actual diameter.Solution:On the basis of the given data the standard error of mean:

0.06 0.06 0.0231 10 1

s

n

Assuming the sample mean 4.38 inches to be the population mean, therequired limits are as follows:

(i) 95% confidence limits = 0.05( )XX SE t with degrees of freedom

= 4.38 ± .02(2.262)

= 4.38 ± .04524

i.e., 4.335 to 4.425

(ii) 99% confidence limits = 0.01( )XX SE t with 9 degrees of freedom

= 4.38 ± .02(3.25) = 4.38 ± .0650

i.e., 4.3150 to 4.4450.

Page 262: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 256

Example 11.2:The sales data of an item in six shops before and after a special promotionalcampaign are:

Shops A B C D E F

Before thepromotionalcampaign 53 28 31 48 50 42

After the campaign 58 29 30 55 56 45

Can the campaign be judged to be a success? Test at 5% level ofsignificance.Solution:We take the hypothesis that the campaign does not bring any improvement insales. We can thus write:

In order to judge this, we apply the ‘difference test’. For this purpose wecalculate the mean and standard deviation of differences in two sample itemsas follows:

Shops Sales before Sales after Difference = D (D – D ) (D – D )2

campaign campaign (i.e., increase orXBi XAi decrease after the

campaign)

A 53 58 +5 +1.5 2.25

B 28 29 +1 –2.5 6.25

C 31 30 –1 –4.5 20.25

D 48 55 +7 +3.5 12.25

E 50 56 +6 +2.5 6.25

F 42 45 +3 –0.5 0.25

n = 6 D = 21 (D – D )2

= 47.50

Mean of difference or 21 3.56Diff

DXn

Page 263: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 257

Standard deviation of difference,

2( ) 47.50 3.081 6 1

0

Diff

Diff

Diff

D Dn

Xt n

= 1.14 × 2.45 = 2.793

Degrees of freedom = (n – 1) = (6 – 1) = 5

Table value of t at 5% level of significance for 5 degrees of freedom= 2.015 for one-tailed test.

Since, the calculated value of t is greater than its table value, the differenceis significant. Thus, the hypothesis is wrong and the special promotionalcampaign can be taken as a success.Example 11.3:Memory capacity of 9 students was tested before and after training. From thefollowing scores, state whether the training was effective or not.

Student 1 2 3 4 5 6 7 8 9

Before (XBi) 10 15 9 3 7 12 16 17 4

After (XAi) 12 17 8 5 6 11 18 20 3

Solution:We take the hypothesis that training was not effective. We can write,

0 0: , :A B BH x X H X X . We apply the difference test for which purpose first ofall we calculate the mean and standard deviation of difference as follows:

Students Before XBi After XAi Difference = D D2

1 10 12 2 42 15 17 2 43 9 8 –1 14 3 5 2 45 7 6 –1 16 12 11 –1 17 16 18 2 48 17 20 3 99 4 3 –1 1

n = 9 D = 7 D2 = 29

Page 264: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 258

2 2 2

7 0.789

( ) 29 (0.78) 9 1.711 9 1

0.78 1.3691 71

Diff

DDn

D D nn

t

Degrees of freedom = (n – 1) = (9 – 1) = 8

Table value of t at 5% level of significance for 8 degrees of freedom

= 1.860 for one-tailed test.

Since the calculated value of t is less than its table value, the difference isinsignificant and the hypothesis is true. Hence it can be inferred that the trainingwas not effective.Example 11.4:It was found that the coefficient of correlation between two variables calculatedfrom a sample of 25 items was 0.37. Test the significance of r at 5% level withthe help of t-test.Solution:To test the significance of r through t-test, we use the following formula forcalculating t value:

2

2

21

0.37 = 25 21 (0.37)

=1.903

rt nr

Degrees of freedom = (n–2) = (25–2) = 23

The table value of at 5% level of significance for 23 degrees of freedomis 2.069 for a two-tailed test.

The calculated value of t is less than its table value, hence r is insignificant.

Activity 1

Select a variable. Compare the mean of the variable for a sample of 10 forone group with the mean of the variable for a sample of 10 for a secondgroup using t-test.

Page 265: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 259

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) When population ____________ is not known, the test is commonlyknown as Student’s t-test and is based on the t distribution.

(b) In t-test for the case of two samples population variances areregarded as equal if _____________ of the two population means isto be tested.

2. State whether true or false.

(a) Like normal distribution, t distribution is not symmetrical but happensto be flatter than normal distribution.

(b) When n is small, the t distribution is far from normal but when n isinfinite it is identical with normal distribution.

11.3 z-Test

A z-test is any statistical test for which the distribution of the test state can beapproximated by normal distribution under the null hypothesis.

11.3.1 z-Test for Testing the Significance of ‘r’ in Case of SmallSamples or z-Transformation

R.A. Fisher developed the z-test to test the significance of the correlationcoefficient in small samples. While applying the test, r of the sample istransformed into z on account of which the test is also known as z transformation.The z transformation is done as under:

101 1 (1 )log 1.15129 log2 1 (1 )e

r rzr r

where, r represents correlation coefficient on the basis of sample.

The statistic z is used to test (i) Whether an observed value of r issignificantly different from a given hypothetical or known value of populationcorrelation (ii) Whether two sample values of r differ significantly from eachother.

Page 266: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 260

The standard error of z is calculated as:

1.3XS E

n

where, n means the number of pairs in a sample.and

1011.51129 log1

pp

where, p represents population and µ represents population mean.

[Note. If p is not known, then it is taken as zero in which case = 0]

Finally the value of the Standard Normal Variate (S.N.V.) is calculated as follows:

| |1. . . = ( ) 3

3

z

S N V z nn

If the value of S.N.V. exceeds 1.96, the difference is significant at 5%level.

The following example makes the application of z-test clear in testing thesignificance or r.Example 11.5:Test the significance of the coefficient of correlation, r = 0.5 discovered in asample of 19 pairs against hypothesis correlation p = 0.7. Apply z-transformation.Solution:The hypothesis that correlation coefficient in the population is 0.7 has to betested in this case.

Applying z-transformation, we obtain

10

10

11.5129 log11 0.5 = 1.5129 log1 0.5

1.5 = 1.5129 log 0.5

= 1.15129 log 3 = 1.15129 0.4771 = 0.549

rzr

Page 267: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 261

101 0.7=1.15129 log1 0.7

1.7 = 1.15129 log0.3

= 1.15129 log 5.67 = 1.15129 0.7536 = 0.868

| |. . . 13

0.549 0.868 = 119 3

0.319 = 161

= 0.319 4 = 1.276

zS N V

n

Since the difference (0.319) is only 1.276 times the S.E., it is insignificantat 5% level and hence could have arisen due to sampling fluctuations. In otherwords, the hypothesis stands and p may be taken as 0.7.

As, it has been stated above, z-test is also used to test the significance ofthe difference between two independent correlation coefficients. For this purpose,first of all, r1 and r2 values are transformed in the similar manner (as statedabove) into z1 and z2 values respectively, and then the standard error of differencebetween 1z and 2z is worked out as under:

1 2 1 2

1 1.3 3Diff z zS E

n n

where, n1 = Number of pairs in Sample 1n2 = Number of pairs in Sample 2

Finally, we work out the ratio: 1 2

1 2| |. z z

z zS E

If this ratio is greater than 1.96, the difference will be significant at 5%level and if this ratio is greater than 2.5758, the difference will be significant at1% level. We take the following example to make the point clear:

Page 268: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 262

Example 11.6:Given as the following information:

No. of Items in Coefficient ofthe Sample Correlation

Sample 1 23 0.40

Sample 2 19 0.65

Test the significance of the difference, at 5% level, between the two givenvalues of coefficient of correlation, using z-transformation.Solution:Applying z-test, we obtain z1 and z2 values as under:

1 21 10 2 10

1 2

1 11.15129 log 1.15129 log1 1

1 + 4 1 + 65 = 1.15129 log = 1.15129 log1 4 1 65

= 1.15129 log 2.333 = 0.424 = 1.15129 log 4.71 = 0.775

r rz zr r

1 21 2

1 1. 3 3z zS E

n n

We now work out the ratio: 1 1 9= 0.33520 16 80

As this ratio is less than 1.96, the difference between the two given valuesof coefficient of correlation at 5% level is insignificant and it can be concludedthat the two samples come from the same population.

Self-Assessment Questions

3. State whether true or false.

(a) z-test is used to test the significance of the correlation coefficient insmall samples.

(b) The statistic z is used to test whether an observed value of r issignificantly different from a given hypothetical or known value ofpopulation correlation.

Page 269: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 263

4. Fill in the blanks with the appropriate terms.

(a) While applying the z test, r of the sample is transformed into z onaccount of which the test is also known as z ____________________.

(b) z-test is used to test the significance of the _____________________between two independent correlation coefficients.

11.4 Analysis of Variance

In business decisions, we are often involved in determining if there are significantdifferences among various sample means, from which conclusions can be drawnabout the differences among various population means. In the previous chapters,we discussed and evaluated the differences between two sample means. But,what if we have to compare more than 2 sample means? For example, we maybe interested in finding out if there are any significant differences in the averagesales figures of 4 different salesman employed by the same company, or wemay be interested to find out if the average monthly expenditures of a family of4 in 5 different localities are similar or not, or the telephone company may beinterested in checking, whether there are any significant differences in theaverage number of requests for information received in a given day among the5 areas of New York City, and so on. The methodology used for such types ofdeterminations is known as Analysis of Variance.

This technique is one of the most powerful techniques in statistical analysisand was developed by R.A. Fisher. It is also called the F-Test.

There are two types of classifications involved in the analysis of variance.The one-way analysis of variance refers to the situations when only one fact orvariable is considered. For example, in testing for differences in sales for threesalesman, we are considering only one factor, which is the salesman’s sellingability. In the second type of classification, the response variable of interestmay be affected by more than one factor. For example, the sales may be affectednot only by the salesman’s selling ability, but also by the price charged or theextent of advertising in a given area.

For the sake of simplicity and necessity, our discussion will be limited toOne-way Analysis of Variance.

The null hypothesis, that we are going to test, is based upon the assumptionthat there is no significant difference among the means of different populations.

Page 270: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 264

For example, if we are testing for differences in the means of k populations,then,

0 1 2 3 ....... kH

The alternate hypothesis (H1) will state that at least two means are differentfrom each other. In order to accept the null hypothesis, all means must beequal. Even if one mean is not equal to the others, then we cannot accept thenull hypothesis. The simultaneous comparison of several population means iscalled ANalysis Of VAriance or ANOVA.

Assumptions

The methodology of ANOVA is based on the following assumptions.

(i) Each sample of size n is drawn randomly and each sample is independentof the other samples.

(ii) The populations are normally distributed.

(iii) The populations from which the samples are drawn have equal variances.This means that:

12

22

32 2 .........= k , for k populations.

11.4.1 The Rationale Behind Analysis of Variance

Why do we call it the Analysis of Variance, even though we are testing formeans? Why not simply call it the Analysis of Means? How do we test for meansby analysing the variances? As a matter of fact, in order to determine if themeans of several populations are equal, we do consider the measure of variance,2.

The estimate of population variance, 2, is computed by two differentestimates of 2, each one by a different method. One approach is to computean estimator of 2 in such a manner that even if the population means are notequal, it will have no effect on the value of this estimator. This means that, thedifferences in the values of the population means do not alter the value of 2 ascalculated by a given method. This estimator of 2 is the average of the variancesfound within each of the samples. For example, if we take 10 samples of size n,then each sample will have a mean and a variance. Then, the mean of these 10variances would be considered as an unbiased estimator of 2, the populationvariance, and its value remains appropriate irrespective of whether the populationmeans are equal or not. This is really done by pooling all the sample variancesto estimate a common population variance, which is the average of all samplevariances. This common variance is known as variance within samples or 2

within.

Page 271: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 265

The second approach to calculate the estimate of 2, is based upon theCentral Limit Theorem and is valid only under the null hypothesis assumptionthat all the population means are equal. This means that in fact, if there are nodifferences among the population means, then the computed value of 2 by thesecond approach should not differ significantly from the computed value of 2

by the first approach.Hence,

If these two values of 2 are approximately the same, then we can decideto accept the null hypothesis.

The second approach results in the following computation.

Based upon the Central Limit Theorem, we have previously found thatthe standard error of the sample means is calculated by:

2X n

or, the variance would be:2

2X n

or, 2 2Xn

Thus, by knowing the square of the standard error of the mean 2( )X , wecould multiply it by n and obtain a precise estimate of 2. This approach ofestimating 2 is known as 2

between. Now, if the null hypothesis is true, that is if allpopulation means are equal then,

2between value should be approximately the same as 2

within value. A significantdifference between these two values would lead us to conclude that thisdifference is the result of differences between the population means.

But, how do we know that any difference between these two values issignificant or not? How do we know whether this difference, if any, is simply dueto random sampling error or due to actual differences among the populationmeans?

R.A. Fisher developed a Fisher test or F-test to answer the above question.He determined that the difference between 2

between and 2within values could be

expressed as a ratio to be designated as the F-value, so that:

2between2within

F

Page 272: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 266

In the above case, if the population means are exactly the same, then2

between will be equal to the 2within and the value of F will be equal to 1.

However, because of sampling errors and other variations, some disparitybetween these two values will be there, even when the null hypothesis is true,meaning that all population means are equal. The extent of disparity betweenthe two variances and consequently, the value of F, will influence our decisionon whether to accept or reject the null hypothesis. It is logical to conclude that,if the population means are not equal, then their sample means will also varygreatly from one another, resulting in a larger value of 2

between and hence alarger value of F (2

within is based only on sample variances and not on samplemeans and hence, is not affected by differences in sample means). Accordingly,the larger the value of F, the more likely the decision to reject the null hypothesis.But, how large the value of F be so as to reject the null hypothesis? The answeris that the computed value of F must be larger than the critical value of F, givenin the table for a given level of significance and calculated number of degreesof freedom. The F distribution is a family of curves, so that there are differentcurves for different degrees of freedom.

11.4.2 Degrees of Freedom

We have talked about the F distribution being a family of curves, each curvereflecting the degrees of freedom relative to both 2

between and 2within. This means

that, the degrees of freedom are associated both with the numerator as well aswith the denominator of the F-ratio.

(i) The numerator. Since the variance between samples, 2between comes

from many samples and if there are k number of samples, then the degreesof freedom, associated with the numerator would be (k –1).

(ii) The denominator is the mean variance of the variances of k samplesand since, each variance in each sample is associated with the size ofthe sample (n), then the degrees of freedom associated with each samplewould be (n – 1). Hence, the total degrees of freedom would be the sumof degrees of freedom of k samples or,

df = k(n –1), when each sample is of size n.

11.4.3 The F Distribution

The major characteristics of the F distribution are as follows:

(i) Unlike normal distribution, which is only one type of curve irrespective ofthe value of the mean and the standard deviation, the F distribution is afamily of curves. A particular curve is determined by two parameters. These

Page 273: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 267

are the degrees of freedom in the numerator and the degrees of freedomin the denominator. The shape of the curve changes as the number ofdegrees of freedom changes.

(ii) It is a continuous distribution and the value of F cannot be negative.

(iii) The curve representing the F distribution is positively skewed.

(iv) The values of F theoretically range from zero to infinity.A diagram of F distribution curve is shown below.

Do notrejectH0

Reject H0

0 F

The rejection region is only in the right end tail of the curve becauseunlike z distribution and t distribution which had negative values for areas belowthe mean, F distribution has only positive values by definition and only positivevalues of F that are larger than the critical values of F, will lead to a decision toreject the null hypothesis.

Computation of F

Since F ratio contains only two elements, which are the variance between thesamples and the variance within the samples.

If all the means of samples were exactly equal and all samples wereexactly representative of their respective populations so that all the samplemeans, were exactly equal to each other and to the population mean, thenthere will be no variance. However, this can never be the case. We always havevariation, both between samples and within samples, even if we take thesesamples randomly and from the same population. This variation is known asthe total variation.

The total variation designated by 2( - ) ,X X where X representssindividual observations for all samples and X is the grand mean of all samplemeans and equals (), the population mean, is also known as the total sum ofsquares or SST, and is simply the sum of squared differences between each

Page 274: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 268

observation and the overall mean. This total variation represents the contributionof two elements. These elements are:(A) Variance between samples. The variance between samples may be dueto the effect of different treatments, meaning that the population means may beaffected by the factor under consideration, thus, making the population meansactually different, and some variance may be due to the inter-sample variability.This variance is also known as the sum of squares between samples. Let thissum of squares be designated as SSB.Then, SSB is calculated by the following steps:

(i) Take k samples of size n each and calculate the mean of each sample,i.e., 1 2 3, , , .... .kX X X X

(ii) Calculate the grand mean X of the distribution of these sample means,so that,

1

k

ii

xX

k

(iii) Take the difference between the means of the various samples and thegrand mean, i.e.,

1 2 3( ), ( ), ( ), ...., ( )kX X X X X X X X

(iv) Square these deviations or differences individually, multiply each of thesesquared deviations by its respective sample size and sum up all theseproducts, so that we get;

2

1( ) ,

k

i ii

n X X

where ni = Size of the ith sample.

This will be the value of the SSB.

However, if the individual observations of all samples are not available,and only the various means of these samples are available, where the samplesare either of the same size n or different sizes, ni, n2, n3, ....., nk, then the valueof SSB can be calculated as:

2 2 22 2( ) ( ) ..... ( )i i k kSSB n X X n X X n X X

where,

n1=Number of items in sample 1

n2=Number of items in sample 2

Page 275: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 269

nk =Number of items in sample k

1X =Mean of sample 1

2X =Mean of sample 2

kX =Mean of sample k

X =Grand mean or average of all items in all samples.

(v) Divide SSB by the degrees of freedom, which are (k – 1), where k is thenumber of samples and this would give us the value of 2

between, so that,

2between .

( 1)SSBk

(This is also known as mean square between samples or MSB).(B) Variance within samples. Even though each observation in a given samplecomes from the same population and is subjected to the same treatment, somechance variation can still occur. This variance may be due to sampling errors orother natural causes. This variance or sum of squares is calculated through thefollowing steps:

(i) Calculate the mean value of each sample, i.e., 1 2 3, , , .... .kX X X X(ii) Take one sample at a time and take the deviation of each item in the

sample from its mean. Do this for all the samples, so that we would havea difference between each value in each sample and their respectivemeans for all values in all samples.

(iii) Square these differences and take a total sum of all these squareddifferences (or deviations). This sum is also known as SSW or sum ofsquares within samples.

(iv) Divide this SSW by the corresponding degrees of freedom. The degreesof freedom are obtained by subtracting the total number of samples fromthe total number of items. Thus, if N is the total number of items orobservations, and k is the number of samples, then,

df = (N – k)

These are the degrees of freedom within samples. (If all samples are ofequal size n, then df = k(n –1), since (n – 1) are the degrees of freedomfor each sample and there are k samples).

(v) This figure SSW/df, is also known as 2within, or MSW (mean of sum of

squares within samples).

Now, the value of F can be computed as:

Page 276: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 270

2between

2within

SSB/dfF = =

SSW/df

SSB/(k – 1) MSB = =

SSW/(N – k) MSW

This value of F is then compared with the critical value of F from the tableand a decision is made about the validity of null hypothesis.

11.4.4 ANOVA Table

After various calculations for SSB, SSW and the degrees of freedom have beenmade, these figures can be presented in a simple table called Analysis ofVariance table or simply ANOVA table, as follows:

ANOVA Table

Source of Variation Sum of Squares Degrees of Freedom Mean Square F

Treatment SSB (k – 1) ( 1)SSBMSB

k

MSBMSW

Within SSW (N – k) ( )SSWMSW

n k

Total SST

Then,

MSBF =

MSW

A Short-Cut Method

The formula developed above for the computation of the values of F-statistic israther complex and time consuming when we have to calculate the variancebetween samples and the variance within samples. However, a short-cut, simplermethod for these sum of squares is available, which considerably reduces thecomputational work. This technique is used through the following steps:

(i) Take the sum of all the observations of all the samples, either by addingall the individual values, or by multiplying the mean of each sample by itssize and then adding up all these products as follows:

The Total Sum or 1 1 2 2 .... ,k kTS n X n X n X for k samples

(ii) Calculate the value of a correction factor. The Correction Factor (CF)value is obtained by squaring the total sum obtained above and dividing itby the total number of observations N, so that:

Page 277: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 271

2( )TSCFN

(iii) The total sum of squares is obtained by squaring all individual observationsof all samples, summing up these values and subtracting from this sum,the CF.

In other words:

Total sum of squares 2

2 2 21 2

( ).... – kTSSST X X XN

Where,

21X Summation of squares for all X’s in sample 1.

22X Summation of squares for all X’s in sample 2.

:

:

2kX Summation of squares for all X’s in sample k.

(iv) The sum of squares between the samples (SSB) is obtained by thefollowing formula:

22 2 21 2

1 2

( )( ) ( ) ( ).... - k

k

XX X TSSSBn n n N

Where,

21( )X Square of the total of all values in sample 1.

22( )X Square of the total of all values in sample 2.

2( )kX Square of the total of all values in sample k.

(v) Then sum of squares within samples SSW can be calculated as:

SSW = Total sum of squares – Sum of squares between samples

= SST – SSB

(vi) The rest of the procedure is similar to the previous method.

Page 278: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 272

Example 11.7:To test whether all professors teach the same material in different sections ofthe introductory statistics class or not, four sections of the same course wereselected and a common test was administered to five students selected atrandom from each section. The scores for each student from each section werenoted and are given below. We want to test for any differences in learning, asreflected in the average scores for each section.

Section 1 Section 2 Section 3 Section 4Student # Scores (X

1) Scores (X

2) Scores (X

3) Scores (X

4)

1 8 12 10 12

2 10 12 13 15

3 12 10 11 13

4 10 8 12 10

5 5 13 14 10

Totals 1 45X 2 55X 3 60X 4 60X

Means 1 9X 2 11X 3 12X 4 12X Solution:A. The traditional method

(i) State the null hypothesis. We are assuming that there is no significantdifference among the average scores of students from these four sectionsand hence, all professors are teaching the same material with the sameeffectiveness, i.e.,

0 1 2 3 4: H

H1: All means are not equal or at least two means differ from each other.

(ii) Establish a level of significance. Let = 0.05.

(iii) Calculate the variance between the samples, as follows:

(a) The mean of each sample is:

1 2 3 49, 11, 12, 12X X X X

(b) The grand mean or X is:

9 11 12 124

11

XXn

Page 279: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 273

(c) Calculate the value of SSB:

2

2 2 2 2

( ) 5 (9 11) 5 (11 11) 5 (12 11) 5 (12 11) 20 0 5 5 30

SSB n X X

(d) The variance between samples between2 or MSB is given by:

(30) (30) 10( – 1) 3

SSBMSBdf k

(iv) Calculate the variance within samples, as follows:

To find the sum of squares within samples (SSW), we square eachdeviation between the individual value of each sample and its mean, forall samples and then sum these squared deviations, as follows:

Sample 1: 1 9X

2 2 2 2 2 211( ) (8 9) (10 9) (12 9) (10 9) (5 9)

1 1 9 1 16 28

X X

Sample 2: 2 11X

2 2 2 2 2 222( ) (12 11) (12 11) (10 11) (8 11) (13 11)

1 1 1 9 4 16

X X

Sample 3: 3 12X

2 2 2 2 2 233( ) (10 12) (13 12) (11 12) (12 12) (14 12)

4 1 1 0 4 10

X X

Sample 4: 4 12X

2 2 2 2 2 244( ) (12 12) (15 12) (13 12) (10 12) (10 12)

0 9 1 4 4 18

X X

Page 280: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 274

Then, SSW = 28 + 16 + 10 +18 = 72

Now, the variance within samples, 2within, or MSW is given by:

72 72 4.5( ) 20 4 16

SSW SSWMSWdf N k

Then, the F-ratio = 10 2.22.4.5

MSBMSW

Now, we check for the critical value of F from the table for = 0.05 anddegrees of freedom as follows:

df (numerator) = (k – 1) = (4 – 1) = 3

df (denominator) = (N – k) = (20 – 4) = 16

This value of F from the table is given as 3.24. Now, since ourcalculated value of F = 2.22 is less than the critical value of F = 3.24, wecannot reject the null hypothesis.

B. The Short-Cut Method

Following the procedure outlined before for using the short-cut method, we get:

(i) Total Sum (TS) = X

= 220

(ii) Correction before 2 2( ) (220) 2420

20TSCFN

(iii) Total sum of squares:

2( ) – =2522 2420 102SST X CF

(iv) Sum of squares betwen the samples SSB is obtained by:2

1

22 21 2

1 2

2 2 2 2

( )

( )( ) ( ) ....

(45) (55) (60) (60) (2420)5 5 5 5

405 605 720 720 2420 30

ki

i i

k

k

XSSB CFn

XX X CFn n n

Page 281: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 275

(v) SSW can be calculated by:

SST – SSB = 102 – 30 = 72

Now the F value can be calculated as:

/ 30 /( 1) 30 / 3 10/ 72 /( ) 72 /16 4.5

2.22

SSB df kFSSW df n k

As we see, we get the same value of F as obtained by the traditionalmethod. So, we compare our value of F with the critical value of F fromthe table for = 0.05 and df (numerator = 3), and df (denominator = 16),and we get the critical value of F as 3.24. As before, we accept the nullhypothesis.

The ANOVA Table

We can construct an ANOVA table for the problem solved above as follows:

ANOVA Table

Source of Variation Sum of Squares Degrees of Freedom Mean Square F

Treatment SSB = 30 (k – 1) = 3 ( 1)SSBMSB

k

MSBMSW

= 303 10

104 5.

Within (or error) SSW = 72 (N – k) = 16 ( )SSWMSW

N k

=2.22

7216

4 5.

Total SST = 102

Activity 2

Prepare a list of magazines and categorize them into three groups accordingto the educational level of the readers as high, medium and low. Select sixadvertisements randomly from each of the magazines and for eachadvertisement collect three different readability measures. Perform oneway ANOVA tests to determine whether advertisement readabilities of thethree groups of magazines are different.

Page 282: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 276

Self-Assessment Questions

5. Fill in the blanks with the appropriate terms.

(a) The simultaneous ______________ of several population means iscalled analysis of variance or ANOVA.

(b) F ratio contains only _____________ elements, which are thevariance between the samples and the variance within the samples.

6. State whether true or false.

(a) The one-way analysis of variance refers to the situations when onlyone fact or variable is considered.

(b) The F distribution is a family of curves, so that there are similarcurves for different degrees of freedom.

11.5 Summary

Let us recapitulate the important concepts discussed in this unit:

Sir William S. Gosset (pen name Student) developed a significance testand through it made significant contribution in the theory of samplingapplicable in case of small samples. When population variance is notknown, the test is commonly known as Student’s t-test and is based onthe t distribution.

When n is small, the t distribution is far from normal but when n is infiniteit is identical with normal distribution.

For applying t-test in context of small samples, the t value is calculatedfirst of all and, then the calculated value is compared with the table valueof t at certain level of significance for given degrees of freedom.

R.A. Fisher developed the z-test to test the significance of the correlationcoefficient in small samples. While applying the test, r of the sample istransformed into z on account of which the test is also known as ztransformation.

The statistic z is used to test (i) whether an observed value of r issignificantly different from a given hypothetical or known value of populationcorrelation (ii) whether two sample values of r differ significantly fromeach other.

Page 283: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 277

The one-way analysis of variance refers to the situations when only onefact or variable is considered.

The simultaneous comparison of several population means is calledAnalysis of Variance or ANOVA.

The F distribution is a family of curves, so that there are different curvesfor different degrees of freedom.

F ratio contains only two elements, which are the variance between thesamples and the variance within the samples.

11.6 Glossary

t-test: Any statistical hypothesis test in which the test statistic follows aStudent’s t distribution, if the null hypothesis is supported.

z-test: Any statistical test for which the distribution of the test statisticunder the null hypothesis can be approximated by a normal distribution.

ANOVA: In statistics, analysis of variance or ANOVA is a collection ofstatistical models and their associated procedures in which the observedvariance in a particular variable is partitioned into components attributableto different sources of variation.

11.7 Terminal Questions

1. Who developed t-test? When is it used?

2. Define z-test.

3. On what assumptions is the ANOVA methodology based?

4. What is the rationale behind analysis of variance?

5. Define degree of freedom with respect to ANOVA.

6. What are the major characteristics of F distribution? How is F computed?

7. How is ANOVA table constructed? Explain with the help of an example.

Page 284: BCC104 Business Statistics

Business Statistics Unit 11

Sikkim Manipal University Page No. 278

11.8 Answers

Answers to Self-Assessment Questions

1. (a) Variance; (b) Equality

2. (a) False; (b) True

3. (a) True; (b) True

4. (a) Transformation; (b) Difference

5. (a) Comparison; (b) Two

6. (a) True; (b) False

Answers to Terminal Questions

1. Refer Section 11.2

2. Refer Section 11.3

3. Refer Section 11.4

4. Refer Section 11.4.1

5. Refer Section 11.4.2

6. Refer Section 11.4.3

7. Refer Section 11.4.4

11.9 Futher Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal,2007.

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010.

Page 285: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 279

Unit 12 Research Report Writing

Structure

12.1 IntroductionObjectives

12.2 Introduction to Report Writing12.3 Types of Research Reports12.4 Summary12.5 Glossary12.6 Terminal Questions12.7 Answers12.8 Further Reading

12.1 Introduction

In the previous unit, you learnt about various elements of probability.

In this unit, you will learn the concept of report writing. Reports are usedfor different purposes by different departments of an organization. Industries,governments, businesses—all need to prepare reports in order to collectinformation and to keep a track of their performance and progress. The mostimportant aspect of a report is to convey information in clear terms. It shouldprovide the facts in a direct, straightforward and accurate manner. This unitintroduces you to the different types of reports and also shows you the correctmethods of report presentation.

Objectives

After studying this unit, you should be able to:

Describe the importance of reports

Explain the different types of reports

Define the characteristics of a good report

Describe the structure of a report

Use the correct method of presenting reports

Page 286: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 280

12.2 Introduction to Report Writing

A report can be defined as a written document which presents information in aspecialized and concise manner. A list of employees prepared by the HRdepartment for salary distribution can be termed as a report. In other words, areport is information presented in a logical and concise manner.

There is a difference between report writing and other compositionsbecause a report is written in a short and conventional format. A report shouldcover all mandatory matters but nothing extra should be written. For writing areport, at first the relevant data is collected and then it is presented in a conciseand objective manner. Then, after successfully establishing the structure of thereport, the formatting features that improve the look and readability of the reportare added.

12.2.1 Types of Reports

Reports can be divided into different categories. The two main types of reportsare:

Informational report

Interpretive report

Informational report

A report that consists of a collection of data or facts and is written in an orderlyway is called an informational report. The main purpose of this type of report isto present the information in its original form without any conclusion andrecommendation. Informational reports are further divided into four parts asfollows:

Inspection reports: Reports which show the outcome of products orequipment to assure their proper functioning or to describe their qualityare called inspection reports. This type of report is mainly used inmanufacturing organizations.

Inventory reports: Reports which are made to keep stock of variousthings like furniture, equipment, stationery, utensils and other accessoriesare called inventory reports.

Assessment reports: These reports are made to maintain the databaseof the employees in an organization. Generally, these reports are usefulfor the HR department.

Page 287: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 281

Performance report: The reports which are made to measure theperformance of the employees in an organization for different purposeslike appraisal or promotion are called performance reports.

Interpretive report

An interpretive report contains a collection of data with its interpretation or anyrecommendation explicitly specified by the writer. This type of report also includesdata analysis and conclusions made by the report writer. Writing interpretivereports is different from writing informational reports because they containdifferent elements. The possible elements that can be used in interpretive reportsare:

Cover

Frontpiece

Title page

Copyright notice

Forwarding letter

Preface

Acknowledgements

Table of contents

List of illustrations

Abstract and summary

Introduction

Discussion

Conclusions

Recommendations

Appendices

List of references

Bibliography

Glossary

Index

Page 288: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 282

12.2.2 Characteristics of a Good Report

The characteristics of a good report can be classified under the following fourheads:

Language and style of the report

Structure of the report

Presentation of the report

References in the report

Each of the above aspects of report writing needs to be given due attentionas they are interrelated to each other. A report given with a lucid style but withvery less and hypothetical information is of no use to the reader. Similarly, thereport writer needs to avoid overcrowding of information that may make thereader feel confused and lost in reading the data, thereby losing its charm. Asystematic scrutiny of each of these aspects of a report is, therefore, necessary.

1. Language and Style of a Report: A report must have a logical structurewith a clear indication of where the ideas are leading. It should be able tomake a good first impression. The presentation of the report is veryimportant. All reports must be written in good language, using shortsentences and correct grammar and spellings. The main points to bekept in mind in this light are as follows:

Context and style:

o Appropriate and informative title for the content of the report

o Crisp, specific, unbiased writing with minimal jargon

o Adequate analysis of prior relevant research

Questions/Hypotheses:

o Clearly stated questions or hypotheses

o Thorough operational definitions of key concepts along withexact wording or measurement of key variables

Research procedures:

o Full and clear description of the research design

o Demographic profile of the participants/subjects

o Specific data gathering procedures

Page 289: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 283

Data analysis:

o Appropriate inferential statistics for sample or experimental dataand appropriate use of descriptive statistics

o Clear and reasonable interpretation of the statistical findings,accompanied by effective tables and figures

Summary:

o Fair assessment of the implications and limitations of thefindings

o Effective commentary on the overall implications of the findingsfor theory and/or policy

2. Structure of a Report: Before you write a report, you should define thehigh level structure of the report. Defining a clear logical structure willmake the report easier to write and to read. There are two types of reportstructures, which are listed as follows:

Report Structure I: In general, the report writing structure comprisesthe following subheadings:

o Title Page

o Abstract

o Table of Contents

o Introduction

o Technical Detail and Results

o Discussion and Conclusions

o References

o Appendices

Report Structure II: There is also a specific structure of report writingpertaining to technical or scientific reports which is as follows:

o Introduction

o Background and Context

o Technical Details

o Results

o Discussion and Conclusion

Page 290: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 284

Order of writing:

o Start with the technical chapters/sections.

o Follow with the discussion.

o Finally, write the conclusions, introduction and abstract, if youare including any.

Appendix: The appendix should contain the following:

o Material that suits or goes well with the flow of the main reportbut cannot be included in the main text of the report eitherbecause it is too long or is not essential reading, for example,lists of parameter values, etc.

o Bibliography, i.e., list of all the sources of material, you referredto in your report.

3. Presentation of a Report: As stated earlier, mere data overloading orjust a lucid style of writing is not only necessary for good report writing.Both the aspects need to be given due consideration, so that they interactto give a simple, easy-to-read and comprehensive type of report. Samegoes with the presentation of the contents of the report. Printing mistakes,informal use of font size and style can distract the attention of the reader.On the other hand, effective use of tables and figures for betterunderstanding of data and writing its conclusions facilitate easycomprehension. The main points of focus, where due attention is requiredon the part of the report writer are as follows:

Capitals: This requires taking care of the following aspects:

o Using capitals only for proper nouns, place names, organizationnames, etc.

o Defining acronyms at the first point of usage. For example,Incorporated (Inc).

o Using bold, italics or underlines for emphasis, instead of capitals.

Headings: The basic points to be kept in mind for headings are asfollows:

o Differentiate headings from the rest of the text using differentfonts, bold, italics or underlines.

o Maintain consistency in formatting headings using predefinedstyles.

o Avoid headings beyond three levels.

Page 291: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 285

Tables, figures and equations: In general, certain formattingstandards are pursued while giving tables and figures that are asfollows:

o Descriptive labelling of all tables at the top with reference in thetext.

o All figures must be labelled descriptively at the top and must bereferenced in the text.

o All equations must be numbered consecutively.

General presentation:

o Sheets should be of white A4 size and printed on one side only.

o Text should be justified on both sides and leave a blank linebetween paragraphs.

o A staple in the top right hand corner is sufficient for most of thereports.

4. References in a Report: Several report types like scientific, engineering,technical and census reports contain either original writing or text adoptedfrom previous work. As such, a report writer should be careful and shouldavoid any violation of copyright laws and plagiarism. The necessary ruleof thumb in this regard can be stated as follows:

o Citations and referencing:

– A citation is the acknowledgement in your writing of the work ofother authors and includes paraphrasing and making directquotes.

– Unless citation is very necessary, you should write the materialin your own words. This shows that you have understood whatyou have read and know how to apply it, to your own context.

– Direct quotes should be used sparingly.

o Direct quotes:

– Short direct quotes: These need to be placed betweenquotation marks. For example, Rosenfield defines a cluster asa ‘geographically bounded concentration of similar, related orcomplementary businesses, with active channels for businesstransactions, communications and dialogue that sharespecialized infrastructure, common opportunities and threats’.This shows clearly that the words being used are not your ownwords.

Page 292: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 286

– Longer direct quotes: There are occasions when it is usefulto include longer direct quotes. If you are quoting more thanforty words, you should again use quotation marks but alsoindent the text. For example, the sustainability of higher valueadded industry is grounded in the diminishing significance ofcost structures. At the level of the European Union, a weakcapacity to innovate has been identified as an innovation, inthe sense of product, process, and organizational innovation,accounts for a very large amount, perhaps 80–90 per cent ofthe growth in productivity in advanced economies.

12.2.3 Mechanics of Writing a Report

There are several parameters that are strictly followed while preparing technicalreports. The following points should be considered for writing a technical report:

Size and physical design: The manuscript, if handwritten, should be inblack or blue ink and on unruled paper of 81/2" × 11" size. A margin of atleast one-and-half inches is set at the left side and half inch at the rightside of the paper. The top and bottom margins should be of one incheach. If the manuscript is to be typed, then all typing should be doublespaced and on one side of the paper, except for the insertion of longquotations.

Layout: According to the objective and nature of the research, the layoutof the report should be decided and followed in a proper manner.

Quotations: Quotations should be punctuated with quotation marks anddouble spaces, forming an immediate part of the text. However, if aquotation is too lengthy, then it should be single spaced and indented atleast half-an-inch to the right of the normal text margin.

Footnotes: Footnotes are meant for cross-references. They are placedat the bottom of the page, separated from the textual material by a spaceof half-an-inch as a line that is around one-and-a-half inches long.Footnotes are always typed in single space, though they are divided fromone another by double space.

Documentation style: The first footnote reference to any given workshould be complete, giving all essential facts about the edition used. Suchfootnotes follow a general sequence and order:

o In case of the single volume reference:

– Author’s name in normal order

Page 293: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 287

– Title of work, underlined to indicate italics

– Place and date of publication

– Page number reference

For example,

John Gassner, Masters of the Drama, New York: Dover Publications,Inc.1954, p.315.

o In case of a multivolume reference:

– Author’s name in the normal order

– Title of work, underlined to indicate italics

– Place and date of publication

– Number of the volume

– Page number reference

For example,

George Birkbeck Hill, Life Of Johnson, June 2004, Whitefish, Volume2, p.124.

o In case of works arranged alphabetically:

– For works arranged alphabetically such as encyclopedias anddictionaries, no page reference is usually needed. In such cases,order is illustrated according to the names of the topics.

– Name of the Encyclopaedia

– Number of Editions

For example,

‘Salamanca’ Encyclopaedia Britannica, 14th Edition.

o In case of periodicals reference:

– Name of the author in normal order

– Title of article, in quotation marks

– Name of the periodical, underlined to indicate italics

– Volume number

– Date of issuance

– Pagination

Page 294: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 288

For example,

Shahad, P.V. ‘Rajesh Jain’s Ecosystem’, in Business Today,Vol. 14, December 18, p. 28, 2005.

o In case of multiple authorship:

If there are more than two authors or editors, then in thedocumentation, the name of only the first is given and multipleauthorship is indicated by ‘et al’ or ‘and others’.

– Author’s name in normal order

– Title of work, underlined to indicate italics

– Place and date of publication

– Pagination references

For example,

Alexandra K. Wigdor, Ability Testing: Uses Consequences andControversies, 1981, p.23.

Subsequent references to the same work need not be detailed.If the work is cited again without any other work intervening, it maybe indicated as ibid, followed by a comma and the page number.

Punctuations and abbreviations in footnotes: Punctuation concerningthe book and author names has already been discussed. They are generalrules to be strictly adhered to. Some English and Latin abbreviations areoften used in bibliographies and footnotes to eliminate any repetition.

Table 12.1 shows the various English and Latin abbreviations usedin bibliographies and footnotes.

Table 12.1 English and Latin Abbreviations used in Bibliographies and Footnotes

Abbreviations Meaning

Anon., Anonymous

Ante., Before

Art., Article

Aug., Augmented

bk., Book

bull., Bulletin

cf., Compare

ch., Chapter

Page 295: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 289

col., Column

diss., Dissertation

ed., editor, edition, edited

ed. cit., edition cited

e.g. exempli gratia: for example

eng., Enlarged

et.al., and others

et seq., et sequens: and the following

ex., Example

f.,ff., figure(s)

fn., Footnote

ibid.,ibidem in the same place

id.,idem., the same

ill.,illus., or

illust(s) illustrated, illustration(s)

Intro., intro., introduction

l., ll., line(s)

loc. cit., in the place cited; used as op.cit.,

MS., MSS., Manuscript(s)

N.B. nota bene note well

n.d., no date

n.p., no place

no pub., no publisher

no(s) ., number(s)

o.p., out of print

op.cit: in the work cited

p.pp page(s)

passim: here and there

Post: After

Use of statistics, charts and graphs: Statistics contribute to clarity andsimplicity in a report. They are usually presented in the form of tables,charts, bars, line-graphs and pictograms.

Final draft: It requires careful scrutiny with regard to grammatical errors,logical sequence and coherence in the sentences of the report.

Page 296: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 290

Index: An index acts as a good guide to the reader. It can be preparedboth as subject index and author index, giving names of subjects andnames of authors, respectively. The names are followed by the pagenumbers of the report, where they have appeared or been discussed.

12.2.4 Research Report: An Overview

In simple terms, a research report is a written document which describes thefindings of an individual or group of individuals. It gives an account of somethingseen, heard, done, etc. The findings may comprise such information like data,surveys, resolutions or policies on which the concerned individual or individualshave to submit their reports; which should include the proceedings as well asthe relevant conclusions.

The preparation and presentation of a research report is the most importantpart of the research process. No matter how well designed the research studyis, it is of little value, unless communicated effectively to others in the form of aresearch report. Moreover, if the report is confusing or poorly written, then thetime and effort spent on gathering and analysing data would be wasted. It istherefore essential to summarize and communicate the result to the managementof an organization with the help of an understandable and logical research report.

Research reports are helpful during the research study, in the sense thatthey facilitate maintenance of vast data in a logical way. Thus, in case theresearcher experiences any difficulty during the course of the study, it becomeseasier to refer to the contents of the report to get the relevant data. Researchreport writing essentially involves systematic arrangement of data. This helps indiscovering flaws in reasoning, which may have been missed earlier whileconducting a research.

1. Format of a Research Report: The layout of the research report is ofutmost importance because the reader should be able to grasp logically,what has been said and not feel lost in the bulk findings mentioned in theresearch. This requires preparing of a proper layout of the report. Reportlayout means allotting the research findings in a comprehensible format.The layout should contain the following points:

Preliminary pages: In the preliminary pages, the report should carrya ‘title’ and a ‘date’, followed by acknowledgements in the form of‘Preface’ or ‘Foreword’. The ‘Table of Contents’ should come next,followed by a ‘list of tables and illustrations’. This facilitate easyreading and quick location of the required information.

Page 297: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 291

Main text: The main text comprises the complete outline of theresearch report with all the details. The title of the research study isrepeated at the top of the first page of the main text, and then followedwith the other details on the pages numbered consecutively,beginning with the second page. The main text can be classifiedinto the following sections:

o Introduction: The purpose of introduction is to introduce theresearch projects to the readers. It should clearly state theobjectives of research, i.e., it should clarify, why the problemwas considered worth investigating. A brief summary of otherrelevant research can be included as well, to enable the readerto see the present study in that context.

o The methodology used for performing the study: Theintroduction should contain answers to questions like; How wasthe study carried out? What was the basic design? What werethe experimental directions? What were the questions askedin the questionnaires used? etc. Besides this, the scope andlimitations of the study must be marked out.

o Statement of findings and recommendations: The researchreport should comprise a statement of f indings andrecommendations in a nontechnical language so that it is easilycomprehensible.

o Results: A detailed presentation of the findings of the study,with supporting data in tabular forms along with the validationof results, should be given. This section should contain statisticalsummaries and deductions of the data rather than the raw data.There should be a logical sequence and sectional presentationof the results.

o Implications of the result: The researcher should write downhis results clearly and precisely, again at the end of the maintext. The implications derived from the results of the researchstudy should be stated in the research plan. The report shouldalso mention the conclusion drawn from the study, which shouldbe clearly related to the hypothesis stated in the introductorysection.

o Summary: The next step is to conclude the report with a shortsummary, mentioning in brief the research problem, the

Page 298: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 292

methodology, the major findings and the major conclusionsdrawn from the research results.

o End matter: The end of the research report should consist ofappendices listed with respect to all technical data such asquestionnaires, sample information and mathematicalderivations. The bibliography of the referred sources and anindex should also be given.

2. Precautions for Writing Research Reports: A research report is themeans of conveying the research study to a specific target audience. Thefollowing precautions should be taken while preparing the research report:

It should be long enough to cover the subject and short enough topreserve the interest.

It should not be dull and complicated.

It should be simple, without the usage of abstract terms and technicaljargon.

It should offer ready availability of findings with the help of charts,tables and graphs, as readers prefer quick knowledge of mainfindings.

The layout of the report should be in accordance with the objectiveof the research study.

There should be no grammatical errors and the writing should adhereto techniques of report writing in case of quotations, footnotes anddocumentations.

It should be original, intellectual and should contribute to the solutionof a problem or add knowledge to the concerned field.

Appendices should be listed with respect to all the technical data inthe report.

It should be attractive, neat and clean, whether handwritten or typed.

The report writer should not confuse the possessive form of the word‘it is’ with ‘it’s’. The accurate possessive form of ‘it is’ is ‘its’.

A report should not have contractions. Examples are ‘didn’t’ or ‘it’s’.In report writing, it is best to use the noncontractive form. Hence, theexamples would be replaced by ‘did not’ and ‘it is’. Using ‘Figure’instead of ‘Fig.’ and ‘Table’ instead of ‘Tab.’ will spare the reader ofhaving to translate the abbreviations, while reading. If abbreviations

Page 299: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 293

are used, use them consistently throughout the report. For example,do not switch among ‘versus’ and ‘vs’.

It is advisable to avoid using the word ‘very’ and other such wordsthat try to embellish a description. They do not add any extra meaning,and therefore, should be dropped.

Repetition hampers lucidity. The report writer must avoid repeatingthe same word more than once within a sentence.

When you use the word ‘this’ or ‘these’, make sure you indicate towhat you are referring. This reduces the ambiguity in your writingand helps to tie the sentences together.

Do not use the word ‘they’ to refer to a singular person. You caneither rewrite the sentence to avoid using such a reference or usethe singular ‘he’ or ‘she’.

12.2.5 Written and Oral Reports

A written report plays a vital role in every business operation. The manner inwhich an organization writes business letters and business reports creates animpression of its standard. Therefore, the organization should emphasize onthe improvement of writing skills of the employees in order to maintain effectiverelations with their customers.

Preparing an effective written report requires a lot of hard work. Therefore,before you begin writing, it is important to know the objective, i.e., the purposeof writing, collection and organization of required data.

1. Written Report: Writing a report is the best way to communicate, andoften the only way to convey one’s ideas to others. Thus, it is necessarythat the writing should be effective. To improve the effectiveness of writinga report, following are the important points that should be kept in mind:

Take breaks in between writing, since this gives you the time toincubate the ideas.

Start writing a short manuscript first, and later on, the detailed one.Create an outline and organize the complete work.

Make a checklist of the important points that are necessary to becovered in the manuscript.

Focus on one objective at a time.

Use dictionary and relevant reference materials as and whenrequired.

Page 300: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 294

Principles of writing a report

To write a useful report, it is necessary to follow certain principles. The followingare the principles that must be followed while writing a report:

Principle of purpose: A report must have a clear and meaningful purposethat can be converted into an effective management. A clear statement ofthe purpose helps prepare a well-focussed report on which themanagement can work. Specification of the purpose is important because,

o Reports are the analysis of facts and proposals.

o Reports are the record of a particular business activity.

Principle of organization: A report that is written should be well-designedand well-ordered. The managerial plan of a report must include thefollowing:

o Purpose of report

o Information required to be included in the report

o Method used to collect report data

o Summary of the report

o Problems and solutions of the subject mentioned in the report

o An appendix that describes and confirms the content and conclusionof the report

Principle of brevity: Reports should be concise. It is essential because,

o Long reports are costly.

o Long reports are difficult to examine.

o Long reports are prone to disapproval, as they seem insufficient.

o Long reports focus on irrelevant minor details that may lead to theignorance of major points.

Principles of clarity: Reports should be clear. Clarity can be maintainedby using simple language for writing the report. New terms, if any in thereport, should be properly explained to avoid confusion.

Principle of scheduling: Reports should be prepared at that time whenthere is no undue burden on the staff or when the staff has sufficient timeto prepare reports. However, the time period between the gathering ofdata and generating finished reports should not be long; otherwise, thereport may become outdated and useless if it is not completed in time.

Page 301: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 295

Principle of cost: While preparing reports, it is necessary that the cost-benefit analysis of the report should be done. A report should be minimumat costs and maximum at benefits. If the cost of preparation of the reportis high but its benefit is low, then it is not advisable to prepare that report.

Different formats of written reports

A written report can be written in various formats, some of which are as follows:

Straight-line format: This format is used when the information is to bepresented in alphabetical, sequential or numerical orders. This format isused to generate descriptive reports.

Building blocks format: This format is used when the informationpresented, leads to some conclusion. The report in this format starts witha brief introduction, contains some logical facts and finally the conclusionsand recommendations.

Inverted pyramid format: The report in this format has the most importantitem at the top, and the least important item at the bottom of the report.That is, items are listed in the descending order with the most importantitem at the top. This style of writing or format is also known as journalisticstyle or format.

2. Oral Report: At times, oral presentation of the results that are drawn outof research is considered effective, particularly in cases where policyrecommendations are to be made. This approach proves beneficialbecause it provides a medium of interaction between the listener and thespeaker. This leads to a better understanding of the findings and theirimplications. However, the main drawback of oral presentation is lack ofany permanent records related to the research. Oral presentation of thereport is also effective when it is supported by various visual devicessuch as slides, wall charts and white boards that help in betterunderstanding of the research reports.

Advantages of oral reports

Oral reports help in direct communication without any delay. Followings aresome of the advantages of an oral report:

It provides immediate feedback to the participants of the oral report.Moreover, participants can also ask for further clarification, elaborationand justifications.

It is time saving.

Page 302: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 296

It helps develop relationship among employees by building healthyatmosphere in an organization.

It is an effective tool of persuasion in business.

It is economical as it saves large amount of money spent on stationery.

It provides the speaker with the opportunity to correct himself and makehimself clear on the spot.

It helps speakers to immediately understand the reaction of the groupthat they are addressing.

Disadvantages of oral reports

There are many disadvantages of oral reports; these include the following:

Oral reports may not always be time saving. Sometimes, the meetingbetween the speaker and the listener can continue for a very long timewithout any satisfactory conclusion.

A listener of the oral report cannot always retain the entire message.

The messages in the oral reports do not have any legal validity as theyare not documented.

Oral reports may sometimes be misleading, if the thoughts of the speakerare not organized carefully.

Lengthy oral messages may sometimes cause problems.

Activity 1

Suppose you have to write a report on the socio-economic awareness ofyour country. List the headings and the procedure that you will include foryour research purpose.

Self-Assessment Questions

1. Fill in the blanks with the appropriate terms.

(a) A report that consists of a collection of data or facts and is written inan orderly way is called an __________________ report.

(b) An interpretive report contains a collection of data with itsinterpretation or any _____________________ explicitly specifiedby the writer.

Page 303: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 297

2. State whether true or false.

(a) A report should cover all mandatory matters but nothing extra should bewritten.

(b) For writing a report, at first the relevant data is collected and then it ispresented in a concise and objective manner.

12.3 Types of Research Reports

Research reports are designed to convey and record the information that willbe of practical use to the reader. It is organized into distinct units of specific andhighly visible information. The kind of audience addressed in the research reportdecides the type of report. Research reports can be categorized on the followingbasis:

On the basis of information

On the basis of representation

12.3.1 Classification on the Basis of Information

Following are the ways in which the results of the research report can bepresented on the basis of information contained:

Technical report: A technical report is written for other researchers. Inwriting the technical reports, the importance is mainly given on the methodsthat have been used to collect the information and the data, thepresumptions that are made and finally, the various presentationtechniques that are used to present the findings and the data. Followingare the main features of a technical report:

o Summary: It covers a brief analysis of the findings of the research ina very few pages.

o Nature: It contains the reasons for which the research is undertaken,the analysis and the data that is required in order to prepare a report.

o Methods employed: It contains a description of the methods thatwere employed in order to collect the data.

o Data: It covers a brief analysis of the various sources from which thedata has been collected with their features and drawbacks.

o Analysis of data and presentation of the findings: It contains thevarious forms through which the data that has been analyzed, can bepresented.

Page 304: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 298

o Conclusions: It contains a brief explanation of findings of the research.

o Bibliography: It contains a detailed analysis of the variousbibliographies that have been used in order to conduct a research.

o Technical appendices: It contains the appendices for the technicalmatters and for questionnaires and mathematical derivations.

o Index: The index of the technical report must be provided at the endof the report.

Popular report: A popular report is formulated when there is a need todraw the conclusions of the findings of the research report. One of themain points of consideration that should be kept in mind while formulatinga research report is that, it must be simple and attractive. It must bewritten in a very simple manner that is understandable to all. It must alsobe made attractive by using large prints, various subheadings and bygiving the cartoons occasionally. Following are the main points that mustbe kept in mind while preparing a popular report:

o Findings and their implications: While preparing a popular report,main importance is given to the findings of the information and theconclusions that can be drawn out of these findings.

o Recommendations for action: If there are any deviations in the report,then recommendations are made for taking corrective action in orderto rectify the errors.

o Objective of the study: In a popular report, the specific objective forwhich the research has been undertaken is presented.

o Methods employed: The report must contain the various methodsthat have been employed in order to conduct a research.

o Results: The results of the research findings must be presented in asuitable and appropriate manner by taking the help of charts anddiagrams.

o Technical appendices: The report must contain an in-depthinformation used to collect the data in the form of appendices.

12.3.2 Classification on the Basis of Representation

Following are the ways through which the results of the research report can bepresented on the basis of representation:

Written report

Oral report

Page 305: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 299

For details of these two categorise of reports, see Section 12.2.5.

Activity 2

Collect a sample of technical appendices from any printed research report.

Self-Assessment Questions

3. State whether true or false.

(a) Research reports are not designed to convey and record theinformation that will be of practical use to the reader.

(b) The index of the technical report must be provided at the end of thereport.

4. Fill in the blanks with the appropriate terms.

(a) A _________________ report is formulated when there is a need todraw the conclusions of the findings of the research report.

(b) If there are any _______________ in the report, thenrecommendations are made for taking corrective action in order torectify the errors.

12.4 Summary

Let us recapitulate the important concepts discussed in this unit:

A report can be defined as a written document which presents informationin a specialized and concise manner.

There is a difference between report writing and other compositionsbecause a report is written in a short and conventional format. A reportshould cover all mandatory matters but nothing extra should be written.

A report that consists of a collection of data or facts and is written in anorderly way is called an informational report. The main purpose of thistype of report is to present the information in its original form without anyconclusion and recommendation.

An interpretive report contains a collection of data with its interpretationor any recommendation explicitly specified by the writer.

Defining a clear logical structure will make the report easier to write andto read.

Page 306: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 300

A citation is the acknowledgement in your writing of the work of otherauthors and includes paraphrasing and making direct quotes.

In simple terms, a research report is a written document which describesthe findings of an individual or group of individuals. It gives an account ofsomething seen, heard, done, etc. Research reports are helpful duringthe research study, in the sense that they facilitate maintenance of vastdata in a logical way.

A written report plays a vital role in every business operation. Writing areport is the best way to communicate, and often the only way to conveyone’s ideas to others. Thus, it is necessary that the writing should beeffective.

Oral reports help in direct communication without any delay. It providesthe speaker with the opportunity to correct himself and make himself clearon the spot.

In writing the technical reports, the importance is mainly given on themethods that have been used to collect the information and the data, thepresumptions that are made and finally, the various presentationtechniques that are used to present the findings and the data.

12.5 Glossary

Report: A written document presenting information in a specialized andconcise manner.

Informational report: A report consisting of a collection of data or factswritten in an orderly manner.

Interpretive report: A report containing a collection of data with itsinterpretation or any recommendation explicitly specified by the writer.

Research report: A written document describing the findings of someindividual or a group of individuals.

12.6 Terminal Questions

1. What are the different formats of written reports?

2. What are the points that you should keep in mind while writing a popularreport?

Page 307: BCC104 Business Statistics

Business Statistics Unit 12

Sikkim Manipal University Page No. 301

3. What care should be taken with the following elements of a report:

(i) Direct quotes

(ii) Citations

(iii) Referencing

4. State the mechanics of writing a report.

5. Differentiate between written and oral reports.

6. What are the different types of research reports?

12.7 Answers

Answers to Self-Assessment Questions

1. (a) Informational; (b) Recommendation

2. (a) True; (b) True

3. (a) False; (b) True

4. (a) Popular; (b) Deviations

Answers to Terminal Questions

1. Refer Section 12.2.1

2. Refer Section 12.2.2

3. Refer Section 12.2.2

4. Refer Section 12.2.3

5. Refer Section 12.2.5

6. Refer Section 12.3

12.8 Further Reading

1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007

2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand& Sons, 2010

Page 308: BCC104 Business Statistics
Page 309: BCC104 Business Statistics

Business Statistics Unit 13

Sikkim Manipal University Page No. 303

Unit 13 Exercise-I

Example 1: How will you classify people according to gender using nominal scale.

Solution:In the example below, the number ‘1’ is assigned to ‘male’ and the number ‘2’ is assignedto ‘female’. We can just as easily assign the number ‘1’ to ‘female’ and ‘2’ to male. Thepurpose of the number is merely to name the characteristic or give it ‘identity’.

As we can see from the graphs, changing the number assigned to ‘male’ and‘female’ does not have any impact on the data - we still have the same number of menand women in the data set.

Example 2:

What type of questions should be avoided in a questionnaire?

Solution:The following type of questions should be avoided when preparing a questionnaire.

1. Embarrassing Questions: Embarrassing questions are questions that askrespondents details about personal and private matters. Embarrassing questionsare mostly avoided because you would lose the trust of your respondents. Yourrespondents might also feel uncomfortable to answer such questions and mightrefuse to answer your questionnaire.

2. Positive/Negative Connotation Questions: Since most verbs, adjectives andnouns in the English language have either positive or negative connotations,questions are bound to take on a positive or negative question. While defining aquestion, strong negative or positive overtones must be avoided. Depending onthe positive or negative connotation of your question, you will get different data.Ideal questions should have neutral or subtle overtones.

Page 310: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 304

3. Hypothetical Questions: Hypothetical questions are questions that are basedon speculation and fantasy. An example of a hypothetical question would be “Ifyou were the CEO of ABC organization what would be the changes that youwould bring?” Questions of this type, force the respondent to give his or her ideason a particular subject. However, these kind of questions will not give you consistentor clear data. Hypothetical questions are mostly avoided in questionnaires.

Example 3:

Find the mode of the following data set:

48, 45, 46, 35, 45, 46, 35, 57, 34, 46, 48, 48, 46, 67

Solution:

The mode is 46 which occur 4 times.

Example 4:

Find the median of the following data set:

12 18 16 21 10 13 17 19

Solution:

Arrange the data values in order from the lowest value to the highest value:

10 12 13 16 17 18 19 21

The number of values in the data set is 8, which is even. So, the median is theaverage of the two middle values.

5.162

17162

valuedata5valuedata4Median

thth

Example 5:

The marks of seven students in a mathematics test with a maximum possible mark of20 are given below:

15 13 18 16 14 17 12

Find the mean of this set of data values.

Page 311: BCC104 Business Statistics

Business Statistics Unit 13

Sikkim Manipal University Page No. 305

Solution:

15 13 18 16 14 17 12Mean7

1057

15

So, the mean is 15.

Example 6:

Find the mean, median, mode and range for the following list of values:

13, 18, 13, 14, 13, 16, 14, 21, 13

Solution:

The mean is the usual average, so:

Mean = (13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) / 9 = 15

The median is the middle value, so arrange the data in ascending order as follows:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 =5th number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

The mode is the number that is repeated more often than any other, so mode is 13.

The largest value in the list is 21 and the smallest is 13, so the range is 21 – 13 = 8.

Example 7:

Calculate the standard deviation for the data given below:

4, 2, 5, 8, 6

Solution:

Calculate the mean of the data

= 5

Page 312: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 306

Now for each value in the sample:

Now standard deviation is:

=

= 2.24

Example 8:From the following data, construct index number of prices for 1986 with 1980 as base,using (i) Laspeyre’s method, (ii) Paasche’s method, (iii) Bowley-Drobisch method, (iv)Marshall-Edgeworth method, (v) Fisher’s ideal formula.

1980 1986

Commodity Price Per Unit Expenditure Price Expenditurein Rupees Per Unit in Rupees

A 2 10 4 16

B 3 12 6 18

C 1 8 2 14

D 4 20 8 32

Page 313: BCC104 Business Statistics

Business Statistics Unit 13

Sikkim Manipal University Page No. 307

Solution:

Since we are given the price and the total expenditure for the year 1980 and 1986, weshall first calculate the quantities for the two years by dividing the expenditure by price,and then we shall calculate the index numbers as follows:

Commodity P0 q0 P1 q1 P0q0 P0q1 P1q0 P1q1

A 2 5 4 4 10 8 20 16

B 3 4 6 3 12 9 24 18

C 1 8 2 7 8 7 16 14

D 4 5 8 4 20 16 40 32

P q0 0

50P q0 1

40

Pq1 0

100

Pq1 1

80

(i) Laspeyre’s price index or P PqP q01

1 0

0 0

100

100 100 20050

(ii) Paasche’s price index or P PqP q01

1 1

0 1

100

80 100 20040

(iii) Bowley-Drobisch price index or P

PqP q

PqP q

01

1 0

0 0

1 1

0 1

2100

100 8050 40 100 200

2

(iv) Marshall-Edgeworth price index or P p q p qp q p q01

1 0 1 1

0 0 0 1

100

100 80 10050 40

= 200

Page 314: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 308

(v) Fisher’s Ideal index of price or P p qp q

p qp q01

1 0

0 0

1 1

0 1

100

100 80 10050 40

2 2 100 = 200

Example 9:Construct a pie chart in percentage for the given data of a publishing house (Cost isin `):

Promotion cost 10,000 Royalty cost 15,000 Binding cost 20,000 Paper cost 25,000 Transportation cost 10,000 Printing cost 20,000

Solution:

The following pie chart shows the percentage distribution of the expenditure incurredin publishing a book as per the given data.

Page 315: BCC104 Business Statistics

Business Statistics Unit 13

Sikkim Manipal University Page No. 309

Example 10:

The ranks of 15 students in two subjects A and B are given below:

Student Subject A Subject B 1. 1 10 2. 2 7 3. 3 2 4. 4 6 5. 5 4 6. 6 8 7. 7 3 8. 8 1 9. 9 11 10. 10 15 11. 11 9 12. 12 5 13. 13 14 14. 14 12 15. 15 13

Use Spearman's formula to find the rank Correlation Coefficient.

Solution:

Rank in A (R1)

Rank in B (R2)

(R1–R2) D

D2

1 10 –9 81 2 7 –5 25 3 2 1 1 4 6 –2 4 5 4 1 1 6 8 –2 4 7 3 4 16 8 1 7 49 9 11 –2 4 10 15 –5 25 11 9 2 4 12 5 7 49 13 14 –1 1 14 12 2 4 15 13 2 4 n=15 D =0 2722 D

Page 316: BCC104 Business Statistics

Business Statistics Unit 1

Sikkim Manipal University Page No. 310

Spearman’s coefficient of correlation )1(6

1ρ 2

2

nnDi

= )1225(1527261

=1–0.4857

=0.5142

Example 11:

Researchers at the European Centre for Road Safety Testing are trying to find outhow the age of cars affects their braking capability. They test a group of ten cars ofdiffering ages and find out the minimum stopping distances that the cars can achieve.The results are set out in the table below:

Car Age of car in months Minimum stopping at 40 kph (metres)

A 9 28.4 B 15 29.3 C 24 37.6 D 30 36.2 E 38 36.5 F 46 35.3 G 53 36.2 H 60 44.1 I 64 44.8 J 76 47.2

Calculate the coefficient of correlation using the method of least squares.

Solution:

Let us develop the following table for calculating the value of r:

X Y X2 Y2 XY 9 28.4 81 806.56 255.6 15 29.3 225 858.49 439.5 24 37.6 576 1413.76 902.4 30 36.2 900 1310.44 1086 38 36.5 1444 1332.25 1387 46 35.3 2116 1246.09 1623.8 53 36.2 2809 1310.44 1918.6 60 44.1 3600 1944.81 2646 64 44.8 4096 2007.04 2867.2

76 47.2 5776 2227.84 3587.2 Total 415 375.6 21623 14457.72 16713.3

Page 317: BCC104 Business Statistics

Business Statistics Unit 13

Sikkim Manipal University Page No. 311

5.41X , 7.37YMethod of least squares:

22 XnX

XYnXYb

= 2)5.41(1021623)7.37)(5.41(103.16713

=5.17222216235.156453.16713

=5.44008.1067

= 0.242654

74.2796.97.37)5.41(24.07.37 XbYa

)7.37(1072.14457)7.37(10)3.16713(24.0)6.375(74.27

2

2

22

2

YnY

YnXYbYar

82.2449.14212192.4011144.10419

r = 94.082.244

436.217

Page 318: BCC104 Business Statistics
Page 319: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 313

Unit 14 Exercise-II

Example 1: Find the most likely production corresponding to a rainfall 40from the following data:

Rainfall Production

Average 30 500 kgStandard Deviation 5 100 kg

Coefficient of correlation = 0.8

Solution: Let Y stand for production and X for rainfall.

Now, the regression line of Y on X is given by

( )Y Y = . y

xr X X

(Y – 500) =100

0.8 ( 30)5

X

or Y = 20 + 16XFor X = 40, Y = 16(40) + 20 = 660 kg

Example 2: The tangent of the angle between the lines of regression y on

x and x on y is 0.6 and x = 12 y . Find rxy.

Solution:

tan =2

2 2

.1 x y

x y

rr

= 0.6

x =12 y

tan = 0.6 = 2

22

1.1 2

12

y y

y y

rr

or610 =

21

1 21

14

rr

2r2 + 3r – 2 = 0

Page 320: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 314

(r + 2) (2r – 1) = 0

r = – 2 or 12

But r2 must be 1, So r –2 and therefore the required value of r or rxy

is 12

is 0.5.

Example 3: The following table shows the number of public sector industriesfailures in India during the period 1987 to 1993. Using a four-year movingaverage method, calculate the mean square error (MSE) for this data.

Year Number of Failures

1987 32

1988 26

1989 30

1990 28

1991 24

1992 22

1993 26

Solution:The 4-year moving averages are calculated as follows:

(1) 1987 to 1990: Moving average 32 + 26 + 30 + 28= 29

4

(2) 1988 to 1991: Moving average 26 + 30 + 28 + 24= 27

4

(3) 1989 to 1990: Moving average 30 + 28 + 24 + 22= 26

4

(4) 1990 to 1993: Moving average 28 + 24 + 22 + 26= 25

4

To calculate the value of MSE, the following table is constructed.

Year Time Series Value (Y1) Moving Average Error Error Squared

1987 32

1988 26

1989 30

1990 28 29 – 1 1

1991 24 27 – 3 9

1992 22 26 – 4 16

1993 26 25 – 1 1

Page 321: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 315

Then,

MSE 1 9 16 1 27 6.75

4 4

Example 4: The Dean of the School of Business at Atlantic University, whichoperates on a trimester system, has compiled the following quarterly newenrolment of MBA students for the last 3 years from 1992 to 1994 and theresults are shown as follows:

Year Fall Winter Spring Summer

1992 200 180 185 95

1993 220 188 173 83

1994 220 176 161 87

By using the ratio to moving average method, calculate the seasonalindex for each trimester.

Solution: In order to calculate the seasonal indices for fall, winter, springand summer academic sessions, we need to find quarter moving averages,quarter centred moving averages and percentages of actual to centred movingaverages as explained previously.

We construct the following table:

Year Quarters Values Quarter Quarter Quarter Percentage ofMoving Moving Centred Actual to

Total Average Moving CentredAverage Moving Average

(1) (2) (3) (4) (5) (6) (7)

1992 I 200

II 180

660 165

III 185 167.5 110.45

680 170

IV 95 171.0 55.55

688 172

1993 I 220 170.5 129.03

676 169

II 188 167.5 112.24

(Contd...)

Page 322: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 316

664 166III 173 166.0 104.22

664 166IV 83 164.5 50.46

652 1631994 I 220 161.5 136.22

640 160II 176 160.5 109.66

644 161III 161IV 87

Now, we calculate the modified mean for each quarter. This can bedone by the following steps.

The first step is to make a table of values already calculated and placedin column (7) of this table. These are the percentage of actual to movingaverage values for the various quarters of the three years. These are shownin the following table:

Year Fall Winter Spring Summer

1992 – – 110.45 55.551993 129.03 112.24 104.22 50.461994 136.22 109.66 – –

The second step is to take the average of these values for each quarter.The modified mean for each quarter data is shown as follows:

129.03 136.22 265.25Fall 132.6252 2

112.24 109.66 221.90Winter 110.950

2 2

110.45 104.22 214.67Spring 107.335

2 2

55.55 50.46 106.01Summer 53.0052 2

Total = 403.915

These modified means are preliminary seasonal indices. These shouldaverage 100 or a total of 400 for these 4 quarters. However, our total is 403.915.Accordingly, we calculate the adjustment factor as follows:

400Adjustment Factor 0.9903403.915

Page 323: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 317

We get the seasonal index for each quarter by multiplying the modifiedmean for each quarter by the adjustment factor. Then, the seasonal index foreach quarter is shown as follows:

Fall: 132.625 × 0.9903 = 131.34Winter: 110.950 × 0.9903 = 109.87Spring: 107.335 × 0.9903 = 106.29Summer: 53.005 × 0.9903 = 52.50

Total = 400.00

Example 5: In the previous problem which gives us the data about newadmissions into the MBA programme of the university for each trimester,separate the seasonal and irregular influences on the time series and calculatethe irregular (I) component as well as the seasonally-adjusted values foreach quarter.

Solution:

We have already calculated the various values that are needed. We knowthat:

Time Series Values = T × S × C × ICentred Moving Average = T × C

Hence, T × S × C× IS × I =T × C

Let us restate the needed values in the following table.

Year Quarter T × S × C × I T × C S × I

1992 I 200 – –

II 180 – –

III 185 167.5 1.105

IV 95 171.0 0.556

1993 I 220 170.5 1.290

II 188 167.5 1.122

III 173 166.0 1.042

IV 83 164.5 0.505

1994 I 220 161.5 1.362

II 176 160.5 1.097

III 161 – –

IV 87 – –

Page 324: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 318

The seasonal indices for each quarter have already been calculated as:

Fall: 131.34

Winter: 109.87

Spring: 106.29

Summer:52.50

Then the seasonal influence (S) is given by:

Fall: 131.34/100 = 1.3134Winter: 109.87/100 = 1.10987Spring: 106.29/100 = 1.0629Summer: 52.50/100 = 0.5250

Now, we make another table with (S × I) values as calculated in theprevious table and (S) values for each quarter of fall, winter, spring andsummer and this way; we can get the values of (I) by dividing (S × I) valuesby the (S) values. These are shown in the following table:

Year Quarter S × I (S) (I)

1992 I – – –

II – – –

III 1.105 1.0629 1.040

IV 0.556 0.5250 1.059

1993 I 1.290 1.3134 0.982

II 1.122 1.0987 1.021

III 1.042 1.0629 0.980

IV 0.505 0.5250 0.962

1994 I 1.362 1.3134 1.037

II 1.097 1.0987 0.998

III – – –

IV – – –

Now, we can find the seasonally-adjusted values by dividing the originaltime series values by their corresponding seasonal indices. This is shown asfollows:

Page 325: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 319

Year Quarter Time Series Values (S) Seasonally-T × S × C × I adjusted Values

1992 I 200 – –

II 180 – –

III 185 1.0629 174.05

IV 95 0.5250 180.95

1993 I 220 1.3134 167.50

II 188 1.0987 171.11

III 173 1.0629 162.76

IV 83 0.5250 158.09

1994 I 220 . 1.3134 167.50

II 176 1.0987 160.19

III 161 – –

IV 87 – –

Example 6: The life time of electric bulbs for a random sample of 10from a large consignment gave the following data:

Item 1 2 3 4 5 6 7 8 9 10

Life (in 4.2 4.6 3.9 4.1 5.2 3.8 3.9 4.3 4.4 5.6'000 hours)

Can we accept the hypothesis that the average life time of bulbs is 4000hours.

Solution:Let us take the null hypothesis that there is no significant difference betweenthe sample mean and the hypothetical population mean.

Applying the t-test (as the sample is small in size, because 10 < 30),

t =x

nS

Page 326: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 320

Calculation of x and s

x ( )x x ( )x x 2

= (x – 4.4) = ( x – 4 . 4 )2

4.2 – 0.2 0.04

4.6 + 0.2 0.04

3.9 – 0.5 0.25

4.1 – 0.3 0.09

5.2 + 0.8 0.64

3.8 – 0.6 0.36

3.9 – 0.5 0.25

4.3 – 0.1 0.01

4.4 0 0

5.6 + 1.2 1.44

x = 44 ( ) .x x 2 312

x =x

N

= 4410

= 4.4

S =( )( )

..

x xn

2

13129

0 589

No. of degrees of freedom, = (n – 1) = (10 – 1) = 9.

For = 9, t0.05 = 2.262.

The calculated value of t is less than the table value. Hence the hypothesisis accepted.

The average life time of the bulbs could be 4000 hours.

Example 7: A Personnel Manager is interested in trying to determine whetherabsenteeism is greater on one day of the week than on another. His recordsfor the past year show this sample distribution:

Day of Monday Tuesday Wednesday Thursday Fridaythe week

No. of 66 57 54 48 75Absentees

Test whether the absence is uniformly distributed over the week.

Solution:Let us take the (null) hypothesis that absenteeism is uniformly distributed overthe week.

Page 327: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 321

On the basis of this hypothesis, we should expect (66 + 57 + 54 + 48 + 75)/5 = 300/5 = 60 absentees on each day of the week.

fo fe 2o e

e

f ff

66 60 0.60

57 60 0.15

54 60 0.60

48 60 2.40

75 60 3.75

27.50o e

e

f ff

2 = f f

fo e

e

b g27 5.

= (n – 1) = (5 – 1) = 4

for 4, 20.05

= 9.49

The calculated value of 2 is less than the table value. Hence, the (null)hypothesis is accepted.

Example 8: An automobile company gives you the following informationabout age groups and the liking for particular model of car which is plans tointroduce.

Age groups

Persons Below 20 20-39 40-59 60 Total

Who liked the car 140 80 40 20 280Who disliked 60 50 30 80 220the car

Total 200 130 70 100 500

On the basis7 of this data, can it be concluded that the model appeal is

independent of the age groups. (Given v = 3, 0 052 7815. . )

Solution:

Let the null hypothesis be that the model appeal is independent of the agegroup. Applying 2 test:

Page 328: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 322

fe11 = 280 200500

= 112

Row total × Column totali.e.,

Grand total

fe12 = 280 130500

= 72.8

fe13 = 280 70500

39 2

. and so on.

Thus, the table of expected frequencies will be obtained as follows:

112 72.8 39.2 56 280

88 57.2 30.8 44 220

200 130 70 100 500

fo fe (fo – fe)2 (fo – fe)2/fe

140 112.0 784.00 7.000

60 88.0 784.00 8.910

80 72.8 51.84 0.712

50 57.2 51.84 0.906

40 39.2 0.64 0.016

30 30.8 0.64 0.021

20 56.0 1296.00 23.143

80 44.0 1296.00 29.454

2

70.162o e

e

f ff

Thus,2 = 70.162 = (r – 1) (c – 1) = (2 – 1) (4 – 1) = 3

For = 3, 0 052 7815. .

The calculated value is much greater than the table value. Hence, the nullhypothesis is rejected. We therefore, conclude that the model appeal is notindependent of the age groups.

Page 329: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 323

Example 9: A random sample of size 16 has 53 as mean. The sum of thesquares of the deviations from mean is 135. Can this sample be regardedas taken from the population having 56 as mean? Obtain 95% and 99%confidence limits of the mean of the population. (for n = 15, t0.05 = 2.13 andfor n = 15, t0.01 = 2.95).

Solution:

Let us take the (null) hypothesis that there is no significant difference betweenthe sample mean and the hypothetical population mean. Applying t-test:

t = ( )xS

n

Given : x = 53; = 56; n = 16; ( )x x 2 135

S = ( )( )x xn

2

113515

3

t =| |53 56

316 4

= (n – 1) = (16 – 1) = 15

And for = 15, t0.05 = 2.13 (table value)The calculated value of t is more than the table value. Hence, the null

hypothesis gets rejected. Thus, we can say that the sample has not comefrom a population having 56 as mean.

95% confidence limits of the population mean : 0.05S

x tn

=3

53 2.1316

= 51.4 to 54.6

99% confidence limits of the population mean: 0.01S

x tn

=3

53 (2.95)16

= 50.788 to 55.212

Example 10: A random sample of 27 pairs of observations from a normalpopulation gives a correlation coefficient of 0.42. Is it likely that the variablesin the population are uncorrelated?

Solution:

Let the null hypothesis be there is no significant difference between the samplecorrelation and correlation in the population. Applying t-test:

Page 330: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 324

t = rn

r

FHGIKJ

21 2

= 0.42 × 27 2

1 0 422 312

FHG

IKJ

..

No. of degrees of freedom, = n – 2 = 27 – 2 = 25.For = 25, t0.05 = 1.708. The calculated value of t is more than the table

value and hence the null hypothesis is rejected. Thus, it can be said that it isunlikely that the variables in the population are uncorrelated.

Example 11: The mean of a sample of 100 units is 3 and its standarddeviation is 2. Find the standard error and estimate the sample error at (i)5% level of significance (ii) 97.73% level of probability and (iii) 95.45% levelof confidence.

Solution:

Standard Error of mean,

xs

n

=2100

210

02.

(i) Sample error at 5% level of confidence= 0.2 × 1.96 = 0.392

(ii) Sample error at 97.73% level of probability = 0.2 × 3 = 0.6(iii) Sample error at 95.45% level of confidence = 0.2 × 2 = 0.4.

Example 12: Set up an ANOVA table for the following per acre productiondata for three kinds or varieties of wheat, each grown on 4 plots and stateif the variety differences are significant.

Plot of land Per acre production data (variety of wheat)

A B C

1 6 5 5

2 7 5 4

3 3 3 3

4 8 7 4

Solution:

We can solve the problem either by the direct method or by short-cut method,but in each case we shall get the same result. We try below both the methods.

Page 331: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 325

Direct Method:

First we calculate the mean of each of these samples.

x1 =6+7+ 3+ 8

4FHG

IKJ 6

x2 = 5+ 5+ 3+74

FHG

IKJ 5

x3 =5+ 4 + 3+ 4

4FHG

IKJ 4

Mean of the sample means or

x =x x x1 2 3

36 5 4

35

FHG

IKJ

FHG

IKJ

Now, we work out SS between and SS within samples:

SS between = 2 2 21 1 2 2 3 3n x x n x x n x x

= [4 (6 – 5)2 + 4 (5 – 5)2 + 4 (4 – 5)2]= (4 + 0 + 4)= 8.

SS within = 2 2 21 1 2 2 3 3i i ix x x x x x

i = 1, 2, 3, 4

= 2 2 2 26 6 7 6 3 6 8 6

+ 2 2 2 25 5 5 5 3 5 7 5

+ 2 2 2 25 4 4 4 3 4 4 4

= 24

SS for total variance x xij d i2 , i = 1, 2, 3, ···

and j = 1, 2, 3, ...

= 2 2 2 26 5 7 5 3 5 8 5

+ 5 5 5 5 3 5 7 52 2 2 2 b g b g b g b g+ 5 5 4 5 3 5 4 52 2 2 2 b g b g b g b g }

= 32

Page 332: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 326

Alternatively, (SS for total variance) can also be worked out thus:SS for total = (SS between + SS within)

= (8 + 24)= 32

We can now set up the ANOVA table for this problem:

Source of SS d·f MS F-ratio 5% F-limitvariation (from the

F-table)

Between sample 8 (3 – 1) = 2 8

42

4.00

1.52.67 F (2, 9)

Within sample 24 (12 – 3) = 9 24

2.679

Total 32 (12 – 1) = 11

The above table shows that the calculated value of F is 1.5 which is lessthan the table value of 4.26 at 5% level with d.f being 1 = 2 and 2 = 9 andhence could have arisen due to chance. This analysis supports the null-hypothesis of no difference in sample means. We may, therefore, concludethat the difference in wheat output due to varieties is insignificant and is justa matter of chance.

Aliter (Short-cut Method):In this case, we first take the total of all the individual values of n items andcall it as T.

T in the given case = 60and n = 12

Hence, the correction factor = Tn

2

=6012

3002

.

Now SS total, SS between and SS within can be worked out as under:

SS total =2

2ij

Tx

n

where i = 1, 2, 3, ...

and j = 1, 2, 3, ...

= 2

2 2 2 2 2 2 2 2 2 2 2 2 606 7 3 8 5 5 3 7 5 4 3 4

12

= 32

= 4.26

Page 333: BCC104 Business Statistics

Business Statistics Unit 14

Sikkim Manipal University Page No. 327

SS between =2 2j

j

T Tn n

=

2 2 2 224 20 16 604 4 4 12

= 8

SS within =2

2 jij

Tx

nj

= (332 – 308) = 24.

It may be noted that we get exactly the same result as we had obtainedin the case of direct method. From now onwards, we can set up ANOVA tableand interpret F-ratio in the same manner as we have already done under thedirect method.

Page 334: BCC104 Business Statistics

NOTES

Sikkim Manipal University Page No. 328

Page 335: BCC104 Business Statistics

NOTES

Sikkim Manipal University Page No. 329

Page 336: BCC104 Business Statistics

NOTES

Sikkim Manipal University Page No. 330