ch01 2statistics intro

7/31/2019 Ch01 2Statistics Intro

1/74

Introduction to Statistics

Chapter 1


2/74

Statistics: Introduction

Definition

Data types, Scales of measurement

Basic concepts

Tabulation & Graphs

Measures of central tendency


3/74

Definition

Statistics is a science which deals with the

collection, classification, presentation

(tabulation, graphs), analysis and interpretation

of numerical data.


4/74

Collection of data

primary data- when data is collected directly

from the original source, example census,

survey with sampling (using questionnaire)

secondary data- when data is taken indirectly

from any other source, example magazines,

news paper, NGOs, etc.


5/74

Classification of data

Qualitative Data

which shows the quality of the individual.

For example- Eye Color, First Name, Favorite

Movie, Political PartyQuantitative data are those for which the value

has numerical meaning. You can do mathematicaloperations on the values of quantitative variables

(like taking an average). Examples - A personsHeight, Weight, Income, Resting Pulse Rate, BloodAlcohol Level


6/74

Categorical data

Geographical

Chronological


7/74

Quantitative data

Discrete variables are usually obtained by

counting. There are a finite or countable

number of choices available with discrete

data. You can't have 2.63 people in the room.

Continuous variables are usually obtained by

measuring. Length, weight, and time are all

examples of continuous variables. Can bedefined in intervals.


8/74

Levels of Measurement

nominal

ordinal

interval

Ratio


9/74

Levels of Measurement contd.

Nominal

Only names are meaningful.

Can be qualitative only.

Data values serve as labels.

Labels have no meaningful order.

Examples: Blood Type, Breed of Dog


10/74


Ordinal

Adds an order to the names.

It can be qualitative or quantitative.

Data values serve as labels but the labels have

a natural meaningful order.

Differences between values, however, are

meaningless.

Examples: Statistics Grade, Basketball

Rankings


11/74


Interval

Data values are numerical

They have a meaningful order

Differences between data values aremeaningful

The ratio of two data values, however, ismeaningless. This occurs when zero is an

arbitrary measurement rather than actuallyindicating nothing

Examples: Temperature, Year of Birth


12/74


Ratio.

Data values are numerical, have order, andboth differences and ratios of values are

meaningful. Zero measurement indicates absence of

quantity being measured.

Examples : Weight, Volume, Number ofChildren


13/74

Population vs Sample

Population - data set consisting of all

outcomes, measurements, or responses of

interest. (group of observations under study)

Sample - data set which is a subset of the

population data set


14/74

Types of Sampling

There are 5 types of sampling methods:

Random sampling

Example Telephone polling random telephone

numbers, Drawing names out of a hat.

Systematic sampling

Example- Choosing a sample of registered voters

by choosing every 25th voter from the countyregistration roll , Testing every 300th product from

the assembly line


15/74

Types of Sampling contd.

Convenience sampling

Example- Internet Polls, Mail-In Customer Survey

Cluster sampling is accomplished by dividing the

population into groups -- usually geographically.

These groups are called clusters or blocks. The

clusters are randomly selected, and each element

in the selected clusters are used. Example-Randomly choose 10 polling stations in a city and

exit poll all voters at those stations


16/74

Types of Sampling contd.

Stratified sampling also divides the population

into groups called strata & then a sample is taken

from each of these strata using either random,

systematic, or convenience sampling. Example-Choosing 100 men and 100 women for a sample

Stratify the population of 1000 employees by

income level and then choose a sample of low,

middle, and high income individuals.


17/74

Parameter and Statistics

Parameters are statistical measuresassociated with populations and statistic

with samples.

Parameters are usually denoted usingGreek letters (mu, sigma) while statistics

are usually denoted using Roman letters (x,

s).


18/74

Descriptive and Inferential Statistics

We compute statistic, and use them to

estimate parameters. The computation is

the first part of the statistics course is calledDescriptive Statistics and the estimation is

the second part called as Inferential

Statistics.


19/74

Presentation of Data

Tabulation

Frequency distribution

Grouped

Ungrouped

Graphs


20/74

Frequency distribution

According to Croxton & Cowden ,

Frequency Distribution is a statistical

table which shows the set of all distinctvalues of the variable arranged in order

of magnitude either individually or in a

group with their correspondingfrequencies side by side


21/74

Ungrouped Frequency of a Discrete Variable:

Suppose that a survey of 150 houses wasconducted and number of rooms in each house isrecorded as below

5 4 4 6 3 2 2 6 6 2 6 3 3 4 5 6 3 2 2 5 31 4 5 1 5 1 4 3 2 5 1 5 3 2 2 4 2 2 4 4 63 2 4 2 3 2 4 6 3 3 2 6 4 1 4 4 5 2 4 1 42 1 5 1 3 3 2 5 6 1 3 1 5 3 4 3 1 1 4 1 12 2 1 5 2 3 5 3 4 3 1 1 4 1 1 2 2 1 5 2 3

6 3 5 2 2 3 3 3 3 4 5 1 6 2 1 2 1 1 6 5 21 1 5 6 4 2 2 3 3 3 4 3 2 1 5 2 3 1 1 4 64 6 2


22/74

Counting of frequency using tally MarksNo. of Rooms (X) Tally Marks Frequency

1 llll llll llll llll llll llll 29

2 llll llll llll llll llll llll lll 33

3 llll llll llll llll llll llll 30

4 llll llll llll llll llll 24

5 llll llll llll llll 19

6 llll llll llll 15


23/74

Grouped Frequency Distributions -

Guidelines for classesNumber of classes should be between 5 and

20.

The classes must be mutually exclusiveThe classes must be all inclusive or

exhaustive.

The classes must be continuous.The classes must be equal in width.


24/74

Creating a Grouped Frequency Distribution

Compute the Range = Maximum Minimum (fromobservations)

Select the number of classes desired. This is usuallybetween 5 and 20.

Or use Struges formula: n= 1+ 3.322 log10 N

gives the necessary no. of classes for any continuous

frequency distribution, N denotes total no. ofobservation.

Find the class width by dividing the range by thenumber of classes and rounding up.

Pick a suitable starting point less than or equal to the

minimum value. You will be able to cover: "the classwidth times the number of classes" values. Your startingpoint is the lower limit of the first class. Continue to addthe class width to this lower limit to get the rest of thelower limits.


25/74


26/74

Tabulating Numerical Data: Frequency

Distributions

Sort Raw Data in Ascending Order12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find Range: 58 - 12 = 46

Select Number of Classes: 5(usually between 5 and 15)

Compute Class Interval (width): 10 (46/5 then round up)

Determine Class Boundaries (limits): 10, 20, 30, 40, 50, 60

Compute Class Midpoints: 15, 25, 35, 45, 55

Count Observations & Assign to Classes


27/74

Frequency Distributions, Relative Frequency

Distributions and Percentage Distributions

Class Frequency

10 but under 20 3 .15 15

20 but under 30 6 .30 30

30 but under 40 5 .25 25

40 but under 50 4 .20 20

50 but under 60 2 .10 10

Total 20 1 100

RelativeFrequency Percentage

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58


28/74

Graphing Numerical Data:

The Histogram

Histogram

0

3

6

5

4

2

0

0

1

23

4

5

6

7

5 15 25 35 45 55 More

Frequency


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

No Gaps

Between

Bars

Class Midpoints

Class Boundaries


29/74

Graphing Numerical Data:

The Frequency Polygon

Frequency

0

1

23

4

5

6

7

5 15 25 35 45 55 More

Class Midpoints


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58


30/74

Tabulating Numerical Data:

Cumulative Frequency

Lower Cumulative CumulativeLimit Frequency % Frequency

10 0 0

20 3 15

30 9 45

40 14 70

50 18 90

60 20 100


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58


31/74


32/74

Class Boundaries upper limit (Not

Midpoints)


33/74

Example

Using the data given below, construct a 'more than'cumulative frequency table and draw the Ogive

(More than ogive)


34/74


35/74


36/74

Locate median on OgiveValue on X axis of point corresponding to N/2 on y axis

(Point is at intersection of less than & more than ogive )


37/74

Tabulating and Graphing Univariate

Categorical Data

Categorical Data

Tabulating Data

The Summary Table

Graphing Data

Pie Charts

Pareto DiagramBar Charts


38/74

Summary Table(for an Investors Portfolio)

Investment Category Amount Percentage(in thousands $)

Stocks 46.5 42.27

Bonds 32 29.09

CD 15.5 14.09

Savings 16 14.55

Total 110 100

Variables are Categorical.


39/74

Graphing Univariate

Categorical Data

0 10 20 30 40 50

S t ocks

B onds

Savings

CD

Categorical Data

Tabulating Data

The Summary Table

Graphing Data

Pie Charts

Pareto DiagramBar Charts

0

5

10

15

20

25

30

35

40

45

S to ck s B on ds S avin gs C D

0

20

40

60

80

100

120


40/74


41/74

Pie Chart

(for an Investors Portfolio)

Percentages are

rounded to the

nearest percent.

Amount Invested in K$

Savings

15%

CD14%

Bonds29%

Stocks

42%


42/74

Pareto Diagram

Axis for linegraphshows

cumulative% invested

Axis forbar

chartshows

%investedin each

category

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Stocks Bonds Savings CD

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


43/74


44/74


45/74

Tabulating and Graphing Bivariate

Categorical Data

Side by Side Charts

Comparing Investors

0 10 20 30 40 50 60

Stocks

Bonds

CD

Savings

Investor A Investor B Inves tor C


46/74


47/74

Errors in Presenting Data

Using Chart Junk

No Relative Basis in Comparing Data between

Groups

Compressing the Vertical Axis

No Zero Point on the Vertical Axis


48/74

Chart Junk

Good Presentation

1960: $1.00

1970: $1.60

1980: $3.10

1990: $3.80

Minimum Wage Minimum Wage

0

2

4

1960 1970 1980 1990

$

Bad Presentation


49/74

No Relative Basis

Good PresentationAs received by

students.

As received by

students.

Bad Presentation

0

200

300

FR SO JR SR

Freq.

10

30

FR SO JR SR

%

FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior


50/74

Compressing Vertical Axis

Good Presentation

Quarterly Sales Quarterly Sales

Bad Presentation

0

25

50

Q1 Q2 Q3 Q4

$

0

100

200

Q1 Q2 Q3 Q4

$


51/74


52/74

Chapter 2

Measures of Central Tendency


53/74

Average is a value which represents the

whole data or it is a central value of the

data. The arithmetic mean, median, or

mode are commonly used measures ofthe central tendency.



54/74


Arithmetic Mean (AM)

Population Mean:

Sample Mean:

Frequency Distribution:


55/74

Simple problems on AM

For Raw data

X = (X) / N , N = total no. of observation

Find the mean of 5,34, 32, 34, 21, 23, 34, 31 30,

35


56/74

Simple problems on AM

For Raw data

X = (X) / N , N = total no. of observation

Find the mean of 5,34, 32, 34, 21, 23, 34, 31 30,

35

Answer= 27.9


57/74

d f di ib i


58/74

Grouped frequency distribution

X = A + h (fd) / f , f=N

Marks: 0-10 10-20 20-30 30-40 40-50 50-60

No. of 12 18 27 20 17 6

students:

Missing frequency


59/74

Missing frequency

Marks: 0-10 10-20 20-30 30-40 40-50

No. of : 5 10 25 ----- 5

Students

Find missing frequency if arithmetic mean is 25.

(ans-10)


60/74

Combine Series

Let X1 and X2 are two series of data. Where

X1 series is of n1 size and X1 AM.

X2 series is of n2 size and X2 AM.

then the AM of combined series of size n1 + n2 is

given by

X12= n1X1 + n2X2

n1 + n2

this formula can be generalized for k series.

AM of Combine Series


61/74

AM of Combine Series

Q The average honey obtained from 5 beehivesis 30 kg. The average of 3 of them is 31 kg. Whatis the average of the remaining two?

Q A mean wage of 200 workers working in threeshifts in a factory is Rs. 520. The average of 90workers working in first shift is Rs. 500. Theaverage of 60 workers working in second shift is

Rs. 450. What is the average of workers workingin third Shift? (Rs.640)


62/74


63/74


64/74

Median : A Positional Average

The data must be ranked (sorted in ascendingorder) first.

The median is the number in the middle.

To find the depth (position) of the median,

there are several formulas that could be used,the one that we will use is:Depth of median = 0.5 * (n + 1) = (n+1)/2


65/74

How to Find Median

Raw Data -The median is the number in the middle

position.

If the sample size is odd, there is one value at center

which is median. If N observations then it is (N+1)/2th position.

If the sample size is even, the median will be a

midpoint between the two middle numbers of the

data. It is also called as (N+1)/2 th position.

In case of frequency distribution we first need to find

the cumulative frequencies (less than type).


66/74

Examples

a) The marks of 9 students are 7,9,8,5,7,7,6,8 and 9

b) The weights of 8 children are given as (kgs)

12,11.5,13,13.5,10.5,14,12,15.

Ungrouped frequency distribution

c) Calculate the median weight of group of

children

Weights 30 31 32 33 34 35 36 37

No. of children 8 12 15 25 20 12 5 2


67/74

Calculation of median for Ungrouped or

frequency distribution

First arrange in ascending/descending order.

To find Cumulative frequencies less than type.

Find (N+1)/2 & cf (equal or higher side) value

in table for (N+1)/2, corresponding value ofvariable is median ( median class for grouped

frequency data)


68/74


69/74

Examples

Grouped frequency distributiond) Calculate median weight for the following group of persons

Weights 50-55 55-60 60-65 65-70 70-75 75-80

No. of persons 8 10 25 35 15 7

Sales in 100s of Rs. 0-10 10-20 20-30 30-40 40-50

No. of shops 5 25 - 18 7

Missing frequencye) Given that the median sales of shops were Rs. 2,400, find the missing frequency.


70/74

Mode

Mode is the value of the variablecorresponding to the highest frequency. It canbe calculated only for frequency data.

The mode can be computed just by inspectionin case ofungrouped frequency data.

Mode can be easily located on histogram incase of continuous grouped frequency data.Otherwise it can be calculated using a formulaafter detecting the modal class.


71/74

Mode for grouped frequency distributions

Find the modal class, that is the class with themaximum frequency. We can locate the modal

class but to find mode the following formula is

used:

Mode = L + [(fm-f1) / (fm-f1)+(fm-f2)] x h

L is the lower class boundary of modal class

fm is the Frequency of the model classf1 is the frequency of the class previous to model class

f2 is the frequency of the class next to model class

h is the size of model class


72/74

Expenditures 0-10 10-20 20-30 30-40 40-50

No. of

families

14 23 28 20 15

Example:

Following is the distribution of 100 families, according

to expenditures on a luxury product in 1000 rupees. Findthe mode of the following data.


73/74

When Mode is Ill-defined

Some times data have more than one mode- maybe bimodal, trimodal or multimodal.

For non-symmetrical distribution mode can becalculated using the following formula

Mode = 3 median- 2 mean

Problem: For a skewed distribution, the meanand mode are 35.4 and 32.1 respectively. Find thevalue of median.

Ans. 34.3


74/74

Assignment

What are geometric mean (GM) and harmonic

mean(HM)? Define.

What are the merits , demerits and uses of

median, mode, HM and GM?

Solve 3 problems of each average from a

reference book.

ch01 2statistics intro

Documents