ch01 2statistics intro
TRANSCRIPT
-
7/31/2019 Ch01 2Statistics Intro
1/74
Introduction to Statistics
Chapter 1
-
7/31/2019 Ch01 2Statistics Intro
2/74
Statistics: Introduction
Definition
Data types, Scales of measurement
Basic concepts
Tabulation & Graphs
Measures of central tendency
-
7/31/2019 Ch01 2Statistics Intro
3/74
Definition
Statistics is a science which deals with the
collection, classification, presentation
(tabulation, graphs), analysis and interpretation
of numerical data.
-
7/31/2019 Ch01 2Statistics Intro
4/74
Collection of data
primary data- when data is collected directly
from the original source, example census,
survey with sampling (using questionnaire)
secondary data- when data is taken indirectly
from any other source, example magazines,
news paper, NGOs, etc.
-
7/31/2019 Ch01 2Statistics Intro
5/74
Classification of data
Qualitative Data
which shows the quality of the individual.
For example- Eye Color, First Name, Favorite
Movie, Political PartyQuantitative data are those for which the value
has numerical meaning. You can do mathematicaloperations on the values of quantitative variables
(like taking an average). Examples - A personsHeight, Weight, Income, Resting Pulse Rate, BloodAlcohol Level
-
7/31/2019 Ch01 2Statistics Intro
6/74
Categorical data
Geographical
Chronological
-
7/31/2019 Ch01 2Statistics Intro
7/74
Quantitative data
Discrete variables are usually obtained by
counting. There are a finite or countable
number of choices available with discrete
data. You can't have 2.63 people in the room.
Continuous variables are usually obtained by
measuring. Length, weight, and time are all
examples of continuous variables. Can bedefined in intervals.
-
7/31/2019 Ch01 2Statistics Intro
8/74
Levels of Measurement
nominal
ordinal
interval
Ratio
-
7/31/2019 Ch01 2Statistics Intro
9/74
Levels of Measurement contd.
Nominal
Only names are meaningful.
Can be qualitative only.
Data values serve as labels.
Labels have no meaningful order.
Examples: Blood Type, Breed of Dog
-
7/31/2019 Ch01 2Statistics Intro
10/74
Levels of Measurement contd.
Ordinal
Adds an order to the names.
It can be qualitative or quantitative.
Data values serve as labels but the labels have
a natural meaningful order.
Differences between values, however, are
meaningless.
Examples: Statistics Grade, Basketball
Rankings
-
7/31/2019 Ch01 2Statistics Intro
11/74
Levels of Measurement contd.
Interval
Data values are numerical
They have a meaningful order
Differences between data values aremeaningful
The ratio of two data values, however, ismeaningless. This occurs when zero is an
arbitrary measurement rather than actuallyindicating nothing
Examples: Temperature, Year of Birth
-
7/31/2019 Ch01 2Statistics Intro
12/74
Levels of Measurement contd.
Ratio.
Data values are numerical, have order, andboth differences and ratios of values are
meaningful. Zero measurement indicates absence of
quantity being measured.
Examples : Weight, Volume, Number ofChildren
-
7/31/2019 Ch01 2Statistics Intro
13/74
Population vs Sample
Population - data set consisting of all
outcomes, measurements, or responses of
interest. (group of observations under study)
Sample - data set which is a subset of the
population data set
-
7/31/2019 Ch01 2Statistics Intro
14/74
Types of Sampling
There are 5 types of sampling methods:
Random sampling
Example Telephone polling random telephone
numbers, Drawing names out of a hat.
Systematic sampling
Example- Choosing a sample of registered voters
by choosing every 25th voter from the countyregistration roll , Testing every 300th product from
the assembly line
-
7/31/2019 Ch01 2Statistics Intro
15/74
Types of Sampling contd.
Convenience sampling
Example- Internet Polls, Mail-In Customer Survey
Cluster sampling is accomplished by dividing the
population into groups -- usually geographically.
These groups are called clusters or blocks. The
clusters are randomly selected, and each element
in the selected clusters are used. Example-Randomly choose 10 polling stations in a city and
exit poll all voters at those stations
-
7/31/2019 Ch01 2Statistics Intro
16/74
Types of Sampling contd.
Stratified sampling also divides the population
into groups called strata & then a sample is taken
from each of these strata using either random,
systematic, or convenience sampling. Example-Choosing 100 men and 100 women for a sample
Stratify the population of 1000 employees by
income level and then choose a sample of low,
middle, and high income individuals.
-
7/31/2019 Ch01 2Statistics Intro
17/74
Parameter and Statistics
Parameters are statistical measuresassociated with populations and statistic
with samples.
Parameters are usually denoted usingGreek letters (mu, sigma) while statistics
are usually denoted using Roman letters (x,
s).
-
7/31/2019 Ch01 2Statistics Intro
18/74
Descriptive and Inferential Statistics
We compute statistic, and use them to
estimate parameters. The computation is
the first part of the statistics course is calledDescriptive Statistics and the estimation is
the second part called as Inferential
Statistics.
-
7/31/2019 Ch01 2Statistics Intro
19/74
Presentation of Data
Tabulation
Frequency distribution
Grouped
Ungrouped
Graphs
-
7/31/2019 Ch01 2Statistics Intro
20/74
Frequency distribution
According to Croxton & Cowden ,
Frequency Distribution is a statistical
table which shows the set of all distinctvalues of the variable arranged in order
of magnitude either individually or in a
group with their correspondingfrequencies side by side
-
7/31/2019 Ch01 2Statistics Intro
21/74
Ungrouped Frequency of a Discrete Variable:
Suppose that a survey of 150 houses wasconducted and number of rooms in each house isrecorded as below
5 4 4 6 3 2 2 6 6 2 6 3 3 4 5 6 3 2 2 5 31 4 5 1 5 1 4 3 2 5 1 5 3 2 2 4 2 2 4 4 63 2 4 2 3 2 4 6 3 3 2 6 4 1 4 4 5 2 4 1 42 1 5 1 3 3 2 5 6 1 3 1 5 3 4 3 1 1 4 1 12 2 1 5 2 3 5 3 4 3 1 1 4 1 1 2 2 1 5 2 3
6 3 5 2 2 3 3 3 3 4 5 1 6 2 1 2 1 1 6 5 21 1 5 6 4 2 2 3 3 3 4 3 2 1 5 2 3 1 1 4 64 6 2
-
7/31/2019 Ch01 2Statistics Intro
22/74
Counting of frequency using tally MarksNo. of Rooms (X) Tally Marks Frequency
1 llll llll llll llll llll llll 29
2 llll llll llll llll llll llll lll 33
3 llll llll llll llll llll llll 30
4 llll llll llll llll llll 24
5 llll llll llll llll 19
6 llll llll llll 15
-
7/31/2019 Ch01 2Statistics Intro
23/74
Grouped Frequency Distributions -
Guidelines for classesNumber of classes should be between 5 and
20.
The classes must be mutually exclusiveThe classes must be all inclusive or
exhaustive.
The classes must be continuous.The classes must be equal in width.
-
7/31/2019 Ch01 2Statistics Intro
24/74
Creating a Grouped Frequency Distribution
Compute the Range = Maximum Minimum (fromobservations)
Select the number of classes desired. This is usuallybetween 5 and 20.
Or use Struges formula: n= 1+ 3.322 log10 N
gives the necessary no. of classes for any continuous
frequency distribution, N denotes total no. ofobservation.
Find the class width by dividing the range by thenumber of classes and rounding up.
Pick a suitable starting point less than or equal to the
minimum value. You will be able to cover: "the classwidth times the number of classes" values. Your startingpoint is the lower limit of the first class. Continue to addthe class width to this lower limit to get the rest of thelower limits.
-
7/31/2019 Ch01 2Statistics Intro
25/74
-
7/31/2019 Ch01 2Statistics Intro
26/74
Tabulating Numerical Data: Frequency
Distributions
Sort Raw Data in Ascending Order12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find Range: 58 - 12 = 46
Select Number of Classes: 5(usually between 5 and 15)
Compute Class Interval (width): 10 (46/5 then round up)
Determine Class Boundaries (limits): 10, 20, 30, 40, 50, 60
Compute Class Midpoints: 15, 25, 35, 45, 55
Count Observations & Assign to Classes
-
7/31/2019 Ch01 2Statistics Intro
27/74
Frequency Distributions, Relative Frequency
Distributions and Percentage Distributions
Class Frequency
10 but under 20 3 .15 15
20 but under 30 6 .30 30
30 but under 40 5 .25 25
40 but under 50 4 .20 20
50 but under 60 2 .10 10
Total 20 1 100
RelativeFrequency Percentage
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
-
7/31/2019 Ch01 2Statistics Intro
28/74
Graphing Numerical Data:
The Histogram
Histogram
0
3
6
5
4
2
0
0
1
23
4
5
6
7
5 15 25 35 45 55 More
Frequency
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
No Gaps
Between
Bars
Class Midpoints
Class Boundaries
-
7/31/2019 Ch01 2Statistics Intro
29/74
Graphing Numerical Data:
The Frequency Polygon
Frequency
0
1
23
4
5
6
7
5 15 25 35 45 55 More
Class Midpoints
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
-
7/31/2019 Ch01 2Statistics Intro
30/74
Tabulating Numerical Data:
Cumulative Frequency
Lower Cumulative CumulativeLimit Frequency % Frequency
10 0 0
20 3 15
30 9 45
40 14 70
50 18 90
60 20 100
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
-
7/31/2019 Ch01 2Statistics Intro
31/74
-
7/31/2019 Ch01 2Statistics Intro
32/74
Class Boundaries upper limit (Not
Midpoints)
-
7/31/2019 Ch01 2Statistics Intro
33/74
Example
Using the data given below, construct a 'more than'cumulative frequency table and draw the Ogive
(More than ogive)
-
7/31/2019 Ch01 2Statistics Intro
34/74
-
7/31/2019 Ch01 2Statistics Intro
35/74
-
7/31/2019 Ch01 2Statistics Intro
36/74
Locate median on OgiveValue on X axis of point corresponding to N/2 on y axis
(Point is at intersection of less than & more than ogive )
-
7/31/2019 Ch01 2Statistics Intro
37/74
Tabulating and Graphing Univariate
Categorical Data
Categorical Data
Tabulating Data
The Summary Table
Graphing Data
Pie Charts
Pareto DiagramBar Charts
-
7/31/2019 Ch01 2Statistics Intro
38/74
Summary Table(for an Investors Portfolio)
Investment Category Amount Percentage(in thousands $)
Stocks 46.5 42.27
Bonds 32 29.09
CD 15.5 14.09
Savings 16 14.55
Total 110 100
Variables are Categorical.
-
7/31/2019 Ch01 2Statistics Intro
39/74
Graphing Univariate
Categorical Data
0 10 20 30 40 50
S t ocks
B onds
Savings
CD
Categorical Data
Tabulating Data
The Summary Table
Graphing Data
Pie Charts
Pareto DiagramBar Charts
0
5
10
15
20
25
30
35
40
45
S to ck s B on ds S avin gs C D
0
20
40
60
80
100
120
-
7/31/2019 Ch01 2Statistics Intro
40/74
-
7/31/2019 Ch01 2Statistics Intro
41/74
Pie Chart
(for an Investors Portfolio)
Percentages are
rounded to the
nearest percent.
Amount Invested in K$
Savings
15%
CD14%
Bonds29%
Stocks
42%
-
7/31/2019 Ch01 2Statistics Intro
42/74
Pareto Diagram
Axis for linegraphshows
cumulative% invested
Axis forbar
chartshows
%investedin each
category
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Stocks Bonds Savings CD
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
-
7/31/2019 Ch01 2Statistics Intro
43/74
-
7/31/2019 Ch01 2Statistics Intro
44/74
-
7/31/2019 Ch01 2Statistics Intro
45/74
Tabulating and Graphing Bivariate
Categorical Data
Side by Side Charts
Comparing Investors
0 10 20 30 40 50 60
Stocks
Bonds
CD
Savings
Investor A Investor B Inves tor C
-
7/31/2019 Ch01 2Statistics Intro
46/74
-
7/31/2019 Ch01 2Statistics Intro
47/74
Errors in Presenting Data
Using Chart Junk
No Relative Basis in Comparing Data between
Groups
Compressing the Vertical Axis
No Zero Point on the Vertical Axis
-
7/31/2019 Ch01 2Statistics Intro
48/74
Chart Junk
Good Presentation
1960: $1.00
1970: $1.60
1980: $3.10
1990: $3.80
Minimum Wage Minimum Wage
0
2
4
1960 1970 1980 1990
$
Bad Presentation
-
7/31/2019 Ch01 2Statistics Intro
49/74
No Relative Basis
Good PresentationAs received by
students.
As received by
students.
Bad Presentation
0
200
300
FR SO JR SR
Freq.
10
30
FR SO JR SR
%
FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior
-
7/31/2019 Ch01 2Statistics Intro
50/74
Compressing Vertical Axis
Good Presentation
Quarterly Sales Quarterly Sales
Bad Presentation
0
25
50
Q1 Q2 Q3 Q4
$
0
100
200
Q1 Q2 Q3 Q4
$
-
7/31/2019 Ch01 2Statistics Intro
51/74
-
7/31/2019 Ch01 2Statistics Intro
52/74
Chapter 2
Measures of Central Tendency
-
7/31/2019 Ch01 2Statistics Intro
53/74
Average is a value which represents the
whole data or it is a central value of the
data. The arithmetic mean, median, or
mode are commonly used measures ofthe central tendency.
Measures of Central Tendency
-
7/31/2019 Ch01 2Statistics Intro
54/74
Measures of Central Tendency
Arithmetic Mean (AM)
Population Mean:
Sample Mean:
Frequency Distribution:
-
7/31/2019 Ch01 2Statistics Intro
55/74
Simple problems on AM
For Raw data
X = (X) / N , N = total no. of observation
Find the mean of 5,34, 32, 34, 21, 23, 34, 31 30,
35
-
7/31/2019 Ch01 2Statistics Intro
56/74
Simple problems on AM
For Raw data
X = (X) / N , N = total no. of observation
Find the mean of 5,34, 32, 34, 21, 23, 34, 31 30,
35
Answer= 27.9
-
7/31/2019 Ch01 2Statistics Intro
57/74
d f di ib i
-
7/31/2019 Ch01 2Statistics Intro
58/74
Grouped frequency distribution
X = A + h (fd) / f , f=N
Marks: 0-10 10-20 20-30 30-40 40-50 50-60
No. of 12 18 27 20 17 6
students:
Missing frequency
-
7/31/2019 Ch01 2Statistics Intro
59/74
Missing frequency
Marks: 0-10 10-20 20-30 30-40 40-50
No. of : 5 10 25 ----- 5
Students
Find missing frequency if arithmetic mean is 25.
(ans-10)
-
7/31/2019 Ch01 2Statistics Intro
60/74
Combine Series
Let X1 and X2 are two series of data. Where
X1 series is of n1 size and X1 AM.
X2 series is of n2 size and X2 AM.
then the AM of combined series of size n1 + n2 is
given by
X12= n1X1 + n2X2
n1 + n2
this formula can be generalized for k series.
AM of Combine Series
-
7/31/2019 Ch01 2Statistics Intro
61/74
AM of Combine Series
Q The average honey obtained from 5 beehivesis 30 kg. The average of 3 of them is 31 kg. Whatis the average of the remaining two?
Q A mean wage of 200 workers working in threeshifts in a factory is Rs. 520. The average of 90workers working in first shift is Rs. 500. Theaverage of 60 workers working in second shift is
Rs. 450. What is the average of workers workingin third Shift? (Rs.640)
-
7/31/2019 Ch01 2Statistics Intro
62/74
-
7/31/2019 Ch01 2Statistics Intro
63/74
-
7/31/2019 Ch01 2Statistics Intro
64/74
Median : A Positional Average
The data must be ranked (sorted in ascendingorder) first.
The median is the number in the middle.
To find the depth (position) of the median,
there are several formulas that could be used,the one that we will use is:Depth of median = 0.5 * (n + 1) = (n+1)/2
-
7/31/2019 Ch01 2Statistics Intro
65/74
How to Find Median
Raw Data -The median is the number in the middle
position.
If the sample size is odd, there is one value at center
which is median. If N observations then it is (N+1)/2th position.
If the sample size is even, the median will be a
midpoint between the two middle numbers of the
data. It is also called as (N+1)/2 th position.
In case of frequency distribution we first need to find
the cumulative frequencies (less than type).
-
7/31/2019 Ch01 2Statistics Intro
66/74
Examples
a) The marks of 9 students are 7,9,8,5,7,7,6,8 and 9
b) The weights of 8 children are given as (kgs)
12,11.5,13,13.5,10.5,14,12,15.
Ungrouped frequency distribution
c) Calculate the median weight of group of
children
Weights 30 31 32 33 34 35 36 37
No. of children 8 12 15 25 20 12 5 2
-
7/31/2019 Ch01 2Statistics Intro
67/74
Calculation of median for Ungrouped or
frequency distribution
First arrange in ascending/descending order.
To find Cumulative frequencies less than type.
Find (N+1)/2 & cf (equal or higher side) value
in table for (N+1)/2, corresponding value ofvariable is median ( median class for grouped
frequency data)
-
7/31/2019 Ch01 2Statistics Intro
68/74
-
7/31/2019 Ch01 2Statistics Intro
69/74
Examples
Grouped frequency distributiond) Calculate median weight for the following group of persons
Weights 50-55 55-60 60-65 65-70 70-75 75-80
No. of persons 8 10 25 35 15 7
Sales in 100s of Rs. 0-10 10-20 20-30 30-40 40-50
No. of shops 5 25 - 18 7
Missing frequencye) Given that the median sales of shops were Rs. 2,400, find the missing frequency.
-
7/31/2019 Ch01 2Statistics Intro
70/74
Mode
Mode is the value of the variablecorresponding to the highest frequency. It canbe calculated only for frequency data.
The mode can be computed just by inspectionin case ofungrouped frequency data.
Mode can be easily located on histogram incase of continuous grouped frequency data.Otherwise it can be calculated using a formulaafter detecting the modal class.
-
7/31/2019 Ch01 2Statistics Intro
71/74
Mode for grouped frequency distributions
Find the modal class, that is the class with themaximum frequency. We can locate the modal
class but to find mode the following formula is
used:
Mode = L + [(fm-f1) / (fm-f1)+(fm-f2)] x h
L is the lower class boundary of modal class
fm is the Frequency of the model classf1 is the frequency of the class previous to model class
f2 is the frequency of the class next to model class
h is the size of model class
-
7/31/2019 Ch01 2Statistics Intro
72/74
Expenditures 0-10 10-20 20-30 30-40 40-50
No. of
families
14 23 28 20 15
Example:
Following is the distribution of 100 families, according
to expenditures on a luxury product in 1000 rupees. Findthe mode of the following data.
-
7/31/2019 Ch01 2Statistics Intro
73/74
When Mode is Ill-defined
Some times data have more than one mode- maybe bimodal, trimodal or multimodal.
For non-symmetrical distribution mode can becalculated using the following formula
Mode = 3 median- 2 mean
Problem: For a skewed distribution, the meanand mode are 35.4 and 32.1 respectively. Find thevalue of median.
Ans. 34.3
-
7/31/2019 Ch01 2Statistics Intro
74/74
Assignment
What are geometric mean (GM) and harmonic
mean(HM)? Define.
What are the merits , demerits and uses of
median, mode, HM and GM?
Solve 3 problems of each average from a
reference book.