bcor 1020 business statistics lecture 5 – january 31, 2008
Post on 21-Dec-2015
231 Views
Preview:
TRANSCRIPT
BCOR 1020Business Statistics
Lecture 5 – January 31, 2008
Overview
• Chapter 4 – Descriptive Statistics…– Standardized Data– Percentiles and Quartiles– Boxplots
Chapter 4 – Standardized Data
Chebyshev’s Theorem – Developed by mathematicians Jules Bienaymé (1796-1878) and Pafnuty Chebyshev (1821-1894).• For any population with mean and standard deviation ,
the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2].– For k = 2 standard deviations, 100[1 – 1/22] = 75%
(So, at least 75.0% will lie within + 2– For k = 3 standard deviations, 100[1 – 1/32] = 88.9%
(So, at least 88.9% will lie within + 3
• Although applicable to any data set, these limits tend to be too wide to be useful.
Clickers
Using Chebyshev’s Theorem, determine the minimum percentage of observations that lie within 4 standard deviations of the mean.
100[1 – 1/k2]
A = 75.0%
B = 88.9%
C = 93.8%
D = 96.0%
Chapter 4 – Standardized Data
The Empirical Rule:• The normal or Gaussian distribution was named
for Karl Gauss (1771-1855).• The normal distribution is symmetric and is also
known as the bell-shaped curve.• The Empirical Rule states that given data from a
normal distribution, we expect that for…k = 1: About 68.26% will lie within + 1k = 2: About 95.44% will lie within + 2k = 3: About 99.73% will lie within + 3
Chapter 4 – Standardized Data
The Empirical Rule:• Distance from the mean is measured in terms of
the number of standard deviations.• Unusual Observations:
Unusual observations are those that lie beyond + 2.
Outliers are observations that lie beyond + 3.
Note: no upper bound is given. Data values outside + 3 are rare.
Clickers
Suppose 80 students take an exam. Assuming exam scores follow a normal distribution,approximately how many students would youexpect to have scores within 2 standard deviationsof the mean?
A = 55
B = 76
C = 79
D = 80
Chapter 4 – Standardized Data
Defining a Standardized Variable:• A standardized variable (Z) redefines each observation
in terms the number of standard deviations from the mean.
iix
z
Standardization formula for a population:
Standardization formula for a sample:
iix x
zs
• zi tells how far away the observation is from the mean (in terms of .
• A negative z value means the observation is below the mean.
• Positive z means the observation is above the mean.
Chapter 4 – Standardized Data
Defining a Standardized Variable:• MegaStat calculates standardized values as well
as checks for outliers.• In Excel, use =STANDARDIZE(Array, Mean,
STDev) to calculate a standardized z value.
Chapter 4 – Standardized DataExample: Unusual Observations in the P/E Data• The P/E ratio data contains several large data values.
Are they unusual or outliers?
Raw Data:
Standardized
Data:
7 8 8 10 10 10 10 12 13 13 13 13
13 13 13 14 14 14 15 15 15 15 15 16
16 16 17 18 18 18 18 19 19 19 19 19
20 20 20 21 21 21 22 22 23 23 23 24
25 26 26 26 26 27 29 29 30 31 34 36
37 40 41 45 48 55 68 91
Chapter 4 – Standardized Data
Outliers: What do we do with outliers in a data set?• If due to erroneous data, then discard.• An outrageous observation (one completely
outside of an expected range) is certainly invalid.
• Recognize unusual data points and outliers and their potential impact on your study.
• Research books and articles on how to handle outliers.
Chapter 4 – Standardized Data
Estimating Sigma:• It is common to use the sample standard
deviation (S) as an estimate of • We can also use the empirical rule to define a
simple (quick-and-dirty) estimate:– For a normal distribution, the range of 99.73% of the
values is 6 (from – 3 to + 3).– If you know the range R (high – low), you can estimate
the standard deviation as = R/6.– Useful for approximating the standard deviation when
only R is known.– This estimate depends on the assumption of normality.
Chapter 4 – Percentiles & Quartiles
Percentiles:• Percentiles are data that have been divided into
100 groups.– For example, you score in the 83rd percentile on a standardized
test. That means that 83% of the test-takers scored below you.
• Deciles are data that have been divided into 10 groups (i.e. 10th, 20th, 30th, etc. percentiles).
• Quintiles are data that have been divided into 5 groups (i.e. 20th, 40th, 60th, 80th, 100th percentiles).
• Quartiles are data that have been divided into 4 groups (i.e. 25th, 50th, 75th, 100th percentiles).
Chapter 4 – Percentiles & Quartiles
Percentiles:• Percentiles are used to establish benchmarks
for comparison purposes… • (e.g., health care, manufacturing and banking
industries use 5, 25, 50, 75 and 90 percentiles).
– Percentiles are used in employee merit evaluation and salary benchmarking.
Chapter 4 – Percentiles & Quartiles
Quartiles:• Quartiles are scale points that divide the sorted data into
four groups of approximately equal size.
• The three values that separate the four groups are called Q1, Q2, and Q3, respectively.– Quartiles (25, 50, and 75 percent) are commonly used to assess
financial performance and stock portfolios.
Q1 Q2 Q3
Lower 25%
| Second 25%
| Third 25%
| Upper 25%
Chapter 4 – Percentiles & Quartiles
Q2
Lower 50% | Upper 50%
• The second quartile Q2 is the median, an important indicator of central tendency.
• Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values.
Quartiles:
Q1 Q3
Lower 25% | Middle 50% | Upper 25%
Chapter 4 – Percentiles & Quartiles
Method of Medians:• For small data sets, find quartiles using method
of medians:
Step 1. Sort the observations.
Step 2. Find the median Q2.
Step 3. Find the median of the data values that lie below Q2. This is Q1.
Step 4. Find the median of the data values that lie above Q2. This is Q3.
ClickersRecall the following P/E ratios for 68 stocks in a portfolio. First Find Q1, Q2 and Q3.
We can use quartiles to define benchmarks for stocks that are low-priced (bottom Quartile or Q1) or high-priced (top quartile or Q3). What is the P/E ratio benchmark for high-priced stocks in this portfolio?
A = 14 B = 19
C = 26 D = 36
7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
Chapter 4 – Percentiles & Quartiles
Example: P/E Ratios and Quartiles:• recall from the previous question:
• These quartiles express central tendency (M = Q2) and dispersion (the interquartile range IQR).
• Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations.
QQ11 QQ22 QQ33
Lower Lower 25%25%
of of P/E P/E RatiosRatios
1414 Second Second 25%25%
of of P/EP/E Ratios Ratios
1919 Third Third 25%25%
of of P/EP/E Ratios Ratios
2626 Upper 25%Upper 25% of of P/EP/E Ratios Ratios
Chapter 4 – Percentiles & Quartiles
Excel Quartiles:• Use Excel function =QUARTILE(Array, k) to
return the kth quartile.• Excel treats quartiles as a special case of
percentiles. For example, to calculate Q3…– We can use either =QUARTILE(Array, 3) or
=PERCENTILE(Array, 75)
• Excel calculates the quartile positions as:Position of Q1 0.25n + 0.75
Position of Q2 0.50n + 0.50
Position of Q3 0.75n + 0.25
Chapter 4 – Percentiles & QuartilesCaution:• Quartiles generally resist outliers.
• However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values.
Data set A: 1, 2, 4, 4, 8, 8, 8, 8 Q1 = 3, Q2 = 6, Q3 = 8
Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q1 = 3, Q2 = 6, Q3 = 8
• Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well.
Chapter 4 – Percentiles & Quartiles
Central Tendency & Dispersion Using Quartiles:Some robust measures of central tendency using quartiles are:• Median (M = Q2) – we’ve already discussed.
• Midhinge – The mean of the 1st and 3rd quartiles:
Both are robust measures of central tendency since they ignore extreme values (outliers).
Midhinge = 1 3
2
Q Q
Chapter 4 – Percentiles & Quartiles
Central Tendency & Dispersion Using Quartiles:Some robust measures of dispersion using quartiles are:• Midspread (Innerquartile Range,IQR) – A robust
measure of dispersion:
• Coefficient of Quartile Variation (CQV) – Measures relative dispersion, expresses the midspread as a percent of the midhinge:
– Similar to the CV, CQV can be used to compare data sets measured in different units or with different means.
Midspread = Q3 – Q1
3 1
3 1
100Q Q
CQVQ Q
Clickers
Recall from the data set of 68 P/E ratios:Min = 7, Q1 = 14, Q2 = 19, Q3 = 26, Max = 91
What is the Midspread (Innerquartile Range)?
A) 12
B) 19
C) 77
D) 84
Chapter 4 – Boxplots
Boxplots – A useful tool of exploratory data analysis (EDA).• Also called a box-and-whisker plot.• Based on a five-number summary:
Xmin, Q1, Q2, Q3, Xmax
Example: Consider the five-number summary for the 68 P/E ratios…
Xmin = 7, Q1 = 14, Q2 = 19, Q3 = 26, Xmax = 91
Chapter 4 – Boxplots
• The Boxplot for the P/E ratio data is …
MinimumMinimum
Median (Median (QQ22))
MaximumMaximum
QQ11 QQ33
BoxBox
WhiskersWhiskers
Right-skewedRight-skewed
Chapter 4 – Boxplots
Fences and Unusual Data Values – Use quartiles to detect unusual data points.• These points are called fences and can be found
using the following formulas:
• Values outside the inner fences are unusual while those outside the outer fences are outliers.
Inner fences Outer fences:
Lower fence Q1 – 1.5 (Q3–Q1) Q1 – 3.0 (Q3–Q1)
Upper fence Q3 + 1.5 (Q3–Q1) Q3 + 3.0 (Q3–Q1)
Chapter 4 – Boxplots
Fences and Unusual Data Values:• Truncate the whisker at the fences and display
unusual values and outliers as dots.
Example: Boxplot of P/E ratios with fences…
Based on these fences, there are three unusual P/E values and two outliers.
Inner Fence
OuterFence
Unusual Outliers
Chapter 4 – Standardized DataExample: Unusual Observations in the P/E Data• The P/E ratio data contains several large data values. Are
they unusual or outliers? Compare the boxplot to standardized data analysis…
Standardized
Data:
top related