bcor 1020 business statistics lecture 5 – january 31, 2008

BCOR 1020Business Statistics

Lecture 5 – January 31, 2008

Overview

• Chapter 4 – Descriptive Statistics…– Standardized Data– Percentiles and Quartiles– Boxplots

Chapter 4 – Standardized Data

Chebyshev’s Theorem – Developed by mathematicians Jules Bienaymé (1796-1878) and Pafnuty Chebyshev (1821-1894).• For any population with mean and standard deviation ,

the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2].– For k = 2 standard deviations, 100[1 – 1/22] = 75%

(So, at least 75.0% will lie within + 2– For k = 3 standard deviations, 100[1 – 1/32] = 88.9%

(So, at least 88.9% will lie within + 3

• Although applicable to any data set, these limits tend to be too wide to be useful.

Clickers

Using Chebyshev’s Theorem, determine the minimum percentage of observations that lie within 4 standard deviations of the mean.

100[1 – 1/k2]

A = 75.0%

B = 88.9%

C = 93.8%

D = 96.0%

The Empirical Rule:• The normal or Gaussian distribution was named

for Karl Gauss (1771-1855).• The normal distribution is symmetric and is also

known as the bell-shaped curve.• The Empirical Rule states that given data from a

normal distribution, we expect that for…k = 1: About 68.26% will lie within + 1k = 2: About 95.44% will lie within + 2k = 3: About 99.73% will lie within + 3

The Empirical Rule:• Distance from the mean is measured in terms of

the number of standard deviations.• Unusual Observations:

Unusual observations are those that lie beyond + 2.

Outliers are observations that lie beyond + 3.

Note: no upper bound is given. Data values outside + 3 are rare.

Clickers

Suppose 80 students take an exam. Assuming exam scores follow a normal distribution,approximately how many students would youexpect to have scores within 2 standard deviationsof the mean?

A = 55

B = 76

C = 79

D = 80

Defining a Standardized Variable:• A standardized variable (Z) redefines each observation

in terms the number of standard deviations from the mean.

Standardization formula for a population:

Standardization formula for a sample:

• zi tells how far away the observation is from the mean (in terms of .

• A negative z value means the observation is below the mean.

• Positive z means the observation is above the mean.

Defining a Standardized Variable:• MegaStat calculates standardized values as well

as checks for outliers.• In Excel, use =STANDARDIZE(Array, Mean,

STDev) to calculate a standardized z value.

Chapter 4 – Standardized DataExample: Unusual Observations in the P/E Data• The P/E ratio data contains several large data values.

Are they unusual or outliers?

Raw Data:

Standardized

7 8 8 10 10 10 10 12 13 13 13 13

13 13 13 14 14 14 15 15 15 15 15 16

16 16 17 18 18 18 18 19 19 19 19 19

20 20 20 21 21 21 22 22 23 23 23 24

25 26 26 26 26 27 29 29 30 31 34 36

37 40 41 45 48 55 68 91

Outliers: What do we do with outliers in a data set?• If due to erroneous data, then discard.• An outrageous observation (one completely

outside of an expected range) is certainly invalid.

• Recognize unusual data points and outliers and their potential impact on your study.

• Research books and articles on how to handle outliers.

Estimating Sigma:• It is common to use the sample standard

deviation (S) as an estimate of • We can also use the empirical rule to define a

simple (quick-and-dirty) estimate:– For a normal distribution, the range of 99.73% of the

values is 6 (from – 3 to + 3).– If you know the range R (high – low), you can estimate

the standard deviation as = R/6.– Useful for approximating the standard deviation when

only R is known.– This estimate depends on the assumption of normality.

Chapter 4 – Percentiles & Quartiles

Percentiles:• Percentiles are data that have been divided into

100 groups.– For example, you score in the 83rd percentile on a standardized

test. That means that 83% of the test-takers scored below you.

• Deciles are data that have been divided into 10 groups (i.e. 10th, 20th, 30th, etc. percentiles).

• Quintiles are data that have been divided into 5 groups (i.e. 20th, 40th, 60th, 80th, 100th percentiles).

• Quartiles are data that have been divided into 4 groups (i.e. 25th, 50th, 75th, 100th percentiles).

Percentiles:• Percentiles are used to establish benchmarks

for comparison purposes… • (e.g., health care, manufacturing and banking

industries use 5, 25, 50, 75 and 90 percentiles).

– Percentiles are used in employee merit evaluation and salary benchmarking.

Quartiles:• Quartiles are scale points that divide the sorted data into

four groups of approximately equal size.

• The three values that separate the four groups are called Q1, Q2, and Q3, respectively.– Quartiles (25, 50, and 75 percent) are commonly used to assess

financial performance and stock portfolios.

Q1 Q2 Q3

Lower 25%

| Second 25%

| Third 25%

| Upper 25%

Lower 50% | Upper 50%

• The second quartile Q2 is the median, an important indicator of central tendency.

• Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values.

Quartiles:

Lower 25% | Middle 50% | Upper 25%

Method of Medians:• For small data sets, find quartiles using method

of medians:

Step 1. Sort the observations.

Step 2. Find the median Q2.

Step 3. Find the median of the data values that lie below Q2. This is Q1.

Step 4. Find the median of the data values that lie above Q2. This is Q3.

ClickersRecall the following P/E ratios for 68 stocks in a portfolio. First Find Q1, Q2 and Q3.

We can use quartiles to define benchmarks for stocks that are low-priced (bottom Quartile or Q1) or high-priced (top quartile or Q3). What is the P/E ratio benchmark for high-priced stocks in this portfolio?

A = 14 B = 19

C = 26 D = 36

7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14

14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19

19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26

26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91

Example: P/E Ratios and Quartiles:• recall from the previous question:

• These quartiles express central tendency (M = Q2) and dispersion (the interquartile range IQR).

• Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations.

QQ11 QQ22 QQ33

Lower Lower 25%25%

of of P/E P/E RatiosRatios

1414 Second Second 25%25%

of of P/EP/E Ratios Ratios

1919 Third Third 25%25%

of of P/EP/E Ratios Ratios

2626 Upper 25%Upper 25% of of P/EP/E Ratios Ratios

Excel Quartiles:• Use Excel function =QUARTILE(Array, k) to

return the kth quartile.• Excel treats quartiles as a special case of

percentiles. For example, to calculate Q3…– We can use either =QUARTILE(Array, 3) or

=PERCENTILE(Array, 75)

• Excel calculates the quartile positions as:Position of Q1 0.25n + 0.75

Position of Q2 0.50n + 0.50

Position of Q3 0.75n + 0.25

Chapter 4 – Percentiles & QuartilesCaution:• Quartiles generally resist outliers.

• However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values.

Data set A: 1, 2, 4, 4, 8, 8, 8, 8 Q1 = 3, Q2 = 6, Q3 = 8

Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q1 = 3, Q2 = 6, Q3 = 8

• Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well.

Central Tendency & Dispersion Using Quartiles:Some robust measures of central tendency using quartiles are:• Median (M = Q2) – we’ve already discussed.

• Midhinge – The mean of the 1st and 3rd quartiles:

Both are robust measures of central tendency since they ignore extreme values (outliers).

Midhinge = 1 3

Central Tendency & Dispersion Using Quartiles:Some robust measures of dispersion using quartiles are:• Midspread (Innerquartile Range,IQR) – A robust

measure of dispersion:

• Coefficient of Quartile Variation (CQV) – Measures relative dispersion, expresses the midspread as a percent of the midhinge:

– Similar to the CV, CQV can be used to compare data sets measured in different units or with different means.

Midspread = Q3 – Q1

100Q Q

CQVQ Q

Clickers

Recall from the data set of 68 P/E ratios:Min = 7, Q1 = 14, Q2 = 19, Q3 = 26, Max = 91

What is the Midspread (Innerquartile Range)?

Chapter 4 – Boxplots

Boxplots – A useful tool of exploratory data analysis (EDA).• Also called a box-and-whisker plot.• Based on a five-number summary:

Xmin, Q1, Q2, Q3, Xmax

Example: Consider the five-number summary for the 68 P/E ratios…

Xmin = 7, Q1 = 14, Q2 = 19, Q3 = 26, Xmax = 91

• The Boxplot for the P/E ratio data is …

MinimumMinimum

Median (Median (QQ22))

MaximumMaximum

QQ11 QQ33

BoxBox

WhiskersWhiskers

Right-skewedRight-skewed

Fences and Unusual Data Values – Use quartiles to detect unusual data points.• These points are called fences and can be found

using the following formulas:

• Values outside the inner fences are unusual while those outside the outer fences are outliers.

Inner fences Outer fences:

Lower fence Q1 – 1.5 (Q3–Q1) Q1 – 3.0 (Q3–Q1)

Upper fence Q3 + 1.5 (Q3–Q1) Q3 + 3.0 (Q3–Q1)

Fences and Unusual Data Values:• Truncate the whisker at the fences and display

unusual values and outliers as dots.

Example: Boxplot of P/E ratios with fences…

Based on these fences, there are three unusual P/E values and two outliers.

Inner Fence

OuterFence

Unusual Outliers

Chapter 4 – Standardized DataExample: Unusual Observations in the P/E Data• The P/E ratio data contains several large data values. Are

they unusual or outliers? Compare the boxplot to standardized data analysis…

Standardized

bcor 1020 business statistics lecture 5 – january 31, 2008

standardized data outliers

standardized data example

pe data

data set

raw data

erroneous data

standardized values

unusual data points

Documents

bcor 1020 business statistics lecture 15 – march 6, 2008

overview of bcor 11

bcor 1020 business statistics lecture 25 – april 22, 2008

bcor 011 lecture 9: cell membrane structure sept 19, 2005...

systematics and phylogeny chapter 25 bcor 012 march 19 and...

overview of bcor 11 structure and...

new sarcoma defined by bcor-ccnb3 fusion

bcor 1020 business statistics lecture 13 – february 28,...

bcor 1020 business statistics lecture 3 – january 24, 2008

bcor 1020 business statistics lecture 27 – april 29, 2008

10 membrane transport 9 21 05 - university of...

the evolution of populations chapter 23 bcor 012 january...

bcor 011 lecture 10 sept 21, 2005 membrane transport bcor...

purdue weed science€¦ · % cloud cover: 40 10 10 crop...

bcor 1020 business statistics lecture 8 – february 12,...

bcor 1020 business statistics lecture 17 – march 18, 2008

mockspaze solutions pvt ltd.pptx bcor

bcor 1020 business statistics lecture 24 – april 17, 2008

bcor 1020 business statistics lecture 19 – april 1, 2008

bcor 1020 business statistics lecture 12 – february 26,...