chapter 1 describing data: graphical and numerical

106
Chapter 1 Describing Data: Graphical and Numerical PROBABILITY (6MTCOAE205)

Upload: candy

Post on 15-Feb-2016

162 views

Category:

Documents


1 download

DESCRIPTION

PROBABILITY (6MTCOAE205). Chapter 1 Describing Data: Graphical and Numerical. Dealing with Uncertainty. Everyday decisions are based on incomplete information Consider: Will the job market be strong when I graduate? Will the price of Yahoo stock be higher in six months than it is now? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 1 Describing Data: Graphical and Numerical

Chapter 1

Describing Data: Graphical and Numerical

PROBABILITY (6MTCOAE205)

Page 2: Chapter 1 Describing Data: Graphical and Numerical

Dealing with Uncertainty

Everyday decisions are based on incomplete information

Consider: Will the job market be strong when I graduate? Will the price of Yahoo stock be higher in six months

than it is now? Will interest rates remain low for the rest of the year if

the federal budget deficit is as high as predicted?

Assist. Prof. Dr. İmran Göker Ch. 1-2

Page 3: Chapter 1 Describing Data: Graphical and Numerical

Dealing with Uncertainty

Numbers and data are used to assist decision making

Statistics is a tool to help process, summarize, analyze, and interpret data

Assist. Prof. Dr. İmran Göker Ch. 1-3

(continued)

Page 4: Chapter 1 Describing Data: Graphical and Numerical

Key Definitions

A population is the collection of all items of interest or under investigation

N represents the population size A sample is an observed subset of the population

n represents the sample size

A parameter is a specific characteristic of a population A statistic is a specific characteristic of a sample

Assist. Prof. Dr. İmran Göker Ch. 1-4

Page 5: Chapter 1 Describing Data: Graphical and Numerical

Population vs. Sample

Assist. Prof. Dr. İmran Göker Ch. 1-5

a b c d

ef gh i jk l m n

o p q rs t u v w

x y z

Population Sample

Values calculated using population data are called parameters

Values computed from sample data are called statistics

b c

g i n

o r u

y

Page 6: Chapter 1 Describing Data: Graphical and Numerical

Examples of Populations

Names of all registered voters in the Turkish Republic

Incomes of all families living in Ankara Osteoporosis incidence in Turkish women older

than 45 years old. Grade point averages of all the students in our

university

Assist. Prof. Dr. İmran Göker Ch. 1-6

Page 7: Chapter 1 Describing Data: Graphical and Numerical

Random Sampling

Simple random sampling is a procedure in which each member of the population is chosen strictly by

chance, each member of the population is equally likely to be

chosen, every possible sample of n objects is equally likely to

be chosen

The resulting sample is called a random sample

Assist. Prof. Dr. İmran Göker Ch. 1-7

Page 8: Chapter 1 Describing Data: Graphical and Numerical

Descriptive and Inferential Statistics

Two branches of statistics: Descriptive statistics

Graphical and numerical procedures to summarize and process data

Inferential statistics Using data to make predictions, forecasts, and

estimates to assist decision making

Assist. Prof. Dr. İmran Göker Ch. 1-8

Page 9: Chapter 1 Describing Data: Graphical and Numerical

Descriptive Statistics

Collect data e.g., Survey

Present data e.g., Tables and graphs

Summarize data e.g., Sample mean =

Assist. Prof. Dr. İmran Göker Ch. 1-9

iXn

Page 10: Chapter 1 Describing Data: Graphical and Numerical

Inferential Statistics

Assist. Prof. Dr. İmran Göker Ch. 1-10

Estimation e.g., Estimate the population

mean weight using the sample mean weight

Hypothesis testing e.g., Test the claim that the

population mean weight is 140 pounds

Inference is the process of drawing conclusions or making decisions about a population based on

sample results

Page 11: Chapter 1 Describing Data: Graphical and Numerical

Types of Data

Examples: Marital Status Are you registered to

vote? Eye Color (Defined categories or

groups)

Examples: Number of Children Defects per hour (Counted items)

Examples: Weight Voltage (Measured characteristics)

Assist. Prof. Dr. İmran Göker Ch. 1-11

Page 12: Chapter 1 Describing Data: Graphical and Numerical

Measurement Levels

Interval Data

Ordinal Data

Nominal Data

Quantitative Data

Qualitative Data

Categories (no ordering or direction)

Ordered Categories (rankings, order, or scaling)

Differences between measurements but no true zero

Ratio DataDifferences between measurements, true zero exists

Assist. Prof. Dr. İmran Göker Ch. 1-12

Page 13: Chapter 1 Describing Data: Graphical and Numerical

Graphical Presentation of Data

Data in raw form are usually not easy to use for decision making

Some type of organization is needed Table Graph

The type of graph to use depends on the variable being summarized

Assist. Prof. Dr. İmran Göker Ch. 1-13

Page 14: Chapter 1 Describing Data: Graphical and Numerical

Graphical Presentation of Data

Techniques reviewed in this chapter:

CategoricalVariables

NumericalVariables

• Frequency distribution • Bar chart• Pie chart• Pareto diagram

• Line chart• Frequency distribution• Histogram and ogive• Stem-and-leaf display• Scatter plot

(continued)

Assist. Prof. Dr. İmran Göker Ch. 1-14

Page 15: Chapter 1 Describing Data: Graphical and Numerical

Tables and Graphs for Categorical Variables

Categorical Data

Graphing Data

Pie Chart

Pareto Diagram

Bar Chart

Frequency Distribution

Table

Tabulating Data

Assist. Prof. Dr. İmran Göker Ch. 1-15

Page 16: Chapter 1 Describing Data: Graphical and Numerical

The Frequency Distribution Table

Example: Hospital Patients by Unit Hospital Unit Number of Patients

Cardiac Care 1,052 Emergency 2,245Intensive Care 340Maternity 552Surgery 4,630

(Variables are categorical)

Summarize data by category

Assist. Prof. Dr. İmran Göker Ch. 1-16

Page 17: Chapter 1 Describing Data: Graphical and Numerical

Bar and Pie Charts

Bar charts and Pie charts are often used for qualitative (category) data

Height of bar or size of pie slice shows the frequency or percentage for each category

Assist. Prof. Dr. İmran Göker Ch. 1-17

Page 18: Chapter 1 Describing Data: Graphical and Numerical

Bar Chart Example

Hospital Patients by Unit

0

1000

2000

3000

4000

5000

Car

diac

Car

e

Emer

genc

y

Inte

nsiv

eC

are

Mat

erni

ty

Surg

ery

Num

ber

of

patie

nts

per

year

Hospital Number Unit of Patients

Cardiac Care 1,052Emergency 2,245Intensive Care 340Maternity 552Surgery 4,630

Assist. Prof. Dr. İmran Göker Ch. 1-18

Page 19: Chapter 1 Describing Data: Graphical and Numerical

Hospital Patients by Unit

Emergency25%

Maternity6%

Surgery53%

Cardiac Care12%

Intensive Care4%

Pie Chart Example

(Percentages are rounded to the nearest percent)

Hospital Number % of Unit of Patients Total

Cardiac Care 1,052 11.93Emergency 2,245 25.46Intensive Care 340 3.86Maternity 552 6.26Surgery 4,630 52.50

Assist. Prof. Dr. İmran Göker Ch. 1-19

Page 20: Chapter 1 Describing Data: Graphical and Numerical

Pareto Diagram

Used to portray categorical data A bar chart, where categories are shown in

descending order of frequency A cumulative polygon is often shown in the

same graph Used to separate the “vital few” from the “trivial

many”

Assist. Prof. Dr. İmran Göker Ch. 1-20

Page 21: Chapter 1 Describing Data: Graphical and Numerical

Pareto Diagram Example

Example: 400 defective items are examined for cause of defect:

Source of Manufacturing Error Number of defects

Bad Weld 34Poor Alignment 223

Missing Part 25Paint Flaw 78

Electrical Short 19Cracked case 21

Total 400

Assist. Prof. Dr. İmran Göker Ch. 1-21

Page 22: Chapter 1 Describing Data: Graphical and Numerical

Pareto Diagram Example

Step 1: Sort by defect cause, in descending orderStep 2: Determine % in each category

Source of Manufacturing Error Number of defects % of Total Defects

Poor Alignment 223 55.75Paint Flaw 78 19.50Bad Weld 34 8.50

Missing Part 25 6.25Cracked case 21 5.25

Electrical Short 19 4.75Total 400 100%

(continued)

Assist. Prof. Dr. İmran Göker Ch. 1-22

Page 23: Chapter 1 Describing Data: Graphical and Numerical

Pareto Diagram Examplecum

ulative % (line graph)%

of d

efec

ts in

eac

h ca

tego

ry

(bar

gra

ph)

Pareto Diagram: Cause of Manufacturing Defect

0%

10%

20%

30%

40%

50%

60%

Poor Alignment Paint Flaw Bad Weld Missing Part Cracked case Electrical Short0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Step 3: Show results graphically(continued)

Assist. Prof. Dr. İmran Göker Ch. 1-23

Page 24: Chapter 1 Describing Data: Graphical and Numerical

Graphs for Time-Series Data

A line chart (time-series plot) is used to show the values of a variable over time

Time is measured on the horizontal axis

The variable of interest is measured on the vertical axis

Assist. Prof. Dr. İmran Göker Ch. 1-24

Page 25: Chapter 1 Describing Data: Graphical and Numerical

Line Chart Example

Magazine Subscriptions by Year

0

50

100

150

200

250

300

350

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006Th

ousa

nds

of s

ubsc

ribe

rs

Assist. Prof. Dr. İmran Göker Ch. 1-25

Page 26: Chapter 1 Describing Data: Graphical and Numerical

Numerical Data

Stem-and-LeafDisplay

Histogram Ogive

Frequency Distributions and

Cumulative Distributions

Graphs to Describe Numerical Variables

Assist. Prof. Dr. İmran Göker Ch. 1-26

Page 27: Chapter 1 Describing Data: Graphical and Numerical

Frequency Distributions

What is a Frequency Distribution? A frequency distribution is a list or a table … containing class groupings (categories or

ranges within which the data fall) ... and the corresponding frequencies with which

data fall within each class or category

Assist. Prof. Dr. İmran Göker Ch. 1-27

Page 28: Chapter 1 Describing Data: Graphical and Numerical

Why Use Frequency Distributions?

A frequency distribution is a way to summarize data

The distribution condenses the raw data into a more useful form...

and allows for a quick visual interpretation of the data

Assist. Prof. Dr. İmran Göker Ch. 1-28

Page 29: Chapter 1 Describing Data: Graphical and Numerical

Class Intervals and Class Boundaries

Each class grouping has the same width Determine the width of each interval by

Use at least 5 but no more than 15-20 intervals Intervals never overlap Round up the interval width to get desirable

interval endpoints

intervalsdesiredofnumbernumbersmallestnumberlargestwidthintervalw

Assist. Prof. Dr. İmran Göker Ch. 1-29

Page 30: Chapter 1 Describing Data: Graphical and Numerical

Frequency Distribution Example

Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

Assist. Prof. Dr. İmran Göker Ch. 1-30

Page 31: Chapter 1 Describing Data: Graphical and Numerical

Frequency Distribution Example

Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find range: 58 - 12 = 46 Select number of classes: 5 (usually between 5 and 15) Compute interval width: 10 (46/5 then round up)

Determine interval boundaries: 10 but less than 20, 20 but less than 30, . . . , 60 but less than 70

Count observations & assign to classes

(continued)

Assist. Prof. Dr. İmran Göker Ch. 1-31

Page 32: Chapter 1 Describing Data: Graphical and Numerical

Frequency Distribution Example

Interval Frequency

10 but less than 20 3 .15 1520 but less than 30 6 .30 3030 but less than 40 5 .25 25 40 but less than 50 4 .20 2050 but less than 60 2 .10 10 Total 20 1.00 100

RelativeFrequency Percentage

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

(continued)

Assist. Prof. Dr. İmran Göker Ch. 1-32

Page 33: Chapter 1 Describing Data: Graphical and Numerical

Histogram

A graph of the data in a frequency distribution is called a histogram

The interval endpoints are shown on the horizontal axis

the vertical axis is either frequency, relative frequency, or percentage

Bars of the appropriate heights are used to represent the number of observations within each class

Assist. Prof. Dr. İmran Göker Ch. 1-33

Page 34: Chapter 1 Describing Data: Graphical and Numerical

Histogram : Daily High Tem perature

0

3

65

4

2

00123

4567

0 10 20 30 40 50 60

Freq

uenc

y

Temperature in Degrees

Histogram Example

(No gaps between

bars)

Interval

10 but less than 20 320 but less than 30 630 but less than 40 540 but less than 50 450 but less than 60 2

Frequency

0 10 20 30 40 50 60 70

Assist. Prof. Dr. İmran Göker Ch. 1-34

Page 35: Chapter 1 Describing Data: Graphical and Numerical

Histograms in Excel

Select Data Tab

1

Assist. Prof. Dr. İmran Göker Ch. 1-35

Click on Data Analysis2

Page 36: Chapter 1 Describing Data: Graphical and Numerical

Choose Histogram

3

4

Input data range and bin range (bin range is a cell range containing the upper interval endpoints for each class grouping)

Select Chart Output and click “OK”

Histograms in Excel(continued)

(

Assist. Prof. Dr. İmran Göker Ch. 1-36

Page 37: Chapter 1 Describing Data: Graphical and Numerical

Questions for Grouping Data into Intervals

1. How wide should each interval be? (How many classes should be used?)

2. How should the endpoints of the intervals be determined?

Often answered by trial and error, subject to user judgment

The goal is to create a distribution that is neither too "jagged" nor too "blocky”

Goal is to appropriately show the pattern of variation in the data

Assist. Prof. Dr. İmran Göker Ch. 1-37

Page 38: Chapter 1 Describing Data: Graphical and Numerical

How Many Class Intervals?

Many (Narrow class intervals) may yield a very jagged distribution

with gaps from empty classes Can give a poor indication of how

frequency varies across classes

Few (Wide class intervals) may compress variation too much and

yield a blocky distribution can obscure important patterns of

variation. 0

2

4

6

8

10

12

0 30 60 More

TemperatureFr

eque

ncy

0

0.5

1

1.5

2

2.5

3

3.5

4 8

12 16 20 24 28 32 36 40 44 48 52 56 60

Mor

e

Temperature

Freq

uenc

y(X axis labels are upper class endpoints)

Assist. Prof. Dr. İmran Göker Ch. 1-38

Page 39: Chapter 1 Describing Data: Graphical and Numerical

The Cumulative Frequency Distribuiton

Class

10 but less than 20 3 15 3 1520 but less than 30 6 30 9 4530 but less than 40 5 25 14 7040 but less than 50 4 20 18 9050 but less than 60 2 10 20 100 Total 20 100

Percentage Cumulative Percentage

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Frequency Cumulative Frequency

Assist. Prof. Dr. İmran Göker Ch. 1-39

Page 40: Chapter 1 Describing Data: Graphical and Numerical

The OgiveGraphing Cumulative Frequencies

Ogive: Daily High Temperature

0

20

40

60

80

100

10 20 30 40 50 60Cum

ulat

ive

Perc

enta

ge

Interval endpoints

Interval

Less than 10 10 010 but less than 20 20 1520 but less than 30 30 4530 but less than 40 40 7040 but less than 50 50 9050 but less than 60 60 100

Cumulative Percentage

Upper interval

endpoint

Assist. Prof. Dr. İmran Göker Ch. 1-40

Page 41: Chapter 1 Describing Data: Graphical and Numerical

Stem-and-Leaf Diagram

A simple way to see distribution details in a data set

METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)

Assist. Prof. Dr. İmran Göker Ch. 1-41

Page 42: Chapter 1 Describing Data: Graphical and Numerical

Example

Here, use the 10’s digit for the stem unit:

Data in ordered array:21, 24, 24, 26, 27, 27, 30, 32, 38, 41

21 is shown as 38 is shown as

Stem Leaf

2 1

3 8

Assist. Prof. Dr. İmran Göker Ch. 1-42

Page 43: Chapter 1 Describing Data: Graphical and Numerical

Example

Completed stem-and-leaf diagram:Stem Leaves

2 1 4 4 6 7 73 0 2 84 1

(continued)

Data in ordered array:21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Assist. Prof. Dr. İmran Göker Ch. 1-43

Page 44: Chapter 1 Describing Data: Graphical and Numerical

Using other stem units

Using the 100’s digit as the stem: Round off the 10’s digit to form the leaves

613 would become 6 1 776 would become 7 8 . . . 1224 becomes 12 2

Stem Leaf

Assist. Prof. Dr. İmran Göker Ch. 1-44

Page 45: Chapter 1 Describing Data: Graphical and Numerical

Using other stem units

Using the 100’s digit as the stem: The completed stem-and-leaf display:

Stem Leaves

(continued)

6 1 3 6 7 2 2 5 8 8 3 4 6 6 9 9 9 1 3 3 6 8 10 3 5 6 11 4 7 12 2

Data:

613, 632, 658, 717,722, 750, 776, 827,841, 859, 863, 891,894, 906, 928, 933,955, 982, 1034, 1047,1056, 1140, 1169, 1224

Assist. Prof. Dr. İmran Göker Ch. 1-45

Page 46: Chapter 1 Describing Data: Graphical and Numerical

Relationships Between Variables

Graphs illustrated so far have involved only a single variable

When two variables exist other techniques are used:

Categorical(Qualitative)

Variables

Numerical(Quantitative)

Variables

Cross tables Scatter plots

Assist. Prof. Dr. İmran Göker Ch. 1-46

Page 47: Chapter 1 Describing Data: Graphical and Numerical

Scatter Diagrams are used for paired observations taken from two numerical variables

The Scatter Diagram: one variable is measured on the vertical

axis and the other variable is measured on the horizontal axis

Scatter Diagrams

Assist. Prof. Dr. İmran Göker Ch. 1-47

Page 48: Chapter 1 Describing Data: Graphical and Numerical

Scatter Diagram Example

Cost per Day vs. Production Volume

0

50

100

150

200

250

0 10 20 30 40 50 60 70

Volume per Day

Cos

t per

Day

Volume per day

Cost per day

23 12526 14029 14633 16038 16742 17050 18855 19560 200

Assist. Prof. Dr. İmran Göker Ch. 1-48

Page 49: Chapter 1 Describing Data: Graphical and Numerical

Scatter Diagrams in Excel

Select the Insert tab12 Select Scatter type from

the Charts section

When prompted, enter the data range, desired legend, and desired destination to complete the scatter diagram

3

Assist. Prof. Dr. İmran Göker Ch. 1-49

Page 50: Chapter 1 Describing Data: Graphical and Numerical

Cross Tables

Cross Tables (or contingency tables) list the number of observations for every combination of values for two categorical or ordinal variables

If there are r categories for the first variable (rows) and c categories for the second variable (columns), the table is called an r x c cross table

Assist. Prof. Dr. İmran Göker Ch. 1-50

Page 51: Chapter 1 Describing Data: Graphical and Numerical

Cross Table Example

4 x 3 Cross Table for Investment Choices by Investor (values in $1000’s)

Investment Investor A Investor B Investor C Total Category

Stocks 46.5 55 27.5 129Bonds 32.0 44 19.0 95CD 15.5 20 13.5 49Savings 16.0 28 7.0 51Total 110.0 147 67.0 324

Assist. Prof. Dr. İmran Göker Ch. 1-51

Page 52: Chapter 1 Describing Data: Graphical and Numerical

Graphing Multivariate Categorical Data

Side by side bar charts

(continued)

Comparing Investors

0 10 20 30 40 50 60

S toc k s

B onds

CD

S avings

Inves tor A Inves tor B Inves tor C

Assist. Prof. Dr. İmran Göker Ch. 1-52

Page 53: Chapter 1 Describing Data: Graphical and Numerical

Side-by-Side Chart Example Sales by quarter for three sales territories:

0

10

20

30

40

50

60

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

EastWestNorth

1st Qtr 2nd Qtr 3rd Qtr 4th QtrEast 20.4 27.4 59 20.4West 30.6 38.6 34.6 31.6North 45.9 46.9 45 43.9

Assist. Prof. Dr. İmran Göker Ch. 1-53

Page 54: Chapter 1 Describing Data: Graphical and Numerical

Data Presentation Errors

Goals for effective data presentation:

Present data to display essential information

Communicate complex ideas clearly and

accurately

Avoid distortion that might convey the wrong

message

Assist. Prof. Dr. İmran Göker Ch. 1-54

Page 55: Chapter 1 Describing Data: Graphical and Numerical

Data Presentation Errors

Unequal histogram interval widths Compressing or distorting the

vertical axis Providing no zero point on the

vertical axis Failing to provide a relative basis

in comparing data between groups

(continued)

Assist. Prof. Dr. İmran Göker Ch. 1-55

Page 56: Chapter 1 Describing Data: Graphical and Numerical

Describing Data Numerically

Assist. Prof. Dr. İmran Göker

Arithmetic Mean

Median

Mode

Describing Data Numerically

Variance

Standard Deviation

Coefficient of Variation

Range

Interquartile Range

Central Tendency Variation

Ch. 1-56

Page 57: Chapter 1 Describing Data: Graphical and Numerical

Measures of Central Tendency

Assist. Prof. Dr. İmran Göker

Central Tendency

Mean Median Mode

n

xx

n

1ii

Overview

Midpoint of ranked values

Most frequently observed value

Arithmetic average

Ch. 1-57

2.1

Page 58: Chapter 1 Describing Data: Graphical and Numerical

Arithmetic Mean The arithmetic mean (mean) is the most

common measure of central tendency For a population of N values:

For a sample of size n:

Assist. Prof. Dr. İmran GökerSample size

nxxx

n

xx n21

n

1ii

Observed

values

Nxxx

N

xμ N21

N

1ii

Population size

Population values

Ch. 1-58

Page 59: Chapter 1 Describing Data: Graphical and Numerical

Arithmetic Mean

The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers)

Assist. Prof. Dr. İmran Göker

(continued)

0 1 2 3 4 5 6 7 8 9 10

Mean = 3

0 1 2 3 4 5 6 7 8 9 10

Mean = 4

35

155

54321

4520

5104321

Ch. 1-59

Page 60: Chapter 1 Describing Data: Graphical and Numerical

Median

In an ordered list, the median is the “middle” number (50% above, 50% below)

Not affected by extreme values

Assist. Prof. Dr. İmran Göker

0 1 2 3 4 5 6 7 8 9 10

Median = 3

0 1 2 3 4 5 6 7 8 9 10

Median = 3

Ch. 1-60

Page 61: Chapter 1 Describing Data: Graphical and Numerical

Finding the Median

The location of the median:

If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of

the two middle numbers

Note that is not the value of the median, only the

position of the median in the ranked data

Assist. Prof. Dr. İmran Göker

dataorderedtheinposition2

1npositionMedian

21n

Ch. 1-61

Page 62: Chapter 1 Describing Data: Graphical and Numerical

Mode A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may may be no mode There may be several modes

Assist. Prof. Dr. İmran Göker

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

0 1 2 3 4 5 6

No ModeCh. 1-62

Page 63: Chapter 1 Describing Data: Graphical and Numerical

Review Example

Assist. Prof. Dr. İmran Göker

Five houses on a hill by the beach$2,000 K

$500 K

$300 K

$100 K

$100 K

House Prices:

$2,000,000 500,000 300,000 100,000 100,000

Ch. 1-63

Page 64: Chapter 1 Describing Data: Graphical and Numerical

Review Example:Summary Statistics

Assist. Prof. Dr. İmran Göker

Mean: ($3,000,000/5) = $600,000

Median: middle value of ranked data = $300,000

Mode: most frequent value = $100,000

House Prices:

$2,000,000 500,000 300,000 100,000 100,000

Sum 3,000,000

Ch. 1-64

Page 65: Chapter 1 Describing Data: Graphical and Numerical

Which measure of location is the “best”?

Assist. Prof. Dr. İmran Göker

Mean is generally used, unless extreme values (outliers) exist . . .

Then median is often used, since the median is not sensitive to extreme values. Example: Median home prices may be reported for

a region – less sensitive to outliers

Ch. 1-65

Page 66: Chapter 1 Describing Data: Graphical and Numerical

Shape of a Distribution

Describes how data are distributed Measures of shape

Symmetric or skewed

Assist. Prof. Dr. İmran Göker

Mean = Median Mean < Median Median < Mean

Right-SkewedLeft-Skewed Symmetric

Ch. 1-66

Page 67: Chapter 1 Describing Data: Graphical and Numerical

Assist. Prof. Dr. İmran Göker

Geometric Mean

Geometric mean Used to measure the rate of change of a variable

over time

Geometric mean rate of return Measures the status of an investment over time

Where xi is the rate of return in time period i

1/nn21

nn21g )xx(x)xx(xx

1)x...x(xr 1/nn21g

Ch. 1-67

Page 68: Chapter 1 Describing Data: Graphical and Numerical

Assist. Prof. Dr. İmran Göker

Example

An investment of $100,000 rose to $150,000 at the end of year one and increased to $180,000 at end of year two:

$180,000X$150,000X$100,000X 321

50% increase 20% increase

What is the mean percentage return over time?

Ch. 1-68

Page 69: Chapter 1 Describing Data: Graphical and Numerical

Assist. Prof. Dr. İmran Göker

Example

Use the 1-year returns to compute the arithmetic mean and the geometric mean:

30.623%131.6231(1000)

1(20)][(50)

1)x(xr

1/2

1/2

1/n21g

35%2

(20%)(50%)X

Arithmetic mean rate of return:

Geometric mean rate of return:

Misleading result

More accurate result

(continued)

Ch. 1-69

Page 70: Chapter 1 Describing Data: Graphical and Numerical

Measures of Variability

Assist. Prof. Dr. İmran Göker

Same center, different variation

Variation

Variance Standard Deviation

Coefficient of Variation

Range Interquartile Range

Measures of variation give information on the spread or variability of the data values.

Ch. 1-70

Page 71: Chapter 1 Describing Data: Graphical and Numerical

Range

Simplest measure of variation Difference between the largest and the smallest

observations:

Assist. Prof. Dr. İmran Göker

Range = Xlargest – Xsmallest

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

Example:

Ch. 1-71

Page 72: Chapter 1 Describing Data: Graphical and Numerical

Disadvantages of the Range Ignores the way in which data are distributed

Sensitive to outliers

Assist. Prof. Dr. İmran Göker

7 8 9 10 11 12Range = 12 - 7 = 5

7 8 9 10 11 12Range = 12 - 7 = 5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 5 - 1 = 4

Range = 120 - 1 = 119

Ch. 1-72

Page 73: Chapter 1 Describing Data: Graphical and Numerical

Interquartile Range

Can eliminate some outlier problems by using the interquartile range

Eliminate high- and low-valued observations and calculate the range of the middle 50% of the data

Interquartile range = 3rd quartile – 1st quartile IQR = Q3 – Q1

Assist. Prof. Dr. İmran Göker Ch. 1-73

Page 74: Chapter 1 Describing Data: Graphical and Numerical

Interquartile Range

Assist. Prof. Dr. İmran Göker

Median(Q2)

XmaximumX

minimum Q1 Q3

Example:

25% 25% 25% 25%

12 30 45 57 70

Interquartile range = 57 – 30 = 27

Ch. 1-74

Box-and-Whisker Plots

Page 75: Chapter 1 Describing Data: Graphical and Numerical

Quartiles Quartiles split the ranked data into 4 segments with

an equal number of values per segment

Assist. Prof. Dr. İmran Göker

25% 25% 25% 25%

The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger

Q2 is the same as the median (50% are smaller, 50% are larger)

Only 25% of the observations are greater than the third quartile

Q1 Q2 Q3

Ch. 1-75

Page 76: Chapter 1 Describing Data: Graphical and Numerical

Quartile Formulas

Assist. Prof. Dr. İmran Göker

Find a quartile by determining the value in the appropriate position in the ranked data, where

First quartile position: Q1 = 0.25(n+1)

Second quartile position: Q2 = 0.50(n+1) (the median position)

Third quartile position: Q3 = 0.75(n+1)

where n is the number of observed values

Ch. 1-76

Page 77: Chapter 1 Describing Data: Graphical and Numerical

Quartiles

Assist. Prof. Dr. İmran Göker

(n = 9)

Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data

so use the value half way between the 2nd and 3rd values,

so Q1 = 12.5

Sample Ranked Data: 11 12 13 16 16 17 18 21 22

Example: Find the first quartile

Ch. 1-77

Page 78: Chapter 1 Describing Data: Graphical and Numerical

Population Variance

Average of squared deviations of values from the mean

Population variance:

Assist. Prof. Dr. İmran Göker

N

μ)(xσ

N

1i

2i

2

Where = population mean

N = population size

xi = ith value of the variable x

μ

Ch. 1-78

Page 79: Chapter 1 Describing Data: Graphical and Numerical

Sample Variance

Average (approximately) of squared deviations of values from the mean

Sample variance:

Assist. Prof. Dr. İmran Göker

1-n

)x(xs

n

1i

2i

2

Where = arithmetic mean

n = sample size

Xi = ith value of the variable X

X

Ch. 1-79

Page 80: Chapter 1 Describing Data: Graphical and Numerical

Population Standard Deviation

Most commonly used measure of variation Shows variation about the mean Has the same units as the original data

Population standard deviation:

Assist. Prof. Dr. İmran Göker

N

μ)(xσ

N

1i

2i

Ch. 1-80

Page 81: Chapter 1 Describing Data: Graphical and Numerical

Sample Standard Deviation

Most commonly used measure of variation Shows variation about the mean Has the same units as the original data

Sample standard deviation:

Assist. Prof. Dr. İmran Göker

1-n

)x(xS

n

1i

2i

Ch. 1-81

Page 82: Chapter 1 Describing Data: Graphical and Numerical

Calculation Example:Sample Standard Deviation

Assist. Prof. Dr. İmran Göker

Sample Data (xi) : 10 12 14 15 17 18 18 24

n = 8 Mean = x = 16

4.24267

126

1816)(2416)(1416)(1216)(10

1n)x(24)x(14)x(12)X(10s

2222

2222

A measure of the “average” scatter around the mean

Ch. 1-82

Page 83: Chapter 1 Describing Data: Graphical and Numerical

Measuring variation

Assist. Prof. Dr. İmran Göker

Small standard deviation

Large standard deviation

Ch. 1-83

Page 84: Chapter 1 Describing Data: Graphical and Numerical

Comparing Standard Deviations

Assist. Prof. Dr. İmran Göker

Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

Data B

Data A

Mean = 15.5 s = 0.926

11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5 s = 4.570

Data C

Ch. 1-84

Page 85: Chapter 1 Describing Data: Graphical and Numerical

Advantages of Variance and Standard Deviation

Each value in the data set is used in the calculation

Values far from the mean are given extra weight (because deviations from the mean are squared)

Assist. Prof. Dr. İmran Göker Ch. 1-85

Page 86: Chapter 1 Describing Data: Graphical and Numerical

Coefficient of Variation

Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare two or more sets of

data measured in different units

Assist. Prof. Dr. İmran Göker

100%xsCV

Ch. 1-86

Page 87: Chapter 1 Describing Data: Graphical and Numerical

Comparing Coefficient of Variation

Stock A: Average price last year = $50 Standard deviation = $5

Stock B: Average price last year = $100 Standard deviation = $5

Assist. Prof. Dr. İmran Göker

Both stocks have the same standard deviation, but stock B is less variable relative to its price

10%100%$50$5100%

xsCVA

5%100%$100

$5100%xsCVB

Ch. 1-87

Page 88: Chapter 1 Describing Data: Graphical and Numerical

Using Microsoft Excel

Descriptive Statistics can be obtained from Microsoft® Excel

Select:

data / data analysis / descriptive statistics

Enter details in dialog box

Assist. Prof. Dr. İmran Göker Ch. 1-88

Page 89: Chapter 1 Describing Data: Graphical and Numerical

Using Excel

Assist. Prof. Dr. İmran Göker

Select data / data analysis / descriptive statistics

Ch. 1-89

Page 90: Chapter 1 Describing Data: Graphical and Numerical

Using Excel

Enter input range details

Check box for summary statistics

Click OK

Assist. Prof. Dr. İmran Göker Ch. 1-90

Page 91: Chapter 1 Describing Data: Graphical and Numerical

Excel output

Assist. Prof. Dr. İmran Göker

Microsoft Excel descriptive statistics output, using the house price data:

House Prices:

$2,000,000 500,000 300,000 100,000 100,000

Ch. 1-91

Page 92: Chapter 1 Describing Data: Graphical and Numerical

For any population with mean μ and standard deviation σ , and k > 1 , the percentage of observations that fall within the interval

[μ + kσ] Is at least

Assist. Prof. Dr. İmran Göker

Chebychev’s Theorem

)]%(1/k100[1 2

Ch. 1-92

Page 93: Chapter 1 Describing Data: Graphical and Numerical

Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean (for k > 1)

Examples:

(1 - 1/1.52) = 55.6% ……... k = 1.5 (μ ± 1.5σ) (1 - 1/22) = 75% …........... k = 2 (μ ± 2σ) (1 - 1/32) = 89% …….…... k = 3 (μ ± 3σ)

Assist. Prof. Dr. İmran Göker

Chebychev’s Theorem

withinAt least

(continued)

Ch. 1-93

Page 94: Chapter 1 Describing Data: Graphical and Numerical

If the data distribution is bell-shaped, then the interval:

contains about 68% of the values in the population or the sample

Assist. Prof. Dr. İmran Göker

The Empirical Rule

1σμ

μ

68%

1σμCh. 1-94

Page 95: Chapter 1 Describing Data: Graphical and Numerical

contains about 95% of the values in the population or the sample

contains almost all (about 99.7%) of the values in the population or

the sample

Assist. Prof. Dr. İmran Göker

The Empirical Rule

2σμ

3σμ

3σμ

99.7%95%

2σμ

Ch. 1-95

Page 96: Chapter 1 Describing Data: Graphical and Numerical

Weighted Mean

The weighted mean of a set of data is

Where wi is the weight of the ith observation

and

Use when data is already grouped into n classes, with wi values in the ith class

Assist. Prof. Dr. İmran Göker

nxwxwxw

n

xwx nn2211

n

1iii

Ch. 1-96

iwn

Page 97: Chapter 1 Describing Data: Graphical and Numerical

Approximations for Grouped DataSuppose data are grouped into K classes, with

frequencies f1, f2, . . . fK, and the midpoints of the classes are m1, m2, . . ., mK

For a sample of n observations, the mean is

Assist. Prof. Dr. İmran Göker

n

mfx

K

1iii

K

1iifnwhere

Ch. 1-97

Page 98: Chapter 1 Describing Data: Graphical and Numerical

Approximations for Grouped DataSuppose data are grouped into K classes, with

frequencies f1, f2, . . . fK, and the midpoints of the classes are m1, m2, . . ., mK

For a sample of n observations, the variance is

Assist. Prof. Dr. İmran Göker Ch. 1-98

1n

)x(mfs

K

1i

2ii

2

Page 99: Chapter 1 Describing Data: Graphical and Numerical

The Sample Covariance The covariance measures the strength of the linear relationship

between two variables

The population covariance:

The sample covariance:

Only concerned with the strength of the relationship No causal effect is implied

Assist. Prof. Dr. İmran Göker

N

))(y(xy),(xCov

N

1iyixi

xy

1n

)y)(yx(xsy),(xCov

n

1iii

xy

Ch. 1-99

Page 100: Chapter 1 Describing Data: Graphical and Numerical

Interpreting Covariance

Covariance between two variables:

Cov(x,y) > 0 x and y tend to move in the same direction

Cov(x,y) < 0 x and y tend to move in opposite directions

Cov(x,y) = 0 x and y are independent

Assist. Prof. Dr. İmran Göker Ch. 1-100

Page 101: Chapter 1 Describing Data: Graphical and Numerical

Coefficient of Correlation Measures the relative strength of the linear relationship

between two variables

Population correlation coefficient:

Sample correlation coefficient:

Assist. Prof. Dr. İmran Göker

YX ssy),(xCovr

YXσσy),(xCovρ

Ch. 1-101

Page 102: Chapter 1 Describing Data: Graphical and Numerical

Features of Correlation Coefficient, r

Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative linear

relationship The closer to 1, the stronger the positive linear

relationship The closer to 0, the weaker any positive linear

relationship

Assist. Prof. Dr. İmran Göker Ch. 1-102

Page 103: Chapter 1 Describing Data: Graphical and Numerical

Scatter Plots of Data with Various Correlation Coefficients

Assist. Prof. Dr. İmran Göker

Y

X

Y

X

Y

X

Y

X

Y

X

r = -1 r = -.6 r = 0

r = +.3r = +1

Y

Xr = 0

Ch. 1-103

Page 104: Chapter 1 Describing Data: Graphical and Numerical

Using Excel to Find the Correlation Coefficient

Select Data / Data Analysis

Assist. Prof. Dr. İmran Göker Ch. 1-104

Choose Correlation from the selection menu Click OK . . .

Page 105: Chapter 1 Describing Data: Graphical and Numerical

Using Excel to Find the Correlation Coefficient

Input data range and select appropriate options

Click OK to get output

Assist. Prof. Dr. İmran Göker

(continued)

Ch. 1-105

Page 106: Chapter 1 Describing Data: Graphical and Numerical

Interpreting the Result

r = .733

There is a relatively strong positive linear relationship between test score #1 and test score #2

Students who scored high on the first test tended to score high on second test

Assist. Prof. Dr. İmran Göker

Scatter Plot of Test Scores

70

75

80

85

90

95

100

70 75 80 85 90 95 100

Test #1 ScoreTe

st #

2 S

core

Ch. 1-106