chapter 2 organization and description of datajgleaton/lectnotessta3032ch2fa12.pdf · chapter 2 –...

16
1 Chapter 2 Organization and Description of Data When data are in their original form, as collected, they are called raw data. The first task to be done with raw data is clean-up. This is always done. The data must be double-checked to see that it was collected accurately. Any unusual data values should be followed up to see whether they resulted from errors in data collection or from unusual members of the sample. When the data is entered into a calculator or spreadsheet, it should be double-checked to see that it was entered correctly. After the clean-up procedure, the next task is to describe the data. There two kinds of methods for summarizing and describing data graphical techniques and numerical summaries. We will discuss some graphical techniques first. With non-numeric data, we often want a graph which is a variation on the histogram, called a Pareto chart. This type of graph is useful in quality control and process improvement studies, in which the data often represent the different types of defects or failure modes. A Pareto chart graphs the frequencies of occurrences of the different types of defects, ordered from the most frequent to the least frequent. The purpose of a Pareto chart is to focus on the main causes or modes of failure. Example: We have data, listed below, on number of accidents between 1959 and 1999 for each of a number of different types of aircraft, as well as the number of accidents per million flights.

Upload: phamtruc

Post on 21-Aug-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

1

Chapter 2 – Organization and Description of Data

When data are in their original form, as collected, they are called

raw data. The first task to be done with raw data is clean-up. This

is always done. The data must be double-checked to see that it was

collected accurately. Any unusual data values should be followed

up to see whether they resulted from errors in data collection or

from unusual members of the sample. When the data is entered into

a calculator or spreadsheet, it should be double-checked to see that it

was entered correctly.

After the clean-up procedure, the next task is to describe the data.

There two kinds of methods for summarizing and describing data –

graphical techniques and numerical summaries. We will discuss

some graphical techniques first.

With non-numeric data, we often want a graph which is a variation

on the histogram, called a Pareto chart. This type of graph is useful

in quality control and process improvement studies, in which the

data often represent the different types of defects or failure modes.

A Pareto chart graphs the frequencies of occurrences of the different

types of defects, ordered from the most frequent to the least

frequent. The purpose of a Pareto chart is to focus on the main

causes or modes of failure.

Example: We have data, listed below, on number of accidents

between 1959 and 1999 for each of a number of different types of

aircraft, as well as the number of accidents per million flights.

Page 2: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

2

Aircraft type Actual no. of hull

losses

Hull losses per

million departures

MD-11 5 6.54

707/720 115 6.46

DC-8 71 5.84

F-28 32 3.94

BAC 1-11 22 2.64

DC-10 20 2.57

747-Early 21 1.90

A310 4 1.40

A300-600 3 1.34

DC-9 75 1.29

A300-Early 7 1.29

737-1 & 2 62 1.23

727 70 0.97

A310/319/321 7 0.96

F100 3 0.80

L1011 4 0.77

BAe 146 3 0.59

747-400 1 0.49

757 4 0.46

MD-80/90 10 0.43

767 3 0.41

737-3, 4 & 5 12 0.39

The Pareto chart is shown below. To construct the graph using

Excel, we enter the data, with the categories listed in the first

column, and the frequencies or relative frequencies listed in the

second column. Highlight the data, and choose Insert, Chart,

Column.

Page 3: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

3

In this case, of the 22 types of aircraft, we see that the MD-11 had

the highest accident rate, followed by the Boeing 707/720 and the

DC-8. The latter two are no longer in service in most of the world.

The years of service of the MD-11 were 1990 – 1999.

Frequency Distributions and Histograms

For numeric data, there are a number of different graphical

techniques available. The author presents several, including the dot-

plot. We will not include the dot-plot, as other types of graphs, such

as histograms, are equally useful.

Often, with univariate data (resulting from a single measured

characteristic of a sample), there are too many different data values

for a listing of the raw data to be useful in visualizing the

characteristics of the data. It is common to divide the interval of

values of the data into a relatively small number of subintervals,

called classes, and to tabulate the data using the frequencies. Each

frequency is the number of occurrences of data values within a

Aircraft Accident Rates, 1959 - 1999

0

1

2

3

4

5

6

7

MD-1

1

DC-8

BAC

1-11

747-

Ear

ly

A30

0-60

0

A30

0-Ear

ly72

7

F100

BAe1

46 757

767

Type of Aircraft

Nu

mb

er

of

Ac

cid

en

ts p

er

Millio

n F

lig

hts

Page 4: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

4

subinterval. We sometimes want also to use relative frequencies.

The relative frequency for a class is found by dividing the frequency

for that class by the size of the entire data set.

Defn: A histogram is a graph that displays numeric data by using

vertical bars of various heights to represent the frequencies of

occurrence of data values within a subinterval.

Characteristics of a histogram:

1) The classes are listed in order along the horizontal axis.

2) The vertical axis provides a scale for the frequencies.

3) A bar is drawn for each class having width equal to the class

width and height equal to the class frequency.

4) The axes are labeled and the graph is titled.

Note: The number of classes, or subintervals, depends on the size of

the data set. A good rule of thumb is to choose the number of

classes to be approximately equal to the square root of the size of the

data set. For example, if n = 25, then we would use 5 classes; if n =

80, then we would use 9 classes.

Note: The class width is found by dividing the range of the data by

the number of classes and rounding up slightly, so that the largest

data value will be included in the last class.

The class limits are the uppermost and lowermost data values that

could be included in the class (note that there may be no actual data

values equal to the upper- or lower-class limit for any given class).

Since we may do the histogram with the calculator or with Excel, we

do the histogram first, followed by the grouped frequency

distribution.

Page 5: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

5

Example: Compressive strength, in pounds per square inch (psi) of

specimens of a new aluminum-lithium alloy undergoing evaluation

for possible use in aircraft structural components. The data are

listed in the following table.

105 221 183 186 121 181 180 143 97 154 153 174 120 168

167 141 245 228 174 199 181 158 176 110 163 131 154 115

160 208 158 133 207 180 190 193 194 133 156 123 134 178

76 167 184 135 229 146 218 157 101 171 165 172 158 169

199 151 142 163 145 171 148 158 160 175 149 87 160 237

150 135 196 201 200 176 150 170 118 149

We will construct a histogram for the data using Excel. We have a

data set with n = 80. We will choose to use 9 classes. The range is

245 – 76 = 169. Therefore the class width will be

The lower limit of the first class will be the smallest data value, 76

(the author sometimes chooses a different value for the lower class

limit of the first class). To construct the histogram in Excel:

1) Enter the data.

2) Enter a second column giving the upper class limits for all

classes except the last class – 94.8, 113.6, 132.4, 151.2, 170.0,

188.8, 207.6, 226.4.

3) Choose Tools, Data Analysis, Histogram.

4) The input range will be a1..a80. The bin range will be b1..b8.

5) The output range will be c1.

6) The type of output will be chart output.

Below is the resulting histogram, followed by the grouped frequency

table, constructed using the information from the histogram (In the

table, relative frequencies are included).

Page 6: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

6

Class (psi) Frequency Relative Frequency

76.0 – 94.8 1 0.0125 = 1.25%

94.9 – 113.6 1 0.0125 = 1.25%

113.7 – 132.4 4 0.0500 = 5.00%

132.5 – 151.2 6 0.0750 = 7.50%

151.3 – 170.0 16 0.2000 = 20.00%

170.1 – 188.8 20 0.2500 = 25.00%

188.9 – 207.6 16 0.2000 = 20.00%

207.7 – 226.4 9 0.1125 = 11.25%

226.5 – 245.2 7 0.0875 = 8.75%

Looking at a histogram of a data set can sometimes provide a quick

way of answering questions about data, by simply noting the

characteristics of the graph.

0

5

10

15

20

25

76 94.8 113.6 132.4 151.2 170 188.8 207.6 More

Fre

qu

en

cy

Compressive Strength (p.s.i.)

Compressive Strength of New Al Alloy

Page 7: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

7

Example 1: p. 18

It is immediately apparent from the graph that there are two

superimposed distributions, perhaps due to two different operating

processes.

Example 2: p. 19

It is immediately obvious from the histogram that most of the

interrequest times are relatively small, with only a few very large

times.

Sometimes we want to do a relative frequency histogram of a data

set (sometimes called a density histogram, for reasons to be covered

in Chapter 6).

Example: pp. 19 – 20

The density histogram shows an approximately symmetric, bell-

shaped distribution for the compressive strengths.

Numerical Descriptive Measures

One type of numerical summary describes, in some sense, the

location of the center of a data set. There are several measures of

central tendency, the most important of which is the mean.

Defn: For a variable X measured for every member of a finite

population of size N, yielding a set of values x1, x2, …, xN, the mean,

or average, is given by 1

1 N

i

i

xN

. For a sample of size n chosen

Page 8: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

8

from the population, yielding a set of values x1, x2, …, xn, the

sample mean, or average, is given by 1

1 n

i

i

x xn

.

Sometimes, the sample mean is not the most useful measure of

central tendency. For example, sometimes a data set has some

extreme values (either very large or very small). These extreme

values are called outliers (more on this topic later). The value of the

sample mean may be strongly affected by these outliers. In such a

case, a more useful measure of central tendency may be the sample

median.

Defn: The sample median, x , is the center of the data set when the

data are ordered from smallest to largest. If n is odd, then the

median is the middle item of data. If n is even, then the median is

the average of the two middle items of data.

The median is not usually affected by outliers (Example on page

26).

Example: In the original compression strength data set, n = 80, so 160 163

161.52

x

psi.

In addition to locating the center of the data set, we want to describe

the dispersion of the data values.

The simplest, although least useful, measure of dispersion is the

range of the data set.

Defn: The range of a data set is the difference between the largest

and smallest values of the data; the range is a simple measure of the

dispersion of the data.

Page 9: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

9

Example: For the compression strength data,

Range = 245 psi – 76 psi = 169 psi

The range cannot distinguish between the dispersion of two data sets

that have the same largest and smallest values, even though the

values in between may be quite different from one data set to the

other. For this reason, we need a measure of dispersion that takes

into consideration the location of each data value relative to the

center of the data set.

Consider a data set with data values For each data

value , we define the deviation from the mean as This

value gives the (directed) distance of the ith data value from the

mean of the sample data. We may consider using the sum of all of

these deviations as our measure of dispersion. However, it would be

useless to do so, as you will show in Exercise 2.50.

Instead, we define two other measures of dispersion, the variance

and the standard deviation.

Defn: For a variable X measured for every member of a finite

population of size N, yielding a set of values x1, x2, …, xN, the

variance of the data is given by 22

1

1 N

i

i

xN

, and the

standard deviation is given by . For a sample of size n chosen

from the population, yielding a set of values x1, x2, …, xn, the

sample variance is given by 22

1

1

1

n

i

i

s x xn

, and the sample

standard deviation is s.

Page 10: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

10

Note: In the above definitions, and are parameters; these two

quantities have fixed but usually unknown values. The two

quantities x and s are statistics; the values of these two quantities

depend on the particular sample chosen from the population.

If all of the data values in a data set are the same, then the variance

and standard deviation are both 0. If there are any differences

among the data values, then both the variance and standard deviation

are positive; the greater the differences among the data values, the

greater the values of the variance and standard deviation.

Note: While the defining formulae for the population mean and the

sample mean have the same form, the defining formulae for the

population variance and the sample variance differ. For the

population, the variance is the mean of the squared deviations of the

data values from the mean value. For the sample, the variance is

almost the mean of the squared deviations of the data values from

the mean value. Instead of dividing the sum of squared deviations

by the sample size, we divide by n – 1. The reason for doing so has

to do with the fact that we want the sample variance to be a good

estimator of the population variance. A better estimator is given by

dividing by n – 1, rather than by n. Statistically, we say that there

are n – 1 degrees of freedom associated with the sample variance.

Note: If we select a random sample of size n from a population or

distribution, we start out with n quantities which are free to vary, so

that we have n degrees of freedom. Each time we use the data to

estimate a parameter (such as using the sample mean to estimate the

population mean), we use up one degree of freedom. Thus, we have

only n – 1 degrees of freedom associated with the sample variance.

Page 11: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

11

Note: Another, and often simpler, way to calculate the variance is to

use the following fact:

22 2 2

1 1

1 12

1 1

n n

i i i

i i

s x x x xx xn n

2

2

12 2

1 1 1 1 1

1 2 1 1

1 1

n

in n n n ni

i i i i i

i i i i i

x

x x x x xn n n n n

.

Example: Compressive strength, in pounds per square inch (psi) of

specimens of a new aluminum-lithium alloy undergoing evaluation

for possible use in aircraft structural components. The data are

listed in the following table.

105 221 183 186 121 181 180 143 97 154 153 174 120 168

167 141 245 228 174 199 181 158 176 110 163 131 154 115

160 208 158 133 207 180 190 193 194 133 156 123 134 178

76 167 184 135 229 146 218 157 101 171 165 172 158 169

199 151 142 163 145 171 148 158 160 175 149 87 160 237

150 135 196 201 200 176 150 170 118 149

The sum of the data values is

80

1

13013i

i

x

psi. The sum of the

squared data values is

802

1

2206837i

i

x

psi2. Hence, the sample

mean is 162.6625 psi; the sample variance is 1140.6315 psi2. The

sample standard deviation is then 33.7732 psi.

The above example illustrates the usefulness of the standard

deviation as a measure of variation; the data have units of psi. The

variance has units of psi2. The standard deviation has the same units

of measurement as the data.

Page 12: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

12

As an example of the uses of the sample statistics, let us find the

fraction of the compression strength data that lie within two standard

deviations on either side of the mean. We have

( )( ) and

( )( ) From the data set, we see that there are two data values below

95.1161 psi, and two 230.2089 psi. Hence, the fraction of the data

set that lie within two standard deviations on either side of the mean

is

( ) (

)

(Hint: Remember this number.)

Quartiles and Percentiles

Defn: The first quartile, Q1, of a data set is a number such that 25%

of the data values are no greater than that number and 75% of the

data values are no less than that number. The third quartile, Q3, of a

data set is a number such that 75% of the data values are no greater

than that number and 25% of the data values are no less than that

number.

Example: For the aluminum-lithium alloy compression strength

data,

1

143 145144

2Q

psi, and 3

181 181181

2Q

psi.

25% of the specimens had compressive strengths no greater than

144 psi, and 75% of the specimens had compressive strengths no

greater than 181 psi.

Page 13: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

13

Defn: The interquartile range, IQR, is the difference between the

third and first quartiles. IQR is a measure of spread of the data set.

Example: For the original compression strength data, IQR = 87 psi.

Defn: The 100kth percentile of a data set is a number such that

100k% of the data are no greater than that number and 100(1-k)% of

the data values are no less than that number.

Steps in calculating the 100 pth percentile for a numeric data

set:

1. Re-order the data values from smallest to largest.

2. Determine the value of the product np, where n is the size of the

data set.

3. If np is not an integer, round it up to the next integer. Count up to

that position in the listed data to find the 100 pth percentile.

If np is an integer, count up to the npth position in the listed data, and

calculate the average of that data value and the next higher data

value.

Example: For the aluminum-lithium alloy compression strength

data, the 35th percentile is a number such that 35% of the data

values, or 28 values, are no greater than that number. From the

stem-and-leaf plot, we see that the 35th percentile is 152. Thirty-five

percent of the specimens in the sample have compression strengths

no greater than 152 psi.

Alternatively:

1. The data presented in the stem-and-leaf plot are already ordered.

2. np = (80)(0.35) = 28. This is an integer, so we average the 28th

and the 29th data values, obtaining

Page 14: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

14

Boxplots

Defn: The five-number summary of a data set consists of the

minimum value, the first quartile, the median, the third quartile, and

the maximum value.

Example: For the aluminum-lithium alloy compression strength

data, minX = 76 psi, Q1 = 144 psi, 161.5x psi , Q3 = 182 psi, and

maxX = 245 psi.

Defn: A boxplot is a graphical representation of a numeric data set

using the 5-number summary. The data values between the first and

third quartiles are represented by a box, with a vertical line at the

median value. The data values between minX and the first quartile

are represented by a line drawn from one end of the box; the data

values between the third quartile and maxX are represented by a line

drawn from the other end of the box.

Note: Excel does boxplots, but not readily; some Excel

programming is required. Excel can help in constructing boxplots

through providing the 5-number summary for the data, using the

Rank and Percentile function under Data Analysis.

Example: For the compression strength data, the boxplot is shown

below. To find the 5-number summary with Excel, we enter the

data, and use Tools, Data Analysis, Rank and Percentiles.

Aluminum-Lithium Alloy Compression Strength

____________

-----------------------|_____|______|------------------------------

|______|______|______|______|______|______|______|______|__

75 115 135 155 175 195 215 235 255

Compression Strength (psi)

Page 15: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

15

If the median line is approximately in the center of the box, and if

the two whiskers are of approximately equal length, then the data

distribution is symmetric.

Defn: An outlier is an observation whose value is quite different

from the values of most of the observations in the data set.

Note: When outliers are encountered, they should be investigated.

They may result from mistakes in data collection or in data entry.

Or they may result from unusual members of the sample.

Note: Practically speaking, an outlier is an observation whose value

is either at least 1.5 IQR’s below Q1, or at least 1.5 IQR’s above Q3.

An extreme outlier is an observation whose value is either at least 3

IQR’s below Q1, or at least 3 IQR’s above Q3.

Example: A boxplot of the compression strength data, with outliers

indicated, is shown below:

Aluminum-Lithium Alloy Compression Strength

____________

* *------------------|_____|______|-------------------------- * *

|______|______|______|______|______|______|______|______|__

75 115 135 155 175 195 215 235 255

Compression Strength (psi)

Side-by-side boxplots are often useful in comparing the central

tendencies and variabilities of several data sets, as in the results of

scientific experiments.

Example: pp. 32-33.

Page 16: Chapter 2 Organization and Description of Datajgleaton/LectNotesSta3032Ch2Fa12.pdf · Chapter 2 – Organization and Description of Data ... There two kinds of methods for summarizing

16

From examination of the side-by-side boxplots, we see that the

quality index is most variable for Plant 2, is lowest (on average) for

Plant 4, and is highest (on average) for Plant 3.

Example: Handout

Time Series Plots

Often, in a manufacturing situation, we are interested in the

development of the value of a variable over time. The other graphs

we have discussed examine data collected at a single point in time.

A time series is an ordered sequence of observations. Usually the

ordering is over time, although it may also be over some spatial

dimension. The key point here is that successive observations are

dependent, or correlated with each other. This is what makes time

series data different from the other types of data we have looked at.

In time series analysis, we are looking for two types of

characteristics in the data – trends and cycles.

The following two graphs show the two types of characteristics.

Example 1: p. 33

We see that for the measurement instrument, the measurements of

material thickness display a decreasing trend over time. The

instrument is not being consistent in its measurements.

Example 2: Handout