ten tools

73
Neil W. Polhemus, CTO, StatPoint Technologies, Inc. George H. Dyson, Director of Six Sigma Services Copyright © 2010 by StatPoint Technologies, Inc. Ten Data Analysis Tools You Can’t Afford to be Without

Upload: pasticcio

Post on 24-Oct-2015

9 views

Category:

Documents


5 download

DESCRIPTION

tools six sigma

TRANSCRIPT

Page 1: Ten Tools

Neil W. Polhemus, CTO, StatPoint Technologies, Inc.

George H. Dyson, Director of Six Sigma ServicesCopyright ©

2010 by StatPoint Technologies, Inc.

Ten Data Analysis Tools You Can’t Afford to be Without

Page 2: Ten Tools

Business Improvement Objectives

Businesses must create value. This output must be greater than the inputs needed to produce it.

If the output meets the customers’

needs, the business is effective.

If the business creates added value with minimum resources, the business is efficient.

The Role of Six Sigma is to Help a Business Produce the Maximum Value While Using Minimum ResourcesThe Role of Six Sigma is to Help a Business Produce the Maximum Value While Using Minimum Resources

(Pyzdek 2003)

2

Page 3: Ten Tools

Six Sigma Business Successes

• Cost reductions

• Market - share growth

• Defect reductions

• Culture changes

• Productivity improvements

• Customer relations improvements

• Product and service improvements

• Cycle - time reductions

(CSSBB Primer, 2001) / (Pande, 2000)

All these Successes have a common thread….

3

Page 4: Ten Tools

DATA!!!

Data drives analysis.

Analysis uses statistical models & tools of all kinds.

To extract meaningful information

To uncover signals in the presence of noise

To understand the past

To monitor the present

To forecast the future

This understanding results in reduced cost and saving money.

Deming: “Doesn’t anyone care about Profit?”

4

Page 5: Ten Tools

Examples

Product comparisons

Survey analysis

Distribution fitting

Comparison of multiple samples

Outlier detection

Curve fitting

Response surface modeling

Time series forecasting

Event rate modeling

Interactive maps

5

Page 6: Ten Tools

Problem #1 –

Product Comparisons

Consumer Reports 2010: Sedans (Family, Upscale, Luxury)

7 variables

Price (dollars)

Road-test score (0 –

100)

Predicted reliability (1 –

5)

Owner satisfaction (1 –

5)

Owner cost (1 –

5)

Safety (1 –

5)

Fuel economy (overall mpg)

6

Page 7: Ten Tools

Data file: cars.sgd

(n=30)

7

Page 8: Ten Tools

Multivariate Visualization

The data consist of 7 quantitative variables plus one categorical factor. Each car is thus a point in 7-dimensional space with one other feature. How can we visualize what’s going on?

1.

2-D scatterplot2.

3-D scatterplot3.

Scatterplot matrix4.

Bubble chart5.

Parallel coordinates plot6.

Star glyphs7.

Chernoff

faces

8.

Radar/spider plot

8

Page 9: Ten Tools

2-D ScatterplotUseful for plotting 2 dimensions.

9

Plot of MPG vs Price

24 34 44 54 64(X 1000)Price

17

20

23

26

29

32

35

MPG

ClassFamily Luxury Upscale

Page 10: Ten Tools

3-D ScatterplotUseful for plotting 3 dimensions.

10

Plot of MPG vs Price and Road Test

2434

4454

64(X 1000)Price 58

6878

8898

Road Test

17202326293235

MPG

ClassFamily Luxury Upscale

Toyota Camry Hybrid

Nissan Altima Hybrid

Page 11: Ten Tools

Bubble ChartUses size of bubble to illustrate a third dimension.

Bubble Chart for Reliability

24 34 44 54 64(X 1000)Price

17

20

23

26

29

32

35M

PGClass

Family Luxury Upscale

11

Page 12: Ten Tools

Scatterplot Matrix Plots all pairs of variables.

12

ClassFamily Luxury Upscale

Price

Road Test

Reliability

Owner Satisfaction

Owner Cost

Safety

MPG

Page 13: Ten Tools

Parallel Coordinates PlotEach case is shown as a line connecting the values of the variables.

13

Parallel Coordinates Plot

0

0.2

0.4

0.6

0.8

1 ClassFamily Luxury Upscale

MPG

Price

Road Tes

tReli

abilit

yOwner

Satisfac

tion

Safety

Owner Cost

Page 14: Ten Tools

Star GlyphsEach case is shown as a polygon with vertices scaled by the variables.

1/Price

MPG

Road Test

ReliabilityOwner Satisfaction

Safety

Owner Cost

14

Page 15: Ten Tools

Star GlyphsVariables are scaled so that larger area is better.

ClassFamily Luxury Upscale

Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata

Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima

Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ

Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL

BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80

15

Page 16: Ten Tools

Chernoff

FacesEach variable is assigned to a different feature of the faces.

ClassFamily Luxury Upscale

Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata

Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima

Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ

Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL

BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80

16

Page 17: Ten Tools

Feature AssignmentsUnfortunately, some features have a greater impact than others.

17

Page 18: Ten Tools

Radar/Spider PlotGood for comparing a small number of cases.

Owner Cost (0.0-5.0)

Price (25000.0-55000.0)

Road Test (50.0-100.0)

Radar/Spider Plot

MPG (15.0-25.0)

Owner Satisfaction (0.0-5.0)

Reliability (0.0-5.0)

Safety (0.0-5.0)

ModelVolkswagen PassatChevrolet ImpalaLincoln MKZCadillac STS

18

Page 19: Ten Tools

Problem #2: Survey Analysis

The most commonly used statistical procedure is the calculation of a two-way table (a tabulation of responses that can be classified in 2 ways).

Such crosstabulations

result in contingency tables that provide

much useful information.

Example: On January 18-19, 2010, Rasmussen Reports asked 1000 likely voters: “How would your rate the U.S. healthcare system?”

19

Page 20: Ten Tools

Data file: healthcare.sgd

20

Page 21: Ten Tools

Mosaic PlotScales the area of each bar according to the counts in the table.

21

Page 22: Ten Tools

Chi-square TestTests for lack of independence between row and column classification.

22

Page 23: Ten Tools

Correspondence AnalysisUsed to help visualize the important information in two-way tables.

23

Page 24: Ten Tools

Problem #3: Distribution Fitting

In many studies, determining the distribution of a quantitative variable is critical.

Common examples covered in Six Sigma include capability studies.

Distribution fitting is also quite critical in many design problems.

24

Page 25: Ten Tools

Data file: waves.sgd

(n=26,304)

25

Source: www.iahr.net

Page 26: Ten Tools

Frequency HistogramShows the number of observations in non-overlapping intervals.

26

Histogram

0 2 4 6 8 10 12Height

0

400

800

1200

1600

2000

2400

frequ

ency

Page 27: Ten Tools

Normal DistributionThe normal distribution is a poor model for this data.

27

Histogram for Height

-1 2 5 8 11 14Height

0

1

2

3

4(X 1000)

frequ

ency

DistributionNormal

Page 28: Ten Tools

Comparison of DistributionsFits many distributions and sorts them by goodness of fit.

28

Page 29: Ten Tools

Lognormal DistributionThe 3-parameter lognormal distribution is much better.

29

Histogram for Height

0 2 4 6 8 10 12Height

0

400

800

1200

1600

2000

2400

frequ

ency

DistributionLargest Extreme ValueLognormal (3-Parameter)Normal

Page 30: Ten Tools

Problem #4: Multiple Samples

Data are frequently obtained from more than one sample.

Asserting a significant difference between the samples (or lack thereof) is an important application of data analysis.

30

Page 31: Ten Tools

Data file: thickness.sgd

(n=480)

31

Page 32: Ten Tools

Box-and-Whisker PlotsA very useful plot for comparing samples (from John Tukey).

32

(>1.5 x IQR below 25th Percentile,or >1.5 x IQR above 75th Percentile)

63

64

Leng

t h

Maximum Observation(within 1.5 x IQR above 75th)

75th Percentile

Median (50th Percentile)

25th Percentile

Minimum Observation(within 1.5 x IQR below 25th)

IQR

* Outlier

1.5 x IQR57

58

59

60

61

62

54

55

56

Plus sign may be added to show sample mean.

Page 33: Ten Tools

Notched Box-and-Whisker PlotsNon-overlapping notches indicate significantly different medians.

33

Page 34: Ten Tools

HSD IntervalsAllow pairwise comparison of all level means.

34

Page 35: Ten Tools

Problem #5: Outlier Detection

Many data sets contain aberrant observations that don’t come from the same distribution as the others.

Identifying outliers and treating them separately often results in better models.

35

Page 36: Ten Tools

Data file: bodytemp.sgd

(n=130)

36

Page 37: Ten Tools

Outlier PlotShows each data value with lines at 1, 2, 3 and 4-sigma.

37

Page 38: Ten Tools

Grubbs’

TestSmall P-value indicates that the extreme Studentized deviate (ESD) is highly unusual.

38

Page 39: Ten Tools

Problem #6: Curve Fitting

A common data analysis problem involves determining the relationship between a response variable Y and a predictor variable X.

If we can estimate a model where Y = f(X), then we can use that model to make predictions.

39

Page 40: Ten Tools

Data file: chlorine.sgd

(n=44)

40

Page 41: Ten Tools

Simple Linear RegressionFits a linear model of the form Y = mX + b.

41

95% confidence limits for mean

95% prediction limits for new observations

Plot of Fitted Modelchlorine = 0.48551 - 0.00271679*weeks

0 10 20 30 40 50weeks

0.38

0.4

0.42

0.44

0.46

0.48

0.5ch

lorin

e

Page 42: Ten Tools

Comparison of Alternative ModelsFits many transformable nonlinear models and sorts by R-Squared.

42

Page 43: Ten Tools

Reciprocal X ModelA nonlinear model of the form Y = m/X + b is much better.

Plot of Fitted Modelchlorine = 0.368053 + 1.02553/weeks

0 10 20 30 40 50weeks

0.38

0.4

0.42

0.44

0.46

0.48

0.5ch

lorin

e

43

Page 44: Ten Tools

Problem #7: Response Surfaces

The MISSION: air dominance at the lowest possible price.

Use optimization models, built from performance data, to design the best aircraft engine.

Note: the data have been altered and are for demonstration purposes only.

44

Page 45: Ten Tools

Problem Statement

Optimize 3 response variables:

Minimize total fleet acquisition cost (Y1)

Maximize climb rate (Y2)

Maximize launch rate (Y3)

Input factors:

X1: Fan Pressure Ratio: 3.9 to 4.7

X2: Overall Pressure Ratio : 34 to 40

X3: Inlet airflow : 240 to 270 pps

45

Page 46: Ten Tools

Data file: engines.sgx

(n=584)

46

Page 47: Ten Tools

Design PlotShows the location of the historical data within the factor space.

47

Experimental Region

3.94.1

4.34.5

4.7FPR 3435

3637

3839

40

OPR

240245250255260265270

Air

Flow

Page 48: Ten Tools

Standardized Pareto ChartsShow the significant factors affecting each response.

48

Standardized Pareto Chart for Y1 Cost

0 10 20 30 40Standardized effect

AC

CC

AA

BC

BB

AB

B:OPR

A:FPR

C:Air Flow +-

Standardized Pareto Chart for Y2 Ps

0 20 40 60 80 100 120Standardized effect

AC

BB

CC

AB

BC

AA

B:OPR

A:FPR

C:Air Flow +-

Standardized Pareto Chart for Y3 Launch

0 20 40 60 80 100 120Standardized effect

AC

CC

BB

AB

BC

AA

B:OPR

A:FPR

C:Air Flow +-

Page 49: Ten Tools

Desirability FunctionQuantifies the desirability of a joint response (Y1, Y2, Y3).

49

For responses to be minimized:

For responses to be maximized:

Combined desirability:

D=d(Y1)*d(Y2)*d(Y3)

Page 50: Ten Tools

Optimal ConditionsFound at the levels shown below:

50

Page 51: Ten Tools

Response SurfaceShow the estimated desirability throughout the experimental region.

51

Desirability PlotAir Flow=254.697

3.6 3.9 4.2 4.5 4.8 5.1FPR

31 33 35 37 39 41 43

OPR

220

240

260

280

300

Air

Flow

Desirability0.00.10.20.30.40.50.60.70.80.91.0

Desirability

Page 52: Ten Tools

Problem #8: Time Series Data

Data recorded at equally spaced points in time is called a time series.

Time series models are used for various purposes:

Analysis of trends and seasonal effects

Forecasting

Control

Autocorrelation between adjacent observations requires special models.

52

Page 53: Ten Tools

Data file: customers.sgd

(n=168)

53

Page 54: Ten Tools

Time Sequence PlotPlots the data versus time.

Time Series Plot for Customers

1/96 1/99 1/02 1/05 1/08 1/11Month

73

83

93

103

113(X 1000)

Cus

tom

ers

54

Page 55: Ten Tools

Autocorrelation FunctionEstimates the correlation between observations at different lags.

Estimated Autocorrelations for Customers

0 12 24 36lag

-1

-0.6

-0.2

0.2

0.6

1

Aut

ocor

rela

tions

55

Page 56: Ten Tools

Seasonal DecompositionShows the average value during each season (scaled to 100).

Seasonal Index Plot for Customers

1 2 3 4 5 6 7 8 9 10 11 12season

90

94

98

102

106

110

114se

ason

al in

dex

56

Page 57: Ten Tools

Seasonal Subseries PlotShows the seasonal averages and trend within each season.

Seasonal Subseries Plot for Customers

0 2 4 6 8 10 12 14Season

73

83

93

103

113(X 1000)

Cus

tom

ers

57

Page 58: Ten Tools

Annual Subseries PlotShows the seasonal effect separately for each cycle.

58

Annual Subseries Plot for Customers

0 3 6 9 12 15Season

73

83

93

103

113(X 1000)

Cust

omer

s

Cycle19961997199819992000200120022003200420052006200720082009

Page 59: Ten Tools

Seasonally Adjusted DataRemoves the seasonal effects from the data.

Seasonally Adjusted Data Plot for Customers

1/96 1/99 1/02 1/05 1/08 1/11Month

76

81

86

91

96

101

106(X 1000)

seas

onal

ly a

djus

ted

59

Page 60: Ten Tools

Automatic ForecastingFits many models and automatically selects the best.

60

Page 61: Ten Tools

ARIMA ModelSelected model is a rather complicated seasonal ARIMA model.

Time Sequence Plot for CustomersARIMA(2,1,1)x(1,1,2)12

1/96 1/00 1/04 1/08 1/12Month

73

83

93

103

113

123(X 1000)

Cus

tom

ers

actualforecast95.0% limits

61

Page 62: Ten Tools

Problem #9: Event Rate Modeling

When the data to be analyzed consist of the time at which events occur, the process by which those events are generated is called a point process.

The most common type of point process is a Poisson process, in which the times between events are independent and follow a negative exponential distribution.

If the event rate is constant, the process is called homogeneous. If the event rate changes over time, then the process is called nonhomogeneous.

62

Page 63: Ten Tools

Data file: earthquakes.sgd

(n=52)

63

Page 64: Ten Tools

Plot of Magnitude Versus DateShows an apparent increase in magnitude over the sampling period.

Plot of Magnitude vs Date

1/1/08 12/31/08 12/31/09 12/31/10Date

5.4

5.9

6.4

6.9

7.4

7.9

8.4M

agni

tude

64

Page 65: Ten Tools

Point Process PlotShows the dates of occurrence only.

Events Plot for Date

1/1/08 12/31/08 12/31/09 12/31/10Time or distance

Earthquake

65

Page 66: Ten Tools

Event Rate

66

The critical parameter in a point process is the rate of events per unit time (such as earthquakes per year). This parameter is usually called .

The rate parameter is also related to the mean time between events, with MTBE = 1 /

may be constant (a homogeneous process) or vary over

time (a nonhomogenous

process).

Page 67: Ten Tools

Cumulative Events PlotThe slope of the line is related to the event rate.

67

Mean

Cumulative Events Plot for Date

1/1/08 12/31/08 12/31/09 12/31/10Time or distance

0

20

40

60

80N

umbe

r of

eve

nts

Page 68: Ten Tools

Trend TestA small P-value would indicate a significant trend.

68

Page 69: Ten Tools

Other UsesPoint process models are also very useful for estimating failure rates.

69

Aircraft 2

Aircraft 4

Aircraft 6

Aircraft 8Aircraft 9

Aircraft 10Aircraft 11Aircraft 12Aircraft 13

Events Plot

0 500 1000 1500 2000 2500Time or distance

Aircraft 1

Aircraft 3

Aircraft 5

Aircraft 7

Page 70: Ten Tools

Between Group ComparisonsTests can be made to determine whether there are significant differences.

70

Mean

Cumulative Events Plot

0 500 1000 1500 2000 2500Time or distance

0

5

10

15

20

25

30N

umbe

r of e

vent

s

Aircraft 1Aircraft 2Aircraft 3Aircraft 4Aircraft 5Aircraft 6Aircraft 7Aircraft 8Aircraft 9Aircraft 10Aircraft 11Aircraft 12Aircraft 13

Page 71: Ten Tools

Problem #10: Interactive Maps

Graphics that allow the user to interact with the data are extremely useful.

Maps are one important example.

71

Page 72: Ten Tools

Data file: census2000.sgd (n=51)

72

Page 73: Ten Tools

U.S. Map StatletThe slider changes the cutoff between the red and blue states.

73