ten tools
DESCRIPTION
tools six sigmaTRANSCRIPT
Neil W. Polhemus, CTO, StatPoint Technologies, Inc.
George H. Dyson, Director of Six Sigma ServicesCopyright ©
2010 by StatPoint Technologies, Inc.
Ten Data Analysis Tools You Can’t Afford to be Without
Business Improvement Objectives
Businesses must create value. This output must be greater than the inputs needed to produce it.
If the output meets the customers’
needs, the business is effective.
If the business creates added value with minimum resources, the business is efficient.
The Role of Six Sigma is to Help a Business Produce the Maximum Value While Using Minimum ResourcesThe Role of Six Sigma is to Help a Business Produce the Maximum Value While Using Minimum Resources
(Pyzdek 2003)
2
Six Sigma Business Successes
• Cost reductions
• Market - share growth
• Defect reductions
• Culture changes
• Productivity improvements
• Customer relations improvements
• Product and service improvements
• Cycle - time reductions
(CSSBB Primer, 2001) / (Pande, 2000)
All these Successes have a common thread….
3
DATA!!!
Data drives analysis.
Analysis uses statistical models & tools of all kinds.
To extract meaningful information
To uncover signals in the presence of noise
To understand the past
To monitor the present
To forecast the future
This understanding results in reduced cost and saving money.
Deming: “Doesn’t anyone care about Profit?”
4
Examples
Product comparisons
Survey analysis
Distribution fitting
Comparison of multiple samples
Outlier detection
Curve fitting
Response surface modeling
Time series forecasting
Event rate modeling
Interactive maps
5
Problem #1 –
Product Comparisons
Consumer Reports 2010: Sedans (Family, Upscale, Luxury)
7 variables
Price (dollars)
Road-test score (0 –
100)
Predicted reliability (1 –
5)
Owner satisfaction (1 –
5)
Owner cost (1 –
5)
Safety (1 –
5)
Fuel economy (overall mpg)
6
Data file: cars.sgd
(n=30)
7
Multivariate Visualization
The data consist of 7 quantitative variables plus one categorical factor. Each car is thus a point in 7-dimensional space with one other feature. How can we visualize what’s going on?
1.
2-D scatterplot2.
3-D scatterplot3.
Scatterplot matrix4.
Bubble chart5.
Parallel coordinates plot6.
Star glyphs7.
Chernoff
faces
8.
Radar/spider plot
8
2-D ScatterplotUseful for plotting 2 dimensions.
9
Plot of MPG vs Price
24 34 44 54 64(X 1000)Price
17
20
23
26
29
32
35
MPG
ClassFamily Luxury Upscale
3-D ScatterplotUseful for plotting 3 dimensions.
10
Plot of MPG vs Price and Road Test
2434
4454
64(X 1000)Price 58
6878
8898
Road Test
17202326293235
MPG
ClassFamily Luxury Upscale
Toyota Camry Hybrid
Nissan Altima Hybrid
Bubble ChartUses size of bubble to illustrate a third dimension.
Bubble Chart for Reliability
24 34 44 54 64(X 1000)Price
17
20
23
26
29
32
35M
PGClass
Family Luxury Upscale
11
Scatterplot Matrix Plots all pairs of variables.
12
ClassFamily Luxury Upscale
Price
Road Test
Reliability
Owner Satisfaction
Owner Cost
Safety
MPG
Parallel Coordinates PlotEach case is shown as a line connecting the values of the variables.
13
Parallel Coordinates Plot
0
0.2
0.4
0.6
0.8
1 ClassFamily Luxury Upscale
MPG
Price
Road Tes
tReli
abilit
yOwner
Satisfac
tion
Safety
Owner Cost
Star GlyphsEach case is shown as a polygon with vertices scaled by the variables.
1/Price
MPG
Road Test
ReliabilityOwner Satisfaction
Safety
Owner Cost
14
Star GlyphsVariables are scaled so that larger area is better.
ClassFamily Luxury Upscale
Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata
Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima
Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ
Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL
BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80
15
Chernoff
FacesEach variable is assigned to a different feature of the faces.
ClassFamily Luxury Upscale
Nissan Altima Honda Accord Toyota Camry XLE Volkswagen Passat Toyota Camry Hybrid Hyundai Sonata
Chevrolet Malibu Ford Fusion Mercury Milan Nissan Altima Hybrid Ford Taurus Kia Optima
Chevrolet Impala Lexus ES Toyota Avalon Acura TL Hyundai Azera Lincoln MKZ
Buick Lucerne Volvo S60 Infiniti M35 AWD Infinit M35 Audi A6 Acura RL
BMW 535i Cadillac STS Cadillac DTS Lexus GS 300 Lexus GS 450 Hybrid Volvo S80
16
Feature AssignmentsUnfortunately, some features have a greater impact than others.
17
Radar/Spider PlotGood for comparing a small number of cases.
Owner Cost (0.0-5.0)
Price (25000.0-55000.0)
Road Test (50.0-100.0)
Radar/Spider Plot
MPG (15.0-25.0)
Owner Satisfaction (0.0-5.0)
Reliability (0.0-5.0)
Safety (0.0-5.0)
ModelVolkswagen PassatChevrolet ImpalaLincoln MKZCadillac STS
18
Problem #2: Survey Analysis
The most commonly used statistical procedure is the calculation of a two-way table (a tabulation of responses that can be classified in 2 ways).
Such crosstabulations
result in contingency tables that provide
much useful information.
Example: On January 18-19, 2010, Rasmussen Reports asked 1000 likely voters: “How would your rate the U.S. healthcare system?”
19
Data file: healthcare.sgd
20
Mosaic PlotScales the area of each bar according to the counts in the table.
21
Chi-square TestTests for lack of independence between row and column classification.
22
Correspondence AnalysisUsed to help visualize the important information in two-way tables.
23
Problem #3: Distribution Fitting
In many studies, determining the distribution of a quantitative variable is critical.
Common examples covered in Six Sigma include capability studies.
Distribution fitting is also quite critical in many design problems.
24
Data file: waves.sgd
(n=26,304)
25
Source: www.iahr.net
Frequency HistogramShows the number of observations in non-overlapping intervals.
26
Histogram
0 2 4 6 8 10 12Height
0
400
800
1200
1600
2000
2400
frequ
ency
Normal DistributionThe normal distribution is a poor model for this data.
27
Histogram for Height
-1 2 5 8 11 14Height
0
1
2
3
4(X 1000)
frequ
ency
DistributionNormal
Comparison of DistributionsFits many distributions and sorts them by goodness of fit.
28
Lognormal DistributionThe 3-parameter lognormal distribution is much better.
29
Histogram for Height
0 2 4 6 8 10 12Height
0
400
800
1200
1600
2000
2400
frequ
ency
DistributionLargest Extreme ValueLognormal (3-Parameter)Normal
Problem #4: Multiple Samples
Data are frequently obtained from more than one sample.
Asserting a significant difference between the samples (or lack thereof) is an important application of data analysis.
30
Data file: thickness.sgd
(n=480)
31
Box-and-Whisker PlotsA very useful plot for comparing samples (from John Tukey).
32
(>1.5 x IQR below 25th Percentile,or >1.5 x IQR above 75th Percentile)
63
64
Leng
t h
Maximum Observation(within 1.5 x IQR above 75th)
75th Percentile
Median (50th Percentile)
25th Percentile
Minimum Observation(within 1.5 x IQR below 25th)
IQR
* Outlier
1.5 x IQR57
58
59
60
61
62
54
55
56
Plus sign may be added to show sample mean.
Notched Box-and-Whisker PlotsNon-overlapping notches indicate significantly different medians.
33
HSD IntervalsAllow pairwise comparison of all level means.
34
Problem #5: Outlier Detection
Many data sets contain aberrant observations that don’t come from the same distribution as the others.
Identifying outliers and treating them separately often results in better models.
35
Data file: bodytemp.sgd
(n=130)
36
Outlier PlotShows each data value with lines at 1, 2, 3 and 4-sigma.
37
Grubbs’
TestSmall P-value indicates that the extreme Studentized deviate (ESD) is highly unusual.
38
Problem #6: Curve Fitting
A common data analysis problem involves determining the relationship between a response variable Y and a predictor variable X.
If we can estimate a model where Y = f(X), then we can use that model to make predictions.
39
Data file: chlorine.sgd
(n=44)
40
Simple Linear RegressionFits a linear model of the form Y = mX + b.
41
95% confidence limits for mean
95% prediction limits for new observations
Plot of Fitted Modelchlorine = 0.48551 - 0.00271679*weeks
0 10 20 30 40 50weeks
0.38
0.4
0.42
0.44
0.46
0.48
0.5ch
lorin
e
Comparison of Alternative ModelsFits many transformable nonlinear models and sorts by R-Squared.
42
Reciprocal X ModelA nonlinear model of the form Y = m/X + b is much better.
Plot of Fitted Modelchlorine = 0.368053 + 1.02553/weeks
0 10 20 30 40 50weeks
0.38
0.4
0.42
0.44
0.46
0.48
0.5ch
lorin
e
43
Problem #7: Response Surfaces
The MISSION: air dominance at the lowest possible price.
Use optimization models, built from performance data, to design the best aircraft engine.
Note: the data have been altered and are for demonstration purposes only.
44
Problem Statement
Optimize 3 response variables:
Minimize total fleet acquisition cost (Y1)
Maximize climb rate (Y2)
Maximize launch rate (Y3)
Input factors:
X1: Fan Pressure Ratio: 3.9 to 4.7
X2: Overall Pressure Ratio : 34 to 40
X3: Inlet airflow : 240 to 270 pps
45
Data file: engines.sgx
(n=584)
46
Design PlotShows the location of the historical data within the factor space.
47
Experimental Region
3.94.1
4.34.5
4.7FPR 3435
3637
3839
40
OPR
240245250255260265270
Air
Flow
Standardized Pareto ChartsShow the significant factors affecting each response.
48
Standardized Pareto Chart for Y1 Cost
0 10 20 30 40Standardized effect
AC
CC
AA
BC
BB
AB
B:OPR
A:FPR
C:Air Flow +-
Standardized Pareto Chart for Y2 Ps
0 20 40 60 80 100 120Standardized effect
AC
BB
CC
AB
BC
AA
B:OPR
A:FPR
C:Air Flow +-
Standardized Pareto Chart for Y3 Launch
0 20 40 60 80 100 120Standardized effect
AC
CC
BB
AB
BC
AA
B:OPR
A:FPR
C:Air Flow +-
Desirability FunctionQuantifies the desirability of a joint response (Y1, Y2, Y3).
49
For responses to be minimized:
For responses to be maximized:
Combined desirability:
D=d(Y1)*d(Y2)*d(Y3)
Optimal ConditionsFound at the levels shown below:
50
Response SurfaceShow the estimated desirability throughout the experimental region.
51
Desirability PlotAir Flow=254.697
3.6 3.9 4.2 4.5 4.8 5.1FPR
31 33 35 37 39 41 43
OPR
220
240
260
280
300
Air
Flow
Desirability0.00.10.20.30.40.50.60.70.80.91.0
Desirability
Problem #8: Time Series Data
Data recorded at equally spaced points in time is called a time series.
Time series models are used for various purposes:
Analysis of trends and seasonal effects
Forecasting
Control
Autocorrelation between adjacent observations requires special models.
52
Data file: customers.sgd
(n=168)
53
Time Sequence PlotPlots the data versus time.
Time Series Plot for Customers
1/96 1/99 1/02 1/05 1/08 1/11Month
73
83
93
103
113(X 1000)
Cus
tom
ers
54
Autocorrelation FunctionEstimates the correlation between observations at different lags.
Estimated Autocorrelations for Customers
0 12 24 36lag
-1
-0.6
-0.2
0.2
0.6
1
Aut
ocor
rela
tions
55
Seasonal DecompositionShows the average value during each season (scaled to 100).
Seasonal Index Plot for Customers
1 2 3 4 5 6 7 8 9 10 11 12season
90
94
98
102
106
110
114se
ason
al in
dex
56
Seasonal Subseries PlotShows the seasonal averages and trend within each season.
Seasonal Subseries Plot for Customers
0 2 4 6 8 10 12 14Season
73
83
93
103
113(X 1000)
Cus
tom
ers
57
Annual Subseries PlotShows the seasonal effect separately for each cycle.
58
Annual Subseries Plot for Customers
0 3 6 9 12 15Season
73
83
93
103
113(X 1000)
Cust
omer
s
Cycle19961997199819992000200120022003200420052006200720082009
Seasonally Adjusted DataRemoves the seasonal effects from the data.
Seasonally Adjusted Data Plot for Customers
1/96 1/99 1/02 1/05 1/08 1/11Month
76
81
86
91
96
101
106(X 1000)
seas
onal
ly a
djus
ted
59
Automatic ForecastingFits many models and automatically selects the best.
60
ARIMA ModelSelected model is a rather complicated seasonal ARIMA model.
Time Sequence Plot for CustomersARIMA(2,1,1)x(1,1,2)12
1/96 1/00 1/04 1/08 1/12Month
73
83
93
103
113
123(X 1000)
Cus
tom
ers
actualforecast95.0% limits
61
Problem #9: Event Rate Modeling
When the data to be analyzed consist of the time at which events occur, the process by which those events are generated is called a point process.
The most common type of point process is a Poisson process, in which the times between events are independent and follow a negative exponential distribution.
If the event rate is constant, the process is called homogeneous. If the event rate changes over time, then the process is called nonhomogeneous.
62
Data file: earthquakes.sgd
(n=52)
63
Plot of Magnitude Versus DateShows an apparent increase in magnitude over the sampling period.
Plot of Magnitude vs Date
1/1/08 12/31/08 12/31/09 12/31/10Date
5.4
5.9
6.4
6.9
7.4
7.9
8.4M
agni
tude
64
Point Process PlotShows the dates of occurrence only.
Events Plot for Date
1/1/08 12/31/08 12/31/09 12/31/10Time or distance
Earthquake
65
Event Rate
66
The critical parameter in a point process is the rate of events per unit time (such as earthquakes per year). This parameter is usually called .
The rate parameter is also related to the mean time between events, with MTBE = 1 /
may be constant (a homogeneous process) or vary over
time (a nonhomogenous
process).
Cumulative Events PlotThe slope of the line is related to the event rate.
67
Mean
Cumulative Events Plot for Date
1/1/08 12/31/08 12/31/09 12/31/10Time or distance
0
20
40
60
80N
umbe
r of
eve
nts
Trend TestA small P-value would indicate a significant trend.
68
Other UsesPoint process models are also very useful for estimating failure rates.
69
Aircraft 2
Aircraft 4
Aircraft 6
Aircraft 8Aircraft 9
Aircraft 10Aircraft 11Aircraft 12Aircraft 13
Events Plot
0 500 1000 1500 2000 2500Time or distance
Aircraft 1
Aircraft 3
Aircraft 5
Aircraft 7
Between Group ComparisonsTests can be made to determine whether there are significant differences.
70
Mean
Cumulative Events Plot
0 500 1000 1500 2000 2500Time or distance
0
5
10
15
20
25
30N
umbe
r of e
vent
s
Aircraft 1Aircraft 2Aircraft 3Aircraft 4Aircraft 5Aircraft 6Aircraft 7Aircraft 8Aircraft 9Aircraft 10Aircraft 11Aircraft 12Aircraft 13
Problem #10: Interactive Maps
Graphics that allow the user to interact with the data are extremely useful.
Maps are one important example.
71
Data file: census2000.sgd (n=51)
72
U.S. Map StatletThe slider changes the cutoff between the red and blue states.
73