javier garcia - verdugo sanchez - six sigma training - w2 correlation and regression
TRANSCRIPT
C l ti d Correlation and Regression Regression
110
100
90
80
70
60
Ou
tpu
t
50403020100
60
50
40
50403020100
Input Week 2
Knorr-Bremse Group
Overview and Content
With correlation and regression you have a tool g yavailable to describe in an easy way the relation
between continuous factors (x1, x2 etc.) and 1 2
continuously measurable results (y).
• Regression and regression coefficient
• Correlation and correlation coefficient
• Fitted Line Plots• Fitted Line Plots
• Simple regressionp g
• Multiple regression
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 2/24
Validation of Factors Y = f (x)
Overview about the validation of single factors to single results
Factor X = Input
Discrete / Attributive Continuous / Variable
single factors to single results
Discrete / Attributive Continuous / Variable e ve
Ou
tpu
t
Dis
cret
e
ttrib
utiv
Chi-SquareLogistic
Regression
t Y
= O
D At
s
Res
ul
tinuo
u s
riabl
e T - Test
ANOVA ( F - Test) RegressionRegression
Con Va
( )
Variance Test
eg ess o
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 3/24
Regression
xbby21
+=
yThe fitted, estimated value of the dependent variable.
y
21
iy
ei
The zero point shift
The slope of the straight line
1b
2b
y The difference between the fitted (calculated) values and the observed values
ei
1b
xϕ
observed values
Recieving Ch = 91,4033 + 0,476288 Final check
Regression Plot
ix x
0
∑ ∑ ∑ ∑−n n n n
iiii
2
iyxxyx
b
210
200
h
S = 6,77854 R-Sq = 69,5 % R-Sq(adj) = 67,9 %
( )∑ ∑= =
= = = =
−= n
1i
2n
1i i
2
i
1i 1i 1i 1i iiiii
1
xxn
yyb 190
180
Re
cie
ving
Ch
( )∑ ∑
∑ ∑ ∑= = =
−
−= n 2n
2
n
1i
n
1i
n
1i iiii
2
xxn
yxyxnb
230220210200190180170
170
160
Final check
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 4/24
( )∑ ∑= =
−1i 1i ii
xxn Final check
Regression
The method of the smallest quadratic deviations has 4 important properties:p p
• The sum of the residuals values is zero
• The sum of the products of the values of the x variable and corresponding residuals is equal to zero
• The arithmetic means of the measured Y variable and the theoretic calculated Y variable (fitted values) are equaltheoretic calculated Y variable (fitted values) are equal
• The regression straight line runs through the “center of gravity” of th tt l tthe scatter plot
Which statement can we make about the significance of the relation?
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 5/24
g
Regression ExampleAn example: The results shows the soften temperature measured during the final check at the supplier and the receiving check at the customer. The values of two
different plastic types are included in the two columns
Stat
different plastic types are included in the two columns
File: Soften temperature.mtw
>Regression
>Fitted Line Plot…Fitted Line Plot
Recieving Check = 91 40 + 0 4763 Final check
p
210
200
S 6,77854R-Sq 69,5%R-Sq(adj) 67,9%
Recieving Check = 91,40 + 0,4763 Final check
Final check Recieving Check Material
190
ing
Ch
eck
Final check Recieving Check Material
168 162,5 1
209 187,5 2
177,5 183,5 1
222,5 192,5 2
180
170
Re
cie
vi, ,
182,5 187,5 1
227,5 197,5 2
197,5 197,5 2
202,5 182,5 2
240230220210200190180170160160
Final check
173 177,5 1
214,5 192,5 2
182,5 182,5 1
222,5 202,5 2
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 6/24
Final check197,5 187,5 2
Regression
Also in the session window we get the regression equation
I dditi th i ifi i l l t d b th i l iIn addition, the significance is calculated by the variance analysis
Regression Analysis: Recieving Check versus Final check
The regression equation isThe regression equation is
Recieving Check = 91,4 + 0,476 Final check
S = 6,77854 R-Sq = 69,5% R-Sq(adj) = 67,9%
Analysis of VarianceAnalysis of Variance
Source DF SS MS F P
Regression 1 1989,0 1989,0 43,29 0,000
Residual Error 19 873,0 45,9
Total 20 2862,0
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 7/24
R2 and R2 adj.: Practical Significance
• R² is a method within the statistics, to show the practical significance of an effect.
695,02862
1989Re2 ===Total
gression
SS
SSR
Explained variation (SS Regression) divided by the total variation (SS Total). Approximately 70% of the variation is explained by the samples.
• R² adj. is a similar method to explain the practical significance of an ff t It i h l f l if l f t i d l E R2 dj teffect. It is helpful, if we use several factors in a model. E.g. R2 adj. gets
smaller, if an additional factor is added in the model, because every reduction of SS error can be balanced by the loss of degrees of freedom.reduction of SS error can be balanced by the loss of degrees of freedom. The values for R² adj. are always a little bit smaller than for R².
9545MS68,0
202862
95,45112 =−=−=
Total
Total
Error
DF
SSMS
adjR
Total
• S is the pooled standard deviation (averaged within group variation) The square root of S is the MS Error
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 8/24
square root of S is the MS Error.
Correlation
• Correlation is a measure for the strength of a interaction between two quantitative variables (e.g. measurement at supplier and customer).quantitative variables (e.g. measurement at supplier and customer).
• Correlation measures the degree of linearity between two variables.
• The value of the correlation coefficient r ranges between -1 and 1
• Rule: A correlation > 0 80 or < 0 80 is significant a• Rule: A correlation > 0,80 or < -0,80 is significant, a correlation between -0,80 and 0,80 is not significant.
L t h l k t th l ft t t• Lets have look at the example soften temperature. Covariance
(x x) (y y)i in − −∑1x x y yi
ni− −∑1 ( )( ) r
n -1xyxi=1
= ∑ s s yr
n -1
x x y yxy
i
xi=1
i
y= ∑1
s s( )( ) =
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 9/24
The CalculationThe calculation of the covariance and correlation coefficient
Final Insp Incoming Insp Yi - Ymean Xi - X mean Covariance r168 162 5 25 33 8 844 3 37168 162,5 -25 -33,8 844 3,37209 187,5 0 7,2 0 0,00
177,5 183,5 -4 -24,3 97 0,39222,5 192,5 5 20,7 104 0,41182,5 187,5 0 -19,3 0 0,00227,5 197,5 10 25,7 257 1,03197,5 197,5 10 -4,3 -43 -0,17202,5 182,5 -5 0,7 -4 -0,01173 177,5 -10 -28,8 288 1,15
214,5 192,5 5 12,7 64 0,25182,5 182,5 -5 -19,3 96 0,38222,5 202,5 15 20,7 311 1,24, , , ,197,5 187,5 0 -4,3 0 0,00232,5 202,5 15 30,7 461 1,84173 167,5 -20 -28,8 575 2,30
208 5 197 5 10 6 7 67 0 27208,5 197,5 10 6,7 67 0,27182,5 172,5 -15 -19,3 289 1,15222,5 197,5 10 20,7 207 0,83194 176,5 -11 -7,8 85 0,34
229 5 207 5 20 27 7 555 2 21229,5 207,5 20 27,7 555 2,21217,5 182,5 -5 15,7 -79 -0,31
Mean 201,8 187,5 4176 16,67Stdev 20,9 12,0 Covariance 208,8 0,83 r
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 10/24
Calculation in MinitabStat
>Basic Statistics File: Soften temperature mtw
Correlation of the final check and receiving check r = 0 834
>Correlation… Soften temperature.mtw
Correlation of the final check and receiving check r 0,834
² 0 695r² = 0,695
r = 0,834
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 11/24
Exercise: Simulated Data
• We generate two columns with 50 random numbers each and correlate these values.
Calc
>Random Data
– Mean: 100
– Standard deviation: 10
>Random Data
>Normal…
– Standard deviation: 10
Which value do we expect for the correlation? Stat• Which value do we expect for the correlation? Stat
>Basic Statistics
>Correlation…
• Investigate the correlation.
– Does the correlation correspond to our expectations?
Stat
>Regression
• Use the Fitted Line Plot function and investigate r².
>Fitted Line Plot…
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 12/24
Examples for Positive Correlation
76
74
S 0,838232R-Sq 93,2%R-Sq(adj) 93,1%
Strong Positive CorrelationOutput = 57,39 + 0,1732 Input
74
72
70
68Ou
tpu
t
R Sq(adj) 93,1%
85
80
Moderate Positive CorrelationOutput = 53,28 + 0,2109 Input
100908070605040
66
64
62
80
75
70
65
Ou
tpu
t
Input
90
Weak Positive CorrelationOutput = 58,31 + 0,1635 Input
100908070605040
65
60
55
S 5,18519R-Sq 34,6%R-Sq(adj) 33,5%
80
70
60
Ou
tpu
t
Input
100908070605040
50
40
Input
S 10,4391R-Sq 7,3%R-Sq(adj) 5,7%
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 13/24
Input
Examples for Negative Correlation
52,5 S 1,16327R-Sq 88,3%R S ( dj) 88 1%
Strong Negative CorrelationOutput = 56,48 - 0,1786 Input
50,0
47,5
45,0
Ou
tpu
t
R-Sq(adj) 88,1%
100908070605040
42,5
40,065
60
S 4,44849R-Sq 39,3%R-Sq(adj) 38,3%
Moderate Negative CorrelationOutput = 58,46 - 0,1999 Input
100908070605040Input
60
55
50
45
Ou
tpu
t
100908070605040
45
40
35 70
60
S 8,74951R-Sq 12,1%R-Sq(adj) 10,6%
Weak Negative CorrelationOutput = 57,34 - 0,1813 Input
Input 60
50
40Ou
tpu
t
100908070605040
30
20
Input
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 14/24
Input
How large should the Coefficient „r“ be?
Compare your correlation value with the value in the table according to your
Sample size d.f. Significance leveln n-2 0,05 0,025 0,01 0,0053 1 0,9877 0,9969 0,9995 0,99994 2 0,9000 0,9500 0,9800 0,99005 3 0 8054 0 8783 0 9343 0 9587the value in the table according to your
sample size. Is the value larger than noted in the table the correlation is
5 3 0,8054 0,8783 0,9343 0,95876 4 0,7293 0,8114 0,8822 0,91727 5 0,6694 0,7545 0,8329 0,87458 6 0,6215 0,7067 0,7887 0,83439 7 0,5822 0,6664 0,7498 0,7977
“important” or statistically significant. 10 8 0,5494 0,6319 0,7155 0,764611 9 0,5214 0,6021 0,6851 0,734812 10 0,4973 0,5760 0,6581 0,707913 11 0,4762 0,5529 0,6339 0,683514 12 0 4575 0 5324 0 6120 0 66142t 14 12 0,4575 0,5324 0,6120 0,661415 13 0,4409 0,5140 0,5923 0,641116 14 0,4259 0,4973 0,5742 0,622617 15 0,4124 0,4821 0,5577 0,605518 16 0,4000 0,4683 0,5425 0,589719 17 0 3887 0 4555 0 5285 0 5751
2
2
2
or
tn
tr+−
=α α
α
19 17 0,3887 0,4555 0,5285 0,575120 18 0,3783 0,4438 0,5155 0,561421 19 0,3687 0,4329 0,5034 0,548722 20 0,3598 0,4227 0,4921 0,536827 25 0,3233 0,3809 0,4451 0,486921
2
or
r
rnt ⋅−=α32 30 0,2960 0,3494 0,4093 0,448737 35 0,2746 0,3246 0,3810 0,418242 40 0,2573 0,3044 0,3578 0,393247 45 0,2429 0,2876 0,3384 0,372152 50 0,2306 0,2732 0,3218 0,3542
1 r−α
Attention! Due to big sample sizes 52 50 0,2306 0,2732 0,3218 0,354262 60 0,2108 0,2500 0,2948 0,324872 70 0,1954 0,2319 0,2737 0,301782 80 0,1829 0,2172 0,2565 0,283092 90 0,1726 0,2050 0,2422 0,2673102 100 0 1638 0 1946 0 2301 0 2540
also r- values <0,8 are significant. Be aware here, the risk of
misinterpretation is relatively high
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 15/24
102 100 0,1638 0,1946 0,2301 0,2540misinterpretation is relatively high.
Avoid Quick Conclusions
If y and x1 correlate well that does not necessarily mean that a variation of x will cause a variation of y.
A third variable could be in the background which is responsible for the change of the x as well of the y. g y
An example from production shows a strong negative correlation between the pressure (x) and yield (y) in a reactor butbetween the pressure (x) and yield (y) in a reactor, but…
There are contaminations (x2), which are not measured and vary f l t lfrom process cycle to process cycle.
– Contamination is causing foaming
– Contamination is causing poor yield
Th i d t d th f b ild– The pressure is used to reduce the foam build up
– The pressure is a reaction on the foam build up and has no effect th i ld
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 16/24
on the yield
Another Example
• Open the file:Open the file: MYSTERY.MTWMYSTERY.MTW
• Calculate the correlation
10
Scatterplot of Output vs Input
• Calculate the correlation.
• Is there a correlation
pu
t
8
6
4
between the two variables?
• Create a plot for both
Ou
tp
2
0
-2
pvariables.
• What is your conclusion for
Input210-1-2-3
-4
What is your conclusion for the correlation?
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 17/24
Simple Regression
Correlation describes the linear dependence of two variables regression defines this relation more detailedregression defines this relation more detailed.
Regression leads to an equation, which uses one (or more) variables to explain the variation of the output variable.
St t > R i > R iStat > Regression > Regression…
Performs simple and multiple regression
Stat > Regression > Fitted Line Plot…
Scatter Plot Fitted Line equation and r²Scatter Plot, Fitted Line, equation and r
Stat > Regression > Residuals Plots…
Stores the residuals of the “regression" or "Fitted line plot"
Proofs basic assumptions about the behavior of the residuals
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 18/24
Summary
C l ti i f l t l t d ib d d i• Correlation is a useful tool to describe dependencies during many improvement activities.
• Correlation is the measure of the linear relation between two quantitative variables.
• Avoid too fast conclusion for causes.
C f• Correlation is the basis for the regression method.
• Regression describes the relation of the variablesRegression describes the relation of the variables more detailed and shows a equation model.
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 19/24
AppendixAppendixFurther ExamplesFurther Examples
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 20/24
Example; Retailer Sales and Cost of Production Area Frequency Sales310 10240 2930980 7510 5270
File: Sales.mtw
A t il h i t t i ti t th1210 10810 68501290 9890 70101120 13720 70201490 13920 8350
A retailer chain wants to investigate the sales dependence of shop location(Area) and the passerby frequency. 1490 13920 8350
780 8540 4330940 12360 57701290 12270 7680
p y q y
What kind of relations you can describe? 1290 12270 7680480 11010 3160240 8250 1520550 9310 3150
Units Cost3200 322004100 327004100 32700
10700 701008700 48200File Cost mtw6500 386009400 55400
11200 77200
File. Cost.mtw
The table shows the production fix costs f 10 11200 77200
1400 243006000 37500
and the number of units over 10 month.
Determine the favorable production size.
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 21/24
4200 34000p
Example; Salary
File: Salery.mtwy
Evaluate the factors, which of them has the strongest effect on salary?strongest effect on salary?
Salary Year in the job Company years Education Age Pers. No. Sex Sex Group38985 18 7 9 52 412 M 028938 12 5 4 39 517 F 132920 15 3 9 45 458 F 129548 5 6 1 30 604 M 031138 11 11 6 46 562 F 124749 6 2 0 26 598 F 124749 6 2 0 26 598 F 141889 22 16 7 63 351 M 031528 3 11 3 35 674 M 038791 21 4 5 48 356 M 039828 18 6 5 47 415 F 139828 18 6 5 47 415 F 1
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 22/24
The Mystery Example
10
8
S 1,69190R-Sq 6,4%R Sq(adj) 5 4%
Fitted Line PlotOutput = 1,145 - 0,4340 Input
If we use Stat > Regression > Fitted
pu
t
8
6
4
R-Sq(adj) 5,4%If we use Stat > Regression > Fitted Line Plot > Linear we get…
Ou
tp
2
0
Input210-1-2-3
-2
-4
12 Regression
Fitted Line PlotOutput = 0,1401 + 0,0413 Input
+ 1,025 Input**2
10
8
6
S 1,02499R-Sq 66,0%R-Sq(adj) 65,3%
95% CI
Ou
tpu
t
4
2
0If we use Stat > Regression > Fitted
210-1-2-3
0
-2
-4
Line Plot > quadratic Regression we get a strong correlation.
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 23/24
Input
Example; Retailer SalesDiagnosis at regression
Stat9000 S 408,182
R-Sq 96,9%
Fitted Line PlotSales = 605,7 + 5,222 Area
>Regression
>Residual Plot…
s
8000
7000
6000
R-Sq(adj) 96,6%
Evaluation like at ANOVA
Sa
les
5000
4000
3000
Area1600140012001000800600400200
2000
100099
90
l
500
Normal Probability Plot of the Residuals Residuals Versus the Fitted Values
Residual Plots for Sales
Mi it b d th id l d
Per
cent
10005000-500-1000
50
10
1
Res
idua
l
8000600040002000
0
-500
Minitab needs the residuals and the fits in one column. Storage of residuals and fits is possible
Residual100050005001000
Fitted Value8000600040002000
4
3500
Histogram of the Residuals Residuals Versus the Order of the Data
during every evaluation.
Freq
uenc
y 3
2
1
0
Res
idua
l
0
-500
Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 24/24
Residual7505002500-250-500
0
Observation Order121110987654321