javier garcia - verdugo sanchez - six sigma training - w2 correlation and regression

12
C lti d Correlation and Regression Regression 110 100 90 80 70 Output 60 50 40 50 40 30 20 10 0 Input Week 2 Knorr-Bremse Group Overview and Content With correlation and regression you have a tool available to describe in an easy way the relation between continuous factors (x 1 , x 2 etc.) and 1 2 continuously measurable results (y). Regression and regression coefficient Correlation and correlation coefficient Fitted Line Plots Fitted Line Plots Simple regression Multiple regression Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 2/24

Upload: j-garcia-verdugo

Post on 13-Apr-2017

293 views

Category:

Engineering


6 download

TRANSCRIPT

Page 1: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

C l ti d Correlation and Regression Regression

110

100

90

80

70

60

Ou

tpu

t

50403020100

60

50

40

50403020100

Input Week 2

Knorr-Bremse Group

Overview and Content

With correlation and regression you have a tool g yavailable to describe in an easy way the relation

between continuous factors (x1, x2 etc.) and 1 2

continuously measurable results (y).

• Regression and regression coefficient

• Correlation and correlation coefficient

• Fitted Line Plots• Fitted Line Plots

• Simple regressionp g

• Multiple regression

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 2/24

Page 2: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Validation of Factors Y = f (x)

Overview about the validation of single factors to single results

Factor X = Input

Discrete / Attributive Continuous / Variable

single factors to single results

Discrete / Attributive Continuous / Variable e ve

Ou

tpu

t

Dis

cret

e

ttrib

utiv

Chi-SquareLogistic

Regression

t Y

= O

D At

s

Res

ul

tinuo

u s

riabl

e T - Test

ANOVA ( F - Test) RegressionRegression

Con Va

( )

Variance Test

eg ess o

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 3/24

Regression

xbby21

+=

yThe fitted, estimated value of the dependent variable.

y

21

iy

ei

The zero point shift

The slope of the straight line

1b

2b

y The difference between the fitted (calculated) values and the observed values

ei

1b

observed values

Recieving Ch = 91,4033 + 0,476288 Final check

Regression Plot

ix x

0

∑ ∑ ∑ ∑−n n n n

iiii

2

iyxxyx

b

210

200

h

S = 6,77854 R-Sq = 69,5 % R-Sq(adj) = 67,9 %

( )∑ ∑= =

= = = =

−= n

1i

2n

1i i

2

i

1i 1i 1i 1i iiiii

1

xxn

yyb 190

180

Re

cie

ving

Ch

( )∑ ∑

∑ ∑ ∑= = =

−= n 2n

2

n

1i

n

1i

n

1i iiii

2

xxn

yxyxnb

230220210200190180170

170

160

Final check

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 4/24

( )∑ ∑= =

−1i 1i ii

xxn Final check

Page 3: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Regression

The method of the smallest quadratic deviations has 4 important properties:p p

• The sum of the residuals values is zero

• The sum of the products of the values of the x variable and corresponding residuals is equal to zero

• The arithmetic means of the measured Y variable and the theoretic calculated Y variable (fitted values) are equaltheoretic calculated Y variable (fitted values) are equal

• The regression straight line runs through the “center of gravity” of th tt l tthe scatter plot

Which statement can we make about the significance of the relation?

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 5/24

g

Regression ExampleAn example: The results shows the soften temperature measured during the final check at the supplier and the receiving check at the customer. The values of two

different plastic types are included in the two columns

Stat

different plastic types are included in the two columns

File: Soften temperature.mtw

>Regression

>Fitted Line Plot…Fitted Line Plot

Recieving Check = 91 40 + 0 4763 Final check

p

210

200

S 6,77854R-Sq 69,5%R-Sq(adj) 67,9%

Recieving Check = 91,40 + 0,4763 Final check

Final check Recieving Check Material

190

ing

Ch

eck

Final check Recieving Check Material

168 162,5 1

209 187,5 2

177,5 183,5 1

222,5 192,5 2

180

170

Re

cie

vi, ,

182,5 187,5 1

227,5 197,5 2

197,5 197,5 2

202,5 182,5 2

240230220210200190180170160160

Final check

173 177,5 1

214,5 192,5 2

182,5 182,5 1

222,5 202,5 2

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 6/24

Final check197,5 187,5 2

Page 4: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Regression

Also in the session window we get the regression equation

I dditi th i ifi i l l t d b th i l iIn addition, the significance is calculated by the variance analysis

Regression Analysis: Recieving Check versus Final check

The regression equation isThe regression equation is

Recieving Check = 91,4 + 0,476 Final check

S = 6,77854 R-Sq = 69,5% R-Sq(adj) = 67,9%

Analysis of VarianceAnalysis of Variance

Source DF SS MS F P

Regression 1 1989,0 1989,0 43,29 0,000

Residual Error 19 873,0 45,9

Total 20 2862,0

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 7/24

R2 and R2 adj.: Practical Significance

• R² is a method within the statistics, to show the practical significance of an effect.

695,02862

1989Re2 ===Total

gression

SS

SSR

Explained variation (SS Regression) divided by the total variation (SS Total). Approximately 70% of the variation is explained by the samples.

• R² adj. is a similar method to explain the practical significance of an ff t It i h l f l if l f t i d l E R2 dj teffect. It is helpful, if we use several factors in a model. E.g. R2 adj. gets

smaller, if an additional factor is added in the model, because every reduction of SS error can be balanced by the loss of degrees of freedom.reduction of SS error can be balanced by the loss of degrees of freedom. The values for R² adj. are always a little bit smaller than for R².

9545MS68,0

202862

95,45112 =−=−=

Total

Total

Error

DF

SSMS

adjR

Total

• S is the pooled standard deviation (averaged within group variation) The square root of S is the MS Error

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 8/24

square root of S is the MS Error.

Page 5: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Correlation

• Correlation is a measure for the strength of a interaction between two quantitative variables (e.g. measurement at supplier and customer).quantitative variables (e.g. measurement at supplier and customer).

• Correlation measures the degree of linearity between two variables.

• The value of the correlation coefficient r ranges between -1 and 1

• Rule: A correlation > 0 80 or < 0 80 is significant a• Rule: A correlation > 0,80 or < -0,80 is significant, a correlation between -0,80 and 0,80 is not significant.

L t h l k t th l ft t t• Lets have look at the example soften temperature. Covariance

(x x) (y y)i in − −∑1x x y yi

ni− −∑1 ( )( ) r

n -1xyxi=1

= ∑ s s yr

n -1

x x y yxy

i

xi=1

i

y= ∑1

s s( )( ) =

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 9/24

The CalculationThe calculation of the covariance and correlation coefficient

Final Insp Incoming Insp Yi - Ymean Xi - X mean Covariance r168 162 5 25 33 8 844 3 37168 162,5 -25 -33,8 844 3,37209 187,5 0 7,2 0 0,00

177,5 183,5 -4 -24,3 97 0,39222,5 192,5 5 20,7 104 0,41182,5 187,5 0 -19,3 0 0,00227,5 197,5 10 25,7 257 1,03197,5 197,5 10 -4,3 -43 -0,17202,5 182,5 -5 0,7 -4 -0,01173 177,5 -10 -28,8 288 1,15

214,5 192,5 5 12,7 64 0,25182,5 182,5 -5 -19,3 96 0,38222,5 202,5 15 20,7 311 1,24, , , ,197,5 187,5 0 -4,3 0 0,00232,5 202,5 15 30,7 461 1,84173 167,5 -20 -28,8 575 2,30

208 5 197 5 10 6 7 67 0 27208,5 197,5 10 6,7 67 0,27182,5 172,5 -15 -19,3 289 1,15222,5 197,5 10 20,7 207 0,83194 176,5 -11 -7,8 85 0,34

229 5 207 5 20 27 7 555 2 21229,5 207,5 20 27,7 555 2,21217,5 182,5 -5 15,7 -79 -0,31

Mean 201,8 187,5 4176 16,67Stdev 20,9 12,0 Covariance 208,8 0,83 r

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 10/24

Page 6: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Calculation in MinitabStat

>Basic Statistics File: Soften temperature mtw

Correlation of the final check and receiving check r = 0 834

>Correlation… Soften temperature.mtw

Correlation of the final check and receiving check r 0,834

² 0 695r² = 0,695

r = 0,834

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 11/24

Exercise: Simulated Data

• We generate two columns with 50 random numbers each and correlate these values.

Calc

>Random Data

– Mean: 100

– Standard deviation: 10

>Random Data

>Normal…

– Standard deviation: 10

Which value do we expect for the correlation? Stat• Which value do we expect for the correlation? Stat

>Basic Statistics

>Correlation…

• Investigate the correlation.

– Does the correlation correspond to our expectations?

Stat

>Regression

• Use the Fitted Line Plot function and investigate r².

>Fitted Line Plot…

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 12/24

Page 7: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Examples for Positive Correlation

76

74

S 0,838232R-Sq 93,2%R-Sq(adj) 93,1%

Strong Positive CorrelationOutput = 57,39 + 0,1732 Input

74

72

70

68Ou

tpu

t

R Sq(adj) 93,1%

85

80

Moderate Positive CorrelationOutput = 53,28 + 0,2109 Input

100908070605040

66

64

62

80

75

70

65

Ou

tpu

t

Input

90

Weak Positive CorrelationOutput = 58,31 + 0,1635 Input

100908070605040

65

60

55

S 5,18519R-Sq 34,6%R-Sq(adj) 33,5%

80

70

60

Ou

tpu

t

Input

100908070605040

50

40

Input

S 10,4391R-Sq 7,3%R-Sq(adj) 5,7%

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 13/24

Input

Examples for Negative Correlation

52,5 S 1,16327R-Sq 88,3%R S ( dj) 88 1%

Strong Negative CorrelationOutput = 56,48 - 0,1786 Input

50,0

47,5

45,0

Ou

tpu

t

R-Sq(adj) 88,1%

100908070605040

42,5

40,065

60

S 4,44849R-Sq 39,3%R-Sq(adj) 38,3%

Moderate Negative CorrelationOutput = 58,46 - 0,1999 Input

100908070605040Input

60

55

50

45

Ou

tpu

t

100908070605040

45

40

35 70

60

S 8,74951R-Sq 12,1%R-Sq(adj) 10,6%

Weak Negative CorrelationOutput = 57,34 - 0,1813 Input

Input 60

50

40Ou

tpu

t

100908070605040

30

20

Input

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 14/24

Input

Page 8: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

How large should the Coefficient „r“ be?

Compare your correlation value with the value in the table according to your

Sample size d.f. Significance leveln n-2 0,05 0,025 0,01 0,0053 1 0,9877 0,9969 0,9995 0,99994 2 0,9000 0,9500 0,9800 0,99005 3 0 8054 0 8783 0 9343 0 9587the value in the table according to your

sample size. Is the value larger than noted in the table the correlation is

5 3 0,8054 0,8783 0,9343 0,95876 4 0,7293 0,8114 0,8822 0,91727 5 0,6694 0,7545 0,8329 0,87458 6 0,6215 0,7067 0,7887 0,83439 7 0,5822 0,6664 0,7498 0,7977

“important” or statistically significant. 10 8 0,5494 0,6319 0,7155 0,764611 9 0,5214 0,6021 0,6851 0,734812 10 0,4973 0,5760 0,6581 0,707913 11 0,4762 0,5529 0,6339 0,683514 12 0 4575 0 5324 0 6120 0 66142t 14 12 0,4575 0,5324 0,6120 0,661415 13 0,4409 0,5140 0,5923 0,641116 14 0,4259 0,4973 0,5742 0,622617 15 0,4124 0,4821 0,5577 0,605518 16 0,4000 0,4683 0,5425 0,589719 17 0 3887 0 4555 0 5285 0 5751

2

2

2

or

tn

tr+−

=α α

α

19 17 0,3887 0,4555 0,5285 0,575120 18 0,3783 0,4438 0,5155 0,561421 19 0,3687 0,4329 0,5034 0,548722 20 0,3598 0,4227 0,4921 0,536827 25 0,3233 0,3809 0,4451 0,486921

2

or

r

rnt ⋅−=α32 30 0,2960 0,3494 0,4093 0,448737 35 0,2746 0,3246 0,3810 0,418242 40 0,2573 0,3044 0,3578 0,393247 45 0,2429 0,2876 0,3384 0,372152 50 0,2306 0,2732 0,3218 0,3542

1 r−α

Attention! Due to big sample sizes 52 50 0,2306 0,2732 0,3218 0,354262 60 0,2108 0,2500 0,2948 0,324872 70 0,1954 0,2319 0,2737 0,301782 80 0,1829 0,2172 0,2565 0,283092 90 0,1726 0,2050 0,2422 0,2673102 100 0 1638 0 1946 0 2301 0 2540

also r- values <0,8 are significant. Be aware here, the risk of

misinterpretation is relatively high

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 15/24

102 100 0,1638 0,1946 0,2301 0,2540misinterpretation is relatively high.

Avoid Quick Conclusions

If y and x1 correlate well that does not necessarily mean that a variation of x will cause a variation of y.

A third variable could be in the background which is responsible for the change of the x as well of the y. g y

An example from production shows a strong negative correlation between the pressure (x) and yield (y) in a reactor butbetween the pressure (x) and yield (y) in a reactor, but…

There are contaminations (x2), which are not measured and vary f l t lfrom process cycle to process cycle.

– Contamination is causing foaming

– Contamination is causing poor yield

Th i d t d th f b ild– The pressure is used to reduce the foam build up

– The pressure is a reaction on the foam build up and has no effect th i ld

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 16/24

on the yield

Page 9: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Another Example

• Open the file:Open the file: MYSTERY.MTWMYSTERY.MTW

• Calculate the correlation

10

Scatterplot of Output vs Input

• Calculate the correlation.

• Is there a correlation

pu

t

8

6

4

between the two variables?

• Create a plot for both

Ou

tp

2

0

-2

pvariables.

• What is your conclusion for

Input210-1-2-3

-4

What is your conclusion for the correlation?

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 17/24

Simple Regression

Correlation describes the linear dependence of two variables regression defines this relation more detailedregression defines this relation more detailed.

Regression leads to an equation, which uses one (or more) variables to explain the variation of the output variable.

St t > R i > R iStat > Regression > Regression…

Performs simple and multiple regression

Stat > Regression > Fitted Line Plot…

Scatter Plot Fitted Line equation and r²Scatter Plot, Fitted Line, equation and r

Stat > Regression > Residuals Plots…

Stores the residuals of the “regression" or "Fitted line plot"

Proofs basic assumptions about the behavior of the residuals

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 18/24

Page 10: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Summary

C l ti i f l t l t d ib d d i• Correlation is a useful tool to describe dependencies during many improvement activities.

• Correlation is the measure of the linear relation between two quantitative variables.

• Avoid too fast conclusion for causes.

C f• Correlation is the basis for the regression method.

• Regression describes the relation of the variablesRegression describes the relation of the variables more detailed and shows a equation model.

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 19/24

AppendixAppendixFurther ExamplesFurther Examples

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 20/24

Page 11: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

Example; Retailer Sales and Cost of Production Area Frequency Sales310 10240 2930980 7510 5270

File: Sales.mtw

A t il h i t t i ti t th1210 10810 68501290 9890 70101120 13720 70201490 13920 8350

A retailer chain wants to investigate the sales dependence of shop location(Area) and the passerby frequency. 1490 13920 8350

780 8540 4330940 12360 57701290 12270 7680

p y q y

What kind of relations you can describe? 1290 12270 7680480 11010 3160240 8250 1520550 9310 3150

Units Cost3200 322004100 327004100 32700

10700 701008700 48200File Cost mtw6500 386009400 55400

11200 77200

File. Cost.mtw

The table shows the production fix costs f 10 11200 77200

1400 243006000 37500

and the number of units over 10 month.

Determine the favorable production size.

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 21/24

4200 34000p

Example; Salary

File: Salery.mtwy

Evaluate the factors, which of them has the strongest effect on salary?strongest effect on salary?

Salary Year in the job Company years Education Age Pers. No. Sex Sex Group38985 18 7 9 52 412 M 028938 12 5 4 39 517 F 132920 15 3 9 45 458 F 129548 5 6 1 30 604 M 031138 11 11 6 46 562 F 124749 6 2 0 26 598 F 124749 6 2 0 26 598 F 141889 22 16 7 63 351 M 031528 3 11 3 35 674 M 038791 21 4 5 48 356 M 039828 18 6 5 47 415 F 139828 18 6 5 47 415 F 1

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 22/24

Page 12: Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 Correlation and Regression

The Mystery Example

10

8

S 1,69190R-Sq 6,4%R Sq(adj) 5 4%

Fitted Line PlotOutput = 1,145 - 0,4340 Input

If we use Stat > Regression > Fitted

pu

t

8

6

4

R-Sq(adj) 5,4%If we use Stat > Regression > Fitted Line Plot > Linear we get…

Ou

tp

2

0

Input210-1-2-3

-2

-4

12 Regression

Fitted Line PlotOutput = 0,1401 + 0,0413 Input

+ 1,025 Input**2

10

8

6

S 1,02499R-Sq 66,0%R-Sq(adj) 65,3%

95% CI

Ou

tpu

t

4

2

0If we use Stat > Regression > Fitted

210-1-2-3

0

-2

-4

Line Plot > quadratic Regression we get a strong correlation.

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 23/24

Input

Example; Retailer SalesDiagnosis at regression

Stat9000 S 408,182

R-Sq 96,9%

Fitted Line PlotSales = 605,7 + 5,222 Area

>Regression

>Residual Plot…

s

8000

7000

6000

R-Sq(adj) 96,6%

Evaluation like at ANOVA

Sa

les

5000

4000

3000

Area1600140012001000800600400200

2000

100099

90

l

500

Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

Residual Plots for Sales

Mi it b d th id l d

Per

cent

10005000-500-1000

50

10

1

Res

idua

l

8000600040002000

0

-500

Minitab needs the residuals and the fits in one column. Storage of residuals and fits is possible

Residual100050005001000

Fitted Value8000600040002000

4

3500

Histogram of the Residuals Residuals Versus the Order of the Data

during every evaluation.

Freq

uenc

y 3

2

1

0

Res

idua

l

0

-500

Knorr-Bremse Group 07 BB W2 Regression 08, D. Szemkus/H. Winkler Page 24/24

Residual7505002500-250-500

0

Observation Order121110987654321