dr. mario mazzocchiresearch methods & data analysis1 correlation and regression analysis week 8...

37
Dr. Mario Mazzocchi Research Methods & Data Analysis 1 Correlation and Correlation and regression analysis regression analysis Week 8 Research Methods & Data Analysis

Post on 22-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

1

Correlation and Correlation and regression analysisregression analysis

Week 8Research Methods & Data

Analysis

Page 2: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

2

Lecture outlineLecture outline

• Correlation• Regression Analysis• The least squares estimation

method• SPSS and regression output• Task overview

Page 3: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

3

CorrelationCorrelation

• Correlation measures to what extent two (or more) variables are related– Correlation expresses a relationship that

is not necessarily precise (e.g. height and weight)

– Positive correlation indicates that the two variables move in the same direction

– Negative correlation indicates that they move in opposite directions

Page 4: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

4

CovarianceCovariance

• Covariance measures the “joint variability”

• If two variables are independent, then the covariance is zero (however, Cov=O does not mean that two variables are independent)

• Where E(…) indicates the expected value (i.e. average value)

( , ) ( ) ( ) ( )xyCov x y E xy E x E y

Page 5: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

5

Correlation coefficientCorrelation coefficient

• The correlation coefficient r gives a measure (in the range –1, +1) of the relationship between two variables– r=0 means no correlation– r=+1 means perfect positive correlation– r=-1 means perfect negative correlation

• Perfect correlation indicates that a p% variation in x corresponds to a p% variation in y

Page 6: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

6

Correlation coefficient Correlation coefficient and covarianceand covariance

( , )

( ) ( )

Cov x yr

Var x Var y

xy

x y

r

1 1

1

1

n n

i ini i

xy i ii

x ys x y

n n n

xy

x y

sr

s s

Pearson correlation coefficient

Correlation coefficient - POPULATION

SAMPLE

Page 7: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

7

Bivariate and Bivariate and multivariate correlationmultivariate correlation• Bivariate correlation

– 2 variables– Pearson correlation coefficient

• Partial correlation– The correlation between two

variables after allowing for the effect of other “control” variables

Page 8: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

8

Significance level in Significance level in correlationcorrelation

• Level of correlation (value of the correlation coefficient): indicates to what extent the two variables “move together”

• Significance of correlation (p value): given that the correlation coefficient is computed on a sample, indicates whether the relationship appear to be statistically significant

• Examples– Correlation is 0.50, but not significant: the sampling

error is so high that the actual correlation could even be 0

– Correlation is 0.10 and highly significant: the level of correlation is very low, but we can be confident on the value of such correlation

Page 9: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

9

Correlation and Correlation and covariance in SPSScovariance in SPSS

Choose between bivariate & partial

Page 10: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

10

Bivariate correlationBivariate correlationSelect the variables you want to analyse

Require the significance level

(two tailed)Ask for additional

statistics (if necessary)

Page 11: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

11

Bivariate correlation Bivariate correlation outputoutput

Correlations

1 .157** .159**

. .000 .000

779 779 779

.157** 1 .291**

.000 . .000

779 779 779

.159** .291** 1

.000 .000 .

779 779 779

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Shopping style

Use coupons

Amount spent

Shoppingstyle Use coupons Amount spent

Correlation is significant at the 0.01 level (2-tailed).**.

Page 12: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

12

Partial correlationsPartial correlationsList of variables to be analysed

Control variables

Page 13: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

13

Partial correlation Partial correlation outputoutput

- - - P A R T I A L C O R R E L A T I O N C O E F F I C I E N T S - - -

Controlling for.. SIZE STYLE

AMTSPENT USECOUP ORG

AMTSPENT 1.0000 .2677 -.0116

( 0) ( 775) ( 775)

P= . P= .000 P= .746

USECOUP .2677 1.0000 .0500

( 775) ( 0) ( 775)

P= .000 P= . P= .164

ORG -.0116 .0500 1.0000

( 775) ( 775) ( 0)

P= .746 P= .164 P= .

(Coefficient / (D.F.) / 2-tailed Significance)

" . " is printed if a coefficient cannot be computed

Partial correlations still measure the correlation between two variables, but eliminate the effect of other variables, i.e. the correlations are computed on consumers shopping in stores of identical size and with the same shopping style

Page 14: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

14

Bivariate and partial Bivariate and partial correlationscorrelations

• Correlation between Amount spent and Use of coupon– Bivariate correlation: 0.291 (p value 0.00)– Partial correlation: 0.268 (p value 0.00)

• The amount spent is positively correlated with the use of coupon (0=no use, 1=from newspaper, 2=from mailing, 3=both)

• The level of correlation does not change much after accounting for different shop size and shopping styles

Page 15: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

15

Linear regression Linear regression analysisanalysis

i iy x

Dependent variableIntercept

Regression coefficient

Independent variable (explanatory variable, regressor…)

Error

Page 16: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

16

Regression analysisRegression analysis

20 30 40 50 60

Age

200

300

400

Ch

ole

ster

ol (

mg

/100

ml)

y

x

Page 17: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

17

ExampleExample• We want to investigate if there is a relationship between cholesterol and age on a sample of 18 people

• The dependent variable is the cholesterol level

• The explanatory variable is age

Page 18: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

18

What regression What regression analysis doesanalysis does

• Determine whether a relationships exist between the dependent and explanatory variables

• Determine how much of the variation in the dependent variable is explained by the independent variable (goodness of fit)

• Allow to predict the values of the dependent variable

Page 19: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

19

Regression and Regression and correlationcorrelation

• Correlation: there is no causal relationship assumed

• Regression: we assume that the explanatory variables “cause” the dependent variable– Bivariate: one explanatory variable– Multivariate: two or more

explanatory variables

Page 20: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

20

How to estimate the How to estimate the regression coefficientsregression coefficients

• The objective is to estimate the population parameters e on our data sample

• A good way to estimate it is by minimising the error ei, which represents the difference between the actual observation and the estimated (predicted) one

i i iy a bx e

Page 21: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

21

Linear Regression

20 30 40 50 60

Age

200

300

400

Ch

ole

ster

ol (

mg

/100

ml)

Cholesterol (mg/100 ml) = 140.36 + 4.58 * ageR-Square = 0.65

The objective is to identify the line (i.e. the a and b coefficients) that minimise the distance between the actual points and the fit line

Page 22: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

22

The least square The least square methodmethod

• This is based on minimising the square of the distance (error) rather than the distance

2

( , )

( )xy y

xx

s sCov x yb r

Var x ss

a y bx

Page 23: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

23

Bivariate regression in Bivariate regression in SPSSSPSS

Page 24: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

24

Regression dialog boxRegression dialog boxDependent variable

Explanatory variableLeave this

unchanged!

Page 25: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

25

Regression outputRegression output

Coefficientsa

140.359 34.715 4.043 .001

4.577 .838 .807 5.464 .000

(Constant)

Age

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Cholesterol (mg/100 ml)a.

Value of the coefficients

Statistical significance

Is the coefficient different from 0?

Page 26: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

26

Model diagnostics: Model diagnostics: goodness of fitgoodness of fit

Model Summary

.807a .651 .629 45.218Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Agea.

The value of the R square is included between 0 and 1 and represents the proportion of total variation that is explained by the regression model

Page 27: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

27

R-squareR-square

y reg resSS SS SS

Total variation

Variation explaned

by regression

Residual variation

2 reg

y

SSR

SS

2 2 2

1 1 1

ˆ ˆ( ) ( ) ( )n n n

i i ii i i

y y y y y y

ˆi iy a bx

Page 28: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

28

Multivariate regressionMultivariate regression

• The principle is identical to bivariate regression, but there are more explanatory variables

• The goodness of fit can be measured through the adjusted R-square, which takes into account the number of explanatory variables

0 1 21 2 ...i i i n i iy b b x b x b xn e

Page 29: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

29

Multivariate regression Multivariate regression in SPSSin SPSS

• Analyze / Regression / Linear

Simply select more than one explanatory variable

Page 30: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

30

OutputOutput

Coefficientsa

296.482 19.792 14.980 .000

9.721 15.012 .024 .648 .517

9.753 6.070 .059 1.607 .109

-69.598 7.483 -.302 -9.301 .000

-1.910 12.570 -.005 -.152 .879

22.760 6.069 .123 3.750 .000

30.417 3.512 .285 8.662 .000

(Constant)

Health food store

Size of store

Gender

Vegetarian

Shopping style

Use coupons

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Amount spenta.

Page 31: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

31

Coefficient Coefficient interpretationinterpretation

• The constant represents the amount spent being 0 all other variables (£ 296.5)

• Health food stores, Size of store and being vegetarian are not significantly different from 0

• Gender coeff = -69.6: On average being woman (G=1) implies spending £ 69 less

• Shopping style coeff = +22.8 S– S=1 (shop per himself) = +22.8– S=2 (shop per himself & spouse) = +45.6– S=3 (shop per himself & family) = +68.4

• Coupon use coeff = 30.4 C– C=1 (do not use coupon) = +30.4– C=2 (coupon from newspapers) = +60.8– C=3 (coupon from mailings) = +91.2– C=4 (coupon from both) = +121.6

Categorization problems?

Page 32: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

32

PredictionPrediction

• On average, how much will someone with the following characteristics spend:– Male (G=0)– Shopping for family (S=3)– Not using coupons (C=1)296.5 69.6 22.8 30.4 395.3AMT G S C

Page 33: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

33

How good is the model?How good is the model?Model Summary

.439a .193 .187 104.08167Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Use coupons, Vegetarian,Gender, Health food store, Shopping style, Size of store

a.

• The regression model explain less than 19% of the total variation in the amount spent

Page 34: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

34

Task ATask A• Examine the relationship between

the amount spent and the following customer characteristics:

– Being male/female– Being vegetarian– Shopping for himself / for himself and others– Shopping style (weekly, bi-weekly, etc.)Potential methods:

• Battery of hypothesis testing & Analysis of variance

• Regression Analysis

Page 35: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

35

Task BTask B• Examine the relationship between the

amount spent and the following customer characteristics:– Hypothesis: the average amount spent in

health-oriented shop is higher than those of other shops. True or false?

– Test the same hypothesis accounting for different shop sizes

Potential methods:

• Battery of hypothesis testing & Analysis of variance

• Regression Analysis

Page 36: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

36

Task CTask C• Find a relationship between the

average amount spent per store and the following store characteristics:– Size of store– Health-oriented store– Store organisation

Potential methods:

• Transform the customer data set into a store data set

• Battery of ANOVA

• Regression Analysis

Page 37: Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis

Dr. Mario Mazzocchi Research Methods & Data Analysis

37

Task DTask D• Hypothesis: is the amount spent by those that use

coupon significantly higher?• What is the most effective way of distributing coupons:

– By mail– On newspapers– Both

Potential methods:

• Recode the variable into 1=not using coupon and 2=using coupon

• Hypothesis testing

• Analysis of variance