dr. mario mazzocchiresearch methods & data analysis1 correlation and regression analysis week 8...
Post on 22-Dec-2015
219 views
TRANSCRIPT
Dr. Mario Mazzocchi Research Methods & Data Analysis
1
Correlation and Correlation and regression analysisregression analysis
Week 8Research Methods & Data
Analysis
Dr. Mario Mazzocchi Research Methods & Data Analysis
2
Lecture outlineLecture outline
• Correlation• Regression Analysis• The least squares estimation
method• SPSS and regression output• Task overview
Dr. Mario Mazzocchi Research Methods & Data Analysis
3
CorrelationCorrelation
• Correlation measures to what extent two (or more) variables are related– Correlation expresses a relationship that
is not necessarily precise (e.g. height and weight)
– Positive correlation indicates that the two variables move in the same direction
– Negative correlation indicates that they move in opposite directions
Dr. Mario Mazzocchi Research Methods & Data Analysis
4
CovarianceCovariance
• Covariance measures the “joint variability”
• If two variables are independent, then the covariance is zero (however, Cov=O does not mean that two variables are independent)
• Where E(…) indicates the expected value (i.e. average value)
( , ) ( ) ( ) ( )xyCov x y E xy E x E y
Dr. Mario Mazzocchi Research Methods & Data Analysis
5
Correlation coefficientCorrelation coefficient
• The correlation coefficient r gives a measure (in the range –1, +1) of the relationship between two variables– r=0 means no correlation– r=+1 means perfect positive correlation– r=-1 means perfect negative correlation
• Perfect correlation indicates that a p% variation in x corresponds to a p% variation in y
Dr. Mario Mazzocchi Research Methods & Data Analysis
6
Correlation coefficient Correlation coefficient and covarianceand covariance
( , )
( ) ( )
Cov x yr
Var x Var y
xy
x y
r
1 1
1
1
n n
i ini i
xy i ii
x ys x y
n n n
xy
x y
sr
s s
Pearson correlation coefficient
Correlation coefficient - POPULATION
SAMPLE
Dr. Mario Mazzocchi Research Methods & Data Analysis
7
Bivariate and Bivariate and multivariate correlationmultivariate correlation• Bivariate correlation
– 2 variables– Pearson correlation coefficient
• Partial correlation– The correlation between two
variables after allowing for the effect of other “control” variables
Dr. Mario Mazzocchi Research Methods & Data Analysis
8
Significance level in Significance level in correlationcorrelation
• Level of correlation (value of the correlation coefficient): indicates to what extent the two variables “move together”
• Significance of correlation (p value): given that the correlation coefficient is computed on a sample, indicates whether the relationship appear to be statistically significant
• Examples– Correlation is 0.50, but not significant: the sampling
error is so high that the actual correlation could even be 0
– Correlation is 0.10 and highly significant: the level of correlation is very low, but we can be confident on the value of such correlation
Dr. Mario Mazzocchi Research Methods & Data Analysis
9
Correlation and Correlation and covariance in SPSScovariance in SPSS
Choose between bivariate & partial
Dr. Mario Mazzocchi Research Methods & Data Analysis
10
Bivariate correlationBivariate correlationSelect the variables you want to analyse
Require the significance level
(two tailed)Ask for additional
statistics (if necessary)
Dr. Mario Mazzocchi Research Methods & Data Analysis
11
Bivariate correlation Bivariate correlation outputoutput
Correlations
1 .157** .159**
. .000 .000
779 779 779
.157** 1 .291**
.000 . .000
779 779 779
.159** .291** 1
.000 .000 .
779 779 779
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Shopping style
Use coupons
Amount spent
Shoppingstyle Use coupons Amount spent
Correlation is significant at the 0.01 level (2-tailed).**.
Dr. Mario Mazzocchi Research Methods & Data Analysis
12
Partial correlationsPartial correlationsList of variables to be analysed
Control variables
Dr. Mario Mazzocchi Research Methods & Data Analysis
13
Partial correlation Partial correlation outputoutput
- - - P A R T I A L C O R R E L A T I O N C O E F F I C I E N T S - - -
Controlling for.. SIZE STYLE
AMTSPENT USECOUP ORG
AMTSPENT 1.0000 .2677 -.0116
( 0) ( 775) ( 775)
P= . P= .000 P= .746
USECOUP .2677 1.0000 .0500
( 775) ( 0) ( 775)
P= .000 P= . P= .164
ORG -.0116 .0500 1.0000
( 775) ( 775) ( 0)
P= .746 P= .164 P= .
(Coefficient / (D.F.) / 2-tailed Significance)
" . " is printed if a coefficient cannot be computed
Partial correlations still measure the correlation between two variables, but eliminate the effect of other variables, i.e. the correlations are computed on consumers shopping in stores of identical size and with the same shopping style
Dr. Mario Mazzocchi Research Methods & Data Analysis
14
Bivariate and partial Bivariate and partial correlationscorrelations
• Correlation between Amount spent and Use of coupon– Bivariate correlation: 0.291 (p value 0.00)– Partial correlation: 0.268 (p value 0.00)
• The amount spent is positively correlated with the use of coupon (0=no use, 1=from newspaper, 2=from mailing, 3=both)
• The level of correlation does not change much after accounting for different shop size and shopping styles
Dr. Mario Mazzocchi Research Methods & Data Analysis
15
Linear regression Linear regression analysisanalysis
i iy x
Dependent variableIntercept
Regression coefficient
Independent variable (explanatory variable, regressor…)
Error
Dr. Mario Mazzocchi Research Methods & Data Analysis
16
Regression analysisRegression analysis
20 30 40 50 60
Age
200
300
400
Ch
ole
ster
ol (
mg
/100
ml)
y
x
Dr. Mario Mazzocchi Research Methods & Data Analysis
17
ExampleExample• We want to investigate if there is a relationship between cholesterol and age on a sample of 18 people
• The dependent variable is the cholesterol level
• The explanatory variable is age
Dr. Mario Mazzocchi Research Methods & Data Analysis
18
What regression What regression analysis doesanalysis does
• Determine whether a relationships exist between the dependent and explanatory variables
• Determine how much of the variation in the dependent variable is explained by the independent variable (goodness of fit)
• Allow to predict the values of the dependent variable
Dr. Mario Mazzocchi Research Methods & Data Analysis
19
Regression and Regression and correlationcorrelation
• Correlation: there is no causal relationship assumed
• Regression: we assume that the explanatory variables “cause” the dependent variable– Bivariate: one explanatory variable– Multivariate: two or more
explanatory variables
Dr. Mario Mazzocchi Research Methods & Data Analysis
20
How to estimate the How to estimate the regression coefficientsregression coefficients
• The objective is to estimate the population parameters e on our data sample
• A good way to estimate it is by minimising the error ei, which represents the difference between the actual observation and the estimated (predicted) one
i i iy a bx e
Dr. Mario Mazzocchi Research Methods & Data Analysis
21
Linear Regression
20 30 40 50 60
Age
200
300
400
Ch
ole
ster
ol (
mg
/100
ml)
Cholesterol (mg/100 ml) = 140.36 + 4.58 * ageR-Square = 0.65
The objective is to identify the line (i.e. the a and b coefficients) that minimise the distance between the actual points and the fit line
Dr. Mario Mazzocchi Research Methods & Data Analysis
22
The least square The least square methodmethod
• This is based on minimising the square of the distance (error) rather than the distance
2
( , )
( )xy y
xx
s sCov x yb r
Var x ss
a y bx
Dr. Mario Mazzocchi Research Methods & Data Analysis
23
Bivariate regression in Bivariate regression in SPSSSPSS
Dr. Mario Mazzocchi Research Methods & Data Analysis
24
Regression dialog boxRegression dialog boxDependent variable
Explanatory variableLeave this
unchanged!
Dr. Mario Mazzocchi Research Methods & Data Analysis
25
Regression outputRegression output
Coefficientsa
140.359 34.715 4.043 .001
4.577 .838 .807 5.464 .000
(Constant)
Age
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: Cholesterol (mg/100 ml)a.
Value of the coefficients
Statistical significance
Is the coefficient different from 0?
Dr. Mario Mazzocchi Research Methods & Data Analysis
26
Model diagnostics: Model diagnostics: goodness of fitgoodness of fit
Model Summary
.807a .651 .629 45.218Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), Agea.
The value of the R square is included between 0 and 1 and represents the proportion of total variation that is explained by the regression model
Dr. Mario Mazzocchi Research Methods & Data Analysis
27
R-squareR-square
y reg resSS SS SS
Total variation
Variation explaned
by regression
Residual variation
2 reg
y
SSR
SS
2 2 2
1 1 1
ˆ ˆ( ) ( ) ( )n n n
i i ii i i
y y y y y y
ˆi iy a bx
Dr. Mario Mazzocchi Research Methods & Data Analysis
28
Multivariate regressionMultivariate regression
• The principle is identical to bivariate regression, but there are more explanatory variables
• The goodness of fit can be measured through the adjusted R-square, which takes into account the number of explanatory variables
0 1 21 2 ...i i i n i iy b b x b x b xn e
Dr. Mario Mazzocchi Research Methods & Data Analysis
29
Multivariate regression Multivariate regression in SPSSin SPSS
• Analyze / Regression / Linear
Simply select more than one explanatory variable
Dr. Mario Mazzocchi Research Methods & Data Analysis
30
OutputOutput
Coefficientsa
296.482 19.792 14.980 .000
9.721 15.012 .024 .648 .517
9.753 6.070 .059 1.607 .109
-69.598 7.483 -.302 -9.301 .000
-1.910 12.570 -.005 -.152 .879
22.760 6.069 .123 3.750 .000
30.417 3.512 .285 8.662 .000
(Constant)
Health food store
Size of store
Gender
Vegetarian
Shopping style
Use coupons
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: Amount spenta.
Dr. Mario Mazzocchi Research Methods & Data Analysis
31
Coefficient Coefficient interpretationinterpretation
• The constant represents the amount spent being 0 all other variables (£ 296.5)
• Health food stores, Size of store and being vegetarian are not significantly different from 0
• Gender coeff = -69.6: On average being woman (G=1) implies spending £ 69 less
• Shopping style coeff = +22.8 S– S=1 (shop per himself) = +22.8– S=2 (shop per himself & spouse) = +45.6– S=3 (shop per himself & family) = +68.4
• Coupon use coeff = 30.4 C– C=1 (do not use coupon) = +30.4– C=2 (coupon from newspapers) = +60.8– C=3 (coupon from mailings) = +91.2– C=4 (coupon from both) = +121.6
Categorization problems?
Dr. Mario Mazzocchi Research Methods & Data Analysis
32
PredictionPrediction
• On average, how much will someone with the following characteristics spend:– Male (G=0)– Shopping for family (S=3)– Not using coupons (C=1)296.5 69.6 22.8 30.4 395.3AMT G S C
Dr. Mario Mazzocchi Research Methods & Data Analysis
33
How good is the model?How good is the model?Model Summary
.439a .193 .187 104.08167Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), Use coupons, Vegetarian,Gender, Health food store, Shopping style, Size of store
a.
• The regression model explain less than 19% of the total variation in the amount spent
Dr. Mario Mazzocchi Research Methods & Data Analysis
34
Task ATask A• Examine the relationship between
the amount spent and the following customer characteristics:
– Being male/female– Being vegetarian– Shopping for himself / for himself and others– Shopping style (weekly, bi-weekly, etc.)Potential methods:
• Battery of hypothesis testing & Analysis of variance
• Regression Analysis
Dr. Mario Mazzocchi Research Methods & Data Analysis
35
Task BTask B• Examine the relationship between the
amount spent and the following customer characteristics:– Hypothesis: the average amount spent in
health-oriented shop is higher than those of other shops. True or false?
– Test the same hypothesis accounting for different shop sizes
Potential methods:
• Battery of hypothesis testing & Analysis of variance
• Regression Analysis
Dr. Mario Mazzocchi Research Methods & Data Analysis
36
Task CTask C• Find a relationship between the
average amount spent per store and the following store characteristics:– Size of store– Health-oriented store– Store organisation
Potential methods:
• Transform the customer data set into a store data set
• Battery of ANOVA
• Regression Analysis
Dr. Mario Mazzocchi Research Methods & Data Analysis
37
Task DTask D• Hypothesis: is the amount spent by those that use
coupon significantly higher?• What is the most effective way of distributing coupons:
– By mail– On newspapers– Both
Potential methods:
• Recode the variable into 1=not using coupon and 2=using coupon
• Hypothesis testing
• Analysis of variance