introduction to correlation and regression analysis

52
Introduction to Correlation & Regression Analysis Farzad Javidanrad November 2013

Upload: farzad-javidanrad

Post on 15-Jul-2015

549 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Introduction to correlation and regression analysis

Introduction to Correlation &

Regression Analysis

Farzad Javidanrad

November 2013

Page 2: Introduction to correlation and regression analysis

Some Basic Concepts:

o Variable: A letter (symbol) which represents the elements of

a specific set.

o Random Variable: A variable whose values are randomly

appear based on a probability distribution.

o Probability Distribution: A corresponding rule (function)

which corresponds a probability to the values of a random

variable (individually or to a set of them). E.g.:

๐’™ 0 1

๐‘ƒ(๐‘ฅ) 0.5 0.5In one trial ๐ป, ๐‘‡

In two trials ๐ป๐ป, ๐ป๐‘‡, ๐‘‡๐ป, ๐‘‡๐‘‡

Page 3: Introduction to correlation and regression analysis

Correlation:Is there any relation between:

fast food sale and different seasons?

specific crime and religion?

smoking cigarette and lung cancer?

maths score and overall score in exam?

temperature and earthquake?

cost of advertisement and number of sold items?

To answer each question two sets of corresponding data need to be randomly collected.

Let random variable "๐’™" represents the first group of

data and random variable "๐’š" represents the second.

Question: Is this true that students who have a better

overall result are good in maths?

Page 4: Introduction to correlation and regression analysis

Our aim is to find out whether there is any linear

association between ๐’™ and ๐’š. In statistics, technical

term for linear association is โ€œcorrelationโ€. So, we are

looking to see if there is any correlation between two

scores.

โ€œLinear associationโ€ : variables are in relations at

their levels, i.e. ๐’™ with ๐’š not with ๐’š๐Ÿ, ๐’š๐Ÿ‘, ๐Ÿ

๐’šor even

โˆ†๐’š.

Imagine we have a random sample of scores in a

school as following:

Page 5: Introduction to correlation and regression analysis

In our example, the correlation between ๐’™ and ๐’š

can be shown in a scatter diagram:

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Y

X

Correlation between maths score and overall score The graph shows a

positive correlation between maths scores and overall scores, i.e. when ๐’™increases ๐’šincreases too.

Page 6: Introduction to correlation and regression analysis

Different scatter diagrams show different types of

correlation:

โ€ข Is this enough? Are we happy?Certainly not!! We think we know things better

when they are described by numbers!!!!

Although, scatter diagrams are informative but to find

the degree (strength) of a correlation between two

variables we need a numerical measurement.

Adopted from www.pdesas.org

Page 7: Introduction to correlation and regression analysis

Following the work of Francis Galton on regression

line, in 1896 Karl Pearson introduced a formula for

measuring correlation between two variables, called

Correlation Coefficient or Pearsonโ€™s Correlation

Coefficient.

For a sample of size ๐’, sample correlation coefficient

๐’“๐’™๐’š can be calculated by:

๐’“๐’™๐’š = ๐Ÿ

๐’(๐’™๐’Š โˆ’ ๐’™)(๐’š๐’Š โˆ’ ๐’š)

๐Ÿ๐’(๐’™๐’Š โˆ’ ๐’™)๐Ÿ . ๐Ÿ

๐’(๐’š๐’Š โˆ’ ๐’š)๐Ÿ=

๐’„๐’๐’—(๐’™, ๐’š)

๐‘บ๐’™ . ๐‘บ๐’š

Where ๐’™ and ๐’š are the mean values of ๐’™ and ๐’š in the

sample and ๐‘บ represents the biased version of

โ€œstandard deviationโ€*. The covariance between ๐’™ and ๐’š(๐’„๐’๐’— ๐’™, ๐’š ) shows how much ๐’™ and ๐’š change together.

Page 8: Introduction to correlation and regression analysis

Alternatively, if there is an opportunity to observe all

available data, the population correlation coefficient

(๐†๐’™๐’š) can be obtained by:

๐†๐’™๐’š =๐‘ฌ ๐’™๐’Š โˆ’ ๐๐’™ . (๐’š๐’Š โˆ’ ๐๐’š)

๐‘ฌ ๐’™๐’Š โˆ’ ๐๐’™๐Ÿ. ๐‘ฌ(๐’š๐’Š โˆ’ ๐๐’š)๐Ÿ

=๐’„๐’๐’—(๐’™, ๐’š)

๐ˆ๐’™ . ๐ˆ๐’š

Where ๐‘ฌ, ๐ and ๐ˆ are expected value, mean and

standard deviation of the random variables,

respectively and ๐‘ต is the size of the population.

Question: Under what conditions can we use this

population correlation coefficient?

Page 9: Introduction to correlation and regression analysis

If ๐’™ = ๐’‚๐’š + ๐’ƒ ๐’“๐’™๐’š = ๐Ÿ

Maximum (perfect) positive correlation.

If ๐’™ = ๐’‚๐’š + ๐’ƒ ๐’“๐’™๐’š = โˆ’๐Ÿ

Maximum (perfect) negative correlation.

If there is no linear association between ๐’™ and ๐’šthen ๐’“๐’™๐’š = ๐ŸŽ.

Note 1: If there is no linear association between two

random variables they might have non linear

association or no association at all.

For all ๐’‚ , ๐’ƒ โˆˆ ๐‘นAnd ๐’‚ > ๐ŸŽ

For all ๐’‚ , ๐’ƒ โˆˆ ๐‘นAnd ๐’‚ < ๐ŸŽ

Page 10: Introduction to correlation and regression analysis

In our example, the sample correlation coefficient is:๐’™๐’Š ๐’š๐’Š ๐’™๐’Š โˆ’ ๐’™ ๐’š๐’Š โˆ’ ๐’š ๐’™๐’Š โˆ’ ๐’™ . (๐’š๐’Š โˆ’ ๐’š) (๐‘ฅ๐‘–โˆ’ ๐‘ฅ )2 (๐‘ฆ๐‘–โˆ’ ๐‘ฆ )2

70 73 12 13.9 166.8 144 193.21

85 90 27 30.9 834.3 729 954.81

22 31 -36 -28.1 1011.6 1296 789.61

66 50 8 -9.1 -72.8 64 82.81

15 31 -43 -28.1 1208.3 1849 789.61

58 50 0 -9.1 0 0 82.81

69 56 11 -3.1 -34.1 121 9.61

49 55 -9 -4.1 36.9 81 16.81

73 80 15 20.9 313.5 225 436.81

61 49 3 -10.1 -30.3 9 102.01

77 79 19 19.9 378.1 361 396.01

44 58 -14 -1.1 15.4 196 1.21

35 40 -23 -19.1 439.3 529 364.81

88 85 30 25.9 777 900 670.81

69 73 11 13.9 152.9 121 193.21

5196.9 6625 5084.15

๐’“๐’™๐’š = ๐Ÿ

๐’(๐’™๐’Š โˆ’ ๐’™)(๐’š๐’Š โˆ’ ๐’š)

๐Ÿ๐’(๐’™๐’Š โˆ’ ๐’™)๐Ÿ . ๐Ÿ

๐’(๐’š๐’Š โˆ’ ๐’š)๐Ÿ= ๐Ÿ“๐Ÿ๐Ÿ—๐Ÿ”.๐Ÿ—

๐Ÿ”๐Ÿ”๐Ÿ๐Ÿ“ร—๐Ÿ“๐ŸŽ๐Ÿ–๐Ÿ’.๐Ÿ๐Ÿ“=๐ŸŽ.๐Ÿ–๐Ÿ—๐Ÿ“

which shows an strong positive correlation between maths score and overall score.

Page 11: Introduction to correlation and regression analysis

Positive Linear Association

No Linear Association

Negative Linear Association

๐‘บ๐’™ > ๐‘บ๐’š ๐‘บ๐’™ = ๐‘บ๐’š ๐‘บ๐’™ < ๐‘บ๐’š

๐’“๐’™๐’š = ๐Ÿ

Adapted and modified from www.tice.agrocampus-ouest.fr

๐’“๐’™๐’š โ‰ˆ ๐Ÿ

๐ŸŽ < ๐’“๐’™๐’š < ๐Ÿ

๐’“๐’™๐’š = ๐ŸŽ

โˆ’๐Ÿ < ๐’“๐’™๐’š< ๐ŸŽ

๐’“๐’™๐’š โ‰ˆ โˆ’๐Ÿ

๐’“๐’™๐’š = โˆ’๐Ÿ

Perfect

Weak

No Correlation

Weak

Strong

Perfect

Strong

Page 12: Introduction to correlation and regression analysis

Some properties of the correlation coefficient:

(Sample or population)

a. It lies between -1 and 1, i.e. โˆ’๐Ÿ โ‰ค ๐’“๐’™๐’š โ‰ค ๐Ÿ.

b. It is symmetrical with respect to ๐’™ and ๐’š, i.e. ๐’“๐’™๐’š =

๐’“๐’š๐’™ . This means the direction of calculation is not

important.

c. It is just a pure number and independent from the

unit of measurement of ๐’™ and ๐’š.

d. It is independent of the choice of origin and scale

of ๐’™ and ๐’šโ€™s measurements, that is;

๐’“๐’™๐’š = ๐’“ ๐’‚๐’™+๐’ƒ ๐’„๐’š+๐’… (๐’‚, ๐’„ > ๐ŸŽ)

Page 13: Introduction to correlation and regression analysis

e. ๐’‡ ๐’™, ๐’š = ๐’‡ ๐’™ . ๐’‡(๐’š) ๐’“๐’™๐’š = ๐ŸŽ

Important Note:Many researchers wrongly construct a theory just based on a

simple correlation test.

Correlation does not imply causation.

If there is a high correlation between number of smoked

cigarettes and the number of infected lungโ€™s cells it does not

necessarily mean that smoking causes lung cancer. Causality

test (sometimes called Granger causality test) is different from

correlation test.

In causality test it is important to know about the direction of

causality (e.g. ๐’™ on ๐’š and not vice versa) but in correlation we

are trying to find if two variables moving together (same or

opposite directions).

๐’™ and ๐’š are statistically independent, where ๐’‡(๐’™, ๐’š) is the joint Probability

Density Function (PDF)

Page 14: Introduction to correlation and regression analysis

Determination Coefficient and Correlation Coefficient:

๐’“๐’™๐’š = ยฑ๐Ÿ perfect linear relationship between variables:

i.e. ๐’™ is the only factor which describes variations of ๐’š at the level (linearly); ๐’š = ๐’‚ + ๐’ƒ๐’™ .

๐’“๐’™๐’š โ‰ˆ ยฑ๐Ÿ ๐’™ is not the only factor which describes

variations of ๐’š but we can still imagine that a line represents this

relationship which passing through most of the points or having a

minimum vertical distance from them, in total. This line is called

the โ€œline of best fitโ€ or known technically as โ€œregression lineโ€.

Adopted from www.ncetm.org.uk/public/files/195322/G3fb.jpg

The graph shows a line of best fit between age of a car and its price. Imagine the line has the equation of ๐’š = ๐’‚ + ๐’ƒ๐’™

Page 15: Introduction to correlation and regression analysis

The criterion to choose a line among others is the

goodness of fit which can be calculated through

determination coefficient, ๐’“๐Ÿ.

In the previous example, age of a car is only factor

among many other factors that explain the price of a

car. Can you find some other factors?

If ๐’š and ๐’™ represent price and age of cars respectively,

the percentage of the variation of ๐’š which is determined

(explained) by the variation of ๐’™ is called โ€œdetermination

coefficientโ€.

Determination coefficient can be understood better by

Venn-Euler diagrams:

Page 16: Introduction to correlation and regression analysis

y x

y x

y x

y=x

๐’“๐Ÿ = ๐ŸŽ , none of variations of y can be determined by x (no linear association)

๐’“๐Ÿ โ‰ˆ ๐ŸŽ, small percentage of variation of y can be determined by x (weak linear association)

๐’“๐Ÿ โ‰ˆ ๐Ÿ, large percentage of variation of y can be determined by x (strong linear association)

๐’“๐Ÿ = ๐Ÿ, all variation of y can be determined by xand no other factors (complete linear association)

The shaded area shows the percentage of variation of

y which can be determined by x. it is easy to

understand that ๐ŸŽ โ‰ค ๐’“๐Ÿ โ‰ค ๐Ÿ.

Page 17: Introduction to correlation and regression analysis

Although, determination coefficient (๐’“๐Ÿ) is different

conceptually from correlation coefficient (๐’“๐’™๐’š) but one

can be calculated from another; in fact:

๐’“๐’™๐’š = ยฑ ๐’“๐Ÿ

Or, alternatively

๐’“๐Ÿ = ๐’ƒ๐Ÿ ๐Ÿ

๐’ ๐’™๐’Š โˆ’ ๐’™ ๐Ÿ

๐Ÿ๐’ ๐’š๐’Š โˆ’ ๐’š ๐Ÿ

= ๐’ƒ๐Ÿ๐‘บ๐’™

๐Ÿ

๐‘บ๐’š๐Ÿ

Where ๐’ƒ is the slope coefficient in the regression

line ๐’š = ๐’‚ + ๐’ƒ๐’™ .

Note: If ๐’š = ๐’‚ + ๐’ƒ๐’™ shows the regression line (๐’š ๐’๐’ ๐’™)

and ๐’™ = ๐’„ + ๐’…๐’š shows another regression line (๐’™ ๐’๐’ ๐’š)then we have: ๐’“๐Ÿ = ๐’ƒ. ๐’…

Page 18: Introduction to correlation and regression analysis

Summary of Correlation & Determination Coefficients:โ€ข Correlation means a linear association between two random variables which

could be positive or negative or zero.

โ€ข Linear association means that variables are in relations at their levels

(linearly).

โ€ข Correlation coefficient measures the strength of linear association between

two variables. It could be calculated for a sample or for the whole population.

โ€ข The value of correlation coefficient is between -1 and 1, which show the

strongest correlation (negative or positive) but moving towards zero it makes

correlation weaker.

โ€ข Correlation does not imply causation.

โ€ข Determination coefficient shows the percentage of variation of one variable

which can be described by another variable and it is a measure for the

goodness of fit for lines passing through plotted points.

โ€ข The value of determination coefficient is between 0 and 1 and can be

obtained from correlation coefficient by squaring it.

Page 19: Introduction to correlation and regression analysis

โ€ข Knowing two random variables are just linearly associated is

not much satisfactory. There are sometimes a strong idea

that the variation of one variable can solidly explain the

variation of another.

โ€ข To test this idea (hypothesis) we need another analytical

approach, which is called โ€œregression analysisโ€.

โ€ข In regression analysis we try to study or predict the mean

(average) value of a dependent variable ๐’€ based on the

knowledge we have about independent (explanatory)

variable(s) ๐‘ฟ๐Ÿ, ๐‘ฟ๐Ÿ,โ€ฆ, ๐‘ฟ๐’. This is familiar for those who know

the meaning of conditional probabilities; as we are going to

make a linear model such as, which is a deterministic part of

the model in regression analysis:

๐ธ(๐‘Œ ๐‘‹1, ๐‘‹2,โ€ฆ, ๐‘‹๐‘›) = ๐›ฝ0 + ๐›ฝ1๐‘‹1 + ๐›ฝ2๐‘‹2 + โ‹ฏ + ๐›ฝ๐‘›๐‘‹๐‘›

Page 20: Introduction to correlation and regression analysis

โ€ข The deterministic part of the regression model does reflect the

structure of the relationship between ๐’€ and ๐‘ฟโ€ฒ๐’” in a

mathematical world but we live in a stochastic world.

โ€ข Godโ€™s knowledge (if the term is applicable) is deterministic but

our perception about everything in this world is always

stochastic and our model should be built in this way.

โ€ข To understand the concept of stochastic model letโ€™s have an

example:

If we make a model between monthly consumption expenditure

๐‘ช and monthly income ๐‘ฐ, the model cannot be deterministic

(mathematical) such that for every value of ๐‘ฐ there is one and

only one value of ๐‘ช (which is the concept of functional

relationship in maths). Why?

Page 21: Introduction to correlation and regression analysis

Although, the income is the main variable determining the amount of

consumption expenditure but many other factors such as the mood of

people, their wealth, interest rate and etc. are overlooked in a simple

mathematical model such as ๐‘ช = ๐’‡(๐‘ฐ) but their influences can change the

value of ๐‘ช even at the same level of ๐‘ฐ. If we believe that the average impact

of all their omitted variables is random (sometimes positive and sometimes

negative). So, in order to make a realistic model we need to add a stochastic

(random) term ๐’– to our mathematical model: ๐‘ช = ๐’‡ ๐‘ฐ + ๐’–

ยฃ1000

ยฃ1400

โ‹ฎ

โ‹ฎ

ยฃ800ยฃ1000ยฃ750

ยฃ900ยฃ1200ยฃ1150

I C

The change in the consumption

expenditure comes from the change of

income (๐ผ) or change of some

random elements (๐‘ข), so, we can write

๐‘ช = ๐’‡ ๐‘ฐ + ๐’–

Page 22: Introduction to correlation and regression analysis

โ€ข The general stochastic model for our purpose would be as

following, which is called โ€œLinear Regression Model**โ€:

๐’€๐’Š = ๐‘ฌ(๐’€๐’Š ๐‘ฟ๐Ÿ๐’Š, โ€ฆ , ๐‘ฟ๐’๐’Š) + ๐’–๐’Š

Which can be written as:

๐’€๐’Š = ๐œท๐ŸŽ + ๐œท๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐œท๐Ÿ๐‘ฟ๐Ÿ๐’Š + โ‹ฏ + ๐œท๐’๐‘ฟ๐’๐’Š + ๐’–๐’Š

Where ๐’Š (๐‘– = 1,2, โ€ฆ , ๐‘›) shows time period (days, weeks, months,

years and etc.) and ๐’–๐’Š is an error (stochastic) term and also a

representative of all other influential variables which are not

considered in the model and ignored.

โ€ข The deterministic part of the model

๐‘ฌ(๐’€๐’Š ๐‘ฟ๐Ÿ๐’Š, โ€ฆ , ๐‘ฟ๐’๐’Š) =๐œท๐ŸŽ + ๐œท๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐œท๐Ÿ๐‘ฟ๐Ÿ๐’Š + โ‹ฏ + ๐œท๐’๐‘ฟ๐’๐’Š

is called Population Regression Function (PRF).

Page 23: Introduction to correlation and regression analysis

โ€ข The general form of the Linear Regression Model with ๐’Œexplanatory variables and ๐’ observations can be shown in

the matrix form as:

๐’€๐‘›ร—1 = ๐‘ฟ๐‘›ร—๐‘˜๐œท๐‘˜ร—1 + ๐’–๐‘›ร—1

Or simply:

๐’€ = ๐‘ฟ๐œท + ๐’–Where

๐’€ =

๐‘Œ1

๐‘Œ2

โ‹ฎ๐‘Œ๐‘›

, ๐‘ฟ =

1 ๐‘‹11 ๐‘‹21

1โ‹ฎ

๐‘‹12

โ‹ฎ๐‘‹22

โ‹ฎ1 ๐‘‹1๐‘› ๐‘‹2๐‘›

โ€ฆ ๐‘‹๐‘˜1โ€ฆโ‹ฑ

๐‘‹๐‘˜2

โ‹ฎโ€ฆ ๐‘‹๐‘˜๐‘›

, ๐œท =

๐›ฝ0

๐›ฝ1

โ‹ฎ๐›ฝ๐‘˜

and ๐’– =

๐‘ข1๐‘ข2

โ‹ฎ๐‘ข๐‘›

๐’€ is also called regressand and ๐‘ฟ is a vector of regressors.

Page 24: Introduction to correlation and regression analysis

โ€ข ๐œท๐ŸŽ is the intercept but ๐œท๐’Šโ€ฒ๐’” are slope coefficients which are also

called regression parameters. The value of each parameter

shows the magnitude of one unit change in the associated

regressor ๐‘ฟ๐’Š on the mean value of the regressand ๐’€๐’Š. The idea

is to estimate the unknown value of the population

regression parameters based on estimators which use

sample data.

โ€ข The sample counterpart of the regression line can be written in

the form of:

๐’€๐’Š = ๐’€๐’Š + ๐’–๐’Š

or

๐’€๐’Š = ๐’ƒ๐ŸŽ + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + โ‹ฏ + ๐’ƒ๐’๐‘ฟ๐’๐’Š + ๐’†๐’Š

Where ๐’€๐’Š = ๐’ƒ๐ŸŽ + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + โ‹ฏ + ๐’ƒ๐’๐‘ฟ๐’๐’Š is the deterministic

part of the sample model and is called โ€œSample Regression

Function (SRF) โ€œand ๐’ƒ๐’Šโ€ฒ๐’” are estimators of unknown parameters

๐œท๐’Šโ€ฒ๐’” and ๐’–๐’Š = ๐’†๐’Š is a residual.

Page 25: Introduction to correlation and regression analysis

The following graph shows the important elements of PRF and

SRF:

๐’€๐’Š โˆ’ ๐‘ฌ(๐’€ ๐‘ฟ๐’Š) = ๐’–๐’Š

๐’€๐’Š โˆ’ ๐’€๐’Š = ๐’–๐’Š = ๐’†๐’Š

observation

Estimation of ๐’€๐’Š based on SRF

Estimation of ๐’€๐’Š based on PRF

Adopted and altered fromhttp://marketingclassic.blogspot.co.uk/2011_12_01_archive.html

In PRF

In SRF

The PRF is a hypothetical line which we have no idea about that but try to estimate its parameters based on the data in sample

๐‘บ๐‘น๐‘ญ: ๐’€๐’Š = ๐’ƒ๐ŸŽ + ๐’ƒ๐Ÿ๐‘ฟ๐’Š

๐‘ท๐‘น๐‘ญ: ๐‘ฌ(๐’€ ๐‘ฟ๐’Š) = ๐œท๐ŸŽ + ๐œท๐Ÿ๐‘ฟ๐’Š

Page 26: Introduction to correlation and regression analysis

โ€ข Now the question is how to calculate ๐’ƒ๐’Šโ€ฒ๐’” based on the

sample observations and how to ensure that they are good

and unbiased estimators of ๐œท๐’Šโ€ฒ๐’” in the population?

โ€ข There are two main methods of calculating ๐’ƒ๐’Šโ€ฒ๐’” and constructing

SRF, called the โ€œmethod of Ordinary Least Square (OLS)โ€ and

the โ€œmethod of Maximum Likelihood (ML)โ€. Here, we focus on

OLS method as it is used most comprehensively. Here, for

simplicity, we start with two-variable PRF (๐’€๐’Š = ๐œท๐ŸŽ + ๐œท๐Ÿ๐‘ฟ๐’Š) and

its SRF counterpart (๐’€๐’Š = ๐’ƒ๐ŸŽ + ๐’ƒ๐Ÿ๐‘ฟ๐’Š).

โ€ข According to OLS method we try to minimise some of the

squared residuals in a hypothetical sample; i.e.

๐’–๐’Š๐Ÿ

= ๐’†๐’Š๐Ÿ = ๐’€๐’Š โˆ’ ๐’€๐’Š

๐Ÿ

= ๐’€๐’Š โˆ’ ๐’ƒ๐ŸŽ โˆ’ ๐’ƒ๐Ÿ๐‘ฟ๐’Š๐Ÿ

Page 27: Introduction to correlation and regression analysis

โ€ข It is obvious from previous equation that the sum of squared

residuals is a function of ๐’ƒ๐ŸŽ and ๐’ƒ๐Ÿ, i.e.

๐’†๐’Š๐Ÿ = ๐’‡(๐’ƒ๐ŸŽ, ๐’ƒ๐Ÿ)

because if these two parameters (intercept and slope) change,

๐’†๐’Š๐Ÿ will change (see the graph on the slide 25).

โ€ข Differentiating A partially with respect to ๐’ƒ๐ŸŽ and ๐’ƒ๐Ÿ and

following the first and necessary conditions for optimisation in

calculus we have:

๐ ๐’†๐’Š๐Ÿ

๐๐’ƒ๐ŸŽ= โˆ’๐Ÿ ๐’€๐’Š โˆ’ ๐’ƒ๐ŸŽ โˆ’ ๐’ƒ๐Ÿ๐‘ฟ๐’Š = โˆ’๐Ÿ ๐’†๐’Š = ๐ŸŽ

๐ ๐’†๐’Š๐Ÿ

๐๐’ƒ๐Ÿ= โˆ’๐Ÿ ๐‘ฟ๐’Š ๐’€๐’Š โˆ’ ๐’ƒ๐ŸŽ โˆ’ ๐’ƒ๐Ÿ๐‘ฟ๐’Š = โˆ’๐Ÿ ๐‘ฟ๐’Š๐’†๐’Š = ๐ŸŽ

A

B

Page 28: Introduction to correlation and regression analysis

After simplifications we reach to two equations with two

unknowns ๐’ƒ๐ŸŽ and ๐’ƒ๐Ÿ:

๐’€๐’Š = ๐’๐’ƒ๐ŸŽ + ๐’ƒ๐Ÿ ๐‘ฟ๐’Š

๐’€๐’Š๐‘ฟ๐’Š = ๐’ƒ๐ŸŽ ๐‘ฟ๐’Š + ๐’ƒ๐Ÿ ๐‘ฟ๐’Š๐Ÿ

Where ๐’ is the sample size. So;

๐’ƒ๐Ÿ = ๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐’€๐’Š โˆ’ ๐’€

๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐Ÿ=

๐’™๐’Š๐’š๐’Š

๐’™๐’Š๐Ÿ

=๐’„๐’๐’—(๐’™, ๐’š)

๐‘บ๐’™๐Ÿ

Where ๐‘บ๐’™ is the biased version of sample standard deviation,

i.e. we have ๐’ instead of (๐’ โˆ’ ๐Ÿ) in denominator.

๐‘บ๐’™ = ๐‘ฟ๐’Š โˆ’ ๐‘ฟ ๐Ÿ

๐’

Page 29: Introduction to correlation and regression analysis

And

๐‘0 = ๐‘Œ โˆ’ ๐‘1 ๐‘‹

โ€ข The ๐’ƒ๐ŸŽ and ๐’ƒ๐Ÿ obtained from OLS method are the point

estimators of ๐œท๐ŸŽ and ๐œท๐Ÿin the population but in order to test

some hypothesis about the population parameters we need to

have knowledge about the distributions of their estimators. For

that reason we need to make some assumptions about the

explanatory variables and the error term in PRF. (see the

equations in B to find the reason).

The Assumptions Underlying the OLS Method:

1. The regression model is linear in terms of its parameters (coefficients).*

2. The values of the explanatory variable(s) are fixed in repeated sampling.

This means that the nature of explanatory variables (๐‘ฟโ€ฒ๐’”) is non-stochastic.

The only stochastic variables are error term (๐’–๐’Š) and regressand (๐’€๐’Š).

3. The disturbance (error) terms are normally distributed with zero mean and

equal variance; given the value of ๐‘ฟโ€ฒ๐’”. That is: ๐’–๐’Š~๐‘ต(๐ŸŽ, ๐ˆ๐Ÿ)

Page 30: Introduction to correlation and regression analysis

4. There is no autocorrelation between error terms, i.e.

๐’„๐’๐’— ๐’–๐’Š, ๐’–๐’‹ = ๐ŸŽ

This means they are completely random and there is no association between

them or any pattern in their appearance.

5. There is no correlation between error terms and explanatory variables, i.e.

๐’„๐’๐’— ๐’–๐’Š, ๐‘ฟ๐’Š = ๐ŸŽ

6. The number of observations (sample size) should be bigger than the

number of parameters in the model.

7. The model should be logically and correctly specified in terms of functional

form or even the type and the nature of variables enter into the model.

These assumptions are the assumptions of the Classical Linear

Regression Models (CLRM), which sometimes they are called

Gaussian assumptions on linear regression models.

Page 31: Introduction to correlation and regression analysis

โ€ข Under these assumptions and also the central limit theorem

the OLS estimators in sampling distribution (repeated sampling)

,when ๐’ โ†’ โˆž, have a normal distribution:

๐’ƒ๐ŸŽ~๐‘ต(๐œท๐ŸŽ, ๐‘ฟ๐’Š

๐Ÿ

๐’ ๐’™๐’Š๐Ÿ

. ๐ˆ๐Ÿ)

๐’ƒ๐Ÿ~๐‘ต(๐œท๐Ÿ,๐ˆ๐Ÿ

๐’™๐’Š๐Ÿ)

where ๐ˆ๐Ÿ is the variance of the error term (๐’—๐’‚๐’“ ๐’–๐’Š = ๐ˆ๐Ÿ) and it

can be estimated itself through ๐ˆ estimator, where:

๐ˆ = ๐’†๐’Š

๐Ÿ

๐’ โˆ’ ๐Ÿ๐‘œ๐‘Ÿ

๐ˆ = ๐’†๐’Š

๐Ÿ

๐’ โˆ’ ๐’Œ๐‘คโ„Ž๐‘’๐‘› ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘’ ๐‘–๐‘  ๐’Œ ๐‘๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ ๐‘–๐‘› ๐‘กโ„Ž๐‘’ ๐‘š๐‘œ๐‘‘๐‘’๐‘™.

Page 32: Introduction to correlation and regression analysis

โ€ข Based on the assumptions of the classical linear regression

model (CLRM), Gauss-Markov Theorem asserts that the least

square estimators, among unbiased estimators, have the

minimum variance. So they are the Best, Linear, Unbiased

Estimators (BLUE).

Interval Estimation For Population Parameters:

โ€ข In order to construct a confidence interval for unknown

๐œทโ€ฒ๐’” (PRFโ€™s parameters) we can either follow Z distribution (if

we have a prior knowledge about ๐ˆ) or t-distribution (if we use

๐ˆ instead).

โ€ข The confidence intervals for the slope parameter at any level of

significance ๐œถ would be*:

๐‘ท ๐’ƒ๐Ÿ โˆ’ ๐’ ๐œถ๐Ÿ. ๐ˆ๐’ƒ๐Ÿ

โ‰ค ๐œท๐Ÿ โ‰ค ๐’ƒ๐Ÿ + ๐’ ๐œถ๐Ÿ. ๐ˆ๐’ƒ๐Ÿ

= ๐Ÿ โˆ’ ๐œถ

Or

๐‘ท ๐’ƒ๐Ÿ โˆ’ ๐’• ๐œถ๐Ÿ,(๐’โˆ’๐Ÿ). ๐ˆ๐’ƒ๐Ÿ

โ‰ค ๐œท๐Ÿ โ‰ค ๐’ƒ๐Ÿ + ๐’• ๐œถ๐Ÿ,(๐’โˆ’๐Ÿ). ๐ˆ๐’ƒ๐Ÿ

= ๐Ÿ โˆ’ ๐œถ

Page 33: Introduction to correlation and regression analysis

Hypothesis Testing For Parameters:

โ€ข The critical values (Z or t) in the confidence intervals, can be

used to find the rejection area(s) and test any hypothesis on

parameters.

โ€ข For example, to test ๐‘ฏ๐ŸŽ: ๐œท๐Ÿ = ๐ŸŽ against the alternative ๐‘ฏ๐Ÿ: ๐œท๐Ÿ โ‰ ๐ŸŽ, after finding the critical values t (which means we do not have prior knowledge of ๐ˆ and use ๐ˆ instead) at any

significance level ๐œถ, we will have two critical regions and if the

value of the test statistic

๐’• =๐’ƒ๐Ÿโˆ’๐œท๐Ÿ

๐ˆ

๐’™๐’Š๐Ÿ

be in the critical region ๐‘ฏ๐ŸŽ: ๐œท๐Ÿ = ๐ŸŽ must be rejected.

โ€ข In case we have more than one slope parameter the degree of

freedom for t-distribution will be the sample size ๐’ minus the

number of estimated parameters including the intercept

parameters, i.e. for ๐’Œ parameters ๐’…๐’‡ = ๐’ โˆ’ ๐’Œ .

Page 34: Introduction to correlation and regression analysis

Determination Coefficient ๐’“๐Ÿ and Goodness of Fit:

โ€ข In early slides we talked about determination coefficient and

its relationship with correlation coefficient. The coefficient of

determination ๐’“๐Ÿ come to our attention when there is no issue

about estimation of regression parameters.

โ€ข It is a measure which shows how well the SRF fits the data.

โ€ข to understand this measure properly letโ€™s have a look at it

from different angle.

We know that

๐’€๐’Š = ๐’€๐’Š + ๐’†๐’Š

And in the deviation form after

subtracting ๐’€ from both sides

๐’€๐’Š โˆ’ ๐’€ = ๐’€๐’Š โˆ’ ๐’€ + ๐’†๐’Š

We know that ๐’†๐’Š = ๐’€๐’Š โˆ’ ๐’€๐’Š

๐’†๐’Š Ad

op

ted

from

Basic Eco

no

me

trics Go

jaratiP7

6

๐‘Œ

๐’€๐’Š โˆ’ ๐’€

Page 35: Introduction to correlation and regression analysis

So;๐’€๐’Š โˆ’ ๐’€ = ( ๐’€๐’Š โˆ’ ๐’€) + (๐’€๐’Š โˆ’ ๐’€๐’Š)

Or in the deviation form๐’š๐’Š = ๐’š๐’Š + ๐’†๐’Š

By squaring both sides and adding all over the sample we have:

๐’š๐’Š๐Ÿ = ๐’š๐’Š

๐Ÿ + ๐Ÿ ๐’š๐’Š ๐’†๐’Š + ๐’†๐’Š๐Ÿ

= ๐’š๐’Š๐Ÿ + ๐’†๐’Š

๐Ÿ

Where ๐’š๐’Š ๐’†๐’Š = ๐ŸŽ according to the OLSโ€™s assumptions 3 and 5.

And if we change it to the non-deviated form:

๐’€๐’Š โˆ’ ๐’€ 2 = ๐’€๐’Š โˆ’ ๐’€2

+ ๐’€๐’Š โˆ’ ๐’€๐’Š2

Total variation of the observed Y values around their mean =Total Sum of

Squares= TSS

Total explained variation of the estimated Y values around their

mean = Explained Sum of Squares (by explanatory

variables)= ESS

Total unexplained variation of the observed Y values around the regression line= Residual Sum of Squares (Explained by

error terms)= RSS

Page 36: Introduction to correlation and regression analysis

Dividing both sides by Total Sum of Squares (TSS) we have:

1 =๐ธ๐‘†๐‘†

๐‘‡๐‘†๐‘†+

๐‘…๐‘†๐‘†

๐‘‡๐‘†๐‘†=

๐’€๐’Š โˆ’ ๐’€ 2

๐’€๐’Š โˆ’ ๐’€ 2+

๐’€๐’Š โˆ’ ๐’€๐’Š2

๐’€๐’Š โˆ’ ๐’€ 2

Where ๐’€๐’Šโˆ’ ๐’€ ๐Ÿ

๐’€๐’Šโˆ’ ๐’€ ๐Ÿ=

๐‘ฌ๐‘บ๐‘บ

๐‘ป๐‘บ๐‘บis the percentage of the variation of the actual

(observed) ๐’€๐’Š which is explained by the explanatory variables (by

regression line).

โ€ข A good reader knows that this is not a new concept; the

determination coefficient ๐’“๐Ÿ was described already as a

measure of the goodness of fit between different alternative

sample regression functions (SRFs).

๐Ÿ = ๐’“๐Ÿ +๐‘น๐‘บ๐‘บ

๐‘ป๐‘บ๐‘บโ†’ ๐’“๐Ÿ = ๐Ÿ โˆ’

๐‘น๐‘บ๐‘บ

๐‘ป๐‘บ๐‘บ

= ๐Ÿ โˆ’ ๐’†๐’Š

๐Ÿ

๐’€๐’Šโˆ’ ๐’€ ๐Ÿ

Page 37: Introduction to correlation and regression analysis

โ€ข A good model must have a reasonable high ๐’“๐Ÿ but this does not

mean any model with a high ๐’“๐Ÿ is a good model. Extremely high

level of ๐’“๐Ÿ could be as a result of having a spurious regression

line due to the variety of reasons such as non-stationarity of

data, cointegration problem and etc.

โ€ข In a regression model with two parameters, ๐’“๐Ÿ can be directly

calculated:

๐’“๐Ÿ = ๐’€๐’Šโˆ’ ๐’€

๐Ÿ

๐’€๐’Šโˆ’ ๐’€ ๐Ÿ = ๐’ƒ๐ŸŽ+๐’ƒ๐Ÿ๐‘ฟ๐’Šโˆ’๐’ƒ๐ŸŽโˆ’๐’ƒ๐Ÿ๐‘ฟ

๐Ÿ

๐’€๐’Šโˆ’ ๐’€ ๐Ÿ

=๐’ƒ๐Ÿ

๐Ÿ ๐‘ฟ๐’Šโˆ’๐‘ฟ๐Ÿ

๐’€๐’Šโˆ’ ๐’€ ๐Ÿ =๐’ƒ๐Ÿ

๐Ÿ ๐’™๐’Š๐Ÿ

๐’š๐’Š๐Ÿ = ๐’ƒ๐Ÿ

๐Ÿ ๐‘บ๐‘ฟ๐Ÿ

๐‘บ๐’€๐Ÿ

Where ๐‘บ๐‘ฟ๐Ÿ and ๐‘บ๐’€

๐Ÿ are the standard deviations of ๐‘ฟ and ๐’€respectively.

Page 38: Introduction to correlation and regression analysis

Multiple Regression Analysis:

โ€ข If there are more than two explanatory variables in the

regression line we need additional assumptions about the

independency of the explanatory variables and also having no

exact linear relationship between them.

โ€ข The population and the sample regression models for three

variables model can be described as following:

In Population: ๐’€๐’Š = ๐œท๐ŸŽ + ๐œท๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐œท๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐’–๐’Š

In Sample: ๐’€๐’Š = ๐’ƒ๐ŸŽ + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐’†๐’Š

โ€ข The OLS estimators can be obtained by minimising ๐’†๐’Š๐Ÿ. So,

the values of the SRF parameters in the deviation form are as

following:

๐’ƒ๐Ÿ =( ๐’™๐Ÿ๐’Š๐’š๐’Š)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’š๐’Š)( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)

( ๐’™๐Ÿ๐’Š๐Ÿ)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)๐Ÿ

Page 39: Introduction to correlation and regression analysis

๐’ƒ๐Ÿ =( ๐’™๐Ÿ๐’Š๐’š๐’Š)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’š๐’Š)( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)

( ๐’™๐Ÿ๐’Š๐Ÿ)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)๐Ÿ

And the intercept parameter will be calculated in the non-deviated

form as:

๐’ƒ๐ŸŽ = ๐’€ โˆ’ ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ โˆ’ ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ

โ€ข Under the classical assumptions and also the central limit

theorem the OLS estimators in sampling distribution (repeated

sampling),when ๐’ โ†’ โˆž, have a normal distribution:

๐’ƒ๐Ÿ~๐‘ต(๐œท๐Ÿ,๐ˆ๐’–

๐Ÿ. ๐’™๐Ÿ๐’Š๐Ÿ

( ๐’™๐Ÿ๐’Š๐Ÿ)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)๐Ÿ)

๐’ƒ๐Ÿ~๐‘ต(๐œท๐Ÿ,๐ˆ๐’–

๐Ÿ. ๐’™๐Ÿ๐’Š๐Ÿ

( ๐’™๐Ÿ๐’Š๐Ÿ)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)๐Ÿ)

Page 40: Introduction to correlation and regression analysis

โ€ข The distribution of the intercept parameter ๐’ƒ๐ŸŽ is not of primary

concern as in many cases it has no practical importance.

โ€ข If the variance of the disturbance (error) term (๐ˆ๐’–๐Ÿ) is not known

the residual variance (sample variance) can be used ( ๐ˆ๐’–๐Ÿ),

which is an unbiased estimator of the earlier:

๐ˆ๐’–๐Ÿ =

๐’†๐’Š๐Ÿ

๐’ โˆ’ ๐’Œ

Where ๐’Œ is the number of parameters in the model (including the

intercept ๐’ƒ๐ŸŽ). Therefore, in a regression model with two slope

parameters and one intercept parameter the residual variance can

be calculated by:

๐ˆ๐’–๐Ÿ =

๐’†๐’Š๐Ÿ

๐’ โˆ’ ๐Ÿ‘

Page 41: Introduction to correlation and regression analysis

So, for a model with two slope parameters, the unbiased

estimates of the variance of these parameters are:

๐‘บ๐’ƒ๐Ÿ

๐Ÿ = ๐’†๐’Š

๐Ÿ

๐’ โˆ’ ๐Ÿ‘.

๐’™๐Ÿ๐’Š๐Ÿ

( ๐’™๐Ÿ๐’Š๐Ÿ)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)๐Ÿ

= ๐ˆ๐’–

๐Ÿ

๐’™๐Ÿ๐’Š๐Ÿ (๐Ÿ โˆ’ ๐’“๐Ÿ

๐Ÿ๐Ÿ)

Where ๐’“๐Ÿ๐Ÿ๐Ÿ =

๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š๐Ÿ

๐’™๐Ÿ๐’Š๐Ÿ ๐’™๐Ÿ๐’Š

๐Ÿ .

and

๐‘บ๐’ƒ๐Ÿ

๐Ÿ = ๐’†๐’Š

๐Ÿ

๐’ โˆ’ ๐Ÿ‘.

๐’™๐Ÿ๐’Š๐Ÿ

( ๐’™๐Ÿ๐’Š๐Ÿ)( ๐’™๐Ÿ๐’Š

๐Ÿ) โˆ’ ( ๐’™๐Ÿ๐’Š๐’™๐Ÿ๐’Š)๐Ÿ

= ๐ˆ๐’–

๐Ÿ

๐’™๐Ÿ๐’Š๐Ÿ (๐Ÿ โˆ’ ๐’“๐Ÿ

๐Ÿ๐Ÿ)

๐ˆ๐’–๐Ÿ

Page 42: Introduction to correlation and regression analysis

The Coefficient of Multiple Determination (๐‘น๐Ÿand ๐‘น๐Ÿ ):

The same concept of the coefficient of determination used for a

bivariate model can be extended for a multivariate model.

โ€ข If ๐‘น๐Ÿ is denoted as the coefficient of multiple determination it

shows the proportion (percentage) of the total variation of ๐’€explained by the explanatory variables and it is calculated by:

๐‘…2 =๐ธ๐‘†๐‘†

๐‘‡๐‘†๐‘†=

๐‘ฆ๐‘–2

๐‘ฆ๐‘–2 =

๐‘1 ๐‘ฆ๐‘–๐‘ฅ1๐‘–+๐‘2 ๐‘ฆ๐‘–๐‘ฅ2๐‘–

๐‘ฆ๐‘–2

And we know that:

0 โ‰ค ๐‘…2 โ‰ค 1

Note that ๐‘…2 can also be calculated through RSS, i.e.

๐‘…2 = 1 โˆ’๐‘…๐‘†๐‘†

๐‘‡๐‘†๐‘†= 1 โˆ’

๐‘’๐‘–2

๐‘ฆ๐‘–2

C

Page 43: Introduction to correlation and regression analysis

โ€ข ๐‘น๐Ÿ is likely to increase by including an additional explanatory

variable (see ). Therefore, in case we have two alternative

models with the same dependent variable ๐’€ but different

number of explanatory variables we should not be misled by the

high ๐‘น๐Ÿof the model with more variables.

โ€ข To solve this problem we need to bring the degrees of freedom

into our consideration as a reduction factor against adding

additional explanatory variables. So, the adjusted ๐‘น๐Ÿ which can

be shown by ๐‘น๐Ÿ is considered as an alternative coefficient of

determination and it is calculated as:

๐‘…2 = 1 โˆ’

๐‘’๐‘–2

๐‘› โˆ’ ๐‘˜ ๐‘ฆ๐‘–

2

๐‘› โˆ’ 1

= 1 โˆ’๐‘› โˆ’ 1

๐‘› โˆ’ ๐‘˜. ๐‘’๐‘–

2

๐‘ฆ๐‘–2

= 1 โˆ’๐‘›โˆ’1

๐‘›โˆ’๐‘˜(1 โˆ’ ๐‘…2)

C

Page 44: Introduction to correlation and regression analysis

Partial Correlation Coefficients:

โ€ข For a three-variable regression model such as

๐’€๐’Š = ๐’ƒ๐ŸŽ + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐’ƒ๐Ÿ๐‘ฟ๐Ÿ๐’Š + ๐’†๐’Š

We can talk about three linear association (correlation) between

๐’€ and ๐‘ฟ๐Ÿ ๐’“๐’š๐’™๐Ÿ, between ๐’€ and ๐‘ฟ๐Ÿ (๐’“๐’š๐’™๐Ÿ

) and finally between

๐‘ฟ๐Ÿ and ๐‘ฟ๐Ÿ (๐’“๐’™๐Ÿ๐’™๐Ÿ). These correlations are called simple (gross)

correlation coefficients but they do not reflect the true linear

association between two variables as the influence of the third

variable on the other two is not removed.

โ€ข The net linear association between two variables can be

obtained through the partial correlation coefficient, where the

influence of the third variable is removed (the variable is hold

constant). Symbolically, ๐’“๐’š๐’™๐Ÿ. ๐’™๐Ÿrepresents the partial

correlation coefficient between ๐’€ and ๐‘ฟ๐Ÿ holding ๐‘ฟ๐Ÿ constant.

Page 45: Introduction to correlation and regression analysis

โ€ข Two partial correlation coefficients in our model can be

calculated as following:

๐’“๐’š๐’™๐Ÿ. ๐’™๐Ÿ=

๐’“๐’š๐’™๐Ÿโˆ’ ๐’“๐’š๐’™๐Ÿ

๐’“๐’™๐Ÿ๐’™๐Ÿ

๐Ÿ โˆ’ ๐’“๐Ÿ๐’™๐Ÿ๐’™๐Ÿ

. ๐Ÿ โˆ’ ๐’“๐Ÿ๐’š๐’™๐Ÿ

๐’“๐’š๐’™๐Ÿ. ๐’™๐Ÿ=

๐’“๐’š๐’™๐Ÿโˆ’ ๐’“๐’š๐’™๐Ÿ

๐’“๐’™๐Ÿ๐’™๐Ÿ

๐Ÿ โˆ’ ๐’“๐Ÿ๐’™๐Ÿ๐’™๐Ÿ

. ๐Ÿ โˆ’ ๐’“๐Ÿ๐’š๐’™๐Ÿ

โ€ข The correlation coefficient ๐’“๐’™๐Ÿ๐’™๐Ÿ.๐’š has no practical importance.

Specifically, when the direction of causality is from ๐‘ฟโ€ฒ๐’” to ๐’€ we

can simply use the simple correlation coefficient in this case:

๐’“ = ๐’™๐Ÿ๐’™๐Ÿ

๐’™๐Ÿ๐Ÿ . ๐’™๐Ÿ

๐Ÿ

โ€ข They can be used to find out which explanatory variable has

more linear association with the dependent variable.

Page 46: Introduction to correlation and regression analysis

Hypothesis Testing in Multiple Regression Models:

In a multiple regression model hypotheses are formed to test

different aspects of this type of regression models:

i. Testing hypothesis about an individual parameter of the

model. For example;

๐‘ฏ๐ŸŽ: ๐œท๐’‹ = ๐ŸŽ against ๐‘ฏ๐Ÿ: ๐œท๐’‹ โ‰  ๐ŸŽ

If ๐ˆ is unknown and is replaced by ๐ˆ the test statistic

๐’• =๐’ƒ๐’‹โˆ’๐œท๐’‹

๐’”๐’†(๐’ƒ๐’‹)=

๐’ƒ๐’‹

๐’”๐’†(๐’ƒ๐’‹)

follows the t-distribution with ๐’ โˆ’ ๐’Œ df (for a regression model with

three parameters, including intercept, ๐๐Ÿ = ๐’ โˆ’ ๐Ÿ‘)

Page 47: Introduction to correlation and regression analysis

ii. Testing hypothesis about the equality of two parameters

in the model. For example,

๐‘ฏ๐ŸŽ: ๐œท๐’Š = ๐œท๐’‹ against ๐‘ฏ๐Ÿ: ๐œท๐’Š โ‰  ๐œท๐’‹

Again, if ๐ˆ is unknown and is replaced by ๐ˆ the test statistic

๐’• =๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹ โˆ’ ๐œท๐’Š โˆ’ ๐œท๐’‹

๐’”๐’†(๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹)

=๐’ƒ๐’Š โˆ’ ๐’ƒ๐’‹

๐’—๐’‚๐’“ ๐’ƒ๐’Š + ๐’—๐’‚๐’“ ๐’ƒ๐’‹ โˆ’ ๐Ÿ๐’„๐’๐’—(๐’ƒ๐’Š, ๐’ƒ๐’‹)

follows the t-distribution with ๐’ โˆ’ ๐’Œ df.

โ€ข If the value of test statistic ๐’• > ๐’•๐œถ

๐Ÿ,(๐’โˆ’๐’Œ) we must reject ๐‘ฏ๐ŸŽ,

otherwise there is not much evidence to reject that.

Page 48: Introduction to correlation and regression analysis

iii. Testing hypothesis about the overall significance of the

estimated model by checking if all the slope parameters

are simultaneously zero. For example, to test

๐‘ฏ๐ŸŽ: ๐œท๐’Š = ๐ŸŽ (โˆ€ ๐’Š) against ๐‘ฏ๐Ÿ: โˆƒ๐œท๐’Š โ‰  ๐ŸŽ

the analysis of variance (ANOVA) table can be used to find if the

mean sum of squares (MSS), due to the regression (or

explanatory variables) are very far from the MSS due to the

residuals. If this is true, it means the variation of explanatory

variables contribute more towards the variation of the dependent

variable than the variation of residuals, so, the ratio

๐‘ด๐‘บ๐‘บ ๐‘‘๐‘ข๐‘’ ๐‘ก๐‘œ ๐‘Ÿ๐‘’๐‘”๐‘Ÿ๐‘’๐‘ ๐‘ ๐‘–๐‘œ๐‘› (๐‘’๐‘ฅ๐‘๐‘™๐‘Ž๐‘›๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘ฆ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’๐‘ )

๐‘ด๐‘บ๐‘บ ๐‘‘๐‘ข๐‘’ ๐‘ก๐‘œ ๐‘Ÿ๐‘’๐‘ ๐‘–๐‘‘๐‘ข๐‘Ž๐‘™๐‘  (๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘’๐‘™๐‘’๐‘š๐‘’๐‘›๐‘ก๐‘ )

should be much higher than one.

Page 49: Introduction to correlation and regression analysis

โ€ข The ANOVA table for the three-variable regression model can

be formed as following:

โ€ข If we believe that the regression model is meaningless so we

cannot reject the null hypothesis that all slope coefficients are

simultaneously equal to zero, otherwise the test statistic

๐น =๐ธ๐‘†๐‘†/๐‘‘๐‘“

๐‘…๐‘†๐‘†/๐‘‘๐‘“=

๐’ƒ๐Ÿ ๐’š๐’Š๐’™๐Ÿ๐’Š + ๐’ƒ๐Ÿ

๐’š๐’Š๐’™๐Ÿ๐’Š

๐Ÿ ๐’†๐’Š

๐Ÿ

๐’ โˆ’ ๐Ÿ‘

Which follows the F-distribution with 2 and ๐’ โˆ’ ๐Ÿ‘ df must be much

bigger than 1.

Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS)

Due to Explanatory Variables

๐’ƒ๐Ÿ ๐’š๐’Š๐’™๐Ÿ๐’Š + ๐’ƒ๐Ÿ ๐’š๐’Š๐’™๐Ÿ๐’Š 2

๐’ƒ๐Ÿ ๐’š๐’Š๐’™๐Ÿ๐’Š + ๐’ƒ๐Ÿ ๐’š๐’Š๐’™๐Ÿ๐’Š

๐Ÿ

Due to Residuals ๐’†๐’Š

๐Ÿ๐’ โˆ’ ๐Ÿ‘

๐ˆ๐Ÿ = ๐’†๐’Š

๐Ÿ

๐’ โˆ’ ๐Ÿ‘

Total ๐’š๐’Š

๐Ÿ๐’ โˆ’ ๐Ÿ

Page 50: Introduction to correlation and regression analysis

โ€ข In general, to test the overall significance of the sample

regression for a multi-variable model (e.g with ๐’Œ slope

parameters) the null and alternative hypotheses and the test

statistic are as following:

๐‘ฏ๐ŸŽ: ๐œท๐Ÿ = ๐œท๐Ÿ = โ‹ฏ = ๐œท๐’Œ = ๐ŸŽ๐‘ฏ๐Ÿ: ๐’‚๐’• ๐’๐’†๐’‚๐’”๐’• ๐’•๐’‰๐’†๐’“๐’† ๐’Š๐’” ๐’๐’๐’† ๐œท๐’Š โ‰  ๐ŸŽ

๐‘ญ = ๐‘ฌ๐‘บ๐‘บ

๐’Œโˆ’๐Ÿ

๐‘น๐‘บ๐‘บ๐’โˆ’๐’Œ

โ€ข If ๐‘ญ > ๐‘ญ๐œถ, ๐’Œโˆ’๐Ÿ, ๐’โˆ’๐’Œ we reject ๐‘ฏ๐ŸŽ at the significance level of ๐œถ,

otherwise there is no enough evidence to reject it.

โ€ข It is sometimes easier to use the determination coefficient ๐‘น๐Ÿ

to run the above test, because

๐‘น๐Ÿ =๐‘ฌ๐‘บ๐‘บ

๐‘ป๐‘บ๐‘บโ†’ ๐‘ฌ๐‘บ๐‘บ = ๐‘น๐Ÿ. ๐‘ป๐‘บ๐‘บ

and also

๐‘น๐‘บ๐‘บ = ๐Ÿ โˆ’ ๐‘น๐Ÿ . ๐‘ป๐‘บ๐‘บ

Page 51: Introduction to correlation and regression analysis

โ€ข The ANOVA table can also be written as:

โ€ข So, the test statistic F can be written as:

๐‘ญ = ๐‘น๐Ÿ ๐’š๐’Š

๐Ÿ

(๐’Œ โˆ’ ๐Ÿ)

(๐Ÿ โˆ’ ๐‘น๐Ÿ) ๐’š๐’Š๐Ÿ

(๐’ โˆ’ ๐’Œ)

=๐’ โˆ’ ๐’Œ

๐’Œ โˆ’ ๐Ÿ.

๐‘น๐Ÿ

๐Ÿ โˆ’ ๐‘น๐Ÿ

Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS)

Due to Explanatory Variables

๐‘น๐Ÿ ๐’š๐’Š๐Ÿ

๐’Œ โˆ’ ๐Ÿ๐‘น๐Ÿ ๐’š๐’Š

๐Ÿ

๐’Œ โˆ’ ๐Ÿ

Due to Residuals(๐Ÿ โˆ’ ๐‘น๐Ÿ) ๐’š๐’Š

๐Ÿ ๐’ โˆ’ ๐’Œ ๐ˆ๐Ÿ =

(๐Ÿ โˆ’ ๐‘น๐Ÿ) ๐’š๐’Š๐Ÿ

๐’ โˆ’ ๐’Œ

Total ๐’š๐’Š

๐Ÿ๐’ โˆ’ ๐Ÿ

Page 52: Introduction to correlation and regression analysis

iv. Testing hypothesis about parameters when they satisfy

certain restrictions.*

e.g.๐‘ฏ๐ŸŽ: ๐œท๐’Š + ๐œท๐’‹ = ๐Ÿ against ๐‘ฏ๐Ÿ: ๐œท๐’Š + ๐œท๐’‹ โ‰  ๐Ÿ

v. Testing hypothesis about the stability of the estimated

regression model in a specific time period or in two cross-

sectional unit.**

vi. Testing hypothesis about different functional forms of

regression models.***