introduction to correlation and regression analysis

Introduction to Correlation &

Regression Analysis

Farzad Javidanrad

November 2013

Some Basic Concepts:

o Variable: A letter (symbol) which represents the elements of

a specific set.

o Random Variable: A variable whose values are randomly

appear based on a probability distribution.

o Probability Distribution: A corresponding rule (function)

which corresponds a probability to the values of a random

variable (individually or to a set of them). E.g.:

𝒙 0 1

𝑃(𝑥) 0.5 0.5In one trial 𝐻, 𝑇

In two trials 𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇

Correlation:Is there any relation between:

fast food sale and different seasons?

specific crime and religion?

smoking cigarette and lung cancer?

maths score and overall score in exam?

temperature and earthquake?

cost of advertisement and number of sold items?

To answer each question two sets of corresponding data need to be randomly collected.

Let random variable "𝒙" represents the first group of

data and random variable "𝒚" represents the second.

Question: Is this true that students who have a better

overall result are good in maths?

Our aim is to find out whether there is any linear

association between 𝒙 and 𝒚. In statistics, technical

term for linear association is “correlation”. So, we are

looking to see if there is any correlation between two

scores.

“Linear association” : variables are in relations at

their levels, i.e. 𝒙 with 𝒚 not with 𝒚𝟐, 𝒚𝟑, 𝟏

𝒚or even

∆𝒚.

Imagine we have a random sample of scores in a

school as following:

In our example, the correlation between 𝒙 and 𝒚

can be shown in a scatter diagram:

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Y

X

Correlation between maths score and overall score The graph shows a

positive correlation between maths scores and overall scores, i.e. when 𝒙increases 𝒚increases too.

Different scatter diagrams show different types of

correlation:

• Is this enough? Are we happy?Certainly not!! We think we know things better

when they are described by numbers!!!!

Although, scatter diagrams are informative but to find

the degree (strength) of a correlation between two

variables we need a numerical measurement.

Adopted from www.pdesas.org

Following the work of Francis Galton on regression

line, in 1896 Karl Pearson introduced a formula for

measuring correlation between two variables, called

Correlation Coefficient or Pearson’s Correlation

Coefficient.

For a sample of size 𝒏, sample correlation coefficient

𝒓𝒙𝒚 can be calculated by:

𝒓𝒙𝒚 = 𝟏

𝒏(𝒙𝒊 − 𝒙)(𝒚𝒊 − 𝒚)

𝟏𝒏(𝒙𝒊 − 𝒙)𝟐 . 𝟏

𝒏(𝒚𝒊 − 𝒚)𝟐=

𝒄𝒐𝒗(𝒙, 𝒚)

𝑺𝒙 . 𝑺𝒚

Where 𝒙 and 𝒚 are the mean values of 𝒙 and 𝒚 in the

sample and 𝑺 represents the biased version of

“standard deviation”*. The covariance between 𝒙 and 𝒚(𝒄𝒐𝒗 𝒙, 𝒚 ) shows how much 𝒙 and 𝒚 change together.

Alternatively, if there is an opportunity to observe all

available data, the population correlation coefficient

(𝝆𝒙𝒚) can be obtained by:

𝝆𝒙𝒚 =𝑬 𝒙𝒊 − 𝝁𝒙 . (𝒚𝒊 − 𝝁𝒚)

𝑬 𝒙𝒊 − 𝝁𝒙𝟐. 𝑬(𝒚𝒊 − 𝝁𝒚)𝟐

=𝒄𝒐𝒗(𝒙, 𝒚)

𝝈𝒙 . 𝝈𝒚

Where 𝑬, 𝝁 and 𝝈 are expected value, mean and

standard deviation of the random variables,

respectively and 𝑵 is the size of the population.

Question: Under what conditions can we use this

population correlation coefficient?

If 𝒙 = 𝒂𝒚 + 𝒃 𝒓𝒙𝒚 = 𝟏

Maximum (perfect) positive correlation.

If 𝒙 = 𝒂𝒚 + 𝒃 𝒓𝒙𝒚 = −𝟏

Maximum (perfect) negative correlation.

If there is no linear association between 𝒙 and 𝒚then 𝒓𝒙𝒚 = 𝟎.

Note 1: If there is no linear association between two

random variables they might have non linear

association or no association at all.

For all 𝒂 , 𝒃 ∈ 𝑹And 𝒂 > 𝟎

For all 𝒂 , 𝒃 ∈ 𝑹And 𝒂 < 𝟎

In our example, the sample correlation coefficient is:𝒙𝒊 𝒚𝒊 𝒙𝒊 − 𝒙 𝒚𝒊 − 𝒚 𝒙𝒊 − 𝒙 . (𝒚𝒊 − 𝒚) (𝑥𝑖− 𝑥 )2 (𝑦𝑖− 𝑦 )2

70 73 12 13.9 166.8 144 193.21

85 90 27 30.9 834.3 729 954.81

22 31 -36 -28.1 1011.6 1296 789.61

66 50 8 -9.1 -72.8 64 82.81

15 31 -43 -28.1 1208.3 1849 789.61

58 50 0 -9.1 0 0 82.81

69 56 11 -3.1 -34.1 121 9.61

49 55 -9 -4.1 36.9 81 16.81

73 80 15 20.9 313.5 225 436.81

61 49 3 -10.1 -30.3 9 102.01

77 79 19 19.9 378.1 361 396.01

44 58 -14 -1.1 15.4 196 1.21

35 40 -23 -19.1 439.3 529 364.81

88 85 30 25.9 777 900 670.81

69 73 11 13.9 152.9 121 193.21

5196.9 6625 5084.15

𝒓𝒙𝒚 = 𝟏

𝒏(𝒙𝒊 − 𝒙)(𝒚𝒊 − 𝒚)

𝟏𝒏(𝒙𝒊 − 𝒙)𝟐 . 𝟏

𝒏(𝒚𝒊 − 𝒚)𝟐= 𝟓𝟏𝟗𝟔.𝟗

𝟔𝟔𝟐𝟓×𝟓𝟎𝟖𝟒.𝟏𝟓=𝟎.𝟖𝟗𝟓

which shows an strong positive correlation between maths score and overall score.

Positive Linear Association

No Linear Association

Negative Linear Association

𝑺𝒙 > 𝑺𝒚 𝑺𝒙 = 𝑺𝒚 𝑺𝒙 < 𝑺𝒚

𝒓𝒙𝒚 = 𝟏

Adapted and modified from www.tice.agrocampus-ouest.fr

𝒓𝒙𝒚 ≈ 𝟏

𝟎 < 𝒓𝒙𝒚 < 𝟏

𝒓𝒙𝒚 = 𝟎

−𝟏 < 𝒓𝒙𝒚< 𝟎

𝒓𝒙𝒚 ≈ −𝟏

𝒓𝒙𝒚 = −𝟏

Perfect

Weak

No Correlation

Weak

Strong

Perfect

Strong

Some properties of the correlation coefficient:

(Sample or population)

a. It lies between -1 and 1, i.e. −𝟏 ≤ 𝒓𝒙𝒚 ≤ 𝟏.

b. It is symmetrical with respect to 𝒙 and 𝒚, i.e. 𝒓𝒙𝒚 =

𝒓𝒚𝒙 . This means the direction of calculation is not

important.

c. It is just a pure number and independent from the

unit of measurement of 𝒙 and 𝒚.

d. It is independent of the choice of origin and scale

of 𝒙 and 𝒚’s measurements, that is;

𝒓𝒙𝒚 = 𝒓 𝒂𝒙+𝒃 𝒄𝒚+𝒅 (𝒂, 𝒄 > 𝟎)

e. 𝒇 𝒙, 𝒚 = 𝒇 𝒙 . 𝒇(𝒚) 𝒓𝒙𝒚 = 𝟎

Important Note:Many researchers wrongly construct a theory just based on a

simple correlation test.

Correlation does not imply causation.

If there is a high correlation between number of smoked

cigarettes and the number of infected lung’s cells it does not

necessarily mean that smoking causes lung cancer. Causality

test (sometimes called Granger causality test) is different from

correlation test.

In causality test it is important to know about the direction of

causality (e.g. 𝒙 on 𝒚 and not vice versa) but in correlation we

are trying to find if two variables moving together (same or

opposite directions).

𝒙 and 𝒚 are statistically independent, where 𝒇(𝒙, 𝒚) is the joint Probability

Density Function (PDF)

Determination Coefficient and Correlation Coefficient:

𝒓𝒙𝒚 = ±𝟏 perfect linear relationship between variables:

i.e. 𝒙 is the only factor which describes variations of 𝒚 at the level (linearly); 𝒚 = 𝒂 + 𝒃𝒙 .

𝒓𝒙𝒚 ≈ ±𝟏 𝒙 is not the only factor which describes

variations of 𝒚 but we can still imagine that a line represents this

relationship which passing through most of the points or having a

minimum vertical distance from them, in total. This line is called

the “line of best fit” or known technically as “regression line”.

Adopted from www.ncetm.org.uk/public/files/195322/G3fb.jpg

The graph shows a line of best fit between age of a car and its price. Imagine the line has the equation of 𝒚 = 𝒂 + 𝒃𝒙

The criterion to choose a line among others is the

goodness of fit which can be calculated through

determination coefficient, 𝒓𝟐.

In the previous example, age of a car is only factor

among many other factors that explain the price of a

car. Can you find some other factors?

If 𝒚 and 𝒙 represent price and age of cars respectively,

the percentage of the variation of 𝒚 which is determined

(explained) by the variation of 𝒙 is called “determination

coefficient”.

Determination coefficient can be understood better by

Venn-Euler diagrams:

y x

y x

y x

y=x

𝒓𝟐 = 𝟎 , none of variations of y can be determined by x (no linear association)

𝒓𝟐 ≈ 𝟎, small percentage of variation of y can be determined by x (weak linear association)

𝒓𝟐 ≈ 𝟏, large percentage of variation of y can be determined by x (strong linear association)

𝒓𝟐 = 𝟏, all variation of y can be determined by xand no other factors (complete linear association)

The shaded area shows the percentage of variation of

y which can be determined by x. it is easy to

understand that 𝟎 ≤ 𝒓𝟐 ≤ 𝟏.

Although, determination coefficient (𝒓𝟐) is different

conceptually from correlation coefficient (𝒓𝒙𝒚) but one

can be calculated from another; in fact:

𝒓𝒙𝒚 = ± 𝒓𝟐

Or, alternatively

𝒓𝟐 = 𝒃𝟐 𝟏

𝒏 𝒙𝒊 − 𝒙 𝟐

𝟏𝒏 𝒚𝒊 − 𝒚 𝟐

= 𝒃𝟐𝑺𝒙

𝟐

𝑺𝒚𝟐

Where 𝒃 is the slope coefficient in the regression

line 𝒚 = 𝒂 + 𝒃𝒙 .

Note: If 𝒚 = 𝒂 + 𝒃𝒙 shows the regression line (𝒚 𝒐𝒏 𝒙)

and 𝒙 = 𝒄 + 𝒅𝒚 shows another regression line (𝒙 𝒐𝒏 𝒚)then we have: 𝒓𝟐 = 𝒃. 𝒅

Summary of Correlation & Determination Coefficients:• Correlation means a linear association between two random variables which

could be positive or negative or zero.

• Linear association means that variables are in relations at their levels

(linearly).

• Correlation coefficient measures the strength of linear association between

two variables. It could be calculated for a sample or for the whole population.

• The value of correlation coefficient is between -1 and 1, which show the

strongest correlation (negative or positive) but moving towards zero it makes

correlation weaker.

• Correlation does not imply causation.

• Determination coefficient shows the percentage of variation of one variable

which can be described by another variable and it is a measure for the

goodness of fit for lines passing through plotted points.

• The value of determination coefficient is between 0 and 1 and can be

obtained from correlation coefficient by squaring it.

• Knowing two random variables are just linearly associated is

not much satisfactory. There are sometimes a strong idea

that the variation of one variable can solidly explain the

variation of another.

• To test this idea (hypothesis) we need another analytical

approach, which is called “regression analysis”.

• In regression analysis we try to study or predict the mean

(average) value of a dependent variable 𝒀 based on the

knowledge we have about independent (explanatory)

variable(s) 𝑿𝟏, 𝑿𝟐,…, 𝑿𝒏. This is familiar for those who know

the meaning of conditional probabilities; as we are going to

make a linear model such as, which is a deterministic part of

the model in regression analysis:

𝐸(𝑌 𝑋1, 𝑋2,…, 𝑋𝑛) = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑛𝑋𝑛

• The deterministic part of the regression model does reflect the

structure of the relationship between 𝒀 and 𝑿′𝒔 in a

mathematical world but we live in a stochastic world.

• God’s knowledge (if the term is applicable) is deterministic but

our perception about everything in this world is always

stochastic and our model should be built in this way.

• To understand the concept of stochastic model let’s have an

example:

If we make a model between monthly consumption expenditure

𝑪 and monthly income 𝑰, the model cannot be deterministic

(mathematical) such that for every value of 𝑰 there is one and

only one value of 𝑪 (which is the concept of functional

relationship in maths). Why?

Although, the income is the main variable determining the amount of

consumption expenditure but many other factors such as the mood of

people, their wealth, interest rate and etc. are overlooked in a simple

mathematical model such as 𝑪 = 𝒇(𝑰) but their influences can change the

value of 𝑪 even at the same level of 𝑰. If we believe that the average impact

of all their omitted variables is random (sometimes positive and sometimes

negative). So, in order to make a realistic model we need to add a stochastic

(random) term 𝒖 to our mathematical model: 𝑪 = 𝒇 𝑰 + 𝒖

£1000

£1400

⋮

⋮

£800£1000£750

£900£1200£1150

I C

The change in the consumption

expenditure comes from the change of

income (𝐼) or change of some

random elements (𝑢), so, we can write

𝑪 = 𝒇 𝑰 + 𝒖

• The general stochastic model for our purpose would be as

following, which is called “Linear Regression Model**”:

𝒀𝒊 = 𝑬(𝒀𝒊 𝑿𝟏𝒊, … , 𝑿𝒏𝒊) + 𝒖𝒊

Which can be written as:

𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏𝑿𝟏𝒊 + 𝜷𝟐𝑿𝟐𝒊 + ⋯ + 𝜷𝒏𝑿𝒏𝒊 + 𝒖𝒊

Where 𝒊 (𝑖 = 1,2, … , 𝑛) shows time period (days, weeks, months,

years and etc.) and 𝒖𝒊 is an error (stochastic) term and also a

representative of all other influential variables which are not

considered in the model and ignored.

• The deterministic part of the model

𝑬(𝒀𝒊 𝑿𝟏𝒊, … , 𝑿𝒏𝒊) =𝜷𝟎 + 𝜷𝟏𝑿𝟏𝒊 + 𝜷𝟐𝑿𝟐𝒊 + ⋯ + 𝜷𝒏𝑿𝒏𝒊

is called Population Regression Function (PRF).

• The general form of the Linear Regression Model with 𝒌explanatory variables and 𝒏 observations can be shown in

the matrix form as:

𝒀𝑛×1 = 𝑿𝑛×𝑘𝜷𝑘×1 + 𝒖𝑛×1

Or simply:

𝒀 = 𝑿𝜷 + 𝒖Where

𝒀 =

𝑌1

𝑌2

⋮𝑌𝑛

, 𝑿 =

1 𝑋11 𝑋21

1⋮

𝑋12

⋮𝑋22

⋮1 𝑋1𝑛 𝑋2𝑛

… 𝑋𝑘1…⋱

𝑋𝑘2

⋮… 𝑋𝑘𝑛

, 𝜷 =

𝛽0

𝛽1

⋮𝛽𝑘

and 𝒖 =

𝑢1𝑢2

⋮𝑢𝑛

𝒀 is also called regressand and 𝑿 is a vector of regressors.

• 𝜷𝟎 is the intercept but 𝜷𝒊′𝒔 are slope coefficients which are also

called regression parameters. The value of each parameter

shows the magnitude of one unit change in the associated

regressor 𝑿𝒊 on the mean value of the regressand 𝒀𝒊. The idea

is to estimate the unknown value of the population

regression parameters based on estimators which use

sample data.

• The sample counterpart of the regression line can be written in

the form of:

𝒀𝒊 = 𝒀𝒊 + 𝒖𝒊

or

𝒀𝒊 = 𝒃𝟎 + 𝒃𝟏𝑿𝟏𝒊 + 𝒃𝟐𝑿𝟐𝒊 + ⋯ + 𝒃𝒏𝑿𝒏𝒊 + 𝒆𝒊

Where 𝒀𝒊 = 𝒃𝟎 + 𝒃𝟏𝑿𝟏𝒊 + 𝒃𝟐𝑿𝟐𝒊 + ⋯ + 𝒃𝒏𝑿𝒏𝒊 is the deterministic

part of the sample model and is called “Sample Regression

Function (SRF) “and 𝒃𝒊′𝒔 are estimators of unknown parameters

𝜷𝒊′𝒔 and 𝒖𝒊 = 𝒆𝒊 is a residual.

The following graph shows the important elements of PRF and

SRF:

𝒀𝒊 − 𝑬(𝒀 𝑿𝒊) = 𝒖𝒊

𝒀𝒊 − 𝒀𝒊 = 𝒖𝒊 = 𝒆𝒊

observation

Estimation of 𝒀𝒊 based on SRF

Estimation of 𝒀𝒊 based on PRF

Adopted and altered fromhttp://marketingclassic.blogspot.co.uk/2011_12_01_archive.html

In PRF

In SRF

The PRF is a hypothetical line which we have no idea about that but try to estimate its parameters based on the data in sample

𝑺𝑹𝑭: 𝒀𝒊 = 𝒃𝟎 + 𝒃𝟏𝑿𝒊

𝑷𝑹𝑭: 𝑬(𝒀 𝑿𝒊) = 𝜷𝟎 + 𝜷𝟏𝑿𝒊

• Now the question is how to calculate 𝒃𝒊′𝒔 based on the

sample observations and how to ensure that they are good

and unbiased estimators of 𝜷𝒊′𝒔 in the population?

• There are two main methods of calculating 𝒃𝒊′𝒔 and constructing

SRF, called the “method of Ordinary Least Square (OLS)” and

the “method of Maximum Likelihood (ML)”. Here, we focus on

OLS method as it is used most comprehensively. Here, for

simplicity, we start with two-variable PRF (𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏𝑿𝒊) and

its SRF counterpart (𝒀𝒊 = 𝒃𝟎 + 𝒃𝟏𝑿𝒊).

• According to OLS method we try to minimise some of the

squared residuals in a hypothetical sample; i.e.

𝒖𝒊𝟐

= 𝒆𝒊𝟐 = 𝒀𝒊 − 𝒀𝒊

𝟐

= 𝒀𝒊 − 𝒃𝟎 − 𝒃𝟏𝑿𝒊𝟐

• It is obvious from previous equation that the sum of squared

residuals is a function of 𝒃𝟎 and 𝒃𝟏, i.e.

𝒆𝒊𝟐 = 𝒇(𝒃𝟎, 𝒃𝟏)

because if these two parameters (intercept and slope) change,

𝒆𝒊𝟐 will change (see the graph on the slide 25).

• Differentiating A partially with respect to 𝒃𝟎 and 𝒃𝟏 and

following the first and necessary conditions for optimisation in

calculus we have:

𝝏 𝒆𝒊𝟐

𝝏𝒃𝟎= −𝟐 𝒀𝒊 − 𝒃𝟎 − 𝒃𝟏𝑿𝒊 = −𝟐 𝒆𝒊 = 𝟎

𝝏 𝒆𝒊𝟐

𝝏𝒃𝟏= −𝟐 𝑿𝒊 𝒀𝒊 − 𝒃𝟎 − 𝒃𝟏𝑿𝒊 = −𝟐 𝑿𝒊𝒆𝒊 = 𝟎

A

B

After simplifications we reach to two equations with two

unknowns 𝒃𝟎 and 𝒃𝟏:

𝒀𝒊 = 𝒏𝒃𝟎 + 𝒃𝟏 𝑿𝒊

𝒀𝒊𝑿𝒊 = 𝒃𝟎 𝑿𝒊 + 𝒃𝟏 𝑿𝒊𝟐

Where 𝒏 is the sample size. So;

𝒃𝟏 = 𝑿𝒊 − 𝑿 𝒀𝒊 − 𝒀

𝑿𝒊 − 𝑿 𝟐=

𝒙𝒊𝒚𝒊

𝒙𝒊𝟐

=𝒄𝒐𝒗(𝒙, 𝒚)

𝑺𝒙𝟐

Where 𝑺𝒙 is the biased version of sample standard deviation,

i.e. we have 𝒏 instead of (𝒏 − 𝟏) in denominator.

𝑺𝒙 = 𝑿𝒊 − 𝑿 𝟐

𝒏

And

𝑏0 = 𝑌 − 𝑏1 𝑋

• The 𝒃𝟎 and 𝒃𝟏 obtained from OLS method are the point

estimators of 𝜷𝟎 and 𝜷𝟏in the population but in order to test

some hypothesis about the population parameters we need to

have knowledge about the distributions of their estimators. For

that reason we need to make some assumptions about the

explanatory variables and the error term in PRF. (see the

equations in B to find the reason).

The Assumptions Underlying the OLS Method:

1. The regression model is linear in terms of its parameters (coefficients).*

2. The values of the explanatory variable(s) are fixed in repeated sampling.

This means that the nature of explanatory variables (𝑿′𝒔) is non-stochastic.

The only stochastic variables are error term (𝒖𝒊) and regressand (𝒀𝒊).

3. The disturbance (error) terms are normally distributed with zero mean and

equal variance; given the value of 𝑿′𝒔. That is: 𝒖𝒊~𝑵(𝟎, 𝝈𝟐)

4. There is no autocorrelation between error terms, i.e.

𝒄𝒐𝒗 𝒖𝒊, 𝒖𝒋 = 𝟎

This means they are completely random and there is no association between

them or any pattern in their appearance.

5. There is no correlation between error terms and explanatory variables, i.e.

𝒄𝒐𝒗 𝒖𝒊, 𝑿𝒊 = 𝟎

6. The number of observations (sample size) should be bigger than the

number of parameters in the model.

7. The model should be logically and correctly specified in terms of functional

form or even the type and the nature of variables enter into the model.

These assumptions are the assumptions of the Classical Linear

Regression Models (CLRM), which sometimes they are called

Gaussian assumptions on linear regression models.

• Under these assumptions and also the central limit theorem

the OLS estimators in sampling distribution (repeated sampling)

,when 𝒏 → ∞, have a normal distribution:

𝒃𝟎~𝑵(𝜷𝟎, 𝑿𝒊

𝟐

𝒏 𝒙𝒊𝟐

. 𝝈𝟐)

𝒃𝟏~𝑵(𝜷𝟏,𝝈𝟐

𝒙𝒊𝟐)

where 𝝈𝟐 is the variance of the error term (𝒗𝒂𝒓 𝒖𝒊 = 𝝈𝟐) and it

can be estimated itself through 𝝈 estimator, where:

𝝈 = 𝒆𝒊

𝟐

𝒏 − 𝟐𝑜𝑟

𝝈 = 𝒆𝒊

𝟐

𝒏 − 𝒌𝑤ℎ𝑒𝑛 𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝒌 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑖𝑛 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙.

• Based on the assumptions of the classical linear regression

model (CLRM), Gauss-Markov Theorem asserts that the least

square estimators, among unbiased estimators, have the

minimum variance. So they are the Best, Linear, Unbiased

Estimators (BLUE).

Interval Estimation For Population Parameters:

• In order to construct a confidence interval for unknown

𝜷′𝒔 (PRF’s parameters) we can either follow Z distribution (if

we have a prior knowledge about 𝝈) or t-distribution (if we use

𝝈 instead).

• The confidence intervals for the slope parameter at any level of

significance 𝜶 would be*:

𝑷 𝒃𝟏 − 𝒁 𝜶𝟐. 𝝈𝒃𝟏

≤ 𝜷𝟏 ≤ 𝒃𝟏 + 𝒁 𝜶𝟐. 𝝈𝒃𝟏

= 𝟏 − 𝜶

Or

𝑷 𝒃𝟏 − 𝒕 𝜶𝟐,(𝒏−𝟐). 𝝈𝒃𝟏

≤ 𝜷𝟏 ≤ 𝒃𝟏 + 𝒕 𝜶𝟐,(𝒏−𝟐). 𝝈𝒃𝟏

= 𝟏 − 𝜶

Hypothesis Testing For Parameters:

• The critical values (Z or t) in the confidence intervals, can be

used to find the rejection area(s) and test any hypothesis on

parameters.

• For example, to test 𝑯𝟎: 𝜷𝟏 = 𝟎 against the alternative 𝑯𝟏: 𝜷𝟏 ≠𝟎, after finding the critical values t (which means we do not have prior knowledge of 𝝈 and use 𝝈 instead) at any

significance level 𝜶, we will have two critical regions and if the

value of the test statistic

𝒕 =𝒃𝟏−𝜷𝟏

𝝈

𝒙𝒊𝟐

be in the critical region 𝑯𝟎: 𝜷𝟏 = 𝟎 must be rejected.

• In case we have more than one slope parameter the degree of

freedom for t-distribution will be the sample size 𝒏 minus the

number of estimated parameters including the intercept

parameters, i.e. for 𝒌 parameters 𝒅𝒇 = 𝒏 − 𝒌 .

Determination Coefficient 𝒓𝟐 and Goodness of Fit:

• In early slides we talked about determination coefficient and

its relationship with correlation coefficient. The coefficient of

determination 𝒓𝟐 come to our attention when there is no issue

about estimation of regression parameters.

• It is a measure which shows how well the SRF fits the data.

• to understand this measure properly let’s have a look at it

from different angle.

We know that

𝒀𝒊 = 𝒀𝒊 + 𝒆𝒊

And in the deviation form after

subtracting 𝒀 from both sides

𝒀𝒊 − 𝒀 = 𝒀𝒊 − 𝒀 + 𝒆𝒊

We know that 𝒆𝒊 = 𝒀𝒊 − 𝒀𝒊

𝒆𝒊 Ad

op

ted

from

Basic Eco

no

me

trics Go

jaratiP7

6

𝑌

𝒀𝒊 − 𝒀

So;𝒀𝒊 − 𝒀 = ( 𝒀𝒊 − 𝒀) + (𝒀𝒊 − 𝒀𝒊)

Or in the deviation form𝒚𝒊 = 𝒚𝒊 + 𝒆𝒊

By squaring both sides and adding all over the sample we have:

𝒚𝒊𝟐 = 𝒚𝒊

𝟐 + 𝟐 𝒚𝒊 𝒆𝒊 + 𝒆𝒊𝟐

= 𝒚𝒊𝟐 + 𝒆𝒊

𝟐

Where 𝒚𝒊 𝒆𝒊 = 𝟎 according to the OLS’s assumptions 3 and 5.

And if we change it to the non-deviated form:

𝒀𝒊 − 𝒀 2 = 𝒀𝒊 − 𝒀2

+ 𝒀𝒊 − 𝒀𝒊2

Total variation of the observed Y values around their mean =Total Sum of

Squares= TSS

Total explained variation of the estimated Y values around their

mean = Explained Sum of Squares (by explanatory

variables)= ESS

Total unexplained variation of the observed Y values around the regression line= Residual Sum of Squares (Explained by

error terms)= RSS

Dividing both sides by Total Sum of Squares (TSS) we have:

1 =𝐸𝑆𝑆

𝑇𝑆𝑆+

𝑅𝑆𝑆

𝑇𝑆𝑆=

𝒀𝒊 − 𝒀 2

𝒀𝒊 − 𝒀 2+

𝒀𝒊 − 𝒀𝒊2

𝒀𝒊 − 𝒀 2

Where 𝒀𝒊− 𝒀 𝟐

𝒀𝒊− 𝒀 𝟐=

𝑬𝑺𝑺

𝑻𝑺𝑺is the percentage of the variation of the actual

(observed) 𝒀𝒊 which is explained by the explanatory variables (by

regression line).

• A good reader knows that this is not a new concept; the

determination coefficient 𝒓𝟐 was described already as a

measure of the goodness of fit between different alternative

sample regression functions (SRFs).

𝟏 = 𝒓𝟐 +𝑹𝑺𝑺

𝑻𝑺𝑺→ 𝒓𝟐 = 𝟏 −

𝑹𝑺𝑺

𝑻𝑺𝑺

= 𝟏 − 𝒆𝒊

𝟐

𝒀𝒊− 𝒀 𝟐

• A good model must have a reasonable high 𝒓𝟐 but this does not

mean any model with a high 𝒓𝟐 is a good model. Extremely high

level of 𝒓𝟐 could be as a result of having a spurious regression

line due to the variety of reasons such as non-stationarity of

data, cointegration problem and etc.

• In a regression model with two parameters, 𝒓𝟐 can be directly

calculated:

𝒓𝟐 = 𝒀𝒊− 𝒀

𝟐

𝒀𝒊− 𝒀 𝟐 = 𝒃𝟎+𝒃𝟏𝑿𝒊−𝒃𝟎−𝒃𝟏𝑿

𝟐

𝒀𝒊− 𝒀 𝟐

=𝒃𝟏

𝟐 𝑿𝒊−𝑿𝟐

𝒀𝒊− 𝒀 𝟐 =𝒃𝟏

𝟐 𝒙𝒊𝟐

𝒚𝒊𝟐 = 𝒃𝟏

𝟐 𝑺𝑿𝟐

𝑺𝒀𝟐

Where 𝑺𝑿𝟐 and 𝑺𝒀

𝟐 are the standard deviations of 𝑿 and 𝒀respectively.

Multiple Regression Analysis:

• If there are more than two explanatory variables in the

regression line we need additional assumptions about the

independency of the explanatory variables and also having no

exact linear relationship between them.

• The population and the sample regression models for three

variables model can be described as following:

In Population: 𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏𝑿𝟏𝒊 + 𝜷𝟐𝑿𝟐𝒊 + 𝒖𝒊

In Sample: 𝒀𝒊 = 𝒃𝟎 + 𝒃𝟏𝑿𝟏𝒊 + 𝒃𝟐𝑿𝟐𝒊 + 𝒆𝒊

• The OLS estimators can be obtained by minimising 𝒆𝒊𝟐. So,

the values of the SRF parameters in the deviation form are as

following:

𝒃𝟏 =( 𝒙𝟏𝒊𝒚𝒊)( 𝒙𝟐𝒊

𝟐) − ( 𝒙𝟐𝒊𝒚𝒊)( 𝒙𝟏𝒊𝒙𝟐𝒊)

( 𝒙𝟏𝒊𝟐)( 𝒙𝟐𝒊

𝟐) − ( 𝒙𝟏𝒊𝒙𝟐𝒊)𝟐

𝒃𝟐 =( 𝒙𝟐𝒊𝒚𝒊)( 𝒙𝟏𝒊

𝟐) − ( 𝒙𝟏𝒊𝒚𝒊)( 𝒙𝟏𝒊𝒙𝟐𝒊)

( 𝒙𝟏𝒊𝟐)( 𝒙𝟐𝒊

𝟐) − ( 𝒙𝟏𝒊𝒙𝟐𝒊)𝟐

And the intercept parameter will be calculated in the non-deviated

form as:

𝒃𝟎 = 𝒀 − 𝒃𝟏𝑿𝟏 − 𝒃𝟐𝑿𝟐

• Under the classical assumptions and also the central limit

theorem the OLS estimators in sampling distribution (repeated

sampling),when 𝒏 → ∞, have a normal distribution:

𝒃𝟏~𝑵(𝜷𝟏,𝝈𝒖

𝟐. 𝒙𝟐𝒊𝟐

( 𝒙𝟏𝒊𝟐)( 𝒙𝟐𝒊

𝟐) − ( 𝒙𝟏𝒊𝒙𝟐𝒊)𝟐)

𝒃𝟐~𝑵(𝜷𝟐,𝝈𝒖

𝟐. 𝒙𝟏𝒊𝟐

( 𝒙𝟏𝒊𝟐)( 𝒙𝟐𝒊

𝟐) − ( 𝒙𝟏𝒊𝒙𝟐𝒊)𝟐)

• The distribution of the intercept parameter 𝒃𝟎 is not of primary

concern as in many cases it has no practical importance.

• If the variance of the disturbance (error) term (𝝈𝒖𝟐) is not known

the residual variance (sample variance) can be used ( 𝝈𝒖𝟐),

which is an unbiased estimator of the earlier:

𝝈𝒖𝟐 =

𝒆𝒊𝟐

𝒏 − 𝒌

Where 𝒌 is the number of parameters in the model (including the

intercept 𝒃𝟎). Therefore, in a regression model with two slope

parameters and one intercept parameter the residual variance can

be calculated by:

𝝈𝒖𝟐 =

𝒆𝒊𝟐

𝒏 − 𝟑

So, for a model with two slope parameters, the unbiased

estimates of the variance of these parameters are:

𝑺𝒃𝟏

𝟐 = 𝒆𝒊

𝟐

𝒏 − 𝟑.

𝒙𝟐𝒊𝟐

( 𝒙𝟏𝒊𝟐)( 𝒙𝟐𝒊

𝟐) − ( 𝒙𝟏𝒊𝒙𝟐𝒊)𝟐

= 𝝈𝒖

𝟐

𝒙𝟏𝒊𝟐 (𝟏 − 𝒓𝟐

𝟏𝟐)

Where 𝒓𝟐𝟏𝟐 =

𝒙𝟏𝒊𝒙𝟐𝒊𝟐

𝒙𝟏𝒊𝟐 𝒙𝟐𝒊

𝟐 .

and

𝑺𝒃𝟐

𝟐 = 𝒆𝒊

𝟐

𝒏 − 𝟑.

𝒙𝟏𝒊𝟐

( 𝒙𝟏𝒊𝟐)( 𝒙𝟐𝒊

𝟐) − ( 𝒙𝟏𝒊𝒙𝟐𝒊)𝟐

= 𝝈𝒖

𝟐

𝒙𝟐𝒊𝟐 (𝟏 − 𝒓𝟐

𝟏𝟐)

𝝈𝒖𝟐

The Coefficient of Multiple Determination (𝑹𝟐and 𝑹𝟐 ):

The same concept of the coefficient of determination used for a

bivariate model can be extended for a multivariate model.

• If 𝑹𝟐 is denoted as the coefficient of multiple determination it

shows the proportion (percentage) of the total variation of 𝒀explained by the explanatory variables and it is calculated by:

𝑅2 =𝐸𝑆𝑆

𝑇𝑆𝑆=

𝑦𝑖2

𝑦𝑖2 =

𝑏1 𝑦𝑖𝑥1𝑖+𝑏2 𝑦𝑖𝑥2𝑖

𝑦𝑖2

And we know that:

0 ≤ 𝑅2 ≤ 1

Note that 𝑅2 can also be calculated through RSS, i.e.

𝑅2 = 1 −𝑅𝑆𝑆

𝑇𝑆𝑆= 1 −

𝑒𝑖2

𝑦𝑖2

C

• 𝑹𝟐 is likely to increase by including an additional explanatory

variable (see ). Therefore, in case we have two alternative

models with the same dependent variable 𝒀 but different

number of explanatory variables we should not be misled by the

high 𝑹𝟐of the model with more variables.

• To solve this problem we need to bring the degrees of freedom

into our consideration as a reduction factor against adding

additional explanatory variables. So, the adjusted 𝑹𝟐 which can

be shown by 𝑹𝟐 is considered as an alternative coefficient of

determination and it is calculated as:

𝑅2 = 1 −

𝑒𝑖2

𝑛 − 𝑘 𝑦𝑖

2

𝑛 − 1

= 1 −𝑛 − 1

𝑛 − 𝑘. 𝑒𝑖

2

𝑦𝑖2

= 1 −𝑛−1

𝑛−𝑘(1 − 𝑅2)

C

Partial Correlation Coefficients:

• For a three-variable regression model such as

𝒀𝒊 = 𝒃𝟎 + 𝒃𝟏𝑿𝟏𝒊 + 𝒃𝟐𝑿𝟐𝒊 + 𝒆𝒊

We can talk about three linear association (correlation) between

𝒀 and 𝑿𝟏 𝒓𝒚𝒙𝟏, between 𝒀 and 𝑿𝟐 (𝒓𝒚𝒙𝟐

) and finally between

𝑿𝟏 and 𝑿𝟐 (𝒓𝒙𝟏𝒙𝟐). These correlations are called simple (gross)

correlation coefficients but they do not reflect the true linear

association between two variables as the influence of the third

variable on the other two is not removed.

• The net linear association between two variables can be

obtained through the partial correlation coefficient, where the

influence of the third variable is removed (the variable is hold

constant). Symbolically, 𝒓𝒚𝒙𝟏. 𝒙𝟐represents the partial

correlation coefficient between 𝒀 and 𝑿𝟏 holding 𝑿𝟐 constant.

• Two partial correlation coefficients in our model can be

calculated as following:

𝒓𝒚𝒙𝟏. 𝒙𝟐=

𝒓𝒚𝒙𝟏− 𝒓𝒚𝒙𝟐

𝒓𝒙𝟏𝒙𝟐

𝟏 − 𝒓𝟐𝒙𝟏𝒙𝟐

. 𝟏 − 𝒓𝟐𝒚𝒙𝟐

𝒓𝒚𝒙𝟐. 𝒙𝟏=

𝒓𝒚𝒙𝟐− 𝒓𝒚𝒙𝟏

𝒓𝒙𝟏𝒙𝟐

𝟏 − 𝒓𝟐𝒙𝟏𝒙𝟐

. 𝟏 − 𝒓𝟐𝒚𝒙𝟏

• The correlation coefficient 𝒓𝒙𝟏𝒙𝟐.𝒚 has no practical importance.

Specifically, when the direction of causality is from 𝑿′𝒔 to 𝒀 we

can simply use the simple correlation coefficient in this case:

𝒓 = 𝒙𝟏𝒙𝟐

𝒙𝟏𝟐 . 𝒙𝟐

𝟐

• They can be used to find out which explanatory variable has

more linear association with the dependent variable.

Hypothesis Testing in Multiple Regression Models:

In a multiple regression model hypotheses are formed to test

different aspects of this type of regression models:

i. Testing hypothesis about an individual parameter of the

model. For example;

𝑯𝟎: 𝜷𝒋 = 𝟎 against 𝑯𝟏: 𝜷𝒋 ≠ 𝟎

If 𝝈 is unknown and is replaced by 𝝈 the test statistic

𝒕 =𝒃𝒋−𝜷𝒋

𝒔𝒆(𝒃𝒋)=

𝒃𝒋

𝒔𝒆(𝒃𝒋)

follows the t-distribution with 𝒏 − 𝒌 df (for a regression model with

three parameters, including intercept, 𝐝𝐟 = 𝒏 − 𝟑)

ii. Testing hypothesis about the equality of two parameters

in the model. For example,

𝑯𝟎: 𝜷𝒊 = 𝜷𝒋 against 𝑯𝟏: 𝜷𝒊 ≠ 𝜷𝒋

Again, if 𝝈 is unknown and is replaced by 𝝈 the test statistic

𝒕 =𝒃𝒊 − 𝒃𝒋 − 𝜷𝒊 − 𝜷𝒋

𝒔𝒆(𝒃𝒊 − 𝒃𝒋)

=𝒃𝒊 − 𝒃𝒋

𝒗𝒂𝒓 𝒃𝒊 + 𝒗𝒂𝒓 𝒃𝒋 − 𝟐𝒄𝒐𝒗(𝒃𝒊, 𝒃𝒋)

follows the t-distribution with 𝒏 − 𝒌 df.

• If the value of test statistic 𝒕 > 𝒕𝜶

𝟐,(𝒏−𝒌) we must reject 𝑯𝟎,

otherwise there is not much evidence to reject that.

iii. Testing hypothesis about the overall significance of the

estimated model by checking if all the slope parameters

are simultaneously zero. For example, to test

𝑯𝟎: 𝜷𝒊 = 𝟎 (∀ 𝒊) against 𝑯𝟏: ∃𝜷𝒊 ≠ 𝟎

the analysis of variance (ANOVA) table can be used to find if the

mean sum of squares (MSS), due to the regression (or

explanatory variables) are very far from the MSS due to the

residuals. If this is true, it means the variation of explanatory

variables contribute more towards the variation of the dependent

variable than the variation of residuals, so, the ratio

𝑴𝑺𝑺 𝑑𝑢𝑒 𝑡𝑜 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 (𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑜𝑟𝑦 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠)

𝑴𝑺𝑺 𝑑𝑢𝑒 𝑡𝑜 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 (𝑟𝑎𝑛𝑑𝑜𝑚 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠)

should be much higher than one.

• The ANOVA table for the three-variable regression model can

be formed as following:

• If we believe that the regression model is meaningless so we

cannot reject the null hypothesis that all slope coefficients are

simultaneously equal to zero, otherwise the test statistic

𝐹 =𝐸𝑆𝑆/𝑑𝑓

𝑅𝑆𝑆/𝑑𝑓=

𝒃𝟏 𝒚𝒊𝒙𝟏𝒊 + 𝒃𝟐

𝒚𝒊𝒙𝟐𝒊

𝟐 𝒆𝒊

𝟐

𝒏 − 𝟑

Which follows the F-distribution with 2 and 𝒏 − 𝟑 df must be much

bigger than 1.

Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS)

Due to Explanatory Variables

𝒃𝟏 𝒚𝒊𝒙𝟏𝒊 + 𝒃𝟐 𝒚𝒊𝒙𝟐𝒊 2

𝒃𝟏 𝒚𝒊𝒙𝟏𝒊 + 𝒃𝟐 𝒚𝒊𝒙𝟐𝒊

𝟐

Due to Residuals 𝒆𝒊

𝟐𝒏 − 𝟑

𝝈𝟐 = 𝒆𝒊

𝟐

𝒏 − 𝟑

Total 𝒚𝒊

𝟐𝒏 − 𝟏

• In general, to test the overall significance of the sample

regression for a multi-variable model (e.g with 𝒌 slope

parameters) the null and alternative hypotheses and the test

statistic are as following:

𝑯𝟎: 𝜷𝟏 = 𝜷𝟐 = ⋯ = 𝜷𝒌 = 𝟎𝑯𝟏: 𝒂𝒕 𝒍𝒆𝒂𝒔𝒕 𝒕𝒉𝒆𝒓𝒆 𝒊𝒔 𝒐𝒏𝒆 𝜷𝒊 ≠ 𝟎

𝑭 = 𝑬𝑺𝑺

𝒌−𝟏

𝑹𝑺𝑺𝒏−𝒌

• If 𝑭 > 𝑭𝜶, 𝒌−𝟏, 𝒏−𝒌 we reject 𝑯𝟎 at the significance level of 𝜶,

otherwise there is no enough evidence to reject it.

• It is sometimes easier to use the determination coefficient 𝑹𝟐

to run the above test, because

𝑹𝟐 =𝑬𝑺𝑺

𝑻𝑺𝑺→ 𝑬𝑺𝑺 = 𝑹𝟐. 𝑻𝑺𝑺

and also

𝑹𝑺𝑺 = 𝟏 − 𝑹𝟐 . 𝑻𝑺𝑺

• The ANOVA table can also be written as:

• So, the test statistic F can be written as:

𝑭 = 𝑹𝟐 𝒚𝒊

𝟐

(𝒌 − 𝟏)

(𝟏 − 𝑹𝟐) 𝒚𝒊𝟐

(𝒏 − 𝒌)

=𝒏 − 𝒌

𝒌 − 𝟏.

𝑹𝟐

𝟏 − 𝑹𝟐

Source of variation Sum of Squares (SS) df Mean Sum of Squares (MSS)

Due to Explanatory Variables

𝑹𝟐 𝒚𝒊𝟐

𝒌 − 𝟏𝑹𝟐 𝒚𝒊

𝟐

𝒌 − 𝟏

Due to Residuals(𝟏 − 𝑹𝟐) 𝒚𝒊

𝟐 𝒏 − 𝒌 𝝈𝟐 =

(𝟏 − 𝑹𝟐) 𝒚𝒊𝟐

𝒏 − 𝒌

Total 𝒚𝒊

𝟐𝒏 − 𝟏

iv. Testing hypothesis about parameters when they satisfy

certain restrictions.*

e.g.𝑯𝟎: 𝜷𝒊 + 𝜷𝒋 = 𝟏 against 𝑯𝟏: 𝜷𝒊 + 𝜷𝒋 ≠ 𝟏

v. Testing hypothesis about the stability of the estimated

regression model in a specific time period or in two cross-

sectional unit.**

vi. Testing hypothesis about different functional forms of

regression models.***

introduction to correlation and regression analysis

Education