understanding regression

24
Understanding regression

Upload: caden

Post on 16-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Understanding regression. A regression is an average. Experiment: Imagine that you are looking at people coming through a door. Imagine also that you had “metric eyes” (rather like Superman’s x-ray vision) and could accurately estimate the height of each person as they passed through. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Understanding regression

Understanding regression

Page 2: Understanding regression

2

A regression is an average• Experiment:

Imagine that you are looking at people coming through a door. Imagine also that you had “metric eyes” (rather like Superman’s x-ray vision) and could accurately estimate the height of each person as they passed through.

After 10 people had gone through the door, what would be the best prediction for the height of the eleventh person?

• Answer – the average • This is why the “average” is also called the

“expected value.”

Page 3: Understanding regression

3

PERSON

HEI

GH

T (c

m)

1 2 3 4 5 6 7 8 9 10 11

X

X

X

X

X

X

X

X

X

X

Average for the first 10

Best predictionfor the 11th

The expected value of the height of the 11th is the average of the previous 10.

Page 4: Understanding regression

4

Imagine that as you are estimating the height of the persons coming through the door, you also note their gender.

Information on gender improves our ability to predict height.

PERSON

HEI

GH

T (c

m)

1 2 3 4 5 6 7 8 9 10 11

X

O

X

X

X

O

X

X

O

O

Average for the men

Average for the women

Prediction for the11th person

conditional ongender

O - Women X - Men

Page 5: Understanding regression

5

Regression• Two basic purposes:

– Explanation

– Prediction

• Regression is an efficient way to analyze the structure of the data.

• A regression model is a sentence that connects the average or expected value of something (a person’s height) in multi-dimensions (multivariate analysis).

Page 6: Understanding regression

6

The regression sentence• The regression equation may be read as a sentence that summarizes the

simultaneous influence of independent variables (causes or drivers) on a single dependent variable (effects or outcomes).

• Here is a simple, single variable model.

Height = 165 + 5D (D = 1 for a man and 0 for a woman)

• The regression sentence:

The predicted (expected) height for people coming through the door is 165 cm plus 5 cm if that person is a man.

• In other words:

Women have an expected height of 165 cm and men have an expected height of 170 cm.

Regression coefficient

Page 7: Understanding regression

7

Adding variables

Adding more variables conditions our prediction (expectation) for the height of people.

Typical variables could include:– number of litres of milk consumed per week– income of parents ($’000s)– kilometres above sea level at birth

Page 8: Understanding regression

8

Number of litres ofmilk consumed each

week

HEIG

HT (c

m)

XX

X

X

X

X

X

X

X

0 5

Height = 100 + 15L

100

For every litreconsumed, heightincreases 15 cm.

No milk consumptionimplies an expected

height of 100 cm.Someone who drinks 20litres of milk each week

has an expected height of400 cm.

Page 9: Understanding regression

9

Regression sentences• An earnings regression simply relates the expected earnings based on

several variables.

Y = 6,000 + 200.5 AGE + 1000.5 YEARS_ED(Y = annual income)

• “Expected annual income for the sample is $6,000 plus 200.5 times AGE plus 1000.5 times years of education.”

• A 30-year-old with 12 years of education can expect to earn: $6,000 + 200.5(30) + 1000.5 (12) = $24,021

• For every year of education, annual salary increases by $1000.50.

Regression coefficient

Page 10: Understanding regression

10

Example - LMAPD impact analysis• Wanted to associate labour market programming

with outcome• Wanted to assess the presence and intensity of

programming• Built a regression sentence that expressed this

relationship

Hours = a1 + a2 Female + a3 Aboriginal + … + ak-1 EmpIoy + ak # Employ Worked Inter. Inter.

• Output appears more complicated, but follows the same principles.

Page 11: Understanding regression

11

Ex. LMAPD: Estimating VR counselling hours (LMAPD

VRhours)• Admin data includes total cost of services spent

by the VR program on a particular client, but it does not include the cost of VR counselling.

• To estimate VR counselling costs per client, 281 VR clients with currently active VR counsellors were selected.

• VR counsellors were provided a short questionnaire including the following question to be answered for each VR client:

On average, over the entire time that you have been this client’s counsellor, how many hours per month did you spend on this client’s case?

Page 12: Understanding regression

12

Ex. LMAPD VRhours

• Surveys for 270 clients were returned.• Information from the surveys was merged

with the administrative data.• The next step was to run a regression using

the sample of 270 VR clients to calculate the coefficients for the independent variables (from the admin data) to estimate VR counselling costs for the entire sample of VR clients (n=1,062).

Page 13: Understanding regression

13

Ex. LMAPD VRhours

• Dependent variable: Average monthly time in hours spent by VR counsellors on the clients’ files (survey question)

• Independent variables:– Demographic: gender, Aboriginal status, minority status,

age, disability type– Service data: urban/rural service delivery region,

organization that delivered services

Page 14: Understanding regression

14

Ex. LMAPD VRhours:Independent variables

Variables Type Mean

(Male gender) M.E. dummy 0.61

Female gender M.E. dummy 0.39

(Non-Aboriginal) M.E. dummy 0.98

Aboriginal M.E. dummy 0.02

(Non-minority) M.E. dummy 0.99

Minority M.E. dummy 0.01

Age Continuous 35.09

Cognitive disability N.E. dummy 0.17

Physical disability N.E. dummy 0.30

Psychiatric disability N.E. dummy 0.28

Hearing disability N.E. dummy 0.09

Vision disability N.E. dummy 0.13

Learning disability N.E. dummy 0.14

(Urban service delivery region) M.E. dummy 0.69

Rural service delivery region M.E. dummy 0.31

(Provincial service delivery) M.E. dummy 0.52

SMD service delivery M.E. dummy 0.31

CPA service delivery M.E. dummy 0.06

CNIB service delivery M.E. dummy 0.12

Page 15: Understanding regression

15

Ex. LMAPD VRhours:Independent variables

• Variables in parentheses (X) are the excluded dummy variables from the regression.

• Types of variables:– Continuous– Mutually exclusive dummy variable– Not mutually exclusive dummy variable

Page 16: Understanding regression

16

Ex. LMAPD VRhours:Regression results

Independent variables Coefficient P-value

Constant 2.05 0.01

Female gender (fg) 0.10 0.74

Aboriginal (ab) -1.14 0.26

Minority (m) 3.98 0.01

Age (ag) -0.01 0.36

Cognitive disability (cd) 0.20 0.78

Physical disability (phd) 5.43 0.00

Psychiatric disability (psd) 1.08 0.10

Hearing disability (hd) 6.34 0.00

Vision disability (vd) 0.58 0.73

Learning disability (ld) -0.61 0.35

Rural service delivery region (r) 0.17 0.612

SMD service delivery (smd) -6.06 0.00

CPA service delivery (cpa) -5.16 0.00

CNIB service delivery (cnib) 0.61 0.74

Sample: 270

Adj. R2: 0.1508

Page 17: Understanding regression

17

Ex. LMAPD VRhours:Coefficients

• Aboriginal status is associated with fewer hours per month (-1.14).

• Minority status required 3.98 hours more of VR counselling.

• Rural clients logged slightly more hours in counselling than urban clients (0.17, not statistically significant).

• Those with physical and hearing disabilities require substantial support.

Page 18: Understanding regression

18

Ex. LMAPD VRhours:Regression sentence

• VRhours = 2.05 + 0.1fg + (-1.14ab) + 3.98m + (-0.01)ag + 0.2cd + 5.43phd + 1.08psd + 6.34hd + 0.58vd + (-0.61)ld + 0.17r + (-6.06)smd + (-5.16)cpa + 0.61cnib

• Can now use the estimated coefficients and the independent variable values for all 1,062 VR participants to calculate the estimated number of VR hours required for each client.

Page 19: Understanding regression

19

Assessing the quality of a regression

1. Goodness of fit (R2) measures the percentage of variation in Y explained by the model.

X

Y

X

X

X

X

X

X

X

X

X

Low R2

X

Y

X

X

X

X

X

X

X

X

X

High R2

The R2 varies between 0 (low) and 1 (high).

Page 20: Understanding regression

20

Assessing the quality of a regression2. Statistical significance

• The higher the coefficient, the more confident we are that it is not zero.

• The lower the SD, the more confident we are that we have measured the effect reliably.

• Coefficient divided by standard deviation is the t value.• The rule of 2 is applied again as a “t” test.

Y = 6,000 + 20.5 AGE + 100.5 YEARS_ED (2.5) (3.8) (1.2)

Computer output reports t values (as above) and standard errors, p values and a host of other diagnostics.

Page 21: Understanding regression

21

Num

ber o

f dea

ths

from

tr

affic

acc

iden

ts

Year

XX

X

X

X

XX

XX

Traffic accidents in Winnipeg: 1995 - 2008

X XX

X X

X

Introduction of Photo Radar

Deaths = A + B (Number of installations)

(The test is whether B is positive.)

Model 1

Photo radar and traffic safety

Model 2

Deaths = A + B (Year) + C (D)

D = 0 (year < 2000)D = 1 (year > 2001)

(The test is whether C is negative.)

Nu

mbe

r of

dea

ths

from

tr

affi

c ac

cid e

nts

Number of photo radar installations

XX

X

X

X

X

X

X

X

Traffic accidents and photo radar for Canada’s largest cities

X

X

X

X

X

X

Page 22: Understanding regression

22

Regression variables

• Dependent (Outcome)

• Independent (Causal)– Context (age, gender, ethnicity)

– Driver (policy)

• Policy can be measured directly ($, person years) or as a change in state (dummy variable).

Page 23: Understanding regression

23

Building a regression model

• Identify the dependent (effect or outcome) variable(s).

• What are the independent (causal) variables?

• Are there policy impacts?

• How are these to be measured?

Page 24: Understanding regression