understanding regression

Post on 16-Feb-2016

38 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Understanding regression. A regression is an average. Experiment: Imagine that you are looking at people coming through a door. Imagine also that you had “metric eyes” (rather like Superman’s x-ray vision) and could accurately estimate the height of each person as they passed through. - PowerPoint PPT Presentation

TRANSCRIPT

Understanding regression

2

A regression is an average• Experiment:

Imagine that you are looking at people coming through a door. Imagine also that you had “metric eyes” (rather like Superman’s x-ray vision) and could accurately estimate the height of each person as they passed through.

After 10 people had gone through the door, what would be the best prediction for the height of the eleventh person?

• Answer – the average • This is why the “average” is also called the

“expected value.”

3

PERSON

HEI

GH

T (c

m)

1 2 3 4 5 6 7 8 9 10 11

X

X

X

X

X

X

X

X

X

X

Average for the first 10

Best predictionfor the 11th

The expected value of the height of the 11th is the average of the previous 10.

4

Imagine that as you are estimating the height of the persons coming through the door, you also note their gender.

Information on gender improves our ability to predict height.

PERSON

HEI

GH

T (c

m)

1 2 3 4 5 6 7 8 9 10 11

X

O

X

X

X

O

X

X

O

O

Average for the men

Average for the women

Prediction for the11th person

conditional ongender

O - Women X - Men

5

Regression• Two basic purposes:

– Explanation

– Prediction

• Regression is an efficient way to analyze the structure of the data.

• A regression model is a sentence that connects the average or expected value of something (a person’s height) in multi-dimensions (multivariate analysis).

6

The regression sentence• The regression equation may be read as a sentence that summarizes the

simultaneous influence of independent variables (causes or drivers) on a single dependent variable (effects or outcomes).

• Here is a simple, single variable model.

Height = 165 + 5D (D = 1 for a man and 0 for a woman)

• The regression sentence:

The predicted (expected) height for people coming through the door is 165 cm plus 5 cm if that person is a man.

• In other words:

Women have an expected height of 165 cm and men have an expected height of 170 cm.

Regression coefficient

7

Adding variables

Adding more variables conditions our prediction (expectation) for the height of people.

Typical variables could include:– number of litres of milk consumed per week– income of parents ($’000s)– kilometres above sea level at birth

8

Number of litres ofmilk consumed each

week

HEIG

HT (c

m)

XX

X

X

X

X

X

X

X

0 5

Height = 100 + 15L

100

For every litreconsumed, heightincreases 15 cm.

No milk consumptionimplies an expected

height of 100 cm.Someone who drinks 20litres of milk each week

has an expected height of400 cm.

9

Regression sentences• An earnings regression simply relates the expected earnings based on

several variables.

Y = 6,000 + 200.5 AGE + 1000.5 YEARS_ED(Y = annual income)

• “Expected annual income for the sample is $6,000 plus 200.5 times AGE plus 1000.5 times years of education.”

• A 30-year-old with 12 years of education can expect to earn: $6,000 + 200.5(30) + 1000.5 (12) = $24,021

• For every year of education, annual salary increases by $1000.50.

Regression coefficient

10

Example - LMAPD impact analysis• Wanted to associate labour market programming

with outcome• Wanted to assess the presence and intensity of

programming• Built a regression sentence that expressed this

relationship

Hours = a1 + a2 Female + a3 Aboriginal + … + ak-1 EmpIoy + ak # Employ Worked Inter. Inter.

• Output appears more complicated, but follows the same principles.

11

Ex. LMAPD: Estimating VR counselling hours (LMAPD

VRhours)• Admin data includes total cost of services spent

by the VR program on a particular client, but it does not include the cost of VR counselling.

• To estimate VR counselling costs per client, 281 VR clients with currently active VR counsellors were selected.

• VR counsellors were provided a short questionnaire including the following question to be answered for each VR client:

On average, over the entire time that you have been this client’s counsellor, how many hours per month did you spend on this client’s case?

12

Ex. LMAPD VRhours

• Surveys for 270 clients were returned.• Information from the surveys was merged

with the administrative data.• The next step was to run a regression using

the sample of 270 VR clients to calculate the coefficients for the independent variables (from the admin data) to estimate VR counselling costs for the entire sample of VR clients (n=1,062).

13

Ex. LMAPD VRhours

• Dependent variable: Average monthly time in hours spent by VR counsellors on the clients’ files (survey question)

• Independent variables:– Demographic: gender, Aboriginal status, minority status,

age, disability type– Service data: urban/rural service delivery region,

organization that delivered services

14

Ex. LMAPD VRhours:Independent variables

Variables Type Mean

(Male gender) M.E. dummy 0.61

Female gender M.E. dummy 0.39

(Non-Aboriginal) M.E. dummy 0.98

Aboriginal M.E. dummy 0.02

(Non-minority) M.E. dummy 0.99

Minority M.E. dummy 0.01

Age Continuous 35.09

Cognitive disability N.E. dummy 0.17

Physical disability N.E. dummy 0.30

Psychiatric disability N.E. dummy 0.28

Hearing disability N.E. dummy 0.09

Vision disability N.E. dummy 0.13

Learning disability N.E. dummy 0.14

(Urban service delivery region) M.E. dummy 0.69

Rural service delivery region M.E. dummy 0.31

(Provincial service delivery) M.E. dummy 0.52

SMD service delivery M.E. dummy 0.31

CPA service delivery M.E. dummy 0.06

CNIB service delivery M.E. dummy 0.12

15

Ex. LMAPD VRhours:Independent variables

• Variables in parentheses (X) are the excluded dummy variables from the regression.

• Types of variables:– Continuous– Mutually exclusive dummy variable– Not mutually exclusive dummy variable

16

Ex. LMAPD VRhours:Regression results

Independent variables Coefficient P-value

Constant 2.05 0.01

Female gender (fg) 0.10 0.74

Aboriginal (ab) -1.14 0.26

Minority (m) 3.98 0.01

Age (ag) -0.01 0.36

Cognitive disability (cd) 0.20 0.78

Physical disability (phd) 5.43 0.00

Psychiatric disability (psd) 1.08 0.10

Hearing disability (hd) 6.34 0.00

Vision disability (vd) 0.58 0.73

Learning disability (ld) -0.61 0.35

Rural service delivery region (r) 0.17 0.612

SMD service delivery (smd) -6.06 0.00

CPA service delivery (cpa) -5.16 0.00

CNIB service delivery (cnib) 0.61 0.74

Sample: 270

Adj. R2: 0.1508

17

Ex. LMAPD VRhours:Coefficients

• Aboriginal status is associated with fewer hours per month (-1.14).

• Minority status required 3.98 hours more of VR counselling.

• Rural clients logged slightly more hours in counselling than urban clients (0.17, not statistically significant).

• Those with physical and hearing disabilities require substantial support.

18

Ex. LMAPD VRhours:Regression sentence

• VRhours = 2.05 + 0.1fg + (-1.14ab) + 3.98m + (-0.01)ag + 0.2cd + 5.43phd + 1.08psd + 6.34hd + 0.58vd + (-0.61)ld + 0.17r + (-6.06)smd + (-5.16)cpa + 0.61cnib

• Can now use the estimated coefficients and the independent variable values for all 1,062 VR participants to calculate the estimated number of VR hours required for each client.

19

Assessing the quality of a regression

1. Goodness of fit (R2) measures the percentage of variation in Y explained by the model.

X

Y

X

X

X

X

X

X

X

X

X

Low R2

X

Y

X

X

X

X

X

X

X

X

X

High R2

The R2 varies between 0 (low) and 1 (high).

20

Assessing the quality of a regression2. Statistical significance

• The higher the coefficient, the more confident we are that it is not zero.

• The lower the SD, the more confident we are that we have measured the effect reliably.

• Coefficient divided by standard deviation is the t value.• The rule of 2 is applied again as a “t” test.

Y = 6,000 + 20.5 AGE + 100.5 YEARS_ED (2.5) (3.8) (1.2)

Computer output reports t values (as above) and standard errors, p values and a host of other diagnostics.

21

Num

ber o

f dea

ths

from

tr

affic

acc

iden

ts

Year

XX

X

X

X

XX

XX

Traffic accidents in Winnipeg: 1995 - 2008

X XX

X X

X

Introduction of Photo Radar

Deaths = A + B (Number of installations)

(The test is whether B is positive.)

Model 1

Photo radar and traffic safety

Model 2

Deaths = A + B (Year) + C (D)

D = 0 (year < 2000)D = 1 (year > 2001)

(The test is whether C is negative.)

Nu

mbe

r of

dea

ths

from

tr

affi

c ac

cid e

nts

Number of photo radar installations

XX

X

X

X

X

X

X

X

Traffic accidents and photo radar for Canada’s largest cities

X

X

X

X

X

X

22

Regression variables

• Dependent (Outcome)

• Independent (Causal)– Context (age, gender, ethnicity)

– Driver (policy)

• Policy can be measured directly ($, person years) or as a change in state (dummy variable).

23

Building a regression model

• Identify the dependent (effect or outcome) variable(s).

• What are the independent (causal) variables?

• Are there policy impacts?

• How are these to be measured?

top related