146 41 multiple_regression

28
MATH& 146 Lesson 41 Section 6.1 Multiple Regression 1

Upload: greg-kent

Post on 15-Feb-2017

127 views

Category:

Education


0 download

TRANSCRIPT

Page 1: 146 41 multiple_regression

MATH& 146

Lesson 41

Section 6.1

Multiple Regression

1

Page 2: 146 41 multiple_regression

Multiple Regression

The principles of simple linear regression lay the foundation for more sophisticated regression methods used in a wide range of challenging settings.

Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted x1, x2, x3, ...). The method is motivated by scenarios where many variables may be simultaneously connected to an output.

2

Page 3: 146 41 multiple_regression

Mario Kart

We will consider eBay auctions of a video game

called Mario Kart for the Nintendo Wii. The

outcome variable of interest is the total price of an

auction, which is the highest bid plus the shipping

cost.

We will try to determine how total price is related to

each characteristic in an auction while

simultaneously controlling for other variables.

3

Page 4: 146 41 multiple_regression

Mario Kart

For instance, all other characteristics held

constant, are longer auctions associated with

higher or lower prices?

And, on average, how much more do buyers tend

to pay for additional Wii wheels (plastic steering

wheels that attach to the Wii controller) in

auctions?

Multiple regression can help us answer these and

other questions.

4

Page 5: 146 41 multiple_regression

Mario Kart

The Mario Kart data set includes results from 141

auctions. Four observations from this data set are

shown in the table below.

5

price cond_new stock_photo duration wheels

1 51.55 1 1 3 1

2 37.04 0 1 7 1

… … … … … …

140 38.76 0 0 7 0

141 54.51 1 1 1 2

Page 6: 146 41 multiple_regression

Mario Kart

Descriptions for each variable are shown below.

Notice that the cond_new and stock_photo

variables are indicator variables (variables that

only have values of 0 or 1).

6

Page 7: 146 41 multiple_regression

Mario Kart

For instance, the cond_new variable takes value 1

if the game up for auction is new and 0 if it is used.

Using indicator variables in place of category

names allows for these variables to be directly

used in regression.

7

price cond_new stock_photo duration wheels

1 51.55 1 1 3 1

2 37.04 0 1 7 1

… … … … … …

140 38.76 0 0 7 0

141 54.51 1 1 1 2

Page 8: 146 41 multiple_regression

Mario Kart

Multiple regression also allows for categorical

variables with many levels, though we do not have

any such variables in this analysis, and we save

these details for a second or third course.

8

price cond_new stock_photo duration wheels

1 51.55 1 1 3 1

2 37.04 0 1 7 1

… … … … … …

140 38.76 0 0 7 0

141 54.51 1 1 1 2

Page 9: 146 41 multiple_regression

Mario Kart

Let's fit a linear regression model with the game's

condition as a predictor of auction price. The model

may be written as

Results of this model are shown below.

9

Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.8711 0.8140 52.67 0.0000

cond_new 10.8996 1.2583 8.66 0.0000

df = 139

Page 10: 146 41 multiple_regression

Mario Kart

A scatterplot of the total auction price against the

game's condition is shown, which takes value 1 when

the game is new and 0 if the game is used. The least

squares line is also shown.

10

Page 11: 146 41 multiple_regression

Example 1

Does the linear model seem reasonable?

11

Page 12: 146 41 multiple_regression

Example 2

Interpret the coefficient for the game's condition in the

model. Is this coefficient significantly different from 0?

12

Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.8711 0.8140 52.67 0.0000

cond_new 10.8996 1.2583 8.66 0.0000

df = 139

Page 13: 146 41 multiple_regression

Multiple Regression

Sometimes there are underlying structures or

relationships between predictor variables. For

instance, new games sold on eBay tend to come

with more Wii wheels, which may have led to

higher prices for those auctions.

We would like to fit a model that includes all

potentially important variables simultaneously.

13

Page 14: 146 41 multiple_regression

Multiple Regression

This would help us evaluate the relationship

between a predictor variable and the outcome

while controlling for the potential influence of other

variables. This is the strategy used in multiple

regression.

While we remain cautious about making any

causal interpretations using multiple regression,

such models are a common first step in providing

evidence of a causal connection.

14

Page 15: 146 41 multiple_regression

Multiple Regression

We want to construct a model that accounts for not

only the game condition, but simultaneously

accounts for three other variables: stock_photo,

duration, and wheels.

15

price cond_new stock_photo duration wheels

1 51.55 1 1 3 1

2 37.04 0 1 7 1

… … … … … …

140 38.76 0 0 7 0

141 54.51 1 1 1 2

Page 16: 146 41 multiple_regression

Multiple Regression

That is,

In this equation, y represents the total price, x1

indicates whether the game was new, x2 indicates

whether a stock photo was used, x3 was the duration

of the auction, and x4 was the number of Wii wheels

included with the game.

16

Page 17: 146 41 multiple_regression

Multiple Regression

We (by that, I mean computers) estimate the

parameters β0, β1, β2, β3, and β4 in the same way

as we did in the case of a single predictor.

We (again, computers) select b0, b1, b2, b3, and b4

that minimize the sum of the squared residuals:

17

141 141

22 2 2 2

1 2 141

1 1

ˆi i i

i i

SSE e e e e y y

Page 18: 146 41 multiple_regression

Multiple Regression

There are 141 residuals for the Mario Kart data, one

for each auction. A computer minimizes the sum of

these squared residuals, and computes point

estimates, shown in the regression table below.

18

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 136

Page 19: 146 41 multiple_regression

Example 3

Write out the model using the point estimates below.

How many predictors are there in this model?

19

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 136

Page 20: 146 41 multiple_regression

Example 4

What does β4, the coefficient of variable x4 (Wii

wheels), represent? What is the point estimate of β4?

20

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 136

Page 21: 146 41 multiple_regression

Example 5

Compute the residual of the first observation using

the multiple regression equation identified in the

previous examples.

21

price cond_new stock_photo duration wheels

1 51.55 1 1 3 1

2 37.04 0 1 7 1

… … … … … …

140 38.76 0 0 7 0

141 54.51 1 1 1 2

Page 22: 146 41 multiple_regression

Example 6

What do the two models below have in common?

Where do they differ?

22

Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.8711 0.8140 52.67 0.0000

cond_new 10.8996 1.2583 8.66 0.0000

df = 139

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 136

Page 23: 146 41 multiple_regression

Example 7

We estimated a coefficient for cond_new of b1 = 10.90

with a standard error of SE(b1) = 1.26 when using

simple linear regression. Why might there be a

difference between that estimate and the one in the

multiple regression setting?

23

1 1

simple linear regression 10.90 1.26

multiple regression 5.13 1.05

b SE b

Page 24: 146 41 multiple_regression

Example 7 answer

When we estimated the connection of the outcome

price and predictor cond_new using simple linear

regression, we were unable to control for other

variables like the number of Wii wheels included in the

auction.

24

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 136

Page 25: 146 41 multiple_regression

Example 7 answer

We say that the simple linear model was biased by the

confounding variable wheels. When we use both

variables, this particular underlying and unintentional

bias is reduced or eliminated (though bias from other

confounding variables may still remain).

25

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 136

Page 26: 146 41 multiple_regression

Collinearity

The previous example describes a common issue in

multiple regression: correlation among predictor

variables. We say the two predictor variables are

collinear when they are correlated, and this

collinearity complicates model estimation.

While it is impossible to prevent collinearity from

arising in observational data, experiments are usually

designed to prevent predictors from being collinear.

26

Page 27: 146 41 multiple_regression

Example 8

The estimated value of the intercept is 36.21, and one

might be tempted to make some interpretation of this

coefficient, such as, it is the model's predicted price

when each of the variables take value zero: the game

is used, the primary image is not a stock photo, the

auction duration is zero days, and there are no wheels

included.

Is there any value gained by making this

interpretation?

27

Page 28: 146 41 multiple_regression

Example 8 answer

Three of the variables (cond_new, stock photo,

and wheels) do take value 0, but the auction

duration is always one or more days. If the auction

is not up for any days, then no one can bid on it!

That means the total auction price would always

be zero for such an auction; the interpretation of

the intercept in this setting is not insightful.

28