statistics for business and economics chapter 11 multiple regression and model building

Post on 20-Dec-2015

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistics for Business and Economics

Chapter 11 Multiple Regression and Model

Building

Learning Objectives

1. Explain the Linear Multiple Regression Model

2. Describe Inference About Individual Parameters

3. Test Overall Significance

4. Explain Estimation and Prediction

5. Describe Various Types of Models

6. Describe Model Building

7. Explain Residual Analysis

8. Describe Regression Pitfalls

Types of Regression Models

Simple

1 ExplanatoryVariable

RegressionModels

2+ ExplanatoryVariables

Multiple

LinearNon-

LinearLinear

Non-Linear

Multiple Regression Model

• General form:

• k independent variables

• x1, x2, …, xk may be functions of variables

— e.g. x2 = (x1)2

0 1 1 2 2 k ky x x x

Regression Modeling Steps

1. Hypothesize deterministic component

2. Estimate unknown model parameters

3. Specify probability distribution of random error term

• Estimate standard deviation of error

4. Evaluate model

5. Use model for prediction and estimation

First–Order Multiple Regression Model

Relationship between 1 dependent and 2 or more independent variables is a linear function

0 1 1 2 2 k ky x x x

Dependent (response) variable

Independent (explanatory) variables

Population slopes

Population Y-intercept

Random error

• Relationship between 1 dependent and 2 independent variables is a linear function

• Model

• Assumes no interaction between x1 and x2

— Effect of x1 on E(y) is the same regardless of x2 values

First-Order Model With 2 Independent Variables

0 1 1 2 2( )E y x x

y

x1

0Response

Plane

(Observed y)

i

Population Multiple Regression Model

Bivariate model:

x2

(x1i , x2i)

0 1 1 2 2i i i iy x x

0 1 1 2 2( ) i iE y x x

Sample Multiple Regression Model

y

0Response

Plane

(Observed y)

i

^^

Bivariate model:

x1

x2

(x1i , x2i)

0 1 1 2 2ˆ ˆ ˆ

i i i iy x x

0 1 1 2 2ˆ ˆ ˆˆi i iy x x

Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

• Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation

Multiple Linear Regression Equations

Too complicated

by hand! Ouch!

1st Order Model Example

You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) and newspaper circulation (000) on the number of ad responses (00). Estimate the unknown parameters.

You’ve collected the following data:

(y) (x1) (x2)Resp Size Circ

1 1 24 8 81 3 13 5 72 6 44 10 6

See ResponsesVsAdsizeAndCirculationData.jmp

0^

1^

2^

1 2ˆ .0640 .2049 .2805y x x

Interpretation of Coefficients Solution

^2. Slope (2)• Number of responses to ad is expected to increase

by .2805 (28.05) for each 1 unit (1,000) increase in circulation holding ad size constant

^1. Slope (1)• Number of responses to ad is expected to

increase by .2049 (20.49) for each 1 sq. in. increase in ad size holding circulation constant

Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

• Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation

Estimation of σ2

22 ˆ( 1) i i

SSEs where SSE y y

n k

2

( 1)

SSEs s

n k

For a model with k predictors (k+1 parameters)

More About JMP Output

SSE

s2

2 .250264.083421

6 3

.083421 .2888

s

s

s

(also called “mean squared error”)

(also called “standard error of the regression”)

Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

• Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation

Evaluating Multiple Regression Model Steps

1. Examine variation measures

2. Test parameter significance• Individual coefficients • Overall model

3. Do residual analysis

Inference for an Individual β Parameter

• Confidence Interval (rarely used in regression)

• Hypothesis Test (used all the time!)Ho: βi = 0Ha: βi ≠ 0 (or < or > )

• Test Statistic (how far is the sample slope from zero?)

ˆ2ˆ

ii t s df = n – (k + 1)

t =ˆ β i − 0

s ˆ β i

Easy way: Just examine p-values

Both coefficients significant!

Reject H0 for both tests

Testing Overall Significance

• Shows if there is a linear relationship between all x variables together and y

• Hypotheses— H0: 1 = 2 = ... = k = 0

No linear relationship

— Ha: At least one coefficient is not 0 At least one x variable affects y

• Test Statistic

• Degrees of Freedom1 = k 2 = n – (k + 1)

— k = Number of independent variables

— n = Sample size

( )

( )

MS ModelF

MS Error

Testing Overall Significance

Testing Overall SignificanceComputer Output

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Prob>F

Model 2 9.2497 4.6249 55.440 0.0043

Error 3 0.2503 0.0834

C Total 5 9.5000

k

n – (k + 1) MS(Error)

MS(Model)

P-value

Testing Overall SignificanceComputer Output

k

n – (k + 1)

MS(Error)

MS(Model)

P-value

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Interaction Model With 2 Independent Variables

• Hypothesizes interaction between pairs of x variables

— Response to one x variable varies at different levels of another x variable

0 1 1 2 2 3 1 2( )E y x x x x • Contains two-way cross product terms

• Can be combined with other models— Example: dummy-variable model

Interaction Model Relationships

Effect (slope) of x1 on E(y) depends on x2 value

E(y)

x1

4

8

12

00 10.5 1.5

E(y) = 1 + 2x1 + 3x2 + 4x1x2

E(y) = 1 + 2x1 + 3(1) + 4x1(1) = 4 + 6x1

E(y) = 1 + 2x1 + 3(0) + 4x1(0) = 1 + 2x1

Interaction Example

You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Conduct a test for interaction. Use α = .05.

Adding Interactions in JMP is Easy

1. Analyze >> Fit Model

2. Click on the response variable and click the Y button

3. Highlight the two X variables and click on the Add button

4. While the two X variables are highlighted, click on the Cross button

5. Run Model

You can also combine steps 3 and 4 into one step: Highlight the two X variables and, from the “Macros” pull down menu, chose “Factorial to Degree.” The default for degree is 2, so you will get all two-factor interactions in the model.

JMP Interaction Output

Interaction not important:

p-value > .05

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Second-Order Model With 1 Independent Variable

• Relationship between 1 dependent and 1 independent variable is a quadratic function

• Useful 1st model if non-linear relationship suspected

• Model

20 1 2( )E y x x

Linear effect

Curvilinear effect

y

Second-Order Model Relationships

yy

y

2 > 02 > 0

2 < 02 < 0

x1

x1x1

x1

Types of Regression Models

Linear (First order)

Quadratic (Second order)

Cubic (Third order)

YY XXii ii 00 11

YY XX XXii ii ii 00 11 22^ ^ ^

YY XX XXii ii ii 00 11 22^ ^ ^ XXii 33

^

^ ^

2nd Order Model Example

The data shows the number of weeks employed and the number of errors made per day for a sample of assembly line workers. Find a 2nd order model, conduct the global F–test, and test if β2 ≠ 0. Use α = .05 for all tests.

1. Analyze >> Fit Y by X

2. From hot spot menu choose:

3. Fit Polynomial >> 2, quadratic

Could also use:

Analyze >> Fit Model, select Y, then highlight X and, from the “Macros” pull down menu, chose “Polynomial to Degree.” The default for degree is 2, so you will get the quadratic (2nd order) polynomial. But from Fit Model, you won’t get the cool fitted line plot.

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Second-Order (Response Surface) Model With 2 Independent Variables

• Relationship between 1 dependent and 2 independent variables is a quadratic function

• Useful 1st model if non-linear relationship suspected

• Model

225

214

21322110)(

ii

iiii

xx

xxxxyE

Second-Order Model Relationships

y

x2x1

4 + 5 > 0 y 4 + 5 < 0

y 32 > 4 4 5

225

214213

22110)(

i

iii

ii

x

xxx

xxyE

x2x1

x2

x1

From JMP:

To specify the model, all you need to do is:

1.Analyze >> Fit Model

2.Highlight the X variables

3.From the “Macros” pull down menu, chose “Response Surface.”

4.The default for degree is 2, so you will get the full second-order model having all squared terms and all cross products.

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

QualitativeVariable

Model

1Quantitative

Variable

Qualitative-Variable Model

• Involves categorical x variable with 2 levels— e.g., male-female; college-no college

• Variable levels coded 0 and 1• Number of dummy variables is 1 less than

number of levels of variable• May be combined with quantitative variable

(1st order or 2nd order model)

Qualitative Predictors in JMP

1.Analyze >> Fit Model

2.Specify a qualitative variable

3.JMP will automatically create the needed zero-one variables for you, and run the regression! (All transparent---it does not save the zero-one variables in your data table.)

You need to do one thing (only once): Go to JMP >> Preferences. Now click the “Platforms” icon on the left panel, and click on “Fit Least Squares.” Now check the box on the right marked “Indicator Parameterization Estimates.” If you don’t do this your regression will still be correct, JMP will use a different form of zero-one (dummy) variables.

First 32 (out of 100) rows of the salary data.

Now do Analyze >> Fit Model

Residual Analysis

Residual Analysis

• Graphical analysis of residuals— Plot estimated errors versus xi values

Difference between actual yi and predicted yi

Estimated errors are called residuals

— Plot histogram or stem-&-leaf of residuals

• Purposes— Examine functional form (linear v. non-linear

model)— Evaluate violations of assumptions

Residual Plot for Functional Form

Add x2 Term Correct Specification

x

e

x

e

Residual Plot for Equal Variance

Unequal Variance Correct Specification

Fan-shaped.Standardized residuals used typically.

x

e

x

e

Residual Plot for Independence

x

Not Independent Correct Specification

Plots reflect sequence data were collected.

e

x

e

Residual Analysis Using JMP

1. Fit full model, and examine “residual plot” of residuals (Y) vs. predicted values (Yhat). This plot automatically appears. Look for outliers, or curvature, or non-constant variance. Hope for a random, shotgun scatter---everything is OK then.

2. Save the residuals (red tab >> Save Columns >> residuals)

3. Analyze >> Distribution (of saved residuals) and obtain a normal quantile (probability plot) using the red tabs.

4. Use the red tab to obtain a goodness of fit test for normality of the residuals using the tabs.

5. If the data were collected sequentially over time, obtain a plot of residuals vs. row number to see if there are any patterns related to time. This is one check of the “independence” of residuals assumption.

Residual Analysis via JMP

Step 1: From regression of SalePrice on three predictors---so far so good

Residual Analysis via JMPStep 2: Save residuals for more analysis

Steps 3 and 4:

(Normality OK (sort of, approximately anyway)

Step 5: Only needed if data in time order. Graph >> Overlay Plot (specify residuals for Y, no X needed!---see next page.)

Residual Analysis via JMPStep 2: Save residuals for more analysis

Steps 3 and 4:

(Normality OK (sort of, approximately anyway)

Step 5: Only needed if data in time order. Graph >> Overlay Plot (no pattern apparent)

Selecting Variables in Model Building

Model Building with Computer Searches

• Rule: Use as few x variables as possible

• Stepwise Regression— Computer selects x variable most highly correlated

with y

— Continues to add or remove variables depending on SSE

• Best subset approach— Computer examines all possible sets

Subset Selection

Simple models tend to work bestSimple models tend to work best

1.1. Give best predictionsGive best predictions2.2. Simplest explanations of Simplest explanations of

underlying underlying phenomenaphenomena3.3. Avoids multicollinearity Avoids multicollinearity

(redundant X (redundant X variables)variables)

Manual Stepwise Regression:

1. 1. Start with full model.Start with full model.

2. 2. If all p-values < .05, If all p-values < .05, stop. Otherwise, drop stop. Otherwise, drop the variable that has the variable that has the largest p value.the largest p value.

3. 3. Refit the model. Go to Refit the model. Go to step 2.step 2.

Automatic Stepwise Regression:

Let the computer do it for you!Let the computer do it for you!

1.1. Stepwise Regression. Stepwise Regression. Backward stepwise Backward stepwise automatesautomatesthe manual stepwise the manual stepwise procedureprocedure

2.2. Best subsets regression.Best subsets regression.Computes all possible models Computes all possible models and summarizes each. and summarizes each.

JMP Stepwise Example: Car Leasing

(a)(a) Use backward stepwise regression to find the best Use backward stepwise regression to find the best predictors of resale value, y.predictors of resale value, y.

(b)(b) Use forward stepwise regression to find the best Use forward stepwise regression to find the best predictors of resale value. Does you r answer agree predictors of resale value. Does you r answer agree with what you had already found in part (a)?with what you had already found in part (a)?

(c)(c) Use all possible regressions to find the mode that Use all possible regressions to find the mode that minimizes s (root mean square error). Does this agree minimizes s (root mean square error). Does this agree with either parts (a) or (b)?with either parts (a) or (b)?

(d)(d) What would you choose for a final model and why?What would you choose for a final model and why?

To appropriately price new car leases, car dealers need to To appropriately price new car leases, car dealers need to accurately predict the value of the cars at the conclusion of the accurately predict the value of the cars at the conclusion of the leases. These resale values are generally determined at leases. These resale values are generally determined at wholesale auctions. Data collected on 54 1997 new car models wholesale auctions. Data collected on 54 1997 new car models are listed on the next two pagesare listed on the next two pages

Leasing Data

Y;Y; Resale value in 2000Resale value in 2000X1:X1: 1997 Price1997 PriceX2:X2: Price increase in model from 1997-2000Price increase in model from 1997-2000X3:X3: Consumer Reports quality indexConsumer Reports quality indexX4:X4: Consumer Reports reliability indexConsumer Reports reliability indexX5:X5: Number of model vehicles sold in 1997Number of model vehicles sold in 1997X6:X6: = Yes, if minor change made in model in = Yes, if minor change made in model in

1998, 1999, or 20001998, 1999, or 2000= No, if not= No, if not

X7:X7: = Yes, if major change made in model in = Yes, if major change made in model in 1998, 1999, or 20001998, 1999, or 2000= No, if not= No, if not

Backward Stepwise:

Analyze >>Fit Model and specify model

Change “Personality” to Stepwise

Enter all model terms and press “Go”

(Change “Prob to Enter” to .1 or ,.05)

Forward Stepwise:

Analyze >>Fit Model and specify model

Change “Personality” to Stepwise

Press “Go”

(Change “Prob to Enter” to .1 or ,.05)

All Possible Regressions Output:

1.Under Stepwise Fit, use red hot spot to select “All Possible Models

2.Requested the best (1) model of each model size

Best model minimizes RMSE (that is, s---same as maximizing adjusted R2

Or you could choose to minimize AICc (corrected Akaike Information Criterion)

Regression Pitfalls

• Parameter Estimability— Number of different x–values must be at least one

more than order of model

• Multicollinearity— Two or more x–variables in the model are

correlated

• Extrapolation— Predicting y–values outside sampled range

• Correlated Errors

Multicollinearity

• High correlation between x variables

• Coefficients measure combined effect

• Leads to unstable coefficients depending on x variables in model

• Always exists – matter of degree

• Example: using both age and height as explanatory variables in same model

Detecting Multicollinearity

• Significant correlations between pairs of x variables are more than with y variable

• Non–significant t–tests for most of the individual parameters, but overall model test is significant

• Estimated parameters have wrong sign• Always do a scatterplot matrix of your data

before analysis---look for outliers and relationships between x variables (Graph >> Scatterplot Matrix)

Any problems?

Outliers?

Collinearity?

What are the best predictors of Resale (y)?

Solutions to Multicollinearity

• Eliminate one or more of the correlated x variables

• Center predictors before computing polynomial terms (squares, crossproducts) (JMP does this automatically!)

• Avoid inference on individual parameters

• Do not extrapolate

y Interpolation

x

Extrapolation Extrapolation

Sampled Range

Extrapolation

Conclusion

1. Explained the Linear Multiple Regression Model

2. Described Inference About Individual Parameters

3. Tested Overall Significance

4. Explained Estimation and Prediction

5. Described Various Types of Models

6. Described Model Building

7. Explained Residual Analysis

8. Described Regression Pitfalls

top related