statistics for business and economics

81
Statistics for Business and Economics Chapter 11 Multiple Regression and Model Building

Upload: eden-dodson

Post on 01-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Statistics for Business and Economics. Chapter 11 Multiple Regression and Model Building. Learning Objectives. Explain the Linear Multiple Regression Model Describe Inference About Individual Parameters Test Overall Significance Explain Estimation and Prediction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistics for Business and Economics

Statistics for Business and Economics

Chapter 11 Multiple Regression and Model

Building

Page 2: Statistics for Business and Economics

Learning Objectives

1. Explain the Linear Multiple Regression Model

2. Describe Inference About Individual Parameters

3. Test Overall Significance

4. Explain Estimation and Prediction

5. Describe Various Types of Models

6. Describe Model Building

7. Explain Residual Analysis

8. Describe Regression Pitfalls

Page 3: Statistics for Business and Economics

Types of Regression Models

Simple

1 ExplanatoryVariable

RegressionModels

2+ ExplanatoryVariables

Multiple

LinearNon-

LinearLinear

Non-Linear

Page 4: Statistics for Business and Economics

Multiple Regression Model

• General form:

• k independent variables

• x1, x2, …, xk may be functions of variables

— e.g. x2 = (x1)2

0 1 1 2 2 k ky x x x

Page 5: Statistics for Business and Economics

Regression Modeling Steps

1. Hypothesize deterministic component

2. Estimate unknown model parameters

3. Specify probability distribution of random error term

• Estimate standard deviation of error

4. Evaluate model

5. Use model for prediction and estimation

Page 6: Statistics for Business and Economics

First–Order Multiple Regression Model

Relationship between 1 dependent and 2 or more independent variables is a linear function

0 1 1 2 2 k ky x x x

Dependent (response) variable

Independent (explanatory) variables

Population slopes

Population Y-intercept

Random error

Page 7: Statistics for Business and Economics

• Relationship between 1 dependent and 2 independent variables is a linear function

• Model

• Assumes no interaction between x1 and x2

— Effect of x1 on E(y) is the same regardless of x2 values

First-Order Model With 2 Independent Variables

0 1 1 2 2( )E y x x

Page 8: Statistics for Business and Economics

y

x1

0Response

Plane

(Observed y)

i

Population Multiple Regression Model

Bivariate model:

x2

(x1i , x2i)

0 1 1 2 2i i i iy x x

0 1 1 2 2( ) i iE y x x

Page 9: Statistics for Business and Economics

Sample Multiple Regression Model

y

0Response

Plane

(Observed y)

i

^^

Bivariate model:

x1

x2

(x1i , x2i)

0 1 1 2 2ˆ ˆ ˆ

i i i iy x x

0 1 1 2 2ˆ ˆ ˆˆi i iy x x

Page 10: Statistics for Business and Economics
Page 11: Statistics for Business and Economics
Page 12: Statistics for Business and Economics

Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

• Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation

Page 13: Statistics for Business and Economics

Multiple Linear Regression Equations

Too complicated

by hand! Ouch!

Page 14: Statistics for Business and Economics

1st Order Model Example

You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) and newspaper circulation (000) on the number of ad responses (00). Estimate the unknown parameters.

You’ve collected the following data:

(y) (x1) (x2)Resp Size Circ

1 1 24 8 81 3 13 5 72 6 44 10 6

See ResponsesVsAdsizeAndCirculationData.jmp

Page 15: Statistics for Business and Economics

0^

1^

2^

1 2ˆ .0640 .2049 .2805y x x

Page 16: Statistics for Business and Economics

Interpretation of Coefficients Solution

^2. Slope (2)• Number of responses to ad is expected to increase

by .2805 (28.05) for each 1 unit (1,000) increase in circulation holding ad size constant

^1. Slope (1)• Number of responses to ad is expected to

increase by .2049 (20.49) for each 1 sq. in. increase in ad size holding circulation constant

Page 17: Statistics for Business and Economics

Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

• Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation

Page 18: Statistics for Business and Economics

Estimation of σ2

22 ˆ( 1) i i

SSEs where SSE y y

n k

2

( 1)

SSEs s

n k

For a model with k predictors (k+1 parameters)

Page 19: Statistics for Business and Economics

More About JMP Output

SSE

s2

2 .250264.083421

6 3

.083421 .2888

s

s

s

(also called “mean squared error”)

(also called “standard error of the regression”)

Page 20: Statistics for Business and Economics

Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

• Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation

Page 21: Statistics for Business and Economics

Evaluating Multiple Regression Model Steps

1. Examine variation measures

2. Test parameter significance• Individual coefficients • Overall model

3. Do residual analysis

Page 22: Statistics for Business and Economics

Inference for an Individual β Parameter

• Confidence Interval (rarely used in regression)

• Hypothesis Test (used all the time!)Ho: βi = 0Ha: βi ≠ 0 (or < or > )

• Test Statistic (how far is the sample slope from zero?)

ˆ2ˆ

ii t s df = n – (k + 1)

t =ˆ β i − 0

s ˆ β i

Page 23: Statistics for Business and Economics

Easy way: Just examine p-values

Both coefficients significant!

Reject H0 for both tests

Page 24: Statistics for Business and Economics

Testing Overall Significance

• Shows if there is a linear relationship between all x variables together and y

• Hypotheses— H0: 1 = 2 = ... = k = 0

No linear relationship

— Ha: At least one coefficient is not 0 At least one x variable affects y

Page 25: Statistics for Business and Economics

• Test Statistic

• Degrees of Freedom1 = k 2 = n – (k + 1)

— k = Number of independent variables

— n = Sample size

( )

( )

MS ModelF

MS Error

Testing Overall Significance

Page 26: Statistics for Business and Economics

Testing Overall SignificanceComputer Output

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Prob>F

Model 2 9.2497 4.6249 55.440 0.0043

Error 3 0.2503 0.0834

C Total 5 9.5000

k

n – (k + 1) MS(Error)

MS(Model)

P-value

Page 27: Statistics for Business and Economics

Testing Overall SignificanceComputer Output

k

n – (k + 1)

MS(Error)

MS(Model)

P-value

Page 28: Statistics for Business and Economics

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Page 29: Statistics for Business and Economics

Interaction Model With 2 Independent Variables

• Hypothesizes interaction between pairs of x variables

— Response to one x variable varies at different levels of another x variable

0 1 1 2 2 3 1 2( )E y x x x x • Contains two-way cross product terms

• Can be combined with other models— Example: dummy-variable model

Page 30: Statistics for Business and Economics

Interaction Model Relationships

Effect (slope) of x1 on E(y) depends on x2 value

E(y)

x1

4

8

12

00 10.5 1.5

E(y) = 1 + 2x1 + 3x2 + 4x1x2

E(y) = 1 + 2x1 + 3(1) + 4x1(1) = 4 + 6x1

E(y) = 1 + 2x1 + 3(0) + 4x1(0) = 1 + 2x1

Page 31: Statistics for Business and Economics

Interaction Example

You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x1, and newspaper circulation (000), x2, on the number of ad responses (00), y. Conduct a test for interaction. Use α = .05.

Page 32: Statistics for Business and Economics

Adding Interactions in JMP is Easy

1. Analyze >> Fit Model

2. Click on the response variable and click the Y button

3. Highlight the two X variables and click on the Add button

4. While the two X variables are highlighted, click on the Cross button

5. Run Model

You can also combine steps 3 and 4 into one step: Highlight the two X variables and, from the “Macros” pull down menu, chose “Factorial to Degree.” The default for degree is 2, so you will get all two-factor interactions in the model.

Page 33: Statistics for Business and Economics

JMP Interaction Output

Interaction not important:

p-value > .05

Page 34: Statistics for Business and Economics
Page 35: Statistics for Business and Economics

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Page 36: Statistics for Business and Economics

Second-Order Model With 1 Independent Variable

• Relationship between 1 dependent and 1 independent variable is a quadratic function

• Useful 1st model if non-linear relationship suspected

• Model

20 1 2( )E y x x

Linear effect

Curvilinear effect

Page 37: Statistics for Business and Economics

y

Second-Order Model Relationships

yy

y

2 > 02 > 0

2 < 02 < 0

x1

x1x1

x1

Page 38: Statistics for Business and Economics

Types of Regression Models

Linear (First order)

Quadratic (Second order)

Cubic (Third order)

YY XXii ii 00 11

YY XX XXii ii ii 00 11 22^ ^ ^

YY XX XXii ii ii 00 11 22^ ^ ^ XXii 33

^

^ ^

Page 39: Statistics for Business and Economics

2nd Order Model Example

The data shows the number of weeks employed and the number of errors made per day for a sample of assembly line workers. Find a 2nd order model, conduct the global F–test, and test if β2 ≠ 0. Use α = .05 for all tests.

Page 40: Statistics for Business and Economics

1. Analyze >> Fit Y by X

2. From hot spot menu choose:

3. Fit Polynomial >> 2, quadratic

Could also use:

Analyze >> Fit Model, select Y, then highlight X and, from the “Macros” pull down menu, chose “Polynomial to Degree.” The default for degree is 2, so you will get the quadratic (2nd order) polynomial. But from Fit Model, you won’t get the cool fitted line plot.

Page 41: Statistics for Business and Economics

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Page 42: Statistics for Business and Economics

Second-Order (Response Surface) Model With 2 Independent Variables

• Relationship between 1 dependent and 2 independent variables is a quadratic function

• Useful 1st model if non-linear relationship suspected

• Model

225

214

21322110)(

ii

iiii

xx

xxxxyE

Page 43: Statistics for Business and Economics

Second-Order Model Relationships

y

x2x1

4 + 5 > 0 y 4 + 5 < 0

y 32 > 4 4 5

225

214213

22110)(

i

iii

ii

x

xxx

xxyE

x2x1

x2

x1

Page 44: Statistics for Business and Economics

From JMP:

To specify the model, all you need to do is:

1.Analyze >> Fit Model

2.Highlight the X variables

3.From the “Macros” pull down menu, chose “Response Surface.”

4.The default for degree is 2, so you will get the full second-order model having all squared terms and all cross products.

Page 45: Statistics for Business and Economics
Page 46: Statistics for Business and Economics

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

QualitativeVariable

Model

1Quantitative

Variable

Page 47: Statistics for Business and Economics

Qualitative-Variable Model

• Involves categorical x variable with 2 levels— e.g., male-female; college-no college

• Variable levels coded 0 and 1• Number of dummy variables is 1 less than

number of levels of variable• May be combined with quantitative variable

(1st order or 2nd order model)

Page 48: Statistics for Business and Economics
Page 49: Statistics for Business and Economics
Page 50: Statistics for Business and Economics
Page 51: Statistics for Business and Economics
Page 52: Statistics for Business and Economics
Page 53: Statistics for Business and Economics

Qualitative Predictors in JMP

1.Analyze >> Fit Model

2.Specify a qualitative variable

3.JMP will automatically create the needed zero-one variables for you, and run the regression! (All transparent---it does not save the zero-one variables in your data table.)

You need to do one thing (only once): Go to JMP >> Preferences. Now click the “Platforms” icon on the left panel, and click on “Fit Least Squares.” Now check the box on the right marked “Indicator Parameterization Estimates.” If you don’t do this your regression will still be correct, JMP will use a different form of zero-one (dummy) variables.

Page 54: Statistics for Business and Economics

First 32 (out of 100) rows of the salary data.

Now do Analyze >> Fit Model

Page 55: Statistics for Business and Economics
Page 56: Statistics for Business and Economics

Residual Analysis

Page 57: Statistics for Business and Economics

Residual Analysis

• Graphical analysis of residuals— Plot estimated errors versus xi values

Difference between actual yi and predicted yi

Estimated errors are called residuals

— Plot histogram or stem-&-leaf of residuals

• Purposes— Examine functional form (linear v. non-linear

model)— Evaluate violations of assumptions

Page 58: Statistics for Business and Economics

Residual Plot for Functional Form

Add x2 Term Correct Specification

x

e

x

e

Page 59: Statistics for Business and Economics

Residual Plot for Equal Variance

Unequal Variance Correct Specification

Fan-shaped.Standardized residuals used typically.

x

e

x

e

Page 60: Statistics for Business and Economics

Residual Plot for Independence

x

Not Independent Correct Specification

Plots reflect sequence data were collected.

e

x

e

Page 61: Statistics for Business and Economics

Residual Analysis Using JMP

1. Fit full model, and examine “residual plot” of residuals (Y) vs. predicted values (Yhat). This plot automatically appears. Look for outliers, or curvature, or non-constant variance. Hope for a random, shotgun scatter---everything is OK then.

2. Save the residuals (red tab >> Save Columns >> residuals)

3. Analyze >> Distribution (of saved residuals) and obtain a normal quantile (probability plot) using the red tabs.

4. Use the red tab to obtain a goodness of fit test for normality of the residuals using the tabs.

5. If the data were collected sequentially over time, obtain a plot of residuals vs. row number to see if there are any patterns related to time. This is one check of the “independence” of residuals assumption.

Page 62: Statistics for Business and Economics

Residual Analysis via JMP

Step 1: From regression of SalePrice on three predictors---so far so good

Page 63: Statistics for Business and Economics

Residual Analysis via JMPStep 2: Save residuals for more analysis

Steps 3 and 4:

(Normality OK (sort of, approximately anyway)

Step 5: Only needed if data in time order. Graph >> Overlay Plot (specify residuals for Y, no X needed!---see next page.)

Page 64: Statistics for Business and Economics

Residual Analysis via JMPStep 2: Save residuals for more analysis

Steps 3 and 4:

(Normality OK (sort of, approximately anyway)

Step 5: Only needed if data in time order. Graph >> Overlay Plot (no pattern apparent)

Page 65: Statistics for Business and Economics

Selecting Variables in Model Building

Page 66: Statistics for Business and Economics

Model Building with Computer Searches

• Rule: Use as few x variables as possible

• Stepwise Regression— Computer selects x variable most highly correlated

with y

— Continues to add or remove variables depending on SSE

• Best subset approach— Computer examines all possible sets

Page 67: Statistics for Business and Economics

Subset Selection

Simple models tend to work bestSimple models tend to work best

1.1. Give best predictionsGive best predictions2.2. Simplest explanations of Simplest explanations of

underlying underlying phenomenaphenomena3.3. Avoids multicollinearity Avoids multicollinearity

(redundant X (redundant X variables)variables)

Page 68: Statistics for Business and Economics

Manual Stepwise Regression:

1. 1. Start with full model.Start with full model.

2. 2. If all p-values < .05, If all p-values < .05, stop. Otherwise, drop stop. Otherwise, drop the variable that has the variable that has the largest p value.the largest p value.

3. 3. Refit the model. Go to Refit the model. Go to step 2.step 2.

Page 69: Statistics for Business and Economics

Automatic Stepwise Regression:

Let the computer do it for you!Let the computer do it for you!

1.1. Stepwise Regression. Stepwise Regression. Backward stepwise Backward stepwise automatesautomatesthe manual stepwise the manual stepwise procedureprocedure

2.2. Best subsets regression.Best subsets regression.Computes all possible models Computes all possible models and summarizes each. and summarizes each.

Page 70: Statistics for Business and Economics

JMP Stepwise Example: Car Leasing

(a)(a) Use backward stepwise regression to find the best Use backward stepwise regression to find the best predictors of resale value, y.predictors of resale value, y.

(b)(b) Use forward stepwise regression to find the best Use forward stepwise regression to find the best predictors of resale value. Does you r answer agree predictors of resale value. Does you r answer agree with what you had already found in part (a)?with what you had already found in part (a)?

(c)(c) Use all possible regressions to find the mode that Use all possible regressions to find the mode that minimizes s (root mean square error). Does this agree minimizes s (root mean square error). Does this agree with either parts (a) or (b)?with either parts (a) or (b)?

(d)(d) What would you choose for a final model and why?What would you choose for a final model and why?

To appropriately price new car leases, car dealers need to To appropriately price new car leases, car dealers need to accurately predict the value of the cars at the conclusion of the accurately predict the value of the cars at the conclusion of the leases. These resale values are generally determined at leases. These resale values are generally determined at wholesale auctions. Data collected on 54 1997 new car models wholesale auctions. Data collected on 54 1997 new car models are listed on the next two pagesare listed on the next two pages

Page 71: Statistics for Business and Economics

Leasing Data

Y;Y; Resale value in 2000Resale value in 2000X1:X1: 1997 Price1997 PriceX2:X2: Price increase in model from 1997-2000Price increase in model from 1997-2000X3:X3: Consumer Reports quality indexConsumer Reports quality indexX4:X4: Consumer Reports reliability indexConsumer Reports reliability indexX5:X5: Number of model vehicles sold in 1997Number of model vehicles sold in 1997X6:X6: = Yes, if minor change made in model in = Yes, if minor change made in model in

1998, 1999, or 20001998, 1999, or 2000= No, if not= No, if not

X7:X7: = Yes, if major change made in model in = Yes, if major change made in model in 1998, 1999, or 20001998, 1999, or 2000= No, if not= No, if not

Page 72: Statistics for Business and Economics
Page 73: Statistics for Business and Economics

Backward Stepwise:

Analyze >>Fit Model and specify model

Change “Personality” to Stepwise

Enter all model terms and press “Go”

(Change “Prob to Enter” to .1 or ,.05)

Forward Stepwise:

Analyze >>Fit Model and specify model

Change “Personality” to Stepwise

Press “Go”

(Change “Prob to Enter” to .1 or ,.05)

Page 74: Statistics for Business and Economics

All Possible Regressions Output:

1.Under Stepwise Fit, use red hot spot to select “All Possible Models

2.Requested the best (1) model of each model size

Best model minimizes RMSE (that is, s---same as maximizing adjusted R2

Or you could choose to minimize AICc (corrected Akaike Information Criterion)

Page 75: Statistics for Business and Economics

Regression Pitfalls

• Parameter Estimability— Number of different x–values must be at least one

more than order of model

• Multicollinearity— Two or more x–variables in the model are

correlated

• Extrapolation— Predicting y–values outside sampled range

• Correlated Errors

Page 76: Statistics for Business and Economics

Multicollinearity

• High correlation between x variables

• Coefficients measure combined effect

• Leads to unstable coefficients depending on x variables in model

• Always exists – matter of degree

• Example: using both age and height as explanatory variables in same model

Page 77: Statistics for Business and Economics

Detecting Multicollinearity

• Significant correlations between pairs of x variables are more than with y variable

• Non–significant t–tests for most of the individual parameters, but overall model test is significant

• Estimated parameters have wrong sign• Always do a scatterplot matrix of your data

before analysis---look for outliers and relationships between x variables (Graph >> Scatterplot Matrix)

Page 78: Statistics for Business and Economics

Any problems?

Outliers?

Collinearity?

What are the best predictors of Resale (y)?

Page 79: Statistics for Business and Economics

Solutions to Multicollinearity

• Eliminate one or more of the correlated x variables

• Center predictors before computing polynomial terms (squares, crossproducts) (JMP does this automatically!)

• Avoid inference on individual parameters

• Do not extrapolate

Page 80: Statistics for Business and Economics

y Interpolation

x

Extrapolation Extrapolation

Sampled Range

Extrapolation

Page 81: Statistics for Business and Economics

Conclusion

1. Explained the Linear Multiple Regression Model

2. Described Inference About Individual Parameters

3. Tested Overall Significance

4. Explained Estimation and Prediction

5. Described Various Types of Models

6. Described Model Building

7. Explained Residual Analysis

8. Described Regression Pitfalls