data mining regression

Multiple Linear Regression

Dr. Sunil D. [email protected]

mailto:[email protected]

MLR Application Situation

• Predicted variable is a continuous variable

• Predictors are continuous, binary or ordinal class variables

MLR Theoretical Foundation - 1

This procedure performs linear regression on the selected dataset. This fits a linear model of the form

Y= b0 + b1X1 + b2X2+ .... + bkXk+ e

where Y is the dependent variable (response) and x1, x2,.. .,xk are the independent variables (predictors) and e is random error. b0 , b1, b2, .... bk are known as the regression coefficients, which have to be estimated from the data

Regression Model

0

200

400

600

800

1000

1200

0 20 40 60 80

Advertising Expenditure

Sale

s Sales

Linear (Sales)

Validity of MLR - Intuitive

The multiple linear regression algorithm in XLMiner™ chooses regression coefficients so as to minimize the difference between predicted values and actual values.

Predicted and Actual Values

-10

0

10

20

30

40

50

60

1 14 27 40 53 66 79 92 105 118 131 144

Predicted Value

Actual Value

Residual

-15

-10

-5

0

5

10

15

20

1 13 25 37 49 61 73 85 97 109 121 133 145

Residual

Predicted Value

Actual Value

Residual

30.19297657 34.7 4.507023434

18.75084324 15 -3.75084324

19.83233146 20.4 0.567668545

19.54158638 18.2 -1.34158638

19.6134897 19.9 0.286510301

16.45707912 20.2 3.742920884

18.76402488 18.2 -0.56402488

15.45156408 15.2 -0.25156408

Validation Data

Validity of MLR - Measured

Total sum of squared

errorsRMS Error

Average Error

3166.458039 4.564204288-

0.09095719

Residual df 239

Multiple R-squared 0.7101965

Std. Dev. Estimate 5.11450529

Residual SS 6251.80127

Validation Data Prediction Error Report

Regression

MinimumShould be higher side towards 1.0

Validity of MLR – Test of Significance 1

• Is result significant as reflected by p-value being < .05?• If yes, null hypothesis (H0:β1= 0 and β2 = 0…..) is

rejected at 95% confidence level• That is, there is at least one predictor which does not

have its coefficient as 0

Source df SS MS p-value

Regression 13 15320.7503 1178.5192548.75607E-

57

Error 239 6251.80127 26.15816431

Total 252 21572.55157

ANOVA lower than .05?

Validity of MLR – Test of Significance 2

• Which variables are valid predictors in terms of its p-value being less than .05 reflecting the rejection of null hypothesis (H0:βk= 0)?

Predictor (Indep. Var.)

p-value

Constant 0

ZN 0.00100423

INDUS 0.84421581

CHAS 0.18414988

NOX 0.00680984

RM 0.00000333

AGE 0.55937093

DIS 0.00000109

RAD 0.00013665

CRIM 0.00235436

TAX 0.00271487

PTRATIO 0.00000157

B 0.03476212

LSTAT 0

MLR Subset Selection

• This procedure gives us the subset of variables that are the best predictors.

• We can find the subset through the selection of any of the following procedures:– Backward elimination– Exhaustive search– Stepwise selection– Forward selection– Sequential replacement

Logistic Regression

Dr. Sunil D. Lakdawala

[email protected]

mailto:[email protected]

LR Application Situation

• Predicted variable is a binary class variable – ‘1’ for success and ‘0’ for failure

• The results indicate the probability of success

• Predictors are continuous, binary or ordinal class variables

LR Theoretical Foundation - 1

• Logistic regression is a variation of ordinary regression

• Logistic regression forms a predictor variable (log (p/(1-p)) which is a linear combination of the explanatory variables

• P(Y=1) = Exp(ß0 + ß1X1 + .. ßkXk ) /

(1 + Exp(ß0 + ß1X1 + .. ßkXk ))

• 1 – P = 1/ (1 + Exp(ß0 + ß1X1 + .. ßkXk )

This Slide is Intentionally Left Blank

data mining regression

Documents