lab 2, assignment 1: ols regression of electricity consumption on temperature at 53 sites

22
Data mining and statistic al learning - lab2-4 Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites -10000 -5000 0 5000 10000 ARJEPLOG B R _M _N FLODA GUSTAVSFORS HELSIN G B ORG KO L M _RDEN_ST R_M S... LULE __K A LLAX MORA R_NGEDALA SKILLIN G E SVANBERGA VI LH ELM I NA _LV SBYN _V E R KA LIX_SVAR TB YN Predictor Param eter

Upload: fatima-cherry

Post on 01-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites. SAS code for ridge regression. proc reg data=mining.dailytemperature outest = dtempbeta ridge= 0 to 10 by 1 ; model daily_consumption = stockholm g_teborg malm_ /p; - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Lab 2, assignment 1: OLS regression of electricity

consumption on temperature at 53 sites

-10000

-5000

0

5000

10000

ARJEPLOG

BR_M_N

FLODA

GUSTAVSFORS

HELSINGBORG

KOLM

_RDEN_STR_MS...

LULE__KALLAX

MORA

R_NGEDALA

SKILLIN

GE

SVANBERGA

VILHELM

INA

_LVSBYN

_VERKALIX

_SVARTBYN

Predictor

Par

amet

er

Page 2: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

SAS code for ridge regression

proc reg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1;model daily_consumption = stockholm g_teborg malm_ /p;output out=olsoutput pred=olspred;proc print data=dtempbeta;run;

_TYPE_ _DEPVAR_ _RIDGE_ _RMSE_ Intercept STOCKHOLM G_TEBORG MALM_PARMS Daily_Consumption 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 0 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 1 36314.6 462824.0 -2327.8 -2357.6 -2512.6RIDGE Daily_Consumption 2 43008.7 450349.7 -1830.1 -1899.4 -2011.6RIDGE Daily_Consumption 3 48325.9 442054.5 -1514.3 -1584.8 -1674.9RIDGE Daily_Consumption 4 52401.2 436146.6 -1292.7 -1358.6 -1434.4RIDGE Daily_Consumption 5 55571.5 431726.2 -1128.0 -1188.6 -1254.1RIDGE Daily_Consumption 6 58092.1 428294.6 -1000.8 -1056.3 -1114.1RIDGE Daily_Consumption 7 60138.0 425553.4 -899.4 -950.4 -1002.1RIDGE Daily_Consumption 8 61829.0 423313.5 -816.7 -863.8 -910.6RIDGE Daily_Consumption 9 63248.9 421448.8 -747.9 -791.7 -834.4RIDGE Daily_Consumption 10 64457.3 419872.4 -689.8 -730.6 -770.0

Page 3: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Estimated regression parameters in ridge regression

-6000

-5000

-4000

-3000

-2000

-1000

0

0 1 2 3 4 5 6 7 8 9 10

Shrinkage

Par

amet

er STOCKHOLM

G_TEBORG

MALM_

Page 4: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Predicted vs observed values

in OLS regression and ridge regression

- trade-off between variance and bias

200000

300000

400000

500000

600000

700000

200000 300000 400000 500000 600000 700000

Observed

Pre

dic

ted

OLS regression Ridge regression

Page 5: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Fat content vs absorbance in different channels (wavelengths)

0

10

20

30

40

50

60

2 2.5 3 3.5 4 4.5 5 5.5 6

Absorbance

Fat

co

nte

nt

(%)

Channel 1 Channel 40 Channel 60 Channel 100

Page 6: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

OLS regression fat vs channel10, channel30, channel50,

channel70, channel90

Model: MODEL1 Dependent Variable: Fat Fat Number of Observations Used 215

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 29469 5893.82795 233.90 <.0001 Error 209 5266.30507 25.19763 Corrected Total 214 34735 Root MSE 5.01972 R-Square 0.8484 Dependent Mean 18.14233 Adj R-Sq 0.8448 Coeff Var 27.66858

Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 42.20859 3.82059 11.05 <.0001 Channel10 Channel10 1 -245.26494 10.05660 -24.39 <.0001 Channel30 Channel30 1 361.41787 23.53244 15.36 <.0001 Channel50 Channel50 1 -203.28522 33.52937 -6.06 <.0001 Channel70 Channel70 1 104.37041 19.91571 5.24 <.0001 Channel90 Channel90 1 -34.48938 9.15823 -3.77 0.0002

Page 7: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

OLS regression fat vs channel1 – channel 100Model: MODEL1 Dependent Variable: Fat Number of Observations Used: 215

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 53 34326 647.66185 254.72 <.0001 Error 161 409.36692 2.54265 Corrected Total 214 34735 Root MSE 1.59457 R-Square 0.9882 Dependent Mean 18.14233 Adj R-Sq 0.9843 Coeff Var 8.78922 NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. Channel3 = -9.37E-6 * Intercept - 0.03975 * Channel1 + 0.47341 * Channel2 - 0.66366 * Channel4 + 0.1448 * Channel6 - 0.04202 * Channel8 - 0.0296 * Channel10 + 0.04022 * Channel12 - 0.1013 * Channel14 + 0.08297 * Channel16 + 0.09432 * Channel18 - 0.1725 * Channel20 + 0.07997 * Channel21 - 0.00495 * Channel23 + 0.02818 * Channel25 + 0.00606 * Channel27 - 0.08143 * Channel28 + 0.08083 * Channel30 - 0.05219 * Channel32 + 0.01912 * Channel33 + 0.01284 * Channel35 - 0.01179 * Channel36 + 0.03298 * Channel37 - 0.02684 * Channel38 + 0.00346 * Channel39 - 0.04165 * Channel41 + 0.04493 * Channel42 - 0.01572 * Channel44 + 0.01452 * Channel46 + 0.00074 * Channel48 - 0.0342 * Channel49 + 0.08672 * Channel51 - 0.0911 * Channel52 + 0.03303 * Channel53 - 0.00125 * Channel55 - 0.00744 * Channel56 + 0.01541 * Channel58 - 0.00663 * Channel59 - 0.02578 * Channel61 + 0.02883 * Channel63 - 0.01135 * Channel65 + 0.04673 * Channel67 - 0.04764 * Channel69 - 0.00365 * Channel71 + 0.01601 * Channel73 - 0.01333 * Channel75 - 0.00651 * Channel77 - 0.00392 * Channel80 + 0.03827 * Channel83 - 0.02069 * Channel86 + 0.01285 * Channel89 - 0.01378 * Channel92 - 0.00849 * Channel95 + 0.0093 * Channel98

Page 8: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

OLS regression fat vs channel1 – channel 100

Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept B 7.67989 2.01644 3.81 0.0002 Channel1 Channel1 B 7550.89847 3181.94418 2.37 0.0188 Channel2 Channel2 B -6236.59799 4650.43463 -1.34 0.1818 Channel3 Channel3 0 0 . . . Channel4 Channel4 B -2576.07036 3776.80152 -0.68 0.4962 Channel5 Channel5 0 0 . . . Channel6 Channel6 B -7766.73338 4103.41990 -1.89 0.0602 Channel7 Channel7 0 0 . . . Channel8 Channel8 B 5660.86411 4248.60674 1.33 0.1846 Channel9 Channel9 0 0 . . . Channel10 Channel10 B 4509.28620 4503.11172 1.00 0.3182 Channel11 Channel11 0 0 . . . Channel12 Channel12 B 8050.98503 4080.26245 1.97 0.0502 Channel13 Channel13 0 0 . . . Channel14 Channel14 B -7368.85561 4319.59587 -1.71 0.0900 Channel15 Channel15 0 0 . . . Channel16 Channel16 B -5251.52459 3382.29352 -1.55 0.1225 Channel17 Channel17 0 0 . . . . . .

Page 9: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

OLS regression with strongly correlated predictors

If the XTX matrix has not full rank (some X-variables are linearly dependent) the mean square solution is not unique

If the X-variables are strongly correlated, then:

(i) the regression coefficients will be uncertain;

(ii) the predictions may be OK

Page 10: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Principal Component Analysis of lake survey data

Some variables vary much more than others

How does this influence principal components derived from the covariance and correlation matrices, respectively?

0

1000

2000

3000

4000

5000

6000

7000

0 1000 2000 3000 4000 5000 6000 7000

Cl (meq/l)

To

t-N

(m g

/l)

Page 11: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Principal Component Analysis of lake survey data

- score plot derived from the correlation matrix

Page 12: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Principal Component Analysis of lake survey data

- eigenvectors derived from the correlation matrix

-0.2-0.1

00.10.20.30.40.5

pH_

Con

d__m

S_m

25_C

Ca_

meq

_l

Mg_

meq

_l

Na_

meq

_l

K_m

eq_l

Alk

__A

cid_

meq

_l

SO

4_IC

_meq

_l

Cl_

meq

_l

NO

2_N

O3_

N_u

g_l

Tot

_N_p

s_ug

_l

Tot

_P_u

g_l

Abs

__F

_420

nm_5

c

TO

C_m

g_l

Si_

mg_

l

PRIN1

PRIN2

Page 13: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Principal Component Analysis of lake survey data with

outliers removed

- score plot derived from the correlation matrix

Page 14: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Principal Component Analysis of lake survey data with

outliers removed

- eigenvectors derived from the correlation matrix

-0.6-0.4-0.2

00.20.40.60.8

pH_

Cond_

_mS_m

25_C

Ca_m

eq_l

Mg_

meq

_l

Na_m

eq_l

K_meq

_l

Alk__Acid

_meq

_l

SO4_

IC_m

eq_l

Cl_m

eq_l

NO2_

NO3_N_ug

_l

Tot_N

_ps_

ug_l

Tot_P

_ug_l

Abs__

F_420

nm_5

cm

TOC_mg_

l

Si_m

g_l

Lo

ad

ing

PRIN 1

PRIN 2

Page 15: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Principal Component Analysis of lake survey data with

outliers removed

- MINITAB score plot derived from the correlation matrix

403020100

12.5

10.0

7.5

5.0

2.5

0.0

-2.5

-5.0

First Component

Seco

nd C

om

ponent

Score Plot of pH, ..., Si mg/ l

Page 16: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Principal Component Analysis of lake survey data with

outliers removed

- MINITAB loading plot derived from the correlation matrix

0.40.30.20.10.0

0.50

0.25

0.00

-0.25

-0.50

First Component

Seco

nd C

om

ponent Si mg/l

TOC mg/lAbs._F 420nm/5cm

Tot-P ug/l

Tot-N_ps ug/l

NO2+NO3-N ug/l

Cl meq/l

SO4_IC meq/l

Alk./Acid meq/l

K meq/lNa meq/l

Mg meq/l

Ca meq/lCond. mS/m25øC

pH

Loading Plot of pH, ..., Si mg/ l

Page 17: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Regression of an indicator matrix

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2 Class 1

Class 2

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Assign a new object to class 1 if

22212120212ˆˆˆ),(ˆ xxxxf

21211110211ˆˆˆ),(ˆ xxxxf

),(ˆ),(ˆ212211 xxfxxf

Page 18: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Discriminant analysis

- decision border

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2

Class 1

Class 2

Discr.

Page 19: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

3D-plot of an indicator matrix for class 1

15

0.0 10

0.5

1.0

4 6 58 10

Class_1

x2

x1

3D Scatterplot of Class_1 vs x2 vs x1

Page 20: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

3D-plot of an indicator matrix for class 2

15

0.0 10

0.5

1.0

4 6 58 10

Class_2

x2

x1

3D Scatterplot of Class_2 vs x2 vs x1

Page 21: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Regression of an indicator matrix

- discriminating function

0

5

10

15

20

25

2 4 6 8 10

x1

x2

Class 1

Class 2

Class 3

Estimate discriminant functions

for each class, and then classify a new object to the class with the largest value for its discriminant function

)(xk

Page 22: Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites

Data mining and statistical learning - lab2-4

Linear discriminant analysis (LDA)

LDA is an optimal classification method when the data arise from Gaussian distributions with different means and a common covariance matrix

4

6

8

10

12

14

16

18

2 4 6 8 10 12

Class1

Class 2

Class3