1 data mining, data perturbation and degrees of freedom of projection regression t.c. lin * appear...

19
1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Re gression T.C. Lin * appear in JRSS series C

Upload: gregory-wilkins

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

1

Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression

T.C. Lin

* appear in JRSS series C

Page 2: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

2

Canadian lynx data gives the annual record of the Canadian lynx trapped in the Mackenzie River district of the North-West Canada for the period 1821-1934. (n=114)

•Some about this data set

•is a fairly noisy set of field data.

•has nonlinear and no Gaussian characteristics (see Tong, 1990).

•has been used as a benchmark to gauge the performance of various time series methods. (cf. Campbell and Walker (1977), Tong (1980, 1990), Lim (1987))

Page 3: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

3

Page 4: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

4

Page 5: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

5

Nowadays the same data set is used routinely in the formulation, selection, estimation, diagnostics and prediction of statistical models.

Parametric: •Linear: AR, MA, ARMA•Nonlinear: SETAR, ARCH,...

Nonparametric*

• Most general: Xt = f (Xt-i1, …, Xt-ip) + et

• Additive: Xt = f 1(Xt-i1)+...+ fp (Xt-ip ) + et (ADD)

•PPR: Y = β0+ΣKk=1βkfk(a’kX)+e,

*The selection, estimation are usually based on the smoothing, back fitting, BRUTO, ACE, Projector, etc.

(Hastie, 1990)

Page 6: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

6

Questions:

• Is it possible to compare performances of such models?

(/ Can nonparametric methods produce models with better predictive power than the parametric methods?)

• What is the degrees of freedom (df) of a model?

(/ How can one access and adjust for the impact of data mining?)

While no universal answer can be expected, a data perturbation procedure (Wong and Ye, 1996) is used here to assess and adjust the impact of df to each fitted model.

Page 7: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

7

Why Data Mining?

The theory of statistical inference is based on the assumption that the model for the data is given a priori. i.e.,

x1,...,xn ~ p(x|Θ),

p(x|Θ) is the known distribution with unknown parametersΘ.

But in practice this assumption is not tenable since the model is seldom formulated using the subject-matter knowledge or data-free procedures.

Consequently, over fitting or data mining occurs frequently in the modern data analysis environment.

Page 8: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

8

How to count the df?

•Parametric

df = # of parameters in the model.

•Nonparametric

df is at the heart of the problem of assessing the impact of data mining.

•Example 1. Y=WY, df = tr (W), see Hastie and Tibshirani (1990).

•Example 2. df = tr (H), introduced later.

Page 9: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

9

Idea of data perturbation:

Intuitively and ideally, any estimated model should be validated using a new data set.

It can be viewed as a method of generating new data that is close to the observed response Y (a generalization of the Breiman's little bootstrap method) .

we would like to have,

=> Effective degrees of freedom (EDF) = tr(H) = hii

)',,(,))'(ˆ,),(ˆ()(ˆˆ11 nn YfYfYfY

n

i 1

.]/ˆ[][

,)(ˆ)(ˆ

,)(ˆ)(ˆˆ)ˆ(

1,n

jiiiii

iiiii

YfhHwhere

hYfYf

HYfYfYY

Page 10: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

10

Table 1. MSE & SD of five models fitted to lynx data

About SD: Fit the same class of models to the first 100 obs., keeping the last 100 for out-of-sample predictions. SD = the standard deviation of the multi-step ahead prediction errors.

Model AR(2) SETAR ADD(1,2) ADD(1,2,9) PPR

MSE 0.0459 0.0358 0.0455 0.038 0.0194

MSEadj 0.0443 0.0365 0.0377

SD 0.295 0.136 0.100 0.347 0.247

Page 11: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

11

Models for the lynx data.

◆Model 1: Moran’s AR(2) model

Xt = 1.05 + 1.41 Xt-1 - 0.77Xt-2 + et,

et ~ WN (0,0.04591).

◆Model 2: Tong’s SETAR (2;7,2) model 0.546+1.032Xt-1-0.173Xt-2+0.171 Xt-3

-0.431Xt-4+0.332X-0.284Xt-6

Xt= +0.210Xt-7+et(1) if Xt-2 3.116≦

2.632+1.492Xt-1-1.324Xt-2+et(2) if Xt-2> 3.116

var(et(1))= 0.0259, var(et

(2))= 0.0505. (Pooled var = 0.0358).

Page 12: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

12

BRUTO Algorithm (see HT 90)

is a forward model selection procedure using a modified GCV, defined by

to choose the significant variables and their smoothing parameters.

2

1 ,

1 1 ,

}/}]1)({1[1{

2)}(ˆ{

nStrn

xfyGCV p

j j

n

i

p

j ijjib

j

j

Page 13: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

13

K = 5 (number of potential variable)

• Model 3: ADD(1,2) Xt = 1.07 + 1.37 Xt-1 + s (Xt-2, 3) + et,

where

• et ~ WN(0,0.0455)

• s(x, d) stands for a cubic smoothing spline with df = d fitted to a time series x.

Page 14: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

14

K=10:

•Model 4: ADD(1,2,9)

Xt = 0.61 + 1.15 Xt-1 + s(Xt-2,3) + s(Xt-9,3) + et,

where et ~ WN (0,0.0381).

Page 15: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

15

What is PPR (Projection pursuit )?

Let

• Xt-1=(Xt-i1, Xt-i2, ..., Xt-ip)' be a random vector and

• a1, a2, ... denote some p-dimensional unit ``direction" vectors.

The PPR are additive models of linear combinations of past values,

Xt = ΣKk=1fk

*(a’k Xt-1 )+et

= β0+ΣKk=1βkfk(a’k Xt-1 )+et.

Page 16: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

16

• Model 5: PPR

Because of model 4 and for simplicity we take Xt = (xt-1, xt-2, xt

-9)' as the covariate. Using the PPR algorithm , we have

and MSE = 0.0194 .

k β k ak (projections)

1 0.4677 (0.8619, -0.5021, 0.0703)

2 0.1241 (0.6515, -0.4106, -0.6378)

3 0.1380 (0.0665, -0.9888, 0.1335)

4 0.1251 (-0.6473, 0.6631, 0.3756)

5 0.1384 (-0.5305, 0.6342, 0.5623)

Page 17: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

17

Data Perturbation Procedure:

•For an integer m > 1 (the Monte Carlo sample size), generateδ1, ...,δm as i.i.d. N(0, t2In) where t > 0 and In is the n×n identity matrix.

•Use the ``perturbed" data Y +δj, to re-compute (refit)

•For i =1,2, ..., n, the slope of the LS line fitted to ( (Yi +δij), δij), j=1, ..., m, gives an estimate of hii.

.,,2,1),(ˆ)ˆ( mjYfY jj

if̂

Page 18: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

18

Page 19: 1 Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression T.C. Lin * appear in JRSS series C

19

Conclusion:

The comparison and our findings bring home the danger of not accounting for the impact of data mining. Evidently, extensive data mining yields models and estimates that are too faith to the data with small (in-sample) prediction error and yet considerable loss in out-of-sample prediction accuracy.