1 data mining, data perturbation and degrees of freedom of projection regression t.c. lin * appear...

1

Data Mining, Data Perturbation and Degrees of Freedom of Projection Regression

T.C. Lin

* appear in JRSS series C

2

Canadian lynx data gives the annual record of the Canadian lynx trapped in the Mackenzie River district of the North-West Canada for the period 1821-1934. (n=114)

•Some about this data set

•is a fairly noisy set of field data.

•has nonlinear and no Gaussian characteristics (see Tong, 1990).

•has been used as a benchmark to gauge the performance of various time series methods. (cf. Campbell and Walker (1977), Tong (1980, 1990), Lim (1987))

5

Nowadays the same data set is used routinely in the formulation, selection, estimation, diagnostics and prediction of statistical models.

Parametric: •Linear: AR, MA, ARMA•Nonlinear: SETAR, ARCH,...

Nonparametric*

• Most general: Xt = f (Xt-i1, …, Xt-ip) + et

• Additive: Xt = f 1(Xt-i1)+...+ fp (Xt-ip ) + et (ADD)

•PPR: Y = β0+ΣKk=1βkfk(a’kX)+e,

*The selection, estimation are usually based on the smoothing, back fitting, BRUTO, ACE, Projector, etc.

(Hastie, 1990)

6

Questions:

• Is it possible to compare performances of such models?

(/ Can nonparametric methods produce models with better predictive power than the parametric methods?)

• What is the degrees of freedom (df) of a model?

(/ How can one access and adjust for the impact of data mining?)

While no universal answer can be expected, a data perturbation procedure (Wong and Ye, 1996) is used here to assess and adjust the impact of df to each fitted model.

7

Why Data Mining?

The theory of statistical inference is based on the assumption that the model for the data is given a priori. i.e.,

x1,...,xn ~ p(x|Θ),

p(x|Θ) is the known distribution with unknown parametersΘ.

But in practice this assumption is not tenable since the model is seldom formulated using the subject-matter knowledge or data-free procedures.

Consequently, over fitting or data mining occurs frequently in the modern data analysis environment.

8

How to count the df?

•Parametric

df = # of parameters in the model.

•Nonparametric

df is at the heart of the problem of assessing the impact of data mining.

•Example 1. Y=WY, df = tr (W), see Hastie and Tibshirani (1990).

•Example 2. df = tr (H), introduced later.

9

Idea of data perturbation:

Intuitively and ideally, any estimated model should be validated using a new data set.

It can be viewed as a method of generating new data that is close to the observed response Y (a generalization of the Breiman's little bootstrap method) .

we would like to have,

=> Effective degrees of freedom (EDF) = tr(H) = hii

)',,(,))'(ˆ,),(ˆ()(ˆˆ11 nn YfYfYfY

n

i 1

.]/ˆ[][

,)(ˆ)(ˆ

,)(ˆ)(ˆˆ)ˆ(

1,n

jiiiii

iiiii

YfhHwhere

hYfYf

HYfYfYY

10

Table 1. MSE & SD of five models fitted to lynx data

About SD: Fit the same class of models to the first 100 obs., keeping the last 100 for out-of-sample predictions. SD = the standard deviation of the multi-step ahead prediction errors.

Model AR(2) SETAR ADD(1,2) ADD(1,2,9) PPR

MSE 0.0459 0.0358 0.0455 0.038 0.0194

MSEadj 0.0443 0.0365 0.0377

SD 0.295 0.136 0.100 0.347 0.247

11

Models for the lynx data.

◆Model 1: Moran’s AR(2) model

Xt = 1.05 + 1.41 Xt-1 - 0.77Xt-2 + et,

et ~ WN (0,0.04591).

◆Model 2: Tong’s SETAR (2;7,2) model 0.546+1.032Xt-1-0.173Xt-2+0.171 Xt-3

-0.431Xt-4+0.332X-0.284Xt-6

Xt= +0.210Xt-7+et(1) if Xt-2 3.116≦

2.632+1.492Xt-1-1.324Xt-2+et(2) if Xt-2＞ 3.116

var(et(1))= 0.0259, var(et

(2))= 0.0505. (Pooled var = 0.0358).

12

BRUTO Algorithm (see HT 90)

is a forward model selection procedure using a modified GCV, defined by

to choose the significant variables and their smoothing parameters.

2

1 ,

1 1 ,

}/}]1)({1[1{

2)}(ˆ{

nStrn

xfyGCV p

j j

n

i

p

j ijjib

j

j

13

K = 5 (number of potential variable)

• Model 3: ADD(1,2) Xt = 1.07 + 1.37 Xt-1 + s (Xt-2, 3) + et,

where

• et ~ WN(0,0.0455)

• s(x, d) stands for a cubic smoothing spline with df = d fitted to a time series x.

14

K=10:

•Model 4: ADD(1,2,9)

Xt = 0.61 + 1.15 Xt-1 + s(Xt-2,3) + s(Xt-9,3) + et,

where et ~ WN (0,0.0381).

15

What is PPR (Projection pursuit )?

Let

• Xt-1=(Xt-i1, Xt-i2, ..., Xt-ip)' be a random vector and

• a1, a2, ... denote some p-dimensional unit ``direction" vectors.

The PPR are additive models of linear combinations of past values,

Xt = ΣKk=1fk

*(a’k Xt-1 )+et

= β0+ΣKk=1βkfk(a’k Xt-1 )+et.

16

• Model 5: PPR

Because of model 4 and for simplicity we take Xt = (xt-1, xt-2, xt

-9)' as the covariate. Using the PPR algorithm , we have

and MSE = 0.0194 .

k β k ak (projections)

1 0.4677 (0.8619, -0.5021, 0.0703)

2 0.1241 (0.6515, -0.4106, -0.6378)

3 0.1380 (0.0665, -0.9888, 0.1335)

4 0.1251 (-0.6473, 0.6631, 0.3756)

5 0.1384 (-0.5305, 0.6342, 0.5623)

17

Data Perturbation Procedure:

•For an integer m > 1 (the Monte Carlo sample size), generateδ1, ...,δm as i.i.d. N(0, t2In) where t > 0 and In is the n×n identity matrix.

•Use the ``perturbed" data Y +δj, to re-compute (refit)

•For i =1,2, ..., n, the slope of the LS line fitted to ( (Yi +δij), δij), j=1, ..., m, gives an estimate of hii.

.,,2,1),(ˆ)ˆ( mjYfY jj

if̂

19

Conclusion:

The comparison and our findings bring home the danger of not accounting for the impact of data mining. Evidently, extensive data mining yields models and estimates that are too faith to the data with small (in-sample) prediction error and yet considerable loss in out-of-sample prediction accuracy.

1 data mining, data perturbation and degrees of freedom of projection regression t.c. lin * appear...

Documents