the generalization paradox of ensembles - salford …docs.salford-systems.com/johnelder.pdf© 2006...

0© 2006 Elder Research, Inc.

John F. Elder IV, Ph.D.

[email protected]

Elder Research, Inc.

635 Berkmar Circle

Charlottesville, Virginia 22901

434-973-7673

www.datamininglab.com

Salford Systems Data Mining Conference

San Diego, CA

March 31, 2006

The Generalization Paradox of Ensembles


Complexity

• Ensembles of models generalize well- a theoretical surprise

• Chief danger of data mining: Overfit

• Occam’s razor: regulate complexity to avoid overfit

• But, does the razor work? - counter-evidence has been gathering

• What if complexity is measured incorrectly?

• Generalized degrees of freedom (Ye, 1998)

• Experiments: single vs. bagged decision trees

• Summary: factors that matter

Its importance, influences, and inference.

And, are ensembles as complex as they look?Outline:


Decision Tree Nearest Neighbor

Kernel

Neural Network (or

Polynomial Network)

Delaunay Triangles


Relative Performance Examples: 5 Algorithms on 6 Datasets(John Elder, Elder Research & Stephen Lee, U. Idaho, 1997)

Err

or

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)


Essentially every Bundling method improves performanceE

rror

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)


Overfit models generalize poorly

0

1

2

3

4

5

0.1 0.6 1.1 1.6 2.1 2.6 3.1 3.6 4.1 4.6 5.1 5.6 6.1 6.6 7.1 7.6 8.1 8.6 9.1 9.6

X

Y

Y = f(X)

TestPnts

YhatX

YhatX2

YhatX3

YhatX4

YhatX5

YhatX6

YhatX7

YhatX8

YhatX9


Regression error vs. #parameters

0.001

0.010

0.100

1.000

1 2 3 4 5 6 7 8 9

# Regression Parameters, k

Me

an

Sq

ua

red

Err

or

Training

Evaluation


Additional terms always help performance on training data

Regulate Complexity by:

1) Reserving data

*Cross-validation

*Bootstrapping

2) Complexity penalty

*Mallow’s Cp

*Minimum Description Length (MDL) (Rissanen, Barron, Wallace & Bolton)

*Generalized Cross-Validation (Wahba)

3) Roughness penalty

*integrated square of second derivative

*cubic splines

4) Parameter shrinkage

*Ridge regression -> Bayesian priors

*”Optimal brain damage” (Weight decay in neural networks)

Overfit is the Chief Danger in Model Induction

e

#terms, p

°

°°°°° °

°


Popular complexity penalties(N cases, k terms, and an error variance estimate, !p

2)

MSE = SSEN

adjusted Mean Squared Error, aMSE = SSEN-K

Akaike’s (1970) Future Prediction Error, FPE = aMSE·N+K

N

Akaike's (1973) Information Criterion, AIC = ln(MSE) + 2KN

Mallow's (1973) Cp = SSE

!p2 - N + 2K

Barron's (1984) Predicted Squared Error, PSE = MSE + 2!p2K

N

Schwarz's (1978) Bayesian Criterion, SBC = ln(MSE) + Kln(N)N

Rissanen’s (1983) Minimum Description Length, MDL = MSE + ln(N)K!2

p

N

Sawa's BIC = ln(MSE) + 2q(K+2-q)/N, where q = !p2

MSE

e

k

!

scoreMinimize e + "k


Occam’s Razor

• Nunquam ponenda est pluralitas sin necesitate

“Entities should not be multiplied beyond necessity”

“If (nearly) as descriptive, the simpler model is more correct.”

• But, gathering doubts:

– Ensemble methods which employ multiple models (e.g.,

bagging, boosting, bundling, Bayesian model averaging

– Nonlinear terms have higher (or lower) than linear effect

– Much overfit is from excessive search (e.g., Jensen 2000),

rather than over-parameterization

– Neural network structures are fixed, but their degree of fit

grows with time

• Domingos (1998) won KDD Best Paper arguing for its death


Target Shufflingcan be used to measure “vast search effect”

(see Battery MCT, Monte Carlo Target in CART #

added at my suggestion; thanks!)

1) Break link between target, Y, and features, X

by shuffling Y to form Ys.

2) Model new Ys ~ f(X)

3) Measure quality of resulting (random) model

4) Repeat to build distribution

-> Best (or mean) shuffled (i.e., useless) model

sets the baseline above which true model

performance may be measured


Y = Tree(x1,x2) + noise; 8 random input variables, 20 training cases

Small difference between original tree (.17) and best shuffled tree (.30)

-> means that most (6/7ths) of the model’s gains are illusion.


Y = Tree(x1,x2) + noise; 8 random input variables, 50 training cases

Larger difference between original tree (.07) and best shuffled tree (.47)

-> means ~1/2 of the model’s gains are real.


Counting (and penalizing) terms• Much faster than cross-validation

• Allows one to use all the data for training

but

• A single parameter in a nonlinear method can have <1 or >5effective degrees of freedom.“[The results of Hastie and Tibshirani (1985)], together with those of Hinkley (1969,1970) and Feder (1975), indicate that the number of degrees of freedom associatedwith nonlinear least squares regression can be considerably more than the number ofparameters involved in the fit.” (Friedman and Silverman, 1989)

• The final model form doesn’t necessarily reveal the extent of thestructure search.Ex: The winning KDD Cup 2001 model used 3 variables. But there were 140,000candidates, and only 2,000 constraints (cases).

Hjorth (1989) “... the evaluation of a selected model can not be based on that modelalone, but requires information about the class of models and the selection procedure.”

! Need model selection metrics which include the effect of modelselection!


So a complexity metric is needed

which additionally takes into account:

• Extent of model space search:

– Algorithm thoroughness (e.g., greedy, optimal subsets)

– Input breadth and diversity

• Power of parameters

• Degree of structure in data

-> Must be empirical (involving re-sampling).

(Removes the speed advantage.

But, a complexity slope could likely be estimated for each

data/algorithm pair from early experiments.)


Complexity should be measured by the

Flexibility of the Modeling Process

• Generalized Degrees of Freedom, GDF (Jianming Ye, JASAMarch 1998)

– Perturb output, re-fit procedure,measure changes in estimates

• Covariance Inflation Criterion, CIC (Tibshirani & Knight, 1999)

– Shuffle output, re-fit procedure,measure covariance between new and old estimates.

• Key step (loop around modeling procedure) reminiscent ofRegression Analysis Tool, RAT (Faraway, 1991) -- whereresampling tests of a 2-second procedure took 2 days to run.


Generalized Degrees of Freedom• #terms in Linear Regression (LR) = DoF, k

• Nonlinear terms (e.g., MARS) can have effect of ~3k (Friedman, Owen ‘91)

• Other parameters can have effects < 1

(e.g., under-trained neural networks)

Procedure (Ye, 1998):

• For LR, k = trace(Hat Matrix) = " #yhat / #y

• Define GDF to be sum of sensitivity of each fitted value, yhat, to perturbations

in the corresponding output, y. That is, instead of extrapolating from LR by

counting terms, use alternate trace measure which is equivalent under LR.

• (Similarly, the effective degrees of freedom of a spline model is estimated by

the trace of the projection matrix, S: yhat = Sy )

• Put a y-perturbation loop around the entire modeling process (which can

involve multiple stages)


GDF computation

Modeling

ProcessX yehat

yey +

N(0,$)

perturbations

obse

rvat

ions

(ye,yehat)

Ye robustness trick:

average, across perturbation

runs, the sensitivities for a

given observation.

y = 0.2502x - 0.8218

-1.80

-1.60

-1.40

-1.20

-1.00

-0.80

-0.60

-1.80 -1.60 -1.40 -1.20 -1.00 -0.80 -0.60

Ye

Ye h

at


Example problem: underlying data surface

is piecewise constant in 2 dimensions


Additive N(0,.5) noise


100 random training samples


Estimation Surfaces

for bundles of 5 trees4-leaf trees

8-leaf trees(some of the finer structure is real)

Bagging produces gentler stairsteps than raw tree

(illustrating how it generalizes better for smooth functions)


Equivalent tree for 8-leaf bundle(actually 25% pruned)

So a bundled tree is still a tree.

But is it as complex as it looks?


Experiment: Introduce selection noise(an additional 8 candidate input variables)

Estimation surface for

5-bag of 4-leaf trees … for 8-leaf trees

Main structure here is clear enough for simple models to avoid noise inputs

but their eventual use leads to a distribution of estimates on 2-d projection.


Estimated GDF vs. #parameters

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9

Parameters, k

GD

F

Tree 2d

Bagged Tree 2d

Tree 2d+ 8 noise

Bagged Tree 2d+ 8 noise

Linear Regression 1d

1

~5

~4~4

~3

Bagging reduces complexity

Noise x’s increase complexity


Outlier/Influential Point DetectionA portion of complexity is assigned to each case

an 8-leaf tree a 4-leaf tree


Danger of Interpretability

• Accuracy * Interpretability < Breiman’s constant

• Model can be useful without being "correct" or explanatory.

• Over-interpretation: We usually read too much into particular variables

picked by "best" model -- which barely won out over hundreds of other

models of the billions tried, using a score function only approximating

one's goals, on finite, noisy data.

• Many similar variables can lead the structure of the top model to vary

chaotically (especially Trees and Polynomial Networks).

But, structural similarity is not functional similarity.

(Competing models can look different, but act the same.)

• We modelers fall for our hand-crafted variables.

• We can interpret anything. [heart ex.] [cloud ex.]


Ensembles & Complexity

• Bundling competing models improves generalization.

• Different model families are a good source of component diversity.

• If we measure complexity as flexibility (GDF)the classic relation between complexity and overfit is revived.

– The more a modeling process can match an arbitrary change made to itsoutput, the more complex it is.

– Simplicity is not parsimony.

• Complexity increases with distracting variables.

• It is expected to increase with parameter power and searchthoroughness, and decrease with priors, shrinking, and clarity ofstructure in data. Constraints (observations) may go either way…

• Model ensembles often have less complexity than their components.

• Diverse modeling procedures can be fairly compared using GDF


John F. Elder IV

Chief Scientist, Elder Research, Inc.

John obtained a BS and MEE in Electrical Engineering from Rice University, and a

PhD in Systems Engineering from the University of Virginia, where he’s recently

been an adjunct professor, teaching Optimization. Prior to a decade leading ERI, he

spent 5 years in aerospace consulting, 4 heading research at an investment

management firm, and 2 in Rice's Computational & Applied Mathematics department.

Dr. Elder has authored innovative data mining tools, is active on Statistics,

Engineering, and Finance conferences and boards, is a frequent keynote

conference speaker, and was a Program co-chair of the 2004 Knowledge

Discovery and Data Mining conference. John’s courses on analysis techniques --

taught at dozens of universities, companies, and government labs -- are noted for

their clarity and effectiveness. Since the Fall of 2001, Dr. Elder has been

honored to serve on a panel appointed by Congress to guide technology for the

National Security Agency.

John is a follower of Christ and the proud father of 5.

Dr. John Elder heads a data mining consulting team with

offices in Charlottesville, Virginia and Washington DC,

(www.datamininglab.com). Founded in 1995, Elder Research,

Inc. focuses on investment, commercial, and security

applications of pattern discovery and optimization,

including stock selection, image recognition, text mining,

process optimization, cross-selling, biometrics, drug

efficacy, credit scoring, market timing, and fraud detection.

the generalization paradox of ensembles - salford …docs.salford-systems.com/johnelder.pdf© 2006...

Documents