the generalization paradox of ensembles - salford …docs.salford-systems.com/johnelder.pdf© 2006...

29
0 © 2006 Elder Research, Inc. John F. Elder IV, Ph.D. [email protected] Elder Research, Inc. 635 Berkmar Circle Charlottesville, Virginia 22901 434-973-7673 www.datamininglab.com Salford Systems Data Mining Conference San Diego, CA March 31, 2006 The Generalization Paradox of Ensembles

Upload: ngomien

Post on 08-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

0© 2006 Elder Research, Inc.

John F. Elder IV, Ph.D.

[email protected]

Elder Research, Inc.

635 Berkmar Circle

Charlottesville, Virginia 22901

434-973-7673

www.datamininglab.com

Salford Systems Data Mining Conference

San Diego, CA

March 31, 2006

The Generalization Paradox of Ensembles

Page 2: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

1© 2006 Elder Research, Inc.

Complexity

• Ensembles of models generalize well- a theoretical surprise

• Chief danger of data mining: Overfit

• Occam’s razor: regulate complexity to avoid overfit

• But, does the razor work? - counter-evidence has been gathering

• What if complexity is measured incorrectly?

• Generalized degrees of freedom (Ye, 1998)

• Experiments: single vs. bagged decision trees

• Summary: factors that matter

Its importance, influences, and inference.

And, are ensembles as complex as they look?Outline:

Page 3: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

2© 2006 Elder Research, Inc.

Decision Tree Nearest Neighbor

Kernel

Neural Network (or

Polynomial Network)

Delaunay Triangles

Page 4: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

3© 2006 Elder Research, Inc.

Relative Performance Examples: 5 Algorithms on 6 Datasets(John Elder, Elder Research & Stephen Lee, U. Idaho, 1997)

Err

or

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)

Page 5: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

4© 2006 Elder Research, Inc.

Essentially every Bundling method improves performanceE

rror

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)

Page 6: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

5© 2006 Elder Research, Inc.

Overfit models generalize poorly

0

1

2

3

4

5

0.1 0.6 1.1 1.6 2.1 2.6 3.1 3.6 4.1 4.6 5.1 5.6 6.1 6.6 7.1 7.6 8.1 8.6 9.1 9.6

X

Y

Y = f(X)

TestPnts

YhatX

YhatX2

YhatX3

YhatX4

YhatX5

YhatX6

YhatX7

YhatX8

YhatX9

Page 7: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

6© 2006 Elder Research, Inc.

Regression error vs. #parameters

0.001

0.010

0.100

1.000

1 2 3 4 5 6 7 8 9

# Regression Parameters, k

Me

an

Sq

ua

red

Err

or

Training

Evaluation

Page 8: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

7© 2006 Elder Research, Inc.

Additional terms always help performance on training data

Regulate Complexity by:

1) Reserving data

*Cross-validation

*Bootstrapping

2) Complexity penalty

*Mallow’s Cp

*Minimum Description Length (MDL) (Rissanen, Barron, Wallace & Bolton)

*Generalized Cross-Validation (Wahba)

3) Roughness penalty

*integrated square of second derivative

*cubic splines

4) Parameter shrinkage

*Ridge regression -> Bayesian priors

*”Optimal brain damage” (Weight decay in neural networks)

Overfit is the Chief Danger in Model Induction

e

#terms, p

°

°°°°° °

°

Page 9: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

8© 2006 Elder Research, Inc.

Popular complexity penalties(N cases, k terms, and an error variance estimate, !p

2)

MSE = SSEN

adjusted Mean Squared Error, aMSE = SSEN-K

Akaike’s (1970) Future Prediction Error, FPE = aMSE·N+K

N

Akaike's (1973) Information Criterion, AIC = ln(MSE) + 2KN

Mallow's (1973) Cp = SSE

!p2 - N + 2K

Barron's (1984) Predicted Squared Error, PSE = MSE + 2!p2K

N

Schwarz's (1978) Bayesian Criterion, SBC = ln(MSE) + Kln(N)N

Rissanen’s (1983) Minimum Description Length, MDL = MSE + ln(N)K!2

p

N

Sawa's BIC = ln(MSE) + 2q(K+2-q)/N, where q = !p2

MSE

e

k

!

scoreMinimize e + "k

Page 10: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

9© 2006 Elder Research, Inc.

Occam’s Razor

• Nunquam ponenda est pluralitas sin necesitate

“Entities should not be multiplied beyond necessity”

“If (nearly) as descriptive, the simpler model is more correct.”

• But, gathering doubts:

– Ensemble methods which employ multiple models (e.g.,

bagging, boosting, bundling, Bayesian model averaging

– Nonlinear terms have higher (or lower) than linear effect

– Much overfit is from excessive search (e.g., Jensen 2000),

rather than over-parameterization

– Neural network structures are fixed, but their degree of fit

grows with time

• Domingos (1998) won KDD Best Paper arguing for its death

Page 11: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

10© 2006 Elder Research, Inc.

Target Shufflingcan be used to measure “vast search effect”

(see Battery MCT, Monte Carlo Target in CART #

added at my suggestion; thanks!)

1) Break link between target, Y, and features, X

by shuffling Y to form Ys.

2) Model new Ys ~ f(X)

3) Measure quality of resulting (random) model

4) Repeat to build distribution

-> Best (or mean) shuffled (i.e., useless) model

sets the baseline above which true model

performance may be measured

Page 12: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

11© 2006 Elder Research, Inc.

Y = Tree(x1,x2) + noise; 8 random input variables, 20 training cases

Small difference between original tree (.17) and best shuffled tree (.30)

-> means that most (6/7ths) of the model’s gains are illusion.

Page 13: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

12© 2006 Elder Research, Inc.

Y = Tree(x1,x2) + noise; 8 random input variables, 50 training cases

Larger difference between original tree (.07) and best shuffled tree (.47)

-> means ~1/2 of the model’s gains are real.

Page 14: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

13© 2006 Elder Research, Inc.

Counting (and penalizing) terms• Much faster than cross-validation

• Allows one to use all the data for training

but

• A single parameter in a nonlinear method can have <1 or >5effective degrees of freedom.“[The results of Hastie and Tibshirani (1985)], together with those of Hinkley (1969,1970) and Feder (1975), indicate that the number of degrees of freedom associatedwith nonlinear least squares regression can be considerably more than the number ofparameters involved in the fit.” (Friedman and Silverman, 1989)

• The final model form doesn’t necessarily reveal the extent of thestructure search.Ex: The winning KDD Cup 2001 model used 3 variables. But there were 140,000candidates, and only 2,000 constraints (cases).

Hjorth (1989) “... the evaluation of a selected model can not be based on that modelalone, but requires information about the class of models and the selection procedure.”

! Need model selection metrics which include the effect of modelselection!

Page 15: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

14© 2006 Elder Research, Inc.

So a complexity metric is needed

which additionally takes into account:

• Extent of model space search:

– Algorithm thoroughness (e.g., greedy, optimal subsets)

– Input breadth and diversity

• Power of parameters

• Degree of structure in data

-> Must be empirical (involving re-sampling).

(Removes the speed advantage.

But, a complexity slope could likely be estimated for each

data/algorithm pair from early experiments.)

Page 16: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

15© 2006 Elder Research, Inc.

Complexity should be measured by the

Flexibility of the Modeling Process

• Generalized Degrees of Freedom, GDF (Jianming Ye, JASAMarch 1998)

– Perturb output, re-fit procedure,measure changes in estimates

• Covariance Inflation Criterion, CIC (Tibshirani & Knight, 1999)

– Shuffle output, re-fit procedure,measure covariance between new and old estimates.

• Key step (loop around modeling procedure) reminiscent ofRegression Analysis Tool, RAT (Faraway, 1991) -- whereresampling tests of a 2-second procedure took 2 days to run.

Page 17: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

16© 2006 Elder Research, Inc.

Generalized Degrees of Freedom• #terms in Linear Regression (LR) = DoF, k

• Nonlinear terms (e.g., MARS) can have effect of ~3k (Friedman, Owen ‘91)

• Other parameters can have effects < 1

(e.g., under-trained neural networks)

Procedure (Ye, 1998):

• For LR, k = trace(Hat Matrix) = " #yhat / #y

• Define GDF to be sum of sensitivity of each fitted value, yhat, to perturbations

in the corresponding output, y. That is, instead of extrapolating from LR by

counting terms, use alternate trace measure which is equivalent under LR.

• (Similarly, the effective degrees of freedom of a spline model is estimated by

the trace of the projection matrix, S: yhat = Sy )

• Put a y-perturbation loop around the entire modeling process (which can

involve multiple stages)

Page 18: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

17© 2006 Elder Research, Inc.

GDF computation

Modeling

ProcessX yehat

yey +

N(0,$)

perturbations

obse

rvat

ions

(ye,yehat)

Ye robustness trick:

average, across perturbation

runs, the sensitivities for a

given observation.

y = 0.2502x - 0.8218

-1.80

-1.60

-1.40

-1.20

-1.00

-0.80

-0.60

-1.80 -1.60 -1.40 -1.20 -1.00 -0.80 -0.60

Ye

Ye h

at

Page 19: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

18© 2006 Elder Research, Inc.

Example problem: underlying data surface

is piecewise constant in 2 dimensions

Page 20: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

19© 2006 Elder Research, Inc.

Additive N(0,.5) noise

Page 21: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

20© 2006 Elder Research, Inc.

100 random training samples

Page 22: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

21© 2006 Elder Research, Inc.

Estimation Surfaces

for bundles of 5 trees4-leaf trees

8-leaf trees(some of the finer structure is real)

Bagging produces gentler stairsteps than raw tree

(illustrating how it generalizes better for smooth functions)

Page 23: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

22© 2006 Elder Research, Inc.

Equivalent tree for 8-leaf bundle(actually 25% pruned)

So a bundled tree is still a tree.

But is it as complex as it looks?

Page 24: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

23© 2006 Elder Research, Inc.

Experiment: Introduce selection noise(an additional 8 candidate input variables)

Estimation surface for

5-bag of 4-leaf trees … for 8-leaf trees

Main structure here is clear enough for simple models to avoid noise inputs

but their eventual use leads to a distribution of estimates on 2-d projection.

Page 25: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

24© 2006 Elder Research, Inc.

Estimated GDF vs. #parameters

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9

Parameters, k

GD

F

Tree 2d

Bagged Tree 2d

Tree 2d+ 8 noise

Bagged Tree 2d+ 8 noise

Linear Regression 1d

1

~5

~4~4

~3

Bagging reduces complexity

Noise x’s increase complexity

Page 26: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

25© 2006 Elder Research, Inc.

Outlier/Influential Point DetectionA portion of complexity is assigned to each case

an 8-leaf tree a 4-leaf tree

Page 27: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

26© 2006 Elder Research, Inc.

Danger of Interpretability

• Accuracy * Interpretability < Breiman’s constant

• Model can be useful without being "correct" or explanatory.

• Over-interpretation: We usually read too much into particular variables

picked by "best" model -- which barely won out over hundreds of other

models of the billions tried, using a score function only approximating

one's goals, on finite, noisy data.

• Many similar variables can lead the structure of the top model to vary

chaotically (especially Trees and Polynomial Networks).

But, structural similarity is not functional similarity.

(Competing models can look different, but act the same.)

• We modelers fall for our hand-crafted variables.

• We can interpret anything. [heart ex.] [cloud ex.]

Page 28: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

27© 2006 Elder Research, Inc.

Ensembles & Complexity

• Bundling competing models improves generalization.

• Different model families are a good source of component diversity.

• If we measure complexity as flexibility (GDF)the classic relation between complexity and overfit is revived.

– The more a modeling process can match an arbitrary change made to itsoutput, the more complex it is.

– Simplicity is not parsimony.

• Complexity increases with distracting variables.

• It is expected to increase with parameter power and searchthoroughness, and decrease with priors, shrinking, and clarity ofstructure in data. Constraints (observations) may go either way…

• Model ensembles often have less complexity than their components.

• Diverse modeling procedures can be fairly compared using GDF

Page 29: The Generalization Paradox of Ensembles - Salford …docs.salford-systems.com/JohnElder.pdf© 2006 Elder Research, Inc. Complexity • Ensembles of models generalize well - a theoretical

28© 2006 Elder Research, Inc.

John F. Elder IV

Chief Scientist, Elder Research, Inc.

John obtained a BS and MEE in Electrical Engineering from Rice University, and a

PhD in Systems Engineering from the University of Virginia, where he’s recently

been an adjunct professor, teaching Optimization. Prior to a decade leading ERI, he

spent 5 years in aerospace consulting, 4 heading research at an investment

management firm, and 2 in Rice's Computational & Applied Mathematics department.

Dr. Elder has authored innovative data mining tools, is active on Statistics,

Engineering, and Finance conferences and boards, is a frequent keynote

conference speaker, and was a Program co-chair of the 2004 Knowledge

Discovery and Data Mining conference. John’s courses on analysis techniques --

taught at dozens of universities, companies, and government labs -- are noted for

their clarity and effectiveness. Since the Fall of 2001, Dr. Elder has been

honored to serve on a panel appointed by Congress to guide technology for the

National Security Agency.

John is a follower of Christ and the proud father of 5.

Dr. John Elder heads a data mining consulting team with

offices in Charlottesville, Virginia and Washington DC,

(www.datamininglab.com). Founded in 1995, Elder Research,

Inc. focuses on investment, commercial, and security

applications of pattern discovery and optimization,

including stock selection, image recognition, text mining,

process optimization, cross-selling, biometrics, drug

efficacy, credit scoring, market timing, and fraud detection.