the generalization paradox of ensembles - salford …docs.salford-systems.com/johnelder.pdf© 2006...
TRANSCRIPT
0© 2006 Elder Research, Inc.
John F. Elder IV, Ph.D.
Elder Research, Inc.
635 Berkmar Circle
Charlottesville, Virginia 22901
434-973-7673
www.datamininglab.com
Salford Systems Data Mining Conference
San Diego, CA
March 31, 2006
The Generalization Paradox of Ensembles
1© 2006 Elder Research, Inc.
Complexity
• Ensembles of models generalize well- a theoretical surprise
• Chief danger of data mining: Overfit
• Occam’s razor: regulate complexity to avoid overfit
• But, does the razor work? - counter-evidence has been gathering
• What if complexity is measured incorrectly?
• Generalized degrees of freedom (Ye, 1998)
• Experiments: single vs. bagged decision trees
• Summary: factors that matter
Its importance, influences, and inference.
And, are ensembles as complex as they look?Outline:
2© 2006 Elder Research, Inc.
Decision Tree Nearest Neighbor
Kernel
Neural Network (or
Polynomial Network)
Delaunay Triangles
3© 2006 Elder Research, Inc.
Relative Performance Examples: 5 Algorithms on 6 Datasets(John Elder, Elder Research & Stephen Lee, U. Idaho, 1997)
Err
or
Rel
ativ
e to
Pee
r T
echniq
ues
(lo
wer
is
bet
ter)
4© 2006 Elder Research, Inc.
Essentially every Bundling method improves performanceE
rror
Rel
ativ
e to
Pee
r T
echniq
ues
(lo
wer
is
bet
ter)
5© 2006 Elder Research, Inc.
Overfit models generalize poorly
0
1
2
3
4
5
0.1 0.6 1.1 1.6 2.1 2.6 3.1 3.6 4.1 4.6 5.1 5.6 6.1 6.6 7.1 7.6 8.1 8.6 9.1 9.6
X
Y
Y = f(X)
TestPnts
YhatX
YhatX2
YhatX3
YhatX4
YhatX5
YhatX6
YhatX7
YhatX8
YhatX9
6© 2006 Elder Research, Inc.
Regression error vs. #parameters
0.001
0.010
0.100
1.000
1 2 3 4 5 6 7 8 9
# Regression Parameters, k
Me
an
Sq
ua
red
Err
or
Training
Evaluation
7© 2006 Elder Research, Inc.
Additional terms always help performance on training data
Regulate Complexity by:
1) Reserving data
*Cross-validation
*Bootstrapping
2) Complexity penalty
*Mallow’s Cp
*Minimum Description Length (MDL) (Rissanen, Barron, Wallace & Bolton)
*Generalized Cross-Validation (Wahba)
3) Roughness penalty
*integrated square of second derivative
*cubic splines
4) Parameter shrinkage
*Ridge regression -> Bayesian priors
*”Optimal brain damage” (Weight decay in neural networks)
Overfit is the Chief Danger in Model Induction
e
#terms, p
°
°°°°° °
°
8© 2006 Elder Research, Inc.
Popular complexity penalties(N cases, k terms, and an error variance estimate, !p
2)
MSE = SSEN
adjusted Mean Squared Error, aMSE = SSEN-K
Akaike’s (1970) Future Prediction Error, FPE = aMSE·N+K
N
Akaike's (1973) Information Criterion, AIC = ln(MSE) + 2KN
Mallow's (1973) Cp = SSE
!p2 - N + 2K
Barron's (1984) Predicted Squared Error, PSE = MSE + 2!p2K
N
Schwarz's (1978) Bayesian Criterion, SBC = ln(MSE) + Kln(N)N
Rissanen’s (1983) Minimum Description Length, MDL = MSE + ln(N)K!2
p
N
Sawa's BIC = ln(MSE) + 2q(K+2-q)/N, where q = !p2
MSE
e
k
!
scoreMinimize e + "k
9© 2006 Elder Research, Inc.
Occam’s Razor
• Nunquam ponenda est pluralitas sin necesitate
“Entities should not be multiplied beyond necessity”
“If (nearly) as descriptive, the simpler model is more correct.”
• But, gathering doubts:
– Ensemble methods which employ multiple models (e.g.,
bagging, boosting, bundling, Bayesian model averaging
– Nonlinear terms have higher (or lower) than linear effect
– Much overfit is from excessive search (e.g., Jensen 2000),
rather than over-parameterization
– Neural network structures are fixed, but their degree of fit
grows with time
• Domingos (1998) won KDD Best Paper arguing for its death
10© 2006 Elder Research, Inc.
Target Shufflingcan be used to measure “vast search effect”
(see Battery MCT, Monte Carlo Target in CART #
added at my suggestion; thanks!)
1) Break link between target, Y, and features, X
by shuffling Y to form Ys.
2) Model new Ys ~ f(X)
3) Measure quality of resulting (random) model
4) Repeat to build distribution
-> Best (or mean) shuffled (i.e., useless) model
sets the baseline above which true model
performance may be measured
11© 2006 Elder Research, Inc.
Y = Tree(x1,x2) + noise; 8 random input variables, 20 training cases
Small difference between original tree (.17) and best shuffled tree (.30)
-> means that most (6/7ths) of the model’s gains are illusion.
12© 2006 Elder Research, Inc.
Y = Tree(x1,x2) + noise; 8 random input variables, 50 training cases
Larger difference between original tree (.07) and best shuffled tree (.47)
-> means ~1/2 of the model’s gains are real.
13© 2006 Elder Research, Inc.
Counting (and penalizing) terms• Much faster than cross-validation
• Allows one to use all the data for training
but
• A single parameter in a nonlinear method can have <1 or >5effective degrees of freedom.“[The results of Hastie and Tibshirani (1985)], together with those of Hinkley (1969,1970) and Feder (1975), indicate that the number of degrees of freedom associatedwith nonlinear least squares regression can be considerably more than the number ofparameters involved in the fit.” (Friedman and Silverman, 1989)
• The final model form doesn’t necessarily reveal the extent of thestructure search.Ex: The winning KDD Cup 2001 model used 3 variables. But there were 140,000candidates, and only 2,000 constraints (cases).
Hjorth (1989) “... the evaluation of a selected model can not be based on that modelalone, but requires information about the class of models and the selection procedure.”
! Need model selection metrics which include the effect of modelselection!
14© 2006 Elder Research, Inc.
So a complexity metric is needed
which additionally takes into account:
• Extent of model space search:
– Algorithm thoroughness (e.g., greedy, optimal subsets)
– Input breadth and diversity
• Power of parameters
• Degree of structure in data
-> Must be empirical (involving re-sampling).
(Removes the speed advantage.
But, a complexity slope could likely be estimated for each
data/algorithm pair from early experiments.)
15© 2006 Elder Research, Inc.
Complexity should be measured by the
Flexibility of the Modeling Process
• Generalized Degrees of Freedom, GDF (Jianming Ye, JASAMarch 1998)
– Perturb output, re-fit procedure,measure changes in estimates
• Covariance Inflation Criterion, CIC (Tibshirani & Knight, 1999)
– Shuffle output, re-fit procedure,measure covariance between new and old estimates.
• Key step (loop around modeling procedure) reminiscent ofRegression Analysis Tool, RAT (Faraway, 1991) -- whereresampling tests of a 2-second procedure took 2 days to run.
16© 2006 Elder Research, Inc.
Generalized Degrees of Freedom• #terms in Linear Regression (LR) = DoF, k
• Nonlinear terms (e.g., MARS) can have effect of ~3k (Friedman, Owen ‘91)
• Other parameters can have effects < 1
(e.g., under-trained neural networks)
Procedure (Ye, 1998):
• For LR, k = trace(Hat Matrix) = " #yhat / #y
• Define GDF to be sum of sensitivity of each fitted value, yhat, to perturbations
in the corresponding output, y. That is, instead of extrapolating from LR by
counting terms, use alternate trace measure which is equivalent under LR.
• (Similarly, the effective degrees of freedom of a spline model is estimated by
the trace of the projection matrix, S: yhat = Sy )
• Put a y-perturbation loop around the entire modeling process (which can
involve multiple stages)
17© 2006 Elder Research, Inc.
GDF computation
Modeling
ProcessX yehat
yey +
N(0,$)
perturbations
obse
rvat
ions
(ye,yehat)
Ye robustness trick:
average, across perturbation
runs, the sensitivities for a
given observation.
y = 0.2502x - 0.8218
-1.80
-1.60
-1.40
-1.20
-1.00
-0.80
-0.60
-1.80 -1.60 -1.40 -1.20 -1.00 -0.80 -0.60
Ye
Ye h
at
18© 2006 Elder Research, Inc.
Example problem: underlying data surface
is piecewise constant in 2 dimensions
19© 2006 Elder Research, Inc.
Additive N(0,.5) noise
20© 2006 Elder Research, Inc.
100 random training samples
21© 2006 Elder Research, Inc.
Estimation Surfaces
for bundles of 5 trees4-leaf trees
8-leaf trees(some of the finer structure is real)
Bagging produces gentler stairsteps than raw tree
(illustrating how it generalizes better for smooth functions)
22© 2006 Elder Research, Inc.
Equivalent tree for 8-leaf bundle(actually 25% pruned)
So a bundled tree is still a tree.
But is it as complex as it looks?
23© 2006 Elder Research, Inc.
Experiment: Introduce selection noise(an additional 8 candidate input variables)
Estimation surface for
5-bag of 4-leaf trees … for 8-leaf trees
Main structure here is clear enough for simple models to avoid noise inputs
but their eventual use leads to a distribution of estimates on 2-d projection.
24© 2006 Elder Research, Inc.
Estimated GDF vs. #parameters
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9
Parameters, k
GD
F
Tree 2d
Bagged Tree 2d
Tree 2d+ 8 noise
Bagged Tree 2d+ 8 noise
Linear Regression 1d
1
~5
~4~4
~3
Bagging reduces complexity
Noise x’s increase complexity
25© 2006 Elder Research, Inc.
Outlier/Influential Point DetectionA portion of complexity is assigned to each case
an 8-leaf tree a 4-leaf tree
26© 2006 Elder Research, Inc.
Danger of Interpretability
• Accuracy * Interpretability < Breiman’s constant
• Model can be useful without being "correct" or explanatory.
• Over-interpretation: We usually read too much into particular variables
picked by "best" model -- which barely won out over hundreds of other
models of the billions tried, using a score function only approximating
one's goals, on finite, noisy data.
• Many similar variables can lead the structure of the top model to vary
chaotically (especially Trees and Polynomial Networks).
But, structural similarity is not functional similarity.
(Competing models can look different, but act the same.)
• We modelers fall for our hand-crafted variables.
• We can interpret anything. [heart ex.] [cloud ex.]
27© 2006 Elder Research, Inc.
Ensembles & Complexity
• Bundling competing models improves generalization.
• Different model families are a good source of component diversity.
• If we measure complexity as flexibility (GDF)the classic relation between complexity and overfit is revived.
– The more a modeling process can match an arbitrary change made to itsoutput, the more complex it is.
– Simplicity is not parsimony.
• Complexity increases with distracting variables.
• It is expected to increase with parameter power and searchthoroughness, and decrease with priors, shrinking, and clarity ofstructure in data. Constraints (observations) may go either way…
• Model ensembles often have less complexity than their components.
• Diverse modeling procedures can be fairly compared using GDF
28© 2006 Elder Research, Inc.
John F. Elder IV
Chief Scientist, Elder Research, Inc.
John obtained a BS and MEE in Electrical Engineering from Rice University, and a
PhD in Systems Engineering from the University of Virginia, where he’s recently
been an adjunct professor, teaching Optimization. Prior to a decade leading ERI, he
spent 5 years in aerospace consulting, 4 heading research at an investment
management firm, and 2 in Rice's Computational & Applied Mathematics department.
Dr. Elder has authored innovative data mining tools, is active on Statistics,
Engineering, and Finance conferences and boards, is a frequent keynote
conference speaker, and was a Program co-chair of the 2004 Knowledge
Discovery and Data Mining conference. John’s courses on analysis techniques --
taught at dozens of universities, companies, and government labs -- are noted for
their clarity and effectiveness. Since the Fall of 2001, Dr. Elder has been
honored to serve on a panel appointed by Congress to guide technology for the
National Security Agency.
John is a follower of Christ and the proud father of 5.
Dr. John Elder heads a data mining consulting team with
offices in Charlottesville, Virginia and Washington DC,
(www.datamininglab.com). Founded in 1995, Elder Research,
Inc. focuses on investment, commercial, and security
applications of pattern discovery and optimization,
including stock selection, image recognition, text mining,
process optimization, cross-selling, biometrics, drug
efficacy, credit scoring, market timing, and fraud detection.